Azure Data Factory
Azure Data Factory
NOTE
This article applies to version 1 of Azure Data Factory. If you are using the current version of the Data Factory service, see
Introduction to Data Factory V2.
Azure Data Factory is the platform for these kinds of scenarios. It is a cloud-based data integration service that
allows you to create data-driven workflows in the cloud that orchestrate and automate data movement and data
transformation. Using Azure Data Factory, you can do the following tasks:
Create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data
stores.
Process or transform the data by using compute services such as Azure HDInsight Hadoop, Spark, Azure
Data Lake Analytics, and Azure Machine Learning.
Publish output data to data stores such as Azure Synapse Analytics for business intelligence (BI)
applications to consume.
It's more of an Extract-and-Load (EL) and Transform-and-Load (TL) platform rather than a traditional Extract-
Transform-and-Load (ETL) platform. The transformations process data by using compute services rather than by
adding derived columns, counting the number of rows, sorting data, and so on.
Currently, in Azure Data Factory, the data that workflows consume and produce is time-sliced data (hourly, daily,
weekly, and so on). For example, a pipeline might read input data, process data, and produce output data once a
day. You can also run a workflow just one time.
Key components
An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory
is composed of four key components. These components work together to provide the platform on which you
can compose data-driven workflows with steps to move and transform data.
Pipeline
A data factory can have one or more pipelines. A pipeline is a group of activities. Together, the activities in a
pipeline perform a task.
For example, a pipeline can contain a group of activities that ingests data from an Azure blob, and then runs a
Hive query on an HDInsight cluster to partition the data. The benefit of this is that the pipeline allows you to
manage the activities as a set instead of each one individually. For example, you can deploy and schedule the
pipeline, instead of scheduling independent activities.
Activity
A pipeline can have one or more activities. Activities define the actions to perform on your data. For example,
you can use a copy activity to copy data from one data store to another data store. Similarly, you can use a Hive
activity. A Hive activity runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data
Factory supports two types of activities: data movement activities and data transformation activities.
Data movement activities
Copy Activity in Data Factory copies data from a source data store to a sink data store. Data from any source can
be written to any sink. Select a data store to learn how to copy data to and from that store. Data Factory
supports the following data stores:
DB2* ✓
MySQL* ✓
Oracle* ✓ ✓
PostgreSQL* ✓
SAP HANA* ✓
SQL Server* ✓ ✓
Sybase* ✓
Teradata* ✓
NoSQL Cassandra* ✓
C AT EGO RY DATA STO RE SUP P O RT ED A S A SO URC E SUP P O RT ED A S A SIN K
MongoDB* ✓
File Amazon S3 ✓
File System* ✓ ✓
FTP ✓
HDFS* ✓
SFTP ✓
Generic OData ✓
Generic ODBC* ✓
Salesforce ✓
Supported regions
Currently, you can create data factories in the West US, East US, and North Europe regions. However, a data
factory can access data stores and compute services in other Azure regions to move data between data stores or
process data by using compute services.
Azure Data Factory itself does not store any data. It lets you create data-driven workflows to orchestrate the
movement of data between supported data stores. It also lets you process data by using compute services in
other regions or in an on-premises environment. It also allows you to monitor and manage workflows by using
both programmatic and UI mechanisms.
Data Factory is available in only West US, East US, and North Europe regions. However, the service that powers
the data movement in Data Factory is available globally in several regions. If a data store is behind a firewall,
then a Data Management Gateway that's installed in your on-premises environment moves the data instead.
For an example, let's assume that your compute environments such as Azure HDInsight cluster and Azure
Machine Learning are located in the West Europe region. You can create and use an Azure Data Factory instance
in North Europe. Then you can use it to schedule jobs on your compute environments in West Europe. It takes a
few milliseconds for Data Factory to trigger the job on your compute environment, but the time for running the
job on your computing environment does not change.
Move data between two cloud data stores Create a data factory with a pipeline that moves data from
blob storage to SQL Database.
Transform data by using Hadoop cluster Build your first Azure data factory with a data pipeline that
processes data by running a Hive script on an Azure
HDInsight (Hadoop) cluster.
Move data between an on-premises data store and a cloud Build a data factory with a pipeline that moves data from a
data store by using Data Management Gateway SQL Server database to an Azure blob. As part of the
walkthrough, you install and configure the Data
Management Gateway on your machine.
What is Azure Data Factory?
7/16/2021 • 8 minutes to read • Edit Online
To see more detail, click the preceding image to zoom in, or browse to the high resolution image.
Connect and collect
Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured,
unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources of data and
processing, such as software-as-a-service (SaaS) services, databases, file shares, and FTP web services. The next
step is to move the data as needed to a centralized location for subsequent processing.
Without Data Factory, enterprises must build custom data movement components or write custom services to
integrate these data sources and processing. It's expensive and hard to integrate and maintain such systems. In
addition, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service
can offer.
With Data Factory, you can use the Copy Activity in a data pipeline to move data from both on-premises and
cloud source data stores to a centralization data store in the cloud for further analysis. For example, you can
collect data in Azure Data Lake Storage and transform the data later by using an Azure Data Lake Analytics
compute service. You can also collect data in Azure Blob storage and transform it later by using an Azure
HDInsight Hadoop cluster.
Transform and enrich
After data is present in a centralized data store in the cloud, process or transform the collected data by using
ADF mapping data flows. Data flows enable data engineers to build and maintain data transformation graphs
that execute on Spark without needing to understand Spark clusters or Spark programming.
If you prefer to code transformations by hand, ADF supports external activities for executing your
transformations on compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine
Learning.
CI/CD and publish
Data Factory offers full support for CI/CD of your data pipelines using Azure DevOps and GitHub. This allows
you to incrementally develop and deliver your ETL processes before publishing the finished product. After the
raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse,
Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from
their business intelligence tools.
Monitor
After you have successfully built and deployed your data integration pipeline, providing business value from
refined data, monitor the scheduled activities and pipelines for success and failure rates. Azure Data Factory has
built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health
panels on the Azure portal.
Top-level concepts
An Azure subscription might have one or more Azure Data Factory instances (or data factories). Azure Data
Factory is composed of below key components.
Pipelines
Activities
Datasets
Linked services
Data Flows
Integration Runtimes
These components work together to provide the platform on which you can compose data-driven workflows
with steps to move and transform data.
Pipeline
A data factory might have one or more pipelines. A pipeline is a logical grouping of activities that performs a
unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of
activities that ingests data from an Azure blob, and then runs a Hive query on an HDInsight cluster to partition
the data.
The benefit of this is that the pipeline allows you to manage the activities as a set instead of managing each one
individually. The activities in a pipeline can be chained together to operate sequentially, or they can operate
independently in parallel.
Mapping data flows
Create and manage graphs of data transformation logic that you can use to transform any-sized data. You can
build-up a reusable library of data transformation routines and execute those processes in a scaled-out manner
from your ADF pipelines. Data Factory will execute your logic on a Spark cluster that spins-up and spins-down
when you need it. You won't ever have to manage or maintain clusters.
Activity
Activities represent a processing step in a pipeline. For example, you might use a copy activity to copy data from
one data store to another data store. Similarly, you might use a Hive activity, which runs a Hive query on an
Azure HDInsight cluster, to transform or analyze your data. Data Factory supports three types of activities: data
movement activities, data transformation activities, and control activities.
Datasets
Datasets represent data structures within the data stores, which simply point to or reference the data you want
to use in your activities as inputs or outputs.
Linked services
Linked services are much like connection strings, which define the connection information that's needed for
Data Factory to connect to external resources. Think of it this way: a linked service defines the connection to the
data source, and a dataset represents the structure of the data. For example, an Azure Storage-linked service
specifies a connection string to connect to the Azure Storage account. Additionally, an Azure blob dataset
specifies the blob container and the folder that contains the data.
Linked services are used for two purposes in Data Factory:
To represent a data store that includes, but isn't limited to, a SQL Server database, Oracle database, file
share, or Azure blob storage account. For a list of supported data stores, see the copy activity article.
To represent a compute resource that can host the execution of an activity. For example, the
HDInsightHive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and
supported compute environments, see the transform data article.
Integration Runtime
In Data Factory, an activity defines the action to be performed. A linked service defines a target data store or a
compute service. An integration runtime provides the bridge between the activity and linked Services. It's
referenced by the linked service or activity, and provides the compute environment where the activity either
runs on or gets dispatched from. This way, the activity can be performed in the region closest possible to the
target data store or compute service in the most performant way while meeting security and compliance needs.
Triggers
Triggers represent the unit of processing that determines when a pipeline execution needs to be kicked off. There
are different types of triggers for different types of events.
Pipeline runs
A pipeline run is an instance of the pipeline execution. Pipeline runs are typically instantiated by passing the
arguments to the parameters that are defined in pipelines. The arguments can be passed manually or within the
trigger definition.
Parameters
Parameters are key-value pairs of read-only configuration. Parameters are defined in the pipeline. The
arguments for the defined parameters are passed during execution from the run context that was created by a
trigger or a pipeline that was executed manually. Activities within the pipeline consume the parameter values.
A dataset is a strongly typed parameter and a reusable/referenceable entity. An activity can reference datasets
and can consume the properties that are defined in the dataset definition.
A linked service is also a strongly typed parameter that contains the connection information to either a data
store or a compute environment. It is also a reusable/referenceable entity.
Control flow
Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching,
defining parameters at the pipeline level, and passing arguments while invoking the pipeline on-demand or
from a trigger. It also includes custom-state passing and looping containers, that is, For-each iterators.
Variables
Variables can be used inside of pipelines to store temporary values and can also be used in conjunction with
parameters to enable passing values between pipelines, data flows, and other activities.
Next steps
Here are important next step documents to explore:
Dataset and linked services
Pipelines and activities
Integration runtime
Mapping Data Flows
Data Factory UI in the Azure portal
Copy Data tool in the Azure portal
PowerShell
.NET
Python
REST
Azure Resource Manager template
What's New in Azure Data Factory
7/15/2021 • 2 minutes to read • Edit Online
Azure Data Factory receives improvements on an ongoing basis. To stay up to date with the most recent
developments, this article provides you with information about:
The latest releases
Known issues
Bug fixes
Deprecated functionality
Plans for changes
This page will be updated monthly, so revisit it regularly.
June 2021
Data Movement New user experience with Azure Data Redesigned Copy Data Tool is now
Factory Copy Data Tool available with improved data ingestion
experience.
Learn more
Data Flow SQL Server is now supported as a SQL Server is now supported as a
source and sink in data flows source and sink in data flows. Follow
the link for instructions on how to
configure your networking using the
Azure Integration Runtime managed
VNET feature to talk to your SQL
Server on-premise and cloud VM-
based instances.
Learn more
Dataflow Cluster quick reuse is now ADF is happy to announce the general
enabled by default for all new Azure availability of the popular data flow
Integration Runtimes quick start-up reuse feature. All new
Azure Integration Runtimes will now
have quick reuse enabled by default.
Learn more
Power Query activity in ADF public You can now build complex field
preview mappings to your Power Query sink
using Azure Data Factory data
wrangling. The sink is now configured
in the pipeline in the Power Query
(Preview) activity to accommodate this
update.
Learn more
Updated data flows monitoring UI in Azure Data Factory has a new update
Azure Data Factory for the monitoring UI to make it easier
to view your data flow ETL job
executions and quickly identify areas
for performance tuning.
Learn more
SQL Ser ver Integration Ser vices Run any SQL anywhere in 3 simple This post provides 3 simple steps to
(SSIS) steps with SSIS in Azure Data Factory run any SQL statements/scripts
anywhere with SSIS in Azure Data
Factory.
1. Prepare your Self-Hosted
Integration Runtime/SSIS
Integration Runtime.
2. Prepare an Execute SSIS
Package activity in Azure Data
Factory pipeline.
3. Run the Execute SSIS Package
activity on your Self-Hosted
Integration Runtime/SSIS
Integration Runtime.
Learn more
More information
Blog - Azure Data Factory
Stack Overflow forum
Twitter
Videos
Compare Azure Data Factory with Data Factory
version 1
3/5/2021 • 10 minutes to read • Edit Online
Feature comparison
The following table compares the features of Data Factory with the features of Data Factory version 1.
Datasets A named view of data that references Datasets are the same in the current
the data that you want to use in your version. However, you do not need to
activities as inputs and outputs. define availability schedules for
Datasets identify data within different datasets. You can define a trigger
data stores, such as tables, files, resource that can schedule pipelines
folders, and documents. For example, from a clock scheduler paradigm. For
an Azure Blob dataset specifies the more information, see Triggers and
blob container and folder in Azure Blob Datasets.
storage from which the activity should
read the data.
Linked services Linked services are much like Linked services are the same as in
connection strings, which define the Data Factory V1, but with a new
connection information that's connectVia property to utilize the
necessary for Data Factory to connect Integration Runtime compute
to external resources. environment of the current version of
Data Factory. For more information,
see Integration runtime in Azure Data
Factory and Linked service properties
for Azure Blob storage.
F EAT URE VERSIO N 1 C URREN T VERSIO N
Pipelines A data factory can have one or more Pipelines are groups of activities that
pipelines. A pipeline is a logical are performed on data. However, the
grouping of activities that together scheduling of activities in the pipeline
perform a task. You use startTime, has been separated into new trigger
endTime, and isPaused to schedule and resources. You can think of pipelines in
run pipelines. the current version of Data Factory
more as "workflow units" that you
schedule separately via triggers.
Activities Activities define actions to perform on In the current version of Data Factory,
your data within a pipeline. Data activities still are defined actions within
movement (copy activity) and data a pipeline. The current version of Data
transformation activities (such as Hive, Factory introduces new control flow
Pig, and MapReduce) are supported. activities. You use these activities in a
control flow (looping and branching).
Data movement and data
transformation activities that were
supported in V1 are supported in the
current version. You can define
transformation activities without using
datasets in the current version.
Hybrid data movement and activity Now called Integration Runtime, Data Data Management Gateway is now
dispatch Management Gateway supported called Self-Hosted Integration Runtime.
moving data between on-premises It provides the same capability as it did
and cloud. in V1.
Expressions Data Factory V1 allows you to use In the current version of Data Factory,
functions and system variables in data you can use expressions anywhere in a
selection queries and activity/dataset JSON string value. For more
properties. information, see Expressions and
functions in the current version of
Data Factory.
The following sections provide more information about the capabilities of the current version.
Control flow
To support diverse integration flows and patterns in the modern data warehouse, the current version of Data
Factory has enabled a new flexible data pipeline model that is no longer tied to time-series data. A few common
flows that were previously not possible are now enabled. They are described in the following sections.
Chaining activities
In V1, you had to configure the output of an activity as an input of another activity to chain them. in the current
version, you can chain activities in a sequence within a pipeline. You can use the dependsOn property in an
activity definition to chain it with an upstream activity. For more information and an example, see Pipelines and
activities and Branching and chaining activities.
Branching activities
in the current version, you can branch activities within a pipeline. The If-condition activity provides the same
functionality that an if statement provides in programming languages. It evaluates a set of activities when the
condition evaluates to true and another set of activities when the condition evaluates to false . For examples
of branching activities, see the Branching and chaining activities tutorial.
Parameters
You can define parameters at the pipeline level and pass arguments while you're invoking the pipeline on-
demand or from a trigger. Activities can consume the arguments that are passed to the pipeline. For more
information, see Pipelines and triggers.
Custom state passing
Activity outputs including state can be consumed by a subsequent activity in the pipeline. For example, in the
JSON definition of an activity, you can access the output of the previous activity by using the following syntax:
@activity('NameofPreviousActivity').output.value . By using this feature, you can build workflows where values
can pass through activities.
Looping containers
The ForEach activity defines a repeating control flow in your pipeline. This activity iterates over a collection and
runs specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping
structure in programming languages.
The Until activity provides the same functionality that a do-until looping structure provides in programming
languages. It runs a set of activities in a loop until the condition that's associated with the activity evaluates to
true . You can specify a timeout value for the until activity in Data Factory.
Trigger-based flows
Pipelines can be triggered by on-demand (event-based, i.e. blob post) or wall-clock time. The pipelines and
triggers article has detailed information about triggers.
Invoking a pipeline from another pipeline
The Execute Pipeline activity allows a Data Factory pipeline to invoke another pipeline.
Delta flows
A key use case in ETL patterns is "delta loads," in which only data that has changed since the last iteration of a
pipeline is loaded. New capabilities in the current version, such as lookup activity, flexible scheduling, and
control flow, enable this use case in a natural way. For a tutorial with step-by-step instructions, see Tutorial:
Incremental copy.
Other control flow activities
Following are a few more control flow activities that are supported by the current version of Data Factory.
ForEach activity Defines a repeating control flow in your pipeline. This activity
is used to iterate over a collection and runs specified
activities in a loop. The loop implementation of this activity is
similar to Foreach looping structure in programming
languages.
Web activity Calls a custom REST endpoint from a Data Factory pipeline.
You can pass datasets and linked services to be consumed
and accessed by the activity.
Lookup activity Reads or looks up a record or table name value from any
external source. This output can further be referenced by
succeeding activities.
Get metadata activity Retrieves the metadata of any data in Azure Data Factory.
Flexible scheduling
In the current version of Data Factory, you do not need to define dataset availability schedules. You can define a
trigger resource that can schedule pipelines from a clock scheduler paradigm. You can also pass parameters to
pipelines from a trigger for a flexible scheduling and execution model.
Pipelines do not have "windows" of time execution in the current version of Data Factory. The Data Factory V1
concepts of startTime, endTime, and isPaused don't exist in the current version of Data Factory. For more
information about how to build and then schedule a pipeline in the current version of Data Factory, see Pipeline
execution and triggers.
Custom activities
In V1, you implement (custom) DotNet activity code by creating a .NET class library project with a class that
implements the Execute method of the IDotNetActivity interface. Therefore, you need to write your custom code
in .NET Framework 4.5.2 and run it on Windows-based Azure Batch Pool nodes.
In a custom activity in the current version, you don't have to implement a .NET interface. You can directly run
commands, scripts, and your own custom code compiled as an executable.
For more information, see Difference between custom activity in Data Factory and version 1.
SDKs
the current version of Data Factory provides a richer set of SDKs that can be used to author, manage, and
monitor pipelines.
.NET SDK : The .NET SDK is updated in the current version.
PowerShell : The PowerShell cmdlets are updated in the current version. The cmdlets for the current
version have DataFactor yV2 in the name, for example: Get-AzDataFactoryV2.
Python SDK : This SDK is new in the current version.
REST API : The REST API is updated in the current version.
The SDKs that are updated in the current version are not backward-compatible with V1 clients.
Authoring experience
VERSIO N 2 VERSIO N 1
Monitoring experience
in the current version, you can also monitor data factories by using Azure Monitor. The new PowerShell cmdlets
support monitoring of integration runtimes. Both V1 and V2 support visual monitoring via a monitoring
application that can be launched from the Azure portal.
Next steps
Learn how to create a data factory by following step-by-step instructions in the following quickstarts:
PowerShell, .NET, Python, REST API.
Quickstart: Create a data factory by using the Azure
Data Factory UI
7/7/2021 • 11 minutes to read • Edit Online
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.
John, Doe
Jane, Doe
Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:
5. On the Create Data Factor y page, under Basics tab, select your Azure Subscription in which you
want to create the data factory.
6. For Resource Group , take one of the following steps:
a. Select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a new resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
7. For Region , select the location for the data factory.
The list shows only locations that Data Factory supports, and where your Azure Data Factory meta data
will be stored. The associated data stores (like Azure Storage and Azure SQL Database) and computes
(like Azure HDInsight) that Data Factory uses can run in other regions.
8. For Name , enter ADFTutorialDataFactor y . The name of the Azure data factory must be globally
unique. If you see the following error, change the name of the data factory (for example,
<yourname>ADFTutorialDataFactor y ) and try creating again. For naming rules for Data Factory
artifacts, see the Data Factory - naming rules article.
3. On the New Dataset page, select Azure Blob Storage , and then select Continue .
4. On the Select Format page, choose the format type of your data, and then select Continue . In this case,
select Binar y when copy files as-is without parsing the content.
5. On the Set Proper ties page, complete following steps:
a. Under Name , enter InputDataset .
b. For Linked ser vice , select AzureStorageLinkedSer vice .
c. For File path , select the Browse button.
d. In the Choose a file or folder window, browse to the input folder in the adftutorial container, select
the emp.txt file, and then select OK .
e. Select OK .
4. Switch to the Source tab in the copy activity settings, and select InputDataset for Source Dataset .
5. Switch to the Sink tab in the copy activity settings, and select OutputDataset for Sink Dataset .
6. Click Validate on the pipeline toolbar above the canvas to validate the pipeline settings. Confirm that the
pipeline has been successfully validated. To close the validation output, select the Validation button in the
top-right corner.
Debug the pipeline
In this step, you debug the pipeline before deploying it to Data Factory.
1. On the pipeline toolbar above the canvas, click Debug to trigger a test run.
2. Confirm that you see the status of the pipeline run on the Output tab of the pipeline settings at the
bottom.
3. Confirm that you see an output file in the output folder of the adftutorial container. If the output folder
doesn't exist, the Data Factory service automatically creates it.
2. To trigger the pipeline manually, select Add Trigger on the pipeline toolbar, and then select Trigger
Now . On the Pipeline run page, select OK .
2. Select the CopyPipeline link, you'll see the status of the copy activity run on this page.
3. To view details about the copy operation, select the Details (eyeglasses image) link. For details about the
properties, see Copy Activity overview.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To learn
about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Use the Copy Data tool to copy data
7/7/2021 • 7 minutes to read • Edit Online
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory before doing this quickstart.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.
John, Doe
Jane, Doe
Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:
2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , then select
Next .
3. On the Source data store page, complete the following steps:
a. Click + Create new connection to add a connection.
b. Select the linked service type that you want to create for the source connection. In this tutorial, we
use Azure Blob Storage . Select it from the gallery, and then select Continue .
c. On the New connection (Azure Blob Storage) page, specify a name for your connection. Select
your Azure subscription from the Azure subscription list and your storage account from the
Storage account name list, test connection, and then select Create .
d. Select the newly created connection in the Connection block.
e. In the File or folder section, select Browse to navigate to the adftutorial/input folder, select the
emp.txt file, and then click OK .
f. Select the Binar y copy checkbox to copy file as-is, and then select Next .
4. On the Destination data store page, complete the following steps:
a. Select the AzureBlobStorage connection that you created in the Connection block.
b. In the Folder path section, enter adftutorial/output for the folder path.
c. Leave other settings as default and then select Next .
5. On the Settings page, specify a name for the pipeline and its description, then select Next to use other
default configurations.
6. On the Summar y page, review all settings, and select Next .
7. On the Deployment complete page, select Monitor to monitor the pipeline that you created.
8. The application switches to the Monitor tab. You see the status of the pipeline on this tab. Select Refresh
to refresh the list. Click the link under Pipeline name to view activity run details or rerun the pipeline.
9. On the Activity runs page, select the Details link (eyeglasses icon) under the Activity name column for
more details about copy operation. For details about the properties, see Copy Activity overview.
10. To go back to the Pipeline Runs view, select the All pipeline runs link in the breadcrumb menu. To
refresh the view, select Refresh .
11. Verify that the emp.txt file is created in the output folder of the adftutorial container. If the output
folder doesn't exist, the Data Factory service automatically creates it.
12. Switch to the Author tab above the Monitor tab on the left panel so that you can edit linked services,
datasets, and pipelines. To learn about editing them in the Data Factory UI, see Create a data factory by
using the Azure portal.
Next steps
The pipeline in this sample copies data from one location to another location in Azure Blob storage. To learn
about using Data Factory in more scenarios, go through the tutorials.
Quickstart: Create an Azure Data Factory using
Azure CLI
6/8/2021 • 5 minutes to read • Edit Online
This quickstart describes how to use Azure CLI to create an Azure Data Factory. The pipeline you create in this
data factory copies data from one folder to another folder in an Azure Blob Storage. For information on how to
transform data using Azure Data Factory, see Transform data in Azure Data Factory.
For an introduction to the Azure Data Factory service, see Introduction to Azure Data Factory.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Use the Bash environment in Azure Cloud Shell.
If you prefer, install the Azure CLI to run CLI reference commands.
If you're using a local installation, sign in to the Azure CLI by using the az login command. To finish
the authentication process, follow the steps displayed in your terminal. For additional sign-in
options, see Sign in with the Azure CLI.
When you're prompted, install Azure CLI extensions on first use. For more information about
extensions, see Use extensions with the Azure CLI.
Run az version to find the version and dependent libraries that are installed. To upgrade to the
latest version, run az upgrade.
NOTE
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the contributor
or owner role, or an administrator of the Azure subscription. For more information, see Azure roles.
3. Create a container named adftutorial by using the az storage container create command:
az storage container create --resource-group ADFQuickStartRG --name adftutorial \
--account-name adfquickstartstorage --auth-mode key
4. In the local directory, create a file named emp.txt to upload. If you're working in Azure Cloud Shell, you
can find the current working directory by using the echo $PWD Bash command. You can use standard
Bash commands, like cat , to create a file:
IMPORTANT
Replace ADFTutorialFactory with a globally unique data factory name, for example, ADFTutorialFactorySP1127.
You can see the data factory that you created by using the az datafactory factory show command:
2. In your working directory, create a JSON file with this content, which includes your own connection string
from the previous step. Name the file AzureStorageLinkedService.json :
{
"type":"AzureStorage",
"typeProperties":{
"connectionString":{
"type": "SecureString",
"value":"DefaultEndpointsProtocol=https;AccountName=adfquickstartstorage;AccountKey=K9F4Xk/EhYrMBIR98
rtgJ0HRSIDU4eWQILLh2iXo05Xnr145+syIKNczQfORkQ3QIOZAd/eSDsvED19dAwW/tw==;EndpointSuffix=core.windows.n
et"
}
}
}
4. In your working directory, create a JSON file with this content, named InputDataset.json :
{
"type":
"AzureBlob",
"linkedServiceName": {
"type":"LinkedServiceReference",
"referenceName":"AzureStorageLinkedService"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "input",
"container": "adftutorial"
}
}
}
5. Create an input dataset named InputDataset by using the az datafactory dataset create command:
6. In your working directory, create a JSON file with this content, named OutputDataset.json :
{
"type":
"AzureBlob",
"linkedServiceName": {
"type":"LinkedServiceReference",
"referenceName":"AzureStorageLinkedService"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "output",
"container": "adftutorial"
}
}
}
7. Create an output dataset named OutputDataset by using the az datafactory dataset create command:
2. Create a pipeline named Adfv2QuickStartPipeline by using the az datafactory pipeline create command:
This command returns a run ID. Copy it for use in the next command.
4. Verify that the pipeline run succeeded by using the az datafactory pipeline-run show command:
You can also verify that your pipeline ran as expected by using the Azure portal. For more information, see
Review deployed resources.
Clean up resources
All of the resources in this quickstart are part of the same resource group. To remove them all, use the az group
delete command:
If you're using this resource group for anything else, instead, delete individual resources. For instance, to remove
the linked service, use the az datafactory linked-service delete command.
In this quickstart, you created the following JSON files:
AzureStorageLinkedService.json
InputDataset.json
OutputDataset.json
Adfv2QuickStartPipeline.json
Delete them by using standard Bash commands.
Next steps
Pipelines and activities in Azure Data Factory
Linked services in Azure Data Factory
Datasets in Azure Data Factory
Transform data in Azure Data Factory
Quickstart: Create an Azure Data Factory using
PowerShell
5/28/2021 • 12 minutes to read • Edit Online
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.
John, Doe
Jane, Doe
Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
WARNING
If you do not use latest versions of PowerShell and Data Factory module, you may run into deserialization errors while
running the commands.
Log in to PowerShell
1. Launch PowerShell on your machine. Keep PowerShell open until the end of this quickstart. If you close
and reopen, you need to run these commands again.
2. Run the following command, and enter the same Azure user name and password that you use to sign in
to the Azure portal:
Connect-AzAccount
3. Run the following command to view all the subscriptions for this account:
Get-AzSubscription
4. If you see multiple subscriptions associated with your account, run the following command to select the
subscription that you want to work with. Replace SubscriptionId with the ID of your Azure subscription:
$resourceGroupName = "ADFQuickStartRG";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.
IMPORTANT
Update the data factory name to be globally unique. For example, ADFTutorialFactorySP1127.
$dataFactoryName = "ADFQuickStartFactory";
4. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
TIP
In this quickstart, you use Account key as the authentication type for your data store, but you can choose other
supported authentication methods: SAS URI,Service Principal and Managed Identity if needed. Refer to corresponding
sections in this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key
Vault. Refer to this article for detailed illustrations.
1. Create a JSON file named AzureStorageLinkedSer vice.json in C:\ADFv2QuickStar tPSH folder with
the following content: (Create the folder ADFv2QuickStartPSH if it does not already exist.).
IMPORTANT
Replace <accountName> and <accountKey> with name and key of your Azure storage account before saving the
file.
{
"name": "AzureStorageLinkedService",
"properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>;EndpointSuffix=core.windows.net"
}
}
}
If you are using Notepad, select All files for the Save as type filed in the Save as dialog box.
Otherwise, it may add .txt extension to the file. For example, AzureStorageLinkedService.json.txt . If
you create the file in File Explorer before opening it in Notepad, you may not see the .txt extension
since the Hide extensions for known files types option is set by default. Remove the .txt extension
before proceeding to the next step.
2. In PowerShell , switch to the ADFv2QuickStar tPSH folder.
Set-Location 'C:\ADFv2QuickStartPSH'
3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service:
AzureStorageLinkedSer vice .
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobStorageLinkedService
Create datasets
In this procedure, you create two datasets: InputDataset and OutputDataset . These datasets are of type
Binar y . They refer to the Azure Storage linked service that you created in the previous section. The input dataset
represents the source data in the input folder. In the input dataset definition, you specify the blob container
(adftutorial ), the folder (input ), and the file (emp.txt ) that contain the source data. The output dataset
represents the data that's copied to the destination. In the output dataset definition, you specify the blob
container (adftutorial ), the folder (output ), and the file to which the data is copied.
1. Create a JSON file named InputDataset.json in the C:\ADFv2QuickStar tPSH folder, with the following
content:
{
"name": "InputDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "emp.txt",
"folderPath": "input",
"container": "adftutorial"
}
}
}
}
DatasetName : InputDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset
3. Repeat the steps to create the output dataset. Create a JSON file named OutputDataset.json in the
C:\ADFv2QuickStar tPSH folder, with the following content:
{
"name": "OutputDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"folderPath": "output",
"container": "adftutorial"
}
}
}
}
DatasetName : OutputDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.BinaryDataset
Create a pipeline
In this procedure, you create a pipeline with a copy activity that uses the input and output datasets. The copy
activity copies data from the file you specified in the input dataset settings to the file you specified in the output
dataset settings.
1. Create a JSON file named Adfv2QuickStar tPipeline.json in the C:\ADFv2QuickStar tPSH folder with
the following content:
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "InputDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputDataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
2. To create the pipeline: Adfv2QuickStar tPipeline , Run the Set-AzDataFactor yV2Pipeline cmdlet.
$DFPipeLine = Set-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-Name "Adfv2QuickStartPipeline" `
-DefinitionFile ".\Adfv2QuickStartPipeline.json"
$RunId = Invoke-AzDataFactoryV2Pipeline `
-DataFactoryName $DataFactory.DataFactoryName `
-ResourceGroupName $ResGrp.ResourceGroupName `
-PipelineName $DFPipeLine.Name
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun `
-ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId
if ($Run) {
if ( ($Run.Status -ne "InProgress") -and ($Run.Status -ne "Queued") ) {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output ("Pipeline is running...status: " + $Run.Status)
}
Start-Sleep -Seconds 10
}
ResourceGroupName : ADFQuickStartRG
DataFactoryName : ADFQuickStartFactory
RunId : 00000000-0000-0000-0000-0000000000000
PipelineName : Adfv2QuickStartPipeline
LastUpdated : 8/27/2019 7:23:07 AM
Parameters : {}
RunStart : 8/27/2019 7:22:56 AM
RunEnd : 8/27/2019 7:23:07 AM
DurationInMs : 11324
Status : Succeeded
Message :
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
ResourceGroupName : ADFQuickStartRG
DataFactoryName : ADFQuickStartFactory
ActivityRunId : 00000000-0000-0000-0000-000000000000
ActivityName : CopyFromBlobToBlob
PipelineRunId : 00000000-0000-0000-0000-000000000000
PipelineName : Adfv2QuickStartPipeline
Input : {source, sink, enableStaging}
Output : {dataRead, dataWritten, filesRead, filesWritten...}
LinkedServiceName :
ActivityRunStart : 8/27/2019 7:22:58 AM
ActivityRunEnd : 8/27/2019 7:23:05 AM
DurationInMs : 6828
Status : Succeeded
Error : {errorCode, message, failureType, target}
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following command to
delete the entire resource group:
NOTE
Dropping a resource group may take some time. Please be patient with the process
If you want to delete just the data factory, not the entire resource group, run the following command:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline using
.NET SDK
5/28/2021 • 13 minutes to read • Edit Online
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Azure roles
To create Data Factory instances, the user account that you use to sign in to Azure must be a member of the
contributor or owner role, or an administrator of the Azure subscription. To view the permissions that you have
in the subscription, go to the Azure portal, select your username in the upper-right corner, select "..." icon for
more options, and then select My permissions . If you have access to multiple subscriptions, select the
appropriate subscription.
To create and manage child resources for Data Factory - including datasets, linked services, pipelines, triggers,
and integration runtimes - the following requirements are applicable:
To create and manage child resources in the Azure portal, you must belong to the Data Factor y
Contributor role at the resource group level or above.
To create and manage child resources with PowerShell or the SDK, the contributor role at the resource level
or above is sufficient.
For sample instructions about how to add a user to a role, see the Add roles article.
For more info, see the following articles:
Data Factory Contributor role
Roles and permissions for Azure Data Factory
Azure Storage account
You use a general-purpose Azure Storage account (specifically Blob storage) as both source and destination data
stores in this quickstart. If you don't have a general-purpose Azure Storage account, see Create a storage
account to create one.
Get the storage account name
You need the name of your Azure Storage account for this quickstart. The following procedure provides steps to
get the name of your storage account:
1. In a web browser, go to the Azure portal and sign in using your Azure username and password.
2. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can also
search for and select Storage accounts from any page.
3. In the Storage accounts page, filter for your storage account (if needed), and then select your storage
account.
You can also search for and select Storage accounts from any page.
Create a blob container
In this section, you create a blob container named adftutorial in Azure Blob storage.
1. From the storage account page, select Over view > Containers .
2. On the <Account name> - Containers page's toolbar, select Container .
3. In the New container dialog box, enter adftutorial for the name, and then select OK . The <Account
name> - Containers page is updated to include adftutorial in the list of containers.
John, Doe
Jane, Doe
Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.) Then return to
the Azure portal and follow these steps:
1. In the <Account name> - Containers page where you left off, select adftutorial from the updated list of
containers.
a. If you closed the window or went to another page, sign in to the Azure portal again.
b. From the Azure portal menu, select All ser vices , then select Storage > Storage accounts . You can
also search for and select Storage accounts from any page.
c. Select your storage account, and then select Containers > adftutorial .
2. On the adftutorial container page's toolbar, select Upload .
3. In the Upload blob page, select the Files box, and then browse to and select the emp.txt file.
4. Expand the Advanced heading. The page now displays as shown:
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -IncludePrerelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Rest.Serialization;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets the variables. Replace the placeholders with your
own values. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, and more) and computes
(HDInsight and others) used by data factory can be in other regions.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID where the data factory resides>";
string resourceGroup = "<your resource group where the data factory resides>";
string region = "<the location of your resource group>";
string dataFactoryName =
"<specify the name of data factory to create. It must be globally unique.>";
string storageAccount = "<your storage account name to copy data>";
string storageKey = "<your storage account key>";
// specify the container and input folder from which all files
// need to be copied to the output folder.
string inputBlobPath =
"<path to existing blob(s) to copy data from, e.g. containername/inputdir>";
//specify the contains and output folder where the files are copied
string outputBlobPath =
"<the blob path to copy data to, e.g. containername/outputdir>";
// name of the Azure Storage linked service, blob dataset, and the pipeline
string storageLinkedServiceName = "AzureStorageLinkedService";
string blobDatasetName = "BlobDataset";
string pipelineName = "Adfv2QuickStartPipeline";
NOTE
For Sovereign clouds, you must use the appropriate cloud-specific endpoints for ActiveDirectoryAuthority and
ResourceManagerUrl (BaseUri). For example, in US Azure Gov you would use authority of https://login.microsoftonline.us
instead of https://login.microsoftonline.com, and use https://management.usgovcloudapi.net instead of
https://management.azure.com/, and then create the data factory management client. You can use Powershell to easily
get the endpoint Urls for various clouds by executing “Get-AzEnvironment | Format-List”, which will return a list of
endpoints for each cloud environment.
3. Add the following code to the Main method that creates an instance of
DataFactor yManagementClient class. You use this object to create a data factory, a linked service,
datasets, and a pipeline. You also use this object to monitor the pipeline run details.
Create a dataset
Add the following code to the Main method that creates an Azure blob dataset .
You define a dataset that represents the data to copy from a source to a sink. In this example, this Blob dataset
references to the Azure Storage linked service you created in the previous step. The dataset takes a parameter
whose value is set in an activity that consumes the dataset. The parameter is used to construct the "folderPath"
pointing to where the data resides/is stored.
// Create an Azure Blob dataset
Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference
{
ReferenceName = storageLinkedServiceName
},
FolderPath = new Expression { Value = "@{dataset().path}" },
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "path", new ParameterSpecification { Type = ParameterType.String } }
}
}
);
client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, blobDatasetName, blobDataset);
Console.WriteLine(
SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings));
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity .
In this example, this pipeline contains one activity and takes two parameters: the input blob path and the output
blob path. The values for these parameters are set when the pipeline is triggered/run. The copy activity refers to
the same blob dataset created in the previous step as input and output. When the dataset is used as an input
dataset, input path is specified. And, when the dataset is used as an output dataset, the output path is specified.
// Create a pipeline with a copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Parameters = new Dictionary<string, ParameterSpecification>
{
{ "inputPath", new ParameterSpecification { Type = ParameterType.String } },
{ "outputPath", new ParameterSpecification { Type = ParameterType.String } }
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToBlob",
Inputs = new List<DatasetReference>
{
new DatasetReference()
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.inputPath" }
}
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobDatasetName,
Parameters = new Dictionary<string, object>
{
{ "path", "@pipeline().parameters.outputPath" }
}
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
}
}
};
client.Pipelines.CreateOrUpdate(resourceGroup, dataFactoryName, pipelineName, pipeline);
Console.WriteLine(SafeJsonConvert.SerializeObject(pipeline, client.SerializationSettings));
2. Add the following code to the Main method that retrieves copy activity run details, such as the size of the
data that's read or written.
Clean up resources
To programmatically delete the data factory, add the following lines of code to the program:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create a data factory and pipeline using
Python
5/28/2021 • 10 minutes to read • Edit Online
Prerequisites
An Azure account with an active subscription. Create one for free.
Python 3.6+.
An Azure Storage account.
Azure Storage Explorer (optional).
An application in Azure Active Directory. Create the application by following the steps in this link, using
Authentication Option 2 (application secret), and assign the application to the Contributor role by
following instructions in the same article. Make note of the following values as shown in the article to use
in later steps: Application (client) ID, client secret value, and tenant ID.
John|Doe
Jane|Doe
2. Use tools such as Azure Storage Explorer to create the adfv2tutorial container, and input folder in the
container. Then, upload the input.txt file to the input folder.
The Python SDK for Data Factory supports Python 2.7 and 3.6+.
4. To install the Python package for Azure Identity authentication, run the following command:
NOTE
The "azure-identity" package might have conflicts with "azure-cli" on some common dependencies. If you meet
any authentication issue, remove "azure-cli" and its dependencies, or use a clean machine without installing
"azure-cli" package to make it work. For Sovereign clouds, you must use the appropriate cloud-specific constants.
Please refer to Connect to all regions using Azure libraries for Python Multi-cloud | Microsoft Docs for instructions
to connect with Python in Sovereign clouds.
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create the data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details. Set subscription_id variable to the ID of your Azure
subscription. For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight,
etc.) used by data factory can be in other regions.
def main():
# Azure subscription ID
subscription_id = '<subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment out the
code that creates the resource group
rg_name = '<resource group>'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ClientSecretCredential(client_id='<Application (client) ID>',
client_secret='<client secret value>', tenant_id='<tenant ID>')
# Specify following for Soverign Clouds, import right cloud constant and then use it to connect.
# from msrestazure.azure_cloud import AZURE_PUBLIC_CLOUD as CLOUD
# credentials = DefaultAzureCredential(authority=CLOUD.endpoints.active_directory,
tenant_id=tenant_id)
rg_params = {'location':'westus'}
df_params = {'location':'westus'}
Create a data factory
Add the following code to the Main method that creates a data factor y . If your resource group already exists,
comment out the first create_or_update statement.
# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString(value='DefaultEndpointsProtocol=https;AccountName=<account
name>;AccountKey=<account key>;EndpointSuffix=<suffix>')
ls_azure_storage =
LinkedServiceResource(properties=AzureStorageLinkedService(connection_string=storage_string))
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Create datasets
In this section, you create two datasets: one for the source and the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about properties
of Azure Blob dataset, see Azure blob connector article.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step.
# Create an Azure blob dataset (input)
ds_name = 'ds_in'
ds_ls = LinkedServiceReference(reference_name=ls_name)
blob_path = '<container>/<folder path>'
blob_filename = '<file name>'
ds_azure_blob = DatasetResource(properties=AzureBlobDataset(
linked_service_name=ds_ls, folder_path=blob_path, file_name=blob_filename))
ds = adf_client.datasets.create_or_update(
rg_name, df_name, ds_name, ds_azure_blob)
print_item(ds)
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity .
#Note1: To pass parameters to the pipeline, add them to the json string params_for_pipeline shown below
in the format { “ParameterName1” : “ParameterValue1” } for each of the parameters needed in the pipeline.
#Note2: To pass parameters to a dataflow, create a pipeline parameter to hold the parameter name/value,
and then consume the pipeline parameter in the dataflow parameter in the format
@pipeline().parameters.parametername.
p_name = 'copyPipeline'
params_for_pipeline = {}
p_name = 'copyPipeline'
params_for_pipeline = {}
p_obj = PipelineResource(activities=[copy_activity], parameters=params_for_pipeline)
p = adf_client.pipelines.create_or_update(rg_name, df_name, p_name, p_obj)
print_item(p)
Now, add the following statement to invoke the main method when the program is run:
Full script
Here is the full Python code:
def print_item(group):
"""Print an Azure object instance."""
print("\tName: {}".format(group.name))
print("\tId: {}".format(group.id))
if hasattr(group, 'location'):
print("\tLocation: {}".format(group.location))
if hasattr(group, 'tags'):
print("\tTags: {}".format(group.tags))
if hasattr(group, 'properties'):
print_properties(group.properties)
def print_properties(props):
"""Print a ResourceGroup properties instance."""
if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
print("\tProperties:")
print("\t\tProvisioning State: {}".format(props.provisioning_state))
print("\n\n")
def print_activity_run_details(activity_run):
"""Print activity run details."""
print("\n\tActivity run details\n")
print("\tActivity run status: {}".format(activity_run.status))
if activity_run.status == 'Succeeded':
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
else:
print("\tErrors: {}".format(activity_run.error['message']))
def main():
# Azure subscription ID
subscription_id = '<subscription ID>'
# This program creates this resource group. If it's an existing resource group, comment out the code
that creates the resource group
rg_name = '<resource group>'
# Specify your Active Directory client ID, client secret, and tenant ID
credentials = ClientSecretCredential(client_id='<service principal ID>', client_secret='<service
principal key>', tenant_id='<tenant ID>')
resource_client = ResourceManagementClient(credentials, subscription_id)
adf_client = DataFactoryManagementClient(credentials, subscription_id)
rg_params = {'location':'westus'}
df_params = {'location':'westus'}
# IMPORTANT: specify the name and key of your Azure Storage account.
storage_string = SecureString(value='DefaultEndpointsProtocol=https;AccountName=<account
name>;AccountKey=<account key>;EndpointSuffix=<suffix>')
ls_azure_storage =
LinkedServiceResource(properties=AzureStorageLinkedService(connection_string=storage_string))
ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
print_item(ls)
Name: storageLinkedService
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/linkedservices/storageLinkedService
Name: ds_in
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_in
Name: ds_out
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/datasets/ds_out
Name: copyPipeline
Id: /subscriptions/<subscription ID>/resourceGroups/<resource group
name>/providers/Microsoft.DataFactory/factories/<data factory name>/pipelines/copyPipeline
Clean up resources
To delete the data factory, add the following code to the program:
adf_client.factories.delete(rg_name, df_name)
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure data factory and
pipeline by using the REST API
6/20/2021 • 8 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure subscription . If you don't have a subscription, you can create a free trial account.
Azure Storage account . You use the blob storage as source and sink data store. If you don't have an
Azure storage account, see the Create a storage account article for steps to create one.
Create a blob container in Blob Storage, create an input folder in the container, and upload some files to
the folder. You can use tools such as Azure Storage Explorer to connect to Azure Blob storage, create a blob
container, upload input file, and verify the output file.
Install Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell. This
quickstart uses PowerShell to invoke REST API calls.
Create an application in Azure Active Director y following this instruction. Make note of the following
values that you use in later steps: application ID , clientSecrets , and tenant ID . Assign application to
"Contributor " role.
NOTE
For Sovereign clouds, you must use the appropriate cloud-specific endpoints for ActiveDirectoryAuthority and
ResourceManagerUrl (BaseUri). You can use Powershell to easily get the endpoint Urls for various clouds by executing
“Get-AzEnvironment | Format-List”, which will return a list of endpoints for each cloud environment.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
2. Run the following commands after replacing the places-holders with your own values, to set global
variables to be used in later steps.
$AuthContext =
[Microsoft.IdentityModel.Clients.ActiveDirectory.AuthenticationContext]"https://login.microsoftonline.com/${
tenantId}"
$cred = New-Object -TypeName Microsoft.IdentityModel.Clients.ActiveDirectory.ClientCredential -ArgumentList
($appId, $clientSecrets)
$result = $AuthContext.AcquireTokenAsync("https://management.core.windows.net/",
$cred).GetAwaiter().GetResult()
$authHeader = @{
'Content-Type'='application/json'
'Accept'='application/json'
'Authorization'=$result.CreateAuthorizationHeader()
}
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
Here is the sample response:
{
"name":"<dataFactoryName>",
"identity":{
"type":"SystemAssigned",
"principalId":"<service principal ID>",
"tenantId":"<tenant ID>"
},
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>",
"type":"Microsoft.DataFactory/factories",
"properties":{
"provisioningState":"Succeeded",
"createTime":"2019-09-03T02:10:27.056273Z",
"version":"2018-06-01"
},
"eTag":"\"0200c876-0000-0100-0000-5d6dcb930000\"",
"location":"East US",
"tags":{
}
}
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/linkedservices/AzureStorageLinkedService?api-
version=${apiVersion}"
$body = @"
{
"name":"AzureStorageLinkedService",
"properties":{
"annotations":[
],
"type":"AzureBlobStorage",
"typeProperties":{
"connectionString":"DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/linkedservices/AzureStorageLinkedService",
"name":"AzureStorageLinkedService",
"type":"Microsoft.DataFactory/factories/linkedservices",
"properties":{
"annotations":[
],
"type":"AzureBlobStorage",
"typeProperties":{
"connectionString":"DefaultEndpointsProtocol=https;AccountName=<accountName>;"
}
},
"etag":"07011a57-0000-0100-0000-5d6e14a20000"
}
Create datasets
You define a dataset that represents the data to copy from a source to a sink. In this example, you create two
datasets: InputDataset and OutputDataset. They refer to the Azure Storage linked service that you created in the
previous section. The input dataset represents the source data in the input folder. In the input dataset definition,
you specify the blob container (adftutorial), the folder (input), and the file (emp.txt) that contain the source data.
The output dataset represents the data that's copied to the destination. In the output dataset definition, you
specify the blob container (adftutorial), the folder (output), and the file to which the data is copied.
Create InputDataset
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/datasets/InputDataset?api-version=${apiVersion}"
$body = @"
{
"name":"InputDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"emp.txt",
"folderPath":"input",
"container":"adftutorial"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/datasets/InputDataset",
"name":"InputDataset",
"type":"Microsoft.DataFactory/factories/datasets",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":"@{type=AzureBlobStorageLocation; fileName=emp.txt; folderPath=input;
container=adftutorial}"
}
},
"etag":"07011c57-0000-0100-0000-5d6e14b40000"
}
Create OutputDataset
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/datasets/OutputDataset?api-version=${apiVersion}"
$body = @"
{
"name":"OutputDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":"output",
"container":"adftutorial"
}
}
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/datasets/OutputDataset",
"name":"OutputDataset",
"type":"Microsoft.DataFactory/factories/datasets",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":"@{type=AzureBlobStorageLocation; folderPath=output; container=adftutorial}"
}
},
"etag":"07013257-0000-0100-0000-5d6e18920000"
}
Create pipeline
In this example, this pipeline contains one Copy activity. The Copy activity refers to the "InputDataset" and the
"OutputDataset" created in the previous step as input and output.
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/pipelines/Adfv2QuickStartPipeline?api-version=${apiVersion}"
$body = @"
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "InputDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "OutputDataset",
"type": "DatasetReference"
}
]
}
],
"annotations": []
}
}
"@
$response = Invoke-RestMethod -Method PUT -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>/pipelines/Adfv2QuickStartPipeline",
"name":"Adfv2QuickStartPipeline",
"type":"Microsoft.DataFactory/factories/pipelines",
"properties":{
"activities":[
"@{name=CopyFromBlobToBlob; type=Copy; dependsOn=System.Object[]; policy=;
userProperties=System.Object[]; typeProperties=; inputs=System.Object[]; outputs=System.Object[]}"
],
"annotations":[
]
},
"etag":"07012057-0000-0100-0000-5d6e14c00000"
}
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/
Microsoft.DataFactory/factories/${factoryName}/pipelines/Adfv2QuickStartPipeline/createRun?api-
version=${apiVersion}"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader -Body $body
$response | ConvertTo-Json
$runId = $response.runId
{
"runId":"04a2bb9a-71ea-4c31-b46e-75276b61bafc"
}
Monitor pipeline
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}?api-
version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"
},
"invokedBy":{
"id":"2bb3938176ee43439752475aa12b2251",
"name":"Manual",
"invokedByType":"Manual"
},
"runStart":"2019-09-03T07:22:47.0075159Z",
"runEnd":"2019-09-03T07:22:57.8862692Z",
"durationInMs":10878,
"status":"Succeeded",
"message":"",
"lastUpdated":"2019-09-03T07:22:57.8862692Z",
"annotations":[
],
"runDimension":{
},
"isLatest":true
}
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}/queryActivityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader
$response | ConvertTo-Json
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Run the following command to delete the entire resource group:
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. Go
through the tutorials to learn about using Data Factory in more scenarios.
Quickstart: Create an Azure Data Factory using
ARM template
7/7/2021 • 6 minutes to read • Edit Online
NOTE
This article does not provide a detailed introduction of the Data Factory service. For an introduction to the Azure Data
Factory service, see Introduction to Azure Data Factory.
If your environment meets the prerequisites and you're familiar with using ARM templates, select the Deploy to
Azure button. The template will open in the Azure portal.
Prerequisites
Azure subscription
If you don't have an Azure subscription, create a free account before you begin.
Create a file
Open a text editor such as Notepad , and create a file named emp.txt with the following content:
John, Doe
Jane, Doe
Save the file in the C:\ADFv2QuickStar tPSH folder. (If the folder doesn't already exist, create it.)
Review template
The template used in this quickstart is from Azure Quickstart Templates.
{
"$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"metadata": {
"_generator": {
"name": "bicep",
"version": "0.4.1.14562",
"templateHash": "8367564219536411224"
}
}
},
"parameters": {
"dataFactoryName": {
"type": "string",
"defaultValue": "[format('datafactory{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Data Factory Name"
}
},
"location": {
"type": "string",
"defaultValue": "[resourceGroup().location]",
"metadata": {
"description": "Location of the data factory."
}
},
"storageAccountName": {
"type": "string",
"defaultValue": "[format('storage{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Name of the Azure storage account that contains the input/output data."
}
},
"blobContainerName": {
"type": "string",
"defaultValue": "[format('blob{0}', uniqueString(resourceGroup().id))]",
"metadata": {
"description": "Name of the blob container in the Azure Storage account."
}
}
},
"functions": [],
"variables": {
"dataFactoryLinkedServiceName": "ArmtemplateStorageLinkedService",
"dataFactoryDataSetInName": "ArmtemplateTestDatasetIn",
"dataFactoryDataSetOutName": "ArmtemplateTestDatasetOut",
"pipelineName": "ArmtemplateSampleCopyPipeline"
},
"resources": [
{
"type": "Microsoft.Storage/storageAccounts",
"apiVersion": "2021-04-01",
"name": "[parameters('storageAccountName')]",
"location": "[parameters('location')]",
"sku": {
"name": "Standard_LRS"
},
"kind": "StorageV2"
},
{
"type": "Microsoft.Storage/storageAccounts/blobServices/containers",
"apiVersion": "2021-04-01",
"name": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories",
"apiVersion": "2018-06-01",
"name": "[parameters('dataFactoryName')]",
"location": "[parameters('location')]",
"identity": {
"type": "SystemAssigned"
}
},
{
"type": "Microsoft.DataFactory/factories/linkedservices",
"type": "Microsoft.DataFactory/factories/linkedservices",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "[format('DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}',
parameters('storageAccountName'), listKeys(resourceId('Microsoft.Storage/storageAccounts',
parameters('storageAccountName')), '2021-04-01').keys[0].value)]"
}
},
"dependsOn": [
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccountName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('dataFactoryDataSetInName'))]",
"properties": {
"linkedServiceName": {
"referenceName": "[variables('dataFactoryLinkedServiceName')]",
"type": "LinkedServiceReference"
},
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"folderPath": "input",
"fileName": "emp.txt"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/blobServices/containers',
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[1],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')
[2])]",
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/linkedservices', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('dataFactoryDataSetOutName'))]",
"properties": {
"linkedServiceName": {
"referenceName": "[variables('dataFactoryLinkedServiceName')]",
"type": "LinkedServiceReference"
},
"type": "Binary",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "[format('{0}/default/{1}', parameters('storageAccountName'),
parameters('blobContainerName'))]",
"folderPath": "output"
}
}
},
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/blobServices/containers',
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[0],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')[1],
split(format('{0}/default/{1}', parameters('storageAccountName'), parameters('blobContainerName')), '/')
[2])]",
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/linkedservices', parameters('dataFactoryName'),
variables('dataFactoryLinkedServiceName'))]"
]
},
{
"type": "Microsoft.DataFactory/factories/pipelines",
"apiVersion": "2018-06-01",
"name": "[format('{0}/{1}', parameters('dataFactoryName'), variables('pipelineName'))]",
"properties": {
"activities": [
{
"name": "MyCopyActivity",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobStorageWriterSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "[variables('dataFactoryDataSetInName')]",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "[variables('dataFactoryDataSetOutName')]",
"type": "DatasetReference"
}
]
}
]
},
"dependsOn": [
"[resourceId('Microsoft.DataFactory/factories', parameters('dataFactoryName'))]",
"[resourceId('Microsoft.DataFactory/factories/datasets', parameters('dataFactoryName'),
variables('dataFactoryDataSetInName'))]",
"[resourceId('Microsoft.DataFactory/factories/datasets', parameters('dataFactoryName'),
variables('dataFactoryDataSetOutName'))]"
]
}
]
}
Unless it's specified, use the default values to create the Azure Data Factory resources:
Subscription : Select an Azure subscription.
Resource group : Select Create new , enter a unique name for the resource group, and then select
OK .
Region : Select a location. For example, East US .
Data Factor y Name : Use default value.
Location : Use default value.
Storage Account Name : Use default value.
Blob Container : Use default value.
Upload a file
1. On the Containers page, select Upload .
2. In te right pane, select the Files box, and then browse to and select the emp.txt file that you created
earlier.
3. Expand the Advanced heading.
4. In the Upload to folder box, enter input.
5. Select the Upload button. You should see the emp.txt file and the status of the upload in the list.
6. Select the Close icon (an X ) to close the Upload blob page.
Keep the container page open, because you can use it to verify the output at the end of this quickstart.
Start Trigger
1. Navigate to the Data factories page, and select the data factory you created.
2. Select Open on the Open Azure Data Factor y Studio tile.
Clean up resources
You can clean up the resources that you created in the Quickstart in two ways. You can delete the Azure resource
group, which includes all the resources in the resource group. If you want to keep the other resources intact,
delete only the data factory you created in this tutorial.
Deleting a resource group deletes all resources including data factories in it. Run the following command to
delete the entire resource group:
Remove-AzResourceGroup -ResourceGroupName $resourcegroupname
If you want to delete just the data factory, and not the entire resource group, run the following command:
Next steps
In this quickstart, you created an Azure Data Factory using an ARM template and validated the deployment. To
learn more about Azure Data Factory and Azure Resource Manager, continue on to the articles below.
Azure Data Factory documentation
Learn more about Azure Resource Manager
Get other Azure Data Factory ARM templates
Create Azure Data Factory Data Flow
7/7/2021 • 2 minutes to read • Edit Online
Once you are in the Data Factory UI, you can use sample Data Flows. The samples are available from the ADF
Template Gallery. In ADF, select "Pipeline templates" tile in the 'Discover more' section of the homepage, and
select the Data Flow category from the template gallery.
You will be prompted to enter your Azure Blob Storage account information.
The data used for these samples can be found here. Download the sample data and store the files in your Azure
Blob storage accounts so that you can execute the samples.
Data flows
Data flow tutorial videos
Code-free data transformation at scale
Delta lake transformations
Data wrangling with Power Query
Data flows inside managed VNet
Best practices for lake data in ADLS Gen2
Dynamically set column names
Pipelines
Control flow
SSIS
SSIS integration runtime
Data share
Data integration with Azure Data Share
Data lineage
Azure Purview
Next steps
Learn more about Data Factory pipelines and data flows.
Copy data from Azure Blob storage to a SQL
Database by using the Copy Data tool
7/8/2021 • 6 minutes to read • Edit Online
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account : Use Blob storage as the source data store. If you don't have an Azure Storage
account, see the instructions in Create a storage account.
Azure SQL Database : Use a SQL Database as the sink data store. If you don't have a SQL Database, see the
instructions in Create a SQL Database.
Create a blob and a SQL table
Prepare your Blob storage and your SQL Database for the tutorial by performing these steps.
Create a source blob
1. Launch Notepad . Copy the following text and save it in a file named inputEmp.txt on your disk:
FirstName|LastName
John|Doe
Jane|Doe
2. Create a container named adfv2tutorial and upload the inputEmp.txt file to the container. You can use
the Azure portal or various tools like Azure Storage Explorer to perform these tasks.
Create a sink SQL table
1. Use the following SQL script to create a table named dbo.emp in your SQL Database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
CREATE CLUSTERED INDEX IX_emp_ID ON dbo.emp (ID);
2. Allow Azure services to access SQL Server. Verify that the setting Allow Azure ser vices and resources
to access this ser ver is enabled for your server that's running SQL Database. This setting lets Data
Factory write data to your database instance. To verify and turn on this setting, go to logical SQL server >
Security > Firewalls and virtual networks > set the Allow Azure ser vices and resources to access
this ser ver option to ON .
NOTE
The option to Allow Azure ser vices and resources to access this ser ver enables network access to your
SQL Server from any Azure resource, not just those in your subscription. For more information, see Azure SQL
Server Firewall rules. Instead, you can use Private endpoints to connect to Azure PaaS services without using
public IPs.
2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , then select
Next .
3. On the Source data store page, complete the following steps:
a. Select + Create new connection to add a connection.
b. Select Azure Blob Storage from the gallery, and then select Continue .
c. On the New connection (Azure Blob Storage) page, select your Azure subscription from the Azure
subscription list, and select your storage account from the Storage account name list. Test connection
and then select Create .
d. Select the newly created linked service as source in the Connection block.
e. In the File or folder section, select Browse to navigate to the adfv2tutorial folder, select the
inputEmp.txt file, then select OK .
f. Select Next to move to next step.
4. On the File format settings page, enable the checkbox for First row as header. Notice that the tool
automatically detects the column and row delimiters, and you can preview data and view the schema of
the input data by selecting Preview data button on this page. Then select Next .
5. On the Destination data store page, completes the following steps:
a. Select + Create new connection to add a connection.
b. Select Azure SQL Database from the gallery, and then select Continue .
c. On the New connection (Azure SQL Database) page, select your Azure subscription, server name
and database name from the dropdown list. Then select SQL authentication under Authentication
type , specify the username and password. Test connection and select Create .
d. Select the newly created linked service as sink, then select Next .
6. On the Destination data store page, select Use existing table and select the dbo.emp table. Then
select Next .
7. On the Column mapping page, notice that the second and the third columns in the input file are
mapped to the FirstName and LastName columns of the emp table. Adjust the mapping to make sure
that there is no error, and then select Next .
8. On the Settings page, under Task name , enter CopyFromBlobToSqlPipeline , and then select Next .
9. On the Summar y page, review the settings, and then select Next .
10. On the Deployment page, select Monitor to monitor the pipeline (task).
11. On the Pipeline runs page, select Refresh to refresh the list. Select the link under Pipeline name to view
activity run details or rerun the pipeline.
12. On the "Activity runs" page, select the Details link (eyeglasses icon) under Activity name column for
more details about copy operation. To go back to the "Pipeline runs" view, select the All pipeline runs
link in the breadcrumb menu. To refresh the view, select Refresh .
13. Verify that the data is inserted into the dbo.emp table in your SQL Database.
14. Select the Author tab on the left to switch to the editor mode. You can update the linked services,
datasets, and pipelines that were created via the tool by using the editor. For details on editing these
entities in the Data Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from Blob storage to a SQL Database. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob storage to a database
in Azure SQL Database by using Azure Data
Factory
7/7/2021 • 10 minutes to read • Edit Online
NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Blob storage as a source data store. If you don't have a storage account,
see Create an Azure storage account for steps to create one.
Azure SQL Database . You use the database as a sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database for steps to create one.
Create a blob and a SQL table
Now, prepare your Blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Launch Notepad. Copy the following text, and save it as an emp.txt file on your disk:
FirstName,LastName
John,Doe
Jane,Doe
2. Create a container named adftutorial in your Blob storage. Create a folder named input in this
container. Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure
Storage Explorer to do these tasks.
Create a sink SQL table
1. Use the following SQL script to create the dbo.emp table in your database:
2. Allow Azure services to access SQL Server. Ensure that Allow access to Azure ser vices is turned ON
for your SQL Server so that Data Factory can write data to your SQL Server. To verify and turn on this
setting, go to logical SQL server > Overview > Set server firewall> set the Allow access to Azure
ser vices option to ON .
Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start with creating the pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the home page, select Orchestrate .
2. In the General panel under Proper ties , specify CopyPipeline for Name . Then collapse the panel by
clicking the Properties icon in the top-right corner.
3. In the Activities tool box, expand the Move and Transform category, and drag and drop the Copy
Data activity from the tool box to the pipeline designer surface. Specify CopyFromBlobToSql for
Name .
Configure source
TIP
In this tutorial, you use Account key as the authentication type for your source data store, but you can choose other
supported authentication methods: SAS URI,Service Principal and Managed Identity if needed. Refer to corresponding
sections in this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key
Vault. Refer to this article for detailed illustrations.
TIP
In this tutorial, you use SQL authentication as the authentication type for your sink data store, but you can choose other
supported authentication methods: Service Principal and Managed Identity if needed. Refer to corresponding sections in
this article for details. To store secrets for data stores securely, it's also recommended to use an Azure Key Vault. Refer to
this article for detailed illustrations.
4. Verify that two more rows are added to the emp table in the database.
5. On the Edit trigger page, review the warning, and then select Save . The pipeline in this example doesn't
take any parameters.
6. Click Publish all to publish the change.
7. Go to the Monitor tab on the left to see the triggered pipeline runs.
8. To switch from the Pipeline Runs view to the Trigger Runs view, select Trigger Runs on the left side
of the window.
9. You see the trigger runs in a list.
10. Verify that two rows per minute (for each pipeline run) are inserted into the emp table until the specified
end time.
Next steps
The pipeline in this sample copies data from one location to another location in Blob storage. You learned how
to:
Create a data factory.
Create a pipeline with a copy activity.
Test run the pipeline.
Trigger the pipeline manually.
Trigger the pipeline on a schedule.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn how to copy data from on-premises to the cloud:
Copy data from on-premises to the cloud
Copy data from Azure Blob to Azure SQL Database
using Azure Data Factory
5/6/2021 • 11 minutes to read • Edit Online
Prerequisites
Azure Storage account. You use the blob storage as source data store. If you don't have an Azure storage
account, see Create a general-purpose storage account.
Azure SQL Database. You use the database as sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database.
Visual Studio. The walkthrough in this article uses Visual Studio 2019.
Azure SDK for .NET.
Azure Active Directory application. If you don't have an Azure Active Directory application, see the Create an
Azure Active Directory application section of How to: Use the portal to create an Azure AD application. Copy
the following values for use in later steps: Application (client) ID , authentication key , and Director y
(tenant) ID . Assign the application to the Contributor role by following the instructions in the same article.
Create a blob and a SQL table
Now, prepare your Azure Blob and Azure SQL Database for the tutorial by creating a source blog and a sink SQL
table.
Create a source blob
First, create a source blob by creating a container and uploading an input text file to it:
1. Open Notepad. Copy the following text and save it locally to a file named inputEmp.txt.
John|Doe
Jane|Doe
2. Use a tool such as Azure Storage Explorer to create the adfv2tutorial container, and to upload the
inputEmp.txt file to the container.
Create a sink SQL table
Next, create a sink SQL table:
1. Use the following SQL script to create the dbo.emp table in your Azure SQL Database.
2. Allow Azure services to access SQL Database. Ensure that you allow access to Azure services in your
server so that the Data Factory service can write data to SQL Database. To verify and turn on this setting,
do the following steps:
a. Go to the Azure portal to manage your SQL server. Search for and select SQL ser vers .
b. Select your server.
c. Under the SQL server menu's Security heading, select Firewalls and vir tual networks .
d. In the Firewall and vir tual networks page, under Allow Azure ser vices and resources to
access this ser ver , select ON .
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -PreRelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Rest;
using Microsoft.Rest.Serialization;
using Microsoft.Azure.Management.ResourceManager;
using Microsoft.Azure.Management.DataFactory;
using Microsoft.Azure.Management.DataFactory.Models;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
2. Add the following code to the Main method that sets variables. Replace the 14 placeholders with your
own values.
To see the list of Azure regions in which Data Factory is currently available, see Products available by
region. Under the Products drop-down list, choose Browse > Analytics > Data Factor y . Then in the
Regions drop-down list, choose the regions that interest you. A grid appears with the availability status
of Data Factory products for your selected regions.
NOTE
Data stores, such as Azure Storage and Azure SQL Database, and computes, such as HDInsight, that Data Factory
uses can be in other regions than what you choose for Data Factory.
// Set variables
string tenantID = "<your tenant ID>";
string applicationId = "<your application ID>";
string authenticationKey = "<your authentication key for the application>";
string subscriptionId = "<your subscription ID to create the factory>";
string resourceGroup = "<your resource group to create the factory>";
string region = "<location to create the data factory in, such as East US>";
string dataFactoryName = "<name of data factory to create (must be globally unique)>";
3. Add the following code to the Main method that creates an instance of DataFactoryManagementClient
class. You use this object to create a data factory, linked service, datasets, and pipeline. You also use this
object to monitor the pipeline run details.
while (
client.Factories.Get(
resourceGroup, dataFactoryName
).ProvisioningState == "PendingCreation"
)
{
System.Threading.Thread.Sleep(1000);
}
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, storageLinkedServiceName, storageLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(storageLinkedService, client.SerializationSettings)
);
client.LinkedServices.CreateOrUpdate(
resourceGroup, dataFactoryName, sqlDbLinkedServiceName, sqlDbLinkedService
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(sqlDbLinkedService, client.SerializationSettings)
);
Create datasets
In this section, you create two datasets: one for the source, the other for the sink.
Create a dataset for source Azure Blob
Add the following code to the Main method that creates an Azure blob dataset. For information about
supported properties and details, see Azure Blob dataset properties.
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service you create in the previous step, and describes:
The location of the blob to copy from: FolderPath and FileName
The blob format indicating how to parse the content: TextFormat and its settings, such as column delimiter
The data structure, including column names and data types, which map in this example to the sink SQL table
// Create an Azure Blob dataset
Console.WriteLine("Creating dataset " + blobDatasetName + "...");
DatasetResource blobDataset = new DatasetResource(
new AzureBlobDataset
{
LinkedServiceName = new LinkedServiceReference {
ReferenceName = storageLinkedServiceName
},
FolderPath = inputBlobPath,
FileName = inputBlobName,
Format = new TextFormat { ColumnDelimiter = "|" },
Structure = new List<DatasetDataElement>
{
new DatasetDataElement { Name = "FirstName", Type = "String" },
new DatasetDataElement { Name = "LastName", Type = "String" }
}
}
);
client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, blobDatasetName, blobDataset
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(blobDataset, client.SerializationSettings)
);
client.Datasets.CreateOrUpdate(
resourceGroup, dataFactoryName, sqlDatasetName, sqlDataset
);
Console.WriteLine(
SafeJsonConvert.SerializeObject(sqlDataset, client.SerializationSettings)
);
Create a pipeline
Add the following code to the Main method that creates a pipeline with a copy activity. In this tutorial, this
pipeline contains one activity: CopyActivity , which takes in the Blob dataset as source and the SQL dataset as
sink. For information about copy activity details, see Copy activity in Azure Data Factory.
// Create a pipeline with copy activity
Console.WriteLine("Creating pipeline " + pipelineName + "...");
PipelineResource pipeline = new PipelineResource
{
Activities = new List<Activity>
{
new CopyActivity
{
Name = "CopyFromBlobToSQL",
Inputs = new List<DatasetReference>
{
new DatasetReference() { ReferenceName = blobDatasetName }
},
Outputs = new List<DatasetReference>
{
new DatasetReference { ReferenceName = sqlDatasetName }
},
Source = new BlobSource { },
Sink = new SqlSink { }
}
}
};
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(queryResponse.Value.First().Output);
}
else
Console.WriteLine(queryResponse.Value.First().Error);
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Create Azure Storage and Azure SQL Database linked services.
Create Azure Blob and Azure SQL Database datasets.
Create a pipeline containing a copy activity.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copying data from on-premises to cloud:
Copy data from on-premises to cloud
Copy data from a SQL Server database to Azure
Blob storage by using the Copy Data tool
7/13/2021 • 8 minutes to read • Edit Online
NOTE
If you're new to Azure Data Factory, see Introduction to Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to log in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. Select your user name in the
upper-right corner, and then select Permissions . If you have access to multiple subscriptions, select the
appropriate subscription. For sample instructions on how to add a user to a role, see Assign Azure roles using
the Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Blob storage (sink). You then create a table
named emp in your SQL Server database and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
3. In the list of storage accounts, filter for your storage account, if needed. Then select your storage account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
2. On the Proper ties page of the Copy Data tool, choose Built-in copy task under Task type , and choose
Run once now under Task cadence or task schedule , then select Next .
3. On the Source data store page, select on + Create new connection .
4. Under New connection , search for SQL Ser ver , and then select Continue .
5. In the New connection (SQL ser ver) dialog box, under Name , enter SqlSer verLinkedSer vice .
Select +New under Connect via integration runtime . You must create a self-hosted integration
runtime, download it to your machine, and register it with Data Factory. The self-hosted integration
runtime copies data between your on-premises environment and the cloud.
6. In the Integration runtime setup dialog box, select Self-Hosted . Then select Continue .
7. In the Integration runtime setup dialog box, under Name , enter TutorialIntegrationRuntime . Then
select Create .
8. In the Integration runtime setup dialog box, select Click here to launch the express setup for
this computer . This action installs the integration runtime on your machine and registers it with Data
Factory. Alternatively, you can use the manual setup option to download the installation file, run it, and
use the key to register the integration runtime.
9. Run the downloaded application. You see the status of the express setup in the window.
10. In the New Connection (SQL Ser ver) dialog box, confirm that TutorialIntegrationRuntime is
selected under Connect via integration runtime . Then, take the following steps:
a. Under Name , enter SqlSer verLinkedSer vice .
b. Under Ser ver name , enter the name of your SQL Server instance.
c. Under Database name , enter the name of your on-premises database.
d. Under Authentication type , select appropriate authentication.
e. Under User name , enter the name of user with access to SQL Server.
f. Enter the Password for the user.
g. Test connection and select Create .
11. On the Source data store page, ensure that the newly created SQL Ser ver connection is selected in
the Connection block. Then in the Source tables section, choose EXISTING TABLES and select the
dbo.emp table in the list, and select Next . You can select any other table based on your database.
12. On the Apply filter page, you can preview data and view the schema of the input data by selecting the
Preview data button. Then select Next .
13. On the Destination data store page, select + Create new connection
14. In New connection , search and select Azure Blob Storage , and then select Continue .
15. On the New connection (Azure Blob Storage) dialog, take the following steps:
a. Under Name , enter AzureStorageLinkedSer vice .
b. Under Connect via integration runtime , select TutorialIntegrationRuntime , and select Account
key under Authentication method .
c. Under Azure subscription , select your Azure subscription from the drop-down list.
d. Under Storage account name , select your storage account from the drop-down list.
e. Test connection and select Create .
16. In the Destination data store dialog, make sure that the newly created Azure Blob Storage
connection is selected in the Connection block. Then under Folder path , enter
adftutorial/fromonprem . You created the adftutorial container as part of the prerequisites. If the
output folder doesn't exist (in this case fromonprem ), Data Factory automatically creates it. You can also
use the Browse button to browse the blob storage and its containers/folders. If you do not specify any
value under File name , by default the name from the source would be used (in this case dbo.emp ).
17. On the File format settings dialog, select Next .
18. On the Settings dialog, under Task name , enter CopyFromOnPremSqlToAzureBlobPipeline , and
then select Next . The Copy Data tool creates a pipeline with the name you specify for this field.
19. On the Summar y dialog, review values for all the settings, and select Next .
20. On the Deployment page, select Monitor to monitor the pipeline (task).
21. When the pipeline run completes, you can view the status of the pipeline you created.
22. On the "Pipeline runs" page, select Refresh to refresh the list. Select the link under Pipeline name to
view activity run details or rerun the pipeline.
23. On the "Activity runs" page, select the Details link (eyeglasses icon) under the Activity name column
for more details about copy operation. To go back to the "Pipeline runs" page, select the All pipeline
runs link in the breadcrumb menu. To refresh the view, select Refresh .
24. Confirm that you see the output file in the fromonprem folder of the adftutorial container.
25. Select the Author tab on the left to switch to the editor mode. You can update the linked services,
datasets, and pipelines created by the tool by using the editor. Select Code to view the JSON code
associated with the entity opened in the editor. For details on how to edit these entities in the Data
Factory UI, see the Azure portal version of this tutorial.
Next steps
The pipeline in this sample copies data from a SQL Server database to Blob storage. You learned how to:
Create a data factory.
Use the Copy Data tool to create a pipeline.
Monitor the pipeline and activity runs.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn about how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Copy data from a SQL Server database to Azure
Blob storage
7/7/2021 • 9 minutes to read • Edit Online
NOTE
This article doesn't provide a detailed introduction to Data Factory. For more information, see Introduction to Data
Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal. In the upper-right corner, select
your user name, and then select Permissions . If you have access to multiple subscriptions, select the
appropriate subscription. For sample instructions on how to add a user to a role, see Assign Azure roles using
the Azure portal.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Blob storage (sink). You then create a table
named emp in your SQL Server database and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it's not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .
3. In the list of storage accounts, filter for your storage account if needed. Then select your storage account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Blob storage.
1. In the Storage account window, go to Over view , and then select Containers .
Create a pipeline
1. On the Azure Data Factory home page, select Orchestrate . A pipeline is automatically created for you.
You see the pipeline in the tree view, and its editor opens.
2. In the General panel under Proper ties , specify SQLSer verToBlobPipeline for Name . Then collapse
the panel by clicking the Properties icon in the top-right corner.
3. In the Activities tool box, expand Move & Transform . Drag and drop the Copy activity to the pipeline
design surface. Set the name of the activity to CopySqlSer verToAzureBlobActivity .
4. In the Proper ties window, go to the Source tab, and select + New .
5. In the New Dataset dialog box, search for SQL Ser ver . Select SQL Ser ver , and then select Continue .
6. In the Set Proper ties dialog box, under Name , enter SqlSer verDataset . Under Linked ser vice , select
+ New . You create a connection to the source data store (SQL Server database) in this step.
7. In the New Linked Ser vice dialog box, add Name as SqlSer verLinkedSer vice . Under Connect via
integration runtime , select +New . In this section, you create a self-hosted integration runtime and
associate it with an on-premises machine with the SQL Server database. The self-hosted integration
runtime is the component that copies data from the SQL Server database on your machine to Blob
storage.
8. In the Integration Runtime Setup dialog box, select Self-Hosted , and then select Continue .
9. Under name, enter TutorialIntegrationRuntime . Then select Create .
10. For Settings, select Click here to launch the express setup for this computer . This action installs
the integration runtime on your machine and registers it with Data Factory. Alternatively, you can use the
manual setup option to download the installation file, run it, and use the key to register the integration
runtime.
11. In the Integration Runtime (Self-hosted) Express Setup window, select Close when the process is
finished.
12. In the New linked ser vice (SQL Ser ver) dialog box, confirm that TutorialIntegrationRuntime is
selected under Connect via integration runtime . Then, take the following steps:
a. Under Name , enter SqlSer verLinkedSer vice .
b. Under Ser ver name , enter the name of your SQL Server instance.
c. Under Database name , enter the name of the database with the emp table.
d. Under Authentication type , select the appropriate authentication type that Data Factory should use
to connect to your SQL Server database.
e. Under User name and Password , enter the user name and password. Use mydomain\myuser as user
name if needed.
f. Select Test connection . This step is to confirm that Data Factory can connect to your SQL Server
database by using the self-hosted integration runtime you created.
g. To save the linked service, select Create .
13. After the linked service is created, you're back to the Set proper ties page for the SqlServerDataset. Take
the following steps:
a. In Linked ser vice , confirm that you see SqlSer verLinkedSer vice .
b. Under Table name , select [dbo].[emp] .
c. Select OK .
14. Go to the tab with SQLSer verToBlobPipeline , or select SQLSer verToBlobPipeline in the tree view.
15. Go to the Sink tab at the bottom of the Proper ties window, and select + New .
16. In the New Dataset dialog box, select Azure Blob Storage . Then select Continue .
17. In Select Format dialog box, choose the format type of your data. Then select Continue .
18. In the Set Proper ties dialog box, enter AzureBlobDataset for Name. Next to the Linked ser vice text
box, select + New .
19. In the New Linked Ser vice (Azure Blob Storage) dialog box, enter AzureStorageLinkedSer vice as
name, select your storage account from the Storage account name list. Test connection, and then select
Create to deploy the linked service.
20. After the linked service is created, you're back to the Set proper ties page. Select OK .
21. Open the sink dataset. On the Connection tab, take the following steps:
a. In Linked ser vice , confirm that AzureStorageLinkedSer vice is selected.
b. In File path , enter adftutorial/fromonprem for the Container/ Director y part. If the output folder
doesn't exist in the adftutorial container, Data Factory automatically creates the output folder.
c. For the File part, select Add dynamic content .
d. Add @CONCAT(pipeline().RunId, '.txt') , and then select Finish . This action will rename the file with
PipelineRunID.txt.
22. Go to the tab with the pipeline opened, or select the pipeline in the tree view. In Sink Dataset , confirm
that AzureBlobDataset is selected.
23. To validate the pipeline settings, select Validate on the toolbar for the pipeline. To close the Pipe
validation output , select the >> icon.
24. To publish entities you created to Data Factory, select Publish all .
25. Wait until you see the Publishing completed pop-up. To check the status of publishing, select the Show
notifications link on the top of the window. To close the notification window, select Close .
3. On the Activity runs page, select the Details (eyeglasses image) link to see details about the copy
operation. To go back to the Pipeline Runs view, select All pipeline runs at the top.
Next steps
The pipeline in this sample copies data from one location to another in Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Storage linked services.
Create SQL Server and Blob storage datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see Supported data stores.
To learn how to copy data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Tutorial: Copy data from a SQL Server database to
Azure Blob storage
3/5/2021 • 15 minutes to read • Edit Online
NOTE
This article does not provide a detailed introduction to the Data Factory service. For more information, see Introduction
to Azure Data Factory.
Prerequisites
Azure subscription
Before you begin, if you don't already have an Azure subscription, create a free account.
Azure roles
To create data factory instances, the user account you use to sign in to Azure must be assigned a Contributor or
Owner role or must be an administrator of the Azure subscription.
To view the permissions you have in the subscription, go to the Azure portal, select your username at the top-
right corner, and then select Permissions . If you have access to multiple subscriptions, select the appropriate
subscription. For sample instructions on adding a user to a role, see the Assign Azure roles using the Azure
portal article.
SQL Server 2014, 2016, and 2017
In this tutorial, you use a SQL Server database as a source data store. The pipeline in the data factory you create
in this tutorial copies data from this SQL Server database (source) to Azure Blob storage (sink). You then create a
table named emp in your SQL Server database, and insert a couple of sample entries into the table.
1. Start SQL Server Management Studio. If it is not already installed on your machine, go to Download SQL
Server Management Studio.
2. Connect to your SQL Server instance by using your credentials.
3. Create a sample database. In the tree view, right-click Databases , and then select New Database .
4. In the New Database window, enter a name for the database, and then select OK .
5. To create the emp table and insert some sample data into it, run the following query script against the
database. In the tree view, right-click the database that you created, and then select New Quer y .
3. In the list of storage accounts, filter for your storage account (if needed), and then select your storage
account.
4. In the Storage account window, select Access keys .
5. In the Storage account name and key1 boxes, copy the values, and then paste them into Notepad or
another editor for later use in the tutorial.
Create the adftutorial container
In this section, you create a blob container named adftutorial in your Azure Blob storage.
1. In the Storage account window, switch to Over view , and then select Blobs .
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Install the latest version of Azure PowerShell if you don't already have it on your machine. For detailed
instructions, see How to install and configure Azure PowerShell.
Log in to PowerShell
1. Start PowerShell on your machine, and keep it open through completion of this quickstart tutorial. If you
close and reopen it, you'll need to run these commands again.
2. Run the following command, and then enter the Azure username and password that you use to sign in to
the Azure portal:
Connect-AzAccount
3. If you have multiple Azure subscriptions, run the following command to select the subscription that you
want to work with. Replace SubscriptionId with the ID of your Azure subscription:
$resourceGroupName = "ADFTutorialResourceGroup"
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again.
3. Define a variable for the data factory name that you can use in PowerShell commands later. The name
must start with a letter or a number, and it can contain only letters, numbers, and the dash (-) character.
IMPORTANT
Update the data factory name with a globally unique name. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFTutorialFactory"
The specified data factory name 'ADFv2TutorialDataFactory' is already in use. Data factory names
must be globally unique.
To create data-factory instances, the user account that you use to sign in to Azure must be assigned a contributor or
owner role or must be an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on the
following page, and then expand Analytics to locate Data Factor y : Products available by region. The data stores
(Azure Storage, Azure SQL Database, and so on) and computes (Azure HDInsight and so on) used by the data factory
can be in other regions.
$integrationRuntimeName = "ADFTutorialIR"
Name : ADFTutorialIR
Type : SelfHosted
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Description : selfhosted IR description
Id : /subscriptions/<subscription
ID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/<dataFactoryName>/in
tegrationruntimes/<integrationRuntimeName>
3. To retrieve the status of the created integration runtime, run the following command:
4. To retrieve the authentication keys for registering the self-hosted integration runtime with the Data
Factory service in the cloud, run the following command. Copy one of the keys (excluding the quotation
marks) for registering the self-hosted integration runtime that you install on your machine in the next
step.
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
9. When the self-hosted integration runtime is registered successfully, the following message is displayed:
10. In the Register Integration Runtime (Self-hosted) window, select Launch Configuration
Manager .
11. When the node is connected to the cloud service, the following message is displayed:
12. Test the connectivity to your SQL Server database by doing the following:
a. In the Configuration Manager window, switch to the Diagnostics tab.
b. In the Data source type box, select SqlSer ver .
c. Enter the server name.
d. Enter the database name.
e. Select the authentication mode.
f. Enter the username.
g. Enter the password that's associated with the username.
h. To confirm that integration runtime can connect to the SQL Server, select Test .
If the connection is successful, a green checkmark icon is displayed. Otherwise, you'll receive an error
message associated with the failure. Fix any issues, and ensure that the integration runtime can connect
to your SQL Server instance.
Note all the preceding values for later use in this tutorial.
IMPORTANT
Before you save the file, replace <accountName> and <accountKey> with the name and key of your Azure
storage account. You noted them in the Prerequisites section.
{
"name": "AzureStorageLinkedService",
"properties": {
"annotations": [],
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=
<accountName>;AccountKey=<accountKey>;EndpointSuffix=core.windows.net"
}
}
}
Set-Location 'C:\ADFv2Tutorial'
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroup name>
DataFactoryName : <dataFactory name>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobStorageLinkedService
If you receive a "file not found" error, confirm that the file exists by running the dir command. If the file
name has a .txt extension (for example, AzureStorageLinkedService.json.txt), remove it, and then run the
PowerShell command again.
Create and encrypt a SQL Server linked service (source )
In this step, you link your SQL Server instance to the data factory.
1. Create a JSON file named SqlServerLinkedService.json in the C:\ADFv2Tutorial folder by using the
following code:
IMPORTANT
Select the section that's based on the authentication that you use to connect to SQL Server.
],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=False;data source=<serverName>;initial catalog=
<databaseName>;user id=<userName>;password=<password>"
},
"connectVia":{
"referenceName":"<integration runtime name> ",
"type":"IntegrationRuntimeReference"
}
}
}
{
"name":"SqlServerLinkedService",
"type":"Microsoft.DataFactory/factories/linkedservices",
"properties":{
"annotations":[
],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=True;data source=<serverName>;initial catalog=
<databaseName>",
"userName":"<username> or <domain>\\<username>",
"password":{
"type":"SecureString",
"value":"<password>"
}
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}
IMPORTANT
Select the section that's based on the authentication you use to connect to your SQL Server instance.
Replace <integration runtime name> with the name of your integration runtime.
Before you save the file, replace <ser vername> , <databasename> , <username> , and <password>
with the values of your SQL Server instance.
If you need to use a backslash (\) in your user account or server name, precede it with the escape character (\).
For example, use mydomain\\myuser.
2. To encrypt the sensitive data (username, password, and so on), run the
New-AzDataFactoryV2LinkedServiceEncryptedCredential cmdlet.
This encryption ensures that the credentials are encrypted using Data Protection Application
Programming Interface (DPAPI). The encrypted credentials are stored locally on the self-hosted
integration runtime node (local machine). The output payload can be redirected to another JSON file (in
this case, encryptedLinkedService.json) that contains encrypted credentials.
Create datasets
In this step, you create input and output datasets. They represent input and output data for the copy operation,
which copies data from the SQL Server database to Azure Blob storage.
Create a dataset for the source SQL Server database
In this step, you define a dataset that represents data in the SQL Server database instance. The dataset is of type
SqlServerTable. It refers to the SQL Server linked service that you created in the preceding step. The linked
service has the connection information that the Data Factory service uses to connect to your SQL Server
instance at runtime. This dataset specifies the SQL table in the database that contains the data. In this tutorial,
the emp table contains the source data.
1. Create a JSON file named SqlServerDataset.json in the C:\ADFv2Tutorial folder, with the following code:
{
"name":"SqlServerDataset",
"properties":{
"linkedServiceName":{
"referenceName":"EncryptedSqlServerLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"SqlServerTable",
"schema":[
],
"typeProperties":{
"schema":"dbo",
"table":"emp"
}
}
}
{
"name":"AzureBlobDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"DelimitedText",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":"fromonprem",
"container":"adftutorial"
},
"columnDelimiter":",",
"escapeChar":"\\",
"quoteChar":"\""
},
"schema":[
]
},
"type":"Microsoft.DataFactory/factories/datasets"
}
DatasetName : AzureBlobDataset
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.DelimitedTextDataset
Create a pipeline
In this tutorial, you create a pipeline with a copy activity. The copy activity uses SqlServerDataset as the input
dataset and AzureBlobDataset as the output dataset. The source type is set to SqlSource and the sink type is set
to BlobSink.
1. Create a JSON file named SqlServerToBlobPipeline.json in the C:\ADFv2Tutorial folder, with the following
code:
{
"name":"SqlServerToBlobPipeline",
"properties":{
"activities":[
{
"name":"CopySqlServerToAzureBlobActivity",
"type":"Copy",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"SqlServerSource"
},
"sink":{
"type":"DelimitedTextSink",
"storeSettings":{
"type":"AzureBlobStorageWriteSettings"
},
"formatSettings":{
"type":"DelimitedTextWriteSettings",
"quoteAllText":true,
"fileExtension":".txt"
}
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"SqlServerDataset",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"AzureBlobDataset",
"type":"DatasetReference"
}
]
}
],
"annotations":[
]
}
}
2. To create the pipeline SQLServerToBlobPipeline, run the Set-AzDataFactoryV2Pipeline cmdlet.
PipelineName : SQLServerToBlobPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopySqlServerToAzureBlobActivity}
Parameters :
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
2. You can get the run ID of pipeline SQLServerToBlobPipeline and check the detailed activity run result by
running the following command:
{
"dataRead":36,
"dataWritten":32,
"filesWritten":1,
"sourcePeakConnections":1,
"sinkPeakConnections":1,
"rowsRead":2,
"rowsCopied":2,
"copyDuration":18,
"throughput":0.01,
"errors":[
],
"effectiveIntegrationRuntime":"ADFTutorialIR",
"usedParallelCopies":1,
"executionDetails":[
{
"source":{
"type":"SqlServer"
},
"sink":{
"type":"AzureBlobStorage",
"region":"CentralUS"
},
"status":"Succeeded",
"start":"2019-09-11T07:10:38.2342905Z",
"duration":18,
"usedParallelCopies":1,
"detailedDurations":{
"queuingDuration":6,
"timeToFirstByte":0,
"transferDuration":5
}
}
]
}
Verify the output
The pipeline automatically creates the output folder named fromonprem in the adftutorial blob container.
Confirm that you see the dbo.emp.txt file in the output folder.
1. In the Azure portal, in the adftutorial container window, select Refresh to see the output folder.
2. Select fromonprem in the list of folders.
3. Confirm that you see a file named dbo.emp.txt .
Next steps
The pipeline in this sample copies data from one location to another in Azure Blob storage. You learned how to:
Create a data factory.
Create a self-hosted integration runtime.
Create SQL Server and Azure Storage linked services.
Create SQL Server and Azure Blob datasets.
Create a pipeline with a copy activity to move the data.
Start a pipeline run.
Monitor the pipeline run.
For a list of data stores that are supported by Data Factory, see supported data stores.
To learn about copying data in bulk from a source to a destination, advance to the following tutorial:
Copy data in bulk
Load data into Azure Data Lake Storage Gen2 with
Azure Data Factory
7/7/2021 • 4 minutes to read • Edit Online
TIP
For copying data from Azure Data Lake Storage Gen1 into Gen2, refer to this specific walkthrough.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account with Data Lake Storage Gen2 enabled: If you don't have a Storage account, create an
account.
AWS account with an S3 bucket that contains data: This article shows how to copy data from Amazon S3. You
can use other data stores by following similar steps.
Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.
4. In the New linked ser vice (Amazon S3) page, do the following steps:
a. Specify the Access Key ID value.
b. Specify the Secret Access Key value.
c. Click Test connection to validate the settings, then select Create .
d. You will see a new AmazonS3 connection gets created. Select Next .
5. In the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder/file, and then select Choose .
6. Specify the copy behavior by checking the Recursively and Binar y copy options. Select Next .
7. In the Destination data store page, click + Create new connection , and then select Azure Data
Lake Storage Gen2 , and select Continue .
8. In the New linked ser vice (Azure Data Lake Storage Gen2) page, do the following steps:
a. Select your Data Lake Storage Gen2 capable account from the "Storage account name" drop-down
list.
b. Select Create to create the connection. Then select Next .
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and
select Next . ADF will create the corresponding ADLS Gen2 file system and subfolders during copy if it
doesn't exist.
10. In the Settings page, select Next to use the default settings.
11. In the Summar y page, review the settings, and select Next .
12. On the Deployment page , select Monitor to monitor the pipeline (task).
13. When the pipeline run completes successfully, you see a pipeline run that is triggered by a manual trigger.
You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
14. To see activity runs associated with the pipeline run, select the CopyFromAmazonS3ToADLS link under
the PIPELINE NAME column. For details about the copy operation, select the Details link (eyeglasses
icon) under the ACTIVITY NAME column. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used configuration.
15. To refresh the view, select Refresh. Select All pipeline runs at the top to go back to the Pipeline Runs
view.
16. Verify that the data is copied into your Data Lake Storage Gen2 account.
Next steps
Copy activity overview
Azure Data Lake Storage Gen2 connector
Load data into Azure Data Lake Storage Gen1 by
using Azure Data Factory
7/7/2021 • 4 minutes to read • Edit Online
NOTE
For more information, see Copy data to or from Data Lake Storage Gen1 by using Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Data Lake Storage Gen1 account: If you don't have a Data Lake Storage Gen1 account, see the instructions in
Create a Data Lake Storage Gen1 account.
Amazon S3: This article shows how to copy data from Amazon S3. You can use other data stores by following
similar steps.
Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name "LoadADLSG1Demo" is not available," enter a different name for the data factory. For example,
you could use the name yourname ADFTutorialDataFactor y . Try creating the data factory again. For
the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These
data stores include Azure Data Lake Storage Gen1, Azure Storage, Azure SQL Database, and so on.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:
Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.
2. In the Proper ties page, specify CopyFromAmazonS3ToADLS for the Task name field, and select
Next :
6. Choose the copy behavior by selecting the Copy files recursively and Binar y copy (copy files as-is)
options. Select Next :
7. In the Destination data store page, click + Create new connection , and then select Azure Data
Lake Storage Gen1 , and select Continue :
8. In the New Linked Ser vice (Azure Data Lake Storage Gen1) page, do the following steps:
a. Select your Data Lake Storage Gen1 account for the Data Lake Store account name .
b. Specify the Tenant , and select Finish.
c. Select Next .
IMPORTANT
In this walkthrough, you use a managed identity for Azure resources to authenticate your Data Lake Storage
Gen1 account. Be sure to grant the MSI the proper permissions in Data Lake Storage Gen1 by following these
instructions.
9. In the Choose the output file or folder page, enter copyfroms3 as the output folder name, and
select Next :
14. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To
switch back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the
list.
15. To monitor the execution details for each copy activity, select the Details link under Actions in the
activity monitoring view. You can monitor details like the volume of data copied from the source to the
sink, data throughput, execution steps with corresponding duration, and used configurations:
16. Verify that the data is copied into your Data Lake Storage Gen1 account:
Next steps
Advance to the following article to learn about Data Lake Storage Gen1 support:
Azure Data Lake Storage Gen1 connector
Copy data from Azure Data Lake Storage Gen1 to
Gen2 with Azure Data Factory
7/7/2021 • 7 minutes to read • Edit Online
Prerequisites
An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
Azure Data Lake Storage Gen1 account with data in it.
Azure Storage account with Data Lake Storage Gen2 enabled. If you don't have a Storage account, create an
account.
2. On the Proper ties page, specify CopyFromADLSGen1ToGen2 for the Task name field. Select Next .
3. On the Source data store page, select + Create new connection .
4. Select Azure Data Lake Storage Gen1 from the connector gallery, and select Continue .
5. On the Specify Azure Data Lake Storage Gen1 connection page, follow these steps:
a. Select your Data Lake Storage Gen1 for the account name, and specify or validate the Tenant .
b. Select Test connection to validate the settings. Then select Finish .
c. You see that a new connection was created. Select Next .
IMPORTANT
In this walk-through, you use a managed identity for Azure resources to authenticate your Azure Data Lake
Storage Gen1. To grant the managed identity the proper permissions in Azure Data Lake Storage Gen1, follow
these instructions.
6. On the Choose the input file or folder page, browse to the folder and file that you want to copy over.
Select the folder or file, and select Choose .
7. Specify the copy behavior by selecting the Copy files recursively and Binar y copy options. Select
Next .
8. On the Destination data store page, select + Create new connection > Azure Data Lake Storage
Gen2 > Continue .
9. On the Specify Azure Data Lake Storage Gen2 connection page, follow these steps:
a. Select your Data Lake Storage Gen2 capable account from the Storage account name drop-down list.
b. Select Finish to create the connection. Then select Next .
10. On the Choose the output file or folder page, enter copyfromadlsgen1 as the output folder name,
and select Next . Data Factory creates the corresponding Azure Data Lake Storage Gen2 file system and
subfolders during copy if they don't exist.
11. On the Settings page, select Next to use the default settings.
12. On the Summar y page, review the settings, and select Next .
13. On the Deployment page , select Monitor to monitor the pipeline.
14. Notice that the Monitor tab on the left is automatically selected. The Actions column includes links to
view activity run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select the View Activity Runs link in the
Actions column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To
switch back to the pipeline runs view, select the Pipelines link at the top. Select Refresh to refresh the
list.
16. To monitor the execution details for each copy activity, select the Details link (eyeglasses image) under
Actions in the activity monitoring view. You can monitor details like the volume of data copied from the
source to the sink, data throughput, execution steps with corresponding duration, and used
configurations.
17. Verify that the data is copied into your Azure Data Lake Storage Gen2 account.
Best practices
To assess upgrading from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 in general, see
Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2.
The following sections introduce best practices for using Data Factory for a data upgrade from Data Lake
Storage Gen1 to Data Lake Storage Gen2.
Data partition for historical data copy
If your total data size in Data Lake Storage Gen1 is less than 30 TB and the number of files is less than 1
million, you can copy all data in a single copy activity run.
If you have a larger amount of data to copy, or you want the flexibility to manage data migration in batches
and make each of them complete within a specific time frame, partition the data. Partitioning also reduces
the risk of any unexpected issue.
Use a proof of concept to verify the end-to-end solution and test the copy throughput in your environment.
Major proof-of-concept steps:
1. Create one Data Factory pipeline with a single copy activity to copy several TBs of data from Data Lake
Storage Gen1 to Data Lake Storage Gen2 to get a copy performance baseline. Start with data integration
units (DIUs) as 128.
2. Based on the copy throughput you get in step 1, calculate the estimated time that's required for the entire
data migration.
3. (Optional) Create a control table and define the file filter to partition the files to be migrated. The way to
partition the files is to:
Partition by folder name or folder name with a wildcard filter. We recommend this method.
Partition by a file's last modified time.
Network bandwidth and storage I/O
You can control the concurrency of Data Factory copy jobs that read data from Data Lake Storage Gen1 and
write data to Data Lake Storage Gen2. In this way, you can manage the use on that storage I/O to avoid affecting
the normal business work on Data Lake Storage Gen1 during the migration.
Permissions
In Data Factory, the Data Lake Storage Gen1 connector supports service principal and managed identity for
Azure resource authentications. The Data Lake Storage Gen2 connector supports account key, service principal,
and managed identity for Azure resource authentications. To make Data Factory able to navigate and copy all the
files or access control lists (ACLs) you need, grant high enough permissions for the account you provide to
access, read, or write all files and set ACLs if you choose to. Grant it a super-user or owner role during the
migration period.
Preserve ACLs from Data Lake Storage Gen1
If you want to replicate the ACLs along with data files when you upgrade from Data Lake Storage Gen1 to Data
Lake Storage Gen2, see Preserve ACLs from Data Lake Storage Gen1.
Incremental copy
You can use several approaches to load only the new or updated files from Data Lake Storage Gen1:
Load new or updated files by time partitioned folder or file name. An example is /2019/05/13/*.
Load new or updated files by LastModifiedDate.
Identify new or updated files by any third-party tool or solution. Then pass the file or folder name to the Data
Factory pipeline via parameter or a table or file.
The proper frequency to do incremental load depends on the total number of files in Azure Data Lake Storage
Gen1 and the volume of new or updated files to be loaded every time.
Next steps
Copy activity overview Azure Data Lake Storage Gen1 connector Azure Data Lake Storage Gen2 connector
Load data into Azure Synapse Analytics by using
Azure Data Factory
7/7/2021 • 7 minutes to read • Edit Online
NOTE
For more information, see Copy data to or from Azure Synapse Analytics by using Azure Data Factory.
Prerequisites
Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
Azure Synapse Analytics: The data warehouse holds the data that's copied over from the SQL database. If you
don't have an Azure Synapse Analytics, see the instructions in Create an Azure Synapse Analytics.
Azure SQL Database: This tutorial copies data from the Adventure Works LT sample dataset in Azure SQL
Database. You can create this sample database in SQL Database by following the instructions in Create a
sample database in Azure SQL Database.
Azure storage account: Azure Storage is used as the staging blob in the bulk copy operation. If you don't have
an Azure storage account, see the instructions in Create a storage account.
Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.
TIP
In this tutorial, you use SQL authentication as the authentication type for your source data store, but you can
choose other supported authentication methods:Service Principal and Managed Identity if needed. Refer to
corresponding sections in this article for details. To store secrets for data stores securely, it's also recommended to
use an Azure Key Vault. Refer to this article for detailed illustrations.
TIP
In this tutorial, you use SQL authentication as the authentication type for your destination data store, but you can
choose other supported authentication methods:Service Principal and Managed Identity if needed. Refer to
corresponding sections in this article for details. To store secrets for data stores securely, it's also recommended to
use an Azure Key Vault. Refer to this article for detailed illustrations.
10. In the Summar y page, review the settings, and select Next .
11. On the Deployment page , select Monitor to monitor the pipeline (task).
12. Notice that the Monitor tab on the left is automatically selected. When the pipeline run completes
successfully, select the CopyFromSQLToSQLDW link under the PIPELINE NAME column to view
activity run details or to rerun the pipeline.
13. To switch back to the pipeline runs view, select the All pipeline runs link at the top. Select Refresh to
refresh the list.
14. To monitor the execution details for each copy activity, select the Details link (eyeglasses icon) under
ACTIVITY NAME in the activity runs view. You can monitor details like the volume of data copied from
the source to the sink, data throughput, execution steps with corresponding duration, and used
configurations.
Next steps
Advance to the following article to learn about Azure Synapse Analytics support:
Azure Synapse Analytics connector
Copy data from SAP Business Warehouse by using
Azure Data Factory
7/7/2021 • 10 minutes to read • Edit Online
TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction
flow, see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.
Prerequisites
Azure Data Factor y : If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD) with destination type "Database Table" : To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions :
Authorization for Remote Function Calls (RFC) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0 . Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Open on the Open Azure Data Factor y Studio tile to open
the Data Factory UI in a separate tab.
1. On the home page, select Ingest to open the Copy Data tool.
2. On the Proper ties page, specify a Task name , and then select Next .
3. On the Source data store page, select +Create new connection . Select SAP BW Open Hub from
the connector gallery, and then select Continue . To filter the connectors, you can type SAP in the search
box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New , and then select Self-hosted . Enter a Name , and then
select Next . Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Ser ver name , System number , Client ID, Language (if other than EN ),
User name , and Password .
c. Select Test connection to validate the settings, and then select Finish .
d. A new connection is created. Select Next .
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next .
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP)
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next .
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue .
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next .
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name.
Then select Next .
10. On the File format setting page, select Next to use the default settings.
11. On the Settings page, expand Performance settings . Enter a value for Degree of copy parallelism
such as 5 to load from SAP BW in parallel. Then select Next .
12. On the Summar y page, review the settings. Then select Next .
13. On the Deployment page, select Monitor to monitor the pipeline.
14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back
to the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.
17. To view the maximum Request ID , go back to the activity-monitoring view and select Output under
Actions .
On the data factory home page, select Pipeline templates in the Discover more section to use the built-in
template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage : In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub : This is the source to copy data from. Refer to the previous full-copy
walkthrough for detailed configuration.
Azure Data Lake Storage Gen2 : This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.
3. This template generates a pipeline with the following three activities and makes them chained on-
success: Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName : Specify the Open Hub table name to copy data from.
Data_Destination_Container : Specify the destination Azure Data Lake Storage Gen2 container
to copy data to. If the container doesn't exist, the Data Factory copy activity creates one during
execution.
Data_Destination_Director y : Specify the folder path under the Azure Data Lake Storage Gen2
container to copy data to. If the path doesn't exist, the Data Factory copy activity creates a path
during execution.
HighWatermarkBlobContainer : Specify the container to store the high-watermark value.
HighWatermarkBlobDirector y : Specify the folder path under the container to store the high-
watermark value.
HighWatermarkBlobName : Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobContainer+HighWatermarkBlobDirectory+HighWatermarkBlobName, such as
container/path/requestIdCache.txt. Create a blob with content 0.
LogicAppURL : In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST
URL .
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go
to Logic Apps Designer .
b. Create a trigger of When an HTTP request is received . Specify the HTTP request body as
follows:
{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}
c. Add a Create blob action. For Folder path and Blob name , use the same values that you
configured previously in HighWatermarkBlobContainer+HighWatermarkBlobDirectory and
HighWatermarkBlobName.
d. Select Save . Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to
validate the configuration. Or, select Publish to publish all the changes, and then select Add trigger to
execute a run.
You might increase the number of parallel running SAP work processes for the DTP:
For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Inser t Records . Otherwise, data will be
extracted many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full . You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request . Otherwise, nothing will
be extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy
activity until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways
to avoid this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched .
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched , you can use the following option to run the delta DTP manually:
No Data Transfer; Delta Status in Source: Fetched
Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Load data from Office 365 by using Azure Data
Factory
7/7/2021 • 5 minutes to read • Edit Online
2. In the New data factor y page, provide values for the fields that are shown in the following image:
Name : Enter a globally unique name for your Azure data factory. If you receive the error "Data factory
name LoadFromOffice365Demo is not available", enter a different name for the data factory. For
example, you could use the name yourname LoadFromOffice365Demo . Try creating the data
factory again. For the naming rules for Data Factory artifacts, see Data Factory naming rules.
Subscription : Select your Azure subscription in which to create the data factory.
Resource Group : Select an existing resource group from the drop-down list, or select the Create
new option and enter the name of a resource group. To learn about resource groups, see Using
resource groups to manage your Azure resources.
Version : Select V2 .
Location : Select the location for the data factory. Only supported locations are displayed in the drop-
down list. The data stores that are used by data factory can be in other locations and regions. These
data stores include Azure Data Lake Store, Azure Storage, Azure SQL Database, and so on.
3. Select Create .
4. After creation is complete, go to your data factory. You see the Data Factor y home page as shown in the
following image:
5. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Integration Application in
a separate tab.
Create a pipeline
1. On the home page, select Orchestrate .
2. In the General tab for the pipeline, enter "CopyPipeline" for Name of the pipeline.
3. In the Activities tool box > Move & Transform category > drag and drop the Copy activity from the tool
box to the pipeline designer surface. Specify "CopyFromOffice365ToBlob" as activity name.
Configure source
1. Go to the pipeline > Source tab , click + New to create a source dataset.
2. In the New Dataset window, select Office 365 , and then select Continue .
3. You are now in the copy activity configuration tab. Click on the Edit button next to the Office 365 dataset
to continue the data configuration.
4. You see a new tab opened for Office 365 dataset. In the General tab at the bottom of the Properties
window, enter "SourceOffice365Dataset" for Name.
5. Go to the Connection tab of the Properties window. Next to the Linked service text box, click + New .
6. In the New Linked Service window, enter "Office365LinkedService" as name, enter the service principal ID
and service principal key, then test connection and select Create to deploy the linked service.
7. After the linked service is created, you are back in the dataset settings. Next to Table , choose the down-
arrow to expand the list of available Office 365 datasets, and choose "BasicDataSet_v0.Message_v0" from
the drop-down list:
8. Now go back to the pipeline > Source tab to continue configuring additional properties for Office 365
data extraction. User scope and user scope filter are optional predicates that you can define to restrict the
data you want to extract out of Office 365. See Office 365 dataset properties section for how you
configure these settings.
9. You are required to choose one of the date filters and provide the start time and end time values.
10. Click on the Impor t Schema tab to import the schema for Message dataset.
Configure sink
1. Go to the pipeline > Sink tab , and select + New to create a sink dataset.
2. In the New Dataset window, notice that only the supported destinations are selected when copying from
Office 365. Select Azure Blob Storage , select Binary format, and then select Continue . In this tutorial,
you copy Office 365 data into an Azure Blob Storage.
3. Click on Edit button next to the Azure Blob Storage dataset to continue the data configuration.
4. On the General tab of the Properties window, in Name, enter "OutputBlobDataset".
5. Go to the Connection tab of the Properties window. Next to the Linked service text box, select + New .
6. In the New Linked Service window, enter "AzureStorageLinkedService" as name, select "Service Principal"
from the dropdown list of authentication methods, fill in the Service Endpoint, Tenant, Service principal
ID, and Service principal key, then select Save to deploy the linked service. Refer here for how to set up
service principal authentication for Azure Blob Storage.
To see activity runs associated with the pipeline run, select the View Activity Runs link in the Actions column.
In this example, there is only one activity, so you see only one entry in the list. For details about the copy
operation, select the Details link (eyeglasses icon) in the Actions column.
If this is the first time you are requesting data for this context (a combination of which data table is being access,
which destination account is the data being loaded into, and which user identity is making the data access
request), you will see the copy activity status as In Progress , and only when you click into "Details" link under
Actions will you see the status as RequesetingConsent . A member of the data access approver group needs to
approve the request in the Privileged Access Management before the data extraction can proceed.
Status as requesting consent:
Status as extracting data:
Once the consent is provided, data extraction will continue and, after some time, the pipeline run will show as
succeeded.
Now go to the destination Azure Blob Storage and verify that Office 365 data has been extracted in Binary
format.
Next steps
Advance to the following article to learn about Azure Synapse Analytics support:
Office 365 connector
Copy multiple tables in bulk by using Azure Data
Factory in the Azure portal
7/7/2021 • 14 minutes to read • Edit Online
NOTE
If you are new to Azure Data Factory, see Introduction to Azure Data Factory.
End-to-end workflow
In this scenario, you have a number of tables in Azure SQL Database that you want to copy to Azure Synapse
Analytics. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the
pipeline triggers another pipeline, which iterates over each table in the database and performs the data copy
operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the
list, copy the specific table in Azure SQL Database to the corresponding table in Azure Synapse Analytics
using staged copy via Blob storage and PolyBase for best performance. In this example, the first pipeline
passes the list of tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure Storage account . The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database . This database contains the source data. Create a database in SQL Database with
Adventure Works LT sample data following Create a database in Azure SQL Database article. This tutorial
copies all the tables from this sample database to an Azure Synapse Analytics.
Azure Synapse Analytics . This data warehouse holds the data copied over from the SQL Database. If you
don't have an Azure Synapse Analytics workspace, see the Get started with Azure Synapse Analytics article
for steps to create one.
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version .
8. Select the location for the data factory. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
9. Click Create .
10. After the creation is complete, select Go to resource to navigate to the Data Factor y page.
11. Select Open on the Open Azure Data Factor y Studio tile to launch the Data Factory UI application in a
separate tab.
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored.
The input dataset AzureSqlDatabaseDataset refers to the AzureSqlDatabaseLinkedSer vice . The linked
service specifies the connection string to connect to the database. The dataset specifies the name of the
database and the table that contains the source data.
The output dataset AzureSqlDWDataset refers to the AzureSqlDWLinkedSer vice . The linked service
specifies the connection string to connect to the Azure Synapse Analytics. The dataset specifies the database and
the table to which the data is copied.
In this tutorial, the source and destination SQL tables are not hard-coded in the dataset definitions. Instead, the
ForEach activity passes the name of the table at runtime to the Copy activity.
Create a dataset for source SQL Database
1. Select Author tab from the left pane.
2. Select the + (plus) in the left pane, and then select Dataset .
3. In the New Dataset window, select Azure SQL Database , and then click Continue .
4. In the Set proper ties window, under Name , enter AzureSqlDatabaseDataset . Under Linked
ser vice , select AzureSqlDatabaseLinkedSer vice . Then click OK .
5. Switch to the Connection tab, select any table for Table . This table is a dummy table. You specify a query
on the source dataset when creating a pipeline. The query is used to extract data from your database.
Alternatively, you can click Edit check box, and enter dbo.dummyName as the table name.
Create a dataset for sink Azure Synapse Analytics
1. Click + (plus) in the left pane, and click Dataset .
2. In the New Dataset window, select Azure Synapse Analytics , and then click Continue .
3. In the Set proper ties window, under Name , enter AzureSqlDWDataset . Under Linked ser vice , select
AzureSqlDWLinkedSer vice . Then click OK .
4. Switch to the Parameters tab, click + New , and enter DWTableName for the parameter name. Click +
New again, and enter DWSchema for the parameter name. If you copy/paste this name from the page,
ensure that there's no trailing space character at the end of DWTableName and DWSchema.
5. Switch to the Connection tab,
a. For Table , check the Edit option. Select into the first input box and click the Add dynamic
content link below. In the Add Dynamic Content page, click the DWSchema under
Parameters , which will automatically populate the top expression text box @dataset().DWSchema ,
and then click Finish .
b. Select into the second input box and click the Add dynamic content link below. In the Add
Dynamic Content page, click the DWTAbleName under Parameters , which will automatically
populate the top expression text box @dataset().DWTableName , and then click Finish .
c. The tableName property of the dataset is set to the values that are passed as arguments for the
DWSchema and DWTableName parameters. The ForEach activity iterates through a list of tables,
and passes one by one to the Copy activity.
Create pipelines
In this tutorial, you create two pipelines: IterateAndCopySQLTables and GetTableListAndTriggerCopyData .
The GetTableListAndTriggerCopyData pipeline performs two actions:
Looks up the Azure SQL Database system table to get the list of tables to be copied.
Triggers the pipeline IterateAndCopySQLTables to do the actual data copy.
The IterateAndCopySQLTables pipeline takes a list of tables as a parameter. For each table in the list, it copies
data from the table in Azure SQL Database to Azure Synapse Analytics using staged copy and PolyBase.
Create the pipeline IterateAndCopySQLTables
1. In the left pane, click + (plus) , and click Pipeline .
2. In the General panel under Proper ties , specify IterateAndCopySQLTables for Name . Then collapse
the panel by clicking the Properties icon in the top-right corner.
3. Switch to the Parameters tab, and do the following actions:
a. Click + New .
b. Enter tableList for the parameter Name .
c. Select Array for Type .
4. In the Activities toolbox, expand Iteration & Conditions , and drag-drop the ForEach activity to the
pipeline design surface. You can also search for activities in the Activities toolbox.
a. In the General tab at the bottom, enter IterateSQLTables for Name .
b. Switch to the Settings tab, click the input box for Items , then click the Add dynamic content link
below.
c. In the Add Dynamic Content page, collapse the System Variables and Functions sections, click the
tableList under Parameters , which will automatically populate the top expression text box as
@pipeline().parameter.tableList . Then click Finish .
d. Switch to Activities tab, click the pencil icon to add a child activity to the ForEach activity.
5. In the Activities toolbox, expand Move & Transfer , and drag-drop Copy data activity into the pipeline
designer surface. Notice the breadcrumb menu at the top. The IterateAndCopySQLTable is the pipeline
name and IterateSQLTables is the ForEach activity name. The designer is in the activity scope. To switch
back to the pipeline editor from the ForEach editor, you can click the link in the breadcrumb menu.
5. Drag-drop Execute Pipeline activity from the Activities toolbox to the pipeline designer surface, and set
the name to TriggerCopy .
6. To Connect the Lookup activity to the Execute Pipeline activity, drag the green box attached to the
Lookup activity to the left of Execute Pipeline activity.
7. Switch to the Settings tab of Execute Pipeline activity, and do the following steps:
a. Select IterateAndCopySQLTables for Invoked pipeline .
b. Clear the checkbox for Wait on completion .
c. In the Parameters section, click the input box under VALUE -> select the Add dynamic content
below -> enter @activity('LookupTableList').output.value as table name value -> select Finish .
You're setting the result list from the Lookup activity as an input to the second pipeline. The result
list contains the list of tables whose data needs to be copied to the destination.
8. To validate the pipeline, click Validate on the toolbar. Confirm that there are no validation errors. To close
the Pipeline Validation Repor t , click >> .
9. To publish entities (datasets, pipelines, etc.) to the Data Factory service, click Publish all on top of the
window. Wait until the publishing succeeds.
3. To view the output of the Lookup activity, click the Output link next to the activity under the ACTIVITY
NAME column. You can maximize and restore the Output window. After reviewing, click X to close the
Output window.
{
"count": 9,
"value": [
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Customer"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Product"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductModelProductDescription"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "ProductCategory"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "Address"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "CustomerAddress"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderDetail"
},
{
"TABLE_SCHEMA": "SalesLT",
"TABLE_NAME": "SalesOrderHeader"
}
],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"effectiveIntegrationRuntimes": [
{
"name": "DefaultIntegrationRuntime",
"type": "Managed",
"location": "East US",
"billedDuration": 0,
"nodes": null
}
]
}
4. To switch back to the Pipeline Runs view, click All Pipeline runs link at the top of the breadcrumb
menu. Click IterateAndCopySQLTables link (under PIPELINE NAME column) to view activity runs of
the pipeline. Notice that there's one Copy activity run for each table in the Lookup activity output.
5. Confirm that the data was copied to the target Azure Synapse Analytics you used in this tutorial.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Copy multiple tables in bulk by using Azure Data
Factory using PowerShell
5/6/2021 • 11 minutes to read • Edit Online
End-to-end workflow
In this scenario, we have a number of tables in Azure SQL Database that we want to copy to Azure Synapse
Analytics. Here is the logical sequence of steps in the workflow that happens in pipelines:
The first pipeline looks up the list of tables that needs to be copied over to the sink data stores. Alternatively
you can maintain a metadata table that lists all the tables to be copied to the sink data store. Then, the
pipeline triggers another pipeline, which iterates over each table in the database and performs the data copy
operation.
The second pipeline performs the actual copy. It takes the list of tables as a parameter. For each table in the
list, copy the specific table in Azure SQL Database to the corresponding table in Azure Synapse Analytics
using staged copy via Blob storage and PolyBase for best performance. In this example, the first pipeline
passes the list of tables as a value for the parameter.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Azure Storage account . The Azure Storage account is used as staging blob storage in the bulk copy
operation.
Azure SQL Database . This database contains the source data.
Azure Synapse Analytics . This data warehouse holds the data copied over from the SQL Database.
Prepare SQL Database and Azure Synapse Analytics
Prepare the source Azure SQL Database :
Create a database with the Adventure Works LT sample data in SQL Database by following Create a database in
Azure SQL Database article. This tutorial copies all the tables from this sample database to Azure Synapse
Analytics.
Prepare the sink Azure Synapse Analytics :
1. If you don't have an Azure Synapse Analytics workspace, see the Get started with Azure Synapse
Analytics article for steps to create one.
2. Create corresponding table schemas in Azure Synapse Analytics. You use Azure Data Factory to
migrate/copy data in a later step.
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
2. Run the Set-AzDataFactor yV2 cmdlet to create a data factory. Replace place-holders with your own
values before executing the command.
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, you must be a Contributor or Administrator of the Azure
subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that
interest you on the following page, and then expand Analytics to locate Data Factor y : Products
available by region. The data stores (Azure Storage, Azure SQL Database, etc.) and computes
(HDInsight, etc.) used by data factory can be in other regions.
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
LinkedServiceName : AzureSqlDatabaseLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
IMPORTANT
Replace <servername>, <databasename>, <username>@<servername> and <password> with values of your
Azure SQL Database before saving the file.
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
LinkedServiceName : AzureSqlDWLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDWLinkedService
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>"
}
}
}
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
Create datasets
In this tutorial, you create source and sink datasets, which specify the location where the data is stored:
Create a dataset for source SQL Database
1. Create a JSON file named AzureSqlDatabaseDataset.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content. The "tableName" is a dummy one as later you use the SQL query in copy
activity to retrieve data.
{
"name": "AzureSqlDatabaseDataset",
"properties": {
"type": "AzureSqlTable",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "dummy"
}
}
}
DatasetName : AzureSqlDatabaseDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "AzureSqlDWDataset",
"properties": {
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "AzureSqlDWLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": {
"value": "@{dataset().DWTableName}",
"type": "Expression"
}
},
"parameters":{
"DWTableName":{
"type":"String"
}
}
}
}
DatasetName : AzureSqlDWDataset
ResourceGroupName : <resourceGroupname>
DataFactoryName : <dataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDwTableDataset
Create pipelines
In this tutorial, you create two pipelines:
Create the pipeline "IterateAndCopySQLTables"
This pipeline takes a list of tables as a parameter. For each table in the list, it copies data from the table in Azure
SQL Database to Azure Synapse Analytics using staged copy and PolyBase.
1. Create a JSON file named IterateAndCopySQLTables.json in the C:\ADFv2TutorialBulkCopy folder,
with the following content:
{
"name": "IterateAndCopySQLTables",
"properties": {
"activities": [
{
"name": "IterateSQLTables",
"type": "ForEach",
"typeProperties": {
"isSequential": "false",
"items": {
"value": "@pipeline().parameters.tableList",
"type": "Expression"
},
"activities": [
{
"name": "CopyData",
"description": "Copy data from Azure SQL Database to Azure Synapse
Analytics",
"type": "Copy",
"inputs": [
{
"referenceName": "AzureSqlDatabaseDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSqlDWDataset",
"type": "DatasetReference",
"parameters": {
"DWTableName": "[@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
}
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT * FROM [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]"
},
"sink": {
"type": "SqlDWSink",
"preCopyScript": "TRUNCATE TABLE [@{item().TABLE_SCHEMA}].
[@{item().TABLE_NAME}]",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
]
}
}
],
"parameters": {
"tableList": {
"type": "Object"
}
}
}
}
PipelineName : IterateAndCopySQLTables
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {IterateSQLTables}
Parameters : {[tableList,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
2. Run the following script to continuously check the run status of pipeline
GetTableListAndTriggerCopyData , and print out the final pipeline run and activity run result.
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
Write-Host "Pipeline run details:" -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 15
}
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
ActivityName : TriggerCopy
PipelineRunId : 0000000000-00000-0000-0000-000000000000
PipelineName : GetTableListAndTriggerCopyData
Input : {pipeline, parameters, waitOnCompletion}
Output : {pipelineRunId}
LinkedServiceName :
ActivityRunStart : 9/18/2017 4:07:11 PM
ActivityRunEnd : 9/18/2017 4:08:14 PM
DurationInMs : 62581
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. You can get the run ID of pipeline "IterateAndCopySQLTables ", and check the detailed activity run
result as the following.
{
"pipelineRunId": "7514d165-14bf-41fb-b5fb-789bea6c9e58"
}
4. Connect to your sink Azure Synapse Analytics and confirm that data has been copied from Azure SQL
Database properly.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create Azure SQL Database, Azure Synapse Analytics, and Azure Storage linked services.
Create Azure SQL Database and Azure Synapse Analytics datasets.
Create a pipeline to look up the tables to be copied and another pipeline to perform the actual copy
operation.
Start a pipeline run.
Monitor the pipeline and activity runs.
Advance to the following tutorial to learn about copy data incrementally from a source to a destination:
Copy data incrementally
Incrementally load data from a source data store to
a destination data store
3/5/2021 • 2 minutes to read • Edit Online
Loading new files only by using time partitioned folder or file name.
You can copy new files only, where files or folders has already been time partitioned with timeslice information
as part of the file or folder name (for example, /yyyy/mm/dd/file.csv). It is the most performant approach for
incrementally loading new files.
For step-by-step instructions, see the following tutorial:
Incrementally copy new files based on time partitioned folder or file name from Azure Blob storage to Azure
Blob storage
Next steps
Advance to the following tutorial:
Incrementally copy data from one table in Azure SQL Database to Azure Blob storage
Incrementally load data from Azure SQL Database
to Azure Blob storage using the Azure portal
7/7/2021 • 13 minutes to read • Edit Online
Overview
Here is the high-level solution diagram:
Prerequisites
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see Create a database in Azure SQL Database for steps to create one.
Azure Storage . You use the blob storage as the sink data store. If you don't have a storage account, see
Create a storage account for steps to create one. Create a container named adftutorial.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Ser ver Explorer , right-click the database, and choose New
Quer y .
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
10. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. On the home page of Data Factory UI, click the Orchestrate tile.
2. In the General panel under Proper ties , specify IncrementalCopyPipeline for Name . Then collapse the
panel by clicking the Properties icon in the top-right corner.
3. Let's add the first lookup activity to get the old watermark value. In the Activities toolbox, expand
General , and drag-drop the Lookup activity to the pipeline designer surface. Change the name of the
activity to LookupOldWaterMarkActivity .
4. Switch to the Settings tab, and click + New for Source Dataset . In this step, you create a dataset to
represent data in the watermarktable . This table contains the old watermark that was used in the
previous copy operation.
5. In the New Dataset window, select Azure SQL Database , and click Continue . You see a new window
opened for the dataset.
6. In the Set proper ties window for the dataset, enter WatermarkDataset for Name .
7. For Linked Ser vice , select New , and then do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for Name .
b. Select your server for Ser ver name .
c. Select your Database name from the dropdown list.
d. Enter your User name & Password .
e. To test connection to the your SQL database, click Test connection .
f. Click Finish .
g. Confirm that AzureSqlDatabaseLinkedSer vice is selected for Linked ser vice .
h. Select Finish .
8. In the Connection tab, select [dbo].[watermarktable] for Table . If you want to preview data in the
table, click Preview data .
9. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left. In the properties window for the Lookup activity, confirm that
WatermarkDataset is selected for the Source Dataset field.
10. In the Activities toolbox, expand General , and drag-drop another Lookup activity to the pipeline
designer surface, and set the name to LookupNewWaterMarkActivity in the General tab of the
properties window. This Lookup activity gets the new watermark value from the table with the source
data to be copied to the destination.
11. In the properties window for the second Lookup activity, switch to the Settings tab, and click New . You
create a dataset to point to the source table that contains the new watermark value (maximum value of
LastModifyTime).
12. In the New Dataset window, select Azure SQL Database , and click Continue .
13. In the Set proper ties window, enter SourceDataset for Name . Select
AzureSqlDatabaseLinkedSer vice for Linked ser vice .
14. Select [dbo].[data_source_table] for Table. You specify a query on this dataset later in the tutorial. The
query takes the precedence over the table you specify in this step.
15. Select Finish .
16. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left. In the properties window for the Lookup activity, confirm that
SourceDataset is selected for the Source Dataset field.
17. Select Quer y for the Use Quer y field, and enter the following query: you are only selecting the
maximum value of LastModifytime from the data_source_table . Please make sure you have also
checked First row only .
20. Select the Copy activity and confirm that you see the properties for the activity in the Proper ties
window.
21. Switch to the Source tab in the Proper ties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Quer y for the Use Quer y field.
c. Enter the following SQL query for the Quer y field.
select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime
<= '@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'
22. Switch to the Sink tab, and click + New for the Sink Dataset field.
23. In this tutorial sink data store is of type Azure Blob Storage. Therefore, select Azure Blob Storage , and
click Continue in the New Dataset window.
24. In the Select Format window, select the format type of your data, and click Continue .
25. In the Set Proper ties window, enter SinkDataset for Name . For Linked Ser vice , select + New . In this
step, you create a connection (linked service) to your Azure Blob storage .
26. In the New Linked Ser vice (Azure Blob Storage) window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Test Connection and then click Finish .
27. In the Set Proper ties window, confirm that AzureStorageLinkedSer vice is selected for Linked
ser vice . Then select Finish .
28. Go to the Connection tab of SinkDataset and do the following steps:
a. For the File path field, enter adftutorial/incrementalcopy . adftutorial is the blob container name
and incrementalcopy is the folder name. This snippet assumes that you have a blob container
named adftutorial in your blob storage. Create the container if it doesn't exist, or set it to the name of
an existing one. Azure Data Factory automatically creates the output folder incrementalcopy if it
does not exist. You can also use the Browse button for the File path to navigate to a folder in a blob
container.
b. For the File part of the File path field, select Add dynamic content [Alt+P] , and then enter
@CONCAT('Incremental-', pipeline().RunId, '.txt') in the opened window. Then select Finish . The file
name is dynamically generated by using the expression. Each pipeline run has a unique ID. The Copy
activity uses the run ID to generate the file name.
29. Switch to the pipeline editor by clicking the pipeline tab at the top or by clicking the name of the pipeline
in the tree view on the left.
30. In the Activities toolbox, expand General , and drag-drop the Stored Procedure activity from the
Activities toolbox to the pipeline designer surface. Connect the green (Success) output of the Copy
activity to the Stored Procedure activity.
31. Select Stored Procedure Activity in the pipeline designer, change its name to
StoredProceduretoWriteWatermarkActivity .
32. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
33. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select usp_write_watermark .
b. To specify values for the stored procedure parameters, click Impor t parameter , and enter
following values for the parameters:
NAME TYPE VA L UE
34. To validate the pipeline settings, click Validate on the toolbar. Confirm that there are no validation errors.
To close the Pipeline Validation Repor t window, click >>.
35. Publish entities (linked services, datasets, and pipelines) to the Azure Data Factory service by selecting the
Publish All button. Wait until you see a message that the publishing succeeded.
2. Open the output file and notice that all the data is copied from the data_source_table to the blob file.
1,aaaa,2017-09-01 00:56:00.0000000
2,bbbb,2017-09-02 05:23:00.0000000
3,cccc,2017-09-03 02:36:00.0000000
4,dddd,2017-09-04 03:21:00.0000000
5,eeee,2017-09-05 08:06:00.0000000
3. Check the latest value from watermarktable . You see that the watermark value was updated.
TA B L EN A M E WAT ERM A RK VA L UE
6,newdata,2017-09-06 02:23:00.0000000
7,newdata,2017-09-07 09:01:00.0000000
2. Check the latest value from watermarktable . You see that the watermark value was updated again.
sample output:
TA B L EN A M E WAT ERM A RK VA L UE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
Review results
Add more data to the source.
Run the pipeline again.
Monitor the second pipeline run
Review results from the second run
In this tutorial, the pipeline copied data from a single table in SQL Database to Blob storage. Advance to the
following tutorial to learn how to copy data from multiple tables in a SQL Server database to SQL Database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from Azure SQL Database
to Azure Blob storage using PowerShell
3/5/2021 • 13 minutes to read • Edit Online
Overview
Here is the high-level solution diagram:
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see Create a database in Azure SQL Database for steps to create one.
Azure Storage . You use the blob storage as the sink data store. If you don't have a storage account, see
Create a storage account for steps to create one. Create a container named adftutorial.
Azure PowerShell . Follow the instructions in Install and configure Azure PowerShell.
Create a data source table in your SQL database
1. Open SQL Server Management Studio. In Ser ver Explorer , right-click the database, and choose New
Quer y .
2. Run the following SQL command against your SQL database to create a table named data_source_table
as the data source store:
In this tutorial, you use LastModifytime as the watermark column. The data in the data source store is
shown in the following table:
Create another table in your SQL database to store the high watermark value
1. Run the following SQL command against your SQL database to create a table named watermarktable to
store the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Set the default value of the high watermark with the table name of source data store. In this tutorial, the
table name is data_source_table.
Output:
TableName | WatermarkValue
---------- | --------------
data_source_table | 2010-01-01 00:00:00.000
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFTutorialFactorySP1127.
$dataFactoryName = "ADFIncCopyTutorialFactory";
5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Storage, SQL Database, Azure SQL Managed Instance, and so on) and computes (Azure
HDInsight, etc.) used by the data factory can be in other regions.
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=
<database>; Persist Security Info=False; User ID=<user> ; Password=<password>;
MultipleActiveResultSets = False; Encrypt = True; TrustServerCertificate = False; Connection Timeout
= 30;"
}
}
}
Create datasets
In this step, you create datasets to represent source and sink data.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
In this tutorial, you use the table name data_source_table. Replace it if you use a table with a different
name.
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
IMPORTANT
This snippet assumes that you have a blob container named adftutorial in your blob storage. Create the
container if it doesn't exist, or set it to the name of an existing one. The output folder incrementalcopy is
automatically created if it doesn't exist in the container. In this tutorial, the file name is dynamically generated by
using the expression @CONCAT('Incremental-', pipeline().RunId, '.txt') .
DatasetName : SinkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset WatermarkDataset.
DatasetName : WatermarkDataset
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
In this tutorial, you create a pipeline with two Lookup activities, one Copy activity, and one StoredProcedure
activity chained in one pipeline.
1. Create a JSON file IncrementalCopyPipeline.json in the same folder with the following content:
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupOldWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from watermarktable"
},
"dataset": {
"referenceName": "WatermarkDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupNewWaterMarkActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select MAX(LastModifytime) as NewWatermarkvalue from
data_source_table"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
"sqlReaderQuery": "select * from data_source_table where LastModifytime >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and LastModifytime <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupNewWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupOldWaterMarkActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoWriteWatermarkActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "usp_write_watermark",
"storedProcedureParameters": {
"LastModifiedtime": {"value":
"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}", "type": "datetime" },
"TableName": {
"value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}", "type":"String"}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
2. Run the Set-AzDataFactor yV2Pipeline cmdlet to create the pipeline IncrementalCopyPipeline.
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
Activities : {LookupOldWaterMarkActivity, LookupNewWaterMarkActivity,
IncrementalCopyActivity, StoredProceduretoWriteWatermarkActivity}
Parameters :
2. Check the status of the pipeline by running the Get-AzDataFactor yV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:42:42 AM
ActivityRunEnd : 9/14/2017 7:43:07 AM
DurationInMs : 25437
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:10 AM
ActivityRunEnd : 9/14/2017 7:43:29 AM
DurationInMs : 19769
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : d4bf3ce2-5d60-43f3-9318-923155f61037
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 7:43:32 AM
ActivityRunEnd : 9/14/2017 7:43:47 AM
DurationInMs : 14467
Status : Succeeded
Error : {errorCode, message, failureType, target}
2. Check the latest value from watermarktable . You see that the watermark value was updated.
TA B L EN A M E WAT ERM A RK VA L UE
Insert data into the data source store to verify delta data loading
1. Insert new data into the SQL database (data source store).
3. Check the status of the pipeline by running the Get-AzDataFactor yV2ActivityRun cmdlet until you see
all the activities running successfully. Replace placeholders with your own appropriate time for the
parameters RunStartedAfter and RunStartedBefore. In this tutorial, you use -RunStartedAfter
"2017/09/14" and -RunStartedBefore "2017/09/15".
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : LookupOldWaterMarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, dataset}
Output : {TableName, WatermarkValue}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:52:26 AM
ActivityRunEnd : 9/14/2017 8:52:52 AM
DurationInMs : 25497
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : IncrementalCopyActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {source, sink}
Output : {dataRead, dataWritten, rowsCopied, copyDuration...}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:00 AM
ActivityRunEnd : 9/14/2017 8:53:20 AM
DurationInMs : 20194
Status : Succeeded
Error : {errorCode, message, failureType, target}
ResourceGroupName : ADF
DataFactoryName : incrementalloadingADF
ActivityName : StoredProceduretoWriteWatermarkActivity
PipelineRunId : 2fc90ab8-d42c-4583-aa64-755dba9925d7
PipelineName : IncrementalCopyPipeline
Input : {storedProcedureName, storedProcedureParameters}
Output : {}
LinkedServiceName :
ActivityRunStart : 9/14/2017 8:53:23 AM
ActivityRunEnd : 9/14/2017 8:53:41 AM
DurationInMs : 18502
Status : Succeeded
Error : {errorCode, message, failureType, target}
4. In the blob storage, you see that another file was created. In this tutorial, the new file name is
Incremental-2fc90ab8-d42c-4583-aa64-755dba9925d7.txt . Open that file, and you see two rows of records in
it.
5. Check the latest value from watermarktable . You see that the watermark value was updated again.
TA B L EN A M E WAT ERM A RK VA L UE
Next steps
You performed the following steps in this tutorial:
Prepare the data store to store the watermark value.
Create a data factory.
Create linked services.
Create source, sink, and watermark datasets.
Create a pipeline.
Run the pipeline.
Monitor the pipeline run.
In this tutorial, the pipeline copied data from a single table in Azure SQL Database to Blob storage. Advance to
the following tutorial to learn how to copy data from multiple tables in a SQL Server database to SQL Database.
Incrementally load data from multiple tables in SQL Server to Azure SQL Database
Incrementally load data from multiple tables in SQL
Server to a database in Azure SQL Database using
the Azure portal
7/7/2021 • 17 minutes to read • Edit Online
Overview
Here are the important steps to create this solution:
1. Select the watermark column .
Select one column for each table in the source data store, which can be used to identify the new or
updated records for every run. Normally, the data in this selected column (for example, last_modify_time
or ID) keeps increasing when rows are created or updated. The maximum value in this column is used as
a watermark.
2. Prepare a data store to store the watermark value .
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities :
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Ser ver . You use a SQL Server database as the source data store in this tutorial.
Azure SQL Database . You use a database in Azure SQL Database as the sink data store. If you don't have a
database in SQL Database, see Create a database in Azure SQL Database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio, and connect to your SQL Server database.
2. In Ser ver Explorer , right-click the database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
Create another table in your database to store the high watermark value
1. Run the following SQL command against your database to create a table named watermarktable to store
the watermark value:
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
4. Select your Azure subscription in which you want to create the data factory.
5. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
6. Select V2 for the version .
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. Click Create .
9. After the creation is complete, you see the Data Factor y page as shown in the image.
10. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
2. Select Integration runtimes on the left pane, and then select +New .
3. In the Integration Runtime Setup window, select Perform data movement and dispatch
activities to external computes , and click Continue .
4. Select Self-Hosted , and click Continue .
5. Enter MySelfHostedIR for Name , and click Create .
6. Click Click here to launch the express setup for this computer in the Option 1: Express setup
section.
8. In the Web browser, in the Integration Runtime Setup window, click Finish .
9. Confirm that you see MySelfHostedIR in the list of integration runtimes.
2. In the New Linked Ser vice window, select SQL Ser ver , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter SqlSer verLinkedSer vice for Name .
b. Select MySelfHostedIR for Connect via integration runtime . This is an impor tant step. The
default integration runtime cannot connect to an on-premises data store. Use the self-hosted
integration runtime you created earlier.
c. For Ser ver name , enter the name of your computer that has the SQL Server database.
d. For Database name , enter the name of the database in your SQL Server that has the source data. You
created a table and inserted data into this database as part of the prerequisites.
e. For Authentication type , select the type of the authentication you want to use to connect to the
database.
f. For User name , enter the name of user that has access to the SQL Server database. If you need to use
a slash character ( \ ) in your user account or server name, use the escape character ( \ ). An example
is mydomain\\myuser .
g. For Password , enter the password for the user.
h. To test whether Data Factory can connect to your SQL Server database, click Test connection . Fix any
errors until the connection succeeds.
i. To save the linked service, click Finish .
Create the Azure SQL Database linked service
In the last step, you create a linked service to link your source SQL Server database to the data factory. In this
step, you link your destination/sink database to the data factory.
1. In the Connections window, switch from Integration Runtimes tab to the Linked Ser vices tab, and
click + New .
2. In the New Linked Ser vice window, select Azure SQL Database , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for Name .
b. For Ser ver name , select the name of your server from the drop-down list.
c. For Database name , select the database in which you created customer_table and project_table as
part of the prerequisites.
d. For User name , enter the name of user that has access to the database.
e. For Password , enter the password for the user.
f. To test whether Data Factory can connect to your SQL Server database, click Test connection . Fix any
errors until the connection succeeds.
g. To save the linked service, click Finish .
4. Confirm that you see two linked services in the list.
Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. In the left pane, click + (plus) , and click Dataset .
2. In the New Dataset window, select SQL Ser ver , click Continue .
3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the
tree view. In the General tab of the Properties window at the bottom, enter SourceDataset for Name .
4. Switch to the Connection tab in the Properties window, and select SqlSer verLinkedSer vice for
Linked ser vice . You do not select a table here. The Copy activity in the pipeline uses a SQL query to load
the data rather than load the entire table.
6. Select the ForEach activity in the pipeline if it isn't already selected. Click the Edit (Pencil icon) button.
7. In the Activities toolbox, expand General , drag-drop the Lookup activity to the pipeline designer
surface, and enter LookupOldWaterMarkActivity for Name .
8. Switch to the Settings tab of the Proper ties window, and do the following steps:
a. Select WatermarkDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
9. Drag-drop the Lookup activity from the Activities toolbox, and enter LookupNewWaterMarkActivity
for Name .
10. Switch to the Settings tab.
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for
Name .
12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box
attached to the Lookup activity and drop it on the Copy activity. Release the mouse button when the
border color of the Copy activity changes to blue .
13. Select the Copy activity in the pipeline. Switch to the Source tab in the Proper ties window.
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
14. Switch to the Sink tab, and select SinkDataset for Sink Dataset .
15. Do the following steps:
a. In the Dataset proper ties , for SinkTableName parameter, enter @{item().TABLE_NAME} .
b. For Stored Procedure Name property, enter @{item().StoredProcedureNameForMergeOperation} .
c. For Table type property, enter @{item().TableType} .
d. For Table type parameter name , enter @{item().TABLE_NAME} .
16. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer
surface. Connect the Copy activity to the Stored Procedure activity.
17. Select the Stored Procedure activity in the pipeline, and enter
StoredProceduretoWriteWatermarkActivity for Name in the General tab of the Proper ties
window.
18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedSer vice for Linked Ser vice .
19. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select [dbo].[usp_write_watermark] .
b. Select Impor t parameter .
c. Specify the following values for the parameters:
NAME TYPE VA L UE
20. Select Publish All to publish the entities you created to the Data Factory service.
21. Wait until you see the Successfully published message. To see the notifications, click the Show
Notifications link. Close the notifications window by clicking X .
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
Monitor the pipeline
1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger . You can
use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column.
For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME
column.
3. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view, select
Refresh .
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Quer y
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Quer y
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Quer y
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from multiple tables in SQL
Server to Azure SQL Database using PowerShell
7/7/2021 • 18 minutes to read • Edit Online
Overview
Here are the important steps to create this solution:
1. Select the watermark column .
Select one column for each table in the source data store, which you can identify the new or updated
records for every run. Normally, the data in this selected column (for example, last_modify_time or ID)
keeps increasing when rows are created or updated. The maximum value in this column is used as a
watermark.
2. Prepare a data store to store the watermark value .
In this tutorial, you store the watermark value in a SQL database.
3. Create a pipeline with the following activities :
a. Create a ForEach activity that iterates through a list of source table names that is passed as a parameter
to the pipeline. For each source table, it invokes the following activities to perform delta loading for that
table.
b. Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the
second Lookup activity to retrieve the new watermark value. These watermark values are passed to the
Copy activity.
c. Create a Copy activity that copies rows from the source data store with the value of the watermark
column greater than the old watermark value and less than the new watermark value. Then, it copies the
delta data from the source data store to Azure Blob storage as a new file.
d. Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.
Here is the high-level solution diagram:
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
SQL Ser ver . You use a SQL Server database as the source data store in this tutorial.
Azure SQL Database . You use a database in Azure SQL Database as the sink data store. If you don't have a
SQL database, see Create a database in Azure SQL Database for steps to create one.
Create source tables in your SQL Server database
1. Open SQL Server Management Studio (SSMS) or Azure Data Studio, and connect to your SQL Server
database.
2. In Ser ver Explorer (SSMS) or in the Connections pane (Azure Data Studio) , right-click the
database and choose New Quer y .
3. Run the following SQL command against your database to create tables named customer_table and
project_table :
create table customer_table
(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
Create another table in Azure SQL Database to store the high watermark value
1. Run the following SQL command against your database to create a table named watermarktable to store
the watermark value:
create table watermarktable
(
TableName varchar(255),
WatermarkValue datetime,
);
2. Insert initial watermark values for both source tables into the watermark table.
BEGIN
UPDATE watermarktable
SET [WatermarkValue] = @LastModifiedtime
WHERE [TableName] = @TableName
END
Create data types and additional stored procedures in Azure SQL Database
Run the following query to create two stored procedures and two data types in your database. They're used to
merge the data from source tables into destination tables.
In order to make the journey easy to start with, we directly use these Stored Procedures passing the delta data in
via a table variable and then merge the them into destination store. Be cautious it is not expecting a "large"
number of delta rows (more than 100) to be stored in the table variable.
If you do need to merge a large number of delta rows into the destination store, we suggest you to use copy
activity to copy all the delta data into a temporary "staging" table in the destination store first, and then built
your own stored procedure without using table variable to merge them from the “staging” table to the “final”
table.
CREATE TYPE DataTypeforCustomerTable AS TABLE(
PersonID int,
Name varchar(255),
LastModifytime datetime
);
GO
BEGIN
MERGE customer_table AS target
USING @customer_table AS source
ON (target.PersonID = source.PersonID)
WHEN MATCHED THEN
UPDATE SET Name = source.Name,LastModifytime = source.LastModifytime
WHEN NOT MATCHED THEN
INSERT (PersonID, Name, LastModifytime)
VALUES (source.PersonID, source.Name, source.LastModifytime);
END
GO
GO
BEGIN
MERGE project_table AS target
USING @project_table AS source
ON (target.Project = source.Project)
WHEN MATCHED THEN
UPDATE SET Creationtime = source.Creationtime
WHEN NOT MATCHED THEN
INSERT (Project, Creationtime)
VALUES (source.Project, source.Creationtime);
END
Azure PowerShell
Install the latest Azure PowerShell modules by following the instructions in Install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
If the resource group already exists, you might not want to overwrite it. Assign a different value to the
$resourceGroupName variable, and run the command again.
IMPORTANT
Update the data factory name to make it globally unique. An example is ADFIncMultiCopyTutorialFactorySP1127.
$dataFactoryName = "ADFIncMultiCopyTutorialFactory";
5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:
To create Data Factory instances, the user account you use to sign in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, SQL Database, SQL Managed Instance, and so on) and computes (Azure
HDInsight, etc.) used by the data factory can be in other regions.
$integrationRuntimeName = "ADFTutorialIR"
3. To retrieve the status of the created integration runtime, run the following command. Confirm that the
value of the State property is set to NeedRegistration .
State : NeedRegistration
Version :
CreateTime : 9/24/2019 6:00:00 AM
AutoUpdate : On
ScheduledUpdateDate :
UpdateDelayOffset :
LocalTimeZoneOffset :
InternalChannelEncryption :
Capabilities : {}
ServiceUrls : {eu.frontend.clouddatahub.net}
Nodes : {}
Links : {}
Name : ADFTutorialIR
Type : SelfHosted
ResourceGroupName : <ResourceGroup name>
DataFactoryName : <DataFactory name>
Description :
Id : /subscriptions/<subscription ID>/resourceGroups/<ResourceGroup
name>/providers/Microsoft.DataFactory/factories/<DataFactory name>/integrationruntimes/<Integration
Runtime name>
4. To retrieve the authentication keys used to register the self-hosted integration runtime with Azure Data
Factory service in the cloud, run the following command:
{
"AuthKey1": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=",
"AuthKey2": "IR@0000000000-0000-0000-0000-
000000000000@xy0@xy@yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy="
}
5. Copy one of the keys (exclude the double quotation marks) used to register the self-hosted integration
runtime that you install on your machine in the following steps.
NOTE
Make a note of the values for authentication type, server, database, user, and password. You use them later in this
tutorial.
{
"name":"SqlServerLinkedService",
"properties":{
"annotations":[
],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=False;data source=<servername>;initial catalog=
<database name>;user id=<username>;Password=<password>"
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}
{
"name":"SqlServerLinkedService",
"properties":{
"annotations":[
],
"type":"SqlServer",
"typeProperties":{
"connectionString":"integrated security=True;data source=<servername>;initial catalog=
<database name>",
"userName":"<username> or <domain>\\<username>",
"password":{
"type":"SecureString",
"value":"<password>"
}
},
"connectVia":{
"referenceName":"<integration runtime name>",
"type":"IntegrationRuntimeReference"
}
}
}
IMPORTANT
Select the right section based on the authentication you use to connect to SQL Server.
Replace <integration runtime name> with the name of your integration runtime.
Replace <servername>, <databasename>, <username>, and <password> with values of your SQL Server
database before you save the file.
If you need to use a slash character ( \ ) in your user account or server name, use the escape character ( \ ).
An example is mydomain\\myuser .
3. Run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureStorageLinkedService. In the following example, you pass values for the ResourceGroupName and
DataFactoryName parameters:
LinkedServiceName : SqlServerLinkedService
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerLinkedService
{
"name":"AzureSQLDatabaseLinkedService",
"properties":{
"annotations":[
],
"type":"AzureSqlDatabase",
"typeProperties":{
"connectionString":"integrated security=False;encrypt=True;connection timeout=30;data
source=<servername>.database.windows.net;initial catalog=<database name>;user id=<user
name>;Password=<password>;"
}
}
}
2. In PowerShell, run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked service
AzureSQLDatabaseLinkedService.
LinkedServiceName : AzureSQLDatabaseLinkedService
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlDatabaseLinkedService
Create datasets
In this step, you create datasets to represent the data source, the data destination, and the place to store the
watermark.
Create a source dataset
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name":"SourceDataset",
"properties":{
"linkedServiceName":{
"referenceName":"SqlServerLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"SqlServerTable",
"schema":[
]
}
}
The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.
2. Run the Set-AzDataFactor yV2Dataset cmdlet to create the dataset SourceDataset.
DatasetName : SourceDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.SqlServerTableDataset
],
"type":"AzureSqlTable",
"typeProperties":{
"tableName":{
"value":"@dataset().SinkTableName",
"type":"Expression"
}
}
}
}
DatasetName : SinkDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": " WatermarkDataset ",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "watermarktable"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : WatermarkDataset
ResourceGroupName : <ResourceGroupName>
DataFactoryName : <DataFactoryName>
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
Create a pipeline
The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table
names and performs the following operations:
1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in
the last iteration).
2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark
column in the source table).
3. Use the Copy activity to copy data between these two watermark values from the source database to
the destination database.
4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the
next iteration.
Create the pipeline
1. Create a JSON file named IncrementalCopyPipeline.json in the same folder with the following
content:
{
"name":"IncrementalCopyPipeline",
"properties":{
"activities":[
{
"name":"IterateSQLTables",
"type":"ForEach",
"dependsOn":[
],
"userProperties":[
],
"typeProperties":{
"items":{
"value":"@pipeline().parameters.tableList",
"type":"Expression"
},
"isSequential":false,
"activities":[
{
"name":"LookupOldWaterMarkActivity",
"type":"Lookup",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"AzureSqlSource",
"sqlReaderQuery":{
"value":"select * from watermarktable where TableName =
'@{item().TABLE_NAME}'",
"type":"Expression"
}
},
"dataset":{
"referenceName":"WatermarkDataset",
"type":"DatasetReference"
}
}
},
{
"name":"LookupNewWaterMarkActivity",
"type":"Lookup",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"SqlServerSource",
"sqlReaderQuery":{
"value":"select MAX(@{item().WaterMark_Column}) as
NewWatermarkvalue from @{item().TABLE_NAME}",
"type":"Expression"
}
},
"dataset":{
"referenceName":"SourceDataset",
"type":"DatasetReference"
},
"firstRowOnly":true
}
},
{
"name":"IncrementalCopyActivity",
"type":"Copy",
"dependsOn":[
{
"activity":"LookupOldWaterMarkActivity",
"dependencyConditions":[
"Succeeded"
]
},
{
"activity":"LookupNewWaterMarkActivity",
"dependencyConditions":[
"Succeeded"
]
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"SqlServerSource",
"sqlReaderQuery":{
"value":"select * from @{item().TABLE_NAME} where
@{item().WaterMark_Column} >
'@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and
@{item().WaterMark_Column} <=
'@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'",
"type":"Expression"
}
},
"sink":{
"type":"AzureSqlSink",
"sqlWriterStoredProcedureName":{
"value":"@{item().StoredProcedureNameForMergeOperation}",
"type":"Expression"
},
"sqlWriterTableType":{
"value":"@{item().TableType}",
"type":"Expression"
},
"storedProcedureTableTypeParameterName":{
"value":"@{item().TABLE_NAME}",
"type":"Expression"
},
"disableMetricsCollection":false
},
"enableStaging":false
},
"inputs":[
{
"referenceName":"SourceDataset",
"type":"DatasetReference"
}
],
"outputs":[
{
"referenceName":"SinkDataset",
"type":"DatasetReference",
"parameters":{
"SinkTableName":{
"value":"@{item().TABLE_NAME}",
"type":"Expression"
}
}
}
]
},
{
"name":"StoredProceduretoWriteWatermarkActivity",
"type":"SqlServerStoredProcedure",
"dependsOn":[
{
"activity":"IncrementalCopyActivity",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"storedProcedureName":"[dbo].[usp_write_watermark]",
"storedProcedureParameters":{
"LastModifiedtime":{
"value":{
"value":"@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}",
"type":"Expression"
},
"type":"DateTime"
},
"TableName":{
"value":{
"value":"@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}",
"type":"Expression"
},
"type":"String"
}
}
},
"linkedServiceName":{
"referenceName":"AzureSQLDatabaseLinkedService",
"type":"LinkedServiceReference"
}
}
]
}
}
],
"parameters":{
"tableList":{
"type":"array"
}
},
"annotations":[
]
}
}
{
"tableList":
[
{
"TABLE_NAME": "customer_table",
"WaterMark_Column": "LastModifytime",
"TableType": "DataTypeforCustomerTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
},
{
"TABLE_NAME": "project_table",
"WaterMark_Column": "Creationtime",
"TableType": "DataTypeforProjectTable",
"StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
}
]
}
7. When you select the link in the Actions column, you see all the activity runs for the pipeline.
8. To go back to the Pipeline Runs view, select All Pipeline Runs .
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 Alice 2017-09-03 02:36:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Quer y
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
Quer y
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-05 08:06:00.000
project_table 2017-03-04 05:16:00.000
Notice that the watermark values for both tables were updated.
UPDATE customer_table
SET [LastModifytime] = '2017-09-08T00:00:00Z', [name]='NewName' where [PersonID] = 3
2. Monitor the pipeline runs by following the instructions in the Monitor the pipeline section. When the
pipeline status is In Progress , you see another action link under Actions to cancel the pipeline run.
3. Select Refresh to refresh the list until the pipeline run succeeds.
4. Optionally, select the View Activity Runs link under Actions to see all the activity runs associated with
this pipeline run.
Output
===========================================
PersonID Name LastModifytime
===========================================
1 John 2017-09-01 00:56:00.000
2 Mike 2017-09-02 05:23:00.000
3 NewName 2017-09-08 00:00:00.000
4 Andy 2017-09-04 03:21:00.000
5 Anny 2017-09-05 08:06:00.000
Notice the new values of Name and LastModifytime for the PersonID for number 3.
Quer y
Output
===================================
Project Creationtime
===================================
project1 2015-01-01 00:00:00.000
project2 2016-02-02 01:23:00.000
project3 2017-03-04 05:16:00.000
NewProject 2017-10-01 00:00:00.000
Output
======================================
TableName WatermarkValue
======================================
customer_table 2017-09-08 00:00:00.000
project_table 2017-10-01 00:00:00.000
Notice that the watermark values for both tables were updated.
Next steps
You performed the following steps in this tutorial:
Prepare source and destination data stores.
Create a data factory.
Create a self-hosted integration runtime (IR).
Install the integration runtime.
Create linked services.
Create source, sink, and watermark datasets.
Create, run, and monitor a pipeline.
Review the results.
Add or update data in source tables.
Rerun and monitor the pipeline.
Review the final results.
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Incrementally load data from Azure SQL Database to Azure Blob storage by using Change Tracking technology
Incrementally load data from Azure SQL Database
to Azure Blob Storage using change tracking
information using the Azure portal
7/7/2021 • 15 minutes to read • Edit Online
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In
some cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time
you processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database
and SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with
SQL Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob
Storage. For more concrete information about SQL Change Tracking technology, see Change tracking in SQL
Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking
technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL
Database as the source data store. You can also use a SQL Server instance.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data
store (Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see the Create a database in Azure SQL Database article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named
adftutorial .
Create a data source table in Azure SQL Database
1. Launch SQL Ser ver Management Studio , and connect to SQL Database.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your database to create a table named data_source_table as
data source store.
create table data_source_table
(
PersonID int NOT NULL,
Name varchar(255),
Age int
PRIMARY KEY (PersonID)
);
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of the database in Azure SQL Database that has the
data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three
days or more, some changed data is not included. You need to either change the value of
CHANGE_RETENTION to a bigger number. Alternatively, ensure that your period to load the changed data is
within two days. For more information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change
tracking version is 0.
6. Run the following query to create a stored procedure in your database. The pipeline invokes this stored
procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
11. After the creation is complete, you see the Data Factor y page as shown in the image.
12. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
13. In the home page, switch to the Manage tab in the left panel as shown in the following image:
2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Click Save .
Create Azure SQL Database linked service.
In this step, you link your database to the data factory.
1. Click Connections , and click + New .
2. In the New Linked Ser vice window, select Azure SQL Database , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureSqlDatabaseLinkedSer vice for the Name field.
b. Select your server for the Ser ver name field.
c. Select your database for the Database name field.
d. Enter name of the user for the User name field.
e. Enter password for the user for the Password field.
f. Click Test connection to test the connection.
g. Click Save to save the linked service.
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus) , and click Dataset .
2. Select Azure SQL Database , and click Finish .
3. You see a new tab for configuring the dataset. You also see the dataset in the treeview. In the Proper ties
window, change the name of the dataset to SourceDataset .
4. Switch to the Connection tab in the Properties window, and do the following steps:
a. Select AzureStorageLinkedSer vice for Linked ser vice .
b. Enter adftutorial/incchgtracking for folder part of the filePath .
c. Enter @CONCAT('Incremental-', pipeline().RunId, '.txt') for file part of the filePath .
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to FullCopyPipeline .
3. In the Activities toolbox, expand Data Flow , and drag-drop the Copy activity to the pipeline designer
surface, and set the name FullCopyActivity .
4. Switch to the Source tab, and select SourceDataset for the Source Dataset field.
5. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
6. To validate the pipeline definition, click Validate on the toolbar. Confirm that there is no validation error.
Close the Pipeline Validation Repor t by clicking >> .
7. To publish entities (linked services, datasets, and pipelines), click Publish . Wait until the publishing
succeeds.
9. You can also see notifications by clicking the Show Notifications button on the left. To close the
notifications window, click X .
Run the full copy pipeline
Click Trigger on the toolbar for the pipeline, and click Trigger Now .
1,aaaa,21
2,bbbb,24
3,cccc,20
4,dddd,26
5,eeee,22
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to IncrementalCopyPipeline .
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to LookupLastChangeTrackingVersionActivity . This activity gets
the change tracking version used in the last copy operation that is stored in the table
table_store_ChangeTracking_version .
4. Switch to the Settings in the Proper ties window, and select ChangeTrackingDataset for the Source
Dataset field.
5. Drag-and-drop the Lookup activity from the Activities toolbox to the pipeline designer surface. Set the
name of the activity to LookupCurrentChangeTrackingVersionActivity . This activity gets the current
change tracking version.
6. Switch to the Settings in the Proper ties window, and do the following steps:
a. Select SourceDataset for the Source Dataset field.
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
8. Switch to the Source tab in the Proper ties window, and do the following steps:
a. Select SourceDataset for Source Dataset .
b. Select Quer y for Use Quer y .
c. Enter the following SQL query for Quer y .
select data_source_table.PersonID,data_source_table.Name,data_source_table.Age,
CT.SYS_CHANGE_VERSION, SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN
CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as
CT on data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTracking
Version}
9. Switch to the Sink tab, and select SinkDataset for the Sink Dataset field.
10. Connect both Lookup activities to the Copy activity one by one. Drag the green button attached
to the Lookup activity to the Copy activity.
11. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer
surface. Set the name of the activity to StoredProceduretoUpdateChangeTrackingActivity . This
activity updates the change tracking version in the table_store_ChangeTracking_version table.
12. Switch to the SQL Account* tab, and select AzureSqlDatabaseLinkedSer vice for Linked ser vice .
13. Switch to the Stored Procedure tab, and do the following steps:
a. For Stored procedure name , select Update_ChangeTracking_Version .
b. Select Impor t parameter .
c. In the Stored procedure parameters section, specify following values for the parameters:
NAME TYPE VA L UE
15. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Repor t window by clicking >> .
16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish All button. Wait until you see the Publishing succeeded message.
Run the incremental copy pipeline
1. Click Trigger on the toolbar for the pipeline, and click Trigger Now .
The file should have only the delta data from your database. The record with U is the updated row in the
database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Database
to Azure Blob Storage using change tracking
information using PowerShell
3/5/2021 • 14 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Overview
In a data integration solution, incrementally loading data after initial data loads is a widely used scenario. In
some cases, the changed data within a period in your source data store can be easily to sliced up (for example,
LastModifyTime, CreationTime). In some cases, there is no explicit way to identify the delta data from last time
you processed the data. The Change Tracking technology supported by data stores such as Azure SQL Database
and SQL Server can be used to identify the delta data. This tutorial describes how to use Azure Data Factory with
SQL Change Tracking technology to incrementally load delta data from Azure SQL Database into Azure Blob
Storage. For more concrete information about SQL Change Tracking technology, see Change tracking in SQL
Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Tracking
technology.
NOTE
Both Azure SQL Database and SQL Server support the Change Tracking technology. This tutorial uses Azure SQL
Database as the source data store. You can also use a SQL Server instance.
1. Initial loading of historical data (run once):
a. Enable Change Tracking technology in the source database in Azure SQL Database.
b. Get the initial value of SYS_CHANGE_VERSION in the database as the baseline to capture changed
data.
c. Load full data from the source database into an Azure blob storage.
2. Incremental loading of delta data on a schedule (run periodically after the initial loading of data):
a. Get the old and new SYS_CHANGE_VERSION values.
b. Load the delta data by joining the primary keys of changed rows (between two
SYS_CHANGE_VERSION values) from sys.change_tracking_tables with data in the source table ,
and then move the delta data to destination.
c. Update the SYS_CHANGE_VERSION for the delta loading next time.
High-level solution
In this tutorial, you create two pipelines that perform the following two operations:
1. Initial load: you create a pipeline with a copy activity that copies the entire data from the source data
store (Azure SQL Database) to the destination data store (Azure Blob Storage).
2. Incremental load: you create a pipeline with the following activities, and run it periodically.
a. Create two lookup activities to get the old and new SYS_CHANGE_VERSION from Azure SQL
Database and pass it to copy activity.
b. Create one copy activity to copy the inserted/updated/deleted data between the two
SYS_CHANGE_VERSION values from Azure SQL Database to Azure Blob Storage.
c. Create one stored procedure activity to update the value of SYS_CHANGE_VERSION for the next
pipeline run.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure PowerShell. Install the latest Azure PowerShell modules by following instructions in How to install and
configure Azure PowerShell.
Azure SQL Database . You use the database as the source data store. If you don't have a database in Azure
SQL Database, see the Create a database in Azure SQL Database article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named
adftutorial .
Create a data source table in your database
1. Launch SQL Ser ver Management Studio , and connect to SQL Database.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your database to create a table named data_source_table as
data source store.
4. Enable Change Tracking mechanism on your database and the source table (data_source_table) by
running the following SQL query:
NOTE
Replace <your database name> with the name of your database that has the data_source_table.
The changed data is kept for two days in the current example. If you load the changed data for every three
days or more, some changed data is not included. You need to either change the value of
CHANGE_RETENTION to a bigger number. Alternatively, ensure that your period to load the changed data is
within two days. For more information, see Enable change tracking for a database
5. Create a new table and store the ChangeTracking_version with a default value by running the following
query:
create table table_store_ChangeTracking_version
(
TableName varchar(255),
SYS_CHANGE_VERSION BIGINT,
);
NOTE
If the data is not changed after you enabled the change tracking for SQL Database, the value of the change
tracking version is 0.
6. Run the following query to create a stored procedure in your database. The pipeline invokes this stored
procedure to update the change tracking version in the table you created in the previous step.
BEGIN
UPDATE table_store_ChangeTracking_version
SET [SYS_CHANGE_VERSION] = @CurrentTrackingVersion
WHERE [TableName] = @TableName
END
Azure PowerShell
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$resourceGroupName variable and run the command again
IMPORTANT
Update the data factory name to be globally unique.
$dataFactoryName = "IncCopyChgTrackingDF";
5. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet:
The specified Data Factory name 'ADFIncCopyChangeTrackingTestFactory' is already in use. Data Factory
names must be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
LinkedServiceName : AzureStorageLinkedService
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureStorageLinkedService
{
"name": "AzureSQLDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server = tcp:<server>.database.windows.net,1433;Initial Catalog=
<database name>; Persist Security Info=False; User ID=<user name>; Password=<password>;
MultipleActiveResultSets = False; Encrypt = True; TrustServerCertificate = False; Connection Timeout
= 30;"
}
}
}
2. In Azure PowerShell , run the Set-AzDataFactor yV2LinkedSer vice cmdlet to create the linked
service: AzureSQLDatabaseLinkedSer vice .
Create datasets
In this step, you create datasets to represent data source, data destination. and the place to store the
SYS_CHANGE_VERSION.
Create a source dataset
In this step, you create a dataset to represent the source data.
1. Create a JSON file named SourceDataset.json in the same folder with the following content:
{
"name": "SourceDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "data_source_table"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : SourceDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
You create the adftutorial container in your Azure Blob Storage as part of the prerequisites. Create the
container if it does not exist (or) set it to the name of an existing one. In this tutorial, the output file name
is dynamically generated by using the expression: @CONCAT('Incremental-', pipeline().RunId, '.txt').
2. Run the Set-AzDataFactoryV2Dataset cmdlet to create the dataset: SinkDataset
DatasetName : SinkDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureBlobDataset
{
"name": " ChangeTrackingDataset",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": "table_store_ChangeTracking_version"
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
}
}
}
DatasetName : ChangeTrackingDataset
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Structure :
Properties : Microsoft.Azure.Management.DataFactory.Models.AzureSqlTableDataset
{
"name": "FullCopyPipeline",
"properties": {
"activities": [{
"name": "FullCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource"
},
"sink": {
"type": "BlobSink"
}
},
"inputs": [{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}],
"outputs": [{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}]
}]
}
}
PipelineName : FullCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {FullCopyActivity}
Parameters :
Run the full copy pipeline
Run the pipeline: FullCopyPipeline by using Invoke-AzDataFactor yV2Pipeline cmdlet.
3. Search for your data factor y in the list of data factories, and select it to launch the Data factory page.
6. When you click the link in the Actions column, you see the following page that shows all the activity
runs for the pipeline.
7. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see a file named incremental-<GUID>.txt in the incchgtracking folder of the adftutorial container.
UPDATE data_source_table
SET [Age] = '10', [name]='update' where [PersonID] = 1
{
"name": "IncrementalCopyPipeline",
"properties": {
"activities": [
{
"name": "LookupLastChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select * from table_store_ChangeTracking_version"
},
"dataset": {
"referenceName": "ChangeTrackingDataset",
"type": "DatasetReference"
}
}
},
{
"name": "LookupCurrentChangeTrackingVersionActivity",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "SELECT CHANGE_TRACKING_CURRENT_VERSION() as
CurrentChangeTrackingVersion"
},
"dataset": {
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
}
}
},
{
"name": "IncrementalCopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderQuery": "select
data_source_table.PersonID,data_source_table.Name,data_source_table.Age, CT.SYS_CHANGE_VERSION,
SYS_CHANGE_OPERATION from data_source_table RIGHT OUTER JOIN CHANGETABLE(CHANGES data_source_table,
@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.SYS_CHANGE_VERSION}) as CT on
data_source_table.PersonID = CT.PersonID where CT.SYS_CHANGE_VERSION <=
@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersion
}"
},
"sink": {
"type": "BlobSink"
}
},
"dependsOn": [
{
"activity": "LookupLastChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
},
{
"activity": "LookupCurrentChangeTrackingVersionActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference"
}
]
},
{
"name": "StoredProceduretoUpdateChangeTrackingActivity",
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "Update_ChangeTracking_Version",
"storedProcedureParameters": {
"CurrentTrackingVersion": {
"value":
"@{activity('LookupCurrentChangeTrackingVersionActivity').output.firstRow.CurrentChangeTrackingVersio
n}",
"type": "INT64"
},
"TableName": {
"value":
"@{activity('LookupLastChangeTrackingVersionActivity').output.firstRow.TableName}",
"type": "String"
}
}
},
"linkedServiceName": {
"referenceName": "AzureSQLDatabaseLinkedService",
"type": "LinkedServiceReference"
"type": "LinkedServiceReference"
},
"dependsOn": [
{
"activity": "IncrementalCopyActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
]
}
}
PipelineName : IncrementalCopyPipeline
ResourceGroupName : ADFTutorialResourceGroup
DataFactoryName : IncCopyChgTrackingDF
Activities : {LookupLastChangeTrackingVersionActivity,
LookupCurrentChangeTrackingVersionActivity, IncrementalCopyActivity,
StoredProceduretoUpdateChangeTrackingActivity}
Parameters :
2. When you click the link in the Actions column, you see the following page that shows all the activity
runs for the pipeline.
3. To switch back to the Pipeline runs view, click Pipelines as shown in the image.
Review the results
You see the second file in the incchgtracking folder of the adftutorial container.
The file should have only the delta data from your database. The record with U is the updated row in the
database and I is the one added row.
1,update,10,2,U
6,new,50,1,I
The first three columns are changed data from data_source_table. The last two columns are the metadata from
change tracking system table. The fourth column is the SYS_CHANGE_VERSION for each changed row. The fifth
column is the operation: U = update, I = insert. For details about the change tracking information, see
CHANGETABLE.
==================================================================
PersonID Name Age SYS_CHANGE_VERSION SYS_CHANGE_OPERATION
==================================================================
1 update 10 2 U
6 new 50 1 I
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally load data from Azure SQL Managed
Instance to Azure Storage using change data
capture (CDC)
7/7/2021 • 14 minutes to read • Edit Online
Overview
The Change Data Capture technology supported by data stores such as Azure SQL Managed Instances (MI) and
SQL Server can be used to identify changed data. This tutorial describes how to use Azure Data Factory with
SQL Change Data Capture technology to incrementally load delta data from Azure SQL Managed Instance into
Azure Blob Storage. For more concrete information about SQL Change Data Capture technology, see Change
data capture in SQL Server.
End-to-end workflow
Here are the typical end-to-end workflow steps to incrementally load data using the Change Data Capture
technology.
NOTE
Both Azure SQL MI and SQL Server support the Change Data Capture technology. This tutorial uses Azure SQL Managed
Instance as the source data store. You can also use an on-premises SQL Server.
High-level solution
In this tutorial, you create a pipeline that performs the following operations:
1. Create a lookup activity to count the number of changed records in the SQL Database CDC table and pass
it to an IF Condition activity.
2. Create an If Condition to check whether there are changed records and if so, invoke the copy activity.
3. Create a copy activity to copy the inserted/updated/deleted data between the CDC table to Azure Blob
Storage.
If you don't have an Azure subscription, create a free account before you begin.
Prerequisites
Azure SQL Database Managed Instance . You use the database as the source data store. If you don't
have an Azure SQL Database Managed Instance, see the Create an Azure SQL Database Managed Instance
article for steps to create one.
Azure Storage account . You use the blob storage as the sink data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one. Create a container named raw .
Create a data source table in Azure SQL Database
1. Launch SQL Ser ver Management Studio , and connect to your Azure SQL Managed Instances server.
2. In Ser ver Explorer , right-click your database and choose the New Quer y .
3. Run the following SQL command against your Azure SQL Managed Instances database to create a table
named customers as data source store.
4. Enable Change Data Capture mechanism on your database and the source table (customers) by
running the following SQL query:
NOTE
Replace <your source schema name> with the schema of your Azure SQL MI that has the customers table.
Change data capture doesn't do anything as part of the transactions that change the table being tracked.
Instead, the insert, update, and delete operations are written to the transaction log. Data that is deposited in
change tables will grow unmanageably if you do not periodically and systematically prune the data. For more
information, see Enable Change Data Capture for a database
EXEC sys.sp_cdc_enable_db
EXEC sys.sp_cdc_enable_table
@source_schema = 'dbo',
@source_name = 'customers',
@role_name = 'null',
@supports_net_changes = 1
5. Insert data into the customers table by running the following command:
3. In the New data factor y page, enter ADFTutorialDataFactor y for the name .
The name of the Azure data factory must be globally unique . If you receive the following error, change
the name of the data factory (for example, yournameADFTutorialDataFactory) and try creating again. See
Data Factory - Naming Rules article for naming rules for Data Factory artifacts.
Data factory name "ADFTutorialDataFactory" is not available.
4. Select V2 for the version .
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
a. Select Use existing , and select an existing resource group from the drop-down list.
b. Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select the location for the data factory. Only locations that are supported are displayed in the drop-
down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used
by data factory can be in other regions.
8. De-select Enable GIT .
9. Click Create .
10. Once the deployment is complete, click on Go to resource
11. After the creation is complete, you see the Data Factor y page as shown in the image.
12. Select Open on the Open Azure Data Factor y Studio tile to launch the Azure Data Factory user
interface (UI) in a separate tab.
13. In the home page, switch to the Manage tab in the left panel as shown in the following image:
2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
3. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure Storage account for Storage account name .
c. Click Save .
Create Azure SQL MI Database linked service.
In this step, you link your Azure SQL MI database to the data factory.
NOTE
For those using SQL MI see here for information regarding access via public vs private endpoint. If using private endpoint
one would need to run this pipeline using a self-hosted integration runtime. The same would apply to those running SQL
Server on-prem, in a VM or VNet scenarios.
Create datasets
In this step, you create datasets to represent data source and data destination.
Create a dataset to represent source data
In this step, you create a dataset to represent the source data.
1. In the treeview, click + (plus) , and click Dataset .
4. In the Set Proper ties tab, set the dataset name and connection information:
a. Select AzureStorageLinkedSer vice for Linked ser vice .
b. Enter raw for container part of the filePath .
c. Enable First row as header
d. Click Ok
Create a pipeline to copy the changed data
In this step, you create a pipeline, which first checks the number of changed records present in the change table
using a lookup activity . An IF condition activity checks whether the number of changed records is greater than
zero and runs a copy activity to copy the inserted/updated/deleted data from Azure SQL Database to Azure
Blob Storage. Lastly, a tumbling window trigger is configured and the start and end times will be passed to the
activities as the start and end window parameters.
1. In the Data Factory UI, switch to the Edit tab. Click + (plus) in the left pane, and click Pipeline .
2. You see a new tab for configuring the pipeline. You also see the pipeline in the treeview. In the
Proper ties window, change the name of the pipeline to IncrementalCopyPipeline .
3. Expand General in the Activities toolbox, and drag-drop the Lookup activity to the pipeline designer
surface. Set the name of the activity to GetChangeCount . This activity gets the number of records in the
change table for a given time window.
4. Switch to the Settings in the Proper ties window:
a. Specify the SQL MI dataset name for the Source Dataset field.
b. Select the Query option and enter the following into the query box:
6. Expand Iteration & conditionals in the Activities toolbox, and drag-drop the If Condition activity to
the pipeline designer surface. Set the name of the activity to HasChangedRows .
7. Switch to the Activities in the Proper ties window:
a. Enter the following Expression
@greater(int(activity('GetChangeCount').output.firstRow.changecount),0)
11. Click preview to verify that the query returns the changed rows correctly.
12. Switch to the Sink tab, and specify the Azure Storage dataset for the Sink Dataset field.
13. Click back to the main pipeline canvas and connect the Lookup activity to the If Condition activity one
by one. Drag the green button attached to the Lookup activity to the If Condition activity.
14. Click Validate on the toolbar. Confirm that there are no validation errors. Close the Pipeline Validation
Repor t window by clicking >> .
15. Click Debug to test the pipeline and verify that a file is generated in the storage location.
16. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish all button. Wait until you see the Publishing succeeded message.
3. Navigate to the Copy activity in the True case of the If Condition activity and click on the Source tab.
Copy the following into the query:
4. Click on the Sink tab of the Copy activity and click Open to edit the dataset properties. Click on the
Parameters tab and add a new parameter called triggerStar t
5. Next, configure the dataset properties to store the data in a customers/incremental subdirectory with
date-based partitions.
a. Click on the Connection tab of the dataset properties and add dynamic content for both the
Director y and the File sections.
b. Enter the following expression in the Director y section by clicking on the dynamic content link under
the textbox:
@concat('customers/incremental/',formatDateTime(dataset().triggerStart,'yyyy/MM/dd'))
c. Enter the following expression in the File section. This will create file names based on the trigger start
date and time, suffixed with the csv extension:
@concat(formatDateTime(dataset().triggerStart,'yyyyMMddHHmmssfff'),'.csv')
d. Navigate back to the Sink settings in Copy activity by clicking on the IncrementalCopyPipeline tab.
e. Expand the dataset properties and enter dynamic content in the triggerStart parameter value with the
following expression:
@pipeline().parameters.triggerStartTime
6. Click Debug to test the pipeline and ensure the folder structure and output file is generated as expected.
Download and open the file to verify the contents.
7. Ensure the parameters are being injected into the query by reviewing the Input parameters of the
pipeline run.
8. Publish entities (linked services, datasets, and pipelines) to the Data Factory service by clicking the
Publish all button. Wait until you see the Publishing succeeded message.
9. Finally, configure a tumbling window trigger to run the pipeline at a regular interval and set start and end
time parameters.
a. Click the Add trigger button, and select New/Edit
b. Enter a trigger name and specify a start time, which is equal to the end time of the debug window
above.
c. On the next screen, specify the following values for the start and end parameters respectively.
@formatDateTime(trigger().outputs.windowStartTime,'yyyy-MM-dd HH:mm:ss.fff')
@formatDateTime(trigger().outputs.windowEndTime,'yyyy-MM-dd HH:mm:ss.fff')
NOTE
Note the trigger will only run once it has been published. Additionally the expected behavior of tumbling window is to run
all historical intervals from the start date until now. More information regarding tumbling window triggers can be found
here.
10. Using SQL Ser ver Management Studio make some additional changes to the customer table by running
the following SQL:
insert into customers (customer_id, first_name, last_name, email, city) values (4, 'Farlie',
'Hadigate', '[email protected]', 'Reading');
insert into customers (customer_id, first_name, last_name, email, city) values (5, 'Anet', 'MacColm',
'[email protected]', 'Portsmouth');
insert into customers (customer_id, first_name, last_name, email, city) values (6, 'Elonore',
'Bearham', '[email protected]', 'Portsmouth');
update customers set first_name='Elon' where customer_id=6;
delete from customers where customer_id=5;
11. Click the Publish all button. Wait until you see the Publishing succeeded message.
12. After a few minutes the pipeline will have triggered and a new file will have been loaded into Azure Storage
Monitor the incremental copy pipeline
1. Click the Monitor tab on the left. You see the pipeline run in the list and its status. To refresh the list, click
Refresh . Hover near the name of the pipeline to access the Rerun action and Consumption report.
2. To view activity runs associated with the pipeline run, click the Pipeline name. If changed data was
detected, there will be three activities including the copy activity otherwise there will only be two entries
in the list. To switch back to the pipeline runs view, click the All Pipelines link at the top.
Review the results
You see the second file in the customers/incremental/YYYY/MM/DD folder of the raw container.
Next steps
Advance to the following tutorial to learn about copying new and changed files only based on their
LastModifiedDate:
Copy new files by lastmodifieddate
Incrementally copy new and changed files based on
LastModifiedDate by using the Copy Data tool
7/20/2021 • 5 minutes to read • Edit Online
NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account : Use Blob storage for the source and sink data stores. If you don't have an Azure
Storage account, follow the instructions in Create a storage account.
9. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the
copy operation, on the Activity runs page, select the Details link (the eyeglasses icon) in the Activity
name column. For details about the properties, see Copy activity overview.
Because there are no files in the source container in your Blob storage account, you won't see any files
copied to the destination container in the account:
10. Create an empty text file and name it file1.txt . Upload this text file to the source container in your
storage account. You can use various tools to perform these tasks, like Azure Storage Explorer.
11. To go back to the Pipeline runs view, select All pipeline runs link in the breadcrumb menu on the
Activity runs page, and wait for the same pipeline to be automatically triggered again.
12. When the second pipeline run completes, follow the same steps mentioned previously to review the
activity run details.
You'll see that one file (file1.txt) has been copied from the source container to the destination container of
your Blob storage account:
13. Create another empty text file and name it file2.txt . Upload this text file to the source container in your
Blob storage account.
14. Repeat steps 11 and 12 for the second text file. You'll see that only the new file (file2.txt) was copied from
the source container to the destination container of your storage account during this pipeline run.
You can also verify that only one file has been copied by using Azure Storage Explorer to scan the files:
Next steps
Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure:
Transform data in the cloud by using an Apache Spark cluster
Incrementally copy new files based on time
partitioned file name by using the Copy Data tool
7/20/2021 • 5 minutes to read • Edit Online
NOTE
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure storage account : Use Blob storage as the source and sink data store. If you don't have an Azure
storage account, see the instructions in Create a storage account.
Create two containers in Blob storage
Prepare your Blob storage for the tutorial by performing these steps.
1. Create a container named source . Create a folder path as 2021/07/15/06 in your container. Create an
empty text file, and name it as file1.txt . Upload the file1.txt to the folder path source/2021/07/15/06
in your storage account. You can use various tools to perform these tasks, such as Azure Storage Explorer.
NOTE
Please adjust the folder name with your UTC time. For example, if the current UTC time is 6:10 AM on July 15,
2021, you can create the folder path as source/2021/07/15/06/ by the rule of
source/{Year}/{Month}/{Day}/{Hour}/.
2. Create a container named destination . You can use various tools to perform these tasks, such as Azure
Storage Explorer.
Create a data factory
1. On the left menu, select Create a resource > Integration > Data Factor y :
9. There's only one activity (copy activity) in the pipeline, so you see only one entry. Adjust the column width
of the Source and Destination columns (if necessary) to display more details, you can see the source
file (file1.txt) has been copied from source/2021/07/15/06/ to destination/2021/07/15/06/ with the
same file name.
You can also verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the
files.
10. Create another empty text file with the new name as file2.txt . Upload the file2.txt file to the folder path
source/2021/07/15/07 in your storage account. You can use various tools to perform these tasks, such
as Azure Storage Explorer.
NOTE
You might be aware that a new folder path is required to be created. Please adjust the folder name with your UTC
time. For example, if the current UTC time is 7:30 AM on July. 15th, 2021, you can create the folder path as
source/2021/07/15/07/ by the rule of {Year}/{Month}/{Day}/{Hour}/.
11. To go back to the Pipeline runs view, select All pipelines runs , and wait for the same pipeline being
triggered again automatically after another one hour.
12. Select the new DeltaCopyFromBlobPipeline link for the second pipeline run when it comes, and do the
same to review details. You will see the source file (file2.txt) has been copied from
source/2021/07/15/07/ to destination/2021/07/15/07/ with the same file name. You can also
verify the same by using Azure Storage Explorer (https://storageexplorer.com/) to scan the files in
destination container.
Next steps
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Transform data using Spark cluster in cloud
Copy data securely from Azure Blob storage to a
SQL database by using private endpoints
7/7/2021 • 11 minutes to read • Edit Online
NOTE
If you're new to Data Factory, see Introduction to Azure Data Factory.
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Blob storage as a source data store. If you don't have a storage account,
see Create an Azure storage account for steps to create one. Ensure the storage account allows access only
from selected networks.
Azure SQL Database . You use the database as a sink data store. If you don't have an Azure SQL database,
see Create a SQL database for steps to create one. Ensure the SQL Database account allows access only from
selected networks.
Create a blob and a SQL table
Now, prepare your blob storage and SQL database for the tutorial by performing the following steps.
Create a source blob
1. Open Notepad. Copy the following text, and save it as an emp.txt file on your disk:
FirstName,LastName
John,Doe
Jane,Doe
2. Create a container named adftutorial in your blob storage. Create a folder named input in this
container. Then, upload the emp.txt file to the input folder. Use the Azure portal or tools such as Azure
Storage Explorer to do these tasks.
Create a sink SQL table
Use the following SQL script to create the dbo.emp table in your SQL database:
CREATE TABLE dbo.emp
(
ID int IDENTITY(1,1) NOT NULL,
FirstName varchar(50),
LastName varchar(50)
)
GO
Create a pipeline
In this step, you create a pipeline with a copy activity in the data factory. The copy activity copies data from Blob
storage to SQL Database. In the Quickstart tutorial, you created a pipeline by following these steps:
1. Create the linked service.
2. Create input and output datasets.
3. Create a pipeline.
In this tutorial, you start by creating a pipeline. Then you create linked services and datasets when you need
them to configure the pipeline.
1. On the home page, select Orchestrate .
2. In the properties pane for the pipeline, enter CopyPipeline for the pipeline name.
3. In the Activities tool box, expand the Move and Transform category, and drag the Copy data activity
from the tool box to the pipeline designer surface. Enter CopyFromBlobToSql for the name.
Configure a source
TIP
In this tutorial, you use Account key as the authentication type for your source data store. You can also choose other
supported authentication methods, such as SAS URI ,Ser vice Principal, and Managed Identity if needed. For more
information, see the corresponding sections in Copy and transform data in Azure Blob storage by using Azure Data
Factory.
To store secrets for data stores securely, we also recommend that you use Azure Key Vault. For more information and
illustrations, see Store credentials in Azure Key Vault.
7. Select Test connection . It should fail when the storage account allows access only from Selected
networks and requires Data Factory to create a private endpoint to it that should be approved prior to
using it. In the error message, you should see a link to create a private endpoint that you can follow to
create a managed private endpoint. An alternative is to go directly to the Manage tab and follow
instructions in the next section to create a managed private endpoint.
NOTE
The Manage tab might not be available for all data factory instances. If you don't see it, you can access private
endpoints by selecting Author > Connections > Private Endpoint .
8. Keep the dialog box open, and then go to your storage account.
9. Follow instructions in this section to approve the private link.
10. Go back to the dialog box. Select Test connection again, and select Create to deploy the linked service.
11. After the linked service is created, it goes back to the Set proper ties page. Next to File path , select
Browse .
12. Go to the adftutorial/input folder, select the emp.txt file, and then select OK .
13. Select OK . It automatically goes to the pipeline page. On the Source tab, confirm that
SourceBlobDataset is selected. To preview data on this page, select Preview data .
NOTE
The Manage tab might not be available for all Data Factory instances. If you don't see it, you can access private
endpoints by selecting Author > Connections > Private Endpoint .
4. Select the Azure Blob Storage tile from the list, and select Continue .
5. Enter the name of the storage account you created.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the storage account level.
TIP
In this tutorial, you use SQL authentication as the authentication type for your sink data store. You can also choose
other supported authentication methods, such as Ser vice Principal and Managed Identity if needed. For more
information, see corresponding sections in Copy and transform data in Azure SQL Database by using Azure Data Factory.
To store secrets for data stores securely, we also recommend that you use Azure Key Vault. For more information and
illustrations, see Store credentials in Azure Key Vault.
4. Select the Azure SQL Database tile from the list, and select Continue .
5. Enter the name of the SQL server you selected.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the SQL server level.
Approval of a private link in SQL Server
1. In the SQL server, go to Private endpoint connections under the Settings section.
2. Select the check box for the private endpoint you created, and select Approve .
3. Add a description, and select yes .
4. Go back to the Managed private endpoints section of the Manage tab in Data Factory.
5. It should take one or two minutes for the approval to appear for your private endpoint.
Debug and publish the pipeline
You can debug a pipeline before you publish artifacts (linked services, datasets, and pipeline) to Data Factory or
your own Azure Repos Git repository.
1. To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output tab
at the bottom of the window.
2. After the pipeline can run successfully, in the top toolbar, select Publish all . This action publishes entities
(datasets and pipelines) you created to Data Factory.
3. Wait until you see the Successfully published message. To see notification messages, select Show
Notifications in the upper-right corner (bell button).
Summary
The pipeline in this sample copies data from Blob storage to SQL Database by using private endpoints in Data
Factory Managed Virtual Network. You learned how to:
Create a data factory.
Create a pipeline with a copy activity.
Best practices for writing to files to data lake with
data flows
7/2/2021 • 6 minutes to read • Edit Online
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The steps in this tutorial will assume that you have
2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
3. In the factory top bar, slide the Data Flow debug slider on. Debug mode allows for interactive testing of
transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and
users are recommended to turn on debug first if they plan to do Data Flow development. For more
information, see Debug Mode.
4. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.
5. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DeltaLake . Click Finish when done.
Next steps
Learn more about data flow sinks.
Dynamically set column names in data flows
6/23/2021 • 6 minutes to read • Edit Online
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
4. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.
5. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DynaCols . Click Finish when done.
Tutorial objectives
You'll learn how to dynamically set column names using a data flow
1. Create a source dataset for the movies CSV file.
2. Create a lookup dataset for a field mapping JSON configuration file.
3. Convert the columns from the source to your target column names.
Start from a blank data flow canvas
First, let's set up the data flow environment for each of the mechanisms described below for landing data in
ADLS Gen2.
1. Click on the source transformation and call it movies1 .
2. Click the new button next to dataset in the bottom panel.
3. Choose either Blob or ADLS Gen2 depending on where you stored the moviesDB.csv file from above.
4. Add a 2nd source, which we will use to source the configuration JSON file to lookup field mappings.
5. Call this as columnmappings .
6. For the dataset, point to a new JSON file that will store a configuration for column mapping. You can
paste the into the JSON file for this tutorial example:
[
{"prevcolumn":"title","newcolumn":"movietitle"},
{"prevcolumn":"year","newcolumn":"releaseyear"}
]
1. Go back to the data flow designer and edit the data flow created above.
2. Click on the parameters tab
3. Create a new parameter and choose string array data type
4. For the default value, enter ['a','b','c']
5. Use the top movies1 source to modify the column names to map to these array values
6. Add a Select transformation. The Select transformation will be used to map incoming columns to new
column names for output.
7. We're going to change the first 3 column names to the new names defined in the parameter
8. To do this, add 3 rule-based mapping entries in the bottom pane
9. For the first column, the matching rule will be position==1 and the name will be $parameter1[1]
Next steps
The completed pipeline from this tutorial can be downloaded from here
Learn more about data flow sinks.
Transform data in delta lake using mapping data
flows
7/2/2021 • 5 minutes to read • Edit Online
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file
from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to
your storage account, see Upload blobs with the Azure portal. The examples will be referencing a container
named 'sample-data'.
2. In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
3. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow
activity from the pane to the pipeline canvas.
4. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
DeltaLake . Click Finish when done.
5. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for
interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes
to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.
For more information, see Debug Mode.
11. Choose a folder name in your storage container where you would like ADF to create the Delta Lake
12. Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this
data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.
13. From Factory Resources, click new > Data flow
14. Use the MoviesCSV again as a source and click "Detect data types" again
15. Add a filter transformation to your source transformation in the graph
16. Only allow movie rows that match the three years you are going to work with which will be 1950, 1988,
and 1960
17. Update ratings for each 1988 movie to '1' by now adding a derived column transformation to your filter
transformation
18. In that same derived column, create movies for 2021 by taking an existing year and change the year to
2021. Let’s pick 1960.
19. This is what your three derived columns will look like
20. Update, insert, delete, and upsert policies are created in the alter Row transform. Add an alter row
transformation after your derived column.
21. Your alter row policies should look like this.
22. Now that you’ve set the proper policy for each alter row type, check that the proper update rules have
been set on the sink transformation
23. Here we are using the Delta Lake sink to your ADLS Gen2 data lake and allowing inserts, updates,
deletes.
24. Note that the Key Columns is a composite key made up of the Movie primary key column and year
column. This is because we created fake 2021 movies by duplicating the 1960 rows. This avoids collisions
when looking up the existing rows by providing uniqueness.
Download completed sample
Here is a sample solution for the Delta pipeline with a data flow for update/delete rows in the lake:
Next steps
Learn more about the data flow expression language.
Transform data using mapping data flows
7/2/2021 • 8 minutes to read • Edit Online
NOTE
This tutorial is meant for mapping data flows in general. Data flows are available both in Azure Data Factory and Synapse
Pipelines. If you are new to data flows in Azure Synapse Pipelines, please follow Data Flow using Azure Synapse Pipelines
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use ADLS storage as a source and sink data stores. If you don't have a storage
account, see Create an Azure storage account for steps to create one.
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file
from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to
your storage account, see Upload blobs with the Azure portal. The examples will be referencing a container
named 'sample-data'.
4. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow
TransformMovies . Click Finish when done.
5. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for
interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes
to warm up and users are recommended to turn on debug first if they plan to do Data Flow development.
For more information, see Debug Mode.
5. Name your dataset MoviesDB . In the linked service dropdown, choose New .
6. In the linked service creation screen, name your ADLS gen2 linked service ADLSGen2 and specify your
authentication method. Then enter your connection credentials. In this tutorial, we're using Account key to
connect to our storage account. You can click Test connection to verify your credentials were entered
correctly. Click Create when finished.
7. Once you're back at the dataset creation screen, enter where your file is located under the File path field.
In this tutorial, the file moviesDB.csv is located in container sample-data. As the file has headers, check
First row as header . Select From connection/store to import the header schema directly from the
file in storage. Click OK when done.
8. If your debug cluster has started, go to the Data Preview tab of the source transformation and click
Refresh to get a snapshot of the data. You can use data preview to verify your transformation is
configured correctly.
9. Next to your source node on the data flow canvas, click on the plus icon to add a new transformation. The
first transformation you're adding is a Filter .
10. Name your filter transformation FilterYears . Click on the expression box next to Filter on to open the
expression builder. Here you'll specify your filtering condition.
11. The data flow expression builder lets you interactively build expressions to use in various
transformations. Expressions can include built-in functions, columns from the input schema, and user-
defined parameters. For more information on how to build expressions, see Data Flow expression builder.
In this tutorial, you want to filter movies of genre comedy that came out between the years 1910 and
2000. As year is currently a string, you need to convert it to an integer using the toInteger() function.
Use the greater than or equals to (>=) and less than or equals to (<=) operators to compare against
literal year values 1910 and 2000. Union these expressions together with the and (&&) operator. The
expression comes out as:
toInteger(year) >= 1910 && toInteger(year) <= 2000
To find which movies are comedies, you can use the rlike() function to find pattern 'Comedy' in the
column genres. Union the rlike expression with the year comparison to get:
toInteger(year) >= 1910 && toInteger(year) <= 2000 && rlike(genres, 'Comedy')
If you've a debug cluster active, you can verify your logic by clicking Refresh to see expression output
compared to the inputs used. There's more than one right answer on how you can accomplish this logic
using the data flow expression language.
Click Save and Finish once you're done with your expression.
12. Fetch a Data Preview to verify the filter is working correctly.
13. The next transformation you'll add is an Aggregate transformation under Schema modifier .
14. Name your aggregate transformation AggregateComedyRatings . In the Group by tab, select year
from the dropdown to group the aggregations by the year the movie came out.
15. Go to the Aggregates tab. In the left text box, name the aggregate column AverageComedyRating .
Click on the right expression box to enter the aggregate expression via the expression builder.
16. To get the average of column Rating , use the avg() aggregate function. As Rating is a string and
avg() takes in a numerical input, we must convert the value to a number via the toInteger() function.
This is expression looks like:
avg(toInteger(Rating))
17. Go to the Data Preview tab to view the transformation output. Notice only two columns are there, year
and AverageComedyRating .
22. Name your sink dataset MoviesSink . For linked service, choose the ADLS gen2 linked service you
created in step 6. Enter an output folder to write your data to. In this tutorial, we're writing to folder
'output' in container 'sample-data'. The folder doesn't need to exist beforehand and can be dynamically
created. Set First row as header as true and select None for Impor t schema . Click Finish.
Now you've finished building your data flow. You're ready to run it in your pipeline.
Running and monitoring the Data Flow
You can debug a pipeline before you publish it. In this step, you're going to trigger a debug run of the data flow
pipeline. While data preview doesn't write data, a debug run will write data to your sink destination.
1. Go to the pipeline canvas. Click Debug to trigger a debug run.
2. Pipeline debug of Data Flow activities uses the active debug cluster but still take at least a minute to
initialize. You can track the progress via the Output tab. Once the run is successful, click on the
eyeglasses icon to open the monitoring pane.
3. In the monitoring pane, you can see the number of rows and time spent in each transformation step.
4. Click on a transformation to get detailed information about the columns and partitioning of the data.
If you followed this tutorial correctly, you should have written 83 rows and 2 columns into your sink folder. You
can verify the data is correct by checking your blob storage.
Next steps
The pipeline in this tutorial runs a data flow that aggregates the average rating of comedies from 1910 to 2000
and writes the data to ADLS. You learned how to:
Create a data factory.
Create a pipeline with a Data Flow activity.
Build a mapping data flow with four transformations.
Test run the pipeline.
Monitor a Data Flow activity
Learn more about the data flow expression language.
Mapping data flow video tutorials
6/23/2021 • 2 minutes to read • Edit Online
Getting Started
Getting started with mapping data flows in Azure Data Factory
Transformation overviews
Aggregate transformation
Alter row transformation
Derived Column transformation
Join transformation
Self-join pattern
Lookup transformation
Lookup Transformation Updates & Tips
Pivot transformation
Pivot transformation: mapping drifted columns
Select transformation
Select transformation: Rule-based mapping
Select transformation: Large Datasets
Surrogate key transformation
Union transformation
Unpivot transformation
Window Transformation
Filter Transformation
Conditional Split Transformation
Exists Transformation
Dynamic Joins and Dynamic Lookups
Flatten transformation
Transform hierarchical data
Rank transformation
Cached lookup
Row context via Window transformation
Parse transformation
Transform complex data types
Output to next activity
Metadata
NOTE
Power Query activity in ADF is currently available in public preview
NOTE
Previously, the data wrangling feature was located in the data flow workflow. Now, you will build your data wrangling
mash-up from New > Power query
The other method is in the activities pane of the pipeline canvas. Open the Power Quer y accordion and drag
the Power Quer y activity onto the canvas.
Author a Power Query data wrangling activity
Add a Source dataset for your Power Query mash-up. You can either choose an existing dataset or create a
new one. After you have saved your mash-up, you can then add the Power Query data wrangling activity to your
pipeline and select a sink dataset to tell ADF where to land your data. While you can choose one or more source
datasets, only one sink is allowed at this time. Choosing a sink dataset is optional, but at least one source dataset
is required.
Author your wrangling Power Query using code-free data preparation. For the list of available functions, see
transformation functions. ADF translates the M script into a data flow script so that you can execute your Power
Query at scale using the Azure Data Factory data flow Spark environment.
Running and monitoring a Power Query data wrangling activity
To execute a pipeline debug run of a Power Query activity, click Debug in the pipeline canvas. Once you publish
your pipeline, Trigger now executes an on-demand run of the last published pipeline. Power Query pipelines
can be schedule with all existing Azure Data Factory triggers.
Go to the Monitor tab to visualize the output of a triggered Power Query activity run.
Next steps
Learn how to create a mapping data flow.
Transform data in the cloud by using a Spark
activity in Azure Data Factory
7/2/2021 • 7 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure storage account . You create a Python script and an input file, and you upload them to Azure Storage.
The output from the Spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
NOTE
HdInsight supports only general-purpose storage accounts with standard tier. Make sure that the account is not a
premium or blob only storage account.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload the Python script to your Blob storage account
1. Create a Python file named WordCount_Spark .py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/mine
craftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfil
es/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure storage account. Then, save the file.
3. In Azure Blob storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark .
5. Create a subfolder named script under the spark folder.
6. Upload the WordCount_Spark .py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstor y.txt with some text. The Spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstor y.txt file to the inputfiles subfolder.
4. For Subscription , select your Azure subscription in which you want to create the data factory.
5. For Resource Group , take one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
Some of the steps in this quickstart assume that you use the name ADFTutorialResourceGroup for the
resource group. To learn about resource groups, see Using resource groups to manage your Azure
resources.
6. For Version , select V2 .
7. For Location , select the location for the data factory.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (like Azure Storage and Azure SQL Database) and computes (like HDInsight) that Data
Factory uses can be in other regions.
8. Select Create .
9. After the creation is complete, you see the Data factor y page. Select the Author & Monitor tile to start
the Data Factory UI application on a separate tab.
2. Select Connections at the bottom of the window, and then select + New .
3. In the New Linked Ser vice window, select Data Store > Azure Blob Storage , and then select
Continue .
4. For Storage account name , select the name from the list, and then select Save .
Create an on-demand HDInsight linked service
1. Select the + New button again to create another linked service.
2. In the New Linked Ser vice window, select Compute > Azure HDInsight , and then select Continue .
3. In the New Linked Ser vice window, complete the following steps:
a. For Name , enter AzureHDInsightLinkedSer vice .
b. For Type , confirm that On-demand HDInsight is selected.
c. For Azure Storage Linked Ser vice , select AzureBlobStorage1 . You created this linked service
earlier. If you used a different name, specify the right name here.
d. For Cluster type , select spark .
e. For Ser vice principal id , enter the ID of the service principal that has permission to create an
HDInsight cluster.
This service principal needs to be a member of the Contributor role of the subscription or the resource
group in which the cluster is created. For more information, see Create an Azure Active Directory
application and service principal. The Ser vice principal id is equivalent to the Application ID, and a
Ser vice principal key is equivalent to the value for a Client secret.
f. For Ser vice principal key , enter the key.
g. For Resource group , select the same resource group that you used when you created the data
factory. The Spark cluster is created in this resource group.
h. Expand OS type .
i. Enter a name for Cluster user name .
j. Enter the Cluster password for the user.
k. Select Finish .
NOTE
Azure HDInsight limits the total number of cores that you can use in each Azure region that it supports. For the on-
demand HDInsight linked service, the HDInsight cluster is created in the same Azure Storage location that's used as its
primary storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more
information, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. In the Activities toolbox, expand HDInsight . Drag the Spark activity from the Activities toolbox to the
pipeline designer surface.
3. In the properties for the Spark activity window at the bottom, complete the following steps:
a. Switch to the HDI Cluster tab.
b. Select AzureHDInsightLinkedSer vice (which you created in the previous procedure).
4. Switch to the Script/Jar tab, and complete the following steps:
a. For Job Linked Ser vice , select AzureBlobStorage1 .
b. Select Browse Storage .
c. Browse to the adftutorial/spark/script folder, select WordCount_Spark .py , and then select Finish .
5. To validate the pipeline, select the Validate button on the toolbar. Select the >> (right arrow) button to
close the validation window.
6. Select Publish All . The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Add Trigger on the toolbar, and then select Trigger Now .
3. To see activity runs associated with the pipeline run, select View Activity Runs in the Actions column.
You can switch back to the pipeline runs view by selecting the All Pipeline Runs link at the top.
The file should have each word from the input text file and the number of times the word appeared in the file.
For example:
(u'This', 1)
(u'a', 1)
(u'is', 1)
(u'test', 1)
(u'file', 1)
Next steps
The pipeline in this sample transforms data by using a Spark activity and an on-demand HDInsight linked
service. You learned how to:
Create a data factory.
Create a pipeline that uses a Spark activity.
Trigger a pipeline run.
Monitor the pipeline run.
To learn how to transform data by running a Hive script on an Azure HDInsight cluster that's in a virtual
network, advance to the next tutorial:
Tutorial: Transform data using Hive in Azure Virtual Network.
Transform data in the cloud by using Spark activity
in Azure Data Factory
3/5/2021 • 7 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure Storage account . You create a python script and an input file, and upload them to the Azure storage.
The output from the spark program is stored in this storage account. The on-demand Spark cluster uses the
same storage account as its primary storage.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload python script to your Blob Storage account
1. Create a python file named WordCount_Spark .py with the following content:
import sys
from operator import add
def main():
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/inputfiles/mine
craftstory.txt").rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
counts.saveAsTextFile("wasbs://adftutorial@<storageaccountname>.blob.core.windows.net/spark/outputfil
es/wordcount")
spark.stop()
if __name__ == "__main__":
main()
2. Replace <storageAccountName> with the name of your Azure Storage account. Then, save the file.
3. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
4. Create a folder named spark .
5. Create a subfolder named script under spark folder.
6. Upload the WordCount_Spark .py file to the script subfolder.
Upload the input file
1. Create a file named minecraftstor y.txt with some text. The spark program counts the number of words in
this text.
2. Create a subfolder named inputfiles in the spark folder.
3. Upload the minecraftstory.txt to the inputfiles subfolder.
Update the <storageAccountName> and <storageAccountKey> with the name and key of your Azure Storage
account.
On-demand HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyOnDemandSparkLinkedSer vice.json .
{
"name": "MyOnDemandSparkLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterSize": 2,
"clusterType": "spark",
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscriptionID> ",
"servicePrincipalId": "<servicePrincipalID>",
"servicePrincipalKey": {
"value": "<servicePrincipalKey>",
"type": "SecureString"
},
"tenant": "<tenant ID>",
"clusterResourceGroup": "<resourceGroupofHDICluster>",
"version": "3.6",
"osType": "Linux",
"clusterNamePrefix":"ADFSparkSample",
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
}
Update values for the following properties in the linked service definition:
hostSubscriptionId . Replace <subscriptionID> with the ID of your Azure subscription. The on-demand
HDInsight cluster is created in this subscription.
tenant . Replace <tenantID> with ID of your Azure tenant.
ser vicePrincipalId , ser vicePrincipalKey . Replace <servicePrincipalID> and <servicePrincipalKey> with ID
and key of your service principal in the Azure Active Directory. This service principal needs to be a member
of the Contributor role of the subscription or the resource Group in which the cluster is created. See create
Azure Active Directory application and service principal for details. The Ser vice principal id is equivalent to
the Application ID and a Ser vice principal key is equivalent to the value for a Client secret.
clusterResourceGroup . Replace <resourceGroupOfHDICluster> with the name of the resource group in
which the HDInsight cluster needs to be created.
NOTE
Azure HDInsight has limitation on the total number of cores you can use in each Azure region it supports. For On-
Demand HDInsight Linked Service, the HDInsight cluster will be created in the same location of the Azure Storage used as
its primary storage. Ensure that you have enough core quotas for the cluster to be created successfully. For more
information, see Set up clusters in HDInsight with Hadoop, Spark, Kafka, and more.
Author a pipeline
In this step, you create a new pipeline with a Spark activity. The activity uses the word count sample. Download
the contents from this location if you haven't already done so.
Create a JSON file in your preferred editor, copy the following JSON definition of a pipeline definition, and save
it as MySparkOnDemandPipeline.json .
{
"name": "MySparkOnDemandPipeline",
"properties": {
"activities": [
{
"name": "MySparkActivity",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyOnDemandSparkLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"rootPath": "adftutorial/spark",
"entryFilePath": "script/WordCount_Spark.py",
"getDebugInfo": "Failure",
"sparkJobLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
$resourceGroupName = "ADFTutorialResourceGroup"
Pipeline name
2. Launch PowerShell . Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
$df
5. Switch to the folder where you created JSON files, and run the following command to deploy an Azure
Storage linked service:
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
4. Confirm that a folder named outputfiles is created in the spark folder of adftutorial container with the
output from the spark program.
Next steps
The pipeline in this sample copies data from one location to another location in an Azure blob storage. You
learned how to:
Create a data factory.
Author and deploy linked services.
Author and deploy a pipeline.
Start a pipeline run.
Monitor the pipeline run.
Advance to the next tutorial to learn how to transform data by running Hive script on an Azure HDInsight cluster
that is in a virtual network.
Tutorial: transform data using Hive in Azure Virtual Network.
Run a Databricks notebook with the Databricks
Notebook Activity in Azure Data Factory
7/2/2021 • 5 minutes to read • Edit Online
Prerequisites
Azure Databricks workspace . Create a Databricks workspace or use an existing one. You create a Python
notebook in your Azure Databricks workspace. Then you execute the notebook and pass parameters to it
using Azure Data Factory.
2. Select Connections at the bottom of the window, and then select + New .
3. In the New Linked Ser vice window, select Compute > Azure Databricks , and then select Continue .
4. In the New Linked Ser vice window, complete the following steps:
a. For Name , enter AzureDatabricks_LinkedSer vice
b. Select the appropriate Databricks workspace that you will run your notebook in
c. For Select cluster , select New job cluster
d. For Domain/ Region , info should auto-populate
e. For Access Token , generate it from Azure Databricks workplace. You can find the steps here.
f. For Cluster version , select 4.2 (with Apache Spark 2.3.1, Scala 2.11)
g. For Cluster node type , select Standard_D3_v2 under General Purpose (HDD) category for
this tutorial.
h. For Workers , enter 2 .
i. Select Finish
Create a pipeline
1. Select the + (plus) button, and then select Pipeline on the menu.
2. Create a parameter to be used in the Pipeline . Later you pass this parameter to the Databricks
Notebook Activity. In the empty pipeline, click on the Parameters tab, then New and name it as 'name '.
3. In the Activities toolbox, expand Databricks . Drag the Notebook activity from the Activities toolbox
to the pipeline designer surface.
4. In the properties for the Databricks Notebook activity window at the bottom, complete the following
steps:
a. Switch to the Azure Databricks tab.
b. Select AzureDatabricks_LinkedSer vice (which you created in the previous procedure).
c. Switch to the Settings tab
c. Browse to select a Databricks Notebook path . Let’s create a notebook and specify the path here. You
get the Notebook Path by following the next few steps.
a. Launch your Azure Databricks Workspace
b. Create a New Folder in Workplace and call it as adftutorial .
c. Create a new notebook (Python), let’s call it mynotebook under adftutorial Folder, click Create.
d. In the newly created notebook "mynotebook'" add the following code:
dbutils.widgets.text("input", "","")
y = dbutils.widgets.get("input")
print ("Param -\'input':")
print (y)
7. Select Publish All . The Data Factory UI publishes entities (linked services and pipeline) to the Azure Data
Factory service.
Trigger a pipeline run
Select Trigger on the toolbar, and then select Trigger Now .
The Pipeline Run dialog box asks for the name parameter. Use /path/filename as the parameter here. Click
Finish.
Monitor the pipeline run
1. Switch to the Monitor tab. Confirm that you see a pipeline run. It takes approximately 5-8 minutes to
create a Databricks job cluster, where the notebook is executed.
You can switch back to the pipeline runs view by selecting the Pipelines link at the top.
Verify the output
You can log on to the Azure Databricks workspace , go to Clusters and you can see the Job status as
pending execution, running, or terminated.
You can click on the Job name and navigate to see further details. On successful run, you can validate the
parameters passed and the output of the Python notebook.
Next steps
The pipeline in this sample triggers a Databricks Notebook activity and passes a parameter to it. You learned
how to:
Create a data factory.
Create a pipeline that uses a Databricks Notebook activity.
Trigger a pipeline run.
Monitor the pipeline run.
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory using the Azure
portal
7/2/2021 • 9 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure Storage account . You create a hive script, and upload it to the Azure storage. The output from
the Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Vir tual Network . If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration
of Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the
previous step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a
sample configuration of HDInsight in a virtual network.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
A vir tual machine . Create an Azure virtual machine VM and join it into the same virtual network that
contains your HDInsight cluster. For details, see How to create virtual machines.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
DROP TABLE IF EXISTS HiveSampleOut;
CREATE EXTERNAL TABLE HiveSampleOut (clientid string, market string, devicemodel string, state
string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${hiveconf:Output}';
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts .
4. Upload the hivescript.hql file to the hivescripts subfolder.
13. Click Author & Monitor to launch the Data Factory User Interface (UI) in a separate tab.
14. In the home page, switch to the Manage tab in the left panel as shown in the following image:
5. Copy the authentication key for the integration runtime by clicking the copy button, and save it. Keep
the window open. You use this key to register the IR installed in a virtual machine.
Install IR on a virtual machine
1. On the Azure VM, download self-hosted integration runtime. Use the authentication key obtained in
the previous step to manually register the self-hosted integration runtime.
2. You see the following message when the self-hosted integration runtime is registered successfully.
3. Click Launch Configuration Manager . You see the following page when the node is connected to the
cloud service:
Self-hosted IR in the Azure Data Factory UI
1. In the Azure Data Factor y UI , you should see the name of the self-hosted VM name and its status.
2. Click Finish to close the Integration Runtime Setup window. You see the self-hosted IR in the list of
integration runtimes.
2. In the New Linked Ser vice window, select Azure Blob Storage , and click Continue .
Create a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined.
Note the following points:
scriptPath points to path to Hive script on the Azure Storage Account you used for MyStorageLinkedService.
The path is case-sensitive.
Output is an argument used in the Hive script. Use the format of
wasbs://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/ to point it to an existing folder on
your Azure Storage. The path is case-sensitive.
1. In the Data Factory UI, click + (plus) in the left pane, and click Pipeline .
2. In the Activities toolbox, expand HDInsight , and drag-drop Hive activity to the pipeline designer
surface.
3. In the properties window, switch to the HDI Cluster tab, and select AzureHDInsightLinkedSer vice for
HDInsight Linked Ser vice .
4. You see only one activity run since there is only one activity in the pipeline of type HDInsightHive . To
switch back to the previous view, click Pipelines link at the top.
5. Confirm that you see an output file in the outputfolder of the adftutorial container.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create a self-hosted integration runtime
Create Azure Storage and Azure HDInsight linked services
Create a pipeline with Hive activity.
Trigger a pipeline run.
Monitor the pipeline run
Verify the output
Advance to the following tutorial to learn about transforming data by using a Spark cluster on Azure:
Branching and chaining Data Factory control flow
Transform data in Azure Virtual Network using Hive
activity in Azure Data Factory
5/28/2021 • 9 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure Storage account . You create a hive script, and upload it to the Azure storage. The output from
the Hive script is stored in this storage account. In this sample, HDInsight cluster uses this Azure Storage
account as the primary storage.
Azure Vir tual Network . If you don't have an Azure virtual network, create it by following these
instructions. In this sample, the HDInsight is in an Azure Virtual Network. Here is a sample configuration
of Azure Virtual Network.
HDInsight cluster. Create a HDInsight cluster and join it to the virtual network you created in the
previous step by following this article: Extend Azure HDInsight using an Azure Virtual Network. Here is a
sample configuration of HDInsight in a virtual network.
Azure PowerShell . Follow the instructions in How to install and configure Azure PowerShell.
Upload Hive script to your Blob Storage account
1. Create a Hive SQL file named hivescript.hql with the following content:
2. In your Azure Blob Storage, create a container named adftutorial if it does not exist.
3. Create a folder named hivescripts .
4. Upload the hivescript.hql file to the hivescripts subfolder.
Create a data factory
1. Set the resource group name. You create a resource group as part of this tutorial. However, you can use
an existing resource group if you like.
$resourceGroupName = "ADFTutorialResourceGroup"
$dataFactoryName = "MyDataFactory09142017"
$pipelineName = "MyHivePipeline" #
4. Specify a name for the self-hosted integration runtime. You need a self-hosted integration runtime when
the Data Factory needs to access resources (such as Azure SQL Database) inside a VNet.
$selfHostedIntegrationRuntimeName = "MySelfHostedIR09142017"
5. Launch PowerShell . Keep Azure PowerShell open until the end of this quickstart. If you close and reopen,
you need to run the commands again. For a list of Azure regions in which Data Factory is currently
available, select the regions that interest you on the following page, and then expand Analytics to locate
Data Factor y : Products available by region. The data stores (Azure Storage, Azure SQL Database, etc.)
and computes (HDInsight, etc.) used by data factory can be in other regions.
Run the following command, and enter the user name and password that you use to sign in to the Azure
portal:
Connect-AzAccount
Run the following command to view all the subscriptions for this account:
Get-AzSubscription
Run the following command to select the subscription that you want to work with. Replace
SubscriptionId with the ID of your Azure subscription:
6. Create the resource group: ADFTutorialResourceGroup if it does not exist already in your subscription.
$df
Create self-hosted IR
In this section, you create a self-hosted integration runtime and associate it with an Azure VM in the same Azure
Virtual Network where your HDInsight cluster is in.
1. Create Self-hosted integration runtime. Use a unique name in case if another integration runtime with the
same name exists.
{
"AuthKey1": "IR@0000000000000000000000000000000000000=",
"AuthKey2": "IR@0000000000000000000000000000000000000="
}
You see the following page when the node is connected to the cloud service:
Author linked services
You author and deploy two Linked Services in this section:
An Azure Storage Linked Service that links an Azure Storage account to the data factory. This storage is the
primary storage used by your HDInsight cluster. In this case, we also use this Azure Storage account to keep
the Hive script and output of the script.
An HDInsight Linked Service. Azure Data Factory submits the Hive script to this HDInsight cluster for
execution.
Azure Storage linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure Storage linked
service, and then save the file as MyStorageLinkedSer vice.json .
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<storageAccountName>;AccountKey=
<storageAccountKey>"
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Replace <accountname> and <accountkey> with the name and key of your Azure Storage account.
HDInsight linked service
Create a JSON file using your preferred editor, copy the following JSON definition of an Azure HDInsight linked
service, and save the file as MyHDInsightLinkedSer vice.json .
{
"name": "MyHDInsightLinkedService",
"properties": {
"type": "HDInsight",
"typeProperties": {
"clusterUri": "https://<clustername>.azurehdinsight.net",
"userName": "<username>",
"password": {
"value": "<password>",
"type": "SecureString"
},
"linkedServiceName": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "MySelfhostedIR",
"type": "IntegrationRuntimeReference"
}
}
}
Update values for the following properties in the linked service definition:
userName . Name of the cluster login user that you specified when creating the cluster.
password . The password for the user.
clusterUri . Specify the URL of your HDInsight cluster in the following format:
https://<clustername>.azurehdinsight.net . This article assumes that you have access to the cluster over
the internet. For example, you can connect to the cluster at https://clustername.azurehdinsight.net . This
address uses the public gateway, which is not available if you have used network security groups (NSGs)
or user-defined routes (UDRs) to restrict access from the internet. For Data Factory to submit jobs to
HDInsight clusters in Azure Virtual Network, your Azure Virtual Network needs to be configured in such
a way that the URL can be resolved to the private IP address of the gateway used by HDInsight.
1. From Azure portal, open the Virtual Network the HDInsight is in. Open the network interface with
name starting with nic-gateway-0 . Note down its private IP address. For example, 10.6.0.15.
2. If your Azure Virtual Network has DNS server, update the DNS record so the HDInsight cluster
URL https://<clustername>.azurehdinsight.net can be resolved to 10.6.0.15 . This is the
recommended approach. If you don’t have a DNS server in your Azure Virtual Network, you can
temporarily work around this by editing the hosts file (C:\Windows\System32\drivers\etc) of all
VMs that registered as self-hosted integration runtime nodes by adding an entry like this:
10.6.0.15 myHDIClusterName.azurehdinsight.net
Author a pipeline
In this step, you create a new pipeline with a Hive activity. The activity executes Hive script to return data from a
sample table and save it to a path you defined. Create a JSON file in your preferred editor, copy the following
JSON definition of a pipeline definition, and save it as MyHivePipeline.json .
{
"name": "MyHivePipeline",
"properties": {
"activities": [
{
"name": "MyHiveActivity",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDILinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptPath": "adftutorial\\hivescripts\\hivescript.hql",
"getDebugInfo": "Failure",
"defines": {
"Output": "wasb://<Container>@<StorageAccount>.blob.core.windows.net/outputfolder/"
},
"scriptLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
]
}
}
2. Run the following script to continuously check the pipeline run status until it finishes.
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 000000000-0000-0000-000000000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output :
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd :
DurationInMs :
Status : InProgress
Error :
ResourceGroupName : ADFV2SampleRG2
DataFactoryName : SampleV2DataFactory2
ActivityName : MyHiveActivity
PipelineRunId : 0000000-0000-0000-0000-000000000000
PipelineName : MyHivePipeline
Input : {getDebugInfo, scriptPath, scriptLinkedService, defines}
Output : {logLocation, clusterInUse, jobId, ExecutionProgress...}
LinkedServiceName :
ActivityRunStart : 9/18/2017 6:58:13 AM
ActivityRunEnd : 9/18/2017 6:59:16 AM
DurationInMs : 63636
Status : Succeeded
Error : {errorCode, message, failureType, target}
3. Check the outputfolder folder for new file created as the Hive query result, it should look like the
following sample output:
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure storage account . You use Data Lake Storage as source and sink data stores. If you don't have a
storage account, see Create an Azure storage account for steps to create one. Ensure the storage account
allows access only from selected networks.
The file that we'll transform in this tutorial is moviesDB.csv, which can be found at this GitHub content site. To
retrieve the file from GitHub, copy the contents to a text editor of your choice to save it locally as a .csv file. To
upload the file to your storage account, see Upload blobs with the Azure portal. The examples will reference a
container named sample-data .
2. On the Integration runtime setup page, choose what integration runtime to create based on required
capabilities. In this tutorial, select Azure, Self-Hosted and then click Continue .
3. Select Azure and then click Continue to create an Azure Integration runtime.
4. Under Vir tual network configuration (Preview) , select Enable .
5. Select Create .
4. Select the Azure Data Lake Storage Gen2 tile from the list, and select Continue .
5. Enter the name of the storage account you created.
6. Select Create .
7. After a few seconds, you should see that the private link created needs an approval.
8. Select the private endpoint that you created. You can see a hyperlink that will lead you to approve the
private endpoint at the storage account level.
Approval of a private link in a storage account
1. In the storage account, go to Private endpoint connections under the Settings section.
2. Select the check box by the private endpoint you created, and select Approve .
2. Name your filter transformation FilterYears . Select the expression box next to Filter on to open the
expression builder. Here you'll specify your filtering condition.
3. The data flow expression builder lets you interactively build expressions to use in various
transformations. Expressions can include built-in functions, columns from the input schema, and user-
defined parameters. For more information on how to build expressions, see Data flow expression builder.
In this tutorial, you want to filter movies in the comedy genre that came out between the years
1910 and 2000. Because the year is currently a string, you need to convert it to an integer by using
the toInteger() function. Use the greater than or equal to (>=) and less than or equal to (<=)
operators to compare against the literal year values 1910 and 2000. Union these expressions
together with the and (&&) operator. The expression comes out as:
toInteger(year) >= 1910 && toInteger(year) <= 2000
To find which movies are comedies, you can use the rlike() function to find the pattern 'Comedy'
in the column genres. Union the rlike expression with the year comparison to get:
toInteger(year) >= 1910 && toInteger(year) <= 2000 && rlike(genres, 'Comedy')
If you have a debug cluster active, you can verify your logic by selecting Refresh to see the
expression output compared to the inputs used. There's more than one right answer on how you
can accomplish this logic by using the data flow expression language.
Select Save and finish after you're finished with your expression.
4. Fetch a Data Preview to verify the filter is working correctly.
2. Name your aggregate transformation AggregateComedyRating . On the Group by tab, select year
from the drop-down box to group the aggregations by the year the movie came out.
3. Go to the Aggregates tab. In the left text box, name the aggregate column AverageComedyRating .
Select the right expression box to enter the aggregate expression via the expression builder.
4. To get the average of column Rating , use the avg() aggregate function. Because Rating is a string and
avg() takes in a numerical input, we must convert the value to a number via the toInteger() function.
This expression looks like:
avg(toInteger(Rating))
6. Go to the Data Preview tab to view the transformation output. Notice only two columns are there, year
and AverageComedyRating .
Add the sink transformation
1. Next, you want to add a Sink transformation under Destination .
2. Name your sink Sink . Select New to create your sink dataset.
3. On the New dataset page, select Azure Data Lake Storage Gen2 and then select Continue .
4. On the Select format page, select DelimitedText and then select Continue .
5. Name your sink dataset MoviesSink . For linked service, choose the same ADLSGen2 linked service you
created for source transformation. Enter an output folder to write your data to. In this tutorial, we're
writing to the folder output in the container sample-data . The folder doesn't need to exist beforehand
and can be dynamically created. Select the First row as header check box, and select None for Impor t
schema . Select OK .
Now you've finished building your data flow. You're ready to run it in your pipeline.
4. Select a transformation to get detailed information about the columns and partitioning of the data.
If you followed this tutorial correctly, you should have written 83 rows and 2 columns into your sink folder. You
can verify the data is correct by checking your blob storage.
Summary
In this tutorial, you used the Data Factory UI to create a pipeline that copies and transforms data from a Data
Lake Storage Gen2 source to a Data Lake Storage Gen2 sink (both allowing access to only selected networks) by
using mapping data flow in Data Factory Managed Virtual Network.
Branching and chaining activities in an Azure Data
Factory pipeline using the Azure portal
7/2/2021 • 10 minutes to read • Edit Online
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure Storage account . You use the blob storage as source data store. If you don't have an Azure storage
account, see the Create a storage account article for steps to create one.
Azure SQL Database . You use the database as sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database article for steps to create one.
Create blob table
1. Launch Notepad. Copy the following text and save it as input.txt file on your disk.
John,Doe
Jane,Doe
For your request trigger, fill in the Request Body JSON Schema with the following JSON:
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
The Request in the Logic App Designer should look like the following image:
For the Send Email action, customize how you wish to format the email, utilizing the properties passed in the
request Body JSON schema. Here is an example:
Save the workflow. Make a note of your HTTP Post request URL for your success email workflow:
11. After the creation is complete, you see the Data Factor y page as shown in the image.
12. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) in a separate tab.
Create a pipeline
In this step, you create a pipeline with one Copy activity and two Web activities. You use the following features to
create the pipeline:
Parameters for the pipeline that are access by datasets.
Web activity to invoke logic apps workflows to send success/failure emails.
Connecting one activity with another activity (on success and failure)
Using output from an activity as an input to the subsequent activity
1. In the home page of Data Factory UI, click the Orchestrate tile.
2. In the properties window for the pipeline, switch to the Parameters tab, and use the New button to add
the following three parameters of type String: sourceBlobContainer, sinkBlobContainer, and receiver.
sourceBlobContainer - parameter in the pipeline consumed by the source blob dataset.
sinkBlobContainer – parameter in the pipeline consumed by the sink blob dataset
receiver – this parameter is used by the two Web activities in the pipeline that send success or failure
emails to the receiver whose email address is specified by this parameter.
3. In the Activities toolbox, expand Data Flow , and drag-drop Copy activity to the pipeline designer
surface.
4. In the Proper ties window for the Copy activity at the bottom, switch to the Source tab, and click +
New . You create a source dataset for the copy activity in this step.
5. In the New Dataset window, select Azure Blob Storage , and click Finish .
6. You see a new tab titled AzureBlob1 . Change the name of the dataset to SourceBlobDataset .
7. Switch to the Connection tab in the Proper ties window, and click New for the Linked ser vice . You
create a linked service to link your Azure Storage account to the data factory in this step.
8. In the New Linked Ser vice window, do the following steps:
a. Enter AzureStorageLinkedSer vice for Name .
b. Select your Azure storage account for the Storage account name .
c. Click Save .
9. Enter @pipeline().parameters.sourceBlobContainer for the folder and emp.txt for the file name. You use
the sourceBlobContainer pipeline parameter to set the folder path for the dataset.
10. Switch to the pipeline tab (or) click the pipeline in the treeview. Confirm that SourceBlobDataset is
selected for Source Dataset .
13. In the properties window, switch to the Sink tab, and click + New for Sink Dataset . You create a sink
dataset for the copy activity in this step similar to the way you created the source dataset.
14. In the New Dataset window, select Azure Blob Storage , and click Finish .
15. In the General settings page for the dataset, enter SinkBlobDataset for Name .
16. Switch to the Connection tab, and do the following steps:
a. Select AzureStorageLinkedSer vice for LinkedSer vice .
b. Enter @pipeline().parameters.sinkBlobContainer for the folder.
c. Enter @CONCAT(pipeline().RunId, '.txt') for the file name. The expression uses the ID of the
current pipeline run for the file name. For the supported list of system variables and expressions,
see System variables and Expression language.
17. Switch to the pipeline tab at the top. Expand General in the Activities toolbox, and drag-drop a Web
activity to the pipeline designer surface. Set the name of the activity to SendSuccessEmailActivity . The
Web Activity allows a call to any REST endpoint. For more information about the activity, see Web
Activity. This pipeline uses a Web Activity to call the Logic Apps email workflow.
18. Switch to the Settings tab from the General tab, and do the following steps:
a. For URL , specify URL for the logic apps workflow that sends the success email.
b. Select POST for Method .
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json .
e. Specify the following JSON for Body .
{
"message": "@{activity('Copy1').output.dataWritten}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
19. Connect the Copy activity to the Web activity by dragging the green button next to the Copy activity and
dropping on the Web activity.
20. Drag-drop another Web activity from the Activities toolbox to the pipeline designer surface, and set the
name to SendFailureEmailActivity .
21. Switch to the Settings tab, and do the following steps:
a. For URL , specify URL for the logic apps workflow that sends the failure email.
b. Select POST for Method .
c. Click + Add header link in the Headers section.
d. Add a header Content-Type and set it to application/json .
e. Specify the following JSON for Body .
{
"message": "@{activity('Copy1').error.message}",
"dataFactoryName": "@{pipeline().DataFactory}",
"pipelineName": "@{pipeline().Pipeline}",
"receiver": "@pipeline().parameters.receiver"
}
22. Select Copy activity in the pipeline designer, and click +-> button, and select Error .
23. Drag the red button next to the Copy activity to the second Web activity SendFailureEmailActivity . You
can move the activities around so that the pipeline looks like in the following image:
24. To validate the pipeline, click Validate button on the toolbar. Close the Pipeline Validation Output
window by clicking the >> button.
25. To publish the entities (datasets, pipelines, etc.) to Data Factory service, select Publish All . Wait until you
see the Successfully published message.
Trigger a pipeline run that succeeds
1. To trigger a pipeline run, click Trigger on the toolbar, and click Trigger Now .
2. To view activity runs associated with this pipeline run, click the first link in the Actions column. You can
switch back to the previous view by clicking Pipelines at the top. Use the Refresh button to refresh the
list.
3. To view activity runs associated with this pipeline run, click the first link in the Actions column. Use the
Refresh button to refresh the list. Notice that the Copy activity in the pipeline failed. The Web activity
succeeded to send the failure email to the specified receiver.
4. Click Error link in the Actions column to see details about the error.
Next steps
You performed the following steps in this tutorial:
Create a data factory.
Create an Azure Storage linked service.
Create an Azure Blob dataset
Create a pipeline that contains a copy activity and a web activity
Send outputs of activities to subsequent activities
Utilize parameter passing and system variables
Start a pipeline run
Monitor the pipeline and activity runs
You can now proceed to the Concepts section for more information about Azure Data Factory.
Pipelines and activities
Branching and chaining activities in a Data Factory
pipeline
4/22/2021 • 14 minutes to read • Edit Online
Prerequisites
Azure Storage account. You use blob storage as a source data store. If you don't have an Azure storage
account, see Create a storage account.
Azure Storage Explorer. To install this tool, see Azure Storage Explorer.
Azure SQL Database. You use the database as a sink data store. If you don't have a database in Azure SQL
Database, see the Create a database in Azure SQL Database.
Visual Studio. This article uses Visual Studio 2019.
Azure .NET SDK. Download and install the Azure .NET SDK.
For a list of Azure regions in which Data Factory is currently available, see Products available by region. The data
stores and computes can be in other regions. The stores include Azure Storage and Azure SQL Database. The
computes include HDInsight, which Data Factory uses.
Create an application as described in Create an Azure Active Directory application. Assign the application to the
Contributor role by following instructions in the same article. You'll need several values for later parts of this
tutorial, such as Application (client) ID and Director y (tenant) ID .
Create a blob table
1. Open a text editor. Copy the following text and save it locally as input.txt.
Ethel|Berg
Tamika|Walsh
2. Open Azure Storage Explorer. Expand your storage account. Right-click Blob Containers and select
Create Blob Container .
3. Name the new container adfv2branch and select Upload to add your input.txt file to the container.
Install-Package Microsoft.Azure.Management.DataFactory
Install-Package Microsoft.Azure.Management.ResourceManager -IncludePrerelease
Install-Package Microsoft.IdentityModel.Clients.ActiveDirectory
2. Add these static variables to the Program class. Replace place-holders with your own values.
// Set variables
static string tenantID = "<tenant ID>";
static string applicationId = "<application ID>";
static string authenticationKey = "<Authentication key for your application>";
static string subscriptionId = "<Azure subscription ID>";
static string resourceGroup = "<Azure resource group name>";
3. Add the following code to the method. This code creates an instance of
Main
DataFactoryManagementClient class. You then use this object to create data factory, linked service, datasets,
and pipeline. You can also use this object to monitor the pipeline run details.
Factory response;
{
response = client.Factories.CreateOrUpdate(resourceGroup, dataFactoryName, resource);
}
2. Add the following line to the Main method that creates a data factory:
Factory df = CreateOrUpdateDataFactory(client);
2. Add the following line to the Main method that creates an Azure Storage linked service:
For more information about supported properties and details, see Linked service properties.
Create datasets
In this section, you create two datasets, one for the source and one for the sink.
Create a dataset for a source Azure Blob
Add a method that creates an Azure blob dataset. For more information about supported properties and details,
see Azure Blob dataset properties.
Add a SourceBlobDatasetDefinition method to your Program.cs file:
You define a dataset that represents the source data in Azure Blob. This Blob dataset refers to the Azure Storage
linked service supported in the previous step. The Blob dataset describes the location of the blob to copy from:
FolderPath and FileName.
Notice the use of parameters for the FolderPath. sourceBlobContainer is the name of the parameter and the
expression is replaced with the values passed in the pipeline run. The syntax to define parameters is
@pipeline().parameters.<parameterName>
2. Add the following code to the Main method that creates both Azure Blob source and sink datasets.
class EmailRequest
{
[Newtonsoft.Json.JsonProperty(PropertyName = "message")]
public string message;
[Newtonsoft.Json.JsonProperty(PropertyName = "dataFactoryName")]
public string dataFactoryName;
[Newtonsoft.Json.JsonProperty(PropertyName = "pipelineName")]
public string pipelineName;
[Newtonsoft.Json.JsonProperty(PropertyName = "receiver")]
public string receiver;
{
"properties": {
"dataFactoryName": {
"type": "string"
},
"message": {
"type": "string"
},
"pipelineName": {
"type": "string"
},
"receiver": {
"type": "string"
}
},
"type": "object"
}
Your workflow looks something like the following example:
This JSON content aligns with the EmailRequest class you created in the previous section.
Add an action of Office 365 Outlook – Send an email . For the Send an email action, customize how you wish
to format the email, using the properties passed in the request Body JSON schema. Here's an example:
After you save the workflow, copy and save the HTTP POST URL value from the trigger.
Create a pipeline
Go back to your project in Visual Studio. We'll now add the code that creates a pipeline with a copy activity and
DependsOn property. In this tutorial, the pipeline contains one activity, a copy activity, which takes in the Blob
dataset as a source and another Blob dataset as a sink. If the copy activity succeeds or fails, it calls different
email tasks.
In this pipeline, you use the following features:
Parameters
Web activity
Activity dependency
Using output from an activity as an input to another activity
1. Add this method to your project. The following sections provide in more detail.
},
Activities = new List<Activity>
{
new CopyActivity
{
Name = copyBlobActivity,
Inputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSourceDatasetName
}
},
Outputs = new List<DatasetReference>
{
new DatasetReference
{
ReferenceName = blobSinkDatasetName
}
},
Source = new BlobSource { },
Sink = new BlobSink { }
},
new WebActivity
{
Name = sendSuccessEmailActivity,
Method = WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/00000000000000000000000000000000000/triggers/ma
nual/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
},
new WebActivity
{
Name = sendFailEmailActivity,
Method =WebActivityMethod.POST,
Url =
"https://prodxxx.eastus.logic.azure.com:443/workflows/000000000000000000000000000000000/triggers/manu
al/paths/invoke?api-version=2016-10-
01&sp=%2Ftriggers%2Fmanual%2Frun&sv=1.0&sig=0000000000000000000000000000000000000000000",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').error.message}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Failed" }
DependencyConditions = new List<String> { "Failed" }
}
}
}
}
};
Console.WriteLine(SafeJsonConvert.SerializeObject(resource,
client.SerializationSettings));
return resource;
}
2. Add the following line to the Main method that creates the pipeline:
Parameters
The first section of our pipeline code defines parameters.
sourceBlobContainer . The source blob dataset consumes this parameter in the pipeline.
sinkBlobContainer . The sink blob dataset consumes this parameter in the pipeline.
receiver . The two Web activities in the pipeline that send success or failure emails to the receiver use this
parameter.
Web activity
The Web activity allows a call to any REST endpoint. For more information about the activity, see Web activity in
Azure Data Factory. This pipeline uses a web activity to call the Logic Apps email workflow. You create two web
activities: one that calls to the CopySuccessEmail workflow and one that calls the CopyFailWorkFlow .
new WebActivity
{
Name = sendCopyEmailActivity,
Method = WebActivityMethod.POST,
Url = "https://prodxxx.eastus.logic.azure.com:443/workflows/12345",
Body = new EmailRequest("@{activity('CopyBlobtoBlob').output.dataWritten}",
"@{pipeline().DataFactory}", "@{pipeline().Pipeline}", "@pipeline().parameters.receiver"),
DependsOn = new List<ActivityDependency>
{
new ActivityDependency
{
Activity = copyBlobActivity,
DependencyConditions = new List<String> { "Succeeded" }
}
}
}
In the Url property, paste the HTTP POST URL endpoints from your Logic Apps workflows. In the Body
property, pass an instance of the EmailRequest class. The email request contains the following properties:
Message. Passes value of @{activity('CopyBlobtoBlob').output.dataWritten . Accesses a property of the
previous copy activity and passes the value of dataWritten . For the failure case, pass the error output instead
of @{activity('CopyBlobtoBlob').error.message .
Data Factory Name. Passes value of @{pipeline().DataFactory} This system variable allows you to access the
corresponding data factory name. For a list of system variables, see System Variables.
Pipeline Name. Passes value of @{pipeline().Pipeline} . This system variable allows you to access the
corresponding pipeline name.
Receiver. Passes value of "@pipeline().parameters.receiver" . Accesses the pipeline parameters.
This code creates a new Activity Dependency that depends on the previous copy activity.
Main class
Your final Main method should look like this.
Factory df = CreateOrUpdateDataFactory(client);
This code continuously checks the status of the run until it finishes copying the data.
2. Add the following code to the Main method that retrieves copy activity run details, for example, size of
the data read/written:
if (pipelineRun.Status == "Succeeded")
{
Console.WriteLine(activityRuns.First().Output);
//SaveToJson(SafeJsonConvert.SerializeObject(activityRuns.First().Output,
client.SerializationSettings), "ActivityRunResult.json", folderForJsons);
}
else
Console.WriteLine(activityRuns.First().Error);
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database ser ver (optional) . If you don't already have a database server, create one in the
Azure portal before you get started. Data Factory will in turn create an SSISDB instance on this database
server.
We recommend that you create the database server in the same Azure region as the integration runtime.
This configuration lets the integration runtime write execution logs into SSISDB without crossing Azure
regions.
Keep these points in mind:
Based on the selected database server, the SSISDB instance can be created on your behalf as a
single database, as part of an elastic pool, or in a managed instance. It can be accessible in a public
network or by joining a virtual network. For guidance in choosing the type of database server to
host SSISDB, see Compare SQL Database and SQL Managed Instance.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Create an Azure-SSIS IR in a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a managed instance with private endpoint to host SSISDB. For more
information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see New-
AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure Active Directory (Azure AD) authentication with the specified
system/user-assigned managed identity for your data factory. For the latter, you need to add the
specified system/user-assigned managed identity for your data factory into an Azure AD group
with access permissions to the database server. For more information, see Create an Azure-SSIS IR
with Azure AD authentication.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.
NOTE
For a list of Azure regions in which Data Factory and an Azure-SSIS IR are currently available, see Data Factory and SSIS IR
availability by region.
2. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
From the authoring UI
1. In the Azure Data Factory UI, switch to the Manage tab, and then switch to the Integration runtimes
tab to view existing integration runtimes in your data factory.
2. Select New to create an Azure-SSIS IR and open the Integration runtime setup pane.
3. In the Integration runtime setup pane, select the Lift-and-shift existing SSIS packages to
execute in Azure tile, and then select Continue .
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure-SSIS integration runtime
section.
NOTE
You can use either Azure File Storage or File System linked services to access Azure Files. If you use Azure
File Storage linked service, Azure-SSIS IR package store supports only Basic (not Account key nor SAS URI )
authentication method for now.
a. For Name , enter the name of your linked service.
b. For Description , enter the description of your linked service.
c. For Type , select Azure File Storage , Azure SQL Managed Instance , or File System .
d. You can ignore Connect via integration runtime , since we always use your Azure-SSIS IR to
fetch the access information for package stores.
e. If you select Azure File Storage , for Authentication method , select Basic , and then complete
the following steps.
a. For Account selection method , select From Azure subscription or Enter manually .
b. If you select From Azure subscription , select the relevant Azure subscription , Storage
account name , and File share .
c. If you select Enter manually , enter
\\<storage account name>.file.core.windows.net\<file share name> for Host ,
Azure\<storage account name>for Username , and <storage account key> for Password or
select your Azure Key Vault where it's stored as a secret.
f. If you select Azure SQL Managed Instance , complete the following steps.
a. Select Connection string or your Azure Key Vault where it's stored as a secret.
b. If you select Connection string , complete the following steps.
a. For Account selection method , if you choose From Azure subscription , select
the relevant Azure subscription , Ser ver name , Endpoint type and Database
name . If you choose Enter manually , complete the following steps.
a. For Fully qualified domain name , enter
<server name>.<dns prefix>.database.windows.net or
<server name>.public.<dns prefix>.database.windows.net,3342 as the private
or public endpoint of your Azure SQL Managed Instance, respectively. If you
enter the private endpoint, Test connection isn't applicable, since ADF UI
can't reach it.
b. For Database name , enter msdb .
b. For Authentication type , select SQL Authentication , Managed Identity ,
Ser vice Principal , or User-Assigned Managed Identity .
If you select SQL Authentication , enter the relevant Username and
Password or select your Azure Key Vault where it's stored as a secret.
If you select Managed Identity , grant the system managed identity for your
ADF access to your Azure SQL Managed Instance.
If you select Ser vice Principal , enter the relevant Ser vice principal ID and
Ser vice principal key or select your Azure Key Vault where it's stored as a
secret.
If you select User-Assigned Managed Identity , grant the specified user-
assigned managed identity for your ADF access to your Azure SQL Managed
Instance. You can then select any existing credentials created using your
specified user-assigned managed identities or create new ones.
g. If you select File system , enter the UNC path of folder where your packages are deployed for
Host , as well as the relevant Username and Password or select your Azure Key Vault where it's
stored as a secret.
h. Select Test connection when applicable and if it's successful, select Create .
3. Your added package stores will appear on the Deployment settings page. To remove them, select their
check boxes, and then select Delete .
Select Test connection when applicable and if it's successful, select Continue .
Advanced settings page
On the Advanced settings page of Integration runtime setup pane, complete the following steps.
1. For Maximum Parallel Executions Per Node , select the maximum number of packages to run
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number if you want to use more than one core to run a single large package that's
compute or memory intensive. Select a high number if you want to run one or more small packages in a
single core.
2. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box to choose whether you want to add
standard/express custom setups on your Azure-SSIS IR. For more information, see Custom setup for an
Azure-SSIS IR.
3. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to create
cer tain network resources, and optionally bring your own static public IP addresses check box
to choose whether you want to join your Azure-SSIS IR to a virtual network.
Select it if you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
without configuring a self-hosted IR. For more information, see Create an Azure-SSIS IR in a virtual
network.
4. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS Integration
Runtime check box to choose whether you want to configure a self-hosted IR as proxy for your Azure-
SSIS IR. For more information, see Set up a self-hosted IR as proxy.
5. Select Continue .
On the Summar y page of Integration runtime setup pane, review all provisioning settings, bookmark the
recommended documentation links, and select Create to start the creation of your integration runtime.
NOTE
Excluding any custom setup time, this process should finish within 5 minutes.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.
Connections pane
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .
You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.
If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.
See also the following SSIS documentation:
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Schedule package executions in Azure
Connect to on-premises data sources with Windows authentication
Next steps
To learn about customizing your Azure-SSIS integration runtime, advance to the following article:
Customize an Azure-SSIS IR
Set up an Azure-SSIS IR in Azure Data Factory by
using PowerShell
7/20/2021 • 16 minutes to read • Edit Online
NOTE
This article demonstrates using Azure PowerShell to set up an Azure-SSIS IR. To use the Azure portal or an Azure Data
Factory app to set up the Azure-SSIS IR, see Tutorial: Set up an Azure-SSIS IR.
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database ser ver or managed instance (optional) . If you don't already have a database
server, create one in the Azure portal before you get started. Data Factory will in turn create an SSISDB
instance on this database server.
We recommend that you create the database server in the same Azure region as the integration runtime.
This configuration lets the integration runtime write execution logs into SSISDB without crossing Azure
regions.
Keep these points in mind:
Based on the selected database server, the SSISDB instance can be created on your behalf as a
single database, as part of an elastic pool, or in a managed instance. It can be accessible in a public
network or by joining a virtual network. For guidance in choosing the type of database server to
host SSISDB, see Compare SQL Database and SQL Managed Instance.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Create an Azure-SSIS IR in a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a managed instance with private endpoint to host SSISDB. For more
information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see New-
AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure AD authentication with the specified system/user-assigned managed
identity for your data factory. For the latter, you need to add the specified system/user-assigned
managed identity for your data factory into an Azure AD group with access permissions to the
database server. For more information, see Create an Azure-SSIS IR with Azure AD authentication.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.
Azure PowerShell . To run a PowerShell script to set up your Azure-SSIS IR, follow the instructions in
Install and configure Azure PowerShell.
NOTE
For a list of Azure regions in which Azure Data Factory and Azure-SSIS IR are currently available, see Azure Data Factory
and Azure-SSIS IR availability by region.
Create variables
Copy the following script to the ISE. Specify values for the variables.
### Azure Data Factory info
# If your input contains a PSH special character (for example, "$"), precede it with the escape character
"`" (for example, "`$")
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
# Data factory name - Must be globally unique
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$DataFactoryLocation = "EastUS"
### Azure-SSIS Integration Runtime info; this is a Data Factory compute resource for running SSIS packages
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, although Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
To create an Azure SQL Database instance as part of the script, see the following example. Set values for the
variables that haven't been defined already (for example, SSISDBServerName, FirewallIPAddress).
# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data access
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName
if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
NOTE
Excluding any custom setup time, this process should finish within 5 minutes.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.
Full script
The PowerShell script in this section configures an instance of Azure-SSIS IR that runs SSIS packages. After you
run this script successfully, you can deploy and run SSIS packages in Azure.
1. Open the ISE.
2. At the ISE command prompt, run the following command:
### Azure-SSIS Integration Runtime info - This is a Data Factory compute resource for running SSIS packages
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
(2 x the number of cores) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data access
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName
if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.
For more SSIS documentation, see:
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Connect to on-premises data sources with Windows authentication
Schedule package executions in Azure
Next steps
In this tutorial, you learned how to:
Create a data factory.
Create an Azure-SSIS Integration Runtime.
Start the Azure-SSIS Integration Runtime.
Review the complete script.
Deploy SSIS packages.
To learn about customizing your Azure-SSIS Integration Runtime, see the following article:
Customize your Azure-SSIS IR
Configure an Azure-SQL Server Integration
Services (SSIS) integration runtime (IR) to join a
virtual network
3/5/2021 • 6 minutes to read • Edit Online
Prerequisites
Azure-SSIS integration runtime . If you do not have an Azure-SSIS integration runtime, provision an
Azure-SSIS integration runtime in Azure Data Factory before begin.
User permission . The user who creates the Azure-SSIS IR must have the role assignment at least on
Azure Data Factory resource with one of the options below:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/* permission,
which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission. If you also want to bring your own
public IP addresses for Azure-SSIS IR while joining it to an Azure Resource Manager virtual network,
please also include Microsoft.Network/publicIPAddresses/*/join/action permission in the role.
Vir tual network .
If you do not have a virtual network, create a virtual network using the Azure portal.
Make sure that the virtual network's resource group can create and delete certain Azure network
resources.
The Azure-SSIS IR needs to create certain network resources under the same resource group as
the virtual network. These resources include:
An Azure load balancer, with the name <Guid>-azurebatch-cloudserviceloadbalancer
A network security group, with the name *<Guid>-azurebatch-
cloudservicenetworksecuritygroup
An Azure public IP address, with the name -azurebatch-cloudservicepublicip
Those resources will be created when your Azure-SSIS IR starts. They'll be deleted when your
Azure-SSIS IR stops. To avoid blocking your Azure-SSIS IR from stopping, don't reuse these
network resources in your other resources.
Make sure that you have no resource lock on the resource group/subscription to which the virtual
network belongs. If you configure a read-only/delete lock, starting and stopping your Azure-SSIS
IR will fail, or it will stop responding.
Make sure that you don't have an Azure Policy assignment that prevents the following resources
from being created under the resource group/subscription to which the virtual network belongs:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Below network configuration scenarios are not covered in this tutorial:
If you bring your own public IP addresses for the Azure-SSIS IR.
If you use your own Domain Name System (DNS) server.
If you use a network security group (NSG) on the subnet.
If you use Azure ExpressRoute or a user-defined route (UDR).
If you use customized Azure-SSIS IR.
For more info, check virtual network configuration.
3. Select your data factory with the Azure-SSIS IR in the list. You see the home page for your data factory.
Select the Author & Monitor tile. You see the Data Factory UI on a separate tab.
4. In the Data Factory UI, switch to the Edit tab, select Connections , and switch to the Integration
Runtimes tab.
5. If your Azure-SSIS IR is running, in the Integration Runtimes list, in the Actions column, select the
Stop button for your Azure-SSIS IR. You can't edit your Azure-SSIS IR until you stop it.
6. In the Integration Runtimes list, in the Actions column, select the Edit button for your Azure-SSIS IR.
7. On the integration runtime setup panel, advance through the General Settings and SQL Settings
sections by selecting the Next button.
8. On the Advanced Settings section:
a. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to
create cer tain network resources, and optionally bring your own static public IP
addresses check box.
b. For Subscription , select the Azure subscription that has your virtual network.
c. For Location , the same location of your integration runtime is selected.
d. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
e. For VNet Name , select the name of your virtual network. It should be the same one used for SQL
Database with virtual network service endpoints or SQL Managed Instance with private endpoint
to host SSISDB. Or it should be the same one connected to your on-premises network. Otherwise,
it can be any virtual network to bring your own static public IP addresses for Azure-SSIS IR.
f. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for SQL Datbase with virtual network service endpoints to host SSISDB. Or it should be a
different subnet from the one used for your SQL Managed Instance with private endpoint to host
SSISDB. Otherwise, it can be any subnet to bring your own static public IP addresses for Azure-
SSIS IR.
g. Select VNet Validation . If the validation is successful, select Continue .
9. On the Summar y section, review all settings for your Azure-SSIS IR. Then select Update .
10. Start your Azure-SSIS IR by selecting the Star t button in the Actions column for your Azure-SSIS IR. It
takes about 20 to 30 minutes to start the Azure-SSIS IR that joins a virtual network.
Next Steps
Learn more about joining Azure-SSIS IR to a virtual network.
Push Data Factory lineage data to Azure Purview
(Preview)
5/6/2021 • 2 minutes to read • Edit Online
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free Azure account before you begin.
Azure Data Factor y . If you don't have an Azure Data Factory, see Create an Azure Data Factory.
Azure Pur view account . The Purview account captures all lineage data generated by data factory. If you
don't have an Azure Purview account, see Create an Azure Purview.
Run Data Factory activities and push lineage data to Azure Purview
Step 1: Connect Data Factory to your Purview account
Log in to your Purview account in Purview portal, go to Management Center . Choose Data Factor y in
External connections and click New button to create a connection to a new Data Factory.
In the popup page, you can choose the Data Factory you want to connect to this Purview account.
You can check the status after creating the connection. Connected means the connection between Data Factory
and this Purview is successfully connected.
NOTE
You need to be assigned any of below roles in the Purview account and Data Factory Contributor role to create the
connection between Data Factory and Azure Purview.
Owner
User Access Administrator
If you don't know how to create Copy and Dataflow activities, see Copy data from Azure Blob storage to a
database in Azure SQL Database by using Azure Data Factory and Transform data using mapping data flows.
Step 3: Run Execute SSIS Package activities in Data Factory
You can create pipelines, Execute SSIS Package activities in Data Factory. You don't need any additional
configuration for lineage data capture. The lineage data will automatically be captured during the activities
execution.
If you don't know how to create Execute SSIS Package activities, see Run SSIS Packages in Azure.
Step 4: View lineage information in your Purview account
Go back to your Purview Account. In the home page, select Browse assets . Choose the asset you want, and
click Lineage tab. You will see all the lineage information.
You also can see lineage data for Execute SSIS Package activity.
NOTE
For the lineage of Execute SSIS Package activity, we only support source and destination. The lineage for transformation is
not supported yet.
Next steps
Catalog lineage user guide
Connect Data Factory to Azure Purview
Data integration using Azure Data Factory and
Azure Data Share
7/2/2021 • 22 minutes to read • Edit Online
Prerequisites
Azure subscription : If you don't have an Azure subscription, create a free account before you begin.
Azure SQL Database : If you don't have a SQL DB, learn how to create a SQL DB account
Azure Data Lake Storage Gen2 storage account : If you don't have an ADLS Gen2 storage account,
learn how to create an ADLS Gen2 storage account.
Azure Synapse Analytics : If you don't have an Azure Synapse Analytics, learn how to create an Azure
Synapse Analytics instance.
Azure Data Factor y : If you haven't created a data factory, see how to create a data factory.
Azure Data Share : If you haven't created a data share, see how to create a data share.
4. Click on Author and Monitor to open up the ADF UX. The ADF UX can also be accessed at
adf.azure.com.
5. You'll be redirected to the homepage of the ADF UX. This page contains quick-starts, instructional videos,
and links to tutorials to learn data factory concepts. To start authoring, click on the pencil icon in left side-
bar.
Create an Azure SQL Database linked service
1. To create a linked service, select Manage hub in the left side-bar, on the Connections pane, select
Linked ser vices and then select New to add a new linked service.
2. The first linked service you'll configure is an Azure SQL DB. You can use the search bar to filter the data
store list. Click on the Azure SQL Database tile and click continue.
3. In the SQL DB configuration pane, enter 'SQLDB' as your linked service name. Enter in your credentials to
allow data factory to connect to your database. If you're using SQL authentication, enter in the server
name, the database, your user name and password. You can verify your connection information is correct
by clicking Test connection . Click Create when finished.
Create an Azure Synapse Analytics linked service
1. Repeat the same process to add an Azure Synapse Analytics linked service. In the connections tab, click
New . Select the Azure Synapse Analytics tile and click continue.
2. In the linked service configuration pane, enter 'SQLDW' as your linked service name. Enter in your
credentials to allow data factory to connect to your database. If you're using SQL authentication, enter in
the server name, the database, your user name and password. You can verify your connection
information is correct by clicking Test connection . Click Create when finished.
Create an Azure Data Lake Storage Gen2 linked service
1. The last linked service needed for this lab is an Azure Data Lake Storage gen2. In the connections tab,
click New . Select the Azure Data Lake Storage Gen2 tile and click continue.
2. In the linked service configuration pane, enter 'ADLSGen2' as your linked service name. If you're using
Account key authentication, select your ADLS Gen2 storage account from the Storage account name
dropdown. You can verify your connection information is correct by clicking Test connection . Click
Create when finished.
2. In the General tab of the pipeline canvas, name your pipeline something descriptive such as
'IngestAndTransformTaxiData'.
3. In the activities pane of the pipeline canvas, open the Move and Transform accordion and drag the
Copy data activity onto the canvas. Give the copy activity a descriptive name such as 'IngestIntoADLS'.
3. Call your dataset 'TripData'. Select 'SQLDB' as your linked service. Select table name 'dbo.TripData' from
the table name dropdown. Import the schema From connection/store . Click OK when finished.
You have successfully created your source dataset. Make sure in the source settings, the default value Table is
selected in the use query field.
Configure ADLS Gen2 sink dataset
1. Click on the Sink tab of the copy activity. To create a new dataset, click New .
2. Search for Azure Data Lake Storage Gen2 and click continue.
3. In the select format pane, select DelimitedText as you're writing to a csv file. Click continue.
4. Name your sink dataset 'TripDataCSV'. Select 'ADLSGen2' as your linked service. Enter where you want to
write your csv file. For example, you can write your data to file trip-data.csv in container
staging-container . Set First row as header to true as you want your output data to have headers.
Since no file exists in the destination yet, set Impor t schema to None . Click OK when finished.
Test the copy activity with a pipeline debug run
1. To verify your copy activity is working correctly, click Debug at the top of the pipeline canvas to execute a
debug run. A debug run allows you to test your pipeline either end-to-end or until a breakpoint before
publishing it to the data factory service.
2. To monitor your debug run, go to the Output tab of the pipeline canvas. The monitoring screen will
autorefresh every 20 seconds or when you manually click the refresh button. The copy activity has a
special monitoring view, which can be access by clicking the eye-glasses icon in the Actions column.
3. The copy monitoring view gives the activity's execution details and performance characteristics. You can
see information such as data read/written, rows read/written, files read/written, and throughput. If you
have configured everything correctly, you should see 49,999 rows written into one file in your ADLS sink.
4. Before moving on to the next section, it's suggested that you publish your changes to the data factory
service by clicking Publish all in the factory top bar. While not covered in this lab, Azure Data Factory
supports full git integration. Git integration allows for version control, iterative saving in a repository, and
collaboration on a data factory. For more information, see source control in Azure Data Factory.
Transform data using mapping data flow
Now that you have successfully copied data into Azure Data Lake Storage, it is time to join and aggregate that
data into a data warehouse. We will use mapping data flow, Azure Data Factory's visually designed
transformation service. Mapping data flows allow users to develop transformation logic code-free and execute
them on spark clusters managed by the ADF service.
The data flow created in this step inner joins the 'TripDataCSV' dataset created in the previous section with a
table 'dbo.TripFares' stored in 'SQLDB' based on four key columns. Then the data gets aggregated based upon
column payment_type to calculate the average of certain fields and written in an Azure Synapse Analytics table.
Add a data flow activity to your pipeline
1. In the activities pane of the pipeline canvas, open the Move and Transform accordion and drag the
Data flow activity onto the canvas.
2. In the side pane that opens, select Create new data flow and choose Mapping data flow . Click OK .
3. You'll be directed to the data flow canvas where you'll be building your transformation logic. In the
general tab, name your data flow 'JoinAndAggregateData'.
3. Go to tab Schema and click Impor t schema . Select From connection/store to import directly from
the file store. 14 columns of type string should appear.
4. Go back to data flow 'JoinAndAggregateData'. If your debug cluster has started (indicated by a green
circle next to the debug slider), you can get a snapshot of the data in the Data Preview tab. Click
Refresh to fetch a data preview.
NOTE
Data preview does not write data.
2. Name this source 'TripFaresSQL'. Click New next to the source dataset field to create a new SQL DB
dataset.
3. Select the Azure SQL Database tile and click continue. Note: You may notice many of the connectors in
data factory are not supported in mapping data flow. To transform data from one of these sources, ingest
it into a supported source using the copy activity.
4. Call your dataset 'TripFares'. Select 'SQLDB' as your linked service. Select table name 'dbo.TripFares' from
the table name dropdown. Import the schema From connection/store . Click OK when finished.
5. To verify your data, fetch a data preview in the Data Preview tab.
Inner join TripDataCSV and TripFaresSQL
1. To add a new transformation, click the plus icon in the bottom-right corner of 'TripDataCSV'. Under
Multiple inputs/outputs , select Join .
2. Name your join transformation 'InnerJoinWithTripFares'. Select 'TripFaresSQL' from the right stream
dropdown. Select Inner as the join type. To learn more about the different join types in mapping data
flow, see join types.
Select which columns you wish to match on from each stream via the Join conditions dropdown. To
add an additional join condition, click on the plus icon next to an existing condition. By default, all join
conditions are combined with an AND operator, which means all conditions must be met for a match. In
this lab, we want to match on columns medallion , hack_license , vendor_id , and pickup_datetime
4. To enter an aggregation expression, click the blue box labeled Enter expression . This will open up the
data flow expression builder, a tool used to visually create data flow expressions using input schema,
built-in functions and operations, and user-defined parameters. For more information on the capabilities
of the expression builder, see the expression builder documentation.
To get the average fare, use the avg() aggregation function to aggregate the total_amount column cast
to an integer with toInteger() . In the data flow expression language, this is defined as
avg(toInteger(total_amount)) . Click Save and finish when you're done.
5. To add an additional aggregation expression, click on the plus icon next to average_fare . Select Add
column .
6. In the text box labeled Add or select a column , enter 'total_trip_distance'. As in the last step, open the
expression builder to enter in the expression.
To get the total trip distance, use the sum() aggregation function to aggregate the trip_distance
column cast to an integer with toInteger() . In the data flow expression language, this is defined as
sum(toInteger(trip_distance)) . Click Save and finish when you're done.
7. Test your transformation logic in the Data Preview tab. As you can see, there are significantly fewer
rows and columns than previously. Only the three groups by and aggregation columns defined in this
transformation continue downstream. As there are only five payment type groups in the sample, only five
rows are outputted.
Configure you Azure Synapse Analytics sink
1. Now that we have finished our transformation logic, we are ready to sink our data in an Azure Synapse
Analytics table. Add a sink transformation under the Destination section.
2. Name your sink 'SQLDWSink'. Click New next to the sink dataset field to create a new Azure Synapse
Analytics dataset.
5. Go to the Settings tab of the sink. Since we are creating a new table, we need to select Recreate table
under table action. Unselect Enable staging , which toggles whether we are inserting row-by-row or in
batch.
You have successfully created your data flow. Now it's time to run it in a pipeline activity.
Debug your pipeline end-to -end
1. Go back to the tab for the IngestAndTransformData pipeline. Notice the green box on the
'IngestIntoADLS' copy activity. Drag it over to the 'JoinAndAggregateData' data flow activity. This creates
an 'on success', which causes the data flow activity to only run if the copy is successful.
2. As we did for the copy activity, click Debug to execute a debug run. For debug runs, the data flow activity
will use the active debug cluster instead of spinning up a new cluster. This pipeline will take a little over a
minute to execute.
3. Like the copy activity, the data flow has a special monitoring view accessed by the eyeglasses icon on
completion of the activity.
4. In the monitoring view, you can see a simplified data flow graph along with the execution times and rows
at each execution stage. If done correctly, you should have aggregated 49,999 rows into five rows in this
activity.
5. You can click a transformation to get additional details on its execution such as partitioning information
and new/updated/dropped columns.
You have now completed the data factory portion of this lab. Publish your resources if you wish to
operationalize them with triggers. You successfully ran a pipeline that ingested data from Azure SQL Database to
Azure Data Lake Storage using the copy activity and then aggregated that data into an Azure Synapse Analytics.
You can verify the data was successfully written by looking at the SQL Server itself.
12. You'll be given a script to run before you can proceed. The script provided creates a user in the SQL
database to allow the Azure Data Share MSI to authenticate on its behalf.
IMPORTANT
Before running the script, you must set yourself as the Active Directory Admin for the SQL Server.
1. Open a new tab and navigate to the Azure portal. Copy the script provided to create a user in the
database that you want to share data from. Do this by logging into the EDW database using Query
Explorer (preview) using AAD authentication.
You'll need to modify the script so that the user created is contained within brackets. Eg:
create user [dataprovider-xxxx] from external login; exec sp_addrolemember db_owner, [dataprovider-
xxxx];
2. Switch back to Azure Data Share where you were adding datasets to your data share.
3. Select EDW , then select AggregatedTaxiData for the table.
4. Select Add dataset
We now have a SQL table that is part of our dataset. Next, we will add additional datasets from Azure
Data Lake Store.
5. Select Add dataset and select Azure Data Lake Store Gen2
6. Select Next
7. Expand wwtaxidata. Expand Boston Taxi Data. Notice that you can share down to the file level.
8. Select the Boston Taxi Data folder to add the entire folder to your data share.
9. Select Add datasets
10. Review the datasets that have been added. You should have a SQL table and an ADLS Gen2 folder added
to your data share.
11. Select Continue
12. In this screen, you can add recipients to your data share. The recipients you add will receive invitations to
your data share. For the purpose of this lab, you must add in 2 e-mail addresses:
a. The e-mail address of the Azure subscription you're in.
21. Select the invitation to [email protected]. Select Delete. If your recipient hasn't yet accepted the
invitation, they will no longer be able to do so.
22. Select the Histor y tab. Nothing is displayed as yet because your data consumer hasn't yet accepted your
invitation and triggered a snapshot.
Receiving data (Data consumer flow)
Now that we have reviewed our data share, we are ready to switch context and wear our data consumer hat.
You should now have an Azure Data Share invitation in your inbox from Microsoft Azure. Launch Outlook Web
Access (outlook.com) and log in using the credentials supplied for your Azure subscription.
In the e-mail that you should have received, click on "View invitation >". At this point, you're going to be
simulating the data consumer experience when accepting a data providers invitation to their data share.
You may be prompted to select a subscription. Make sure you select the subscription you have been working in
for this lab.
1. Click on the invitation titled DataProvider.
2. In this Invitation screen, you'll notice various details about the data share that you configured earlier as a
data provider. Review the details and accept the terms of use if provided.
3. Select the Subscription and Resource Group that already exists for your lab.
4. For Data share account , select DataConsumer . You can also create a new data share account.
5. Next to Received share name , you'll notice the default share name is the name that was specified by
the data provider. Give the share a friendly name that describes the data you're about to receive, e.g
TaxiDataShare .
6. You can choose to Accept and configure now or Accept and configure later . If you choose to accept
and configure now, you'll specify a storage account where all data should be copied. If you choose to
accept and configure later, the datasets in the share will be unmapped and you'll need to manually map
them. We will opt for that later.
7. Select Accept and configure later .
In configuring this option, a share subscription is created but there is nowhere for the data to land since
no destination has been mapped.
Next, we will configure dataset mappings for the data share.
8. Select the Received Share (the name you specified in step 5).
Trigger snapshot is greyed out but the share is Active.
9. Select the Datasets tab. Notice that each dataset is Unmapped, which means that it has no destination to
copy data to.
10. Select the Azure Synapse Analytics Table and then select + Map to Target .
11. On the right-hand side of the screen, select the Target Data Type drop down.
You can map the SQL data to a wide range of data stores. In this case, we'll be mapping to an Azure SQL
Database.
(Optional) Select Azure Data Lake Store Gen2 as the target data type.
(Optional) Select the Subscription, Resource Group and Storage account you have been working in.
(Optional) You can choose to receive the data into your data lake in either csv or parquet format.
12. Next to Target data type , select Azure SQL Database.
13. Select the Subscription, Resource Group and Storage account you have been working in.
14. Before you can proceed, you'll need to create a new user in the SQL Server by running the script
provided. First, copy the script provided to your clipboard.
15. Open a new Azure portal tab. Don't close your existing tab as you'll need to come back to it in a moment.
16. In the new tab you opened, navigate to SQL databases .
17. Select the SQL database (there should only be one in your subscription). Be careful not to select the data
warehouse.
18. Select Quer y editor (preview)
19. Use AAD authentication to log in to Query editor.
20. Run the query provided in your data share (copied to clipboard in step 14).
This command allows the Azure Data Share service to use Managed Identities for Azure Services to
authenticate to the SQL Server to be able to copy data into it.
21. Go back to the original tab, and select Map to target .
22. Next, select the Azure Data Lake Gen2 folder that is part of the dataset and map it to an Azure Blob
Storage account.
With all datasets mapped, you're now ready to start receiving data from the data provider.
This tutorial provides steps for using the Azure portal to setup Private Link Service and access on-premises SQL
Server from Managed VNet using Private Endpoint.
NOTE
The solution presented in this article describes SQL Server connectivity, but you can use a similar approach to connect
and query other available on-premises connectors that are supported in Azure Data Factory.
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Vir tual Network . If you don’t have a Virtual Network, create one following Create Virtual Network.
Vir tual network to on-premises network . Create a connection between virtual network and on-
premises network either using ExpressRoute or VPN.
Data Factor y with Managed VNet enabled . If you don’t have a Data Factory or Managed VNet is not
enabled, create one following Create Data Factory with Managed VNet.
SET T IN G VA L UE
3. Accept the defaults for the remaining settings, and then select Review + create .
4. In the Review + create tab, select Create .
Create load balancer resources
Create a backend pool
A backend address pool contains the IP addresses of the virtual (NICs) connected to the load balancer.
Create the backend address pool myBackendPool to include virtual machines for load-balancing internet
traffic.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from the
resources list.
2. Under Settings , select Backend pools , then select Add .
3. On the Add a backend pool page, for name, type myBackendPool , as the name for your backend pool,
and then select Add .
Create a health probe
The load balancer monitors the status of your app with a health probe.
The health probe adds or removes VMs from the load balancer based on their response to health checks.
Create a health probe named myHealthProbe to monitor the health of the VMs.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Health probes , then select Add .
SET T IN G VA L UE
SET T IN G VA L UE
SET T IN G VA L UE
Project details
Instance details
6. Select the Outbound settings tab or select Next: Outbound settings at the bottom of the page.
7. In the Outbound settings tab, enter or select the following information:
SET T IN G VA L UE
8. Select the Access security tab or select Next: Access security at the bottom of the page.
9. Leave the default of Role-based access control only in the Access security tab.
10. Select the Tags tab or select Next: Tags at the bottom of the page.
11. Select the Review + create tab or select Next: Review + create at the bottom of the page.
12. Select Create in the Review + create tab.
SET T IN G VA L UE
Project details
Instance details
Administrator account
3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:
SET T IN G VA L UE
Network interface
Subnet be-subnet .
Load balancing
NOTE
FQDN doesn't work for on-premises SQL Server unless you add a record in Azure DNS zone.
3. Run below command and check the iptables in your backend server VMs. You can see one record in your
iptables with your target IP.
sudo iptables -t nat -v -L PREROUTING -n --line-number
NOTE
If you have more than one SQL Server or data sources, you need to define multiple load balancer rules and IP
table records with different ports. Otherwise, there will be some conflict. For example,
Next steps
Advance to the following tutorial to learn about accessing Microsoft Azure SQL Managed Instance from Data
Factory Managed VNet using Private Endpoint:
Access SQL Managed Instance from Data Factory Managed VNet
Tutorial: How to access SQL Managed Instance from
Data Factory Managed VNET using Private
Endpoint
6/10/2021 • 7 minutes to read • Edit Online
This tutorial provides steps for using the Azure portal to setup Private Link Service and access SQL Managed
Instance from Managed VNET using Private Endpoint.
Prerequisites
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Vir tual Network . If you don’t have a Virtual Network, create one following Create Virtual Network.
Vir tual network to on-premises network . Create a connection between virtual network and on-
premises network either using ExpressRoute or VPN.
Data Factor y with Managed VNET enabled . If you don’t have a Data Factory or Managed VNET is not
enabled, create one following Create Data Factory with Managed VNET.
SET T IN G VA L UE
3. Accept the defaults for the remaining settings, and then select Review + create .
4. In the Review + create tab, select Create .
Create load balancer resources
Create a backend pool
A backend address pool contains the IP addresses of the virtual (NICs) connected to the load balancer.
Create the backend address pool myBackendPool to include virtual machines for load-balancing internet
traffic.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from the
resources list.
2. Under Settings , select Backend pools , then select Add .
3. On the Add a backend pool page, for name, type myBackendPool , as the name for your backend pool,
and then select Add .
Create a health probe
The load balancer monitors the status of your app with a health probe.
The health probe adds or removes VMs from the load balancer based on their response to health checks.
Create a health probe named myHealthProbe to monitor the health of the VMs.
1. Select All ser vices in the left-hand menu, select All resources , and then select myLoadBalancer from
the resources list.
2. Under Settings , select Health probes , then select Add .
SET T IN G VA L UE
SET T IN G VA L UE
SET T IN G VA L UE
Project details
Instance details
6. Select the Outbound settings tab or select Next: Outbound settings at the bottom of the page.
7. In the Outbound settings tab, enter or select the following information:
SET T IN G VA L UE
8. Select the Access security tab or select Next: Access security at the bottom of the page.
9. Leave the default of Role-based access control only in the Access security tab.
10. Select the Tags tab or select Next: Tags at the bottom of the page.
11. Select the Review + create tab or select Next: Review + create at the bottom of the page.
12. Select Create in the Review + create tab.
SET T IN G VA L UE
Project details
Instance details
Administrator account
3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:
SET T IN G VA L UE
Network interface
Subnet be-subnet .
Load balancing
3. Run below command and check the iptables in your backend server VMs. You can see one record in your
iptables with your target IP.
sudo iptables -t nat -v -L PREROUTING -n --line-number
NOTE
Note: If you have more than one SQL MI or other data sources, you need to define multiple load balancer rules
and IP table records with different ports. Otherwise, there will be some conflict. For example,
NOTE
Please input SQL Managed Instance host manually. Otherwise it’s not full qualified domain name in the selection
list.
Copy data
Copy blobs from a folder to another folder in an Azure Blob This PowerShell script copies blobs from a folder in Azure
Storage Blob Storage to another folder in the same Blob Storage.
Copy data from SQL Server to Azure Blob Storage This PowerShell script copies data from a SQL Server
database to an Azure blob storage.
Bulk copy This sample PowerShell script copies data from multiple
tables in a database in Azure SQL Database to Azure
Synapse Analytics.
Incremental copy This sample PowerShell script loads only new or updated
records from a source data store to a sink data store after
the initial full copy of data from the source to the sink.
Transform data
Transform data using a Spark cluster This PowerShell script transforms data by running a program
on a Spark cluster.
Create Azure-SSIS integration runtime This PowerShell script provisions an Azure-SSIS integration
runtime that runs SQL Server Integration Services (SSIS)
packages in Azure.
Pipelines and activities in Azure Data Factory
6/24/2021 • 16 minutes to read • Edit Online
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together perform
a task. For example, a pipeline could contain a set of activities that ingest and clean log data, and then kick off a
mapping data flow to analyze the log data. The pipeline allows you to manage the activities as a set instead of
each one individually. You deploy and schedule the pipeline instead of the activities independently.
The activities in a pipeline define actions to perform on your data. For example, you may use a copy activity to
copy data from SQL Server to an Azure Blob Storage. Then, use a data flow activity or a Databricks Notebook
activity to process and transform data from the blob storage to an Azure Synapse Analytics pool on top of which
business intelligence reporting solutions are built.
Data Factory has three groupings of activities: data movement activities, data transformation activities, and
control activities. An activity can take zero or more input datasets and produce one or more output datasets. The
following diagram shows the relationship between pipeline, activity, and dataset in Data Factory:
An input dataset represents the input for an activity in the pipeline, and an output dataset represents the output
for the activity. Datasets identify data within different data stores, such as tables, files, folders, and documents.
After you create a dataset, you can use it with activities in a pipeline. For example, a dataset can be an
input/output dataset of a Copy Activity or an HDInsightHive Activity. For more information about datasets, see
Datasets in Azure Data Factory article.
Azure Cognitive ✓ ✓ ✓
Search index
Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Database ✓ ✓ ✓
for MariaDB
Azure Database ✓ ✓ ✓ ✓
for MySQL
Azure Database ✓ ✓ ✓ ✓
for PostgreSQL
Azure Databricks ✓ ✓ ✓ ✓
Delta Lake
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓ ✓
Managed
Instance
Azure Synapse ✓ ✓ ✓ ✓
Analytics
Azure Table ✓ ✓ ✓ ✓
storage
DB2 ✓ ✓ ✓
Drill ✓ ✓ ✓
Google ✓ ✓ ✓
BigQuery
Greenplum ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache Impala ✓ ✓ ✓
Informix ✓ ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft Access ✓ ✓ ✓
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
SAP Business ✓ ✓
Warehouse via
Open Hub
SAP Business ✓ ✓
Warehouse via
MDX
SAP HANA ✓ ✓ ✓
SAP table ✓ ✓
Snowflake ✓ ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓ ✓
Vertica ✓ ✓ ✓
NoSQL Cassandra ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓ ✓
MongoDB Atlas ✓ ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
Amazon S3 ✓ ✓ ✓
Compatible
Storage
File system ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google Cloud ✓ ✓ ✓
Storage
HDFS ✓ ✓ ✓
Oracle Cloud ✓ ✓ ✓
Storage
SFTP ✓ ✓ ✓ ✓
Generic OData ✓ ✓ ✓
Generic ODBC ✓ ✓ ✓
Generic REST ✓ ✓ ✓ ✓
Concur (Preview) ✓ ✓ ✓
Dataverse ✓ ✓ ✓ ✓
Dynamics 365 ✓ ✓ ✓ ✓
Dynamics AX ✓ ✓ ✓
Dynamics CRM ✓ ✓ ✓ ✓
Google AdWords ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
HubSpot ✓ ✓ ✓
Jira ✓ ✓ ✓
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Microsoft 365 ✓ ✓ ✓
Oracle Eloqua ✓ ✓ ✓
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle Service ✓ ✓ ✓
Cloud (Preview)
PayPal (Preview) ✓ ✓ ✓
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service Cloud
Salesforce ✓ ✓ ✓
Marketing Cloud
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
SharePoint ✓ ✓ ✓
Online List
Shopify (Preview) ✓ ✓ ✓
Square (Preview) ✓ ✓ ✓
Web table ✓ ✓
(HTML table)
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Xero ✓ ✓ ✓
Zoho (Preview) ✓ ✓ ✓
NOTE
If a connector is marked Preview, you can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, contact Azure support.
Wait Activity When you use a Wait activity in a pipeline, the pipeline waits
for the specified time before continuing with execution of
subsequent activities.
Web Activity Web Activity can be used to call a custom REST endpoint
from a Data Factory pipeline. You can pass datasets and
linked services to be consumed and accessed by the activity.
Webhook Activity Using the webhook activity, call an endpoint, and pass a
callback URL. The pipeline run waits for the callback to be
invoked before proceeding to the next activity.
Pipeline JSON
Here is how a pipeline is defined in JSON format:
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities":
[
],
"parameters": {
},
"concurrency": <your max pipeline concurrency>,
"annotations": [
]
}
}
Activity JSON
The activities section can have one or more activities defined within it. There are two main types of activities:
Execution and Control Activities.
Execution activities
Execution activities include data movement and data transformation activities. They have the following top-level
structure:
{
"name": "Execution Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"linkedServiceName": "MyLinkedService",
"policy":
{
},
"dependsOn":
{
}
}
linkedServiceName Name of the linked service used by the Yes for HDInsight Activity, Azure
activity. Machine Learning Studio (classic) Batch
Scoring Activity, Stored Procedure
An activity may require that you Activity.
specify the linked service that links to
the required compute environment. No for all others
Activity policy
Policies affect the run-time behavior of an activity, giving configurability options. Activity Policies are only
available for execution activities.
Activity policy JSON definition
{
"name": "MyPipelineName",
"properties": {
"activities": [
{
"name": "MyCopyBlobtoSqlActivity",
"type": "Copy",
"typeProperties": {
...
},
"policy": {
"timeout": "00:10:00",
"retry": 1,
"retryIntervalInSeconds": 60,
"secureOutput": true
}
}
],
"parameters": {
...
}
}
}
timeout Specifies the timeout for the Timespan No. Default timeout is 7
activity to run. days.
Control activity
Control activities have the following top-level structure:
{
"name": "Control Activity Name",
"description": "description",
"type": "<ActivityType>",
"typeProperties":
{
},
"dependsOn":
{
}
}
Activity dependency
Activity Dependency defines how subsequent activities depend on previous activities, determining the condition
of whether to continue executing the next task. An activity can depend on one or multiple previous activities
with different dependency conditions.
The different dependency conditions are: Succeeded, Failed, Skipped, Completed.
For example, if a pipeline has Activity A -> Activity B, the different scenarios that can happen are:
Activity B has dependency condition on Activity A with succeeded : Activity B only runs if Activity A has a
final status of succeeded
Activity B has dependency condition on Activity A with failed : Activity B only runs if Activity A has a final
status of failed
Activity B has dependency condition on Activity A with completed : Activity B runs if Activity A has a final
status of succeeded or failed
Activity B has a dependency condition on Activity A with skipped : Activity B runs if Activity A has a final
status of skipped. Skipped occurs in the scenario of Activity X -> Activity Y -> Activity Z, where each activity
runs only if the previous activity succeeds. If Activity X fails, then Activity Y has a status of "Skipped" because
it never executes. Similarly, Activity Z has a status of "Skipped" as well.
Example: Activity 2 depends on the Activity 1 succeeding
{
"name": "PipelineName",
"properties":
{
"description": "pipeline description",
"activities": [
{
"name": "MyFirstActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
}
},
{
"name": "MySecondActivity",
"type": "Copy",
"typeProperties": {
},
"linkedServiceName": {
},
"dependsOn": [
{
"activity": "MyFirstActivity",
"dependencyConditions": [
"Succeeded"
]
}
]
}
],
"parameters": {
}
}
}
The typeProper ties section is different for each transformation activity. To learn about type properties
supported for a transformation activity, click the transformation activity in the Data transformation activities.
For a complete walkthrough of creating this pipeline, see Tutorial: transform data using Spark.
Scheduling pipelines
Pipelines are scheduled by triggers. There are different types of triggers (Scheduler trigger, which allows
pipelines to be triggered on a wall-clock schedule, as well as the manual trigger, which triggers pipelines on-
demand). For more information about triggers, see pipeline execution and triggers article.
To have your trigger kick off a pipeline run, you must include a pipeline reference of the particular pipeline in the
trigger definition. Pipelines & triggers have an n-m relationship. Multiple triggers can kick off a single pipeline,
and the same trigger can kick off multiple pipelines. Once the trigger is defined, you must start the trigger to
have it start triggering the pipeline. For more information about triggers, see pipeline execution and triggers
article.
For example, say you have a Scheduler trigger, "Trigger A," that I wish to kick off my pipeline, "MyCopyPipeline."
You define the trigger, as shown in the following example:
Trigger A definition
{
"name": "TriggerA",
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
...
}
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "MyCopyPipeline"
},
"parameters": {
"copySourceName": "FileSource"
}
}
}
}
Next steps
See the following tutorials for step-by-step instructions for creating pipelines with activities:
Build a pipeline with a copy activity
Build a pipeline with a data transformation activity
How to achieve CI/CD (continuous integration and delivery) using Azure Data Factory
Continuous integration and delivery in Azure Data Factory
Linked services in Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together
perform a task. The activities in a pipeline define actions to perform on your data. For example, you might use a
copy activity to copy data from SQL Server to Azure Blob storage. Then, you might use a Hive activity that runs a
Hive script on an Azure HDInsight cluster to process data from Blob storage to produce output data. Finally, you
might use a second copy activity to copy the output data to Azure Synapse Analytics, on top of which business
intelligence (BI) reporting solutions are built. For more information about pipelines and activities, see Pipelines
and activities in Azure Data Factory.
Now, a dataset is a named view of data that simply points or references the data you want to use in your
activities as inputs and outputs.
Before you create a dataset, you must create a linked ser vice to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For example, an Azure
Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder within that Azure Storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create two linked services:
Azure Storage and Azure SQL Database. Then, create two datasets: Azure Blob dataset (which refers to the Azure
Storage linked service) and Azure SQL Table dataset (which refers to the Azure SQL Database linked service).
The Azure Storage and Azure SQL Database linked services contain connection strings that Data Factory uses at
runtime to connect to your Azure Storage and Azure SQL Database, respectively. The Azure Blob dataset
specifies the blob container and blob folder that contains the input blobs in your Blob storage. The Azure SQL
Table dataset specifies the SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data
Factory:
Linked service JSON
A linked service in Data Factory is defined in JSON format as follows:
{
"name": "<Name of the linked service>",
"properties": {
"type": "<Type of the linked service>",
"typeProperties": {
"<data store or compute-specific type properties>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Datasets in Azure Data Factory
3/22/2021 • 4 minutes to read • Edit Online
Overview
A data factory can have one or more pipelines. A pipeline is a logical grouping of activities that together
perform a task. The activities in a pipeline define actions to perform on your data. Now, a dataset is a named
view of data that simply points or references the data you want to use in your activities as inputs and outputs.
Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an
Azure Blob dataset specifies the blob container and folder in Blob storage from which the activity should read
the data.
Before you create a dataset, you must create a linked ser vice to link your data store to the data factory. Linked
services are much like connection strings, which define the connection information needed for Data Factory to
connect to external resources. Think of it this way; the dataset represents the structure of the data within the
linked data stores, and the linked service defines the connection to the data source. For example, an Azure
Storage linked service links a storage account to the data factory. An Azure Blob dataset represents the blob
container and the folder within that Azure Storage account that contains the input blobs to be processed.
Here is a sample scenario. To copy data from Blob storage to a SQL Database, you create two linked services:
Azure Blob Storage and Azure SQL Database. Then, create two datasets: Delimited Text dataset (which refers to
the Azure Blob Storage linked service, assuming you have text files as source) and Azure SQL Table dataset
(which refers to the Azure SQL Database linked service). The Azure Blob Storage and Azure SQL Database linked
services contain connection strings that Data Factory uses at runtime to connect to your Azure Storage and
Azure SQL Database, respectively. The Delimited Text dataset specifies the blob container and blob folder that
contains the input blobs in your Blob storage, along with format-related settings. The Azure SQL Table dataset
specifies the SQL table in your SQL Database to which the data is to be copied.
The following diagram shows the relationships among pipeline, activity, dataset, and linked service in Data
Factory:
Dataset JSON
A dataset in Data Factory is defined in the following JSON format:
{
"name": "<name of dataset>",
"properties": {
"type": "<type of dataset: DelimitedText, AzureSqlTable etc...>",
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference",
},
"schema":[
],
"typeProperties": {
"<type specific property>": "<value>",
"<type specific property 2>": "<value 2>",
}
}
}
When you import the schema of dataset, select the Impor t Schema button and choose to import from the
source or from a local file. In most cases, you'll import the schema directly from the source. But if you already
have a local schema file (a Parquet file or CSV with headers), you can direct Data Factory to base the schema on
that file.
In copy activity, datasets are used in source and sink. Schema defined in dataset is optional as reference. If you
want to apply column/field mapping between source and sink, refer to Schema and type mapping.
In Data Flow, datasets are used in source and sink transformations. The datasets define the basic data schemas. If
your data has no schema, you can use schema drift for your source and sink. Metadata from the datasets
appears in your source transformation as the source projection. The projection in the source transformation
represents the Data Flow data with defined names and types.
Dataset type
Azure Data Factory supports many different types of datasets, depending on the data stores you use. You can
find the list of data stores supported by Data Factory from Connector overview article. Click a data store to learn
how to create a linked service and a dataset for it.
For example, for a Delimited Text dataset, the dataset type is set to DelimitedText as shown in the following
JSON sample:
{
"name": "DelimitedTextInput",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorage",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "input.log",
"folderPath": "inputdata",
"container": "adfgetstarted"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"quoteChar": "\""
},
"schema": []
}
}
Create datasets
You can create datasets by using one of these tools or SDKs: .NET API, PowerShell, REST API, Azure Resource
Manager Template, and Azure portal
Next steps
See the following tutorial for step-by-step instructions for creating pipelines and datasets by using one of these
tools or SDKs.
Quickstart: create a data factory using .NET
Quickstart: create a data factory using PowerShell
Quickstart: create a data factory using REST API
Quickstart: create a data factory using Azure portal
Pipeline execution and triggers in Azure Data
Factory
5/28/2021 • 16 minutes to read • Edit Online
In the JSON definition, the pipeline takes two parameters: sourceBlobContainer and sinkBlobContainer . You
pass values to these parameters at runtime.
You can manually run your pipeline by using one of the following methods:
.NET SDK
Azure PowerShell module
REST API
Python SDK
REST API
The following sample command shows you how to run your pipeline by using the REST API manually:
POST
https://management.azure.com/subscriptions/mySubId/resourceGroups/myResourceGroup/providers/Microsoft.DataFa
ctory/factories/myDataFactory/pipelines/copyPipeline/createRun?api-version=2017-03-01-preview
For a complete sample, see Quickstart: Create a data factory by using the REST API.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
The following sample command shows you how to manually run your pipeline by using Azure PowerShell:
You pass parameters in the body of the request payload. In the .NET SDK, Azure PowerShell, and the Python SDK,
you pass values in a dictionary that's passed as an argument to the call:
{
"sourceBlobContainer": "MySourceFolder",
"sinkBlobContainer": "MySinkFolder"
}
{
"runId": "0448d45a-a0bd-23f3-90a5-bfeea9264aed"
}
For a complete sample, see Quickstart: Create a data factory by using Azure PowerShell.
.NET SDK
The following sample call shows you how to run your pipeline by using the .NET SDK manually:
For a complete sample, see Quickstart: Create a data factory by using the .NET SDK.
NOTE
You can use the .NET SDK to invoke Data Factory pipelines from Azure Functions, from your web services, and so on.
Trigger execution
Triggers are another way that you can execute a pipeline run. Triggers represent a unit of processing that
determines when a pipeline execution needs to be kicked off. Currently, Data Factory supports three types of
triggers:
Schedule trigger: A trigger that invokes a pipeline on a wall-clock schedule.
Tumbling window trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based trigger: A trigger that responds to an event.
Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger).Multiple
triggers can kick off a single pipeline, or a single trigger can kick off multiple pipelines. In the following trigger
definition, the pipelines property refers to a list of pipelines that are triggered by the particular trigger. The
property definition includes values for the pipeline parameters.
Basic trigger definition
{
"properties": {
"name": "MyTrigger",
"type": "<type of trigger>",
"typeProperties": {...},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>": "<parameter 2 Value>"
}
}
]
}
}
Schedule trigger
A schedule trigger runs pipelines on a wall-clock schedule. This trigger supports periodic and advanced calendar
options. For example, the trigger supports intervals like "weekly" or "Monday at 5:00 PM and Thursday at 9:00
PM." The schedule trigger is flexible because the dataset pattern is agnostic, and the trigger doesn't discern
between time-series and non-time-series data.
For more information about schedule triggers and, for examples, see Create a trigger that runs a pipeline on a
schedule.
IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.
Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling a trigger:
star tTime A date-time value. For basic schedules, the value of the
star tTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified star tTime value.
endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past.
JSO N P RO P ERT Y DESC RIP T IO N
timeZone The time zone. For a list of supported time zones, see Create
a trigger that runs a pipeline on a schedule.
recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency ,
inter val, endTime , count , and schedule elements. When
a recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.
inter val A positive integer that denotes the interval for the
frequency value. The frequency value determines how
often the trigger runs. For example, if the inter val is 3 and
the frequency is "week", the trigger recurs every three
weeks.
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-11-01T09:00:00-08:00",
"endTime": "2017-11-02T22:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToBlobPipeline"
},
"parameters": {}
},
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "SQLServerToAzureSQLPipeline"
},
"parameters": {}
}
]
}
}
startTime property
The following table shows you how the star tTime property controls a trigger run:
Star t time is in the past Calculates the first future execution The trigger starts no sooner than the
time after the start time, and runs at specified start time. The first
that time. occurrence is based on the schedule,
calculated from the start time.
Runs subsequent executions calculated
from the last execution time. Runs subsequent executions based on
the recurrence schedule.
See the example that follows this table.
Star t time is in the future or the Runs once at the specified start time. The trigger starts no sooner than the
current time specified start time. The first
Runs subsequent executions calculated occurrence is based on the schedule,
from the last execution time. calculated from the start time.
Let's look at an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00, the start time is 2017-04-07 14:00, and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the inter val
property to 2.) Notice that the star tTime value is in the past and occurs before the current time.
Under these conditions, the first execution is 2017-04-09 at 14:00. The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00 PM. The next instance is two days from
that time, which is on 2017-04-09 at 2:00 PM.
The first execution time is the same even whether star tTime is 2017-04-05 14:00 or 2017-04-01 14:00. After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are on 2017-04-11 at 2:00 PM, then on 2017-04-13 at 2:00 PM, then on 2017-04-15 at 2:00 PM, and
so on.
Finally, when hours or minutes aren't set in the schedule for a trigger, the hours or minutes of the first execution
are used as defaults.
schedule property
You can use schedule to limit the number of trigger executions. For example, if a trigger with a monthly
frequency is scheduled to run only on day 31, the trigger runs only in those months that have a thirty-first day.
You can also use schedule to expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the first and second days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting: week number, month day, weekday, hour, minute.
The following table describes the schedule elements in detail:
monthDays Day of the month on which the trigger - Any value <= -1 and >= -31
runs. The value can be specified with a - Any value >= 1 and <= 31
monthly frequency only. - Array of values
{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.
{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.
{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.
{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}
{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.
{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the twenty-eighth day of every month
(assuming a frequency value of "month").
{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month.
{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.
{monthDays":[1,14]} Run on the first and fourteenth day of every month at the
specified start time.
{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.
EXA M P L E DESC RIP T IO N
{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.
{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.
{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time.
{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}
{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}
NOTE
The tumbling window trigger run waits for the triggered pipeline run to finish. Its run state reflects the state of the
triggered pipeline run. For example, if a triggered pipeline run is cancelled, the corresponding tumbling window trigger run
is marked cancelled. This is different from the "fire and forget" behavior of the schedule trigger, which is marked successful
as long as a pipeline run started.
The following table provides a comparison of the tumbling window trigger and schedule trigger:
Backfill scenarios Supported. Pipeline runs can be Not supported. Pipeline runs can be
scheduled for windows in the past. executed only on time periods from
the current time and the future.
Event-based trigger
An event-based trigger runs pipelines in response to an event. There are two flavors of event based triggers.
Storage event trigger runs a pipeline against events happening in a Storage account, such as the arrival of a
file, or the deletion of a file in Azure Blob Storage account.
Custom event trigger processes and handles custom topics in Event Grid
For more information about event-based triggers, see Storage Event Trigger and Custom Event Trigger.
Next steps
See the following tutorials:
Quickstart: Create a data factory by using the .NET SDK
Create a schedule trigger
Create a tumbling window trigger
Integration runtime in Azure Data Factory
6/18/2021 • 12 minutes to read • Edit Online
IR T Y P E P UB L IC N ET W O RK P RIVAT E N ET W O RK
NOTE
Azure Integration runtime has properties related to Data Flow runtime, which defines the underlying compute
infrastructure that would be used to run the data flows on.
NOTE
Use self-hosted integration runtime to support data stores that requires bring-your-own driver such as SAP Hana,
MySQL, etc. For more information, see supported data stores.
NOTE
Java Runtime Environment (JRE) is a dependency of Self Hosted IR. Please make sure you have JRE installed on the same
host.
TIP
If you have strict data compliance requirements and need ensure that data do not leave a certain geography, you
can explicitly create an Azure IR in a certain region and point the Linked Service to this IR using ConnectVia
property. For example, if you want to copy data from Blob in UK South to Azure Synapse Analytics in UK South
and want to ensure data do not leave UK, create an Azure IR in UK South and link both Linked Services to this IR.
TIP
A good practice would be to ensure Data flow runs in the same region as your corresponding data stores (if
possible). You can either achieve this by auto-resolve Azure IR (if data store location is same as Data Factory
location), or by creating a new Azure IR instance in the same region as your data stores and then execute the data
flow on it.
If you enable Managed Virtual Network for auto-resolve Azure IR, ADF uses the IR in the data factory region.
You can monitor which IR location takes effect during activity execution in pipeline activity monitoring view on
UI or activity monitoring payload.
Self-hosted IR location
The self-hosted IR is logically registered to the Data Factory and the compute used to support its functionalities
is provided by you. Therefore there is no explicit location property for self-hosted IR.
When used to perform data movement, the self-hosted IR extracts data from the source and writes into the
destination.
Azure -SSIS IR location
Selecting the right location for your Azure-SSIS IR is essential to achieve high performance in your extract-
transform-load (ETL) workflows.
The location of your Azure-SSIS IR does not need to be the same as the location of your data factory, but it
should be the same as the location of your own Azure SQL Database or SQL Managed Instance where
SSISDB. This way, your Azure-SSIS Integration Runtime can easily access SSISDB without incurring excessive
traffics between different locations.
If you do not have an existing SQL Database or SQL Managed Instance, but you have on-premises data
sources/destinations, you should create a new Azure SQL Database or SQL Managed Instance in the same
location of a virtual network connected to your on-premises network. This way, you can create your Azure-
SSIS IR using the new Azure SQL Database or SQL Managed Instance and joining that virtual network, all in
the same location, effectively minimizing data movements across different locations.
If the location of your existing Azure SQL Database or SQL Managed Instance is not the same as the location
of a virtual network connected to your on-premises network, first create your Azure-SSIS IR using an existing
Azure SQL Database or SQL Managed Instance and joining another virtual network in the same location, and
then configure a virtual network to virtual network connection between different locations.
The following diagram shows location settings of Data Factory and its integration run times:
Next steps
See the following articles:
Create Azure integration runtime
Create self-hosted integration runtime
Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides instructions on
using SQL Managed Instance and joining the IR to a virtual network.
Mapping data flows in Azure Data Factory
5/25/2021 • 4 minutes to read • Edit Online
Getting started
Data flows are created from the factory resources pane like pipelines and datasets. To create a data flow, select
the plus sign next to Factor y Resources , and then select Data Flow .
This action takes you to the data flow canvas, where you can create your transformation logic. Select Add
source to start configuring your source transformation. For more information, see Source transformation.
Configuration panel
The configuration panel shows the settings specific to the currently selected transformation. If no transformation
is selected, it shows the data flow. In the overall data flow configuration, you can add parameters via the
Parameters tab. For more information, see Mapping data flow parameters.
Each transformation contains at least four configuration tabs.
Transformation settings
The first tab in each transformation's configuration pane contains the settings specific to that transformation. For
more information, see that transformation's documentation page.
Optimize
The Optimize tab contains settings to configure partitioning schemes. To learn more about how to optimize
your data flows, see the mapping data flow performance guide.
Inspect
The Inspect tab provides a view into the metadata of the data stream that you're transforming. You can see
column counts, the columns changed, the columns added, data types, the column order, and column references.
Inspect is a read-only view of your metadata. You don't need to have debug mode enabled to see metadata in
the Inspect pane.
As you change the shape of your data through transformations, you'll see the metadata changes flow in the
Inspect pane. If there isn't a defined schema in your source transformation, then metadata won't be visible in
the Inspect pane. Lack of metadata is common in schema drift scenarios.
Data preview
If debug mode is on, the Data Preview tab gives you an interactive snapshot of the data at each transform. For
more information, see Data preview in debug mode.
Top bar
The top bar contains actions that affect the whole data flow, like saving and validation. You can view the
underlying JSON code and data flow script of your transformation logic as well. For more information, learn
about the data flow script.
Available transformations
View the mapping data flow transformation overview to get a list of available transformations.
Debug mode
Debug mode allows you to interactively see the results of each transformation step while you build and debug
your data flows. The debug session can be used both in when building your data flow logic and running pipeline
debug runs with data flow activities. To learn more, see the debug mode documentation.
Australia Central
Australia Central 2
Australia East ✓
Australia Southeast ✓
Brazil South ✓
Canada Central ✓
Central India ✓
Central US ✓
China East
China East 2
China Non-Regional
China North ✓
China North 2 ✓
East Asia ✓
East US ✓
East US 2 ✓
France Central ✓
France South
Japan East ✓
A Z URE REGIO N DATA F LO W S IN A DF
Japan West
Korea Central ✓
Korea South
North Central US ✓
North Europe ✓
Norway East ✓
Norway West
South Central US
South India
Southeast Asia ✓
Switzerland North
Switzerland West
UAE Central
UAE North ✓
UK South ✓
UK West
US DoD Central
US DoD East
US Gov Arizona ✓
US Gov Non-Regional
US Gov Texas
US Gov Virginia ✓
West Central US
A Z URE REGIO N DATA F LO W S IN A DF
West Europe ✓
West India
West US ✓
West US 2 ✓
Next steps
Learn how to create a source transformation.
Learn how to build your data flows in debug mode.
Mapping data flow Debug Mode
4/16/2021 • 6 minutes to read • Edit Online
Overview
Azure Data Factory mapping data flow's debug mode allows you to interactively watch the data shape transform
while you build and debug your data flows. The debug session can be used both in Data Flow design sessions as
well as during pipeline debug execution of data flows. To turn on debug mode, use the Data Flow Debug
button in the top bar of data flow canvas or pipeline canvas when you have data flow activities.
Once you turn on the slider, you will be prompted to select which integration runtime configuration you wish to
use. If AutoResolveIntegrationRuntime is chosen, a cluster with eight cores of general compute with a default
60-minute time to live will be spun up. If you'd like to allow for more idle team before your session times out,
you can choose a higher TTL setting. For more information on data flow integration runtimes, see Data flow
performance.
When Debug mode is on, you'll interactively build your data flow with an active Spark cluster. The session will
close once you turn debug off in Azure Data Factory. You should be aware of the hourly charges incurred by
Azure Databricks during the time that you have the debug session turned on.
In most cases, it's a good practice to build your Data Flows in debug mode so that you can validate your
business logic and view your data transformations before publishing your work in Azure Data Factory. Use the
"Debug" button on the pipeline panel to test your data flow in a pipeline.
NOTE
Every debug session that a user starts from their ADF browser UI is a new session with its own Spark cluster. You can use
the monitoring view for debug sessions above to view and manage debug sessions per factory. You are charged for every
hour that each debug session is executing including the TTL time.
Cluster status
The cluster status indicator at the top of the design surface turns green when the cluster is ready for debug. If
your cluster is already warm, then the green indicator will appear almost instantly. If your cluster wasn't already
running when you entered debug mode, then the Spark cluster will perform a cold boot. The indicator will spin
until the environment is ready for interactive debugging.
When you are finished with your debugging, turn the Debug switch off so that your Spark cluster can terminate
and you'll no longer be billed for debug activity.
Debug settings
Once you turn on debug mode, you can edit how a data flow previews data. Debug settings can be edited by
clicking "Debug Settings" on the Data Flow canvas toolbar. You can select the row limit or file source to use for
each of your Source transformations here. The row limits in this setting are only for the current debug session.
You can also select the staging linked service to be used for an Azure Synapse Analytics source.
If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use
during debugging by selecting the Parameters tab.
Use the sampling settings here to point to sample files or sample tables of data so that you do not have to
change your source datasets. By using a sample file or table here, you can maintain the same logic and property
settings in your data flow while testing against a subset of data.
The default IR used for debug mode in ADF data flows is a small 4-core single worker node with a 4-core single
driver node. This works fine with smaller samples of data when testing your data flow logic. If you expand the
row limits in your debug settings during data preview or set a higher number of sampled rows in your source
during pipeline debug, then you may wish to consider setting a larger compute environment in a new Azure
Integration Runtime. Then you can restart your debug session using the larger compute environment.
Data preview
With debug on, the Data Preview tab will light-up on the bottom panel. Without debug mode on, Data Flow will
show you only the current metadata in and out of each of your transformations in the Inspect tab. The data
preview will only query the number of rows that you have set as your limit in your debug settings. Click
Refresh to fetch the data preview.
NOTE
File sources only limit the rows that you see, not the rows being read. For very large datasets, it is recommended that you
take a small portion of that file and use it for your testing. You can select a temporary file in Debug Settings for each
source that is a file dataset type.
When running in Debug Mode in Data Flow, your data will not be written to the Sink transform. A Debug session
is intended to serve as a test harness for your transformations. Sinks are not required during debug and are
ignored in your data flow. If you wish to test writing the data in your Sink, execute the Data Flow from an Azure
Data Factory Pipeline and use the Debug execution from a pipeline.
Data Preview is a snapshot of your transformed data using row limits and data sampling from data frames in
Spark memory. Therefore, the sink drivers are not utilized or tested in this scenario.
Testing join conditions
When unit testing Joins, Exists, or Lookup transformations, make sure that you use a small set of known data for
your test. You can use the Debug Settings option above to set a temporary file to use for your testing. This is
needed because when limiting or sampling rows from a large dataset, you cannot predict which rows and which
keys will be read into the flow for testing. The result is non-deterministic, meaning that your join conditions may
fail.
Quick actions
Once you see the data preview, you can generate a quick transformation to typecast, remove, or do a
modification on a column. Click on the column header and then select one of the options from the data preview
toolbar.
Once you select a modification, the data preview will immediately refresh. Click Confirm in the top-right corner
to generate a new transformation.
Typecast and Modify will generate a Derived Column transformation and Remove will generate a Select
transformation.
NOTE
If you edit your Data Flow, you need to re-fetch the data preview before adding a quick transformation.
Data profiling
Selecting a column in your data preview tab and clicking Statistics in the data preview toolbar will pop up a
chart on the far-right of your data grid with detailed statistics about each field. Azure Data Factory will make a
determination based upon the data sampling of which type of chart to display. High-cardinality fields will default
to NULL/NOT NULL charts while categorical and numeric data that has low cardinality will display bar charts
showing data value frequency. You'll also see max/len length of string fields, min/max values in numeric fields,
standard dev, percentiles, counts, and average.
Next steps
Once you're finished building and debugging your data flow, execute it from a pipeline.
When testing your pipeline with a data flow, use the pipeline Debug run execution option.
Schema drift in mapping data flow
11/2/2020 • 3 minutes to read • Edit Online
If schema drift is enabled, make sure the Auto-mapping slider in the Mapping tab is turned on. With this slider
on, all incoming columns are written to your destination. Otherwise you must use rule-based mapping to write
drifted columns.
Transforming drifted columns
When your data flow has drifted columns, you can access them in your transformations with the following
methods:
Use the byPosition and byName expressions to explicitly reference a column by name or position number.
Add a column pattern in a Derived Column or Aggregate transformation to match on any combination of
name, stream, position, origin, or type
Add rule-based mapping in a Select or Sink transformation to match drifted columns to columns aliases via a
pattern
For more information on how to implement column patterns, see Column patterns in mapping data flow.
Map drifted columns quick action
To explicitly reference drifted columns, you can quickly generate mappings for these columns via a data preview
quick action. Once debug mode is on, go to the Data Preview tab and click Refresh to fetch a data preview. If
data factory detects that drifted columns exist, you can click Map Drifted and generate a derived column that
allows you to reference all drifted columns in schema views downstream.
In the generated Derived Column transformation, each drifted column is mapped to its detected name and data
type. In the above data preview, the column 'movieId' is detected as an integer. After Map Drifted is clicked,
movieId is defined in the Derived Column as toInteger(byName('movieId')) and included in schema views in
downstream transformations.
Next steps
In the Data Flow Expression Language, you'll find additional facilities for column patterns and schema drift
including "byName" and "byPosition".
Using column patterns in mapping data flow
5/25/2021 • 4 minutes to read • Edit Online
Use the expression builder to enter the match condition. Create a boolean expression that matches columns
based on the name , type , stream , origin , and position of the column. The pattern will affect any column,
drifted or defined, where the condition returns true.
The above column pattern matches every column of type double and creates one derived column per match. By
stating $$ as the column name field, each matched column is updated with the same name. The value of the
each column is the existing value rounded to two decimal points.
To verify your matching condition is correct, you can validate the output schema of defined columns in the
Inspect tab or get a snapshot of the data in the Data preview tab.
Each rule-based mapping requires two inputs: the condition on which to match by and what to name each
mapped column. Both values are inputted via the expression builder. In the left expression box, enter your
boolean match condition. In the right expression box, specify what the matched column will be mapped to.
Use $$ syntax to reference the input name of a matched column. Using the above image as an example, say a
user wants to match on all string columns whose names are shorter than six characters. If one incoming column
was named test , the expression $$ + '_short' will rename the column test_short . If that's the only mapping
that exists, all columns that don't meet the condition will be dropped from the outputted data.
Patterns match both drifted and defined columns. To see which defined columns are mapped by a rule, click the
eyeglasses icon next to the rule. Verify your output using data preview.
Regex mapping
If you click the downward chevron icon, you can specify a regex-mapping condition. A regex-mapping condition
matches all column names that match the specified regex condition. This can be used in combination with
standard rule-based mappings.
The above example matches on regex pattern (r) or any column name that contains a lower case r. Similar to
standard rule-based mapping, all matched columns are altered by the condition on the right using $$ syntax.
Rule -based hierarchies
If your defined projection has a hierarchy, you can use rule-based mapping to map the hierarchies subcolumns.
Specify a matching condition and the complex column whose subcolumns you wish to map. Every matched
subcolumn will be outputted using the 'Name as' rule specified on the right.
The above example matches on all subcolumns of complex column a . a contains two subcolumns b and c .
The output schema will include two columns b and c as the 'Name as' condition is $$ .
Next steps
Learn more about the mapping data flow expression language for data transformations
Use column patterns in the sink transformation and select transformation with rule-based mapping
Monitor Data Flows
6/23/2021 • 5 minutes to read • Edit Online
When you execute your pipeline, you can monitor the pipeline and all of the activities contained in the pipeline
including the Data Flow activity. Click on the monitor icon in the left-hand Azure Data Factory UI panel. You can
see a screen similar to the one below. The highlighted icons allow you to drill into the activities in the pipeline,
including the Data Flow activity.
You see statistics at this level as well including the run times and status. The Run ID at the activity level is
different than the Run ID at the pipeline level. The Run ID at the previous level is for the pipeline. Selecting the
eyeglasses gives you deep details on your data flow execution.
When you're in the graphical node monitoring view, you can see a simplified view-only version of your data flow
graph. To see the details view with larger graph nodes that include transformation stage labels, use the zoom
slider on the right side of your canvas. You can also use the search button on the right side to find parts of your
data flow logic in the graph.
View Data Flow Execution Plans
When your Data Flow is executed in Spark, Azure Data Factory determines optimal code paths based on the
entirety of your data flow. Additionally, the execution paths may occur on different scale-out nodes and data
partitions. Therefore, the monitoring graph represents the design of your flow, taking into account the execution
path of your transformations. When you select individual nodes, you can see "stages" that represent code that
was executed together on the cluster. The timings and counts that you see represent those groups or stages as
opposed to the individual steps in your design.
When you select the open space in the monitoring window, the stats in the bottom pane display timing
and row counts for each Sink and the transformations that led to the sink data for transformation lineage.
When you select individual transformations, you receive additional feedback on the right-hand panel that
shows partition stats, column counts, skewness (how evenly is the data distributed across partitions), and
kurtosis (how spiky is the data).
Sorting by processing time will help you to identify which stages in your data flow took the most time.
To find which transformations inside each stage took the most time, sort on highest processing time.
The rows written is also sortable as a way to identify which streams inside your data flow are writing the
most data.
When you select the Sink in the node view, you can see column lineage. There are three different
methods that columns are accumulated throughout your data flow to land in the Sink. They are:
Computed: You use the column for conditional processing or within an expression in your data flow,
but don't land it in the Sink
Derived: The column is a new column that you generated in your flow, that is, it was not present in the
Source
Mapped: The column originated from the source and your are mapping it to a sink field
Data flow status: The current status of your execution
Cluster startup time: Amount of time to acquire the JIT Spark compute environment for your data flow
execution
Number of transforms: How many transformation steps are being executed in your flow
{
"stage": 4,
"partitionTimes": [
14353,
14914,
14246,
14912,
...
]
}
Error rows
Enabling error row handling in your data flow sink will be reflected in the monitoring output. When you set the
sink to "report success on error", the monitoring output will show the number of success and failed rows when
you click on the sink monitoring node.
When you select "report failure on error", the same output will be shown only in the activity monitoring output
text. This is because the data flow activity will return failure for execution and the detailed monitoring view will
be unavailable.
Monitor Icons
This icon means that the transformation data was already cached on the cluster, so the timings and execution
path have taken that into account:
You also see green circle icons in the transformation. They represent a count of the number of sinks that data is
flowing into.
Mapping data flows performance and tuning guide
6/8/2021 • 21 minutes to read • Edit Online
When monitoring data flow performance, there are four possible bottlenecks to look out for:
Cluster start-up time
Reading from a source
Transformation time
Writing to a sink
Cluster start-up time is the time it takes to spin up an Apache Spark cluster. This value is located in the top-right
corner of the monitoring screen. Data flows run on a just-in-time model where each job uses an isolated cluster.
This start-up time generally takes 3-5 minutes. For sequential jobs, this can be reduced by enabling a time to live
value. For more information, see optimizing the Azure Integration Runtime.
Data flows utilize a Spark optimizer that reorders and runs your business logic in 'stages' to perform as quickly
as possible. For each sink that your data flow writes to, the monitoring output lists the duration of each
transformation stage, along with the time it takes to write data into the sink. The time that is the largest is likely
the bottleneck of your data flow. If the transformation stage that takes the largest contains a source, then you
may want to look at further optimizing your read time. If a transformation is taking a long time, then you may
need to repartition or increase the size of your integration runtime. If the sink processing time is large, you may
need to scale up your database or verify you are not outputting to a single file.
Once you have identified the bottleneck of your data flow, use the below optimizations strategies to improve
performance.
Optimize tab
The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. This tab exists in
every transformation of data flow and specifies whether you want to repartition the data after the
transformation has completed. Adjusting the partitioning provides control over the distribution of your data
across compute nodes and data locality optimizations that can have both positive and negative effects on your
overall data flow performance.
By default, Use current partitioning is selected which instructs Azure Data Factory keep the current output
partitioning of the transformation. As repartitioning data takes time, Use current partitioning is recommended
in most scenarios. Scenarios where you may want to repartition your data include after aggregates and joins
that significantly skew your data or when using Source partitioning on a SQL DB.
To change the partitioning on any transformation, select the Optimize tab and select the Set Par titioning
radio button. You are presented with a series of options for partitioning. The best method of partitioning differs
based on your data volumes, candidate keys, null values, and cardinality.
IMPORTANT
Single partition combines all the distributed data into a single partition. This is a very slow operation that also significantly
affects all downstream transformation and writes. The Azure Data Factory highly recommends against using this option
unless there is an explicit business reason to do so.
Logging level
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs,
you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose"
mode (default), you are requesting ADF to fully log activity at each individual partition level during your data
transformation. This can be an expensive operation, so only enabling verbose when troubleshooting can
improve your overall data flow and pipeline performance. "Basic" mode will only log transformation durations
while "None" will only provide a summary of durations.
8 8 16
16 16 32
32 16 48
64 16 80
128 16 144
256 16 272
Data flows are priced at vcore-hrs meaning that both cluster size and execution-time factor into this. As you
scale up, your cluster cost per minute will increase, but your overall time will decrease.
TIP
There is a ceiling on how much the size of a cluster affects the performance of a data flow. Depending on the size of your
data, there is a point where increasing the size of a cluster will stop improving performance. For example, If you have
more nodes than partitions of data, adding additional nodes won't help. A best practice is to start small and scale up to
meet your performance needs.
Time to live
By default, every data flow activity spins up a new Spark cluster based upon the Azure IR configuration. Cold
cluster start-up time takes a few minutes and data processing can't start until it is complete. If your pipelines
contain multiple sequential data flows, you can enable a time to live (TTL) value. Specifying a time to live value
keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR
during the TTL time, it will reuse the existing cluster and start up time will greatly reduced. After the second job
completes, the cluster will again stay alive for the TTL time.
You can additionally minimize the startup time of warm clusters by setting the "Quick re-use" option in the
Azure Integration runtime under Data Flow Properties. Setting this to true will tell ADF to not teardown the
existing cluster after each job and instead re-use the existing cluster, essentially keeping the compute
environment you've set in your Azure IR alive for up to the period of time specified in your TTL. This option
makes for the shortest start-up time of your data flow activities when executing from a pipeline.
However, if most of your data flows execute in parallel, it is not recommended that you enable TTL for the IR that
you use for those activities. Only one job can run on a single cluster at a time. If there is an available cluster, but
two data flows start, only one will use the live cluster. The second job will spin up its own isolated cluster.
NOTE
Time to live is not available when using the auto-resolve integration runtime
Optimizing sources
For every source except Azure SQL Database, it is recommended that you keep Use current par titioning as
the selected value. When reading from all other source systems, data flows automatically partitions data evenly
based upon the size of the data. A new partition is created for about every 128 MB of data. As your data size
increases, the number of partitions increase.
Any custom partitioning happens after Spark reads in the data and will negatively impact your data flow
performance. As the data is evenly partitioned on read, this is not recommended.
NOTE
Read speeds can be limited by the throughput of your source system.
TIP
For source partitioning, the I/O of the SQL Server is the bottleneck. Adding too many partitions may saturate your
source database. Generally four or five partitions is ideal when using this option.
Isolation level
The isolation level of the read on an Azure SQL source system has an impact on performance. Choosing 'Read
uncommitted' will provide the fastest performance and prevent any database locks. To learn more about SQL
Isolation levels, please see Understanding isolation levels.
Read using query
You can read from Azure SQL Database using a table or a SQL query. If you are executing a SQL query, the
query must complete before transformation can start. SQL Queries can be useful to push down operations that
may execute faster and reduce the amount of data read from a SQL Server such as SELECT, WHERE, and JOIN
statements. When pushing down operations, you lose the ability to track lineage and performance of the
transformations before the data comes into the data flow.
Azure Synapse Analytics sources
When using Azure Synapse Analytics, a setting called Enable staging exists in the source options. This allows
ADF to read from Synapse using Staging , which greatly improves read performance. Enabling Staging
requires you to specify an Azure Blob Storage or Azure Data Lake Storage gen2 staging location in the data flow
activity settings.
File -based sources
While data flows support a variety of file types, the Azure Data Factory recommends using the Spark-native
Parquet format for optimal read and write times.
If you're running the same data flow on a set of files, we recommend reading from a folder, using wildcard paths
or reading from a list of files. A single data flow activity run can process all of your files in batch. More
information on how to set these settings can be found in the connector documentation such as Azure Blob
Storage.
If possible, avoid using the For-Each activity to run data flows over a set of files. This will cause each iteration of
the for-each to spin up its own Spark cluster, which is often not necessary and can be expensive.
Optimizing sinks
When data flows write to sinks, any custom partitioning will happen immediately before the write. Like the
source, in most cases it is recommended that you keep Use current par titioning as the selected partition
option. Partitioned data will write significantly quicker than unpartitioned data, even your destination is not
partitioned. Below are the individual considerations for various sink types.
Azure SQL Database sinks
With Azure SQL Database, the default partitioning should work in most cases. There is a chance that your sink
may have too many partitions for your SQL database to handle. If you are running into this, reduce the number
of partitions outputted by your SQL Database sink.
Impact of error row handling to performance
When you enable error row handling ("continue on error") in the sink transformation, ADF will take an
additional step before writing the compatible rows to your destination table. This additional step will have a
small performance penalty that can be in the range of 5% added for this step with an additional small
performance hit also added if you set the option to also with the incompatible rows to a log file.
Disabling indexes using a SQL Script
Disabling indexes before a load in a SQL database can greatly improve performance of writing to the table. Run
the below command before writing to your SQL sink.
ALTER INDEX ALL ON dbo.[Table Name] DISABLE
After the write has completed, rebuild the indexes using the following command:
ALTER INDEX ALL ON dbo.[Table Name] REBUILD
These can both be done natively using Pre and Post-SQL scripts within an Azure SQL DB or Synapse sink in
mapping data flows.
WARNING
When disabling indexes, the data flow is effectively taking control of a database and queries are unlikely to succeed at this
time. As a result, many ETL jobs are triggered in the middle of the night to avoid this conflict. For more information, learn
about the constraints of disabling indexes
Optimizing transformations
Optimizing Joins, Exists, and Lookups
Broadcasting
In joins, lookups, and exists transformations, if one or both data streams are small enough to fit into worker
node memory, you can optimize performance by enabling Broadcasting . Broadcasting is when you send small
data frames to all nodes in the cluster. This allows for the Spark engine to perform a join without reshuffling the
data in the large stream. By default, the Spark engine will automatically decide whether or not to broadcast one
side of a join. If you are familiar with your incoming data and know that one stream will be significantly smaller
than the other, you can select Fixed broadcasting. Fixed broadcasting forces Spark to broadcast the selected
stream.
If the size of the broadcasted data is too large for the Spark node, you may get an out of memory error. To avoid
out of memory errors, use memor y optimized clusters. If you experience broadcast timeouts during data flow
executions, you can switch off the broadcast optimization. However, this will result in slower performing data
flows.
When working with data sources that can take longer to query, like large database queries, it is recommended to
turn broadcast off for joins. Source with long query times can cause Spark timeouts when the cluster attempts
to broadcast to compute nodes. Another good choice for turning off broadcast is when you have a stream in
your data flow that is aggregating values for use in a lookup transformation later. This pattern can confuse the
Spark optimizer and cause timeouts.
Cross joins
If you use literal values in your join conditions or have multiple matches on both sides of a join, Spark will run
the join as a cross join. A cross join is a full cartesian product that then filters out the joined values. This is
significantly slower than other join types. Ensure that you have column references on both sides of your join
conditions to avoid the performance impact.
Sorting before joins
Unlike merge join in tools like SSIS, the join transformation isn't a mandatory merge join operation. The join
keys don't require sorting prior to the transformation. The Azure Data Factory team doesn't recommend using
Sort transformations in mapping data flows.
Window transformation performance
The Window transformation partitions your data by value in columns that you select as part of the over()
clause in the transformation settings. There are a number of very popular aggregate and analytical functions
that are exposed in the Windows transformation. However, if your use case is to generate a window over your
entire dataset for the purpose of ranking rank() or row number rowNumber() , it is recommended that you
instead use the Rank transformation and the Surrogate Key transformation. Those transformation will perform
better again full dataset operations using those functions.
Repartitioning skewed data
Certain transformations such as joins and aggregates reshuffle your data partitions and can occasionally lead to
skewed data. Skewed data means that data is not evenly distributed across the partitions. Heavily skewed data
can lead to slower downstream transformations and sink writes. You can check the skewness of your data at any
point in a data flow run by clicking on the transformation in the monitoring display.
The monitoring display will show how the data is distributed across each partition along with two metrics,
skewness and kurtosis. Skewness is a measure of how asymmetrical the data is and can have a positive, zero,
negative, or undefined value. Negative skew means the left tail is longer than the right. Kur tosis is the measure
of whether the data is heavy-tailed or light-tailed. High kurtosis values are not desirable. Ideal ranges of
skewness lie between -3 and 3 and ranges of kurtosis are less than 10. An easy way to interpret these numbers
is looking at the partition chart and seeing if 1 bar is significantly larger than the rest.
If your data is not evenly partitioned after a transformation, you can use the optimize tab to repartition.
Reshuffling data takes time and may not improve your data flow performance.
TIP
If you repartition your data, but have downstream transformations that reshuffle your data, use hash partitioning on a
column used as a join key.
Using data flows in pipelines
When building complex pipelines with multiple data flows, your logical flow can have a big impact on timing
and cost. This section covers the impact of different architecture strategies.
Executing data flows in parallel
If you execute multiple data flows in parallel, ADF spins up separate Spark clusters for each activity. This allows
for each job to be isolated and run in parallel, but will lead to multiple clusters running at the same time.
If your data flows execute in parallel, its recommended to not enable the Azure IR time to live property as it will
lead to multiple unused warm pools.
TIP
Instead of running the same data flow multiple times in a for each activity, stage your data in a data lake and use wildcard
paths to process the data in a single data flow.
Next steps
See other Data Flow articles related to performance:
Data Flow activity
Monitor Data Flow performance
Managing the mapping data flow graph
3/5/2021 • 2 minutes to read • Edit Online
As your data flows get more complex, use the following mechanisms to effectively navigate and manage the
data flow graph.
Moving transformations
In mapping data flows, a set of connected transformation logic is known as a stream . The Incoming stream
field dictates which data stream is feeding the current transformation. Each transformation has one or two
incoming streams depending on its function and represents an output stream. The output schema of the
incoming streams determines which column metadata can be referenced by the current transformation.
Unlike the pipeline canvas, data flow transformations aren't edited using a drag and drop model. To change the
incoming stream of or "move" a transformation, choose a different value from the Incoming stream
dropdown. When you do this, all downstream transformations will move alongside the edited transformation.
The graph will automatically update to show the new logical flow. If you change the incoming stream to a
transformation that already has downstream transformation, a new branch or parallel data stream will be
created. Learn more about new branches in mapping data flow.
In some transformations like filter, clicking on a blue expression text box will open the expression builder.
When you reference columns in a matching or group-by condition, an expression can extract values from
columns. To create an expression, select Computed column .
In cases where an expression or a literal value are valid inputs, select Add dynamic content to build an
expression that evaluates to a literal value.
Expression elements
In mapping data flows, expressions can be composed of column values, parameters, functions, local variables,
operators, and literals. These expressions must evaluate to a Spark data type such as string, boolean, or integer.
Functions
Mapping data flows has built-in functions and operators that can be used in expressions. For a list of available
functions, see the mapping data flow language reference.
Address array indexes
When dealing with columns or functions that return array types, use brackets ([]) to access a specific element. If
the index doesn't exist, the expression evaluates into NULL.
IMPORTANT
In mapping data flows, arrays are one-based meaning the first element is referenced by index one. For example,
myArray[1] will access the first element of an array called 'myArray'.
Input schema
If your data flow uses a defined schema in any of its sources, you can reference a column by name in many
expressions. If you are utilizing schema drift, you can reference columns explicitly using the byName() or
byNames() functions or match using column patterns.
Parameters
Parameters are values that are passed into a data flow at run time from a pipeline. To reference a parameter,
either click on the parameter from the Expression elements view or reference it with a dollar sign in front of
its name. For example, a parameter called parameter1 would be referenced by $parameter1 . To learn more, see
parameterizing mapping data flows.
Cached lookup
A cached lookup allows you to do an inline lookup of the output of a cached sink. There are two functions
available to use on each sink, lookup() and outputs() . The syntax to reference these functions is
cacheSinkName#functionName() . For more information, see cache sinks.
lookup() takes in the matching columns in the current transformation as parameters and returns a complex
column equal to the row matching the key columns in the cache sink. The complex column returned contains a
subcolumn for each column mapped in the cache sink. For example, if you had an error code cache sink
errorCodeCache that had a key column matching on the code and a column called Message . Calling
errorCodeCache#lookup(errorCode).Message would return the message corresponding with the code passed in.
outputs() takes no parameters and returns the entire cache sink as an array of complex columns. This can't be
called if key columns are specified in the sink and should only be used if there is a small number of rows in the
cache sink. A common use case is appending the max value of an incrementing key. If a cached single
aggregated row CacheMaxKey contains a column MaxKey , you can reference the first value by calling
CacheMaxKey#outputs()[1].MaxKey .
Locals
If you are sharing logic across multiple columns or want to compartmentalize your logic, you can create a local
within a derived column. To reference a local, either click on the local from the Expression elements view or
reference it with a colon in front of its name. For example, a local called local1 would be referenced by :local1 .
Learn more about locals in the derived column documentation.
String interpolation
When creating long strings that use expression elements, use string interpolation to easily build up complex
string logic. String interpolation avoids extensive use of string concatenation when parameters are included in
query strings. Use double quotation marks to enclose literal string text together with expressions. You can
include expression functions, columns, and parameters. To use expression syntax, enclose it in curly braces,
Some examples of string interpolation:
"My favorite movie is {iif(instr(title,', The')>0,"The {split(title,', The')[1]}",title)}"
NOTE
When using string interpolation syntax in SQL source queries, the query string must be on one single line, without '/n'.
Commenting expressions
Add comments to your expressions by using single-line and multiline comment syntax.
The following examples are valid comments:
/* This is my comment */
/* This is a
multi-line comment */
If you put a comment at the top of your expression, it appears in the transformation text box to document your
transformation expressions.
Regular expressions
Many expression language functions use regular expression syntax. When you use regular expression functions,
Expression Builder tries to interpret a backslash (\) as an escape character sequence. When you use backslashes
in your regular expression, either enclose the entire regex in backticks (`) or use a double backslash.
An example that uses backticks:
Keyboard shortcuts
Below are a list of shortcuts available in the expression builder. Most intellisense shortcuts are available when
creating expressions.
Ctrl+K Ctrl+C: Comment entire line.
Ctrl+K Ctrl+U: Uncomment.
F1: Provide editor help commands.
Alt+Down arrow key: Move down current line.
Alt+Up arrow key: Move up current line.
Ctrl+Spacebar: Show context help.
To convert milliseconds from epoch to a date or timestamp, use toTimestamp(<number of milliseconds>) . If time
is coming in seconds, multiply by 1,000.
toTimestamp(1574127407*1000l)
The trailing "l" at the end of the previous expression signifies conversion to a long type as inline syntax.
Find time from epoch or Unix Time
toLong( currentTimestamp() - toTimestamp('1970-01-01 00:00:00.000', 'yyyy-MM-dd HH:mm:ss.SSS') ) * 1000l
Data flow time evaluation
Dataflow processes till milliseconds. For 2018-07-31T20:00:00.2170000, you will see 2018-07-31T20:00:00.217
in output. In ADF portal, timestamp is being shown in the current browser setting , which can eliminate 217,
but when you will run the data flow end to end, 217 (milliseconds part will be processed as well). You can use
toString(myDateTimeColumn) as expression and see full precision data in preview. Process datetime as datetime
rather than string for all practical purposes.
Next steps
Begin building data transformation expressions
Data transformation expressions in mapping data
flow
7/9/2021 • 49 minutes to read • Edit Online
Expression functions
In Data Factory, use the expression language of the mapping data flow feature to configure data
transformations.
abs
acos
add
Adds a pair of strings or numbers. Adds a date to a number of days. Adds a duration to a timestamp. Appends
one array of similar type to another. Same as the + operator.
add(10, 20) -> 30
10 + 20 -> 30
add('ice', 'cream') -> 'icecream'
'ice' + 'cream' + ' cone' -> 'icecream cone'
add(toDate('2012-12-12'), 3) -> toDate('2012-12-15')
toDate('2012-12-12') + 3 -> toDate('2012-12-15')
[10, 20] + [30, 40] -> [10, 20, 30, 40]
toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS') + (days(1) + hours(2) - seconds(10)) ->
toTimestamp('2019-02-04 07:19:18.871', 'yyyy-MM-dd HH:mm:ss.SSS')
addDays
addMonths
and
asin
atan
atan2
Returns the angle in radians between the positive x-axis of a plane and the point given by the coordinates.
atan2(0, 0) -> 0.0
between
Checks if the first value is in between two other values inclusively. Numeric, string and datetime values can be
compared
between(10, 5, 24)
true
between(currentDate(), currentDate() + 10, currentDate() + 20)
false
bitwiseAnd
bitwiseOr
bitwiseXor
blake2b
Calculates the Blake2 digest of set of column of varying primitive datatypes given a bit length which can only be
multiples of 8 between 8 & 512. It can be used to calculate a fingerprint for a row
blake2b(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4'))
'c9521a5080d8da30dffb430c50ce253c345cc4c4effc315dab2162dac974711d'
blake2bBinary
Calculates the Blake2 digest of set of column of varying primitive datatypes given a bit length which can only be
multiples of 8 between 8 & 512. It can be used to calculate a fingerprint for a row
blake2bBinary(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4'))
unHex('c9521a5080d8da30dffb430c50ce253c345cc4c4effc315dab2162dac974711d')
case
Based on alternating conditions applies one value or the other. If the number of inputs are even, the other is
defaulted to NULL for last condition.
case(10 + 20 == 30, 'dumbo', 'gumbo') -> 'dumbo'
case(10 + 20 == 25, 'bojjus', 'do' < 'go', 'gunchus') -> 'gunchus'
isNull(case(10 + 20 == 25, 'bojjus', 'do' > 'go', 'gunchus')) -> true
case(10 + 20 == 25, 'bojjus', 'do' > 'go', 'gunchus', 'dumbo') -> 'dumbo'
cbrt
ceil
coalesce
Returns the first not null value from a set of inputs. All inputs should be of the same type.
coalesce(10, 20) -> 10
coalesce(toString(null), toString(null), 'dumbo', 'bo', 'go') -> 'dumbo'
columnNames
Gets the names of all output columns for a stream. You can pass an optional stream name as the second
argument.
columnNames()
columnNames('DeriveStream')
columns
Gets the values of all output columns for a stream. You can pass an optional stream name as the second
argument.
columns()
columns('DeriveStream')
compare
Compares two values of the same type. Returns negative integer if value1 < value2, 0 if value1 == value2,
positive value if value1 > value2.
(compare(12, 24) < 1) -> true
(compare('dumbo', 'dum') > 0) -> true
concat
Concatenates a variable number of strings together. Same as the + operator with strings.
concat('dataflow', 'is', 'awesome') -> 'dataflowisawesome'
'dataflow' + 'is' + 'awesome' -> 'dataflowisawesome'
isNull('sql' + null) -> true
concatWS
Concatenates a variable number of strings together with a separator. The first parameter is the separator.
concatWS(' ', 'dataflow', 'is', 'awesome') -> 'dataflow is awesome'
isNull(concatWS(null, 'dataflow', 'is', 'awesome')) -> true
concatWS(' is ', 'dataflow', 'awesome') -> 'dataflow is awesome'
cos
crc32
Calculates the CRC32 hash of set of column of varying primitive datatypes given a bit length which can only be
of values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row.
crc32(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> 3630253689L
currentDate
Gets the current date when this job starts to run. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for
available formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
currentDate() == toDate('2250-12-31') -> false
currentDate('PST') == toDate('2250-12-31') -> false
currentDate('America/New_York') == toDate('2250-12-31') -> false
currentTimestamp
Gets the current timestamp when the job starts to run with local time zone.
currentTimestamp() == toTimestamp('2250-12-31 12:12:12') -> false
currentUTC
Gets the current timestamp as UTC. If you want your current time to be interpreted in a different timezone than
your cluster time zone, you can pass an optional timezone in the form of 'GMT', 'PST', 'UTC', 'America/Cayman'.
It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html. To convert the UTC time to a
different timezone use fromUTC() .
currentUTC() == toTimestamp('2050-12-12 19:18:12') -> false
currentUTC() != toTimestamp('2050-12-12 19:18:12') -> true
fromUTC(currentUTC(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true
dayOfMonth
dayOfWeek
Gets the day of the week given a date. 1 - Sunday, 2 - Monday ..., 7 - Saturday.
dayOfWeek(toDate('2018-06-08')) -> 6
dayOfYear
days
degrees
dropLeft
Removes as many characters from the left of the string. If the drop requested exceeds the length of the string, an
empty string is returned.
dropLeft('bojjus', 2) => 'jjus'
dropLeft('cake', 10) => ''
dropRight
Removes as many characters from the right of the string. If the drop requested exceeds the length of the string,
an empty string is returned.
dropRight('bojjus', 2) => 'bojj'
dropRight('cake', 10) => ''
endsWith
equals
equalsIgnoreCase
escape
Escapes a string according to a format. Literal values for acceptable format are 'json', 'xml', 'ecmascript', 'html',
'java'.
expr
Results in a expression from a string. This is the same as writing this expression in a non-literal form. This can be
used to pass parameters as string representations.
expr('price * discount') => any
factorial
false
Always returns a false value. Use the function syntax(false()) if there is a column named 'false'.
(10 + 20 > 30) -> false
(10 + 20 > 30) -> false()
floor
fromBase64
fromUTC
Converts to the timestamp from UTC. You can optionally pass the timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
fromUTC(currentTimestamp()) == toTimestamp('2050-12-12 19:18:12') -> false
fromUTC(currentTimestamp(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true
greater
greaterOrEqual
greatest
Returns the greatest value among the list of values as input skipping null values. Returns null if all inputs are
null.
greatest(10, 30, 15, 20) -> 30
greatest(10, toInteger(null), 20) -> 20
greatest(toDate('2010-12-12'), toDate('2011-12-12'), toDate('2000-12-12')) -> toDate('2011-12-12')
greatest(toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS'), toTimestamp('2019-02-05
08:21:34.890', 'yyyy-MM-dd HH:mm:ss.SSS')) -> toTimestamp('2019-02-05 08:21:34.890', 'yyyy-MM-dd
HH:mm:ss.SSS')
hasColumn
Checks for a column value by name in the stream. You can pass a optional stream name as the second
argument. Column names known at design time should be addressed just by their name. Computed inputs are
not supported but you can use parameter substitutions.
hasColumn('parent')
hour
Gets the hour value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
hour(toTimestamp('2009-07-30 12:58:59')) -> 12
hour(toTimestamp('2009-07-30 12:58:59'), 'PST') -> 12
hours
Based on a condition applies one value or the other. If other is unspecified it is considered NULL. Both the values
must be compatible(numeric, string...).
iif(10 + 20 == 30, 'dumbo', 'gumbo') -> 'dumbo'
iif(10 > 30, 'dumbo', 'gumbo') -> 'gumbo'
iif(month(toDate('2018-12-01')) == 12, 345.12, 102.67) -> 345.12
iifNull
Checks if the first parameter is null. If not null, the first parameter is returned. If null, the second parameter is
returned. If three parameters are specified, the behavior is the same as iif(isNull(value1), value2, value3) and the
third parameter is returned if the first value is not null.
iifNull(10, 20) -> 10
iifNull(null, 20, 40) -> 20
iifNull('azure', 'data', 'factory') -> 'factory'
iifNull(null, 'data', 'factory') -> 'data'
initCap
Converts the first letter of every word to uppercase. Words are identified as separated by whitespace.
initCap('cool iceCREAM') -> 'Cool Icecream'
instr
Finds the position(1 based) of the substring within a string. 0 is returned if not found.
instr('dumbo', 'mbo') -> 3
instr('microsoft', 'o') -> 5
instr('good', 'bad') -> 0
isDelete
Checks if the row is marked for delete. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isDelete()
isDelete(1)
isError
Checks if the row is marked as error. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isError()
isError(1)
isIgnore
Checks if the row is marked to be ignored. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isIgnore()
isIgnore(1)
isInsert
Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isInsert()
isInsert(1)
isMatch
Checks if the row is matched at lookup. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isMatch()
isMatch(1)
isNull
isUpdate
Checks if the row is marked for update. For transformations taking more than one input stream you can pass
the (1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isUpdate()
isUpdate(1)
isUpsert
Checks if the row is marked for insert. For transformations taking more than one input stream you can pass the
(1-based) index of the stream. The stream index should be either 1 or 2 and the default value is 1.
isUpsert()
isUpsert(1)
jaroWinkler
lastDayOfMonth
least
left
Extracts a substring start at index 1 with number of characters. Same as SUBSTRING(str, 1, n).
left('bojjus', 2) -> 'bo'
left('bojjus', 20) -> 'bojjus'
length
lesser
lesserOrEqual
levenshtein
like
The pattern is a string that is matched literally. The exceptions are the following special symbols: _ matches any
one character in the input (similar to . in posix regular expressions) % matches zero or more characters in the
input (similar to .* in posix regular expressions). The escape character is ''. If an escape character precedes a
special symbol or another escape character, the following character is matched literally. It is invalid to escape any
other character.
like('icecream', 'ice%') -> true
locate
locate(<substring to find> : string, <string> : string, [<from index - 1-based> : integral]) => integer
Finds the position(1 based) of the substring within a string starting a certain position. If the position is omitted it
is considered from the beginning of the string. 0 is returned if not found.
locate('mbo', 'dumbo') -> 3
locate('o', 'microsoft', 6) -> 7
locate('bad', 'good') -> 0
log
Calculates log value. An optional base can be supplied else a Euler number if used.
log(100, 10) -> 2
log10
lower
Lowercases a string.
lower('GunChus') -> 'gunchus'
lpad
lpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string
Left pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater than
the length, then it is trimmed to the length.
lpad('dumbo', 10, '-') -> '-----dumbo'
lpad('dumbo', 4, '-') -> 'dumb'
lpad('dumbo', 8, '<>') -> '<><dumbo'
ltrim
Left trims a string of leading characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter.
ltrim(' dumbo ') -> 'dumbo '
ltrim('!--!du!mbo!', '-!') -> 'du!mbo!'
md5
Calculates the MD5 digest of set of column of varying primitive datatypes and returns a 32 character hex string.
It can be used to calculate a fingerprint for a row.
md5(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '4ce8a880bd621a1ffad0bca905e1bc5a'
millisecond
millisecond(<value1> : timestamp, [<value2> : string]) => integer
Gets the millisecond value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
millisecond(toTimestamp('2009-07-30 12:58:59.871', 'yyyy-MM-dd HH:mm:ss.SSS')) -> 871
milliseconds
minus
Subtracts numbers. Subtract number of days from a date. Subtract duration from a timestamp. Subtract two
timestamps to get difference in milliseconds. Same as the - operator.
minus(20, 10) -> 10
20 - 10 -> 10
minus(toDate('2012-12-15'), 3) -> toDate('2012-12-12')
toDate('2012-12-15') - 3 -> toDate('2012-12-12')
toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS') + (days(1) + hours(2) - seconds(10)) ->
toTimestamp('2019-02-04 07:19:18.871', 'yyyy-MM-dd HH:mm:ss.SSS')
toTimestamp('2019-02-03 05:21:34.851', 'yyyy-MM-dd HH:mm:ss.SSS') - toTimestamp('2019-02-03
05:21:36.923', 'yyyy-MM-dd HH:mm:ss.SSS') -> -2072
minute
Gets the minute value of a timestamp. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
minute(toTimestamp('2009-07-30 12:58:59')) -> 58
minute(toTimestamp('2009-07-30 12:58:59'), 'PST') -> 58
minutes
mod
month
monthsBetween
Gets the number of months between two dates. You can round off the calculation.You can pass an optional
timezone in the form of 'GMT', 'PST', 'UTC', 'America/Cayman'. The local timezone is used as the default. Refer
Java's SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
monthsBetween(toTimestamp('1997-02-28 10:30:00'), toDate('1996-10-30')) -> 3.94959677
multiply
nextSequence
Returns the next unique sequence. The number is consecutive only within a partition and is prefixed by the
partitionId.
nextSequence() == 12313112 -> false
normalize
not
notEquals
notNull
null
Returns a NULL value. Use the function syntax(null()) if there is a column named 'null'. Any operation that
uses will result in a NULL.
isNull('dumbo' + null) -> true
isNull(10 * null) -> true
isNull('') -> false
isNull(10 + 20) -> false
isNull(10/0) -> true
or
pMod
partitionId
radians
random
Returns a random number given an optional seed within a partition. The seed should be a fixed value and is
used in conjunction with the partitionId to produce random values
random(1) == 1 -> false
regexExtract
regexExtract(<string> : string, <regex to find> : string, [<match group 1-based index> : integral]) => string
Extract a matching substring for a given regex pattern. The last parameter identifies the match group and is
defaulted to 1 if omitted. Use <regex> (back quote) to match a string without escaping.
regexExtract('Cost is between 600 and 800 dollars', '(\\d+) and (\\d+)', 2) -> '800'
regexExtract('Cost is between 600 and 800 dollars', `(\d+) and (\d+)`, 2) -> '800'
regexMatch
Checks if the string matches the given regex pattern. Use <regex> (back quote) to match a string without
escaping.
regexMatch('200.50', '(\\d+).(\\d+)') -> true
regexMatch('200.50', `(\d+).(\d+)`) -> true
regexReplace
regexReplace(<string> : string, <regex to find> : string, <substring to replace> : string) => string
Replace all occurrences of a regex pattern with another substring in the given string Use <regex> (back quote) to
match a string without escaping.
regexReplace('100 and 200', '(\\d+)', 'bojjus') -> 'bojjus and bojjus'
regexReplace('100 and 200', `(\d+)`, 'gunchus') -> 'gunchus and gunchus'
regexSplit
Splits a string based on a delimiter based on regex and returns an array of strings.
regexSplit('bojjusAgunchusBdumbo', `[CAB]`) -> ['bojjus', 'gunchus', 'dumbo']
regexSplit('bojjusAgunchusBdumboC', `[CAB]`) -> ['bojjus', 'gunchus', 'dumbo', '']
(regexSplit('bojjusAgunchusBdumboC', `[CAB]`)[1]) -> 'bojjus'
isNull(regexSplit('bojjusAgunchusBdumboC', `[CAB]`)[20]) -> true
replace
replace(<string> : string, <substring to find> : string, [<substring to replace> : string]) => string
Replace all occurrences of a substring with another substring in the given string. If the last parameter is omitted,
it is default to empty string.
replace('doggie dog', 'dog', 'cat') -> 'catgie cat'
replace('doggie dog', 'dog', '') -> 'gie '
replace('doggie dog', 'dog') -> 'gie '
reverse
Reverses a string.
reverse('gunchus') -> 'suhcnug'
right
rlike
round
round(<number> : number, [<scale to round> : number], [<rounding option> : integral]) => double
Rounds a number given an optional scale and an optional rounding mode. If the scale is omitted, it is defaulted
to 0. If the mode is omitted, it is defaulted to ROUND_HALF_UP(5). The values for rounding include 1 -
ROUND_UP 2 - ROUND_DOWN 3 - ROUND_CEILING 4 - ROUND_FLOOR 5 - ROUND_HALF_UP 6 -
ROUND_HALF_DOWN 7 - ROUND_HALF_EVEN 8 - ROUND_UNNECESSARY.
round(100.123) -> 100.0
round(2.5, 0) -> 3.0
round(5.3999999999999995, 2, 7) -> 5.40
rpad
rpad(<string to pad> : string, <final padded length> : integral, <padding> : string) => string
Right pads the string by the supplied padding until it is of a certain length. If the string is equal to or greater
than the length, then it is trimmed to the length.
rpad('dumbo', 10, '-') -> 'dumbo-----'
rpad('dumbo', 4, '-') -> 'dumb'
rpad('dumbo', 8, '<>') -> 'dumbo<><'
rtrim
Right trims a string of trailing characters. If second parameter is unspecified, it trims whitespace. Else it trims any
character specified in the second parameter.
rtrim(' dumbo ') -> ' dumbo'
rtrim('!--!du!mbo!', '-!') -> '!--!du!mbo'
second
Gets the second value of a date. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. The local timezone is used as the default. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
second(toTimestamp('2009-07-30 12:58:59')) -> 59
seconds
sha1
Calculates the SHA-1 digest of set of column of varying primitive datatypes and returns a 40 character hex
string. It can be used to calculate a fingerprint for a row.
sha1(5, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) -> '46d3b478e8ec4e1f3b453ac3d8e59d5854e282bb'
sha2
Calculates the SHA-2 digest of set of column of varying primitive datatypes given a bit length which can only be
of values 0(256), 224, 256, 384, 512. It can be used to calculate a fingerprint for a row.
sha2(256, 'gunchus', 8.2, 'bojjus', true, toDate('2010-4-4')) ->
'afe8a553b1761c67d76f8c31ceef7f71b66a1ee6f4e6d3b5478bf68b47d06bd3'
sin
sinh
soundex
split
sqrt
startsWith
subDays
Subtract days from a date or timestamp. Same as the - operator for date.
subDays(toDate('2016-08-08'), 1) -> toDate('2016-08-07')
subMonths
substring
substring(<string to subset> : string, <from 1-based index> : integral, [<number of characters> : integral])
=> string
Extracts a substring of a certain length from a position. Position is 1 based. If the length is omitted, it is defaulted
to end of the string.
substring('Cat in the hat', 5, 2) -> 'in'
substring('Cat in the hat', 5, 100) -> 'in the hat'
substring('Cat in the hat', 5) -> 'in the hat'
substring('Cat in the hat', 100, 100) -> ''
tan
tanh
translate
translate(<string to translate> : string, <lookup characters> : string, <replace characters> : string) =>
string
Replace one set of characters by another set of characters in the string. Characters have 1 to 1 replacement.
translate('(bojjus)', '()', '[]') -> '[bojjus]'
translate('(gunchus)', '()', '[') -> '[gunchus'
trim
Trims a string of leading and trailing characters. If second parameter is unspecified, it trims whitespace. Else it
trims any character specified in the second parameter.
trim(' dumbo ') -> 'dumbo'
trim('!--!du!mbo!', '-!') -> 'du!mbo'
true
Always returns a true value. Use the function syntax(true()) if there is a column named 'true'.
(10 + 20 == 30) -> true
(10 + 20 == 30) -> true()
typeMatch
Matches the type of the column. Can only be used in pattern expressions.number matches short, integer, long,
double, float or decimal, integral matches short, integer, long, fractional matches double, float, decimal and
datetime matches date or timestamp type.
typeMatch(type, 'number')
typeMatch('date', 'datetime')
unescape
Unescapes a string according to a format. Literal values for acceptable format are 'json', 'xml', 'ecmascript',
'html', 'java'.
unescape('{\\\\\"value\\\\\": 10}', 'json')
'{\\\"value\\\": 10}'
upper
Uppercases a string.
upper('bojjus') -> 'BOJJUS'
uuid
weekOfYear
weeks
xor
year
Aggregate functions
The following functions are only available in aggregate, pivot, unpivot, and window transformations.
approxDistinctCount
Gets the approximate aggregate count of distinct values for a column. The optional second parameter is to
control the estimation error.
approxDistinctCount(ProductID, .05) => long
avg
avgIf
collect
Collects all values of the expression in the aggregated group into an array. Structures can be collected and
transformed to alternate structures during this process. The number of items will be equal to the number of
rows in that group and can contain null values. The number of collected items should be small.
collect(salesPerson)
collect(firstName + lastName))
collect(@(name = salesPerson, sales = salesAmount) )
count
Gets the aggregate count of values. If the optional column(s) is specified, it ignores NULL values in the count.
count(custId)
count(custId, custName)
count()
count(iif(isNull(custId), 1, NULL))
countDistinct
countIf
Based on a criteria gets the aggregate count of values. If the optional column is specified, it ignores NULL values
in the count.
countIf(state == 'CA' && commission < 10000, name)
covariancePopulation
covarianceSample
covarianceSampleIf
first
Gets the first value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false.
first(sales)
first(sales, false)
isDistinct
Finds if a column or set of columns is distinct. It does not count null as a distinct value
isDistinct(custId, custName) => boolean
kurtosis
kurtosisIf
last
Gets the last value of a column group. If the second parameter ignoreNulls is omitted, it is assumed false.
last(sales)
last(sales, false)
max
maxIf
mean
meanIf
meanIf(<value1> : boolean, <value2> : number) => number
min
minIf
skewness
skewnessIf
stddev
stddevIf
stddevPopulation
stddevPopulationIf
stddevSample
stddevSampleIf
sum
sumDistinct
sumDistinctIf
Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column.
sumDistinctIf(state == 'CA' && commission < 10000, sales)
sumDistinctIf(true, sales)
sumIf
Based on criteria gets the aggregate sum of a numeric column. The condition can be based on any column.
sumIf(state == 'CA' && commission < 10000, sales)
sumIf(true, sales)
variance
varianceIf
variancePopulation
variancePopulationIf
varianceSample
varianceSampleIf
Array functions
Array functions perform transformations on data structures that are arrays. These include special keywords to
address array elements and indexes:
#acc represents a value that you wish to include in your single output when reducing an array
#index represents the current array index, along with array index numbers #index2, #index3 ...
#item represents the current element value in the array
array
Creates an array of items. All items should be of the same type. If no items are specified, an empty string array is
the default. Same as a [] creation operator.
array('Seattle', 'Washington')
['Seattle', 'Washington']
['Seattle', 'Washington'][1]
'Washington'
at
Finds the element at an array index. The index is 1-based. Out of bounds index results in a null value. Finds a
value in a map given a key. If the key is not found it returns null.
at(['apples', 'pears'], 1) => 'apples'
at(['fruit' -> 'apples', 'vegetable' -> 'carrot'], 'fruit') => 'apples'
contains
Returns true if any element in the provided array evaluates as true in the provided predicate. Contains expects a
reference to one element in the predicate function as #item.
contains([1, 2, 3, 4], #item == 3) -> true
contains([1, 2, 3, 4], #item > 5) -> false
distinct
except
filter
Filters elements out of the array that do not meet the provided predicate. Filter expects a reference to one
element in the predicate function as #item.
filter([1, 2, 3, 4], #item > 2) -> [3, 4]
filter(['a', 'b', 'c', 'd'], #item == 'a' || #item == 'b') -> ['a', 'b']
find
Find the first item from an array that match the condition. It takes a filter function where you can address the
item in the array as #item. For deeply nested maps you can refer to the parent maps using the #item_n(#item_1,
#item_2...) notation.
find([10, 20, 30], #item > 10) -> 20
find(['azure', 'data', 'factory'], length(#item) > 4) -> 'azure'
find([ @( name = 'Daniel', types = [ @(mood = 'jovial', behavior = 'terrific'), @(mood = 'grumpy',
behavior = 'bad') ] ), @( name = 'Mark', types = [ @(mood = 'happy', behavior = 'awesome'), @(mood =
'calm', behavior = 'reclusive') ] ) ], contains(#item.types, #item.mood=='happy') /*Filter out the happy
kid*/ )
@( name = 'Mark', types = [ @(mood = 'happy', behavior = 'awesome'), @(mood = 'calm', behavior =
'reclusive') ] )
flatten
Flattens array or arrays into a single array. Arrays of atomic items are returned unaltered. The last argument is
optional and is defaulted to false to flatten recursively more than one level deep.
flatten([['bojjus', 'girl'], ['gunchus', 'boy']]) => ['bojjus', 'girl', 'gunchus', 'boy']
flatten([[['bojjus', 'gunchus']]] , true) => ['bojjus', 'gunchus']
in
intersect
map
Maps each element of the array to a new element using the provided expression. Map expects a reference to one
element in the expression function as #item.
map([1, 2, 3, 4], #item + 2) -> [3, 4, 5, 6]
map(['a', 'b', 'c', 'd'], #item + '_processed') -> ['a_processed', 'b_processed', 'c_processed',
'd_processed']
mapIf
Conditionally maps an array to another array of same or smaller length. The values can be of any datatype
including structTypes. It takes a mapping function where you can address the item in the array as #item and
current index as #index. For deeply nested maps you can refer to the parent maps using the
#item_[n](#item_1, #index_1...) notation.
mapIf([10, 20, 30], #item > 10, #item + 5) -> [25, 35]
mapIf(['icecream', 'cake', 'soda'], length(#item) > 4, upper(#item)) -> ['ICECREAM', 'CAKE']
mapIndex
Maps each element of the array to a new element using the provided expression. Map expects a reference to one
element in the expression function as #item and a reference to the element index as #index.
mapIndex([1, 2, 3, 4], #item + 2 + #index) -> [4, 6, 8, 10]
mapLoop
Loops through from 1 to length to create an array of that length. It takes a mapping function where you can
address the index in the array as #index. For deeply nested maps you can refer to the parent maps using the
#index_n(#index_1, #index_2...) notation.
mapLoop(3, #index * 10) -> [10, 20, 30]
reduce
reduce(<value1> : array, <value2> : any, <value3> : binaryfunction, <value4> : unaryfunction) => any
Accumulates elements in an array. Reduce expects a reference to an accumulator and one element in the first
expression function as #acc and #item and it expects the resulting value as #result to be used in the second
expression function.
toString(reduce(['1', '2', '3', '4'], '0', #acc + #item, #result)) -> '01234'
size
slice
slice(<array to slice> : array, <from 1-based index> : integral, [<number of items> : integral]) => array
Extracts a subset of an array from a position. Position is 1 based. If the length is omitted, it is defaulted to end of
the string.
slice([10, 20, 30, 40], 1, 2) -> [10, 20]
slice([10, 20, 30, 40], 2) -> [20, 30, 40]
slice([10, 20, 30, 40], 2)[1] -> 20
isNull(slice([10, 20, 30, 40], 2)[0]) -> true
isNull(slice([10, 20, 30, 40], 2)[20]) -> true
slice(['a', 'b', 'c', 'd'], 8) -> []
sort
Sorts the array using the provided predicate function. Sort expects a reference to two consecutive elements in
the expression function as #item1 and #item2.
sort([4, 8, 2, 3], compare(#item1, #item2)) -> [2, 3, 4, 8]
sort(['a3', 'b2', 'c1'], iif(right(#item1, 1) >= right(#item2, 1), 1, -1)) -> ['c1', 'b2', 'a3']
unfold
Unfolds an array into a set of rows and repeats the values for the remaining columns in every row.
unfold(addresses) => any
unfold( @(name = salesPerson, sales = salesAmount) ) => any
union
lookup
Looks up the first row from the cached sink using the specified keys that match the keys from the cached sink.
cacheSink#lookup(movieId)
mlookup
Looks up the all matching rows from the cached sink using the specified keys that match the keys from the
cached sink.
cacheSink#mlookup(movieId)
output
outputs
Returns the entire output row set of the results of the cache sink
cacheSink#outputs()
Conversion functions
Conversion functions are used to convert data and test for data types
isBitSet
setBitSet
isBoolean
Checks if the string value is a boolean value according to the rules of toBoolean()
isByte
Checks if the string value is a byte value given an optional format according to the rules of toByte()
isByte('123') -> true
isByte('chocolate') -> false
isDate
Checks if the input date string is a date using an optional input date format. Refer Java's SimpleDateFormat for
available formats. If the input date format is omitted, default format is yyyy-[M]M-[d]d . Accepted formats are
[ yyyy, yyyy-[M]M, yyyy-[M]M-[d]d, yyyy-[M]M-[d]dT* ]
isShort
Checks of the string value is a short value given an optional format according to the rules of toShort()
isInteger
Checks of the string value is a integer value given an optional format according to the rules of toInteger()
isLong
Checks of the string value is a long value given an optional format according to the rules of toLong()
isNan
isFloat
Checks of the string value is a float value given an optional format according to the rules of toFloat()
isDouble
Checks of the string value is a double value given an optional format according to the rules of toDouble()
isDecimal
Checks of the string value is a decimal value given an optional format according to the rules of toDecimal()
isTimestamp
Checks if the input date string is a timestamp using an optional input timestamp format. Refer to Java's
SimpleDateFormat for available formats. If the timestamp is omitted the default pattern
yyyy-[M]M-[d]d hh:mm:ss[.f...] is used. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. Timestamp supports up to millisecond accuracy with value of 999 Refer to Java's
SimpleDateFormat for available formats.
isTimestamp('2016-12-31 00:12:00') -> true
isTimestamp('2016-12-31T00:12:00' -> 'yyyy-MM-dd\\'T\\'HH:mm:ss' -> 'PST') -> true
isTimestamp('2012-8222.18') -> false
toBase64
toBinary
toBoolean
Converts a value of ('t', 'true', 'y', 'yes', '1') to true and ('f', 'false', 'n', 'no', '0') to false and NULL for any other
value.
toBoolean('true') -> true
toBoolean('n') -> false
isNull(toBoolean('truthy')) -> true
toByte
Converts any numeric or string to a byte value. An optional Java decimal format can be used for the conversion.
toByte(123)
123
toByte(0xFF)
-1
toByte('123')
123
toDate
Converts input date string to date using an optional input date format. Refer Java's SimpleDateFormat class for
available formats. If the input date format is omitted, default format is yyyy-[M]M-[d]d. Accepted formats are :[
yyyy, yyyy-[M]M, yyyy-[M]M-[d]d, yyyy-[M]M-[d]dT* ].
toDate('2012-8-18') -> toDate('2012-08-18')
toDate('12/18/2012', 'MM/dd/yyyy') -> toDate('2012-12-18')
toDecimal
Converts any numeric or string to a decimal value. If precision and scale are not specified, it is defaulted to
(10,2).An optional Java decimal format can be used for the conversion. An optional locale format in the form of
BCP47 language like en-US, de, zh-CN.
toDecimal(123.45) -> 123.45
toDecimal('123.45', 8, 4) -> 123.4500
toDecimal('$123.45', 8, 4,'$###.00') -> 123.4500
toDecimal('Ç123,45', 10, 2, 'Ç###,##', 'de') -> 123.45
toDouble
Converts any numeric or string to a double value. An optional Java decimal format can be used for the
conversion. An optional locale format in the form of BCP47 language like en-US, de, zh-CN.
toDouble(123.45) -> 123.45
toDouble('123.45') -> 123.45
toDouble('$123.45', '$###.00') -> 123.45
toDouble('Ç123,45', 'Ç###,##', 'de') -> 123.45
toFloat
toFloat(<value> : any, [<format> : string], [<locale> : string]) => float
Converts any numeric or string to a float value. An optional Java decimal format can be used for the conversion.
Truncates any double.
toFloat(123.45) -> 123.45f
toFloat('123.45') -> 123.45f
toFloat('$123.45', '$###.00') -> 123.45f
toInteger
Converts any numeric or string to an integer value. An optional Java decimal format can be used for the
conversion. Truncates any long, float, double.
toInteger(123) -> 123
toInteger('123') -> 123
toInteger('$123', '$###') -> 123
toLong
Converts any numeric or string to a long value. An optional Java decimal format can be used for the conversion.
Truncates any float, double.
toLong(123) -> 123
toLong('123') -> 123
toLong('$123', '$###') -> 123
toShort
Converts any numeric or string to a short value. An optional Java decimal format can be used for the
conversion. Truncates any integer, long, float, double.
toShort(123) -> 123
toShort('123') -> 123
toShort('$123', '$###') -> 123
toString
Converts a primitive datatype to a string. For numbers and date a format can be specified. If unspecified the
system default is picked.Java decimal format is used for numbers. Refer to Java SimpleDateFormat for all
possible date formats; the default format is yyyy-MM-dd.
toString(10) -> '10'
toString('engineer') -> 'engineer'
toString(123456.789, '##,###.##') -> '123,456.79'
toString(123.78, '000000.000') -> '000123.780'
toString(12345, '##0.#####E0') -> '12.345E3'
toString(toDate('2018-12-31')) -> '2018-12-31'
isNull(toString(toDate('2018-12-31', 'MM/dd/yy'))) -> true
toString(4 == 20) -> 'false'
toTimestamp
toTimestamp(<string> : any, [<timestamp format> : string], [<time zone> : string]) => timestamp
Converts a string to a timestamp given an optional timestamp format. If the timestamp is omitted the default
pattern yyyy-[M]M-[d]d hh:mm:ss[.f...] is used. You can pass an optional timezone in the form of 'GMT', 'PST',
'UTC', 'America/Cayman'. Timestamp supports up to millisecond accuracy with value of 999. Refer Java's
SimpleDateFormat class for available formats.
https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
toTimestamp('2016-12-31 00:12:00') -> toTimestamp('2016-12-31 00:12:00')
toTimestamp('2016-12-31T00:12:00', 'yyyy-MM-dd\'T\'HH:mm:ss', 'PST') -> toTimestamp('2016-12-31
00:12:00')
toTimestamp('12/31/2016T00:12:00', 'MM/dd/yyyy\'T\'HH:mm:ss') -> toTimestamp('2016-12-31 00:12:00')
millisecond(toTimestamp('2019-02-03 05:19:28.871', 'yyyy-MM-dd HH:mm:ss.SSS')) -> 871
toUTC
Converts the timestamp to UTC. You can pass an optional timezone in the form of 'GMT', 'PST', 'UTC',
'America/Cayman'. It is defaulted to the current timezone. Refer Java's SimpleDateFormat class for available
formats. https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html.
toUTC(currentTimestamp()) == toTimestamp('2050-12-12 19:18:12') -> false
toUTC(currentTimestamp(), 'Asia/Seoul') != toTimestamp('2050-12-12 19:18:12') -> true
Map functions
Map functions perform operations on map data types
associate
Creates a map of key/values. All the keys & values should be of the same type. If no items are specified, it is
defaulted to a map of string to string type.Same as a [ -> ] creation operator. Keys and values should alternate
with each other.
associate('fruit', 'apple', 'vegetable', 'carrot' )=> ['fruit' -> 'apple', 'vegetable' -> 'carrot']
keyValues
Creates a map of key/values. The first parameter is an array of keys and second is the array of values. Both
arrays should have equal length.
keyValues(['bojjus', 'appa'], ['gunchus', 'ammi']) => ['bojjus' -> 'gunchus', 'appa' -> 'ammi']
mapAssociation
Transforms a map by associating the keys to new values. Returns an array. It takes a mapping function where
you can address the item as #key and current value as #value.
mapAssociation(['bojjus' -> 'gunchus', 'appa' -> 'ammi'], @(key = #key, value = #value)) => [@(key =
'bojjus', value = 'gunchus'), @(key = 'appa', value = 'ammi')]
reassociate
Transforms a map by associating the keys to new values. It takes a mapping function where you can address the
item as #key and current value as #value.
reassociate(['fruit' -> 'apple', 'vegetable' -> 'tomato'], substring(#key, 1, 1) + substring(#value, 1,
1)) => ['fruit' -> 'fa', 'vegetable' -> 'vt']
Metafunctions
Metafunctions primarily function on metadata in your data flow
byItem
Find a sub item within a structure or array of structure If there are multiple matches, the first match is returned.
If no match it returns a NULL value. The returned value has to be type converted by one of the type conversion
actions(? date, ? string ...). Column names known at design time should be addressed just by their name.
Computed inputs are not supported but you can use parameter substitutions
byItem( byName('customer'), 'orderItems') ? (itemName as string, itemQty as integer)
byOrigin
Selects a column value by name in the origin stream. The second argument is the origin stream name. If there
are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value has to
be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...). Column names known at
design time should be addressed just by their name. Computed inputs are not supported but you can use
parameter substitutions.
toString(byOrigin('ancestor', 'ancestorStream'))
byOrigins
Selects an array of columns by name in the stream. The second argument is the stream where it originated from.
If there are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value
has to be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...) Column names
known at design time should be addressed just by their name. Computed inputs are not supported but you can
use parameter substitutions.
toString(byOrigins(['ancestor1', 'ancestor2'], 'ancestorStream'))
byName
Selects a column value by name in the stream. You can pass a optional stream name as the second argument. If
there are multiple matches, the first match is returned. If no match it returns a NULL value. The returned value
has to be type converted by one of the type conversion functions(TO_DATE, TO_STRING ...). Column names
known at design time should be addressed just by their name. Computed inputs are not supported but you can
use parameter substitutions.
toString(byName('parent'))
toLong(byName('income'))
toBoolean(byName('foster'))
toLong(byName($debtCol))
toString(byName('Bogus Column'))
toString(byName('Bogus Column', 'DeriveStream'))
byNames
Select an array of columns by name in the stream. You can pass a optional stream name as the second
argument. If there are multiple matches, the first match is returned. If there are no matches for a column, the
entire output is a NULL value. The returned value requires a type conversion functions (toDate, toString, ...).
Column names known at design time should be addressed just by their name. Computed inputs are not
supported but you can use parameter substitutions.
toString(byNames(['parent', 'child']))
byNames(['parent']) ? string
toLong(byNames(['income']))
byNames(['income']) ? long
toBoolean(byNames(['foster']))
toLong(byNames($debtCols))
toString(byNames(['a Column']))
toString(byNames(['a Column'], 'DeriveStream'))
byNames(['orderItem']) ? (itemName as string, itemQty as integer)
byPath
Finds a hierarchical path by name in the stream. You can pass an optional stream name as the second argument.
If no such path is found it returns null. Column names/paths known at design time should be addressed just by
their name or dot notation path. Computed inputs are not supported but you can use parameter substitutions.
byPath('grandpa.parent.child') => column
byPosition
Selects a column value by its relative position(1 based) in the stream. If the position is out of bounds it returns a
NULL value. The returned value has to be type converted by one of the type conversion functions(TO_DATE,
TO_STRING ...) Computed inputs are not supported but you can use parameter substitutions.
toString(byPosition(1))
toDecimal(byPosition(2), 10, 2)
toBoolean(byName(4))
toString(byName($colName))
toString(byPosition(1234))
hasPath
Checks if a certain hierarchical path exists by name in the stream. You can pass an optional stream name as the
second argument. Column names/paths known at design time should be addressed just by their name or dot
notation path. Computed inputs are not supported but you can use parameter substitutions.
hasPath('grandpa.parent.child') => boolean
originColumns
Gets all output columns for a origin stream where columns were created. Must be enclosed in another function.
array(toString(originColumns('source1')))
hex
Unhexes a binary value from its string representation. This can be used in conjunction with sha2, md5 to
convert from string to binary representation
unhex('1fadbe') -> toBinary([toByte(0x1f), toByte(0xad), toByte(0xbe)])
Window functions
The following functions are only available in window transformations.
cumeDist
The CumeDist function computes the position of a value relative to all values in the partition. The result is the
number of rows preceding or equal to the current row in the ordering of the partition divided by the total
number of rows in the window partition. Any tie values in the ordering will evaluate to the same position.
cumeDist()
denseRank
Computes the rank of a value in a group of values specified in a window's order by clause. The result is one plus
the number of rows preceding or equal to the current row in the ordering of the partition. The values will not
produce gaps in the sequence. Dense Rank works even when data is not sorted and looks for change in values.
denseRank()
lag
lag(<value> : any, [<number of rows to look before> : number], [<default value> : any]) => any
Gets the value of the first parameter evaluated n rows before the current row. The second parameter is the
number of rows to look back and the default value is 1. If there are not as many rows a value of null is returned
unless a default value is specified.
lag(amount, 2)
lag(amount, 2000, 100)
lead
lead(<value> : any, [<number of rows to look after> : number], [<default value> : any]) => any
Gets the value of the first parameter evaluated n rows after the current row. The second parameter is the
number of rows to look forward and the default value is 1. If there are not as many rows a value of null is
returned unless a default value is specified.
lead(amount, 2)
lead(amount, 2000, 100)
nTile
The NTile function divides the rows for each window partition into n buckets ranging from 1 to at most n .
Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the
number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket. The
NTile function is useful for the calculation of tertiles , quartiles, deciles, and other common summary
statistics. The function calculates two variables during initialization: The size of a regular bucket will have one
extra row added to it. Both variables are based on the size of the current partition. During the calculation process
the function keeps track of the current row number, the current bucket number, and the row number at which
the bucket will change (bucketThreshold). When the current row number reaches bucket threshold, the bucket
value is increased by one and the threshold is increased by the bucket size (plus one extra if the current bucket is
padded).
nTile()
nTile(numOfBuckets)
rank
Computes the rank of a value in a group of values specified in a window's order by clause. The result is one plus
the number of rows preceding or equal to the current row in the ordering of the partition. The values will
produce gaps in the sequence. Rank works even when data is not sorted and looks for change in values.
rank()
rowNumber
Next steps
Learn how to use Expression Builder.
What is data wrangling?
3/5/2021 • 2 minutes to read • Edit Online
NOTE
The Power Query activity in Azure Data Factory is currently available in public preview
Use cases
Supported sources
C O N N EC TO R DATA F O RM AT A UT H EN T IC AT IO N T Y P E
Azure Data Lake Storage Gen2 CSV, Parquet Account Key, Service Principal
Currently not all Power Query M functions are supported for data wrangling despite being available during
authoring. While building your Power Query activities, you'll be prompted with the following error message if a
function isn't supported:
The wrangling data flow is invalid. Expression.Error: The transformation logic isn't supported. Please try a
simpler expression
Next steps
Learn how to create a data wrangling Power Query mash-up.
Transformation functions in Power Query for data
wrangling
6/18/2021 • 3 minutes to read • Edit Online
NOTE
Power Query in ADF is currently available in public preview
Currently not all Power Query M functions are supported for data wrangling despite being available during
authoring. While building your mash-ups, you'll be prompted with the following error message if a function isn't
supported:
UserQuery : Expression.Error: The transformation logic is not supported as it requires dynamic access to rows
of data, which cannot be scaled out.
Column Management
Selection: Table.SelectColumns
Removal: Table.RemoveColumns
Renaming: Table.RenameColumns, Table.PrefixColumns, Table.TransformColumnNames
Reordering: Table.ReorderColumns
Row Filtering
Use M function Table.SelectRows to filter on the following conditions:
Equality and inequality
Numeric, text, and date comparisons (but not DateTime)
Numeric information such as Number.IsEven/Odd
Text containment using Text.Contains, Text.StartsWith, or Text.EndsWith
Date ranges including all the 'IsIn' Date functions)
Combinations of these using and, or, or not conditions
Merging/Joining tables
Power Query will generate a nested join (Table.NestedJoin; users can also manually write
Table.AddJoinColumn). Users must then expand the nested join column into a non-nested join
(Table.ExpandTableColumn, not supported in any other context).
The M function Table.Join can be written directly to avoid the need for an additional expansion step, but the
user must ensure that there are no duplicate column names among the joined tables
Supported Join Kinds: Inner, LeftOuter, RightOuter, FullOuter
Both Value.Equals and Value.NullableEquals are supported as key equality comparers
Group by
Use Table.Group to aggregate values.
Must be used with an aggregation function
Supported aggregation functions: List.Sum, List.Count, List.Average, List.Min, List.Max, List.StandardDeviation,
List.First, List.Last
Sorting
Use Table.Sort to sort values.
Reducing Rows
Keep and Remove Top, Keep Range (corresponding M functions, only supporting counts, not conditions:
Table.FirstN, Table.Skip, Table.RemoveFirstN, Table.Range, Table.MinN, Table.MaxN)
Table.NestedJoin Just doing a join will result in a validation error. The columns
must be expanded for it to work.
Row level error handling Row level error handling is currently not supported. For
example, to filter out non-numeric values from a column,
one approach would be to transform the text column to a
number. Every cell which fails to transform will be in an error
state and need to be filtered. This scenario isn't possible in
scaled-out M.
M script workarounds
For SplitColumn there is an alternate for split by length and by position
Table.AddColumn(Source, "First characters", each Text.Start([Email], 7), type text)
Table.AddColumn(#"Inserted first characters", "Text range", each Text.Middle([Email], 4, 9), type text)
This option is accessible from the Extract option in the ribbon
For Table.CombineColumns
Next steps
Learn how to create a data wrangling Power Query in ADF.
Roles and permissions for Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online
Set up permissions
After you create a Data Factory, you may want to let other users work with the data factory. To give this access to
other users, you have to add them to the built-in Data Factor y Contributor role on the Resource Group that
contains the Data Factory.
Scope of the Data Factory Contributor role
Membership of the Data Factor y Contributor role lets users do the following things:
Create, edit, and delete data factories and child resources including datasets, linked services, pipelines,
triggers, and integration runtimes.
Deploy Resource Manager templates. Resource Manager deployment is the deployment method used by
Data Factory in the Azure portal.
Manage App Insights alerts for a data factory.
Create support tickets.
For more info about this role, see Data Factory Contributor role.
Resource Manager template deployment
The Data Factor y Contributor role, at the resource group level or above, lets users deploy Resource Manager
templates. As a result, members of the role can use Resource Manager templates to deploy both data factories
and their child resources, including datasets, linked services, pipelines, triggers, and integration runtimes.
Membership in this role does not let the user create other resources.
Permissions on Azure Repos and GitHub are independent of Data Factory permissions. As a result, a user with
repo permissions who is only a member of the Reader role can edit Data Factory child resources and commit
changes to the repo, but can't publish these changes.
IMPORTANT
Resource Manager template deployment with the Data Factor y Contributor role does not elevate your permissions.
For example, if you deploy a template that creates an Azure virtual machine, and you don't have permission to create
virtual machines, the deployment fails with an authorization error.
Next steps
Learn more about roles in Azure - Understand role definitions
Learn more about the Data Factor y contributor role - Data Factory Contributor role.
Azure Data Factory - naming rules
4/22/2021 • 2 minutes to read • Edit Online
Data factory Unique across Microsoft Azure. Names Each data factory is tied to
are case-insensitive, that is, MyDF and exactly one Azure subscription.
mydf refer to the same data factory. Object names must start with a
letter or a number, and can
contain only letters, numbers,
and the dash (-) character.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number. Consecutive dashes
are not permitted in container
names.
Name can be 3-63 characters
long.
Linked Unique within a data factory. Names Object names must start with a
services/Datasets/Pipelines/Data Flows are case-insensitive. letter.
The following characters are
not allowed: “.”, “+”, “?”, “/”, “<”,
”>”,”*”,”%”,”&”,”:”,”\”
Dashes ("-") are not allowed in
the names of linked services,
data flows, and datasets.
Integration Runtime Unique within a data factory. Names Integration runtime Name can
are case-insensitive. contain only letters, numbers
and the dash (-) character.
The first and last characters
must be a letter or number.
Every dash (-) character must
be immediately preceded and
followed by a letter or a
number.
Consecutive dashes are not
permitted in integration
runtime name.
Data flow transformations Unique within a data flow. Names are Data flow transformation
case-insensitive names can only contain letters
and numbers
The first character must be a
letter.
NAME N A M E UN IQ UEN ESS VA L IDAT IO N C H EC K S
Resource Group Unique across Microsoft Azure. Names For more info, see Azure naming rules
are case-insensitive. and restrictions.
Pipeline parameters & variable Unique within the pipeline. Names are Validation check on parameter
case-insensitive. names and variable names is
limited to uniqueness because
of backward compatibility
reason.
When use parameters or
variables to reference entity
names, for example linked
service, the entity naming rules
apply.
A good practice is to follow
data flow transformation
naming rules to name your
pipeline parameters and
variables.
Next steps
Learn how to create data factories by following step-by-step instructions in Quickstart: create a data factory
article.
Azure Data Factory data redundancy
3/5/2021 • 2 minutes to read • Edit Online
Azure Data Factory data includes metadata (pipeline, datasets, linked services, integration runtime and triggers)
and monitoring data (pipeline, trigger, and activity runs).
In all regions (except Brazil South and Southeast Asia), Azure Data Factory data is stored and replicated in the
paired region to protect against metadata loss. During regional datacenter failures, Microsoft may initiate a
regional failover of your Azure Data Factory instance. In most cases, no action is required on your part. When
the Microsoft-managed failover has completed, you will be able to access your Azure Data Factory in the failover
region.
Due to data residency requirements in Brazil South, and Southeast Asia, Azure Data Factory data is stored on
local region only. For Southeast Asia, all the data are stored in Singapore. For Brazil South, all data are stored in
Brazil. When the region is lost due to a significant disaster, Microsoft will not be able to recover your Azure Data
Factory data.
NOTE
Microsoft-managed failover does not apply to self-hosted integration runtime (SHIR) since this infrastructure is typically
customer-managed. If the SHIR is set up on Azure VM, then the recommendation is to leverage Azure site recovery for
handling the Azure VM failover to another region.
NOTE
In case of a disaster (loss of region), new data factory can be provisioned manually or in an automated fashion. Once the
new data factory has been created, you can restore your pipelines, datasets and linked services JSON from the existing Git
repository.
Data stores
Azure Data Factory enables you to move data among data stores located on-premises and in the cloud. To
ensure business continuity with your data stores, you should refer to the business continuity recommendations
for each of these data stores.
See also
Azure Regional Pairs
Data residency in Azure
Visual authoring in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Authoring canvas
To open the authoring canvas , click on the pencil icon.
Here, you author the pipelines, activities, datasets, linked services, data flows, triggers, and integration runtimes
that comprise your factory. To get started building a pipeline using the authoring canvas, see Copy data using
the copy Activity.
The default visual authoring experience is directly working with the Data Factory service. Azure Repos Git or
GitHub integration is also supported to allow source control and collaboration for work on your data factory
pipelines. To learn more about the differences between these authoring experiences, see Source control in Azure
Data Factory.
Properties pane
For top-level resources such as pipelines, datasets, and data flows, high-level properties are editable in the
properties pane on the right-hand side of the canvas. The properties pane contains properties such as name,
description, annotations, and other high-level properties. Subresources such as pipeline activities and data flow
transformations are edited using the panel at the bottom of the canvas.
The properties pane only opens by default on resource creation. To edit it, click on the properties pane icon
located in the top-right corner of the canvas.
Related resources
In the properties pane, you can see what resources are dependent on the selected resource by selecting the
Related tab. Any resource that references the current resource will be listed here.
For example, in the above image, one pipeline and two data flows use the dataset currently selected.
Management hub
The management hub, accessed by the Manage tab in the Azure Data Factory UX, is a portal that hosts global
management actions for your data factory. Here, you can manage your connections to data stores and external
computes, source control configuration, and trigger settings. For more information, learn about the capabilities
of the management hub.
This opens the Data Factor y Expression Builder where you can build expressions from supported system
variables, activity output, functions, and user-specified variables or parameters.
For information about the expression language, see Expressions and functions in Azure Data Factory.
Provide feedback
Select Feedback to comment about features or to notify Microsoft about issues with the tool:
Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines programmatically.
Iterative development and debugging with Azure
Data Factory
4/22/2021 • 4 minutes to read • Edit Online
Debugging a pipeline
As you author using the pipeline canvas, you can test your activities using the Debug capability. When you do
test runs, you don't have to publish your changes to the data factory before you select Debug . This feature is
helpful in scenarios where you want to make sure that the changes work as expected before you update the data
factory workflow.
As the pipeline is running, you can see the results of each activity in the Output tab of the pipeline canvas.
View the results of your test runs in the Output window of the pipeline canvas.
After a test run succeeds, add more activities to your pipeline and continue debugging in an iterative manner.
You can also Cancel a test run while it is in progress.
IMPORTANT
Selecting Debug actually runs the pipeline. For example, if the pipeline contains copy activity, the test run copies data
from source to destination. As a result, we recommend that you use test folders in your copy activities and other activities
when debugging. After you've debugged the pipeline, switch to the actual folders that you want to use in normal
operations.
Setting breakpoints
Azure Data Factory allows for you to debug a pipeline until you reach a particular activity on the pipeline canvas.
Put a breakpoint on the activity until which you want to test, and select Debug . Data Factory ensures that the
test runs only until the breakpoint activity on the pipeline canvas. This Debug Until feature is useful when you
don't want to test the entire pipeline, but only a subset of activities inside the pipeline.
To set a breakpoint, select an element on the pipeline canvas. A Debug Until option appears as an empty red
circle at the upper right corner of the element.
After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled.
NOTE
The Azure Data Factory service only persists debug run history for 15 days.
Data preview in the data flow designer and pipeline debugging of data flows are intended to work best with
small samples of data. However, if you need to test your logic in a pipeline or data flow against large amounts of
data, increase the size of the Azure Integration Runtime being used in the debug session with more cores and a
minimum of general purpose compute.
Debugging a pipeline with a data flow activity
When executing a debug pipeline run with a data flow, you have two options on which compute to use. You can
either use an existing debug cluster or spin up a new just-in-time cluster for your data flows.
Using an existing debug session will greatly reduce the data flow start up time as the cluster is already running,
but is not recommended for complex or parallel workloads as it may fail when multiple jobs are run at once.
Using the activity runtime will create a new cluster using the settings specified in each data flow activity's
integration runtime. This allows each job to be isolated and should be used for complex workloads or
performance testing. You can also control the TTL in the Azure IR so that the cluster resources used for
debugging will still be available for that time period to serve additional job requests.
NOTE
If you have a pipeline with data flows executing in parallel or data flows that need to be tested with large datasets, choose
"Use Activity Runtime" so that Data Factory can use the Integration Runtime that you've selected in your data flow
activity. This will allow the data flows to execute on multiple clusters and can accommodate your parallel data flow
executions.
Next steps
After testing your changes, promote them to higher environments using continuous integration and
deployment in Azure Data Factory.
Management hub in Azure Data Factory
4/28/2021 • 2 minutes to read • Edit Online
Manage connections
Linked services
Linked services define the connection information for Azure Data Factory to connect to external data stores and
compute environments. For more information, see linked services concepts. Linked service creation, editing, and
deletion is done in the management hub.
Integration runtimes
An integration runtime is a compute infrastructure used by Azure Data Factory to provide data integration
capabilities across different network environments. For more information, learn about integration runtime
concepts. In the management hub, you can create, delete, and monitor your integration runtimes.
Manage authoring
Triggers
Triggers determine when a pipeline run should be kicked off. Currently triggers can be on a wall clock schedule,
operate on a periodic interval, or depend on an event. For more information, learn about trigger execution. In
the management hub, you can create, edit, delete, or view the current state of a trigger.
Global parameters
Global parameters are constants across a data factory that can be consumed by a pipeline in any expression. For
more information, learn about global parameters.
Next steps
Learn how to configure a git repository to your ADF
Source control in Azure Data Factory
7/2/2021 • 15 minutes to read • Edit Online
NOTE
For Azure Government Cloud, only GitHub Enterprise Server is available.
To learn more about how Azure Data Factory integrates with Git, view the 15-minute tutorial video below:
NOTE
Authoring directly with the Data Factory service is disabled in the Azure Data Factory UX when a Git repository is
configured. Changes made via PowerShell or an SDK are published directly to the Data Factory service, and are not
entered into Git.
NOTE
When configuring git in the Azure Portal, settings like project name and repo name have to be manually entered instead
being part of a dropdown.
The configuration pane shows the following Azure Repos code repository settings:
Repositor y Type The type of the Azure Repos code Azure DevOps Git or GitHub
repository.
Azure Active Director y Your Azure AD tenant name. <your tenant name>
SET T IN G DESC RIP T IO N VA L UE
Azure Repos Organization Your Azure Repos organization name. <your organization name>
You can locate your Azure Repos
organization name at
https://{organization
name}.visualstudio.com
. You can sign in to your Azure Repos
organization to access your Visual
Studio profile and see your repositories
and projects.
ProjectName Your Azure Repos project name. You <your Azure Repos project name>
can locate your Azure Repos project
name at
https://{organization
name}.visualstudio.com/{project
name}
.
Repositor yName Your Azure Repos code repository <your Azure Repos code
name. Azure Repos projects contain Git repository name>
repositories to manage your source
code as your project grows. You can
create a new repository or use an
existing repository that's already in
your project.
Collaboration branch Your Azure Repos collaboration branch <your collaboration branch name>
that is used for publishing. By default,
it's main . Change this setting in case
you want to publish resources from
another branch.
Root folder Your root folder in your Azure Repos <your root folder name>
collaboration branch.
Impor t existing Data Factor y Specifies whether to import existing Selected (default)
resources to repositor y data factory resources from the UX
Authoring canvas into an Azure
Repos Git repository. Select the box to
import your data factory resources
into the associated Git repository in
JSON format. This action exports each
resource individually (that is, the linked
services and datasets are exported
into separate JSONs). When this box
isn't selected, the existing resources
aren't imported.
Branch to impor t resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing
NOTE
If you are using Microsoft Edge and do not see any values in your Azure DevOps Account dropdown, add
https://*.visualstudio.com to the trusted sites list.
IMPORTANT
To connect to another Azure Active Directory, the user logged in must be a part of that active directory.
GitHub Enterprise URL The GitHub Enterprise root URL (must <your GitHub enterprise url>
be HTTPS for local GitHub Enterprise
server). For example:
https://github.mydomain.com .
Required only if Use GitHub
Enterprise is selected
GitHub account Your GitHub account name. This name <your GitHub account name>
can be found from
https://github.com/{account
name}/{repository name}. Navigating
to this page prompts you to enter
GitHub OAuth credentials to your
GitHub account.
Repositor y Name Your GitHub code repository name. <your repository name>
GitHub accounts contain Git
repositories to manage your source
code. You can create a new repository
or use an existing repository that's
already in your account.
Collaboration branch Your GitHub collaboration branch that <your collaboration branch>
is used for publishing. By default,
it's main. Change this setting in case
you want to publish resources from
another branch.
Root folder Your root folder in your GitHub <your root folder name>
collaboration branch.
Impor t existing Data Factor y Specifies whether to import existing Selected (default)
resources to repositor y data factory resources from the
UX authoring canvas into a GitHub
repository. Select the box to import
your data factory resources into the
associated Git repository in JSON
format. This action exports each
resource individually (that is, the linked
services and datasets are exported
into separate JSONs). When this box
isn't selected, the existing resources
aren't imported.
Branch to impor t resource into Specifies into which branch the data
factory resources (pipelines, datasets,
linked services etc.) are imported. You
can import resources into one of the
following branches: a. Collaboration b.
Create new c. Use Existing
GitHub organizations
Connecting to a GitHub organization requires the organization to grant permission to Azure Data Factory. A user
with ADMIN permissions on the organization must perform the below steps to allow data factory to connect.
Connecting to GitHub for the first time in Azure Data Factory
If you're connecting to GitHub from Azure Data Factory for the first time, follow these steps to connect to a
GitHub organization.
1. In the Git configuration pane, enter the organization name in the GitHub Account field. A prompt to login into
GitHub will appear.
2. Login using your user credentials.
3. You'll be asked to authorize Azure Data Factory as an application called AzureDataFactory. On this screen, you
will see an option to grant permission for ADF to access the organization. If you don't see the option to grant
permission, ask an admin to manually grant the permission through GitHub.
Once you follow these steps, your factory will be able to connect to both public and private repositories within
your organization. If you are unable to connect, try clearing the browser cache and retrying.
Already connected to GitHub using a personal account
If you have already connected to GitHub and only granted permission to access a personal account, follow the
below steps to grant permissions to an organization.
1. Go to GitHub and open Settings .
2. Select Applications . In the Authorized OAuth apps tab, you should see AzureDataFactory.
3. Select the application and grant the application access to your organization.
Once you follow these steps, your factory will be able to connect to both public and private repositories within
your organization.
Known GitHub limitations
You can store script and data files in a GitHub repository. However, you have to upload the files manually
to Azure Storage. A Data Factory pipeline does not automatically upload script or data files stored in a
GitHub repository to Azure Storage.
GitHub Enterprise with a version older than 2.14.0 doesn't work in the Microsoft Edge browser.
GitHub integration with the Data Factory visual authoring tools only works in the generally available
version of Data Factory.
A maximum of 1,000 entities per resource type (such as pipelines and datasets) can be fetched from a
single GitHub branch. If this limit is reached, is suggested to split your resources into separate factories.
Azure DevOps Git does not have this limitation.
Version control
Version control systems (also known as source control) let developers collaborate on code and track changes
that are made to the code base. Source control is an essential tool for multi-developer projects.
Creating feature branches
Each Azure Repos Git repository that's associated with a data factory has a collaboration branch. ( main ) is the
default collaboration branch). Users can also create feature branches by clicking + New Branch in the branch
dropdown. Once the new branch pane appears, enter the name of your feature branch.
When you are ready to merge the changes from your feature branch to your collaboration branch, click on the
branch dropdown and select Create pull request . This action takes you to Azure Repos Git where you can
raise pull requests, do code reviews, and merge changes to your collaboration branch. ( main is the default). You
are only allowed to publish to the Data Factory service from your collaboration branch.
{
"publishBranch": "factory/adf_publish"
}
Azure Data Factory can only have one publish branch at a time. When you specify a new publish branch, Data
Factory doesn't delete the previous publish branch. If you want to remove the previous publish branch, delete it
manually.
NOTE
Data Factory only reads the publish_config.json file when it loads the factory. If you already have the factory loaded
in the portal, refresh the browser to make your changes take effect.
A side pane will open where you confirm that the publish branch and pending changes are correct. Once you
verify your changes, click OK to confirm the publish.
IMPORTANT
The main branch is not representative of what's deployed in the Data Factory service. The main branch must be published
manually to the Data Factory service.
Enter your data factory name and click confirm to remove the Git repository associated with your data factory.
After you remove the association with the current repo, you can configure your Git settings to use a different
repo and then import existing Data Factory resources to the new repo.
IMPORTANT
Removing Git configuration from a data factory doesn't delete anything from the repository. The factory will contain all
published resources. You can continue to edit the factory directly against the service.
Next steps
To learn more about monitoring and managing pipelines, see Monitor and manage pipelines
programmatically.
To implement continuous integration and deployment, see Continuous integration and delivery (CI/CD) in
Azure Data Factory.
Continuous integration and delivery in Azure Data
Factory
6/10/2021 • 29 minutes to read • Edit Online
Overview
Continuous integration is the practice of testing each change made to your codebase automatically and as early
as possible. Continuous delivery follows the testing that happens during continuous integration and pushes
changes to a staging or production system.
In Azure Data Factory, continuous integration and delivery (CI/CD) means moving Data Factory pipelines from
one environment (development, test, production) to another. Azure Data Factory utilizes Azure Resource
Manager templates to store the configuration of your various ADF entities (pipelines, datasets, data flows, and
so on). There are two suggested methods to promote a data factory to another environment:
Automated deployment using Data Factory's integration with Azure Pipelines
Manually upload a Resource Manager template using Data Factory UX integration with Azure Resource
Manager.
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
CI/CD lifecycle
Below is a sample overview of the CI/CD lifecycle in an Azure data factory that's configured with Azure Repos
Git. For more information on how to configure a Git repository, see Source control in Azure Data Factory.
1. A development data factory is created and configured with Azure Repos Git. All developers should have
permission to author Data Factory resources like pipelines and datasets.
2. A developer creates a feature branch to make a change. They debug their pipeline runs with their most
recent changes. For more information on how to debug a pipeline run, see Iterative development and
debugging with Azure Data Factory.
3. After a developer is satisfied with their changes, they create a pull request from their feature branch to
the main or collaboration branch to get their changes reviewed by peers.
4. After a pull request is approved and changes are merged in the main branch, the changes get published
to the development factory.
5. When the team is ready to deploy the changes to a test or UAT (User Acceptance Testing) factory, the
team goes to their Azure Pipelines release and deploys the desired version of the development factory to
UAT. This deployment takes place as part of an Azure Pipelines task and uses Resource Manager template
parameters to apply the appropriate configuration.
6. After the changes have been verified in the test factory, deploy to the production factory by using the
next task of the pipelines release.
NOTE
Only the development factory is associated with a git repository. The test and production factories shouldn't have a git
repository associated with them and should only be updated via an Azure DevOps pipeline or via a Resource
Management template.
f. Select … next to the Template parameters box to choose the parameters file. Look for the file
ARMTemplateParametersForFactory.json in the folder of the adf_publish branch.
g. Select … next to the Override template parameters box, and enter the desired parameter values for
the target data factory. For credentials that come from Azure Key Vault, enter the secret's name between
double quotation marks. For example, if the secret's name is cred1, enter "$(cred1)" for this value.
h. Select Incremental for the Deployment mode .
WARNING
In Complete deployment mode, resources that exist in the resource group but aren't specified in the new Resource
Manager template will be deleted . For more information, please refer to Azure Resource Manager Deployment
Modes
8. Save the release pipeline.
9. To trigger a release, select Create release . To automate the creation of releases, see Azure DevOps
release triggers
IMPORTANT
In CI/CD scenarios, the integration runtime (IR) type in different environments must be the same. For example, if you have
a self-hosted IR in the development environment, the same IR must also be of type self-hosted in other environments,
such as test and production. Similarly, if you're sharing integration runtimes across multiple stages, you have to configure
the integration runtimes as linked self-hosted in all environments, such as development, test, and production.
When you use this method, the secret is pulled from the key vault automatically.
The parameters file needs to be in the publish branch as well.
2. Add an Azure Key Vault task before the Azure Resource Manager Deployment task described in the
previous section:
a. On the Tasks tab, create a new task. Search for Azure Key Vault and add it.
b. In the Key Vault task, select the subscription in which you created the key vault. Provide credentials
if necessary, and then select the key vault.
Deployment can fail if you try to update active triggers. To update active triggers, you need to manually stop
them and then restart them after the deployment. You can do this by using an Azure PowerShell task:
1. On the Tasks tab of the release, add an Azure PowerShell task. Choose task version the latest Azure
PowerShell version.
2. Select the subscription your factory is in.
3. Select Script File Path as the script type. This requires you to save your PowerShell script in your
repository. The following PowerShell script can be used to stop triggers:
You can complete similar steps (with the Start-AzDataFactoryV2Trigger function) to restart the triggers after
deployment.
The data factory team has provided a sample pre- and post-deployment script located at the bottom of this
article.
2. In your test and production data factories, select Impor t ARM Template . This action takes you to the
Azure portal, where you can import the exported template. Select Build your own template in the
editor to open the Resource Manager template editor.
3. Select Load file , and then select the generated Resource Manager template. This is the
arm_template.json file located in the .zip file exported in step 1.
4. In the settings section, enter the configuration values, like linked service credentials. When you're done,
select Purchase to deploy the Resource Manager template.
Use custom parameters with the Resource Manager template
If your development factory has an associated git repository, you can override the default Resource Manager
template parameters of the Resource Manager template generated by publishing or exporting the template. You
might want to override the default Resource Manager parameter configuration in these scenarios:
You use automated CI/CD and you want to change some properties during Resource Manager
deployment, but the properties aren't parameterized by default.
Your factory is so large that the default Resource Manager template is invalid because it has more than
the maximum allowed parameters (256).
To handle custom parameter 256 limit, there are three options:
Use the custom parameter file and remove properties that don't need parameterization, i.e., properties
that can keep a default value and hence decrease the parameter count.
Refactor logic in the dataflow to reduce parameters, for example, pipeline parameters all have the
same value, you can just use global parameters instead.
Split one data factory into multiple data flows.
To override the default Resource Manager parameter configuration, go to the Manage hub and select ARM
template in the "Source control" section. Under ARM parameter configuration section, click Edit icon in
"Edit parameter configuration" to open the Resource Manager parameter configuration code editor.
NOTE
ARM parameter configuration is only enabled in "GIT mode". Currently it is disabled in "live mode" or "Data Factory"
mode.
Creating a custom Resource Manager parameter configuration creates a file named arm-template-
parameters-definition.json in the root folder of your git branch. You must use that exact file name.
When publishing from the collaboration branch, Data Factory will read this file and use its configuration to
generate which properties get parameterized. If no file is found, the default template is used.
When exporting a Resource Manager template, Data Factory reads this file from whichever branch you're
currently working on, not the collaboration branch. You can create or edit the file from a private branch, where
you can test your changes by selecting Expor t ARM Template in the UI. You can then merge the file into the
collaboration branch.
NOTE
A custom Resource Manager parameter configuration doesn't change the ARM template parameter limit of 256. It lets
you choose and decrease the number of parameterized properties.
Here's an explanation of how the preceding template is constructed, broken down by resource type.
Pipelines
Any property in the path activities/typeProperties/waitTimeInSeconds is parameterized. Any activity in a
pipeline that has a code-level property named waitTimeInSeconds (for example, the Wait activity) is
parameterized as a number, with a default name. But it won't have a default value in the Resource Manager
template. It will be a mandatory input during the Resource Manager deployment.
Similarly, a property called headers (for example, in a Web activity) is parameterized with type object
(JObject). It has a default value, which is the same value as that of the source factory.
IntegrationRuntimes
All properties under the path typeProperties are parameterized with their respective default values. For
example, there are two properties under IntegrationRuntimes type properties: computeProperties and
ssisProperties . Both property types are created with their respective default values and types (Object).
Triggers
Under typeProperties , two properties are parameterized. The first one is maxConcurrency , which is specified
to have a default value and is of type string . It has the default parameter name
<entityName>_properties_typeProperties_maxConcurrency .
The recurrence property also is parameterized. Under it, all properties at that level are specified to be
parameterized as strings, with default values and parameter names. An exception is the interval property,
which is parameterized as type int . The parameter name is suffixed with
<entityName>_properties_typeProperties_recurrence_triggerSuffix . Similarly, the freq property is a string
and is parameterized as a string. However, the freq property is parameterized without a default value. The
name is shortened and suffixed. For example, <entityName>_freq .
LinkedServices
Linked services are unique. Because linked services and datasets have a wide range of types, you can provide
type-specific customization. In this example, for all linked services of type AzureDataLakeStore , a specific
template will be applied. For all others (via * ), a different template will be applied.
The connectionString property will be parameterized as a securestring value. It won't have a default value.
It will have a shortened parameter name that's suffixed with connectionString .
The property secretAccessKey happens to be an AzureKeyVaultSecret (for example, in an Amazon S3 linked
service). It's automatically parameterized as an Azure Key Vault secret and fetched from the configured key
vault. You can also parameterize the key vault itself.
Datasets
Although type-specific customization is available for datasets, you can provide configuration without
explicitly having a *-level configuration. In the preceding example, all dataset properties under
typeProperties are parameterized.
NOTE
Azure aler ts and matrices if configured for a pipeline are not currently supported as parameters for ARM
deployments. To reapply the alerts and matrices in new environment, please follow Data Factory Monitoring, Alerts and
Matrices.
{
"Microsoft.DataFactory/factories": {
"properties": {
"globalParameters": {
"*": {
"value": "="
}
}
},
"location": "="
},
"Microsoft.DataFactory/factories/pipelines": {
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/dataflows": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
},
"computeProperties": {
"dataFlowProperties": {
"externalComputeInfo": [{
"accessToken": "-::secureString"
}
]
}
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"host": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"poolName": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"functionAppUrl":"=",
"environmentUrl": "=",
"aadResourceId": "=",
"sasUri": "|:-sasUri:secureString",
"sasToken": "|",
"connectionString": "|:-connectionString:secureString",
"hostKeyFingerprint": "="
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}
},
"Microsoft.DataFactory/factories/managedVirtualNetworks/managedPrivateEndpoints": {
"properties": {
"*": "="
}
}
}
{
"Microsoft.DataFactory/factories": {
"properties": {
"globalParameters": {
"*": {
"*": {
"value": "="
}
}
},
"location": "="
},
"Microsoft.DataFactory/factories/pipelines": {
},
"Microsoft.DataFactory/factories/dataflows": {
},
"Microsoft.DataFactory/factories/integrationRuntimes":{
"properties": {
"typeProperties": {
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "=",
"catalogAdminUserName": "=",
"catalogAdminPassword": {
"value": "-::secureString"
}
},
"customSetupScriptProperties": {
"sasToken": {
"value": "-::secureString"
}
}
},
"linkedInfo": {
"key": {
"value": "-::secureString"
},
"resourceId": "="
}
}
}
},
"Microsoft.DataFactory/factories/triggers": {
"properties": {
"pipelines": [{
"parameters": {
"*": "="
}
},
"pipelineReference.referenceName"
],
"pipeline": {
"parameters": {
"*": "="
}
},
"typeProperties": {
"scope": "="
}
}
},
"Microsoft.DataFactory/factories/linkedServices": {
"*": {
"properties": {
"typeProperties": {
"accountName": "=",
"username": "=",
"userName": "=",
"accessKeyId": "=",
"servicePrincipalId": "=",
"userId": "=",
"clientId": "=",
"clusterUserName": "=",
"clusterSshUserName": "=",
"clusterSshUserName": "=",
"hostSubscriptionId": "=",
"clusterResourceGroup": "=",
"subscriptionId": "=",
"resourceGroupName": "=",
"tenant": "=",
"dataLakeStoreUri": "=",
"baseUrl": "=",
"database": "=",
"serviceEndpoint": "=",
"batchUri": "=",
"poolName": "=",
"databaseName": "=",
"systemNumber": "=",
"server": "=",
"url":"=",
"aadResourceId": "=",
"connectionString": "|:-connectionString:secureString",
"existingClusterId": "-"
}
}
},
"Odbc": {
"properties": {
"typeProperties": {
"userName": "=",
"connectionString": {
"secretName": "="
}
}
}
}
},
"Microsoft.DataFactory/factories/datasets": {
"*": {
"properties": {
"typeProperties": {
"folderPath": "=",
"fileName": "="
}
}
}}
}
WARNING
If you do not use latest versions of PowerShell and Data Factory module, you may run into deserialization errors while
running the commands.
The following sample script can be used to stop triggers before deployment and restart them afterward. The
script also includes code to delete resources that have been removed. Save the script in an Azure DevOps git
repository and reference it via an Azure PowerShell task the latest Azure PowerShell version.
When running a pre-deployment script, you will need to specify a variation of the following parameters in the
Script Arguments field.
-armTemplate "$(System.DefaultWorkingDirectory)/<your-arm-template-location>" -ResourceGroupName <your-
resource-group-name> -DataFactoryName <your-data-factory-name> -predeployment $true -deleteDeployment $false
When running a post-deployment script, you will need to specify a variation of the following parameters in the
Script Arguments field.
-armTemplate "$(System.DefaultWorkingDirectory)/<your-arm-template-location>" -ResourceGroupName <your-
resource-group-name> -DataFactoryName <your-data-factory-name> -predeployment $false -deleteDeployment $true
NOTE
The -deleteDeployment flag is used to specify the deletion of the ADF deployment entry from the deployment history
in ARM.
Here is the script that can be used for pre- and post-deployment. It accounts for deleted resources and resource
references.
param
(
[parameter(Mandatory = $false)] [String] $armTemplate,
[parameter(Mandatory = $false)] [String] $ResourceGroupName,
[parameter(Mandatory = $false)] [String] $DataFactoryName,
[parameter(Mandatory = $false)] [Bool] $predeployment=$true,
[parameter(Mandatory = $false)] [Bool] $deleteDeployment=$false
)
function getPipelineDependencies {
param([System.Object] $activity)
if ($activity.Pipeline) {
return @($activity.Pipeline.ReferenceName)
} elseif ($activity.Activities) {
$result = @()
$activity.Activities | ForEach-Object{ $result += getPipelineDependencies -activity $_ }
return $result
} elseif ($activity.ifFalseActivities -or $activity.ifTrueActivities) {
$result = @()
$activity.ifFalseActivities | Where-Object {$_ -ne $null} | ForEach-Object{ $result +=
getPipelineDependencies -activity $_ }
$activity.ifTrueActivities | Where-Object {$_ -ne $null} | ForEach-Object{ $result +=
getPipelineDependencies -activity $_ }
return $result
} elseif ($activity.defaultActivities) {
$result = @()
$activity.defaultActivities | ForEach-Object{ $result += getPipelineDependencies -activity $_ }
if ($activity.cases) {
$activity.cases | ForEach-Object{ $_.activities } | ForEach-Object{$result +=
getPipelineDependencies -activity $_ }
}
return $result
} else {
return @()
}
}
function pipelineSortUtil {
param([Microsoft.Azure.Commands.DataFactoryV2.Models.PSPipeline]$pipeline,
[Hashtable] $pipelineNameResourceDict,
[Hashtable] $visited,
[Hashtable] $visited,
[System.Collections.Stack] $sortedList)
if ($visited[$pipeline.Name] -eq $true) {
return;
}
$visited[$pipeline.Name] = $true;
$pipeline.Activities | ForEach-Object{ getPipelineDependencies -activity $_ -pipelineNameResourceDict
$pipelineNameResourceDict} | ForEach-Object{
pipelineSortUtil -pipeline $pipelineNameResourceDict[$_] -pipelineNameResourceDict
$pipelineNameResourceDict -visited $visited -sortedList $sortedList
}
$sortedList.Push($pipeline)
function Get-SortedPipelines {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$pipelines = Get-AzDataFactoryV2Pipeline -DataFactoryName $DataFactoryName -ResourceGroupName
$ResourceGroupName
$ppDict = @{}
$visited = @{}
$stack = new-object System.Collections.Stack
$pipelines | ForEach-Object{ $ppDict[$_.Name] = $_ }
$pipelines | ForEach-Object{ pipelineSortUtil -pipeline $_ -pipelineNameResourceDict $ppDict -visited
$visited -sortedList $stack }
$sortedList = new-object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSPipeline]
function triggerSortUtil {
param([Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]$trigger,
[Hashtable] $triggerNameResourceDict,
[Hashtable] $visited,
[System.Collections.Stack] $sortedList)
if ($visited[$trigger.Name] -eq $true) {
return;
}
$visited[$trigger.Name] = $true;
if ($trigger.Properties.DependsOn) {
$trigger.Properties.DependsOn | Where-Object {$_ -and $_.ReferenceTrigger} | ForEach-Object{
triggerSortUtil -trigger $triggerNameResourceDict[$_.ReferenceTrigger.ReferenceName] -
triggerNameResourceDict $triggerNameResourceDict -visited $visited -sortedList $sortedList
}
}
$sortedList.Push($trigger)
}
function Get-SortedTriggers {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$triggers = Get-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName
$triggerDict = @{}
$visited = @{}
$stack = new-object System.Collections.Stack
$triggers | ForEach-Object{ $triggerDict[$_.Name] = $_ }
$triggers | ForEach-Object{ triggerSortUtil -trigger $_ -triggerNameResourceDict $triggerDict -visited
$visited -sortedList $stack }
$sortedList = new-object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSTrigger]
function Get-SortedLinkedServices {
param(
[string] $DataFactoryName,
[string] $ResourceGroupName
)
$linkedServices = Get-AzDataFactoryV2LinkedService -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName
$LinkedServiceHasDependencies = @('HDInsightLinkedService', 'HDInsightOnDemandLinkedService',
'AzureBatchLinkedService')
$Akv = 'AzureKeyVaultLinkedService'
$HighOrderList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]
$RegularList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]
$AkvList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]
$linkedServices | ForEach-Object {
if ($_.Properties.GetType().Name -in $LinkedServiceHasDependencies) {
$HighOrderList.Add($_)
}
elseif ($_.Properties.GetType().Name -eq $Akv) {
$AkvList.Add($_)
}
else {
$RegularList.Add($_)
}
}
$SortedList = New-Object
Collections.Generic.List[Microsoft.Azure.Commands.DataFactoryV2.Models.PSLinkedService]($HighOrderList.Count
+ $RegularList.Count + $AkvList.Count)
$SortedList.AddRange($HighOrderList)
$SortedList.AddRange($RegularList)
$SortedList.AddRange($AkvList)
$SortedList
}
#Triggers
Write-Host "Getting triggers"
$triggersInTemplate = $resources | Where-Object { $_.type -eq "Microsoft.DataFactory/factories/triggers" }
$triggerNamesInTemplate = $triggersInTemplate | ForEach-Object {$_.name.Substring(37, $_.name.Length-40)}
#Delete resources
Write-Host "Deleting triggers"
$triggersToDelete | ForEach-Object {
Write-Host "Deleting trigger " $_.Name
$trig = Get-AzDataFactoryV2Trigger -name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName
if ($trig.RuntimeState -eq "Started") {
if ($_.TriggerType -eq "BlobEventsTrigger" -or $_.TriggerType -eq "CustomEventsTrigger") {
Write-Host "Unsubscribing trigger" $_.Name "from events"
$status = Remove-AzDataFactoryV2TriggerSubscription -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Name $_.Name
while ($status.Status -ne "Disabled"){
Start-Sleep -s 15
$status = Get-AzDataFactoryV2TriggerSubscriptionStatus -ResourceGroupName
$ResourceGroupName -DataFactoryName $DataFactoryName -Name $_.Name
}
}
Stop-AzDataFactoryV2Trigger -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Name $_.Name -Force
}
Remove-AzDataFactoryV2Trigger -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting pipelines"
$deletedpipelines | ForEach-Object {
Write-Host "Deleting pipeline " $_.Name
Remove-AzDataFactoryV2Pipeline -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting dataflows"
$deleteddataflow | ForEach-Object {
Write-Host "Deleting dataflow " $_.Name
Remove-AzDataFactoryV2DataFlow -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting datasets"
$deleteddataset | ForEach-Object {
Write-Host "Deleting dataset " $_.Name
Remove-AzDataFactoryV2Dataset -Name $_.Name -ResourceGroupName $ResourceGroupName -DataFactoryName
$DataFactoryName -Force
}
Write-Host "Deleting linked services"
$deletedlinkedservices | ForEach-Object {
Write-Host "Deleting Linked Service " $_.Name
Remove-AzDataFactoryV2LinkedService -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
Write-Host "Deleting integration runtimes"
$deletedintegrationruntimes | ForEach-Object {
Write-Host "Deleting integration runtime " $_.Name
Remove-AzDataFactoryV2IntegrationRuntime -Name $_.Name -ResourceGroupName $ResourceGroupName -
DataFactoryName $DataFactoryName -Force
}
$deploymentsToDelete | ForEach-Object {
Write-host "Deleting inner deployment: " $_.properties.targetResource.id
Remove-AzResourceGroupDeployment -Id $_.properties.targetResource.id
}
Write-Host "Deleting deployment: " $deploymentName
Remove-AzResourceGroupDeployment -ResourceGroupName $ResourceGroupName -Name $deploymentName
}
Overview
Continuous integration is the practice of testing each change made to your codebase automatically. As early as
possible, continuous delivery follows the testing that happens during continuous integration and pushes
changes to a staging or production system.
In Azure Data Factory, continuous integration and continuous delivery (CI/CD) means moving Data Factory
pipelines from one environment, such as development, test, and production, to another. Data Factory uses Azure
Resource Manager templates (ARM templates) to store the configuration of your various Data Factory entities,
such as pipelines, datasets, and data flows.
There are two suggested methods to promote a data factory to another environment:
Automated deployment using the integration of Data Factory with Azure Pipelines.
Manually uploading an ARM template by using Data Factory user experience integration with Azure Resource
Manager.
For more information, see Continuous integration and delivery in Azure Data Factory.
This article focuses on the continuous deployment improvements and the automated publish feature for CI/CD.
NOTE
You can continue to use the existing mechanism, which is the adf_publish branch, or you can use the new flow. Both
are supported.
Package overview
Two commands are currently available in the package:
Export ARM template
Validate
Export ARM template
Run npm run start export <rootFolder> <factoryId> [outputFolder] to export the ARM template by using the
resources of a given folder. This command also runs a validation check prior to generating the ARM template.
Here's an example:
RootFolder is a mandatory field that represents where the Data Factory resources are located.
FactoryId is a mandatory field that represents the Data Factory resource ID in the format
/subscriptions/<subId>/resourceGroups/<rgName>/providers/Microsoft.DataFactory/factories/<dfName> .
OutputFolder is an optional parameter that specifies the relative path to save the generated ARM template.
NOTE
The ARM template generated isn't published to the live version of the factory. Deployment should be done by using a
CI/CD pipeline.
Validate
Run npm run start validate <rootFolder> <factoryId> to validate all the resources of a given folder. Here's an
example:
RootFolder is a mandatory field that represents where the Data Factory resources are located.
FactoryId is a mandatory field that represents the Data Factory resource ID in the format
/subscriptions/<subId>/resourceGroups/<rgName>/providers/Microsoft.DataFactory/factories/<dfName> .
2. Select the repository where you want to save your pipeline YAML script. We recommend saving it in a
build folder in the same repository of your Data Factory resources. Ensure there's a package.json file in
the repository that contains the package name, as shown in the following example:
{
"scripts":{
"build":"node node_modules/@microsoft/azure-data-factory-utilities/lib/index"
},
"dependencies":{
"@microsoft/azure-data-factory-utilities":"^0.1.5"
}
}
3. Select Star ter pipeline . If you've uploaded or merged the YAML file, as shown in the following example,
you can also point directly at that and edit it.
# Sample YAML file to validate and export an ARM template into a build artifact
# Requires a package.json file located in the target repository
trigger:
- main #collaboration branch
pool:
vmImage: 'ubuntu-latest'
steps:
# Installs Node and the npm packages saved in your package.json file in the build
- task: NodeTool@0
inputs:
versionSpec: '10.x'
displayName: 'Install Node.js'
- task: Npm@1
inputs:
command: 'install'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
verbose: true
displayName: 'Install npm package'
# Validates all of the Data Factory resources in the repository. You'll get the same validation
errors as when "Validate All" is selected.
# Enter the appropriate subscription and name for the source factory.
- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
customCommand: 'run build validate $(Build.Repository.LocalPath) /subscriptions/xxxxxxxx-xxxx-
xxxx-xxxx-
xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/yourFactoryNa
me'
displayName: 'Validate'
# Validate and then generate the ARM template into the destination folder, which is the same as
selecting "Publish" from the UX.
# The ARM template generated isn't published to the live version of the factory. Deployment should be
done by using a CI/CD pipeline.
- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>' #replace with the
package.json folder
customCommand: 'run build export $(Build.Repository.LocalPath) /subscriptions/xxxxxxxx-xxxx-xxxx-
xxxx-
xxxxxxxxxxxx/resourceGroups/testResourceGroup/providers/Microsoft.DataFactory/factories/yourFactoryNa
me "ArmTemplate"'
displayName: 'Validate and Generate ARM template'
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.Repository.LocalPath)/<folder-of-the-package.json-file>/ArmTemplate'
#replace with the package.json folder
artifact: 'ArmTemplates'
publishLocation: 'pipeline'
4. Enter your YAML code. We recommend that you use the YAML file as a starting point.
5. Save and run. If you used the YAML, it gets triggered every time the main branch is updated.
Next steps
Learn more information about continuous integration and delivery in Data Factory: Continuous integration and
delivery in Azure Data Factory.
Azure Data Factory connector overview
6/1/2021 • 4 minutes to read • Edit Online
Azure −/✓
Cognitive
Search Index
Azure ✓/✓
Cosmos DB's
API for
MongoDB
Azure ✓/− ✓
Database for
MariaDB
DB2 ✓/− ✓
Drill ✓/− ✓
Google ✓/− ✓
BigQuery
Greenplum ✓/− ✓
HBase ✓/− ✓
Apache ✓/− ✓
Impala
Informix ✓/✓ ✓
MariaDB ✓/− ✓
Microsoft ✓/✓ ✓
Access
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y
MySQL ✓/− ✓
Netezza ✓/− ✓
Oracle ✓/✓ ✓
Phoenix ✓/− ✓
PostgreSQL ✓/− ✓
Presto ✓/− ✓
(Preview)
Spark ✓/− ✓
Sybase ✓/− ✓
Teradata ✓/− ✓
Vertica ✓/− ✓
Couchbase ✓/− ✓
(Preview)
MongoDB ✓/✓
MongoDB ✓/✓
Atlas
Amazon S3 ✓/− ✓ ✓ ✓
Compatible
Storage
FTP ✓/− ✓ ✓ ✓
HDFS ✓/− ✓ ✓
SFTP ✓/✓ ✓ ✓ ✓
Generic ✓/− ✓
OData
Generic ✓/✓ ✓
ODBC
Concur ✓/− ✓
(Preview)
Dataverse ✓/✓ ✓
Dynamics ✓/✓ ✓
365
Dynamics AX ✓/− ✓
Dynamics ✓/✓ ✓
CRM
Google ✓/− ✓
AdWords
HubSpot ✓/− ✓
(Preview)
Jira ✓/− ✓
Magento ✓/− ✓
(Preview)
Marketo ✓/− ✓
(Preview)
Oracle ✓/− ✓
Responsys
(Preview)
PayPal ✓/− ✓
(Preview)
QuickBooks ✓/− ✓
(Preview)
Salesforce ✓/✓ ✓
Salesforce ✓/✓ ✓
Service Cloud
Salesforce ✓/− ✓
Marketing
Cloud
ServiceNow ✓/− ✓
GET
C OPY M A P P IN G M ETA DATA
A C T IVIT Y DATA F LO W A C T IVIT Y / VA L
( SO URC E/ SIN ( SO URC E/ SIN LO O K UP IDAT IO N DEL ET E
C AT EGO RY DATA STO RE K) K) A C T IVIT Y A C T IVIT Y A C T IVIT Y
SharePoint ✓/− ✓
Online List
Shopify ✓/− ✓
(Preview)
Square ✓/− ✓
(Preview)
Xero ✓/− ✓
Zoho ✓/− ✓
(Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency
on preview connectors in your solution, please contact Azure support.
Next steps
Copy activity
Mapping Data Flow
Lookup Activity
Get Metadata Activity
Delete Activity
Copy data from Amazon Marketplace Web Service
using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Amazon Marketplace Web Service connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Amazon Marketplace Web Service to any supported sink data store. For a list of data
stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Marketplace Web Service connector.
Example:
{
"name": "AmazonMWSLinkedService",
"properties": {
"type": "AmazonMWS",
"typeProperties": {
"endpoint" : "mws.amazonservices.com",
"marketplaceID" : "A2EUQ1WTGCTBG2",
"sellerID" : "<sellerID>",
"mwsAuthToken": {
"type": "SecureString",
"value": "<mwsAuthToken>"
},
"accessKeyId" : "<accessKeyId>",
"secretKey": {
"type": "SecureString",
"value": "<secretKey>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Marketplace Web Service dataset.
To copy data from Amazon Marketplace Web Service, set the type property of the dataset to
AmazonMWSObject . The following properties are supported:
Example
{
"name": "AmazonMWSDataset",
"properties": {
"type": "AmazonMWSObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<AmazonMWS linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Orders where
Amazon_Order_Id = 'xx'"
.
Example:
"activities":[
{
"name": "CopyFromAmazonMWS",
"type": "Copy",
"inputs": [
{
"referenceName": "<AmazonMWS input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AmazonMWSSource",
"query": "SELECT * FROM Orders where Amazon_Order_Id = 'xx'"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Redshift using Azure Data
Factory
5/6/2021 • 5 minutes to read • Edit Online
Supported capabilities
This Amazon Redshift connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Amazon Redshift to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Amazon Redshift connector supports retrieving data from Redshift using query or built-in
Redshift UNLOAD support.
TIP
To achieve the best performance when copying large amounts of data from Redshift, consider using the built-in Redshift
UNLOAD through Amazon S3. See Use UNLOAD to copy data from Amazon Redshift section for details.
Prerequisites
If you are copying data to an on-premises data store using Self-hosted Integration Runtime, grant Integration
Runtime (use IP address of the machine) the access to Amazon Redshift cluster. See Authorize access to the
cluster for instructions.
If you are copying data to an Azure data store, see Azure Data Center IP Ranges for the Compute IP address
and SQL ranges used by the Azure data centers.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon Redshift connector.
Linked service properties
The following properties are supported for Amazon Redshift linked service:
port The number of the TCP port that the No, default is 5439
Amazon Redshift server uses to listen
for client connections.
Example:
{
"name": "AmazonRedshiftLinkedService",
"properties":
{
"type": "AmazonRedshift",
"typeProperties":
{
"server": "<server name>",
"database": "<database name>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Amazon Redshift dataset.
To copy data from Amazon Redshift, the following properties are supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "AmazonRedshiftDataset",
"properties":
{
"type": "AmazonRedshiftTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Amazon Redshift linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom query to read data. No (if "tableName" in dataset is
For example: select * from MyTable. specified)
"source": {
"type": "AmazonRedshiftSource",
"query": "<SQL query>",
"redshiftUnloadSettings": {
"s3LinkedServiceName": {
"referenceName": "<Amazon S3 linked service>",
"type": "LinkedServiceReference"
},
"bucketName": "bucketForUnload"
}
}
Learn more on how to use UNLOAD to copy data from Amazon Redshift efficiently from next section.
BIGINT Int64
BOOLEAN String
CHAR String
DATE DateTime
DECIMAL Decimal
A M A Z O N REDSH IF T DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
INTEGER Int32
REAL Single
SMALLINT Int16
TEXT String
TIMESTAMP DateTime
VARCHAR String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Amazon Simple Storage Service by
using Azure Data Factory
5/14/2021 • 15 minutes to read • Edit Online
TIP
To learn more about the data migration scenario from Amazon S3 to Azure Storage, see Use Azure Data Factory to
migrate data from Amazon S3 to Azure Storage.
Supported capabilities
This Amazon S3 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Amazon S3 connector supports copying files as is or parsing files with the supported file
formats and compression codecs. You can also choose to preserve file metadata during copy. The connector
uses AWS Signature Version 4 to authenticate requests to S3.
TIP
If you want to copy data from any S3-compatible storage provider, see Amazon S3 Compatible Storage.
Required permissions
To copy data from Amazon S3, make sure you've been granted the following permissions for Amazon S3 object
operations: s3:GetObject and s3:GetObjectVersion .
If you use Data Factory UI to author, additional s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation
permissions are required for operations like testing connection to linked service and browsing from root. If you
don't want to grant these permissions, you can choose "Test connection to file path" or "Browse from specified
path" options from the UI.
For the full list of Amazon S3 permissions, see Specifying Permissions in a Policy on the AWS site.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3.
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AmazonS3LinkedService",
"properties": {
"type": "AmazonS3",
"typeProperties": {
"authenticationType": "TemporarySecurityCredentials",
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"sessionToken": {
"type": "SecureString",
"value": "<session token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 under location settings in a format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3Location",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3ReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
Legacy models
NOTE
The following models are still supported as is for backward compatibility. We suggest that you use the new model
mentioned earlier. The Data Factory authoring UI has switched to generating the new model.
bucketName The S3 bucket name. The wildcard filter Yes for the Copy or Lookup activity, no
is not supported. for the GetMetadata activity
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.
{
"name": "AmazonS3Dataset",
"properties": {
"type": "AmazonS3Object",
"linkedServiceName": {
"referenceName": "<Amazon S3 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"bucketName": "testbucket",
"prefix": "testFolder/test",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromAmazonS3",
"type": "Copy",
"inputs": [
{
"referenceName": "<Amazon S3 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Amazon S3 Compatible Storage by
using Azure Data Factory
5/14/2021 • 9 minutes to read • Edit Online
Supported capabilities
This Amazon S3 Compatible Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Amazon S3 Compatible Storage connector supports copying files as is or parsing files with the
supported file formats and compression codecs. The connector uses AWS Signature Version 4 to authenticate
requests to S3. You can use this Amazon S3 Compatible Storage connector to copy data from any S3-compatible
storage provider. Specify the corresponding service URL in the linked service configuration.
Required permissions
To copy data from Amazon S3 Compatible Storage, make sure you've been granted the following permissions
for Amazon S3 object operations: s3:GetObject and s3:GetObjectVersion .
If you use Data Factory UI to author, additional s3:ListAllMyBuckets and s3:ListBucket / s3:GetBucketLocation
permissions are required for operations like testing connection to linked service and browsing from root. If you
don't want to grant these permissions, you can choose "Test connection to file path" or "Browse from specified
path" options from the UI.
For the full list of Amazon S3 permissions, see Specifying Permissions in a Policy on the AWS site.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Amazon S3 Compatible Storage.
Linked service properties
The following properties are supported for an Amazon S3 Compatible linked service:
Example:
{
"name": "AmazonS3CompatibleLinkedService",
"properties": {
"type": "AmazonS3Compatible",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Amazon S3 Compatible under location settings in a format-based
dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Amazon S3 Compatible Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AmazonS3CompatibleLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
"activities":[
{
"name": "CopyFromAmazonS3CompatibleStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AmazonS3CompatibleReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Avro format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Avro dataset.
NOTE
White space in column name is not supported for Avro files.
Avro as sink
The following properties are supported in the copy activity *sink* section.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Sink properties
The below table lists the properties supported by an avro sink. You can edit these properties in the Settings tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Copy and transform data in Azure Blob storage by
using Azure Data Factory
7/15/2021 • 32 minutes to read • Edit Online
TIP
To learn about a migration scenario for a data lake or a data warehouse, see Use Azure Data Factory to migrate data from
your data lake or data warehouse to Azure.
Supported capabilities
This Azure Blob storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
For the Copy activity, this Blob storage connector supports:
Copying blobs to and from general-purpose Azure storage accounts and hot/cool blob storage.
Copying blobs by using an account key, a service shared access signature (SAS), a service principal, or
managed identities for Azure resource authentications.
Copying blobs from block, append, or page blobs and copying data to only block blobs.
Copying blobs as is, or parsing or generating blobs with supported file formats and compression codecs.
Preserving file metadata during copy.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Blob storage.
Linked service properties
This Blob storage connector supports the following authentication types. See the corresponding sections for
details.
Account key authentication
Shared access signature authentication
Service principal authentication
Managed identities for Azure resource authentication
NOTE
If want to use the public Azure integration runtime to connect to your Blob storage by leveraging the Allow trusted
Microsoft ser vices to access this storage account option enabled on Azure Storage firewall, you must use
managed identity authentication.
When you use PolyBase or COPY statement to load data into Azure Synapse Analytics, if your source or staging Blob
storage is configured with an Azure Virtual Network endpoint, you must use managed identity authentication as
required by Synapse. See the Managed identity authentication section for more configuration prerequisites.
NOTE
Azure HDInsight and Azure Machine Learning activities only support authentication that uses Azure Blob storage account
keys.
NOTE
If you're using the AzureStorage type linked service, it's still supported as is. But we suggest that you use the new
AzureBlobStorage linked service type going forward.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Data Factory now supports both service shared access signatures and account shared access signatures. For more
information about shared access signatures, see Grant limited access to Azure Storage resources using shared access
signatures.
In later dataset configurations, the folder path is the absolute path starting from the container level. You need to
configure one aligned with the path in your SAS URI.
Data Factory supports the following properties for using shared access signature authentication:
NOTE
If you're using the AzureStorage type linked service, it's still supported as is. But we suggest that you use the new
AzureBlobStorage linked service type going forward.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<accountname>.blob.core.windows.net/?sv=<storage version>&st=<start time>&se=<expire time>&sr=
<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<accountname>.blob.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName with value of SAS token e.g. ?sv=<storage version>&st=<start
time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expir y time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right container or blob based on the need. A shared access signature URI to
a blob allows Data Factory to access that particular blob. A shared access signature URI to a Blob storage
container allows Data Factory to iterate through blobs in that container. To provide access to more or fewer
objects later, or to update the shared access signature URI, remember to update the linked service with the
new URI.
Service principal authentication
For general information about Azure Storage service principal authentication, see Authenticate access to Azure
Storage using Azure Active Directory.
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of these values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant the service principal proper permission in Azure Blob storage. For more information on the roles,
see Use the Azure portal to assign an Azure role for access to blob and queue data.
As source , in Access control (IAM) , grant at least the Storage Blob Data Reader role.
As sink , in Access control (IAM) , grant at least the Storage Blob Data Contributor role.
These properties are supported for an Azure Blob storage linked service:
NOTE
If your blob account enables soft delete, service principal authentication is not supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when service principal authentication is
used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint. Make sure you create the
corresponding private endpoint in ADF to enable access.
NOTE
Service principal authentication is supported only by the "AzureBlobStorage" type linked service, not the previous
"AzureStorage" type linked service.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"accountKind": "StorageV2",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
IMPORTANT
If you use PolyBase or COPY statement to load data from Blob storage (as a source or as staging) into Azure Synapse
Analytics, when you use managed identity authentication for Blob storage, make sure you also follow steps 1 to 3 in this
guidance. Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your
server. Data Factory handles the rest. If you configure Blob storage with an Azure Virtual Network endpoint, you also
need to have Allow trusted Microsoft ser vices to access this storage account turned on under Azure Storage
account Firewalls and Vir tual networks settings menu as required by Synapse.
These properties are supported for an Azure Blob storage linked service:
NOTE
If your blob account enables soft delete, managed identity authentication is not supported in Data Flow.
If you access the blob storage through private endpoint using Data Flow, note when managed identity authentication
is used Data Flow connects to the ADLS Gen2 endpoint instead of Blob endpoint . Make sure you create the
corresponding private endpoint in ADF to enable access.
NOTE
Managed identities for Azure resource authentication are supported only by the "AzureBlobStorage" type linked service,
not the previous "AzureStorage" type linked service.
Example:
{
"name": "AzureBlobStorageLinkedService",
"properties": {
"type": "AzureBlobStorage",
"typeProperties": {
"serviceEndpoint": "https://<accountName>.blob.core.windows.net/",
"accountKind": "StorageV2"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Blob storage under location settings in a format-based
dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 2: blob prefix Prefix for the blob name under the No
- prefix given container configured in a dataset
to filter source blobs. Blobs whose
names start with
container_in_dataset/this_prefix
are selected. It utilizes the service-side
filter for Blob storage, which provides
better performance than a wildcard
filter.
Additional settings:
NOTE
For Parquet/delimited text format, the BlobSource type for the Copy activity source mentioned in the next section is still
supported as is for backward compatibility. We suggest that you use the new model until the Data Factory authoring UI
has switched to generating these new types.
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
The $logs container, which is automatically created when Storage Analytics is enabled for a storage account, isn't shown
when a container listing operation is performed via the Data Factory UI. The file path must be provided directly for Data
Factory to consume files from the $logs container.
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobStorageWriteSettings",
"copyBehavior": "PreserveHierarchy",
"metadata": [
{
"name": "testKey1",
"value": "value1"
},
{
"name": "testKey2",
"value": "value2"
},
{
"name": "lastModifiedKey",
"value": "$$LASTMODIFIED"
}
]
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET
List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this:
/data/sales/20??/**/*.csv
In this case, all files that were sourced under /data/sales are moved to /backup/priorSales .
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.
Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All datetimes are in UTC.
Sink properties
In the sink transformation, you can write to either a container or a folder in Azure Blob storage. Use the
Settings tab to manage how the files get written.
Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv , loans2.csv , and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Be aware that the merge operation can possibly fail based on node size. We don't
recommend this option for large datasets.
Quote all: Determines whether to enclose all values in quotation marks.
Legacy models
NOTE
The following models are still supported as is for backward compatibility. We suggest that you use the new model
mentioned earlier. The Data Factory authoring UI has switched to generating the new model.
folderPath Path to the container and folder in Yes for the Copy or Lookup activity, No
Blob storage. for the GetMetadata activity
An example is:
myblobcontainer/myblobfolder/ .
See more examples in Folder and file
filter examples.
P RO P ERT Y DESC RIP T IO N REQ UIRED
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.
TIP
To copy all blobs under a folder, specify folderPath only.
To copy a single blob with a given name, specify folderPath for the folder part and fileName for the file name.
To copy a subset of blobs under a folder, specify folderPath for the folder part and fileName with a wildcard filter.
Example:
{
"name": "AzureBlobDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "<Azure Blob storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Blob input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Blob output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BlobSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores that the Copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Copy data to an Azure Cognitive Search index
using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
You can copy data from any supported source data store into search index. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Cognitive Search connector.
IMPORTANT
When copying data from a cloud data store into search index, in Azure Cognitive Search linked service, you need to refer
an Azure Integration Runtime with explicit region in connactVia. Set the region as the one where your search service
resides. Learn more from Azure Integration Runtime.
Example:
{
"name": "AzureSearchLinkedService",
"properties": {
"type": "AzureSearch",
"typeProperties": {
"url": "https://<service>.search.windows.net",
"key": {
"type": "SecureString",
"value": "<AdminKey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Cognitive Search dataset.
To copy data into Azure Cognitive Search, the following properties are supported:
Example:
{
"name": "AzureSearchIndexDataset",
"properties": {
"type": "AzureSearchIndex",
"typeProperties" : {
"indexName": "products"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Cognitive Search linked service name>",
"type": "LinkedServiceReference"
}
}
}
WriteBehavior property
AzureSearchSink upserts when writing data. In other words, when writing a document, if the document key
already exists in the search index, Azure Cognitive Search updates the existing document rather than throwing a
conflict exception.
The AzureSearchSink provides the following two upsert behaviors (by using AzureSearch SDK):
Merge : combine all the columns in the new document with the existing one. For columns with null value in
the new document, the value in the existing one is preserved.
Upload : The new document replaces the existing one. For columns not specified in the new document, the
value is set to null whether there is a non-null value in the existing document or not.
The default behavior is Merge .
WriteBatchSize Property
Azure Cognitive Search service supports writing documents as a batch. A batch can contain 1 to 1,000 Actions.
An action handles one document to perform the upload/merge operation.
Example:
"activities":[
{
"name": "CopyToAzureSearch",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Cognitive Search output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureSearchIndexSink",
"writeBehavior": "Merge"
}
}
}
]
String Y
Int32 Y
Int64 Y
Double Y
Boolean Y
DataTimeOffset Y
A Z URE C O GN IT IVE SEA RC H DATA T Y P E SUP P O RT ED IN A Z URE C O GN IT IVE SEA RC H SIN K
String Array N
GeographyPoint N
Currently other data types e.g. ComplexType are not supported. For a full list of Azure Cognitive Search
supported data types, see Supported data types (Azure Cognitive Search).
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Cosmos DB (SQL
API) by using Azure Data Factory
5/25/2021 • 14 minutes to read • Edit Online
NOTE
This connector only support Cosmos DB SQL API. For MongoDB API, refer to connector for Azure Cosmos DB's API for
MongoDB. Other API types are not supported now.
Supported capabilities
This Azure Cosmos DB (SQL API) connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
For Copy activity, this Azure Cosmos DB (SQL API) connector supports:
Copy data from and to the Azure Cosmos DB SQL API using key, service principal, or managed identities for
Azure resources authentications.
Write to Azure Cosmos DB as inser t or upser t .
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import and export JSON documents.
Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when
you write to Azure Cosmos DB.
TIP
The Data Migration video walks you through the steps of copying data from Azure Blob storage to Azure Cosmos DB.
The video also describes performance-tuning considerations for ingesting data to Azure Cosmos DB in general.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB (SQL API).
Example
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store account key in Azure Key Vault
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"connectionString": "AccountEndpoint=<EndpointUrl>;Database=<Database>",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Currently, the service principal authentication is not supported in data flow.
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"accountEndpoint": "<account endpoint>",
"database": "<database name>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Currently, the managed identity authentication is not supported in data flow.
A data factory can be associated with a managed identity for Azure resources, which represents this specific data
factory. You can directly use this managed identity for Cosmos DB authentication, similar to using your own
service principal. It allows this designated factory to access and copy data to or from your Cosmos DB.
To use managed identities for Azure resource authentication, follow these steps.
1. Retrieve the Data Factory managed identity information by copying the value of the managed identity
object ID generated along with your factory.
2. Grant the managed identity proper permission. See examples on how permission works in Cosmos DB
from Access control lists on files and directories. More specifically, create a role definition, and assign the
role to the managed identity.
These properties are supported for the linked service:
Example:
{
"name": "CosmosDbSQLAPILinkedService",
"properties": {
"type": "CosmosDb",
"typeProperties": {
"accountEndpoint": "<account endpoint>",
"database": "<database name>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB (SQL API) dataset:
{
"name": "CosmosDbSQLAPIDataset",
"properties": {
"type": "CosmosDbSqlApiCollection",
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB linked service name>",
"type": "LinkedServiceReference"
},
"schema": [],
"typeProperties": {
"collectionName": "<collection name>"
}
}
}
If you use "DocumentDbCollectionSource" type source, it is still supported as-is for backward compatibility. You
are suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos
DB.
Example
"activities":[
{
"name": "CopyFromCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cosmos DB SQL API input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbSqlApiSource",
"query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName,
c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"",
"preferredRegions": [
"East US"
]
},
"sink": {
"type": "<sink type>"
}
}
}
]
When copy data from Cosmos DB, unless you want to export JSON documents as-is, the best practice is to
specify the mapping in copy activity. Data Factory honors the mapping you specified on the activity - if a row
doesn't contain a value for a column, a null value is provided for the column value. If you don't specify a
mapping, Data Factory infers the schema by using the first row in the data. If the first row doesn't contain the full
schema, some columns will be missing in the result of the activity operation.
Azure Cosmos DB (SQL API ) as sink
To copy data to Azure Cosmos DB (SQL API), set the sink type in Copy Activity to
DocumentDbCollectionSink .
The following properties are supported in the Copy Activity sink section:
TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Migrate from relational database to Cosmos DB.
TIP
Cosmos DB limits single request's size to 2MB. The formula is Request Size = Single Document Size * Write Batch Size. If
you hit error saying "Request size is too large.", reduce the writeBatchSize value in copy sink configuration.
If you use "DocumentDbCollectionSink" type source, it is still supported as-is for backward compatibility. You are
suggested to use the new model going forward which provide richer capabilities to copy data from Cosmos DB.
Example
"activities":[
{
"name": "CopyToCosmosDBSQLAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbSqlApiSink",
"writeBehavior": "upsert"
}
}
}
]
Schema mapping
To copy data from Azure Cosmos DB to tabular sink or reversed, refer to schema mapping.
Throughput: Set an optional value for the number of RUs you'd like to apply to your CosmosDB collection for
each execution of this data flow. Minimum is 400.
Write throughput budget: An integer that represents the RUs you want to allocate for this Data Flow write
operation, out of the total throughput allocated to the collection.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Cosmos DB's API for
MongoDB by using Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online
NOTE
This connector only supports copy data to/from Azure Cosmos DB's API for MongoDB. For SQL API, refer to Cosmos DB
SQL API connector. Other API types are not supported now.
Supported capabilities
You can copy data from Azure Cosmos DB's API for MongoDB to any supported sink data store, or copy data
from any supported source data store to Azure Cosmos DB's API for MongoDB. For a list of data stores that
Copy Activity supports as sources and sinks, see Supported data stores and formats.
You can use the Azure Cosmos DB's API for MongoDB connector to:
Copy data from and to the Azure Cosmos DB's API for MongoDB.
Write to Azure Cosmos DB as inser t or upser t .
Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL
database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos
DB collection, see Import or export JSON documents.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Azure Cosmos DB's API for MongoDB.
Example
{
"name": "CosmosDbMongoDBAPILinkedService",
"properties": {
"type": "CosmosDbMongoDbApi",
"typeProperties": {
"connectionString": "mongodb://<cosmosdb-name>:<password>@<cosmosdb-
name>.documents.azure.com:10255/?ssl=true&replicaSet=globaldb",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for Azure Cosmos DB's API for MongoDB dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
{
"name": "CosmosDbMongoDBAPIDataset",
"properties": {
"type": "CosmosDbMongoDbApiCollection",
"typeProperties": {
"collectionName": "<collection name>"
},
"schema": [],
"linkedServiceName":{
"referenceName": "<Azure Cosmos DB's API for MongoDB linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.
Example
"activities":[
{
"name": "CopyFromCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Cosmos DB's API for MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CosmosDbMongoDbApiSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.
Example
"activities":[
{
"name": "CopyToCosmosDBMongoDBAPI",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "CosmosDbMongoDbApiSink",
"writeBehavior": "upsert"
}
}
}
]
Schema mapping
To copy data from Azure Cosmos DB's API for MongoDB to tabular sink or reversed, refer to schema mapping.
Specifically for writing into Cosmos DB, to make sure you populate Cosmos DB with the right object ID from
your source data, for example, you have an "id" column in SQL database table and want to use the value of that
as the document ID in MongoDB for insert/upsert, you need to set the proper schema mapping according to
MongoDB strict mode definition ( _id.$oid ) as the following:
{
"_id": ObjectId("592e07800000000000000000")
}
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported
data stores.
Copy data to or from Azure Data Explorer by using
Azure Data Factory
5/6/2021 • 8 minutes to read • Edit Online
TIP
For Azure Data Factory and Azure Data Explorer integration in general, learn more from Integrate Azure Data Explorer
with Azure Data Factory.
Supported capabilities
This Azure Data Explorer connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from any supported source data store to Azure Data Explorer. You can also copy data from
Azure Data Explorer to any supported sink data store. For a list of data stores that the copy activity supports as
sources or sinks, see the Supported data stores table.
NOTE
Copying data to or from Azure Data Explorer through an on-premises data store by using self-hosted integration runtime
is supported in version 3.14 and later.
With the Azure Data Explorer connector, you can do the following:
Copy data by using Azure Active Directory (Azure AD) application token authentication with a ser vice
principal .
As a source, retrieve data by using a KQL (Kusto) query.
As a sink, append data to a destination table.
Getting started
TIP
For a walkthrough of Azure Data Explorer connector, see Copy data to/from Azure Data Explorer using Azure Data Factory
and Bulk copy from a database to Azure Data Explorer.
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Data Explorer connector.
NOTE
When you use the Data Factory UI to author, by default your login user account is used to list Azure Data Explorer
clusters, databases, and tables. You can choose to list the objects using the service principal by clicking the dropdown next
to the refresh button, or manually enter the name if you don't have permission for these operations.
The following properties are supported for the Azure Data Explorer linked service:
{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
"tenant": "<tenant name/id e.g. microsoft.onmicrosoft.com>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
}
}
}
The following properties are supported for the Azure Data Explorer linked service:
{
"name": "AzureDataExplorerLinkedService",
"properties": {
"type": "AzureDataExplorer",
"typeProperties": {
"endpoint": "https://<clusterName>.<regionName>.kusto.windows.net ",
"database": "<database name>",
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets in Azure Data Factory. This
section lists properties that the Azure Data Explorer dataset supports.
To copy data to Azure Data Explorer, set the type property of the dataset to AzureDataExplorerTable .
The following properties are supported:
table The name of the table that the linked Yes for sink; No for source
service refers to.
{
"name": "AzureDataExplorerDataset",
"properties": {
"type": "AzureDataExplorerTable",
"typeProperties": {
"table": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Data Explorer linked service name>",
"type": "LinkedServiceReference"
}
}
}
Example:
"activities":[
{
"name": "CopyFromAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "AzureDataExplorerSource",
"query": "TestTable1 | take 10",
"queryTimeout": "00:10:00"
},
"sink": {
"type": "<sink type>"
}
},
"inputs": [
{
"referenceName": "<Azure Data Explorer input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
]
}
]
Example:
"activities":[
{
"name": "CopyToAzureDataExplorer",
"type": "Copy",
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataExplorerSink",
"ingestionMappingName": "<optional Azure Data Explorer mapping name>",
"additionalProperties": {<additional settings for data ingestion>}
}
},
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Data Explorer output dataset name>",
"type": "DatasetReference"
}
]
}
]
Next steps
For a list of data stores that the copy activity in Azure Data Factory supports as sources and sinks, see
supported data stores.
Learn more about how to copy data from Azure Data Factory to Azure Data Explorer.
Copy data to or from Azure Data Lake Storage
Gen1 using Azure Data Factory
5/6/2021 • 25 minutes to read • Edit Online
Supported capabilities
This Azure Data Lake Storage Gen1 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
Specifically, with this connector you can:
Copy files by using one of the following methods of authentication: service principal or managed identities
for Azure resources.
Copy files as is or parse or generate files with the supported file formats and compression codecs.
Preserve ACLs when copying into Azure Data Lake Storage Gen2.
IMPORTANT
If you copy data by using the self-hosted integration runtime, configure the corporate firewall to allow outbound traffic to
<ADLS account name>.azuredatalakestore.net and login.microsoftonline.com/<tenant>/oauth2/token on port
443. The latter is the Azure Security Token Service that the integration runtime needs to communicate with to get the
access token.
Get started
TIP
For a walk-through of how to use the Azure Data Lake Store connector, see Load data into Azure Data Lake Store.
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide information about properties that are used to define Data Factory entities
specific to Azure Data Lake Store.
Example:
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureDataLakeStoreLinkedService",
"properties": {
"type": "AzureDataLakeStore",
"typeProperties": {
"dataLakeStoreUri": "https://<accountname>.azuredatalakestore.net/webhdfs/v1",
"subscriptionId": "<subscription of ADLS>",
"resourceGroupName": "<resource group of ADLS>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure Data Lake Store Gen1 under location settings in the format-
based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<ADLS Gen1 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureDataLakeStoreLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Additional settings:
Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureDataLakeStoreReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureDataLakeStoreWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET
If you want to replicate the access control lists (ACLs) along with data files when you upgrade from Data Lake
Storage Gen1 to Data Lake Storage Gen2, see Preserve ACLs from Data Lake Storage Gen1.
Wildcard path: Using a wildcard pattern will instruct ADF to loop through each matching folder and file in a
single Source transformation. This is an effective way to process multiple files within a single flow. Add multiple
wildcard matching patterns with the + sign that appears when hovering over your existing wildcard pattern.
From your source container, choose a series of files that match a pattern. Only container can be specified in the
dataset. Your wildcard path must therefore also include your folder path from the root folder.
Wildcard examples:
* Represents any set of characters
** Represents recursive directory nesting
? Replaces one character
[] Matches one of more characters in the brackets
/data/sales/**/*.csv Gets all csv files under /data/sales
/data/sales/20??/**/ Gets all files in the 20th century
/data/sales/*/*/*.csv Gets csv files two levels under /data/sales
/data/sales/2004/*/12/[XY]1?.csv Gets all csv files in 2004 in December starting with X or Y prefixed by a
two-digit number
Par tition Root Path: If you have partitioned folders in your file source with a key=value format (for example,
year=2019), then you can assign the top level of that partition folder tree to a column name in your data flow
data stream.
First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you wish to read.
Use the Partition Root Path setting to define what the top level of the folder structure is. When you view the
contents of your data via a data preview, you'll see that ADF will add the resolved partitions found in each of
your folder levels.
List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this below:
/data/sales/20??/**/*.csv
And "to" as
/backup/priorSales
In this case, all files that were sourced under /data/sales are moved to /backup/priorSales.
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.
Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All date-times are in UTC.
Sink properties
In the sink transformation, you can write to either a container or folder in Azure Data Lake Storage Gen1. the
Settings tab lets you manage how the files get written.
Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv, loans2.csv, and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Please be aware that te merge operation can possibly fail based upon node size. This
option is not recommended for large datasets.
Quote all: Determines whether to enclose all values in quotes
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a particular name, specify folderPath with a folder part and fileName with a file name.
To copy a subset of files under a folder, specify folderPath with a folder part and fileName with a wildcard filter.
Example:
{
"name": "ADLSDataset",
"properties": {
"type": "AzureDataLakeStoreFile",
"linkedServiceName":{
"referenceName": "<ADLS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "datalake/myfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen1 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyToADLSGen1",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen1 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDataLakeStoreSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Data Lake
Storage Gen2 using Azure Data Factory
7/15/2021 • 29 minutes to read • Edit Online
TIP
For data lake or data warehouse migration scenario, learn more from Use Azure Data Factory to migrate data from your
data lake or data warehouse to Azure.
Supported capabilities
This Azure Data Lake Storage Gen2 connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
Delete activity
For Copy activity, with this connector you can:
Copy data from/to Azure Data Lake Storage Gen2 by using account key, service principal, or managed
identities for Azure resources authentications.
Copy files as-is or parse or generate files with supported file formats and compression codecs.
Preserve file metadata during copy.
Preserve ACLs when copying from Azure Data Lake Storage Gen1/Gen2.
Get started
TIP
For a walk-through of how to use the Data Lake Storage Gen2 connector, see Load data into Azure Data Lake Storage
Gen2.
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide information about properties that are used to define Data Factory entities
specific to Data Lake Storage Gen2.
NOTE
If want to use the public Azure integration runtime to connect to the Data Lake Storage Gen2 by leveraging the Allow
trusted Microsoft ser vices to access this storage account option enabled on Azure Storage firewall, you must
use managed identity authentication.
When you use PolyBase or COPY statement to load data into Azure Synapse Analytics, if your source or staging Data
Lake Storage Gen2 is configured with an Azure Virtual Network endpoint, you must use managed identity
authentication as required by Synapse. See the managed identity authentication section with more configuration
prerequisites.
Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"accountkey": {
"type": "SecureString",
"value": "<accountkey>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
If you use Data Factory UI to author and the service principal is not set with "Storage Blob Data Reader/Contributor" role
in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse from
specified path", and specify a path with Read + Execute permission to continue.
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalCredential": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<AKV reference>",
"type": "LinkedServiceReference"
},
"secretName": "<certificate name in AKV>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
If you use Data Factory UI to author and the managed identity is not set with "Storage Blob Data Reader/Contributor"
role in IAM, when doing test connection or browsing/navigating folders, choose "Test connection to file path" or "Browse
from specified path", and specify a path with Read + Execute permission to continue.
IMPORTANT
If you use PolyBase or COPY statement to load data from Data Lake Storage Gen2 into Azure Synapse Analytics, when
you use managed identity authentication for Data Lake Storage Gen2, make sure you also follow steps 1 to 3 in this
guidance. Those steps will register your server with Azure AD and assign the Storage Blob Data Contributor role to your
server. Data Factory handles the rest. If you configure Blob storage with an Azure Virtual Network endpoint, you also
need to have Allow trusted Microsoft ser vices to access this storage account turned on under Azure Storage
account Firewalls and Vir tual networks settings menu as required by Synapse.
Example:
{
"name": "AzureDataLakeStorageGen2LinkedService",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<accountname>.dfs.core.windows.net",
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Data Lake Storage Gen2 under location settings in the format-
based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileSystem": "filesystemname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
The following properties are supported for Data Lake Storage Gen2 under storeSettings settings in format-
based copy source:
Additional settings:
Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureBlobFSReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureBlobFSWriteSettings",
"copyBehavior": "PreserveHierarchy",
"metadata": [
{
"name": "testKey1",
"value": "value1"
},
{
"name": "testKey2",
"value": "value2"
},
{
"name": "lastModifiedKey",
"value": "$$LASTMODIFIED"
}
]
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
TIP
To copy data from Azure Data Lake Storage Gen1 into Gen2 in general, see Copy data from Azure Data Lake Storage
Gen1 to Gen2 with Azure Data Factory for a walk-through and best practices.
Wildcard path: Using a wildcard pattern will instruct ADF to loop through each matching folder and file in a
single Source transformation. This is an effective way to process multiple files within a single flow. Add multiple
wildcard matching patterns with the + sign that appears when hovering over your existing wildcard pattern.
From your source container, choose a series of files that match a pattern. Only container can be specified in the
dataset. Your wildcard path must therefore also include your folder path from the root folder.
Wildcard examples:
* Represents any set of characters
** Represents recursive directory nesting
? Replaces one character
[] Matches one of more characters in the brackets
/data/sales/**/*.csv Gets all csv files under /data/sales
/data/sales/20??/**/ Gets all files in the 20th century
/data/sales/*/*/*.csv Gets csv files two levels under /data/sales
/data/sales/2004/*/12/[XY]1?.csv Gets all csv files in 2004 in December starting with X or Y prefixed by a
two-digit number
Par tition Root Path: If you have partitioned folders in your file source with a key=value format (for example,
year=2019), then you can assign the top level of that partition folder tree to a column name in your data flow
data stream.
First, set a wildcard to include all paths that are the partitioned folders plus the leaf files that you wish to read.
Use the Partition Root Path setting to define what the top level of the folder structure is. When you view the
contents of your data via a data preview, you'll see that ADF will add the resolved partitions found in each of
your folder levels.
List of files: This is a file set. Create a text file that includes a list of relative path files to process. Point to this
text file.
Column to store file name: Store the name of the source file in a column in your data. Enter a new column
name here to store the file name string.
After completion: Choose to do nothing with the source file after the data flow runs, delete the source file, or
move the source file. The paths for the move are relative.
To move source files to another location post-processing, first select "Move" for file operation. Then, set the
"from" directory. If you're not using any wildcards for your path, then the "from" setting will be the same folder
as your source folder.
If you have a source path with wildcard, your syntax will look like this below:
/data/sales/20??/**/*.csv
And "to" as
/backup/priorSales
In this case, all files that were sourced under /data/sales are moved to /backup/priorSales.
NOTE
File operations run only when you start the data flow from a pipeline run (a pipeline debug or execution run) that uses
the Execute Data Flow activity in a pipeline. File operations do not run in Data Flow debug mode.
Filter by last modified: You can filter which files you process by specifying a date range of when they were
last modified. All date-times are in UTC.
Sink properties
In the sink transformation, you can write to either a container or folder in Azure Data Lake Storage Gen2. the
Settings tab lets you manage how the files get written.
Clear the folder : Determines whether or not the destination folder gets cleared before the data is written.
File name option: Determines how the destination files are named in the destination folder. The file name
options are:
Default : Allow Spark to name files based on PART defaults.
Pattern : Enter a pattern that enumerates your output files per partition. For example, loans[n].csv will
create loans1.csv, loans2.csv, and so on.
Per par tition : Enter one file name per partition.
As data in column : Set the output file to the value of a column. The path is relative to the dataset container,
not the destination folder. If you have a folder path in your dataset, it will be overridden.
Output to a single file : Combine the partitioned output files into a single named file. The path is relative to
the dataset folder. Please be aware that te merge operation can possibly fail based upon node size. This
option is not recommended for large datasets.
Quote all: Determines whether to enclose all values in quotes
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with a folder part and fileName with a file name.
To copy a subset of files under a folder, specify folderPath with a folder part and fileName with a wildcard filter.
Example:
{
"name": "ADLSGen2Dataset",
"properties": {
"type": "AzureBlobFSFile",
"linkedServiceName": {
"referenceName": "<Azure Data Lake Storage Gen2 linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "myfilesystem/myfolder",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<ADLS Gen2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureBlobFSSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
"activities":[
{
"name": "CopyToADLSGen2",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ADLS Gen2 output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureBlobFSSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Azure Database for MariaDB using
Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Azure Database for MariaDB connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Azure Database for MariaDB to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MariaDB connector.
Example:
{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "AzureMariaDB",
"typeProperties": {
"connectionString": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; Pwd={your_password}; SslMode=Preferred;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureDatabaseForMariaDBLinkedService",
"properties": {
"type": "AzureMariaDB",
"typeProperties": {
"connectionString": "Server={your_server}.mariadb.database.azure.com; Port=3306; Database=
{your_database}; Uid={your_user}@{your_server}; SslMode=Preferred;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MariaDB dataset.
To copy data from Azure Database for MariaDB, the following properties are supported:
Example
{
"name": "AzureDatabaseForMariaDBDataset",
"properties": {
"type": "AzureMariaDBTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Database for MariaDB linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Database for MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Database for
MySQL by using Azure Data Factory
5/6/2021 • 8 minutes to read • Edit Online
Supported capabilities
This Azure Database for MySQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure Database for MySQL connector.
Example:
{
"name": "AzureDatabaseForMySQLLinkedService",
"properties": {
"type": "AzureMySql",
"typeProperties": {
"connectionString": "Server=<server>.mysql.database.azure.com;Port=<port>;Database=
<database>;UID=<username>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Azure Database for MySQL dataset.
To copy data from Azure Database for MySQL, set the type property of the dataset to AzureMySqlTable . The
following properties are supported:
tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)
Example
{
"name": "AzureMySQLDataset",
"properties": {
"type": "AzureMySqlTable",
"linkedServiceName": {
"referenceName": "<Azure MySQL linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureMySqlSource",
"query": "<custom query e.g. SELECT * FROM MyTable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToAzureDatabaseForMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure MySQL output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureMySqlSink",
"preCopyScript": "<custom SQL script>",
"writeBatchSize": 100000
}
}
}
]
Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
select * from
mytable where
customerId > 1000
and customerId <
2000
or
select * from
"MyTable"
.
Sink transformation
The below table lists the properties supported by Azure Database for MySQL sink. You can edit these properties
in the Sink options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
bigint Int64
bit Boolean
blob Byte[]
bool Int16
char String
date Datetime
datetime Datetime
double Double
enum String
float Single
int Int32
integer Int32
longblob Byte[]
longtext String
mediumblob Byte[]
A Z URE DATA B A SE F O R M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
mediumint Int32
mediumtext String
numeric Decimal
real Double
set String
smallint Int16
text String
time TimeSpan
timestamp Datetime
tinyblob Byte[]
tinyint Int16
tinytext String
varchar String
year Int32
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure Database for
PostgreSQL by using Azure Data Factory
6/16/2021 • 8 minutes to read • Edit Online
Supported capabilities
This Azure Database for PostgreSQL connector is supported for the following activities:
Copy activity with a supported source/sink matrix
Mapping data flow
Lookup activity
Currently, data flow in Azure Data Factory supports Azure database for PostgreSQL Single Server but not
Flexible Server or Hyperscale (Citus); data flow in Azure Synapse Analytics supports all PostgreSQL flavors.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections offer details about properties that are used to define Data Factory entities specific to
Azure Database for PostgreSQL connector.
Example :
{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=
<port>;UID=<username>;Password=<Password>"
}
}
}
Example :
Store password in Azure Key Vault
{
"name": "AzurePostgreSqlLinkedService",
"properties": {
"type": "AzurePostgreSql",
"typeProperties": {
"connectionString": "Server=<server>.postgres.database.azure.com;Database=<database>;Port=
<port>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see Datasets in Azure Data Factory. This
section provides a list of properties that Azure Database for PostgreSQL supports in datasets.
To copy data from Azure Database for PostgreSQL, set the type property of the dataset to
AzurePostgreSqlTable . The following properties are supported:
Example :
{
"name": "AzurePostgreSqlDataset",
"properties": {
"type": "AzurePostgreSqlTable",
"linkedServiceName": {
"referenceName": "<AzurePostgreSql linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
query Use the custom SQL query to read No (if the tableName property in the
data. For example: dataset is specified)
SELECT * FROM mytable or
SELECT * FROM "MyTable" . Note in
PostgreSQL, the entity name is treated
as case-insensitive if not quoted.
Example :
"activities":[
{
"name": "CopyFromAzurePostgreSql",
"type": "Copy",
"inputs": [
{
"referenceName": "<AzurePostgreSql input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzurePostgreSqlSource",
"query": "<custom query e.g. SELECT * FROM mytable>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example :
"activities":[
{
"name": "CopyToAzureDatabaseForPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure PostgreSQL output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzurePostgreSQLSink",
"preCopyScript": "<custom SQL script>",
"writeMethod": "CopyCommand",
"writeBatchSize": 1000000
}
}
}
]
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
select * from
mytable where
customerId > 1000
and customerId <
2000
or
select * from
"MyTable"
. Note in PostgreSQL,
the entity name is
treated as case-
insensitive if not
quoted.
source(allowSchemaDrift: true,
validateSchema: false,
isolationLevel: 'READ_UNCOMMITTED',
query: 'select * from mytable',
format: 'query') ~> AzurePostgreSQLSource
Sink transformation
The below table lists the properties supported by Azure Database for PostgreSQL sink. You can edit these
properties in the Sink options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data to and from Azure Databricks Delta Lake
by using Azure Data Factory
6/17/2021 • 10 minutes to read • Edit Online
Supported capabilities
This Azure Databricks Delta Lake connector is supported for the following activities:
Copy activity with a supported source/sink matrix table
Lookup activity
In general, Azure Data Factory supports Delta Lake with the following capabilities to meet your various needs.
Copy activity supports Azure Databricks Delta Lake connector to copy data from any supported source data
store to Azure Databricks delta lake table, and from delta lake table to any supported sink data store. It
leverages your Databricks cluster to perform the data movement, see details in Prerequisites section.
Mapping Data Flow supports generic Delta format on Azure Storage as source and sink to read and write
Delta files for code-free ETL, and runs on managed Azure Integration Runtime.
Databricks activities supports orchestrating your code-centric ETL or machine learning workload on top of
delta lake.
Prerequisites
To use this Azure Databricks Delta Lake connector, you need to set up a cluster in Azure Databricks.
To copy data to delta lake, Copy activity invokes Azure Databricks cluster to read data from an Azure Storage,
which is either your original source or a staging area to where Data Factory firstly writes the source data via
built-in staged copy. Learn more from Delta lake as the sink.
Similarly, to copy data from delta lake, Copy activity invokes Azure Databricks cluster to write data to an
Azure Storage, which is either your original sink or a staging area from where Data Factory continues to write
data to final sink via built-in staged copy. Learn more from Delta lake as the source.
The Databricks cluster needs to have access to Azure Blob or Azure Data Lake Storage Gen2 account, both the
storage container/file system used for source/sink/staging and the container/file system where you want to
write the Delta Lake tables.
To use Azure Data Lake Storage Gen2 , you can configure a ser vice principal on the Databricks
cluster as part of the Apache Spark configuration. Follow the steps in Access directly with service
principal.
To use Azure Blob storage , you can configure a storage account access key or SAS token on the
Databricks cluster as part of the Apache Spark configuration. Follow the steps in Access Azure Blob
storage using the RDD API.
During copy activity execution, if the cluster you configured has been terminated, Data Factory automatically
starts it. If you author pipeline using Data Factory authoring UI, for operations like data preview, you need to
have a live cluster, Data Factory won't start the cluster on your behalf.
Specify the cluster configuration
1. In the Cluster Mode drop-down, select Standard .
2. In the Databricks Runtime Version drop-down, select a Databricks runtime version.
3. Turn on Auto Optimize by adding the following properties to your Spark configuration:
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure
Databricks Delta Lake connector.
Example:
{
"name": "AzureDatabricksDeltaLakeLinkedService",
"properties": {
"type": "AzureDatabricksDeltaLake",
"typeProperties": {
"domain": "https://adb-xxxxxxxxx.xx.azuredatabricks.net",
"clusterId": "<cluster id>",
"accessToken": {
"type": "SecureString",
"value": "<access token>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for the Azure Databricks Delta Lake dataset.
table Name of the delta table. No for source, yes for sink
Example:
{
"name": "AzureDatabricksDeltaLakeDataset",
"properties": {
"type": "AzureDatabricksDeltaLakeDataset",
"typeProperties": {
"database": "<database name>",
"table": "<delta table name>"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
}
}
}
Under exportSettings :
NOTE
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more
from Prerequisites.
Example:
"activities":[
{
"name": "CopyFromDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delta lake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureDatabricksDeltaLakeSource",
"sqlReaderQuery": "SELECT * FROM events TIMESTAMP AS OF timestamp_expression"
},
"sink": {
"type": "<sink type>"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]
Under importSettings :
P RO P ERT Y DESC RIP T IO N REQ UIRED
NOTE
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more
from Prerequisites.
Example:
"activities":[
{
"name": "CopyToDeltaLake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Delta lake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureDatabricksDeltaLakeSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]
Monitoring
Azure Data Factory provides the same copy activity monitoring experience as other connectors. In addition,
because loading data from/to delta lake is running on your Azure Databricks cluster, you can further view
detailed cluster logs and monitor performance.
Next steps
For a list of data stores supported as sources and sinks by Copy activity in Data Factory, see supported data
stores and formats.
Copy data from or to Azure File Storage by using
Azure Data Factory
5/6/2021 • 19 minutes to read • Edit Online
Supported capabilities
This Azure File Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
You can copy data from Azure File Storage to any supported sink data store, or copy data from any supported
source data store to Azure File Storage. For a list of data stores that Copy Activity supports as sources and sinks,
see Supported data stores and formats.
Specifically, this Azure File Storage connector supports:
Copying files by using account key or service shared access signature (SAS) authentications.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Azure File Storage.
Example:
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=
<accountKey>;EndpointSuffix=core.windows.net;",
"fileShare": "<file share name>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example: store the account key in Azure Key Vault
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;",
"fileShare": "<file share name>",
"accountKey": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example:
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the resource e.g. https://<accountname>.file.core.windows.net/?sv=
<storage version>&st=<start time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=
<protocol>&sig=<signature>>"
},
"fileShare": "<file share name>",
"snapshot": "<snapshot version>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource without token e.g.
https://<accountname>.file.core.windows.net/>"
},
"sasToken": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName with value of SAS token e.g. ?sv=<storage version>&st=<start
time>&se=<expire time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Legacy model
P RO P ERT Y DESC RIP T IO N REQ UIRED
connectVia The Integration Runtime to be used to No for source, Yes for sink
connect to the data store. You can use
Azure Integration Runtime or Self-
hosted Integration Runtime (if your
data store is located in private
network). If not specified, it uses the
default Azure Integration Runtime.
Example:
{
"name": "AzureFileStorageLinkedService",
"properties": {
"type": "AzureFileStorage",
"typeProperties": {
"host": "\\\\<storage name>.file.core.windows.net\\<file service name>",
"userid": "AZURE\\<storage name>",
"password": {
"type": "SecureString",
"value": "<storage access key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Azure File Storage under location settings in format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "AzureFileStorageLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 2: file prefix Prefix for the file name under the given No
- prefix file share configured in a dataset to
filter source files. Files with name
starting with
fileshare_in_linked_service/this_prefix
are selected. It utilizes the service-side
filter for Azure File Storage, which
provides better performance than a
wildcard filter. This feature is not
supported when using a legacy linked
service model.
Additional settings:
Example:
"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "AzureFileStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "AzureFileStorageWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "AzureFileStorageDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<Azure File Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure File Storage input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToAzureFileStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure File Storage output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Azure SQL Database
by using Azure Data Factory
7/16/2021 • 29 minutes to read • Edit Online
Supported capabilities
This Azure SQL Database connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure SQL Database connector supports these functions:
Copying data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from an Azure SQL Database source, see the Parallel copy from SQL database section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during the copy.
If you use Azure SQL Database serverless tier, note when the server is paused, activity run fails instead of
waiting for the auto resume to be ready. You can add activity retry or chain additional activities to make sure the
server is live upon the actual execution.
IMPORTANT
If you copy data by using the Azure integration runtime, configure a server-level firewall rule so that Azure services can
access the server. If you copy data by using a self-hosted integration runtime, configure the firewall to allow the
appropriate IP range. This range includes the machine's IP that's used to connect to Azure SQL Database.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Azure Data Factory entities
specific to an Azure SQL Database connector.
servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal
servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as SecureString to store it authentication with a service principal
securely in Azure Data Factory or
reference a secret stored in Azure Key
Vault.
tenant Specify the tenant information, like the Yes, when you use Azure AD
domain name or tenant ID, under authentication with a service principal
which your application resides. Retrieve
it by hovering the mouse in the upper-
right corner of the Azure portal.
NOTE
Azure SQL Database Always Encr ypted is not supported in data flow.
For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
TIP
If you hit an error with the error code "UserErrorFailedToConnectToSqlServer" and a message like "The session limit for the
database is XXX and has been reached," add Pooling=false to your connection string and try again. Pooling=false is
also recommended for SHIR(Self Hosted Integration Runtime) type linked service setup. Pooling and other
connection parameters can be added as new parameter names and values in Additional connection proper ties
section of linked service creation form.
SQL authentication
Example: using SQL authentication
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
4. Grant the service principal needed permissions as you normally do for SQL users or others. Run the
following code. For more options, see this document.
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;Connection Timeout=30",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
3. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.
ALTER ROLE [role name] ADD MEMBER [your Data Factory name];
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Data Source=tcp:<servername>.database.windows.net,1433;Initial Catalog=
<databasename>;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available to define datasets, see Datasets.
The following properties are supported for Azure SQL Database dataset:
tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .
TIP
To load data from Azure SQL Database efficiently by using data partitioning, learn more from Parallel copy from SQL
database.
To copy data from Azure SQL Database, the following properties are supported in the copy activity source
section:
Under partitionSettings :
"activities":[
{
"name": "CopyFromAzureSQLDatabase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure SQL Database input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into Azure SQL Database.
To copy data to Azure SQL Database, the following properties are supported in the copy activity sink section:
When you enable partitioned copy, copy activity runs parallel queries against your Azure SQL Database source
to load data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity.
For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based
on your specified partition option and settings, and each query retrieves a portion of data from your Azure SQL
Database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Azure SQL Database. The following are suggested configurations for different scenarios. When
copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.
Full load from large table, with physical partitions. Par tition option : Physical partitions of table.
Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.
Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.
"source": {
"type": "AzureSqlSource",
"partitionOption": "PhysicalPartitionsOfTable"
}
If the table has physical partition, you would see "HasPartition" as "yes" like the following.
In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.
Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Option 3: You can use Mapping Data Flow which offers built-in insert/upsert/update methods.
Overwrite the entire table
You can configure the preCopyScript property in the copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data, or use Mapping Data Flow.
2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.
3. In Azure Data Factory, define the SQL sink section in the copy activity as follows:
"sink": {
"type": "AzureSqlSink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
SQL Example: Select * from MyTable where customerId > 1000 and customerId < 2000
Parameterized SQL Example: "select * from {$tablename} where orderyear > {$year}"
Batch size : Enter a batch size to chunk large data into reads.
Isolation Level : The default for SQL sources in mapping data flow is read uncommitted. You can change the
isolation level here to one of these values:
Read Committed
Read Uncommitted
Repeatable Read
Serializable
None (ignore isolation level)
Sink transformation
Settings specific to Azure SQL Database are available in the Settings tab of the sink transformation.
Update method: Determines what operations are allowed on your database destination. The default is to only
allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those
actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter.
The column name that you pick as the key here will be used by ADF as part of the subsequent update, upsert,
delete. Therefore, you must pick a column that exists in the Sink mapping. If you wish to not write the value to
this key column, then click "Skip writing key columns".
You can parameterize the key column used here for updating your target Azure SQL Database table. If you have
multiple columns for a composite key, the click on "Custom Expression" and you will be able to add dynamic
content using the ADF data flow expression language, which can include an array of strings with column names
for a composite key.
Table action: Determines whether to recreate or remove all rows from the destination table prior to writing.
None: No action will be done to the table.
Recreate: The table will get dropped and recreated. Required if creating a new table dynamically.
Truncate: All rows from the target table will get removed.
Batch size : Controls how many rows are being written in each bucket. Larger batch sizes improve compression
and memory optimization, but risk out of memory exceptions when caching data.
Use TempDB: By default, Data Factory will use a global temporary table to store data as part of the loading
process. You can alternatively uncheck the "Use TempDB" option and instead, ask Data Factory to store the
temporary holding table in a user database that is located in the database that is being used for this Sink.
Pre and Post SQL scripts : Enter multi-line SQL scripts that will execute before (pre-processing) and after
(post-processing) data is written to your Sink database
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object
time TimeSpan
timestamp Byte[]
tinyint Byte
uniqueidentifier Guid
varbinary Byte[]
xml String
NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
with precision larger than 28, consider converting to a string in SQL query.
Lookup activity properties
To learn details about the properties, check Lookup activity.
NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores and formats.
Copy and transform data in Azure SQL Managed
Instance by using Azure Data Factory
7/16/2021 • 27 minutes to read • Edit Online
Supported capabilities
This SQL Managed Instance connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure SQL Database connector supports these functions:
Copying data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from SQL MI source, see the Parallel copy from SQL MI section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during copy.
Prerequisites
To access the SQL Managed Instance public endpoint, you can use an Azure Data Factory managed Azure
integration runtime. Make sure that you enable the public endpoint and also allow public endpoint traffic on the
network security group so that Azure Data Factory can connect to your database. For more information, see this
guidance.
To access the SQL Managed Instance private endpoint, set up a self-hosted integration runtime that can access
the database. If you provision the self-hosted integration runtime in the same virtual network as your managed
instance, make sure that your integration runtime machine is in a different subnet than your managed instance.
If you provision your self-hosted integration runtime in a different virtual network than your managed instance,
you can use either a virtual network peering or a virtual network to virtual network connection. For more
information, see Connect your application to SQL Managed Instance.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Azure Data Factory entities
specific to the SQL Managed Instance connector.
servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal
servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as SecureString to store it authentication with a service principal
securely in Azure Data Factory or
reference a secret stored in Azure Key
Vault.
tenant Specify the tenant information, like the Yes, when you use Azure AD
domain name or tenant ID, under authentication with a service principal
which your application resides. Retrieve
it by hovering the mouse in the upper-
right corner of the Azure portal.
NOTE
SQL Managed Instance Always Encr ypted is not supported in data flow.
For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
SQL authentication
Example 1: use SQL authentication
{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "AzureSqlMILinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;Integrated
Security=False;User ID=<username>;Password=<password>;"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
4. Create contained database users for the Azure Data Factory managed identity. Connect to the database
from or to which you want to copy data, run the following T-SQL:
5. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.
ALTER ROLE [role name e.g. db_owner] ADD MEMBER [your application name]
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
3. Create contained database users for the Azure Data Factory managed identity. Connect to the database
from or to which you want to copy data, run the following T-SQL:
CREATE USER [your Data Factory name] FROM EXTERNAL PROVIDER
4. Grant the Data Factory managed identity needed permissions as you normally do for SQL users and
others. Run the following code. For more options, see this document.
ALTER ROLE [role name e.g. db_owner] ADD MEMBER [your Data Factory name]
{
"name": "AzureSqlDbLinkedService",
"properties": {
"type": "AzureSqlMI",
"typeProperties": {
"connectionString": "Data Source=<hostname,port>;Initial Catalog=<databasename>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for use to define datasets, see the datasets article. This section
provides a list of properties supported by the SQL Managed Instance dataset.
To copy data to and from SQL Managed Instance, the following properties are supported:
tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .
Example
{
"name": "AzureSqlMIDataset",
"properties":
{
"type": "AzureSqlMITable",
"linkedServiceName": {
"referenceName": "<SQL Managed Instance linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}
TIP
To load data from SQL MI efficiently by using data partitioning, learn more from Parallel copy from SQL MI.
To copy data from SQL Managed Instance, the following properties are supported in the copy activity source
section:
Under partitionSettings :
"activities":[
{
"name": "CopyFromAzureSqlMI",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Managed Instance input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlMISource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into SQL Managed Instance.
To copy data to SQL Managed Instance, the following properties are supported in the copy activity sink section:
When you enable partitioned copy, copy activity runs parallel queries against your SQL MI source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your SQL MI.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your SQL MI. The following are suggested configurations for different scenarios. When copying data into
file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in which
case the performance is better than writing to a single file.
Full load from large table, with physical partitions. Par tition option : Physical partitions of table.
Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.
Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.
"source": {
"type": "SqlMISource",
"partitionOption": "PhysicalPartitionsOfTable"
}
If the table has physical partition, you would see "HasPartition" as "yes" like the following.
In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.
Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Overwrite the entire table
You can configure the preCopyScript property in a copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data.
2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.
3. In Azure Data Factory, define the SQL MI sink section in the copy activity as follows:
"sink": {
"type": "SqlMISink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
Select * from
MyTable where
customerId > 1000
and customerId <
2000
Sink transformation
The below table lists the properties supported by Azure SQL Managed Instance sink. You can edit these
properties in the Sink options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
SQ L M A N A GED IN STA N C E DATA T Y P E A Z URE DATA FA C TO RY IN T ERIM DATA T Y P E
sql_variant Object
time TimeSpan
timestamp Byte[]
tinyint Int16
uniqueidentifier Guid
varbinary Byte[]
xml String
NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
that requires precision larger than 28, consider converting to a string in a SQL query.
NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy and transform data in Azure Synapse
Analytics by using Azure Data Factory
5/11/2021 • 34 minutes to read • Edit Online
Supported capabilities
This Azure Synapse Analytics connector is supported for the following activities:
Copy activity with supported source/sink matrix table
Mapping data flow
Lookup activity
GetMetadata activity
For Copy activity, this Azure Synapse Analytics connector supports these functions:
Copy data by using SQL authentication and Azure Active Directory (Azure AD) Application token
authentication with a service principal or managed identities for Azure resources.
As a source, retrieve data by using a SQL query or stored procedure. You can also choose to parallel copy
from an Azure Synapse Analytics source, see the Parallel copy from Azure Synapse Analytics section for
details.
As a sink, load data by using PolyBase or COPY statement or bulk insert. We recommend PolyBase or COPY
statement for better copy performance. The connector also supports automatically creating destination table
if not exists based on the source schema.
IMPORTANT
If you copy data by using Azure Data Factory Integration Runtime, configure a server-level firewall rule so that Azure
services can access the logical SQL server. If you copy data by using a self-hosted integration runtime, configure the
firewall to allow the appropriate IP range. This range includes the machine's IP that is used to connect to Azure Synapse
Analytics.
Get started
TIP
To achieve best performance, use PolyBase or COPY statement to load data into Azure Synapse Analytics. The Use
PolyBase to load data into Azure Synapse Analytics and Use COPY statement to load data into Azure Synapse Analytics
sections have details. For a walkthrough with a use case, see Load 1 TB into Azure Synapse Analytics under 15 minutes
with Azure Data Factory.
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to an Azure
Synapse Analytics connector.
servicePrincipalId Specify the application's client ID. Yes, when you use Azure AD
authentication with a service principal.
servicePrincipalKey Specify the application's key. Mark this Yes, when you use Azure AD
field as a SecureString to store it authentication with a service principal.
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
tenant Specify the tenant information (domain Yes, when you use Azure AD
name or tenant ID) under which your authentication with a service principal.
application resides. You can retrieve it
by hovering the mouse in the top-
right corner of the Azure portal.
For different authentication types, refer to the following sections on prerequisites and JSON samples,
respectively:
SQL authentication
Azure AD application token authentication: Service principal
Azure AD application token authentication: Managed identities for Azure resources
TIP
When creating linked service for Azure Synapse ser verless SQL pool from UI, choose "enter manually" instead of
browsing from subscription.
TIP
If you hit error with error code as "UserErrorFailedToConnectToSqlServer" and message like "The session limit for the
database is XXX and has been reached.", add Pooling=false to your connection string and try again.
SQL authentication
Linked service example that uses SQL authentication
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;User ID=<username>@<servername>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
4. Grant the ser vice principal needed permissions as you normally do for SQL users or others. Run
the following code, or refer to more options here. If you want to use PolyBase to load the data, learn the
required database permission.
5. Configure an Azure Synapse Analytics linked ser vice in Azure Data Factory.
Linked service example that uses service principal authentication
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
3. Grant the Data Factor y Managed Identity needed permissions as you normally do for SQL users
and others. Run the following code, or refer to more options here. If you want to use PolyBase to load the
data, learn the required database permission.
4. Configure an Azure Synapse Analytics linked ser vice in Azure Data Factory.
Example:
{
"name": "AzureSqlDWLinkedService",
"properties": {
"type": "AzureSqlDW",
"typeProperties": {
"connectionString": "Server=tcp:<servername>.database.windows.net,1433;Database=
<databasename>;Connection Timeout=30"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for Azure Synapse Analytics dataset:
tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .
{
"name": "AzureSQLDWDataset",
"properties":
{
"type": "AzureSqlDWTable",
"linkedServiceName": {
"referenceName": "<Azure Synapse Analytics linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}
TIP
To load data from Azure Synapse Analytics efficiently by using data partitioning, learn more from Parallel copy from Azure
Synapse Analytics.
To copy data from Azure Synapse Analytics, set the type property in the Copy Activity source to SqlDWSource .
The following properties are supported in the Copy Activity source section:
Under partitionSettings :
"activities":[
{
"name": "CopyFromAzureSQLDW",
"type": "Copy",
"inputs": [
{
"referenceName": "<Azure Synapse Analytics input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlDWSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Use PolyBase
Use COPY statement
Use bulk insert
The fastest and most scalable way to load data is through PolyBase or the COPY statement.
To copy data to Azure Synapse Analytics, set the sink type in Copy Activity to SqlDWSink . The following
properties are supported in the Copy Activity sink section:
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true,
"polyBaseSettings":
{
"rejectType": "percentage",
"rejectValue": 10.0,
"rejectSampleValue": 100,
"useTypeDefault": true
}
}
When you enable partitioned copy, copy activity runs parallel queries against your Azure Synapse Analytics
source to load data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy
activity. For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four
queries based on your specified partition option and settings, and each query retrieves a portion of data from
your Azure Synapse Analytics.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Azure Synapse Analytics. The following are suggested configurations for different scenarios. When
copying data into file-based data store, it's recommended to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.
Full load from large table, with physical partitions. Par tition option : Physical partitions of table.
Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the index or primary key
column is used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detect the values.
Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.
"source": {
"type": "SqlDWSource",
"query":"SELECT * FROM <TableName> WHERE ?AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "DynamicRange",
"partitionSettings": {
"partitionColumnName": "<partition_column_name>",
"partitionUpperBound": "<upper_value_of_partition_column (optional) to decide the partition stride,
not as data filter>",
"partitionLowerBound": "<lower_value_of_partition_column (optional) to decide the partition stride,
not as data filter>"
}
}
SELECT DISTINCT s.name AS SchemaName, t.name AS TableName, c.name AS ColumnName, CASE WHEN c.name IS NULL
THEN 'no' ELSE 'yes' END AS HasPartition
FROM sys.tables AS t
LEFT JOIN sys.objects AS o ON t.object_id = o.object_id
LEFT JOIN sys.schemas AS s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes AS i ON t.object_id = i.object_id
LEFT JOIN sys.index_columns AS ic ON ic.partition_ordinal > 0 AND ic.index_id = i.index_id AND ic.object_id
= t.object_id
LEFT JOIN sys.columns AS c ON c.object_id = ic.object_id AND c.column_id = ic.column_id
LEFT JOIN sys.types AS y ON c.system_type_id = y.system_type_id
WHERE s.name='[your schema]' AND t.name = '[your table name]'
If the table has physical partition, you would see "HasPartition" as "yes".
TIP
Learn more on Best practices for using PolyBase. When using PolyBase with Azure Integration Runtime, effective Data
Integration Units (DIU) for direct or staged storage-to-Synapse is always 2. Tuning the DIU doesn't impact the
performance, as loading data from storage is powered by Synapse engine.
The following PolyBase settings are supported under polyBaseSettings in copy activity:
TIP
To copy data efficiently to Azure Synapse Analytics, learn more from Azure Data Factory makes it even easier and
convenient to uncover insights from data when using Data Lake Store with Azure Synapse Analytics.
If the requirements aren't met, Azure Data Factory checks the settings and automatically falls back to the
BULKINSERT mechanism for the data movement.
1. The source linked ser vice is with the following types and authentication methods:
SUP P O RT ED SO URC E DATA STO RE T Y P E SUP P O RT ED SO URC E A UT H EN T IC AT IO N T Y P E
Azure Data Lake Storage Gen2 Account key authentication, managed identity
authentication
IMPORTANT
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service
Endpoints with Azure storage.
2. The source data format is of Parquet , ORC , or Delimited text , with the following configurations:
a. Folder path doesn't contain wildcard filter.
b. File name is empty, or points to a single file. If you specify wildcard file name in copy activity, it can
only be * or *.* .
c. rowDelimiter is default , \n , \r\n , or \r .
d. nullValue is left as default or set to empty string (""), and treatEmptyAsNull is left as default or set
to true.
e. encodingName is left as default or set to utf-8 .
f. quoteChar , escapeChar , and skipLineCount aren't specified. PolyBase support skip header row, which
can be configured as firstRowAsHeader in ADF.
g. compression can be no compression , GZip , or Deflate .
3. If your source is a folder, recursive in copy activity must be set to true.
4. wildcardFolderPath , wildcardFilename , modifiedDateTimeStart , modifiedDateTimeEnd , prefix ,
enablePartitionDiscovery , and additionalColumns are not specified.
NOTE
If your source is a folder, note PolyBase retrieves files from the folder and all of its subfolders, and it doesn't retrieve data
from files for which the file name begins with an underline (_) or a period (.), as documented here - LOCATION argument.
"activities":[
{
"name": "CopyFromAzureBlobToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "ParquetDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ParquetSource",
"storeSettings":{
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
}
}
}
]
IMPORTANT
When you use managed identity authentication for your staging linked service, learn the needed configurations for
Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your staging Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service Endpoints
with Azure storage.
IMPORTANT
If your staging Azure Storage is configured with Managed Private Endpoint and has the storage firewall enabled, you
must use managed identity authentication and grant Storage Blob Data Reader permissions to the Synapse SQL Server to
ensure it can access the staged files during the PolyBase load.
"activities":[
{
"name": "CopyFromSQLServerToSQLDataWarehouseViaPolyBase",
"type": "Copy",
"inputs": [
{
"referenceName": "SQLServerDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "AzureSQLDWDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlDWSink",
"allowPolyBase": true
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
}
}
}
}
]
The solution is to unselect "Use type default " option (as false) in copy activity sink -> PolyBase settings.
"USE_TYPE_DEFAULT" is a PolyBase native configuration, which specifies how to handle missing values in
delimited text files when PolyBase retrieves data from the text file.
Check the tableName property in Azure Synapse Analytics
The following table gives examples of how to specify the tableName property in the JSON dataset. It shows
several combinations of schema and table names.
DB SC H EM A TA B L E N A M E TA B L EN A M E JSO N P RO P ERT Y
If you see the following error, the problem might be the value you specified for the tableName property. See
the preceding table for the correct way to specify values for the tableName JSON property.
All columns of the table must be specified in the INSERT BULK statement.
The NULL value is a special form of the default value. If the column is nullable, the input data in the blob for that
column might be empty. But it can't be missing from the input dataset. PolyBase inserts NULL for missing values
in Azure Synapse Analytics.
External file access failed
If you receive the following error, ensure that you are using managed identity authentication and have granted
Storage Blob Data Reader permissions to the Azure Synapse workspace's managed identity.
For more information, see Grant permissions to managed identity after workspace creation.
Use COPY statement to load data into Azure Synapse Analytics
Azure Synapse Analytics COPY statement directly supports loading data from Azure Blob and Azure Data
Lake Storage Gen2 . If your source data meets the criteria described in this section, you can choose to use
COPY statement in ADF to load data into Azure Synapse Analytics. Azure Data Factory checks the settings and
fails the copy activity run if the criteria is not met.
NOTE
Currently Data Factory only support copy from COPY statement compatible sources mentioned below.
TIP
When using COPY statement with Azure Integration Runtime, effective Data Integration Units (DIU) is always 2. Tuning
the DIU doesn't impact the performance, as loading data from storage is powered by Synapse engine.
Azure Data Lake Storage Gen2 Delimited text Account key authentication, service
Parquet principal authentication, managed
ORC identity authentication
IMPORTANT
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity authentication
with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using VNet Service
Endpoints with Azure storage.
Batch size : Enter a batch size to chunk large data into reads. In data flows, ADF will use this setting to set Spark
columnar caching. This is an option field, which will use Spark defaults if it is left blank.
Isolation Level : The default for SQL sources in mapping data flow is read uncommitted. You can change the
isolation level here to one of these values:
Read Committed
Read Uncommitted
Repeatable Read
Serializable
None (ignore isolation level)
Sink transformation
Settings specific to Azure Synapse Analytics are available in the Settings tab of the sink transformation.
Update method: Determines what operations are allowed on your database destination. The default is to only
allow inserts. To update, upsert, or delete rows, an alter-row transformation is required to tag rows for those
actions. For updates, upserts and deletes, a key column or columns must be set to determine which row to alter.
Table action: Determines whether to recreate or remove all rows from the destination table prior to writing.
None: No action will be done to the table.
Recreate: The table will get dropped and recreated. Required if creating a new table dynamically.
Truncate: All rows from the target table will get removed.
Enable staging: This enables loading into Azure Synapse Analytics SQL Pools using the copy command and is
recommended for most Synpase sinks. The staging storage is configured in Execute Data Flow activity.
When you use managed identity authentication for your storage linked service, learn the needed
configurations for Azure Blob and Azure Data Lake Storage Gen2 respectively.
If your Azure Storage is configured with VNet service endpoint, you must use managed identity
authentication with "allow trusted Microsoft service" enabled on storage account, refer to Impact of using
VNet Service Endpoints with Azure storage.
Batch size : Controls how many rows are being written in each bucket. Larger batch sizes improve compression
and memory optimization, but risk out of memory exceptions when caching data.
Pre and Post SQL scripts : Enter multi-line SQL scripts that will execute before (pre-processing) and after
(post-processing) data is written to your Sink database
TIP
Refer to Table data types in Azure Synapse Analytics article on Azure Synapse Analytics supported data types and the
workarounds for unsupported ones.
A Z URE SY N A P SE A N A LY T IC S DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
time TimeSpan
tinyint Byte
uniqueidentifier Guid
A Z URE SY N A P SE A N A LY T IC S DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
varbinary Byte[]
Next steps
For a list of data stores supported as sources and sinks by Copy Activity in Azure Data Factory, see supported
data stores and formats.
Copy data to and from Azure Table storage by
using Azure Data Factory
5/28/2021 • 10 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Supported capabilities
This Azure Table storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from any supported source data store to Table storage. You also can copy data from Table
storage to any supported sink data store. For a list of data stores that are supported as sources or sinks by the
copy activity, see the Supported data stores table.
Specifically, this Azure Table connector supports copying data by using account key and service shared access
signature authentications.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Table storage.
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.
Example:
{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<accountname>;AccountKey=
<accountkey>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
Data Factory now supports both ser vice shared access signatures and account shared access signatures . For
more information about shared access signatures, see Grant limited access to Azure Storage resources using shared
access signatures (SAS).
TIP
To generate a service shared access signature for your storage account, you can execute the following PowerShell
commands. Replace the placeholders and grant the needed permission.
$context = New-AzStorageContext -StorageAccountName <accountName> -StorageAccountKey <accountKey>
New-AzStorageContainerSASToken -Name <containerName> -Context $context -Permission rwdl -StartTime
<startTime> -ExpiryTime <endTime> -FullUri
To use shared access signature authentication, the following properties are supported.
NOTE
If you were using "AzureStorage" type linked service, it is still supported as-is, while you are suggested to use this new
"AzureTableStorage" linked service type going forward.
Example:
{
"name": "AzureTableStorageLinkedService",
"properties": {
"type": "AzureTableStorage",
"typeProperties": {
"sasUri": {
"type": "SecureString",
"value": "<SAS URI of the Azure Storage resource e.g.
https://<account>.table.core.windows.net/<table>?sv=<storage version>&st=<start time>&se=<expire
time>&sr=<resource>&sp=<permissions>&sip=<ip range>&spr=<protocol>&sig=<signature>>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
When you create a shared access signature URI, consider the following points:
Set appropriate read/write permissions on objects based on how the linked service (read, write, read/write) is
used in your data factory.
Set Expir y time appropriately. Make sure that the access to Storage objects doesn't expire within the active
period of the pipeline.
The URI should be created at the right table level based on the need.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Azure Table dataset.
To copy data to and from Azure Table, set the type property of the dataset to AzureTable . The following
properties are supported.
Example:
{
"name": "AzureTableDataset",
"properties":
{
"type": "AzureTable",
"typeProperties": {
"tableName": "MyTable"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Azure Table storage linked service name>",
"type": "LinkedServiceReference"
}
}
}
azureTableSourceQuery examples
NOTE
Azure Table query operation times out in 30 seconds as enforced by Azure Table service. Learn how to optimize the query
from Design for querying article.
In Azure Data Factory, if you want to filter the data against a datetime type column, refer to this example:
If you want to filter the data against a string type column, refer to this example:
If you use the pipeline parameter, cast the datetime value to proper format according to the previous samples.
Azure Table as a sink type
To copy data to Azure Table, set the sink type in the copy activity to AzureTableSink . The following properties
are supported in the copy activity sink section.
writeBatchTimeout Inserts data into Azure Table when No (default is 90 seconds, storage
writeBatchSize or writeBatchTimeout is client's default timeout)
hit.
Allowed values are timespan. An
example is "00:20:00" (20 minutes).
Example:
"activities":[
{
"name": "CopyToAzureTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Azure Table output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "<column name>",
"azureTableRowKeyName": "<column name>"
}
}
}
]
azureTablePartitionKeyName
Map a source column to a destination column by using the "translator" property before you can use the
destination column as azureTablePartitionKeyName.
In the following example, source column DivisionID is mapped to the destination column DivisionID:
"translator": {
"type": "TabularTranslator",
"columnMappings": "DivisionID: DivisionID, FirstName: FirstName, LastName: LastName"
}
"sink": {
"type": "AzureTableSink",
"azureTablePartitionKeyName": "DivisionID"
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Binary format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online
NOTE
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Binary dataset.
{
"name": "BinaryDataset",
"properties": {
"type": "Binary",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "ZipDeflate"
}
}
}
}
Binary as source
The following properties are supported in the copy activity *source* section.
"activities": [
{
"name": "CopyFromBinary",
"type": "Copy",
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true,
"deleteFilesAfterCompletion": true
},
"formatSettings": {
"type": "BinaryReadSettings",
"compressionProperties": {
"type": "ZipDeflateReadSettings",
"preserveZipFileNameAsFolder": false
}
}
},
...
}
...
}
]
Binary as sink
The following properties are supported in the copy activity *sink* section.
Supported capabilities
This Cassandra connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Cassandra database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Cassandra connector supports:
Cassandra versions 2.x and 3.x .
Copying data using Basic or Anonymous authentication.
NOTE
For activity running on Self-hosted Integration Runtime, Cassandra 3.x is supported since IR version 3.7 and above.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in Cassandra driver, therefore you don't need to manually install any
driver when copying data from/to Cassandra.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Cassandra connector.
username Specify user name for the user Yes, if authenticationType is set to
account. Basic.
password Specify password for the user account. Yes, if authenticationType is set to
Mark this field as a SecureString to Basic.
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
NOTE
Currently connection to Cassandra using TLS is not supported.
Example:
{
"name": "CassandraLinkedService",
"properties": {
"type": "Cassandra",
"typeProperties": {
"host": "<host>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Cassandra dataset.
To copy data from Cassandra, set the type property of the dataset to CassandraTable . The following properties
are supported:
Example:
{
"name": "CassandraDataset",
"properties": {
"type": "CassandraTable",
"typeProperties": {
"keySpace": "<keyspace name>",
"tableName": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Cassandra linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom query to read data. No (if "tableName" and "keyspace" in
SQL-92 query or CQL query. See CQL dataset are specified).
reference.
Example:
"activities":[
{
"name": "CopyFromCassandra",
"type": "Copy",
"inputs": [
{
"referenceName": "<Cassandra input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CassandraSource",
"query": "select id, firstname, lastname from mykeyspace.mytable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
ASCII String
BIGINT Int64
BLOB Byte[]
BOOLEAN Boolean
DECIMAL Decimal
DOUBLE Double
FLOAT Single
INET String
INT Int32
TEXT String
TIMESTAMP DateTime
C A SSA N DRA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
TIMEUUID Guid
UUID Guid
VARCHAR String
VARINT Decimal
NOTE
For collection types (map, set, list, etc.), refer to Work with Cassandra collection types using virtual table section.
User-defined types are not supported.
The length of Binary Column and String Column lengths cannot be greater than 4000.
1 "sample value 1" ["1", "2", "3"] {"S1": "a", "S2": "b"} {"A", "B", "C"}
3 "sample value 3" ["100", "101", "102", {"S1": "t"} {"A", "E"}
"105"]
The driver would generate multiple virtual tables to represent this single table. The foreign key columns in the
virtual tables reference the primary key columns in the real table, and indicate which real table row the virtual
table row corresponds to.
The first virtual table is the base table named "ExampleTable" is shown in the following table:
P K _IN T VA L UE
The base table contains the same data as the original database table except for the collections, which are omitted
from this table and expanded in other virtual tables.
The following tables show the virtual tables that renormalize the data from the List, Map, and StringSet columns.
The columns with names that end with "_index" or "_key" indicate the position of the data within the original list
or map. The columns with names that end with "_value" contain the expanded data from the collection.
Table "ExampleTable_vt_List":
1 0 1
1 1 2
1 2 3
3 0 100
3 1 101
3 2 102
3 3 103
Table "ExampleTable_vt_Map":
P K _IN T M A P _K EY M A P _VA L UE
1 S1 A
1 S2 b
3 S1 t
Table "ExampleTable_vt_StringSet":
1 A
1 B
1 C
3 A
3 E
Lookup activity properties
To learn details about the properties, check Lookup activity.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Common Data Model format in Azure Data Factory
3/5/2021 • 5 minutes to read • Edit Online
NOTE
When writing CDM entities, you must have an existing CDM entity definition (metadata schema) already defined to use as
a reference. The ADF data flow sink will read that CDM entity file and import the schema into your sink for field mapping.
Source properties
The below table lists the properties supported by a CDM source. You can edit these properties in the Source
options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Schema linked The linked service yes, if using manifest 'adlsgen2' or corpusStore
service where the corpus is 'github'
located
Corpus folder the root location of yes, if using manifest String corpusPath
the corpus
When selecting "Entity Reference" both in the Source and Sink transformations, you can select from these three
options for the location of your entity reference:
Local uses the entity defined in the manifest file already being used by ADF
Custom will ask you to point to an entity manifest file that is different from the manifest file ADF is using
Standard will use an entity reference from the standard library of CDM entities maintained in Github .
Sink settings
Point to the CDM entity reference file that contains the definition of the entity you would like to write.
Define the partition path and format of the output files that you want ADF to use for writing your entities.
Set the output file location and the location and name for the manifest file.
Import schema
CDM is only available as an inline dataset and, by default, doesn't have an associated schema. To get column
metadata, click the Impor t schema button in the Projection tab. This will allow you to reference the column
names and data types specified by the corpus. To import the schema, a data flow debug session must be active
and you must have an existing CDM entity definition file to point to.
When mapping data flow columns to entity properties in the Sink transformation, click on the "Mapping" tab
and select "Import Schema". ADF will read the entity reference that you pointed to in your Sink options, allowing
you to map to the target CDM schema.
NOTE
When using model.json source type that originates from Power BI or Power Platform dataflows, you may encounter
"corpus path is null or empty" errors from the source transformation. This is likely due to formatting issues of the
partition location path in the model.json file. To fix this, follow these steps:
Sink properties
The below table lists the properties supported by a CDM sink. You can edit these properties in the Settings tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Next steps
Create a source transformation in mapping data flow.
Copy data from Concur using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Concur connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Concur to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
NOTE
Partner account is currently not supported.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Concur connector.
Under connectionProperties :
Example:
{
"name":"ConcurLinkedService",
"properties":{
"type":"Concur",
"typeProperties":{
"connectionProperties":{
"host":"<host e.g. implementation.concursolutions.com>",
"baseUrl":"<base URL for authorization e.g. us-impl.api.concursolutions.com>",
"authenticationType":"OAuth_2.0_Bearer",
"clientId":"<client id>",
"clientSecret":{
"type": "SecureString",
"value": "<client secret>"
},
"username":"fakeUserName",
"password":{
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}
Example (legacy):
Note the following is a legacy linked service model without connectionProperties and using OAuth_2.0
authentication.
{
"name": "ConcurLinkedService",
"properties": {
"type": "Concur",
"typeProperties": {
"clientId" : "<clientId>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Concur dataset.
To copy data from Concur, set the type property of the dataset to ConcurObject . There is no additional type-
specific property in this type of dataset. The following properties are supported:
Example
{
"name": "ConcurDataset",
"properties": {
"type": "ConcurObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Concur linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Opportunities
where Id = xxx "
.
Example:
"activities":[
{
"name": "CopyFromConcur",
"type": "Copy",
"inputs": [
{
"referenceName": "<Concur input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ConcurSource",
"query": "SELECT * FROM Opportunities where Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Couchbase using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Couchbase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Couchbase to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Couchbase connector.
Example:
{
"name": "CouchbaseLinkedService",
"properties": {
"type": "Couchbase",
"typeProperties": {
"connectionString": "Server=<server>; Port=<port>;AuthMech=1;CredString=[{\"user\": \"JSmith\",
\"pass\":\"access123\"}, {\"user\": \"Admin\", \"pass\":\"simba123\"}];"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Couchbase dataset.
To copy data from Couchbase, set the type property of the dataset to CouchbaseTable . The following
properties are supported:
Example
{
"name": "CouchbaseDataset",
"properties": {
"type": "CouchbaseTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Couchbase linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromCouchbase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Couchbase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "CouchbaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from DB2 by using Azure Data Factory
5/6/2021 • 7 minutes to read • Edit Online
Supported capabilities
This DB2 database connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from DB2 database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this DB2 connector supports the following IBM DB2 platforms and versions with Distributed
Relational Database Architecture (DRDA) SQL Access Manager (SQLAM) version 9, 10 and 11. It utilizes the
DDM/DRDA protocol.
IBM DB2 for z/OS 12.1
IBM DB2 for z/OS 11.1
IBM DB2 for z/OS 10.1
IBM DB2 for i 7.3
IBM DB2 for i 7.2
IBM DB2 for i 7.1
IBM DB2 for LUW 11
IBM DB2 for LUW 10.5
IBM DB2 for LUW 10.1
TIP
DB2 connector is built on top of Microsoft OLE DB Provider for DB2. To troubleshoot DB2 connector errors, refer to Data
Provider Error Codes.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in DB2 driver, therefore you don't need to manually install any driver
when copying data from DB2.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
DB2 connector.
TIP
If you receive an error message that states
The package corresponding to an SQL statement execution request was not found. SQLSTATE=51002 SQLCODE=-
805
, the reason is a needed package is not created for the user. By default, ADF will try to create a the package under
collection named as the user you used to connect to the DB2. Specify the package collection property to indicate under
where you want ADF to create the needed packages when querying the database.
Example:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"connectionString":"server=<server:port>;database=<database>;authenticationType=Basic;username=
<username>;password=<password>;packageCollection=<packagecollection>;certificateCommonName=<certname>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"connectionString": "server=<server:port>;database=<database>;authenticationType=Basic;username=
<username>;packageCollection=<packagecollection>;certificateCommonName=<certname>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using DB2 linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "Db2LinkedService",
"properties": {
"type": "Db2",
"typeProperties": {
"server": "<servername:port>",
"database": "<dbname>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by DB2 dataset.
To copy data from DB2, the following properties are supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "DB2Dataset",
"properties":
{
"type": "Db2Table",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<DB2 linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"DB2ADMIN\".\"Customers\""
.
Example:
"activities":[
{
"name": "CopyFromDB2",
"type": "Copy",
"inputs": [
{
"referenceName": "<DB2 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "Db2Source",
"query": "SELECT * FROM \"DB2ADMIN\".\"Customers\""
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
BigInt Int64
Binary Byte[]
Blob Byte[]
Char String
Clob String
Date Datetime
DB2DynArray String
DbClob String
Decimal Decimal
DecimalFloat Decimal
Double Double
Float Double
Graphic String
Integer Int32
LongVarBinary Byte[]
LongVarChar String
LongVarGraphic String
Numeric Decimal
Real Single
SmallInt Int16
Time TimeSpan
Timestamp DateTime
VarBinary Byte[]
VarChar String
VarGraphic String
DB 2 DATA B A SE T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
Xml Byte[]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online
Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.
NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.
This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication
NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.
IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".
password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
port The port of the on-premises Dynamics No. The default value is 443.
server.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:
entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink
Example
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.
Example
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.
ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.
For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedPro Boolean ✓
perty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi GUID ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.
Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Delimited text format in Azure Data Factory
7/12/2021 • 9 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the delimited text dataset.
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"columnDelimiter": ",",
"quoteChar": "\"",
"escapeChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
"activities": [
{
"name": "CopyFromDelimitedText",
"type": "Copy",
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "DelimitedTextReadSettings",
"skipLineCount": 3,
"compressionProperties": {
"type": "ZipDeflateReadSettings",
"preserveZipFileNameAsFolder": false
}
}
},
...
}
...
}
]
fileExtension The file extension used to name the Yes when file name is not specified in
output files, for example, .csv , output dataset
.txt . It must be specified when the
fileName is not specified in the
output DelimitedText dataset. When
file name is configured in the output
dataset, it will be used as the sink file
name and the file extension setting will
be ignored.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
NOTE
Data flow sources support for list of files is limited to 1024 entries in your file. To include more files, use wildcards in your
file list.
Source example
The below image is an example of a delimited text source configuration in mapping data flows.
The associated data flow script is:
source(
allowSchemaDrift: true,
validateSchema: false,
multiLineRow: true,
wildcardPaths:['*.csv']) ~> CSVSource
NOTE
Data flow sources support a limited set of Linux globbing that is support by Hadoop file systems
Sink properties
The below table lists the properties supported by a delimited text sink. You can edit these properties in the
Settings tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Sink example
The below image is an example of a delimited text sink configuration in mapping data flows.
The associated data flow script is:
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Delta format in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Import schema
Delta is only available as an inline dataset and, by default, doesn't have an associated schema. To get column
metadata, click the Impor t schema button in the Projection tab. This will allow you to reference the column
names and data types specified by the corpus. To import the schema, a data flow debug session must be active
and you must have an existing CDM entity definition file to point to.
Delta source script example
source(output(movieId as integer,
title as string,
releaseDate as date,
rated as boolean,
screenedOn as timestamp,
ticketPrice as decimal(10,2)
),
store: 'local',
format: 'delta',
versionAsOf: 0,
allowSchemaDrift: false,
folderPath: $tempPath + '/delta'
) ~> movies
Sink properties
The below table lists the properties supported by a delta sink. You can edit these properties in the Settings tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Known limitations
When writing to a delta sink, there is a known limitation where the numbers of rows written won't be return in
the monitoring output.
Next steps
Create a source transformation in mapping data flow.
Create a sink transformation in mapping data flow.
Create an alter row transformation to mark rows as insert, update, upsert, or delete.
Copy data from Drill using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Drill connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Drill to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Drill connector.
Linked service properties
The following properties are supported for Drill linked service:
Example:
{
"name": "DrillLinkedService",
"properties": {
"type": "Drill",
"typeProperties": {
"connectionString": "ConnectionType=Direct;Host=<host>;Port=<port>;AuthenticationType=Plain;UID=
<user name>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Drill dataset.
To copy data from Drill, set the type property of the dataset to DrillTable . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "DrillDataset",
"properties": {
"type": "DrillTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Drill linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromDrill",
"type": "Copy",
"inputs": [
{
"referenceName": "<Drill input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DrillSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Lookup activity properties
To learn details about the properties, check Lookup activity.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online
Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.
NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.
This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication
NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.
IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".
password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
port The port of the on-premises Dynamics No. The default value is 443.
server.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:
entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink
Example
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.
Example
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.
ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.
For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedPro Boolean ✓
perty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi GUID ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.
Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Copy data from Dynamics AX by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Dynamics AX connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Dynamics AX to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Specifically, this Dynamics AX connector supports copying data from Dynamics AX using OData protocol with
Ser vice Principal authentication .
TIP
You can also use this connector to copy data from Dynamics 365 Finance and Operations . Refer to Dynamics 365's
OData support and Authentication method.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to Dynamics AX connector.
Prerequisites
To use service principal authentication, follow these steps:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Go to Dynamics AX, and grant this service principal proper permission to access your Dynamics AX.
Example
{
"name": "DynamicsAXLinkedService",
"properties": {
"type": "DynamicsAX",
"typeProperties": {
"url": "<Dynamics AX instance OData endpoint>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource, e.g. https://sampledynamics.sandbox.operations.dynamics.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
This section provides a list of properties that the Dynamics AX dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from Dynamics AX, set the type property of the dataset to DynamicsAXResource . The following
properties are supported:
Example
{
"name": "DynamicsAXResourceDataset",
"properties": {
"type": "DynamicsAXResource",
"typeProperties": {
"path": "<entity path e.g. dd04tentitySet>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Dynamics AX linked service name>",
"type": "LinkedServiceReference"
}
}
}
Example
"activities":[
{
"name": "CopyFromDynamicsAX",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics AX input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsAXSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Dynamics 365 (Microsoft
Dataverse) or Dynamics CRM by using Azure Data
Factory
6/16/2021 • 14 minutes to read • Edit Online
Supported capabilities
This connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
You can copy data from Dynamics 365 (Microsoft Dataverse) or Dynamics CRM to any supported sink data
store. You also can copy data from any supported source data store to Dynamics 365 (Microsoft Dataverse) or
Dynamics CRM. For a list of data stores that a copy activity supports as sources and sinks, see the Supported
data stores table.
NOTE
Effective November 2020, Common Data Service has been renamed to Microsoft Dataverse. This article is updated to
reflect the latest terminology.
This Dynamics connector supports Dynamics versions 7 through 9 for both online and on-premises. More
specifically:
Version 7 maps to Dynamics CRM 2015.
Version 8 maps to Dynamics CRM 2016 and the early version of Dynamics 365.
Version 9 maps to the later version of Dynamics 365.
Refer to the following table of supported authentication types and configurations for Dynamics versions and
products.
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dataverse Azure Active Directory (Azure AD) Dynamics online and Azure AD
service principal service-principal or Office 365
Dynamics 365 online authentication
Office 365
Dynamics CRM online
DY N A M IC S VERSIO N S A UT H EN T IC AT IO N T Y P ES L IN K ED SERVIC E SA M P L ES
Dynamics 365 on-premises with IFD Dynamics on-premises with IFD and
internet-facing deployment (IFD) IFD authentication
NOTE
With the deprecation of regional Discovery Service, Azure Data Factory has upgraded to leverage global Discovery Service
while using Office 365 Authentication.
IMPORTANT
If your tenant and user is configured in Azure Active Directory for conditional access and/or Multi-Factor Authentication is
required, you will not be able to use Office 365 Authentication type. For those situations, you must use a Azure Active
Directory (Azure AD) service principal authentication.
For Dynamics 365 specifically, the following application types are supported:
Dynamics 365 for Sales
Dynamics 365 for Customer Service
Dynamics 365 for Field Service
Dynamics 365 for Project Service Automation
Dynamics 365 for Marketing This connector doesn't support other application types like Finance, Operations,
and Talent.
TIP
To copy data from Dynamics 365 Finance and Operations, you can use the Dynamics AX connector.
Prerequisites
To use this connector with Azure AD service-principal authentication, you must set up server-to-server (S2S)
authentication in Dataverse or Dynamics. First register the application user (Service Principal) in Azure Active
Directory. You can find out how to do this here. During application registration you will need to create that user
in Dataverse or Dynamics and grant permissions. Those permissions can either be granted directly or indirectly
by adding the application user to a team which has been granted permissions in Dataverse or Dynamics. You
can find more information on how to set up an application user to authenticate with Dataverse here.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Dynamics.
servicePrincipalCredentialType The credential type to use for service- Yes when authentication is
principal authentication. Valid values "AADServicePrincipal"
are "ServicePrincipalKey" and
"ServicePrincipalCert".
password The password for the user account you Yes when authentication is "Office365"
specified as the username. Mark this
field with "SecureString" to store it
securely in Data Factory, or reference a
secret stored in Azure Key Vault.
NOTE
The Dynamics connector formerly used the optional organizationName property to identify your Dynamics CRM or
Dynamics 365 online instance. While that property still works, we suggest you specify the new ser viceUri property
instead to gain better performance for instance discovery.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "AADServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalCredentialType": "ServicePrincipalKey",
"servicePrincipalCredential": "<service principal key>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "Online",
"serviceUri": "https://<organization-name>.crm[x].dynamics.com",
"authenticationType": "Office365",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
port The port of the on-premises Dynamics No. The default value is 443.
server.
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"description": "Dynamics on-premises with IFD linked service using IFD authentication",
"typeProperties": {
"deploymentType": "OnPremisesWithIFD",
"hostName": "contosodynamicsserver.contoso.com",
"port": 443,
"organizationName": "admsDynamicsTest",
"authenticationType": "Ifd",
"username": "[email protected]",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by Dynamics dataset.
To copy data from and to Dynamics, the following properties are supported:
entityName The logical name of the entity to No for source if the activity source is
retrieve. specified as "query" and yes for sink
Example
{
"name": "DynamicsDataset",
"properties": {
"type": "DynamicsEntity",
"schema": [],
"typeProperties": {
"entityName": "account"
},
"linkedServiceName": {
"referenceName": "<Dynamics linked service name>",
"type": "linkedservicereference"
}
}
}
IMPORTANT
When you copy data from Dynamics, explicit column mapping from Dynamics to sink is optional. But we highly
recommend the mapping to ensure a deterministic copy result.
When Data Factory imports a schema in the authoring UI, it infers the schema. It does so by sampling the top rows
from the Dynamics query result to initialize the source column list. In that case, columns with no values in the top
rows are omitted. The same behavior applies to copy executions if there is no explicit mapping. You can review and add
more columns into the mapping, which are honored during copy runtime.
Example
"activities":[
{
"name": "CopyFromDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<Dynamics input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DynamicsSource",
"query": "<FetchXML Query>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize The row count of data written to No. The default value is 10.
Dynamics in each batch.
ignoreNullValues Whether to ignore null values from No. The default value is FALSE .
input data other than key fields during
a write operation.
For Dynamics 365 online, there's a limit of two concurrent batch calls per organization. If that limit is exceeded, a
"Server Busy" exception is thrown before the first request is ever run. Keep writeBatchSize at 10 or less to
avoid such throttling of concurrent calls.
The optimal combination of writeBatchSize and parallelCopies depends on the schema of your entity.
Schema elements include the number of columns, row size, and number of plug-ins, workflows, or workflow
activities hooked up to those calls. The default setting of writeBatchSize (10) × parallelCopies (10) is the
recommendation according to the Dynamics service. This value works for most Dynamics entities, although it
might not give the best performance. You can tune the performance by adjusting the combination in your copy
activity settings.
Example
"activities":[
{
"name": "CopyToDynamics",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Dynamics output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "DynamicsSink",
"writeBehavior": "Upsert",
"writeBatchSize": 10,
"ignoreNullValues": true
}
}
}
]
You can also add filters to filter the views. For example, add the following filter to get a view named "My Active
Accounts" in account entity.
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeTypeCode.BigInt Long ✓ ✓
AttributeTypeCode.Boolean Boolean ✓ ✓
AttributeType.DateTime Datetime ✓ ✓
AttributeType.Decimal Decimal ✓ ✓
AttributeType.Double Double ✓ ✓
AttributeType.EntityName String ✓ ✓
AttributeType.Integer Int32 ✓ ✓
AttributeType.ManagedPro Boolean ✓
perty
AttributeType.Memo String ✓ ✓
AttributeType.Money Decimal ✓ ✓
DATA FA C TO RY IN T ERIM
DY N A M IC S DATA T Y P E DATA T Y P E SUP P O RT ED A S SO URC E SUP P O RT ED A S SIN K
AttributeType.Picklist Int32 ✓ ✓
AttributeType.Uniqueidentifi GUID ✓ ✓
er
AttributeType.String String ✓ ✓
AttributeType.State Int32 ✓ ✓
AttributeType.Status Int32 ✓ ✓
NOTE
The Dynamics data types AttributeType.CalendarRules , AttributeType.MultiSelectPicklist , and
AttributeType.Par tyList aren't supported.
Next steps
For a list of data stores the copy activity in Data Factory supports as sources and sinks, see Supported data
stores.
Excel format in Azure Data Factory
5/14/2021 • 4 minutes to read • Edit Online
NOTE
".xls" format is not supported while using HTTP.
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Excel dataset.
"activities": [
{
"name": "CopyFromExcel",
"type": "Copy",
"typeProperties": {
"source": {
"type": "ExcelSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
}
},
...
}
...
}
]
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Source example
The below image is an example of an Excel source configuration in mapping data flows using dataset mode.
The associated data flow script is:
source(allowSchemaDrift: true,
validateSchema: false,
wildcardPaths:['*.xls']) ~> ExcelSource
If you use inline dataset, you see the following source options in mapping data flow.
source(allowSchemaDrift: true,
validateSchema: false,
format: 'excel',
fileSystem: 'container',
folderPath: 'path',
fileName: 'sample.xls',
sheetName: 'worksheet',
firstRowAsHeader: true) ~> ExcelSourceInlineDataset
Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Copy data to or from a file system by using Azure
Data Factory
5/6/2021 • 17 minutes to read • Edit Online
Supported capabilities
This file system connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this file system connector supports:
Copying files from/to local machine or network file share. To use a Linux file share, install Samba on your
Linux server.
Copying files using Windows authentication.
Copying files as-is or parsing/generating files with the supported file formats and compression codecs.
NOTE
Mapped network drive is not supported when loading data from a network file share. Use the actual path instead e.g.
\\server\share .
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
file system.
Example:
{
"name": "FileLinkedService",
"properties": {
"type": "FileServer",
"typeProperties": {
"host": "<host>",
"userId": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for file system under location settings in format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<File system linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FileServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 2: server side filter File server side native filter, which No
- fileFilter provides better performance than
OPTION 3 wildcard filter. Use * to
match zero or more characters and ?
to match zero or single character.
Learn more about the syntax and
notes from the Remarks under this
section.
OPTION 3: client side filter The file name with wildcard characters Yes
- wildcardFileName under the given
folderPath/wildcardFolderPath to filter
source files. Such filter happens on
ADF side, ADF enumerate the files
under the given path then apply the
wildcard filter.
Allowed wildcards are: * (matches
zero or more characters) and ?
(matches zero or single character); use
^ to escape if your actual file name
has wildcard or this escape char inside.
See more examples in Folder and file
filter examples.
Additional settings:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "FileServerReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Parquet output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "ParquetSink",
"storeSettings":{
"type": "FileServerWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
SO URC E F O L DER
REC URSIVE C O P Y B EH AVIO R ST RUC T URE RESULT IN G TA RGET
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "FileSystemDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<file system linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<file system input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Example:
"activities":[
{
"name": "CopyToFileSystem",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<file system output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "FileSystemSink",
"copyBehavior": "PreserveHierarchy"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from FTP server by using Azure Data
Factory
5/6/2021 • 11 minutes to read • Edit Online
Supported capabilities
This FTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this FTP connector supports:
Copying files using Basic or Anonymous authentication.
Copying files as-is or parsing files with the supported file formats and compression codecs.
The FTP connector support FTP server running in passive mode. Active mode is not supported.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
FTP.
NOTE
The FTP connector supports accessing FTP server with either no encryption or explicit SSL/TLS encryption; it doesn’t
support implicit SSL/TLS encryption.
{
"name": "FTPLinkedService",
"properties": {
"type": "FtpServer",
"typeProperties": {
"host": "<ftp server>",
"port": 21,
"enableSsl": true,
"enableServerCertificateValidation": true,
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for FTP under location settings in format-based dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "FtpServerLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Additional settings:
When copying data form FTP, currently ADF tries to get the file length first, then divide the file into multiple parts
and read them in parallel. If your FTP server doesn't support getting file length or seeking to read from a certain
offset, you may encounter failure.
Example:
"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "FtpReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
format If you want to copy files as-is No (only for binary copy scenario)
between file-based stores (binary
copy), skip the format section in both
input and output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a given name, specify folderPath with folder part and fileName with file name.
To copy a subset of files under a folder, specify folderPath with folder part and fileName with wildcard filter.
NOTE
If you were using "fileFilter" property for file filter, it is still supported as-is, while you are suggested to use the new filter
capability added to "fileName" going forward.
Example:
{
"name": "FTPDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<FTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "myfile.csv.gz",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<FTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Use GitHub to read Common Data Model entity
references
4/22/2021 • 2 minutes to read • Edit Online
Next Steps
Create a source dataset in mapping data flow.
Copy data from Google AdWords using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Google AdWords connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Google AdWords to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google AdWords connector.
Example:
{
"name": "GoogleAdWordsLinkedService",
"properties": {
"type": "GoogleAdWords",
"typeProperties": {
"clientCustomerID" : "<clientCustomerID>",
"developerToken": {
"type": "SecureString",
"value": "<developerToken>"
},
"authenticationType" : "ServiceAuthentication",
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
},
"clientId": {
"type": "SecureString",
"value": "<clientId>"
},
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"email" : "<email>",
"keyFilePath" : "<keyFilePath>",
"trustedCertPath" : "<trustedCertPath>",
"useSystemTrustStore" : true,
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Google AdWords dataset.
To copy data from Google AdWords, set the type property of the dataset to GoogleAdWordsObject . The
following properties are supported:
Example
{
"name": "GoogleAdWordsDataset",
"properties": {
"type": "GoogleAdWordsObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<GoogleAdWords linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGoogleAdWords",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleAdWords input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleAdWordsSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Google BigQuery by using Azure
Data Factory
5/6/2021 • 5 minutes to read • Edit Online
Supported capabilities
This Google BigQuery connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Google BigQuery to any supported sink data store. For a list of data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a
driver to use this connector.
NOTE
This Google BigQuery connector is built on top of the BigQuery APIs. Be aware that BigQuery limits the maximum rate of
incoming requests and enforces appropriate quotas on a per-project basis, refer to Quotas & Limits - API requests. Make
sure you do not trigger too many concurrent requests to the account.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Google BigQuery connector.
Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project ID>",
"additionalProjects" : "<additional project IDs>",
"requestGoogleDriveScope" : true,
"authenticationType" : "UserAuthentication",
"clientId": "<id of the application used to generate the refresh token>",
"clientSecret": {
"type": "SecureString",
"value":"<secret of the application used to generate the refresh token>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refresh token>"
}
}
}
}
Example:
{
"name": "GoogleBigQueryLinkedService",
"properties": {
"type": "GoogleBigQuery",
"typeProperties": {
"project" : "<project id>",
"requestGoogleDriveScope" : true,
"authenticationType" : "ServiceAuthentication",
"email": "<email>",
"keyFilePath": "<.p12 key path on the IR machine>"
},
"connectVia": {
"referenceName": "<name of Self-hosted Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Google BigQuery dataset.
To copy data from Google BigQuery, set the type property of the dataset to GoogleBigQuer yObject . The
following properties are supported:
dataset Name of the Google BigQuery dataset. No (if "query" in activity source is
specified)
tableName Name of the table. This property is No (if "query" in activity source is
supported for backward compatibility. specified)
For new workload, use dataset and
table .
Example
{
"name": "GoogleBigQueryDataset",
"properties": {
"type": "GoogleBigQueryObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<GoogleBigQuery linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGoogleBigQuery",
"type": "Copy",
"inputs": [
{
"referenceName": "<GoogleBigQuery input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GoogleBigQuerySource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Google Cloud Storage by using
Azure Data Factory
5/6/2021 • 9 minutes to read • Edit Online
Supported capabilities
This Google Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Google Cloud Storage connector supports copying files as is or parsing files with the supported
file formats and compression codecs. It takes advantage of GCS's S3-compatible interoperability.
Prerequisites
The following setup is required on your Google Cloud Storage account:
1. Enable interoperability for your Google Cloud Storage account
2. Set the default project that contains the data you want to copy from the target GCS bucket.
3. Create a service account and define the right levels of permissions by using Cloud IAM on GCP.
4. Generate the access keys for this service account.
Required permissions
To copy data from Google Cloud Storage, make sure you've been granted the following permissions for object
operations: storage.objects.get and storage.objects.list .
If you use Data Factory UI to author, additional storage.buckets.list permission is required for operations like
testing connection to linked service and browsing from root. If you don't want to grant this permission, you can
choose "Test connection to file path" or "Browse from specified path" options from the UI.
For the full list of Google Cloud Storage roles and associated permissions, see IAM roles for Cloud Storage on
the Google Cloud site.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Google Cloud Storage.
Here's an example:
{
"name": "GoogleCloudStorageLinkedService",
"properties": {
"type": "GoogleCloudStorage",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://storage.googleapis.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Google Cloud Storage under location settings in a format-based
dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Google Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "GoogleCloudStorageLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 2: GCS prefix Prefix for the GCS key name under the No
- prefix given bucket configured in the dataset
to filter source GCS files. GCS keys
whose names start with
bucket_in_dataset/this_prefix are
selected. It utilizes GCS's service-side
filter, which provides better
performance than a wildcard filter.
Additional settings:
Example:
"activities":[
{
"name": "CopyFromGoogleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "GoogleCloudStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
Legacy models
If you were using an Amazon S3 connector to copy data from Google Cloud Storage, it's still supported as is for
backward compatibility. We suggest that you use the new model mentioned earlier. The Data Factory authoring
UI has switched to generating the new model.
Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Greenplum using Azure Data
Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Greenplum connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Greenplum to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Greenplum connector.
Linked service properties
The following properties are supported for Greenplum linked service:
Example:
{
"name": "GreenplumLinkedService",
"properties": {
"type": "Greenplum",
"typeProperties": {
"connectionString": "HOST=<server>;PORT=<port>;DB=<database>;UID=<user name>;PWD=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Greenplum dataset.
To copy data from Greenplum, set the type property of the dataset to GreenplumTable . The following
properties are supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "GreenplumDataset",
"properties": {
"type": "GreenplumTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Greenplum linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromGreenplum",
"type": "Copy",
"inputs": [
{
"referenceName": "<Greenplum input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "GreenplumSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from HBase using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This HBase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from HBase to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HBase connector.
Linked service properties
The following properties are supported for HBase linked service:
NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting,
e.g. specify /hbaserest0 instead of /hbaserest .
{
"name": "HBaseLinkedService",
"properties": {
"type": "HBase",
"typeProperties": {
"host" : "<cluster name>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbaserest0",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"enableSsl" : true
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HBase dataset.
To copy data from HBase, set the type property of the dataset to HBaseObject . The following properties are
supported:
Example
{
"name": "HBaseDataset",
"properties": {
"type": "HBaseObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<HBase linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromHBase",
"type": "Copy",
"inputs": [
{
"referenceName": "<HBase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HBaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from the HDFS server by using Azure
Data Factory
5/6/2021 • 19 minutes to read • Edit Online
Supported capabilities
The HDFS connector is supported for the following activities:
Copy activity with supported source and sink matrix
Lookup activity
Delete activity
Specifically, the HDFS connector supports:
Copying files by using Windows (Kerberos) or Anonymous authentication.
Copying files by using the webhdfs protocol or built-in DistCp support.
Copying files as is or by parsing or generating files with the supported file formats and compression codecs.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
NOTE
Make sure that the integration runtime can access all the [name node server]:[name node port] and [data node servers]:
[data node port] of the Hadoop cluster. The default [name node port] is 50070, and the default [data node port] is 50075.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HDFS.
{
"name": "HDFSLinkedService",
"properties": {
"type": "Hdfs",
"typeProperties": {
"url" : "http://<machine>:50070/webhdfs/v1/",
"authenticationType": "Windows",
"userName": "<username>@<domain>.com (for Kerberos auth)",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets in Azure Data
Factory.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HDFS under location settings in the format-based dataset:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HdfsLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 1: static path Copy from the folder or file path that's
specified in the dataset. If you want to
copy all files from a folder, additionally
specify wildcardFileName as * .
Additional settings
P RO P ERT Y DESC RIP T IO N REQ UIRED
DistCp settings
tempScriptPath A folder path that's used to store the Yes, if using DistCp
temp DistCp command script. The
script file is generated by Data Factory
and will be removed after the Copy job
is finished.
Example:
"activities":[
{
"name": "CopyFromHDFS",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "HdfsReadSettings",
"recursive": true,
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
A Z URE DATA FA C TO RY
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT C O N F IGURAT IO N
IMPORTANT
The HTTP Kerberos principal must start with "HTTP/" according to Kerberos HTTP SPNEGO specification. Learn
more from here.
IMPORTANT
The username should not contain the hostname.
C:> Ksetup
default realm = REALM.COM (external)
REALM.com:
kdc = <your_kdc_server_address>
NOTE
Replace REALM.COM and AD.COM in the following tutorial with your own realm name and domain controller.
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = REALM.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
REALM.COM = {
kdc = node.REALM.COM
admin_server = node.REALM.COM
}
AD.COM = {
kdc = windc.ad.com
admin_server = windc.ad.com
}
[domain_realm]
.REALM.COM = REALM.COM
REALM.COM = REALM.COM
.ad.com = AD.COM
ad.com = AD.COM
[capaths]
AD.COM = {
REALM.COM = .
}
2. Establish trust from the Windows domain to the Kerberos realm. [password] is the password for the
principal krbtgt/[email protected].
d. Use the Ksetup command to specify the encryption algorithm to be used on the specified realm.
4. Create the mapping between the domain account and the Kerberos principal, so that you can use the
Kerberos principal in the Windows domain.
a. Select Administrative tools > Active Director y Users and Computers .
b. Configure advanced features by selecting View > Advanced Features .
c. On the Advanced Features pane, right-click the account to which you want to create mappings and,
on the Name Mappings pane, select the Kerberos Names tab.
d. Add a principal from the realm.
On the self-hosted integration runtime machine:
Run the following Ksetup commands to add a realm entry.
Legacy models
NOTE
The following models are still supported as is for backward compatibility. We recommend that you use the previously
discussed new model, because the Azure Data Factory authoring UI has switched to generating the new model.
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both the input and
output dataset definitions.
Example:
{
"name": "HDFSDataset",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<HDFS linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
tempScriptPath A folder path that's used to store the Yes, if using DistCp
temp DistCp command script. The
script file is generated by Data Factory
and will be removed after the Copy job
is finished.
"source": {
"type": "HdfsSource",
"distcpSettings": {
"resourceManagerEndpoint": "resourcemanagerendpoint:8088",
"tempScriptPath": "/usr/hadoop/tempscript",
"distcpOptions": "-m 100"
}
}
Next steps
For a list of data stores that are supported as sources and sinks by the Copy activity in Azure Data Factory, see
supported data stores.
Copy and transform data from Hive using Azure
Data Factory
5/6/2021 • 6 minutes to read • Edit Online
Supported capabilities
This Hive connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Hive to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Hive connector.
Linked service properties
The following properties are supported for Hive linked service:
port The TCP port that the Hive server uses Yes
to listen for client connections. If you
connect to Azure HDInsights, specify
port as 443.
Example:
{
"name": "HiveLinkedService",
"properties": {
"type": "Hive",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Hive dataset.
To copy data from Hive, set the type property of the dataset to HiveObject . The following properties are
supported:
tableName Name of the table including schema No (if "query" in activity source is
part. This property is supported for specified)
backward compatibility. For new
workload, use schema and table .
Example
{
"name": "HiveDataset",
"properties": {
"type": "HiveObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Hive linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromHive",
"type": "Copy",
"inputs": [
{
"referenceName": "<Hive input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HiveSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Source example
Below is an example of a Hive source configuration:
These settings translate into the following data flow script:
source(
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false,
format: 'table',
store: 'hive',
schemaName: 'default',
tableName: 'hivesampletable',
staged: true,
storageContainer: 'khive',
storageFolderPath: '',
stagingDatabaseName: 'default') ~> hivesource
Known limitations
Complex types such as arrays, maps, structs, and unions are not supported for read.
Hive connector only supports Hive tables in Azure HDInsight of version 4.0 or greater (Apache Hive 3.1.0)
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an HTTP endpoint by using Azure
Data Factory
5/6/2021 • 10 minutes to read • Edit Online
Supported capabilities
This HTTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an HTTP source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
You can use this HTTP connector to:
Retrieve data from an HTTP/S endpoint by using the HTTP GET or POST methods.
Retrieve data by using one of the following authentications: Anonymous , Basic , Digest , Windows , or
ClientCer tificate .
Copy the HTTP response as-is or parse it by using supported file formats and compression codecs.
TIP
To test an HTTP request for data retrieval before you configure the HTTP connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the HTTP connector.
Example
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<HTTP endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
certThumbprint The thumbprint of the certificate that's Specify either embeddedCer tData or
installed on your self-hosted cer tThumbprint .
Integration Runtime machine's cert
store. Applies only when the self-
hosted type of Integration Runtime is
specified in the connectVia property.
If you use cer tThumbprint for authentication and the certificate is installed in the personal store of the local
computer, grant read permissions to the self-hosted Integration Runtime:
1. Open the Microsoft Management Console (MMC). Add the Cer tificates snap-in that targets Local
Computer .
2. Expand Cer tificates > Personal , and then select Cer tificates .
3. Right-click the certificate from the personal store, and then select All Tasks > Manage Private Keys .
4. On the Security tab, add the user account under which the Integration Runtime Host Service
(DIAHostService) is running, with read access to the certificate.
Example 1: Using cer tThumbprint
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"authenticationType": "ClientCertificate",
"url": "<HTTP endpoint>",
"certThumbprint": "<thumbprint of certificate>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "HttpLinkedService",
"properties": {
"type": "HttpServer",
"typeProperties": {
"url": "<HTTP endpoint>",
"authenticationType": "Anonymous",
"authHeader": {
"x-api-key": {
"type": "SecureString",
"value": "<API key>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for HTTP under location settings in format-based dataset:
NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "HttpServerLocation",
"relativeUrl": "<relative url>"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Example:
"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "HttpReadSettings",
"requestMethod": "Post",
"additionalHeaders": "<header key: header value>\n<header key: header value>\n",
"requestBody": "<body for POST HTTP request>"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
Legacy models
NOTE
The following models are still supported as-is for backward compatibility. You are suggested to use the new model
mentioned in above sections going forward, and the ADF authoring UI has switched to generating the new model.
NOTE
The supported HTTP request payload size is around 500 KB. If the payload size you want to pass to your web endpoint is
larger than 500 KB, consider batching the payload in smaller chunks.
{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"additionalHeaders": "Connection: keep-alive\nUser-Agent: Mozilla/5.0\n"
}
}
}
Example 2: Using the Post method
{
"name": "HttpSourceDataInput",
"properties": {
"type": "HttpFile",
"linkedServiceName": {
"referenceName": "<HTTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"relativeUrl": "<relative url>",
"requestMethod": "Post",
"requestBody": "<body for POST HTTP request>"
}
}
}
Example
"activities":[
{
"name": "CopyFromHTTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<HTTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HttpSource",
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from HubSpot using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This HubSpot connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from HubSpot to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
HubSpot connector.
Example:
{
"name": "HubSpotLinkedService",
"properties": {
"type": "Hubspot",
"typeProperties": {
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"refreshToken": {
"type": "SecureString",
"value": "<refreshToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by HubSpot dataset.
To copy data from HubSpot, set the type property of the dataset to HubspotObject . The following properties
are supported:
Example
{
"name": "HubSpotDataset",
"properties": {
"type": "HubspotObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<HubSpot linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Companies where
Company_Id = xxx"
.
Example:
"activities":[
{
"name": "CopyFromHubspot",
"type": "Copy",
"inputs": [
{
"referenceName": "<HubSpot input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "HubspotSource",
"query": "SELECT * FROM Companies where Company_Id = xxx"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Impala by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Impala connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Impala to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Data Factory provides a built-in driver to enable connectivity. Therefore, you don't need to manually install a
driver to use this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Impala connector.
Linked service properties
The following properties are supported for Impala linked service.
Example:
{
"name": "ImpalaLinkedService",
"properties": {
"type": "Impala",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"authenticationType" : "UsernameAndPassword",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Impala dataset.
To copy data from Impala, set the type property of the dataset to ImpalaObject . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "ImpalaDataset",
"properties": {
"type": "ImpalaObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Impala linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. An example is specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromImpala",
"type": "Copy",
"inputs": [
{
"referenceName": "<Impala input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ImpalaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to IBM Informix using Azure
Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Informix connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Informix source to any supported sink data store, or copy from any supported source
data store to Informix sink. For a list of data stores that are supported as sources/sinks by the copy activity, see
the Supported data stores table.
Prerequisites
To use this Informix connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the Informix ODBC driver for the data store on the Integration Runtime machine. For driver installation
and setup, refer Informix ODBC Driver Guide article in IBM Knowledge Center for details, or contact IBM
support team for driver installation guidance.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Informix connector.
Example:
{
"name": "InformixLinkedService",
"properties": {
"type": "Informix",
"typeProperties": {
"connectionString": "<Informix connection string or DSN>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Informix dataset.
To copy data from Informix, the following properties are supported:
tableName Name of the table in the Informix. No for source (if "query" in activity
source is specified);
Yes for sink
Example
{
"name": "InformixDataset",
"properties": {
"type": "InformixTable",
"linkedServiceName": {
"referenceName": "<Informix linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromInformix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Informix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "InformixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Informix as sink
To copy data to Informix, the following properties are supported in the copy activity sink section:
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example:
"activities":[
{
"name": "CopyToInformix",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Informix output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "InformixSink"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Jira using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Jira connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Jira to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Jira connector.
Example:
{
"name": "JiraLinkedService",
"properties": {
"type": "Jira",
"typeProperties": {
"host" : "<host>",
"port" : "<port>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Jira dataset.
To copy data from Jira, set the type property of the dataset to JiraObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
{
"name": "JiraDataset",
"properties": {
"type": "JiraObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Jira linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromJira",
"type": "Copy",
"inputs": [
{
"referenceName": "<Jira input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "JiraSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
JSON format in Azure Data Factory
5/14/2021 • 10 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the JSON dataset.
{
"name": "JSONDataset",
"properties": {
"type": "Json",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "gzip"
}
}
}
}
JSON as sink
The following properties are supported in the copy activity *sink* section.
P RO P ERT Y DESC RIP T IO N REQ UIRED
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
}
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Single document
If Single document is selected, mapping data flows read one JSON document from each file.
File1.json
{
"json": "record 1"
}
File2.json
{
"json": "record 2"
}
File3.json
{
"json": "record 3"
}
If Document per line is selected, mapping data flows read one JSON document from each line in a file.
File1.json
{"json": "record 1 }
File2.json
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","s
witch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","s
witch1":"US","switch2":"UK"}
File3.json
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":"567834760","s
witch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":"789037573","s
witch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":"345626404","s
witch1":"Germany","switch2":"UK"}
If Array of documents is selected, mapping data flows read one array of document from a file.
File.json
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]
NOTE
If data flows throw an error stating "corrupt_record" when previewing your JSON data, it is likely that your data contains
contains a single document in your JSON file. Setting "single document" should clear that error.
Has comments
Select Has comments if the JSON data has C or C++ style commenting.
Single quoted
Select Single quoted if the JSON fields and values use single quotes instead of double quotes.
Backslash escaped
Select Backslash escaped if backslashes are used to escape characters in the JSON data.
Sink Properties
The below table lists the properties supported by a json sink. You can edit these properties in the Settings tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
@(
field1=0,
field2=@(
field1=0
)
)
If this expression were entered for a column named "complexColumn", then it would be written to the sink as
the following JSON:
{
"complexColumn": {
"field1": 0,
"field2": {
"field1": 0
}
}
}
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Magento using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Magento connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Magento to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Magento connector.
Example:
{
"name": "MagentoLinkedService",
"properties": {
"type": "Magento",
"typeProperties": {
"host" : "192.168.222.110/magento3",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Magento dataset.
To copy data from Magento, set the type property of the dataset to MagentoObject . The following properties
are supported:
Example
{
"name": "MagentoDataset",
"properties": {
"type": "MagentoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Magento linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Customers" .
Example:
"activities":[
{
"name": "CopyFromMagento",
"type": "Copy",
"inputs": [
{
"referenceName": "<Magento input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MagentoSource",
"query": "SELECT * FROM Customers where Id > XXX"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MariaDB using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This MariaDB connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from MariaDB to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
This connector currently supports MariaDB of version 10.0 to 10.2.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MariaDB connector.
Linked service properties
The following properties are supported for MariaDB linked service:
Example:
{
"name": "MariaDBLinkedService",
"properties": {
"type": "MariaDB",
"typeProperties": {
"connectionString": "Server=<host>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MariaDB dataset.
To copy data from MariaDB, set the type property of the dataset to MariaDBTable . There is no additional type-
specific property in this type of dataset.
Example
{
"name": "MariaDBDataset",
"properties": {
"type": "MariaDBTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<MariaDB linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromMariaDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MariaDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MariaDBSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Marketo using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Marketo connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Marketo to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Currently, Marketo instance which is integrated with external CRM is not supported.
NOTE
This Marketo connector is built on top of the Marketo REST API. Be aware that the Marketo has concurrent request limit
on service side. If you hit errors saying "Error while attempting to use REST API: Max rate limit '100' exceeded with in '20'
secs (606)" or "Error while attempting to use REST API: Concurrent access limit '10' reached (615)", consider to reduce the
concurrent copy activity runs to reduce the number of requests to the service.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Marketo connector.
Example:
{
"name": "MarketoLinkedService",
"properties": {
"type": "Marketo",
"typeProperties": {
"endpoint" : "123-ABC-321.mktorest.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Marketo dataset.
To copy data from Marketo, set the type property of the dataset to MarketoObject . The following properties
are supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
{
"name": "MarketoDataset",
"properties": {
"type": "MarketoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Marketo linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Activitiy_Types" .
Example:
"activities":[
{
"name": "CopyFromMarketo",
"type": "Copy",
"inputs": [
{
"referenceName": "<Marketo input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MarketoSource",
"query": "SELECT top 1000 * FROM Activitiy_Types"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Microsoft Access using
Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Microsoft Access connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Microsoft Access source to any supported sink data store, or copy from any supported
source data store to Microsoft Access sink. For a list of data stores that are supported as sources/sinks by the
copy activity, see the Supported data stores table.
Prerequisites
To use this Microsoft Access connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the Microsoft Access ODBC driver for the data store on the Integration Runtime machine.
NOTE
Microsoft Access 2016 version of ODBC driver doesn't work with this connector. Use driver version 2013 or 2010 instead.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Microsoft Access connector.
Example:
{
"name": "MicrosoftAccessLinkedService",
"properties": {
"type": "MicrosoftAccess",
"typeProperties": {
"connectionString": "Driver={Microsoft Access Driver (*.mdb, *.accdb)};Dbq=<path to your DB file
e.g. C:\\mydatabase.accdb>;",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Microsoft Access dataset.
To copy data from Microsoft Access, the following properties are supported:
tableName Name of the table in the Microsoft No for source (if "query" in activity
Access. source is specified);
Yes for sink
Example
{
"name": "MicrosoftAccessDataset",
"properties": {
"type": "MicrosoftAccessTable",
"linkedServiceName": {
"referenceName": "<Microsoft Access linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
query Use the custom query to read data. No (if "tableName" in dataset is
For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromMicrosoftAccess",
"type": "Copy",
"inputs": [
{
"referenceName": "<Microsoft Access input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MicrosoftAccessSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
Example:
"activities":[
{
"name": "CopyToMicrosoftAccess",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Microsoft Access output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MicrosoftAccessSink"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to MongoDB by using Azure
Data Factory
6/8/2021 • 6 minutes to read • Edit Online
IMPORTANT
ADF release this new version of MongoDB connector which provides better native MongoDB support. If you are using
the previous MongoDB connector in your solution which is supported as-is for backward compatibility, refer to MongoDB
connector (legacy) article.
Supported capabilities
You can copy data from MongoDB database to any supported sink data store, or copy data from any supported
source data store to MongoDB database. For a list of data stores that are supported as sources/sinks by the copy
activity, see the Supported data stores table.
Specifically, this MongoDB connector supports versions up to 4.2 .
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.
Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDbV2",
"typeProperties": {
"connectionString": "mongodb://[username:password@]host[:port][/[database][?options]]",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:
Example:
{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbV2Collection",
"typeProperties": {
"collectionName": "<Collection name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.
Example:
"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbV2Source",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
MongoDB as sink
The following properties are supported in the Copy Activity sink section:
P RO P ERT Y DESC RIP T IO N REQ UIRED
TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.
Example
"activities":[
{
"name": "CopyToMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MongoDbV2Sink",
"writeBehavior": "upsert"
}
}
}
]
Schema mapping
To copy data from MongoDB to tabular sink or reversed, refer to schema mapping.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MongoDB using Azure Data
Factory (legacy)
5/6/2021 • 7 minutes to read • Edit Online
IMPORTANT
ADF release a new MongoDB connector which provides better native MongoDB support comparing to this ODBC-based
implementation, refer to MongoDB connector article on details. This legacy MongoDB connector is kept supported as-is
for backward compability, while for any new workload, please use the new connector.
Supported capabilities
You can copy data from MongoDB database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB connector supports:
MongoDB versions 2.4, 2.6, 3.0, 3.2, 3.4 and 3.6 .
Copying data using Basic or Anonymous authentication.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in MongoDB driver, therefore you don't need to manually install any
driver when copying data from MongoDB.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB connector.
username User account to access MongoDB. Yes (if basic authentication is used).
password Password for the user. Mark this field Yes (if basic authentication is used).
as a SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
authSource Name of the MongoDB database that No. For basic authentication, default is
you want to use to check your to use the admin account and the
credentials for authentication. database specified using
databaseName property.
Example:
{
"name": "MongoDBLinkedService",
"properties": {
"type": "MongoDb",
"typeProperties": {
"server": "<server name>",
"databaseName": "<database name>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB dataset:
Example:
{
"name": "MongoDbDataset",
"properties": {
"type": "MongoDbCollection",
"linkedServiceName": {
"referenceName": "<MongoDB linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"collectionName": "<Collection name>"
}
}
}
query Use the custom SQL-92 query to read No (if "collectionName" in dataset is
data. For example: select * from specified)
MyTable.
Example:
"activities":[
{
"name": "CopyFromMongoDB",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
When specify the SQL query, pay attention to the DateTime format. For example:
SELECT * FROM Account WHERE LastModifiedDate >= '2018-06-01' AND LastModifiedDate < '2018-06-02' or to use
parameter
SELECT * FROM Account WHERE LastModifiedDate >= '@{formatDateTime(pipeline().parameters.StartTime,'yyyy-
MM-dd HH:mm:ss')}' AND LastModifiedDate < '@{formatDateTime(pipeline().parameters.EndTime,'yyyy-MM-dd
HH:mm:ss')}'
Binary Byte[]
Boolean Boolean
Date DateTime
NumberDouble Double
NumberInt Int32
NumberLong Int64
ObjectID String
String String
UUID Guid
NOTE
To learn about support for arrays using virtual tables, refer to Support for complex types using virtual tables section.
Currently, the following MongoDB data types are not supported: DBPointer, JavaScript, Max/Min key, Regular Expression,
Symbol, Timestamp, Undefined.
The driver would generate multiple virtual tables to represent this single table. The first virtual table is the base
table named “ExampleTable", shown in the example. The base table contains all the data of the original table, but
the data from the arrays has been omitted and is expanded in the virtual tables.
The following tables show the virtual tables that represent the original arrays in the example. These tables
contain the following:
A reference back to the original primary key column corresponding to the row of the original array (via the
_id column)
An indication of the position of the data within the original array
The expanded data for each element within the array
Table “ExampleTable_Invoices":
EXA M P L ETA B L E_
IN VO IC ES_DIM 1_
_ID IDX IN VO IC E_ID IT EM P RIC E DISC O UN T
Table “ExampleTable_Ratings":
1111 0 5
1111 1 6
2222 0 1
_ID EXA M P L ETA B L E_RAT IN GS_DIM 1_IDX EXA M P L ETA B L E_RAT IN GS
2222 1 2
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from or to MongoDB Atlas using Azure
Data Factory
6/8/2021 • 5 minutes to read • Edit Online
Supported capabilities
You can copy data from MongoDB Atlas database to any supported sink data store, or copy data from any
supported source data store to MongoDB Atlas database. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MongoDB Atlas connector supports versions up to 4.2 .
Prerequisites
If you use Azure Integration Runtime for copy, make sure you add the effective region's Azure Integration
Runtime IPs to the MongoDB Atlas IP Access List.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MongoDB Atlas connector.
Example:
{
"name": "MongoDbAtlasLinkedService",
"properties": {
"type": "MongoDbAtlas",
"typeProperties": {
"connectionString": "mongodb+srv://<username>:<password>@<clustername>.<randomString>.
<hostName>/<dbname>?<otherProperties>",
"database": "myDatabase"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following properties are supported for MongoDB Atlas dataset:
Example:
{
"name": "MongoDbAtlasDataset",
"properties": {
"type": "MongoDbAtlasCollection",
"typeProperties": {
"collectionName": "<Collection name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<MongoDB Atlas linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
ADF support consuming BSON document in Strict mode . Make sure your filter query is in Strict mode instead of Shell
mode. More description can be found at MongoDB manual.
Example:
"activities":[
{
"name": "CopyFromMongoDbAtlas",
"type": "Copy",
"inputs": [
{
"referenceName": "<MongoDB Atlas input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MongoDbAtlasSource",
"filter": "{datetimeData: {$gte: ISODate(\"2018-12-11T00:00:00.000Z\"),$lt: ISODate(\"2018-
12-12T00:00:00.000Z\")}, _id: ObjectId(\"5acd7c3d0000000000000000\") }",
"cursorMethods": {
"project": "{ _id : 1, name : 1, age: 1, datetimeData: 1 }",
"sort": "{ age : 1 }",
"skip": 3,
"limit": 3
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
To import JSON documents as-is, refer to Import or export JSON documents section; to copy from tabular-shaped data,
refer to Schema mapping.
Example
"activities":[
{
"name": "CopyToMongoDBAtlas",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Document DB output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "MongoDbAtlasSink",
"writeBehavior": "upsert"
}
}
}
]
Schema mapping
To copy data from MongoDB Atlas to tabular sink or reversed, refer to schema mapping.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from MySQL using Azure Data Factory
5/6/2021 • 5 minutes to read • Edit Online
NOTE
To copy data from or to Azure Database for MySQL service, use the specialized Azure Database for MySQL connector.
Supported capabilities
This MySQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from MySQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this MySQL connector supports MySQL version 5.6, 5.7 and 8.0 .
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in MySQL driver starting from version 3.7, therefore you don't need to
manually install any driver.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
MySQL connector.
SSLCert The full path and name of a Yes, if using two-way SSL
.pem file containing the SSL verification.
certificate used for proving
the identity of the client.
To specify a private key for
encrypting this certificate
before sending it to the
server, use the SSLKey
property.
SSLKey The full path and name of a Yes, if using two-way SSL
file containing the private verification.
key used for encrypting the
client-side certificate during
two-way SSL verification.
P RO P ERT Y DESC RIP T IO N O P T IO N S REQ UIRED
Example:
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using MySQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "MySQLLinkedService",
"properties": {
"type": "MySql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by MySQL dataset.
To copy data from MySQL, the following properties are supported:
tableName Name of the table in the MySQL No (if "query" in activity source is
database. specified)
Example
{
"name": "MySQLDataset",
"properties":
{
"type": "MySqlTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<MySQL linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromMySQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<MySQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "MySqlSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
bigint Int64
bit(1) Boolean
M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
blob Byte[]
bool Int16
char String
date Datetime
datetime Datetime
double Double
enum String
float Single
int Int32
integer Int32
longblob Byte[]
longtext String
mediumblob Byte[]
mediumint Int32
mediumtext String
numeric Decimal
M Y SQ L DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
real Double
set String
smallint Int16
text String
time TimeSpan
timestamp Datetime
tinyblob Byte[]
tinyint Int16
tinytext String
varchar String
year Int
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Netezza by using Azure Data
Factory
5/6/2021 • 8 minutes to read • Edit Online
TIP
For data migration scenario from Netezza to Azure, learn more from Use Azure Data Factory to migrate data from on-
premises Netezza server to Azure.
Supported capabilities
This Netezza connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Netezza to any supported sink data store. For a list of data stores that Copy Activity
supports as sources and sinks, see Supported data stores and formats.
Netezza connector supports parallel copying from source. See the Parallel copy from Netezza section for details.
Azure Data Factory provides a built-in driver to enable connectivity. You don't need to manually install any driver
to use this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
You can create a pipeline that uses a copy activity by using the .NET SDK, the Python SDK, Azure PowerShell, the
REST API, or an Azure Resource Manager template. See the Copy Activity tutorial for step-by-step instructions
on how to create a pipeline that has a copy activity.
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the Netezza connector.
Linked service properties
The following properties are supported for the Netezza linked service:
CaCertFile The full path to the SSL certificate Yes, if SSL is enabled
that's used by the server. Example:
CaCertFile=<cert path>;
Example
{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "NetezzaLinkedService",
"properties": {
"type": "Netezza",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties that the Netezza dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets.
To copy data from Netezza, set the type property of the dataset to NetezzaTable . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "NetezzaDataset",
"properties": {
"type": "NetezzaTable",
"linkedServiceName": {
"referenceName": "<Netezza linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
TIP
To load data from Netezza efficiently by using data partitioning, learn more from Parallel copy from Netezza section.
To copy data from Netezza, set the source type in Copy Activity to NetezzaSource . The following properties
are supported in the Copy Activity source section:
query Use the custom SQL query to read No (if "tableName" in dataset is
data. Example: specified)
"SELECT * FROM MyTable"
Example:
"activities":[
{
"name": "CopyFromNetezza",
"type": "Copy",
"inputs": [
{
"referenceName": "<Netezza input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "NetezzaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
When you enable partitioned copy, Data Factory runs parallel queries against your Netezza source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Netezza database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Netezza database. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.
Full load from large table. Par tition option : Data Slice.
Load large amount of data by using a custom query. Par tition option : Data Slice.
Quer y :
SELECT * FROM <TABLENAME> WHERE mod(datasliceid, ?
AdfPartitionCount) = ?AdfDataSliceCondition AND
<your_additional_where_clause>
.
During execution, Data Factory replaces
?AdfPartitionCount (with parallel copy number set on
copy activity) and ?AdfDataSliceCondition with the data
slice partition logic, and sends to Netezza.
Load large amount of data by using a custom query, having Par tition options : Dynamic range partition.
an integer column with evenly distributed value for range Quer y :
partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against the partition column to
retrieve data only between the lower and upper range.
"source": {
"type": "NetezzaSource",
"query":"SELECT * FROM <TABLENAME> WHERE mod(datasliceid, ?AdfPartitionCount) = ?AdfDataSliceCondition
AND <your_additional_where_clause>",
"partitionOption": "DataSlice"
}
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from an OData source by using Azure
Data Factory
5/6/2021 • 8 minutes to read • Edit Online
Supported capabilities
This OData connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an OData source to any supported sink data store. For a list of data stores that Copy
Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this OData connector supports:
OData version 3.0 and 4.0.
Copying data by using one of the following authentications: Anonymous , Basic , Windows , and AAD
ser vice principal .
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to an OData connector.
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "https://services.odata.org/OData/OData.svc",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "Windows",
"userName": "<domain>\\<user>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "ODataLinkedService",
"properties": {
"type": "OData",
"typeProperties": {
"url": "<endpoint of OData source>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"aadServicePrincipalCredentialType": "ServicePrincipalCert",
"servicePrincipalEmbeddedCert": {
"type": "SecureString",
"value": "<base64 encoded string of (.pfx) certificate data>"
},
"servicePrincipalEmbeddedCertPassword": {
"type": "SecureString",
"value": "<password of your certificate>"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource e.g. https://tenant.sharepoint.com>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
This section provides a list of properties that the OData dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from OData, set the type property of the dataset to ODataResource . The following properties are
supported:
Example
{
"name": "ODataDataset",
"properties":
{
"type": "ODataResource",
"schema": [],
"linkedServiceName": {
"referenceName": "<OData linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"path": "Products"
}
}
}
Example
"activities":[
{
"name": "CopyFromOData",
"type": "Copy",
"inputs": [
{
"referenceName": "<OData input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ODataSource",
"query": "$select=Name,Description&$top=5"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
Edm.Binary Byte[]
Edm.Boolean Bool
Edm.Byte Byte[]
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid Guid
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
NOTE
OData complex data types (such as Object ) aren't supported.
The access token expires in 1 hour by default, you need to get a new access token when it expires.
1. Use Postman to get the access token:
a. Navigate to Authorization tab on the Postman Website.
b. In the Type box, select OAuth 2.0 , and in the Add authorization data to box, select Request
Headers .
c. Fill the following information in the Configure New Token page to get a new access token:
Grant type : Select Authorization Code .
Callback URL : Enter https://www.localhost.com/ .
Auth URL : Enter
https://login.microsoftonline.com/common/oauth2/authorize?resource=https://<your tenant
name>.sharepoint.com
. Replace <your tenant name> with your own tenant name.
Access Token URL : Enter https://login.microsoftonline.com/common/oauth2/token .
Client ID : Enter your AAD service principal ID.
Client Secret : Enter your service principal secret.
Client Authentication : Select Send as Basic Auth header .
d. You will be asked to login with your username and password.
e. Once you get your access token, please copy and save it for the next step.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to ODBC data stores using
Azure Data Factory
5/10/2021 • 5 minutes to read • Edit Online
Supported capabilities
This ODBC connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from ODBC source to any supported sink data store, or copy from any supported source data
store to ODBC sink. For a list of data stores that are supported as sources/sinks by the copy activity, see the
Supported data stores table.
Specifically, this ODBC connector supports copying data from/to any ODBC-compatible data stores using
Basic or Anonymous authentication. A 64-bit ODBC driver is required. For ODBC sink, ADF support ODBC
version 2.0 standard.
Prerequisites
To use this ODBC connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the 64-bit ODBC driver for the data store on the Integration Runtime machine.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ODBC connector.
{
"name": "ODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": "<connection string>",
"authenticationType": "Anonymous",
"credential": {
"type": "SecureString",
"value": "RefreshToken=<secret refresh token>;"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ODBC dataset.
To copy data from/to ODBC-compatible data store, the following properties are supported:
tableName Name of the table in the ODBC data No for source (if "query" in activity
store. source is specified);
Yes for sink
Example
{
"name": "ODBCDataset",
"properties": {
"type": "OdbcTable",
"schema": [],
"linkedServiceName": {
"referenceName": "<ODBC linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "<table name>"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<ODBC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OdbcSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
ODBC as sink
To copy data to ODBC-compatible data store, set the sink type in the copy activity to OdbcSink . The following
properties are supported in the copy activity sink section:
writeBatchSize Inserts data into the SQL table when No (default is 0 - auto detected)
the buffer size reaches writeBatchSize.
Allowed values are: integer (number of
rows).
Example:
"activities":[
{
"name": "CopyToODBC",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<ODBC output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OdbcSink",
"writeBatchSize": 100000
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Office 365 into Azure using Azure
Data Factory
6/8/2021 • 7 minutes to read • Edit Online
Supported capabilities
ADF Office 365 connector and Microsoft Graph data connect enables at scale ingestion of different types of
datasets from Exchange Email enabled mailboxes, including address book contacts, calendar events, email
messages, user information, mailbox settings, and so on. Refer here to see the complete list of datasets available.
For now, within a single copy activity you can only copy data from Office 365 into Azure Blob Storage ,
Azure Data Lake Storage Gen1 , and Azure Data Lake Storage Gen2 in JSON format (type
setOfObjects). If you want to load Office 365 into other types of data stores or in other formats, you can chain
the first copy activity with a subsequent copy activity to further load data into any of the supported ADF
destination stores (refer to "supported as a sink" column in the "Supported data stores and formats" table).
IMPORTANT
The Azure subscription containing the data factory and the sink data store must be under the same Azure Active
Directory (Azure AD) tenant as Office 365 tenant.
Ensure the Azure Integration Runtime region used for copy activity as well as the destination is in the same region
where the Office 365 tenant users' mailbox is located. Refer here to understand how the Azure IR location is
determined. Refer to table here for the list of supported Office regions and corresponding Azure regions.
Service Principal authentication is the only authentication mechanism supported for Azure Blob Storage, Azure Data
Lake Storage Gen1, and Azure Data Lake Storage Gen2 as destination stores.
Prerequisites
To copy data from Office 365 into Azure, you need to complete the following prerequisite steps:
Your Office 365 tenant admin must complete on-boarding actions as described here.
Create and configure an Azure AD web application in Azure Active Directory. For instructions, see Create an
Azure AD application.
Make note of the following values, which you will use to define the linked service for Office 365:
Tenant ID. For instructions, see Get tenant ID.
Application ID and Application key. For instructions, see Get application ID and authentication key.
Add the user identity who will be making the data access request as the owner of the Azure AD web
application (from the Azure AD web application > Settings > Owners > Add owner).
The user identity must be in the Office 365 organization you are getting data from and must not be a
Guest user.
Policy validation
If ADF is created as part of a managed app and Azure policies assignments are made on resources within the
management resource group, then for every copy activity run, ADF will check to make sure the policy
assignments are enforced. Refer here for a list of supported policies.
Getting started
TIP
For a walkthrough of using Office 365 connector, see Load data from Office 365 article.
You can create a pipeline with the copy activity by using one of the following tools or SDKs. Select a link to go to
a tutorial with step-by-step instructions to create a pipeline with a copy activity.
Azure portal
.NET SDK
Python SDK
Azure PowerShell
REST API
Azure Resource Manager template.
The following sections provide details about properties that are used to define Data Factory entities specific to
Office 365 connector.
NOTE
The difference between office365TenantId and ser vicePrincipalTenantId and the corresponding value to provide:
If you are an enterprise developer developing an application against Office 365 data for your own organization's usage,
then you should supply the same tenant ID for both properties, which is your organization's AAD tenant ID.
If you are an ISV developer developing an application for your customers, then office365TenantId will be your
customer's (application installer) AAD tenant ID and servicePrincipalTenantId will be your company's AAD tenant ID.
Example:
{
"name": "Office365LinkedService",
"properties": {
"type": "Office365",
"typeProperties": {
"office365TenantId": "<Office 365 tenant id>",
"servicePrincipalTenantId": "<AAD app service principal tenant id>",
"servicePrincipalId": "<AAD app service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<AAD app service principal key>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Office 365 dataset.
To copy data from Office 365, the following properties are supported:
If you were setting dateFilterColumn , startTime , endTime , and userScopeFilterUri in dataset, it is still
supported as-is, while you are suggested to use the new model in activity source going forward.
Example
{
"name": "DS_May2019_O365_Message",
"properties": {
"type": "Office365Table",
"linkedServiceName": {
"referenceName": "<Office 365 linked service name>",
"type": "LinkedServiceReference"
},
"schema": [],
"typeProperties": {
"tableName": "BasicDataSet_v0.Event_v1"
}
}
}
dateFilterColumn Name of the DateTime filter column. Yes if dataset has one or more
Use this property to limit the time DateTime columns. Refer here for list of
range for which Office 365 data is datasets that require this DateTime
extracted. filter.
Example:
"activities": [
{
"name": "CopyFromO365ToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "<Office 365 input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "Office365Source",
"dateFilterColumn": "CreatedDateTime",
"startTime": "2019-04-28T16:00:00.000Z",
"endTime": "2019-05-05T16:00:00.000Z",
"userScopeFilterUri": "https://graph.microsoft.com/v1.0/users?$filter=Department eq
'Finance'",
"outputColumns": [
{
"name": "Id"
},
{
"name": "CreatedDateTime"
},
{
"name": "LastModifiedDateTime"
},
{
"name": "ChangeKey"
},
{
"name": "Categories"
},
{
"name": "OriginalStartTimeZone"
},
{
"name": "OriginalEndTimeZone"
},
{
"name": "ResponseStatus"
},
{
"name": "iCalUId"
},
{
"name": "ReminderMinutesBeforeStart"
},
{
"name": "IsReminderOn"
},
{
"name": "HasAttachments"
},
{
"name": "Subject"
},
{
"name": "Body"
},
{
"name": "Importance"
},
{
"name": "Sensitivity"
},
{
"name": "Start"
},
{
"name": "End"
},
{
"name": "Location"
},
{
"name": "IsAllDay"
},
{
"name": "IsCancelled"
},
{
"name": "IsOrganizer"
},
{
"name": "Recurrence"
},
{
"name": "ResponseRequested"
},
{
"name": "ShowAs"
},
{
"name": "Type"
},
{
"name": "Attendees"
},
{
"name": "Organizer"
},
{
"name": "WebLink"
},
{
"name": "Attachments"
},
{
"name": "BodyPreview"
},
{
"name": "Locations"
},
{
"name": "OnlineMeetingUrl"
},
{
"name": "OriginalStart"
},
{
"name": "SeriesMasterId"
}
]
},
"sink": {
"type": "BlobSink"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to Oracle by using Azure Data
Factory
5/6/2021 • 13 minutes to read • Edit Online
Supported capabilities
This Oracle connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an Oracle database to any supported sink data store. You also can copy data from any
supported source data store to an Oracle database. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.
Specifically, this Oracle connector supports:
The following versions of an Oracle database:
Oracle 19c R1 (19.1) and higher
Oracle 18c R1 (18.1) and higher
Oracle 12c R1 (12.1) and higher
Oracle 11g R1 (11.1) and higher
Oracle 10g R1 (10.1) and higher
Oracle 9i R2 (9.2) and higher
Oracle 8i R3 (8.1.7) and higher
Oracle Database Cloud Exadata Service
Parallel copying from an Oracle source. See the Parallel copy from Oracle section for details.
NOTE
Oracle proxy server isn't supported.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The integration runtime provides a built-in Oracle driver. Therefore, you don't need to manually install a driver
when you copy data from and to Oracle.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Oracle connector.
TIP
If you get an error, "ORA-01025: UPI parameter out of range", and your Oracle version is 8i, add WireProtocolMode=1
to your connection string. Then try again.
If you have multiple Oracle instances for failover scenario, you can create Oracle linked service and fill in the
primary host, port, user name, password, etc., and add a new "Additional connection proper ties " with
property name as AlternateServers and value as
(HostName=<secondary host>:PortNumber=<secondary port>:ServiceName=<secondary service name>) - do not miss
the brackets and pay attention to the colons ( : ) as separator. As an example, the following value of alternate
servers defines two alternate database servers for connection failover:
(HostName=AccountingOracleServer:PortNumber=1521:SID=Accounting,HostName=255.201.11.24:PortNumber=1522:ServiceName=ABackup.NA.MyCompany)
.
More connection properties you can set in connection string per your case:
openssl x509 -inform DER -in [Full Path to the DER Certificate including the name of the DER
Certificate] -text
Example: Extract cert info from DERcert.cer, and then save the output to cert.txt.
2. Build the keystore or truststore . The following command creates the truststore file, with or
without a password, in PKCS-12 format.
openssl pkcs12 -in [Path to the file created in the previous step] -out [Path and name of
TrustStore] -passout pass:[Keystore PWD] -nokeys -export
openssl pkcs12 -in cert.txt -out MyTrustStoreFile -passout pass:ThePWD -nokeys -export
3. Place the truststore file on the self-hosted IR machine. For example, place the file at
C:\MyTrustStoreFile.
4. In Azure Data Factory, configure the Oracle connection string with EncryptionMethod=1 and the
corresponding TrustStore / TrustStorePassword value. For example,
Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=
<password>;EncryptionMethod=1;TrustStore=C:\\MyTrustStoreFile;TrustStorePassword=
<trust_store_password>
.
Example:
{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "OracleLinkedService",
"properties": {
"type": "Oracle",
"typeProperties": {
"connectionString": "Host=<host>;Port=<port>;Sid=<sid>;User Id=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties supported by the Oracle dataset. For a full list of sections and
properties available for defining datasets, see Datasets.
To copy data from and to Oracle, set the type property of the dataset to OracleTable . The following properties
are supported.
tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .
Example:
{
"name": "OracleDataset",
"properties":
{
"type": "OracleTable",
"schema": [],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
},
"linkedServiceName": {
"referenceName": "<Oracle linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
To load data from Oracle efficiently by using data partitioning, learn more from Parallel copy from Oracle.
To copy data from Oracle, set the source type in the copy activity to OracleSource . The following properties are
supported in the copy activity source section.
Oracle as sink
To copy data to Oracle, set the sink type in the copy activity to OracleSink . The following properties are
supported in the copy activity sink section.
writeBatchSize Inserts data into the SQL table when No (default is 10,000)
the buffer size reaches
writeBatchSize .
Allowed values are Integer (number of
rows).
Example:
"activities":[
{
"name": "CopyToOracle",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Oracle output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "OracleSink"
}
}
}
]
When you enable partitioned copy, Data Factory runs parallel queries against your Oracle source to load data by
partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Oracle database.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Oracle database. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.
Full load from large table, with physical partitions. Par tition option : Physical partitions of table.
Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer column for data partitioning. Par tition column : Specify the column used to partition
data. If not specified, the primary key column is used.
SC EN A RIO SUGGEST ED SET T IN GS
Load a large amount of data by using a custom query, with Par tition option : Physical partitions of table.
physical partitions. Quer y :
SELECT * FROM <TABLENAME> PARTITION("?
AdfTabularPartitionName") WHERE
<your_additional_where_clause>
.
Par tition name : Specify the partition name(s) to copy data
from. If not specified, Data Factory automatically detects the
physical partitions on the table you specified in the Oracle
dataset.
Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer column for Quer y :
data partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against partition column to
retrieve data only between the lower and upper range.
TIP
When copying data from a non-partitioned table, you can use "Dynamic range" partition option to partition against an
integer column. If your source data doesn't have such type of column, you can leverage ORA_HASH function in source
query to generate a column and use it as partition column.
"source": {
"type": "OracleSource",
"query":"SELECT * FROM <TABLENAME> PARTITION(\"?AdfTabularPartitionName\") WHERE
<your_additional_where_clause>",
"partitionOption": "PhysicalPartitionsOfTable",
"partitionSettings": {
"partitionNames": [
"<partitionA_name>",
"<partitionB_name>"
]
}
}
BFILE Byte[]
BLOB Byte[]
(only supported on Oracle 10g and higher)
CHAR String
CLOB String
DATE DateTime
LONG String
NCHAR String
NCLOB String
NVARCHAR2 String
RAW Byte[]
ROWID String
TIMESTAMP DateTime
VARCHAR2 String
O RA C L E DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
XML String
NOTE
The data types INTERVAL YEAR TO MONTH and INTERVAL DAY TO SECOND aren't supported.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Oracle Cloud Storage by using
Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online
Supported capabilities
This Oracle Cloud Storage connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, this Oracle Cloud Storage connector supports copying files as is or parsing files with the supported
file formats and compression codecs. It takes advantage of Oracle Cloud Storage's S3-compatible
interoperability.
Prerequisites
To copy data from Oracle Cloud Storage, please refer here for the prerequisites and required permission.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Cloud Storage.
Here's an example:
{
"name": "OracleCloudStorageLinkedService",
"properties": {
"type": "OracleCloudStorage",
"typeProperties": {
"accessKeyId": "<access key id>",
"secretAccessKey": {
"type": "SecureString",
"value": "<secret access key>"
},
"serviceUrl": "https://<namespace>.compat.objectstorage.<region identifier>.oraclecloud.com"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for Oracle Cloud Storage under location settings in a format-based
dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<Oracle Cloud Storage linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "OracleCloudStorageLocation",
"bucketName": "bucketname",
"folderPath": "folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
OPTION 2: Oracle Cloud Storage prefix Prefix for the Oracle Cloud Storage key No
- prefix name under the given bucket
configured in the dataset to filter
source Oracle Cloud Storage files.
Oracle Cloud Storage keys whose
names start with
bucket_in_dataset/this_prefix are
selected. It utilizes Oracle Cloud
Storage's service-side filter, which
provides better performance than a
wildcard filter.
Additional settings:
Example:
"activities":[
{
"name": "CopyFromOracleCloudStorage",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "OracleCloudStorageReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
B UC K ET K EY REC URSIVE RET RIEVED)
Next steps
For a list of data stores that the Copy activity in Azure Data Factory supports as sources and sinks, see
Supported data stores.
Copy data from Oracle Eloqua using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Oracle Eloqua connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Eloqua to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Eloqua connector.
Example:
{
"name": "EloquaLinkedService",
"properties": {
"type": "Eloqua",
"typeProperties": {
"endpoint" : "<base URL e.g. xxx.xxx.eloqua.com>",
"username" : "<site name>\\<user name e.g. Eloqua\\Alice>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Eloqua dataset.
To copy data from Oracle Eloqua, set the type property of the dataset to EloquaObject . The following
properties are supported:
Example
{
"name": "EloquaDataset",
"properties": {
"type": "EloquaObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Eloqua linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .
Example:
"activities":[
{
"name": "CopyFromEloqua",
"type": "Copy",
"inputs": [
{
"referenceName": "<Eloqua input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "EloquaSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of supported data stored by Azure Data Factory, see supported data stores.
Copy data from Oracle Responsys using Azure Data
Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Oracle Responsys connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Responsys to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Responsys connector.
Example:
{
"name": "OracleResponsysLinkedService",
"properties": {
"type": "Responsys",
"typeProperties": {
"endpoint" : "<endpoint>",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Responsys dataset.
To copy data from Oracle Responsys, set the type property of the dataset to ResponsysObject . The following
properties are supported:
Example
{
"name": "OracleResponsysDataset",
"properties": {
"type": "ResponsysObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Oracle Responsys linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromOracleResponsys",
"type": "Copy",
"inputs": [
{
"referenceName": "<Oracle Responsys input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ResponsysSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Oracle Service Cloud using Azure
Data Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and provide feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Oracle Service Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Oracle Service Cloud to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Oracle Service Cloud connector.
Example:
{
"name": "OracleServiceCloudLinkedService",
"properties": {
"type": "OracleServiceCloud",
"typeProperties": {
"host" : "<host>",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"useEncryptedEndpoints" : true,
"useHostVerification" : true,
"usePeerVerification" : true,
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Oracle Service Cloud dataset.
To copy data from Oracle Service Cloud, set the type property of the dataset to OracleSer viceCloudObject .
The following properties are supported:
Example
{
"name": "OracleServiceCloudDataset",
"properties": {
"type": "OracleServiceCloudObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<OracleServiceCloud linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromOracleServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<OracleServiceCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "OracleServiceCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
ORC format in Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the ORC dataset.
ORC as sink
The following properties are supported in the copy activity *sink* section.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Source example
The associated data flow script of an ORC source configuration is:
source(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn: 'fileName',
format: 'orc') ~> OrcSource
Sink properties
The below table lists the properties supported by an ORC sink. You can edit these properties in the Settings
tab.
When using inline dataset, you will see additional file settings, which are the same as the properties described in
dataset properties section.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Sink example
The associated data flow script of an ORC sink configuration is:
OrcSource sink(
format: 'orc',
filePattern:'output[n].orc',
truncate: true,
allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> OrcSink
For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
To install Visual C++ 2010 Redistributable Package : Visual C++ 2010 Redistributable Package is not
installed with self-hosted IR installations. You can find it from here.
TIP
If you copy data to/from ORC format using Self-hosted Integration Runtime and hit error saying "An error occurred when
invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64 MB and max 1G.
Next steps
Copy activity overview
Lookup activity
GetMetadata activity
Parquet format in Azure Data Factory
5/14/2021 • 6 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Parquet dataset.
NOTE
White space in column name is not supported for Parquet files.
Parquet as sink
The following properties are supported in the copy activity *sink* section.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Source example
The below image is an example of a parquet source configuration in mapping data flows.
source(allowSchemaDrift: true,
validateSchema: false,
rowUrlColumn: 'fileName',
format: 'parquet') ~> ParquetSource
Sink properties
The below table lists the properties supported by a parquet sink. You can edit these properties in the Settings
tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Sink example
The below image is an example of a parquet sink configuration in mapping data flows.
ParquetSource sink(
format: 'parquet',
filePattern:'output[n].parquet',
truncate: true,
allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> ParquetSink
For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
To install Visual C++ 2010 Redistributable Package : Visual C++ 2010 Redistributable Package is not
installed with self-hosted IR installations. You can find it from here.
TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64 MB and max 1G.
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from PayPal using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This PayPal connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from PayPal to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PayPal connector.
Example:
{
"name": "PayPalLinkedService",
"properties": {
"type": "PayPal",
"typeProperties": {
"host" : "api.sandbox.paypal.com",
"clientId" : "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PayPal dataset.
To copy data from PayPal, set the type property of the dataset to PayPalObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
{
"name": "PayPalDataset",
"properties": {
"type": "PayPalObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<PayPal linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM
Payment_Experience"
.
Example:
"activities":[
{
"name": "CopyFromPayPal",
"type": "Copy",
"inputs": [
{
"referenceName": "<PayPal input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PayPalSource",
"query": "SELECT * FROM Payment_Experience"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Phoenix using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Phoenix connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Phoenix to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Phoenix connector.
Linked service properties
The following properties are supported for Phoenix linked service:
NOTE
If your cluster doesn't support sticky session e.g. HDInsight, explicitly add node index at the end of the http path setting,
e.g. specify /hbasephoenix0 instead of /hbasephoenix .
Example:
{
"name": "PhoenixLinkedService",
"properties": {
"type": "Phoenix",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "443",
"httpPath" : "/hbasephoenix0",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Phoenix dataset.
To copy data from Phoenix, set the type property of the dataset to PhoenixObject . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "PhoenixDataset",
"properties": {
"type": "PhoenixObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Phoenix linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromPhoenix",
"type": "Copy",
"inputs": [
{
"referenceName": "<Phoenix input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PhoenixSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from PostgreSQL by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This PostgreSQL connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from PostgreSQL database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this PostgreSQL connector supports PostgreSQL version 7.4 and above .
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
The Integration Runtime provides a built-in PostgreSQL driver starting from version 3.7, therefore you don't
need to manually install any driver.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
PostgreSQL connector.
Example:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;Password=
<Password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"connectionString": "Server=<server>;Database=<database>;Port=<port>;UID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using PostgreSQL linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Previous payload:
{
"name": "PostgreSqlLinkedService",
"properties": {
"type": "PostgreSql",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by PostgreSQL dataset.
To copy data from PostgreSQL, the following properties are supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "PostgreSQLDataset",
"properties":
{
"type": "PostgreSqlTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<PostgreSQL linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it's still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"query": "SELECT * FROM
\"MySchema\".\"MyTable\""
.
NOTE
Schema and table names are case-sensitive. Enclose them in "" (double quotes) in the query.
Example:
"activities":[
{
"name": "CopyFromPostgreSQL",
"type": "Copy",
"inputs": [
{
"referenceName": "<PostgreSQL input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PostgreSqlSource",
"query": "SELECT * FROM \"MySchema\".\"MyTable\""
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
Lookup activity properties
To learn details about the properties, check Lookup activity.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Presto using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Presto connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Presto to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Presto connector.
Example:
{
"name": "PrestoLinkedService",
"properties": {
"type": "Presto",
"typeProperties": {
"host" : "<host>",
"serverVersion" : "0.148-t",
"catalog" : "<catalog>",
"port" : "<port>",
"authenticationType" : "LDAP",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"timeZoneID" : "Europe/Berlin"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Presto dataset.
To copy data from Presto, set the type property of the dataset to PrestoObject . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "PrestoDataset",
"properties": {
"type": "PrestoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Presto linked service name>",
"type": "LinkedServiceReference"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by Presto source.
Presto as source
To copy data from Presto, set the source type in the copy activity to PrestoSource . The following properties are
supported in the copy activity source section:
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromPresto",
"type": "Copy",
"inputs": [
{
"referenceName": "<Presto input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "PrestoSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from QuickBooks Online using Azure
Data Factory (Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This QuickBooks connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from QuickBooks Online to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
This connector supports QuickBooks OAuth 2.0 authentication.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
QuickBooks connector.
Under connectionProperties :
Example:
{
"name": "QuickBooksLinkedService",
"properties": {
"type": "QuickBooks",
"typeProperties": {
"connectionProperties":{
"endpoint":"quickbooks.api.intuit.com",
"companyId":"<company id>",
"consumerKey":"<consumer key>",
"consumerSecret":{
"type": "SecureString",
"value": "<clientSecret>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by QuickBooks dataset.
To copy data from QuickBooks Online, set the type property of the dataset to QuickBooksObject . The
following properties are supported:
Example
{
"name": "QuickBooksDataset",
"properties": {
"type": "QuickBooksObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<QuickBooks linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Bill" WHERE Id =
'123'"
.
Example:
"activities":[
{
"name": "CopyFromQuickBooks",
"type": "Copy",
"inputs": [
{
"referenceName": "<QuickBooks input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "QuickBooksSource",
"query": "SELECT * FROM \"Bill\" WHERE Id = '123' "
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to a REST endpoint by using
Azure Data Factory
5/6/2021 • 13 minutes to read • Edit Online
Supported capabilities
You can copy data from a REST source to any supported sink data store. You also can copy data from any
supported source data store to a REST sink. For a list of data stores that Copy Activity supports as sources and
sinks, see Supported data stores and formats.
Specifically, this generic REST connector supports:
Copying data from a REST endpoint by using the GET or POST methods and copying data to a REST
endpoint by using the POST , PUT or PATCH methods.
Copying data by using one of the following authentications: Anonymous , Basic , AAD ser vice principal ,
and managed identities for Azure resources .
Pagination in the REST APIs.
For REST as source, copying the REST JSON response as-is or parse it by using schema mapping. Only
response payload in JSON is supported.
TIP
To test a request for data retrieval before you configure the REST connector in Data Factory, learn about the API
specification for header and body requirements. You can use tools like Postman or a web browser to validate.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to the REST connector.
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"authenticationType": "Basic",
"url" : "<REST endpoint>",
"userName": "<user name>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "AadServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint e.g. https://www.example.com/>",
"authenticationType": "ManagedServiceIdentity",
"aadResourceId": "<AAD resource URL e.g. https://management.core.windows.net>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "RESTLinkedService",
"properties": {
"type": "RestService",
"typeProperties": {
"url": "<REST endpoint>",
"authenticationType": "Anonymous",
"authHeader": {
"x-api-key": {
"type": "SecureString",
"value": "<API key>"
}
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties that the REST dataset supports.
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
To copy data from REST, the following properties are supported:
If you were setting requestMethod , additionalHeaders , requestBody and paginationRules in dataset, it is still
supported as-is, while you are suggested to use the new model in activity going forward.
Example:
{
"name": "RESTDataset",
"properties": {
"type": "RestResource",
"typeProperties": {
"relativeUrl": "<relative url>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<REST linked service name>",
"type": "LinkedServiceReference"
}
}
}
NOTE
REST connector ignores any "Accept" header specified in additionalHeaders . As REST connector only support response
in JSON, it will auto generate a header of Accept: application/json .
"activities":[
{
"name": "CopyFromREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<REST input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "RestSource",
"additionalHeaders": {
"x-user-defined": "helloworld"
},
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
},
"httpRequestTimeout": "00:01:00"
},
"sink": {
"type": "<sink type>"
}
}
}
]
REST as sink
The following properties are supported in the copy activity sink section:
REST connector as sink works with the REST APIs that accept JSON. The data will be sent in JSON with the
following pattern. As needed, you can use the copy activity schema mapping to reshape the source data to
conform to the expected payload by the REST API.
[
{ <data object> },
{ <data object> },
...
]
Example:
"activities":[
{
"name": "CopyToREST",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<REST output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "RestSink",
"requestMethod": "POST",
"httpRequestTimeout": "00:01:40",
"requestInterval": 10,
"writeBatchSize": 10000,
"httpCompressionType": "none",
},
}
}
]
Pagination support
When copying data from REST APIs, normally, the REST API limits its response payload size of a single request
under a reasonable number; while to return large amount of data, it splits the result into multiple pages and
requires callers to send consecutive requests to get next page of the result. Usually, the request for one page is
dynamic and composed by the information returned from the response of previous page.
This generic REST connector supports the following pagination patterns:
Next request’s absolute or relative URL = property value in current response body
Next request’s absolute or relative URL = header value in current response headers
Next request’s query parameter = property value in current response body
Next request’s query parameter = header value in current response headers
Next request’s header = property value in current response body
Next request’s header = header value in current response headers
Pagination rules are defined as a dictionary in dataset, which contains one or more case-sensitive key-value
pairs. The configuration will be used to generate the request starting from the second page. The connector will
stop iterating when it gets HTTP status code 204 (No Content), or any of the JSONPath expressions in
"paginationRules" returns null.
Suppor ted keys in pagination rules:
K EY DESC RIP T IO N
AbsoluteUrl Indicates the URL to issue the next request. It can be either
absolute URL or relative URL .
VA L UE DESC RIP T IO N
A JSONPath expression starting with "$" (representing the The response body should contain only one JSON object.
root of the response body) The JSONPath expression should return a single primitive
value, which will be used to issue next request.
Example:
Facebook Graph API returns response in the following structure, in which case next page's URL is represented in
paging.next :
{
"data": [
{
"created_time": "2017-12-12T14:12:20+0000",
"name": "album1",
"id": "1809938745705498_1809939942372045"
},
{
"created_time": "2017-12-12T14:14:03+0000",
"name": "album2",
"id": "1809938745705498_1809941802371859"
},
{
"created_time": "2017-12-12T14:14:11+0000",
"name": "album3",
"id": "1809938745705498_1809941879038518"
}
],
"paging": {
"cursors": {
"after": "MTAxNTExOTQ1MjAwNzI5NDE=",
"before": "NDMyNzQyODI3OTQw"
},
"previous": "https://graph.facebook.com/me/albums?limit=25&before=NDMyNzQyODI3OTQw",
"next": "https://graph.facebook.com/me/albums?limit=25&after=MTAxNTExOTQ1MjAwNzI5NDE="
}
}
The corresponding REST copy activity source configuration especially the paginationRules is as follows:
"typeProperties": {
"source": {
"type": "RestSource",
"paginationRules": {
"AbsoluteUrl": "$.paging.next"
},
...
},
"sink": {
"type": "<sink type>"
}
}
Use OAuth
This section describes how to use a solution template to copy data from REST connector into Azure Data Lake
Storage in JSON format using OAuth.
About the solution template
The template contains two activities:
Web activity retrieves the bearer token and then pass it to subsequent Copy activity as authorization.
Copy activity copies data from REST to Azure Data Lake Storage.
The template defines two parameters:
SinkContainer is the root folder path where the data is copied to in your Azure Data Lake Storage.
SinkDirector y is the directory path under the root where the data is copied to in your Azure Data Lake
Storage.
How to use this solution template
1. Go to the Copy from REST or HTTP using OAuth template. Create a new connection for Source
Connection.
Below are key steps for new linked service (REST) settings:
a. Under Base URL , specify the url parameter for your own source REST service.
b. For Authentication type , choose Anonymous.
2. Create a new connection for Destination Connection.
5. Select Web activity. In Settings , specify the corresponding URL , Method , Headers , and Body to
retrieve OAuth bearer token from the login API of the service that you want to copy data from. The
placeholder in the template showcases a sample of Azure Active Directory (AAD) OAuth. Note AAD
authentication is natively supported by REST connector, here is just an example for OAuth flow.
URL Specify the url to retrieve OAuth bearer token from. for
example, in the sample here it's
https://login.microsoftonline.com/microsoft.onmicrosoft.c
om/oauth2/token
P RO P ERT Y DESC RIP T IO N
Method The HTTP method. Allowed values are Post and Get .
6. In Copy data activity, select Source tab, you could see that the bearer token (access_token) retrieved
from previous step would be passed to Copy data activity as Authorization under Additional headers.
Confirm settings for following properties before starting a pipeline run.
Request method The HTTP method. Allowed values are Get (default) and
Post .
9. Click the "Output" icon of WebActivity in Actions column, you would see the access_token returned by
the service.
10. Click the "Input" icon of CopyActivity in Actions column, you would see the access_token retrieved by
WebActivity is passed to CopyActivity for authentication.
Cau t i on
To avoid token being logged in plain text, enable "Secure output" in Web activity and "Secure input" in
Copy activity.
Schema mapping
To copy data from REST endpoint to tabular sink, refer to schema mapping.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from and to Salesforce by using Azure
Data Factory
5/26/2021 • 9 minutes to read • Edit Online
Supported capabilities
This Salesforce connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce to any supported sink data store. You also can copy data from any supported
source data store to Salesforce. For a list of data stores that are supported as sources or sinks by the Copy
activity, see the Supported data stores table.
Specifically, this Salesforce connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API. By default, when copying data from
Salesforce, the connector uses v45 and automatically chooses between REST and Bulk APIs based on the data
size – when the result set is large, Bulk API is used for better performance; when writing data to Salesforce, the
connector uses v40 of Bulk API. You can also explicitly set the API version used to read/write data via
apiVersion property in linked service.
Prerequisites
API permission must be enabled in Salesforce.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce connector.
{
"name": "SalesforceLinkedService",
"properties": {
"type": "Salesforce",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce dataset.
To copy data from and to Salesforce, set the type property of the dataset to SalesforceObject . The following
properties are supported.
objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
{
"name": "SalesforceDataset",
"properties": {
"type": "SalesforceObject",
"typeProperties": {
"objectApiName": "MyTable__c"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Salesforce linked service name>",
"type": "LinkedServiceReference"
}
}
}
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalTable" type dataset, it
keeps working while you see a suggestion to switch to the new "SalesforceObject" type.
tableName Name of the table in Salesforce. No (if "query" in the activity source is
specified)
query Use the custom query to read data. No (if "objectApiName" in the dataset
You can use Salesforce Object Query is specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce object specified
in "objectApiName" in dataset will be
retrieved.
P RO P ERT Y DESC RIP T IO N REQ UIRED
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
"activities":[
{
"name": "CopyFromSalesforce",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
For backward compatibility: When you copy data from Salesforce, if you use the previous "RelationalSource" type copy, the
source keeps working while you see a suggestion to switch to the new "SalesforceSource" type.
externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified
field must be defined as "External ID
Field" in the Salesforce object. It can't
have NULL values in the
corresponding input data.
Query tips
Retrieve data from a Salesforce report
You can retrieve data from Salesforce reports by specifying a query as {call "<report name>"} . An example is
"query": "{call \"TestReport\"}" .
SY N TA X SO Q L M O DE SQ L M O DE
Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"
Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.
SY N TA X SO Q L M O DE SQ L M O DE
Checkbox Boolean
Currency Decimal
Date DateTime
Date/Time DateTime
Email String
ID String
Number Decimal
Percent Decimal
Phone String
Picklist String
Text String
URL String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from and to Salesforce Service Cloud by
using Azure Data Factory
5/6/2021 • 9 minutes to read • Edit Online
Supported capabilities
This Salesforce Service Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce Service Cloud to any supported sink data store. You also can copy data from
any supported source data store to Salesforce Service Cloud. For a list of data stores that are supported as
sources or sinks by the Copy activity, see the Supported data stores table.
Specifically, this Salesforce Service Cloud connector supports:
Salesforce Developer, Professional, Enterprise, or Unlimited editions.
Copying data from and to Salesforce production, sandbox, and custom domain.
The Salesforce connector is built on top of the Salesforce REST/Bulk API. By default, when copying data from
Salesforce, the connector uses v45 and automatically chooses between REST and Bulk APIs based on the data
size – when the result set is large, Bulk API is used for better performance; when writing data to Salesforce, the
connector uses v40 of Bulk API. You can also explicitly set the API version used to read/write data via
apiVersion property in linked service.
Prerequisites
API permission must be enabled in Salesforce. For more information, see Enable API access in Salesforce by
permission set
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Salesforce Service Cloud connector.
{
"name": "SalesforceServiceCloudLinkedService",
"properties": {
"type": "SalesforceServiceCloud",
"typeProperties": {
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
},
"securityToken": {
"type": "SecureString",
"value": "<security token>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the Salesforce Service Cloud dataset.
To copy data from and to Salesforce Service Cloud, the following properties are supported.
objectApiName The Salesforce object name to retrieve No for source, Yes for sink
data from.
IMPORTANT
The "__c" part of API Name is needed for any custom object.
Example:
{
"name": "SalesforceServiceCloudDataset",
"properties": {
"type": "SalesforceServiceCloudObject",
"typeProperties": {
"objectApiName": "MyTable__c"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Salesforce Service Cloud linked service name>",
"type": "LinkedServiceReference"
}
}
}
tableName Name of the table in Salesforce Service No (if "query" in the activity source is
Cloud. specified)
query Use the custom query to read data. No (if "objectApiName" in the dataset
You can use Salesforce Object Query is specified)
Language (SOQL) query or SQL-92
query. See more tips in query tips
section. If query is not specified, all the
data of the Salesforce Service Cloud
object specified in "objectApiName" in
dataset will be retrieved.
Example:
"activities":[
{
"name": "CopyFromSalesforceServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<Salesforce Service Cloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceServiceCloudSource",
"query": "SELECT Col_Currency__c, Col_Date__c, Col_Email__c FROM AllDataType__c"
},
"sink": {
"type": "<sink type>"
}
}
}
]
externalIdFieldName The name of the external ID field for Yes for "Upsert"
the upsert operation. The specified
field must be defined as "External ID
Field" in the Salesforce Service Cloud
object. It can't have NULL values in the
corresponding input data.
Example:
"activities":[
{
"name": "CopyToSalesforceServiceCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Salesforce Service Cloud output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SalesforceServiceCloudSink",
"writeBehavior": "Upsert",
"externalIdFieldName": "CustomerId__c",
"writeBatchSize": 10000,
"ignoreNullValues": true
}
}
}
]
Query tips
Retrieve data from a Salesforce Service Cloud report
You can retrieve data from Salesforce Service Cloud reports by specifying a query as {call "<report name>"} .
An example is "query": "{call \"TestReport\"}" .
Retrieve deleted records from the Salesforce Service Cloud Recycle Bin
To query the soft deleted records from the Salesforce Service Cloud Recycle Bin, you can specify readBehavior
as queryAll .
Difference between SOQL and SQL query syntax
When copying data from Salesforce Service Cloud, you can use either SOQL query or SQL query. Note that
these two has different syntax and functionality support, do not mix it. You are suggested to use the SOQL
query, which is natively supported by Salesforce Service Cloud. The following table lists the main differences:
SY N TA X SO Q L M O DE SQ L M O DE
Quotation marks Filed/object names cannot be quoted. Field/object names can be quoted, e.g.
SELECT "id" FROM "Account"
Datetime format Refer to details here and samples in Refer to details here and samples in
next section. next section.
SY N TA X SO Q L M O DE SQ L M O DE
Checkbox Boolean
Currency Decimal
Date DateTime
Date/Time DateTime
Email String
ID String
Number Decimal
Percent Decimal
Phone String
Picklist String
Text String
URL String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Salesforce Marketing Cloud using
Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Salesforce Marketing Cloud connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Salesforce Marketing Cloud to any supported sink data store. For a list of data stores
that are supported as sources/sinks by the copy activity, see the Supported data stores table.
The Salesforce Marketing Cloud connector supports OAuth 2 authentication, and it supports both legacy and
enhanced package types. The connector is built on top of the Salesforce Marketing Cloud REST API.
NOTE
This connector doesn't support retrieving custom objects or custom data extensions.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Salesforce Marketing Cloud connector.
Under connectionProperties :
P RO P ERT Y DESC RIP T IO N REQ UIRED
{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"connectionProperties": {
"host": "www.exacttargetapis.com",
"authenticationType": "OAuth_2.0",
"clientId": "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints": true,
"useHostVerification": true,
"usePeerVerification": true
}
}
}
}
If you were using Salesforce Marketing Cloud linked service with the following payload, it is still supported as-is,
while you are suggested to use the new one going forward which adds enhanced package support.
{
"name": "SalesforceMarketingCloudLinkedService",
"properties": {
"type": "SalesforceMarketingCloud",
"typeProperties": {
"clientId": "<clientId>",
"clientSecret": {
"type": "SecureString",
"value": "<clientSecret>"
},
"useEncryptedEndpoints": true,
"useHostVerification": true,
"usePeerVerification": true
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Salesforce Marketing Cloud dataset.
To copy data from Salesforce Marketing Cloud, set the type property of the dataset to
SalesforceMarketingCloudObject . The following properties are supported:
Example
{
"name": "SalesforceMarketingCloudDataset",
"properties": {
"type": "SalesforceMarketingCloudObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<SalesforceMarketingCloud linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSalesforceMarketingCloud",
"type": "Copy",
"inputs": [
{
"referenceName": "<SalesforceMarketingCloud input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SalesforceMarketingCloudSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse via Open
Hub using Azure Data Factory
5/11/2021 • 10 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP Business Warehouse via Open Hub connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Business Warehouse via Open Hub to any supported sink data store. For a list of
data stores that are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse Open Hub connector supports:
SAP Business Warehouse version 7.01 or higher (in a recent SAP Suppor t Package Stack released
after the year 2015) . SAP BW/4HANA is not supported by this connector.
Copying data via Open Hub Destination local table, which underneath can be DSO, InfoCube, MultiProvider,
DataSource, etc.
Copying data using basic authentication.
Connecting to an SAP application server or SAP message server.
Retrieving data via RFC.
In the first step, a DTP is executed. Each execution creates a new SAP request ID. The request ID is stored in the
Open Hub table and is then used by the ADF connector to identify the delta. The two steps run asynchronously:
the DTP is triggered by SAP, and the ADF data copy is triggered through ADF.
By default, ADF is not reading the latest delta from the Open Hub table (option "exclude last request" is true).
Hereby, the data in ADF is not 100% up to date with the data in the Open Hub table (the last delta is missing). In
return, this procedure ensures that no rows get lost caused by the asynchronous extraction. It works fine even
when ADF is reading the Open Hub table while the DTP is still writing into the same table.
You typically store the max copied request ID in the last run by ADF in a staging data store (such as Azure Blob in
above diagram). Therefore, the same request is not read a second time by ADF in the subsequent run.
Meanwhile, note the data is not automatically deleted from the Open Hub table.
For proper delta handling, it is not allowed to have request IDs from different DTPs in the same Open Hub table.
Therefore, you must not create more than one DTP for each Open Hub Destination (OHD). When needing Full
and Delta extraction from the same InfoProvider, you should create two OHDs for the same InfoProvider.
Prerequisites
To use this SAP Business Warehouse Open Hub connector, you need to:
Set up a Self-hosted Integration Runtime with version 3.13 or above. See Self-hosted Integration Runtime
article for details.
Download the 64-bit SAP .NET Connector 3.0 from SAP's website, and install it on the Self-hosted IR
machine. When installing, in the optional setup steps window, make sure you select the Install
Assemblies to GAC option as shown in the following image.
SAP user being used in the Data Factory BW connector needs to have following permissions:
Authorization for RFC and SAP BW.
Permissions to the “Execute” Activity of Authorization Object “S_SDSAUTH”.
Create SAP Open Hub Destination type as Database Table with "Technical Key" option checked. It is also
recommended to leave the Deleting Data from Table as unchecked although it is not required. Use the
DTP (directly execute or integrate into existing process chain) to land data from source object (such as
cube) you have chosen to the open hub destination table.
Getting started
TIP
For a walkthrough of using SAP BW Open Hub connector, see Load data from SAP Business Warehouse (BW) by using
Azure Data Factory.
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse Open Hub connector.
language Language that the SAP system uses. No (default value is EN)
Example:
{
"name": "SapBwOpenHubLinkedService",
"properties": {
"type": "SapOpenHub",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the SAP BW Open Hub dataset.
To copy data from and to SAP BW Open Hub, set the type property of the dataset to SapOpenHubTable . The
following properties are supported.
If you were setting excludeLastRequest and baseRequestId in dataset, it is still supported as-is, while you are
suggested to use the new model in activity source going forward.
Example:
{
"name": "SAPBWOpenHubDataset",
"properties": {
"type": "SapOpenHubTable",
"typeProperties": {
"openHubDestinationName": "<open hub destination name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP BW Open Hub linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
If your Open Hub table only contains the data generated by single request ID, for example, you always do full load and
overwrite the existing data in the table, or you only run the DTP once for test, remember to uncheck the
"excludeLastRequest" option in order to copy the data out.
To speed up the data loading, you can set parallelCopies on the copy activity to load data from SAP BW Open
Hub in parallel. For example, if you set parallelCopies to four, Data Factory concurrently executes four RFC
calls, and each RFC call retrieves a portion of data from your SAP BW Open Hub table partitioned by the DTP
request ID and package ID. This applies when the number of unique DTP request ID + package ID is bigger than
the value of parallelCopies . When copying data into file-based data store, it's also recommanded to write to a
folder as multiple files (only specify folder name), in which case the performance is better than writing to a
single file.
Example:
"activities":[
{
"name": "CopyFromSAPBWOpenHub",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW Open Hub input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapOpenHubSource",
"excludeLastRequest": true
},
"sink": {
"type": "<sink type>"
},
"parallelCopies": 4
}
}
]
C (String) String
I (integer) Int32
F (Float) Double
D (Date) String
T (Time) String
N (Numc) String
Troubleshooting tips
Symptoms: If you are running SAP BW on HANA and observe only subset of data is copied over using ADF
copy activity (1 million rows), the possible cause is that you enable "SAP HANA Execution" option in your DTP, in
which case ADF can only retrieve the first batch of data.
Resolution: Disable "SAP HANA Execution" option in DTP, reprocess the data, then try executing the copy
activity again.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Business Warehouse by using
Azure Data Factory
7/7/2021 • 10 minutes to read • Edit Online
TIP
For general information about copying data from SAP BW, including SAP BW Open Hub integration and delta extraction
flow, see Copy data from SAP Business Warehouse via Open Hub by using Azure Data Factory.
Prerequisites
Azure Data Factor y : If you don't have one, follow the steps to create a data factory.
SAP BW Open Hub Destination (OHD) with destination type "Database Table" : To create an OHD
or to check that your OHD is configured correctly for Data Factory integration, see the SAP BW Open Hub
Destination configurations section of this article.
The SAP BW user needs the following permissions :
Authorization for Remote Function Calls (RFC) and SAP BW.
Permissions to the “Execute” activity of the S_SDSAUTH authorization object.
A self-hosted integration runtime (IR) with SAP .NET connector 3.0 . Follow these setup steps:
1. Install and register the self-hosted integration runtime, version 3.13 or later. (This process is
described later in this article.)
2. Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the
same computer as the self-hosted IR. During installation, make sure that you select Install
Assemblies to GAC in the Optional setup steps dialog box, as the following image shows:
Do a full copy from SAP BW Open Hub
In the Azure portal, go to your data factory. Select Open on the Open Azure Data Factor y Studio tile to open
the Data Factory UI in a separate tab.
1. On the home page, select Ingest to open the Copy Data tool.
2. On the Proper ties page, specify a Task name , and then select Next .
3. On the Source data store page, select +Create new connection . Select SAP BW Open Hub from
the connector gallery, and then select Continue . To filter the connectors, you can type SAP in the search
box.
4. On the Specify SAP BW Open Hub connection page, follow these steps to create a new connection.
a. From the Connect via integration runtime list, select an existing self-hosted IR. Or, choose to
create one if you don't have one yet.
To create a new self-hosted IR, select +New , and then select Self-hosted . Enter a Name , and then
select Next . Select Express setup to install on the current computer, or follow the Manual setup
steps that are provided.
As mentioned in Prerequisites, make sure that you have SAP Connector for Microsoft .NET 3.0
installed on the same computer where the self-hosted IR is running.
b. Fill in the SAP BW Ser ver name , System number , Client ID, Language (if other than EN ),
User name , and Password .
c. Select Test connection to validate the settings, and then select Finish .
d. A new connection is created. Select Next .
5. On the Select Open Hub Destinations page, browse the Open Hub Destinations that are available in
your SAP BW. Select the OHD to copy data from, and then select Next .
6. Specify a filter, if you need one. If your OHD only contains data from a single data-transfer process (DTP)
execution with a single request ID, or you're sure that your DTP is finished and you want to copy the data,
clear the Exclude Last Request check box.
Learn more about these settings in the SAP BW Open Hub Destination configurations section of this
article. Select Validate to double-check what data will be returned. Then select Next .
7. On the Destination data store page, select +Create new connection > Azure Data Lake Storage
Gen2 > Continue .
8. On the Specify Azure Data Lake Storage connection page, follow these steps to create a connection.
a. Select your Data Lake Storage Gen2-capable account from the Name drop-down list.
b. Select Finish to create the connection. Then select Next .
9. On the Choose the output file or folder page, enter copyfromopenhub as the output folder name.
Then select Next .
10. On the File format setting page, select Next to use the default settings.
11. On the Settings page, expand Performance settings . Enter a value for Degree of copy parallelism
such as 5 to load from SAP BW in parallel. Then select Next .
12. On the Summar y page, review the settings. Then select Next .
13. On the Deployment page, select Monitor to monitor the pipeline.
14. Notice that the Monitor tab on the left side of the page is automatically selected. The Actions column
includes links to view activity-run details and to rerun the pipeline.
15. To view activity runs that are associated with the pipeline run, select View Activity Runs in the Actions
column. There's only one activity (copy activity) in the pipeline, so you see only one entry. To switch back
to the pipeline-runs view, select the Pipelines link at the top. Select Refresh to refresh the list.
16. To monitor the execution details for each copy activity, select the Details link, which is an eyeglasses icon
below Actions in the activity-monitoring view. Available details include the data volume copied from the
source to the sink, data throughput, execution steps and duration, and configurations used.
17. To view the maximum Request ID , go back to the activity-monitoring view and select Output under
Actions .
On the data factory home page, select Pipeline templates in the Discover more section to use the built-in
template.
1. Search for SAP BW to find and select the Incremental copy from SAP BW to Azure Data Lake
Storage Gen2 template. This template copies data into Azure Data Lake Storage Gen2. You can use a
similar workflow to copy to other sink types.
2. On the template's main page, select or create the following three connections, and then select Use this
template in the lower-right corner of the window.
Azure Blob storage : In this walkthrough, we use Azure Blob storage to store the high watermark,
which is the max copied request ID.
SAP BW Open Hub : This is the source to copy data from. Refer to the previous full-copy
walkthrough for detailed configuration.
Azure Data Lake Storage Gen2 : This is the sink to copy data to. Refer to the previous full-copy
walkthrough for detailed configuration.
3. This template generates a pipeline with the following three activities and makes them chained on-
success: Lookup, Copy Data, and Web.
Go to the pipeline Parameters tab. You see all the configurations that you need to provide.
SAPOpenHubDestinationName : Specify the Open Hub table name to copy data from.
Data_Destination_Container : Specify the destination Azure Data Lake Storage Gen2 container
to copy data to. If the container doesn't exist, the Data Factory copy activity creates one during
execution.
Data_Destination_Director y : Specify the folder path under the Azure Data Lake Storage Gen2
container to copy data to. If the path doesn't exist, the Data Factory copy activity creates a path
during execution.
HighWatermarkBlobContainer : Specify the container to store the high-watermark value.
HighWatermarkBlobDirector y : Specify the folder path under the container to store the high-
watermark value.
HighWatermarkBlobName : Specify the blob name to store the high watermark value, such as
requestIdCache.txt . In Blob storage, go to the corresponding path of
HighWatermarkBlobContainer+HighWatermarkBlobDirectory+HighWatermarkBlobName, such as
container/path/requestIdCache.txt. Create a blob with content 0.
LogicAppURL : In this template, we use WebActivity to call Azure Logic Apps to set the high-
watermark value in Blob storage. Or, you can use Azure SQL Database to store it. Use a stored
procedure activity to update the value.
You must first create a logic app, as the following image shows. Then, paste in the HTTP POST
URL .
a. Go to the Azure portal. Select a new Logic Apps service. Select +Blank Logic App to go
to Logic Apps Designer .
b. Create a trigger of When an HTTP request is received . Specify the HTTP request body as
follows:
{
"properties": {
"sapOpenHubMaxRequestId": {
"type": "string"
}
},
"type": "object"
}
c. Add a Create blob action. For Folder path and Blob name , use the same values that you
configured previously in HighWatermarkBlobContainer+HighWatermarkBlobDirectory and
HighWatermarkBlobName.
d. Select Save . Then, copy the value of HTTP POST URL to use in the Data Factory pipeline.
4. After you provide the Data Factory pipeline parameters, select Debug > Finish to invoke a run to
validate the configuration. Or, select Publish to publish all the changes, and then select Add trigger to
execute a run.
You might increase the number of parallel running SAP work processes for the DTP:
For a full load OHD, choose different options than for delta extraction:
In OHD: Set the Extraction option to Delete Data and Inser t Records . Otherwise, data will be
extracted many times when you repeat the DTP in a BW process chain.
In the DTP: Set Extraction Mode to Full . You must change the automatically created DTP from Delta to
Full immediately after the OHD is created, as this image shows:
In the BW Open Hub connector of Data Factory: Turn off Exclude last request . Otherwise, nothing will
be extracted.
You typically run the full DTP manually. Or, you can create a process chain for the full DTP. It's typically a separate
chain that's independent of your existing process chains. In either case, make sure that the DTP is finished before
you start the extraction by using Data Factory copy. Otherwise, only partial data will be copied.
Run delta extraction the first time
The first delta extraction is technically a full extraction. By default, the SAP BW Open Hub connector excludes the
last request when it copies data. For the first delta extraction, no data is extracted by the Data Factory copy
activity until a subsequent DTP generates delta data in the table with a separate request ID. There are two ways
to avoid this scenario:
Turn off the Exclude last request option for the first delta extraction. Make sure that the first delta DTP is
finished before you start the delta extraction the first time.
Use the procedure for resyncing the delta extraction, as described in the next section.
Resync delta extraction
The following scenarios change the data in SAP BW cubes but are not considered by the delta DTP:
SAP BW selective deletion (of rows by using any filter condition)
SAP BW request deletion (of faulty requests)
An SAP Open Hub Destination isn't a data-mart-controlled data target (in all SAP BW support packages since
2015). So, you can delete data from a cube without changing the data in the OHD. You must then resync the data
of the cube with Data Factory:
1. Run a full extraction in Data Factory (by using a full DTP in SAP).
2. Delete all rows in the Open Hub table for the delta DTP.
3. Set the status of the delta DTP to Fetched .
After this, all subsequent delta DTPs and Data Factory delta extractions work as expected.
To set the status of the delta DTP to Fetched , you can use the following option to run the delta DTP manually:
No Data Transfer; Delta Status in Source: Fetched
Next steps
Learn about SAP BW Open Hub connector support:
SAP Business Warehouse Open Hub connector
Copy data from SAP Business Warehouse using
Azure Data Factory
5/11/2021 • 4 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP Business Warehouse connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Business Warehouse to any supported sink data store. For a list of data stores that
are supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP Business Warehouse connector supports:
SAP Business Warehouse version 7.x .
Copying data from InfoCubes and Quer yCubes (including BEx queries) using MDX queries.
Copying data using basic authentication.
Prerequisites
To use this SAP Business Warehouse connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP NetWeaver librar y on the Integration Runtime machine. You can get the SAP Netweaver
library from your SAP administrator, or directly from the SAP Software Download Center. Search for the SAP
Note #1025361 to get the download location for the most recent version. Make sure that you pick the 64-
bit SAP NetWeaver library which matches your Integration Runtime installation. Then install all files included
in the SAP NetWeaver RFC SDK according to the SAP Note. The SAP NetWeaver library is also included in the
SAP Client Tools installation.
TIP
To troubleshoot connectivity issue to SAP BW, make sure:
All dependency libraries extracted from the NetWeaver RFC SDK are in place in the %windir%\system32 folder. Usually
it has icudt34.dll, icuin34.dll, icuuc34.dll, libicudecnumber.dll, librfc32.dll, libsapucum.dll, sapcrypto.dll, sapcryto_old.dll,
sapnwrfc.dll.
The needed ports used to connect to SAP Server are enabled on the Self-hosted IR machine, which usually are port
3300 and 3201.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Business Warehouse connector.
Example:
{
"name": "SapBwLinkedService",
"properties": {
"type": "SapBw",
"typeProperties": {
"server": "<server name>",
"systemNumber": "<system number>",
"clientId": "<client id>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP BW dataset.
To copy data from SAP BW, set the type property of the dataset to SapBwCube . While there are no type-specific
properties supported for the SAP BW dataset of type RelationalTable.
Example:
{
"name": "SAPBWDataset",
"properties": {
"type": "SapBwCube",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP BW linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
Example:
"activities":[
{
"name": "CopyFromSAPBW",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP BW input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapBwSource",
"query": "<MDX query for SAP BW>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
ACCP Int
CHAR String
CLNT String
SA P B W DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
CURR Decimal
CUKY String
DEC Decimal
FLTP Double
INT1 Byte
INT2 Int16
INT4 Int
LANG String
LCHR String
LRAW Byte[]
PREC Int16
QUAN Decimal
RAW Byte[]
RAWSTRING Byte[]
STRING String
UNIT String
DATS String
NUMC String
TIMS String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP Cloud for Customer (C4C)
using Azure Data Factory
5/11/2021 • 4 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP Cloud for Customer connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP Cloud for Customer to any supported sink data store, or copy data from any
supported source data store to SAP Cloud for Customer. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this connector enables Azure Data Factory to copy data from/to SAP Cloud for Customer including
the SAP Cloud for Sales, SAP Cloud for Service, and SAP Cloud for Social Engagement solutions.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP Cloud for Customer connector.
Example:
{
"name": "SAPC4CLinkedService",
"properties": {
"type": "SapCloudForCustomer",
"typeProperties": {
"url": "https://<tenantname>.crm.ondemand.com/sap/c4c/odata/v1/c4codata/" ,
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP Cloud for Customer dataset.
To copy data from SAP Cloud for Customer, set the type property of the dataset to
SapCloudForCustomerResource . The following properties are supported:
Example:
{
"name": "SAPC4CDataset",
"properties": {
"type": "SapCloudForCustomerResource",
"typeProperties": {
"path": "<path e.g. LeadCollection>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP C4C linked service>",
"type": "LinkedServiceReference"
}
}
}
Example:
"activities":[
{
"name": "CopyFromSAPC4C",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP C4C input dataset>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapCloudForCustomerSource",
"query": "<custom query e.g. $top=10>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
writeBatchSize The batch size of write operation. The No. Default 10.
batch size to get best performance
may be different for different table or
server.
Example:
"activities":[
{
"name": "CopyToSapC4c",
"type": "Copy",
"inputs": [{
"type": "DatasetReference",
"referenceName": "<dataset type>"
}],
"outputs": [{
"type": "DatasetReference",
"referenceName": "SapC4cDataset"
}],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SapCloudForCustomerSink",
"writeBehavior": "Insert",
"writeBatchSize": 30
},
"parallelCopies": 10,
"dataIntegrationUnits": 4,
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": {
"referenceName": "ErrorLogBlobLinkedService",
"type": "LinkedServiceReference"
},
"path": "incompatiblerows"
}
}
}
]
Edm.Binary Byte[]
Edm.Boolean Bool
Edm.Byte Byte[]
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid Guid
SA P C 4C O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
Edm.Int16 Int16
Edm.Int32 Int32
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from SAP ECC by using Azure Data
Factory
5/11/2021 • 5 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP ECC connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP ECC to any supported sink data store. For a list of data stores that are supported as
sources or sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP ECC connector supports:
Copying data from SAP ECC on SAP NetWeaver version 7.0 and later.
Copying data from any objects exposed by SAP ECC OData services, such as:
SAP tables or views.
Business Application Programming Interface [BAPI] objects.
Data extractors.
Data or intermediate documents (IDOCs) sent to SAP Process Integration (PI) that can be received as
OData via relative adapters.
Copying data by using basic authentication.
The version 7.0 or later refers to SAP NetWeaver version instead of SAP ECC version. For example,SAP ECC 6.0
EHP 7 in general has NetWeaver version >=7.4. In case you are unsure about your environment, here are the
steps to confirm the version from your SAP system:
1. Use SAP GUI to connect to the SAP System.
2. Go to System -> Status .
3. Check the release of the SAP_BASIS, ensure it is equal to or larger than 701.
TIP
To copy data from SAP ECC via an SAP table or view, use the SAP table connector, which is faster and more scalable.
Prerequisites
To use this SAP ECC connector, you need to expose the SAP ECC entities via OData services through SAP
Gateway. More specifically:
Set up SAP Gateway . For servers with SAP NetWeaver versions later than 7.4, SAP Gateway is already
installed. For earlier versions, you must install the embedded SAP Gateway or the SAP Gateway hub
system before exposing SAP ECC data through OData services. To set up SAP Gateway, see the installation
guide.
Activate and configure the SAP OData ser vice . You can activate the OData service through TCODE
SICF in seconds. You can also configure which objects need to be exposed. For more information, see the
step-by-step guidance.
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define the Data Factory entities specific
to the SAP ECC connector.
Example
{
"name": "SapECCLinkedService",
"properties": {
"type": "SapEcc",
"typeProperties": {
"url": "<SAP ECC OData URL, e.g.,
http://eccsvrname:8000/sap/opu/odata/sap/zgw100_dd02l_so_srv/>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
Dataset properties
For a full list of the sections and properties available for defining datasets, see Datasets. The following section
provides a list of the properties supported by the SAP ECC dataset.
To copy data from SAP ECC, set the type property of the dataset to SapEccResource .
The following properties are supported:
Example
{
"name": "SapEccDataset",
"properties": {
"type": "SapEccResource",
"typeProperties": {
"path": "<entity path, e.g., dd04tentitySet>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP ECC linked service name>",
"type": "LinkedServiceReference"
}
}
}
"$select=Name,Description&$top=10"
Example
"activities":[
{
"name": "CopyFromSAPECC",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP ECC input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapEccSource",
"query": "$top=10"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Edm.Binary String
Edm.Boolean Bool
Edm.Byte String
Edm.DateTime DateTime
Edm.Decimal Decimal
Edm.Double Double
Edm.Single Single
Edm.Guid String
Edm.Int16 Int16
Edm.Int32 Int32
O DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
Edm.Int64 Int64
Edm.SByte Int16
Edm.String String
Edm.Time TimeSpan
Edm.DateTimeOffset DateTimeOffset
NOTE
Complex data types aren't currently supported.
Next steps
For a list of the data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
Supported data stores.
Copy data from SAP HANA using Azure Data
Factory
5/11/2021 • 9 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP HANA connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SAP HANA database to any supported sink data store. For a list of data stores supported
as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP HANA connector supports:
Copying data from any version of SAP HANA database.
Copying data from HANA information models (such as Analytic and Calculation views) and
Row/Column tables .
Copying data using Basic or Windows authentication.
Parallel copying from a SAP HANA source. See the Parallel copy from SAP HANA section for details.
TIP
To copy data into SAP HANA data store, use generic ODBC connector. See SAP HANA sink section with details. Note the
linked services for SAP HANA connector and ODBC connector are with different type thus cannot be reused.
Prerequisites
To use this SAP HANA connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the SAP HANA ODBC driver on the Integration Runtime machine. You can download the SAP HANA
ODBC driver from the SAP Software Download Center. Search with the keyword SAP HANA CLIENT for
Windows .
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SAP HANA connector.
{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"connectionString": "SERVERNODE=<server>:<port (optional)>;",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
If you were using SAP HANA linked service with the following payload, it is still supported as-is, while you are
suggested to use the new one going forward.
Example:
{
"name": "SapHanaLinkedService",
"properties": {
"type": "SapHana",
"typeProperties": {
"server": "<server>:<port (optional)>",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by SAP HANA dataset.
To copy data from SAP HANA, the following properties are supported:
schema Name of the schema in the SAP HANA No (if "query" in activity source is
database. specified)
table Name of the table in the SAP HANA No (if "query" in activity source is
database. specified)
Example:
{
"name": "SAPHANADataset",
"properties": {
"type": "SapHanaTable",
"typeProperties": {
"schema": "<schema name>",
"table": "<table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP HANA linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
TIP
To ingest data from SAP HANA efficiently by using data partitioning, learn more from Parallel copy from SAP HANA
section.
To copy data from SAP HANA, the following properties are supported in the copy activity source section:
partitionColumnName Specify the name of the source column Yes when using
that will be used by partition for SapHanaDynamicRange partition.
parallel copy. If not specified, the index
or the primary key of the table is auto-
detected and used as the partition
column.
Apply when the partition option is
SapHanaDynamicRange . If you use a
query to retrieve the source data,
hook
?
AdfHanaDynamicRangePartitionCondition
in WHERE clause. See example in
Parallel copy from SAP HANA section.
Example:
"activities":[
{
"name": "CopyFromSAPHANA",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP HANA input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapHanaSource",
"query": "<SQL query for SAP HANA>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed copy source, it is still supported as-is, while you are suggested to use
the new one going forward.
When you enable partitioned copy, Data Factory runs parallel queries against your SAP HANA source to retrieve
data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For
example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on
your specified partition option and settings, and each query retrieves a portion of data from your SAP HANA.
You are suggested to enable parallel copy with data partitioning especially when you ingest large amount of
data from your SAP HANA. The following are suggested configurations for different scenarios. When copying
data into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name),
in which case the performance is better than writing to a single file.
Full load from large table. Par tition option : Physical partitions of table.
Load large amount of data by using a custom query. Par tition option : Dynamic range partition.
Quer y :
SELECT * FROM <TABLENAME> WHERE ?
AdfHanaDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to apply
dynamic range partition.
"source": {
"type": "SapHanaSource",
"partitionOption": "PhysicalPartitionsOfTable"
}
"source": {
"type": "SapHanaSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfHanaDynamicRangePartitionCondition AND
<your_additional_where_clause>",
"partitionOption": "SapHanaDynamicRange",
"partitionSettings": {
"partitionColumnName": "<Partition_column_name>"
}
}
Data type mapping for SAP HANA
When copying data from SAP HANA, the following mappings are used from SAP HANA data types to Azure
Data Factory interim data types. See Schema and data type mappings to learn about how copy activity maps the
source schema and data type to the sink.
ALPHANUM String
BIGINT Int64
BINARY Byte[]
BINTEXT String
BLOB Byte[]
BOOL Byte
CLOB String
DATE DateTime
DECIMAL Decimal
DOUBLE Double
FLOAT Double
INTEGER Int32
NCLOB String
NVARCHAR String
REAL Single
SECONDDATE DateTime
SHORTTEXT String
SMALLDECIMAL Decimal
SMALLINT Int16
STGEOMETRYTYPE Byte[]
STPOINTTYPE Byte[]
TEXT String
TIME TimeSpan
SA P H A N A DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
TINYINT Byte
VARCHAR String
TIMESTAMP DateTime
VARBINARY Byte[]
{
"name": "SAPHANAViaODBCLinkedService",
"properties": {
"type": "Odbc",
"typeProperties": {
"connectionString": "Driver={HDBODBC};servernode=<HANA server>.clouddatahub-int.net:30015",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from an SAP table by using Azure Data
Factory
5/27/2021 • 11 minutes to read • Edit Online
TIP
To learn ADF's overall support on SAP data integration scenario, see SAP data integration using Azure Data Factory
whitepaper with detailed introduction on each SAP connector, comparsion and guidance.
Supported capabilities
This SAP table connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from an SAP table to any supported sink data store. For a list of the data stores that are
supported as sources or sinks by the copy activity, see the Supported data stores table.
Specifically, this SAP table connector supports:
Copying data from an SAP table in:
SAP ERP Central Component (SAP ECC) version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
SAP Business Warehouse (SAP BW) version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
SAP S/4HANA.
Other products in SAP Business Suite version 7.01 or later (in a recent SAP Support Package Stack
released after 2015).
Copying data from both an SAP transparent table, a pooled table, a clustered table, and a view.
Copying data by using basic authentication or Secure Network Communications (SNC), if SNC is
configured.
Connecting to an SAP application server or SAP message server.
Retrieving data via default or custom RFC.
The version 7.01 or later refers to SAP NetWeaver version instead of SAP ECC version. For example,SAP ECC 6.0
EHP 7 in general has NetWeaver version >=7.4. In case you are unsure about your environment, here are the
steps to confirm the version from your SAP system:
1. Use SAP GUI to connect to the SAP System.
2. Go to System -> Status .
3. Check the release of the SAP_BASIS, ensure it is equal to or larger than 701.
Prerequisites
To use this SAP table connector, you need to:
Set up a self-hosted integration runtime (version 3.17 or later). For more information, see Create and
configure a self-hosted integration runtime.
Download the 64-bit SAP Connector for Microsoft .NET 3.0 from SAP's website, and install it on the self-
hosted integration runtime machine. During installation, make sure you select the Install Assemblies to
GAC option in the Optional setup steps window.
The SAP user who's being used in the Data Factory SAP table connector must have the following
permissions:
Authorization for using Remote Function Call (RFC) destinations.
Permissions to the Execute activity of the S_SDSAUTH authorization object. You can refer to SAP Note
460089 on the majority authorization objects. Certain RFCs are required by the underlying NCo
connector, for example RFC_FUNCTION_SEARCH.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define the Data Factory entities specific
to the SAP table connector.
{
"name": "SapTableLinkedService",
"properties": {
"type": "SapTable",
"typeProperties": {
"messageServer": "<message server name>",
"messageServerService": "<service name or port>",
"systemId": "<system ID>",
"logonGroup": "<logon group>",
"clientId": "<client ID>",
"userName": "<SAP user>",
"password": {
"type": "SecureString",
"value": "<Password for SAP user>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of the sections and properties for defining datasets, see Datasets. The following section provides a
list of the properties supported by the SAP table dataset.
To copy data from and to the SAP BW Open Hub linked service, the following properties are supported:
Example
{
"name": "SAPTableDataset",
"properties": {
"type": "SapTableResource",
"typeProperties": {
"tableName": "<SAP table name>"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<SAP table linked service name>",
"type": "LinkedServiceReference"
}
}
}
TIP
If your SAP table has a large volume of data, such as several billion rows, use partitionOption and partitionSetting
to split the data into smaller partitions. In this case, the data is read per partition, and each data partition is retrieved
from your SAP server via a single RFC call.
Taking partitionOption as partitionOnInt as an example, the number of rows in each partition is calculated with this
formula: (total rows falling between partitionUpperBound and partitionLowerBound )/ maxPartitionsNumber .
To load data partitions in parallel to speed up copy, the parallel degree is controlled by the parallelCopies setting on
the copy activity. For example, if you set parallelCopies to four, Data Factory concurrently generates and runs four
queries based on your specified partition option and settings, and each query retrieves a portion of data from your SAP
table. We strongly recommend making maxPartitionsNumber a multiple of the value of the parallelCopies property.
When copying data into file-based data store, it's also recommanded to write to a folder as multiple files (only specify
folder name), in which case the performance is better than writing to a single file.
TIP
The BASXML is enabled by default for this SAP Table connector on Azure Data Factory side.
In rfcTableOptions , you can use the following common SAP query operators to filter the rows:
EQ Equal to
NE Not equal to
O P ERATO R DESC RIP T IO N
LT Less than
GT Greater than
Example
"activities":[
{
"name": "CopyFromSAPTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<SAP table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SapTableSource",
"partitionOption": "PartitionOnInt",
"partitionSettings": {
"partitionColumnName": "<partition column name>",
"partitionUpperBound": "2000",
"partitionLowerBound": "1",
"maxPartitionsNumber": 500
}
},
"sink": {
"type": "<sink type>"
},
"parallelCopies": 4
}
}
]
Below is an example:
TIP
You can also consider having the joined data aggregated in the VIEW, which is supported by SAP Table connector. You can
also try to extract related tables to get onboard onto Azure (e.g. Azure Storage, Azure SQL Database), then use Data Flow
to proceed with further join or filter.
Below are illustrations of how SAP table connector works with custom function module:
1. Build connection with SAP server via SAP NCO.
2. Invoke "Custom function module" with the parameters set as below:
QUERY_TABLE: the table name you set in the ADF SAP Table dataset;
Delimiter: the delimiter you set in the ADF SAP Table Source;
ROWCOUNT/Option/Fields: the Rowcount/Aggregated Option/Fields you set in the ADF Table source.
3. Get the result and parse the data in below ways:
a. Parse the value in the Fields table to get the schemas.
b. Get the values of the output table to see which table contains these values.
c. Get the values in the OUT_TABLE, parse the data and then write it into the sink.
C (String) String
I (Integer) Int32
F (Float) Double
D (Date) String
T (Time) String
N (Numeric) String
Next steps
For a list of the data stores supported as sources and sinks by the copy activity in Azure Data Factory, see
Supported data stores.
Copy data from ServiceNow using Azure Data
Factory
6/15/2021 • 4 minutes to read • Edit Online
Supported capabilities
This ServiceNow connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from ServiceNow to any supported sink data store. For a list of data stores that are supported
as sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
ServiceNow connector.
Example:
{
"name": "ServiceNowLinkedService",
"properties": {
"type": "ServiceNow",
"typeProperties": {
"endpoint" : "http://<instance>.service-now.com",
"authenticationType" : "Basic",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by ServiceNow dataset.
To copy data from ServiceNow, set the type property of the dataset to Ser viceNowObject . The following
properties are supported:
Example
{
"name": "ServiceNowDataset",
"properties": {
"type": "ServiceNowObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<ServiceNow linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Actual.alm_asset" .
Note the following when specifying the schema and column for ServiceNow in query, and refer to
Performance tips on copy performance implication .
Schema: specify the schema as Actual or Display in the ServiceNow query, which you can look at it as the
parameter of sysparm_display_value as true or false when calling ServiceNow restful APIs.
Column: the column name for actual value under Actual schema is [column name]_value , while for display
value under Display schema is [column name]_display_value . Note the column name need map to the
schema being used in the query.
Sample quer y: SELECT col_value FROM Actual.alm_asset OR SELECT col_display_value FROM Display.alm_asset
Example:
"activities":[
{
"name": "CopyFromServiceNow",
"type": "Copy",
"inputs": [
{
"referenceName": "<ServiceNow input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ServiceNowSource",
"query": "SELECT * FROM Actual.alm_asset"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Performance tips
Schema to use
ServiceNow has 2 different schemas, one is "Actual" which returns actual data, the other is "Display" which
returns the display values of data.
If you have a filter in your query, use "Actual" schema which has better copy performance. When querying
against "Actual" schema, ServiceNow natively support filter when fetching the data to only return the filtered
resultset, whereas when querying "Display" schema, ADF retrieve all the data and apply filter internally.
Index
ServiceNow table index can help improve query performance, refer to Create a table index.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from and to the SFTP server by using
Azure Data Factory
5/6/2021 • 18 minutes to read • Edit Online
Supported capabilities
The SFTP connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
GetMetadata activity
Delete activity
Specifically, the SFTP connector supports:
Copying files from and to the SFTP server by using Basic , SSH public key or multi-factor authentication.
Copying files as is or by parsing or generating files with the supported file formats and compression codecs.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
SFTP.
hostKeyFingerprint Specify the fingerprint of the host key. Yes, if the "skipHostKeyValidation" is set
to false.
Example:
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": false,
"hostKeyFingerPrint": "ssh-rsa 2048 xx:00:00:00:xx:00:x0:0x:0x:0x:0x:00:00:x0:x0:00",
"authenticationType": "Basic",
"userName": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
passPhrase Specify the pass phrase or password to Yes, if the private key file or the key
decrypt the private key if the key file content is protected by a pass phrase.
or the key content is protected by a
pass phrase. Mark this field as a
SecureString to store it securely in your
data factory, or reference a secret
stored in an Azure key vault.
NOTE
The SFTP connector supports an RSA/DSA OpenSSH key. Make sure that your key file content starts with "-----BEGIN
[RSA/DSA] PRIVATE KEY-----". If the private key file is a PPK-format file, use the PuTTY tool to convert from PPK to
OpenSSH format.
Example 1: SshPublicKey authentication using private key filePath
{
"name": "SftpLinkedService",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "xxx",
"privateKeyPath": "D:\\privatekey_openssh",
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "SftpLinkedService",
"type": "Linkedservices",
"properties": {
"type": "Sftp",
"typeProperties": {
"host": "<sftp server>",
"port": 22,
"skipHostKeyValidation": true,
"authenticationType": "SshPublicKey",
"userName": "<username>",
"privateKeyContent": {
"type": "SecureString",
"value": "<base64 string of the private key content>"
},
"passPhrase": {
"type": "SecureString",
"value": "<pass phrase>"
}
},
"connectVia": {
"referenceName": "<name of integration runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see the Datasets article.
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
Excel format
JSON format
ORC format
Parquet format
XML format
The following properties are supported for SFTP under location settings in the format-based dataset:
Example:
{
"name": "DelimitedTextDataset",
"properties": {
"type": "DelimitedText",
"linkedServiceName": {
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, auto retrieved during authoring > ],
"typeProperties": {
"location": {
"type": "SftpLocation",
"folderPath": "root/folder/subfolder"
},
"columnDelimiter": ",",
"quoteChar": "\"",
"firstRowAsHeader": true,
"compressionCodec": "gzip"
}
}
}
Additional settings
Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<Delimited text input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "DelimitedTextSource",
"formatSettings":{
"type": "DelimitedTextReadSettings",
"skipLineCount": 10
},
"storeSettings":{
"type": "SftpReadSettings",
"recursive": true,
"wildcardFolderPath": "myfolder*A",
"wildcardFileName": "*.csv"
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
SFTP as a sink
Azure Data Factory supports the following file formats. Refer to each article for format-based settings.
Avro format
Binary format
Delimited text format
JSON format
ORC format
Parquet format
The following properties are supported for SFTP under storeSettings settings in a format-based Copy sink:
TIP
If you receive the error "UserErrorSftpPathNotFound," "UserErrorSftpPermissionDenied," or "SftpOperationFail" when
you're writing data into SFTP, and the SFTP user you use does have the proper permissions, check to see whether your
SFTP server support file rename operation is working. If it isn't, disable the Upload with temp file (
useTempFileRename ) option and try again. To learn more about this property, see the preceding table. If you use a self-
hosted integration runtime for the Copy activity, be sure to use version 4.6 or later.
Example:
"activities":[
{
"name": "CopyToSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "BinarySink",
"storeSettings":{
"type": "SftpWriteSettings",
"copyBehavior": "PreserveHierarchy"
}
}
}
}
]
SO URC E F O L DER
ST RUC T URE A N D F ILT ER
RESULT ( F IL ES IN B O L D A RE
F O L DERPAT H F IL EN A M E REC URSIVE RET RIEVED)
A Z URE DATA FA C TO RY
SA M P L E SO URC E ST RUC T URE C O N T EN T IN F IL EL IST TO C O P Y. T XT C O N F IGURAT IO N
Legacy models
NOTE
The following models are still supported as is for backward compatibility. We recommend that you use the previously
discussed new model, because the Azure Data Factory authoring UI has switched to generating the new model.
format If you want to copy files as is between No (only for binary copy scenario)
file-based stores (binary copy), skip the
format section in both input and
output dataset definitions.
TIP
To copy all files under a folder, specify folderPath only.
To copy a single file with a specified name, specify folderPath with the folder part and fileName with the file name.
To copy a subset of files under a folder, specify folderPath with the folder part and fileName with the wildcard filter.
NOTE
If you were using fileFilter property for the file filter, it is still supported as is, but we recommend that you use the new
filter capability added to fileName from now on.
Example:
{
"name": "SFTPDataset",
"type": "Datasets",
"properties": {
"type": "FileShare",
"linkedServiceName":{
"referenceName": "<SFTP linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "folder/subfolder/",
"fileName": "*",
"modifiedDatetimeStart": "2018-12-01T05:00:00Z",
"modifiedDatetimeEnd": "2018-12-01T06:00:00Z",
"format": {
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": "\n"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
Example:
"activities":[
{
"name": "CopyFromSFTP",
"type": "Copy",
"inputs": [
{
"referenceName": "<SFTP input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "FileSystemSource",
"recursive": true
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores that are supported as sources and sinks by the Copy activity in Azure Data Factory, see
supported data stores.
Copy data from SharePoint Online List by using
Azure Data Factory
6/8/2021 • 6 minutes to read • Edit Online
Supported capabilities
This SharePoint Online List connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from SharePoint Online List to any supported sink data store. For a list of data stores that
Copy Activity supports as sources and sinks, see Supported data stores and formats.
Specifically, this SharePoint List Online connector uses service principal authentication and retrieves data via
OData protocol.
TIP
This connector supports copying data from SharePoint Online List but not file. Learn how to copy file from Copy file from
SharePoint Online section.
Prerequisites
The SharePoint List Online connector uses service principal authentication to connect to SharePoint. Follow
these steps to set it up:
1. Register an application entity in Azure Active Directory (Azure AD) by following Register your application
with an Azure AD tenant. Make note of the following values, which you use to define the linked service:
Application ID
Application key
Tenant ID
2. Grant SharePoint Online site permission to your registered application:
NOTE
This operation requires SharePoint Online site owner permission. You can find the owner by going to the site
home page -> click the "X members" in the right corner -> check who has the "Owner" role.
<AppPermissionRequests AllowAppOnlyPolicy="true">
<AppPermissionRequest Scope="http://sharepoint/content/sitecollection/web" Right="Read"/>
</AppPermissionRequests>
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties you can use to define Data Factory entities that are
specific to SharePoint Online List connector.
Example:
{
"name": "SharePointOnlineList",
"properties": {
"type": "SharePointOnlineList",
"typeProperties": {
"siteUrl": "<site URL>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
},
"tenantId": "<tenant ID>"
}
}
}
Dataset properties
For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.
The following section provides a list of the properties supported by the SAP table dataset.
Example
{
"name": "SharePointOnlineListDataset",
"properties":
{
"type": "SharePointOnlineListResource",
"linkedServiceName": {
"referenceName": "<SharePoint Online List linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties":
{
"listName":"<name of the list>"
}
}
}
Example
"activities":[
{
"name": "CopyFromSharePointOnlineList",
"type": "Copy",
"inputs": [
{
"referenceName": "<SharePoint Online List input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source":{
"type":"SharePointOnlineListSource",
"query":"<ODataquerye.g.$top=10&$select=Title,Number>"
},
"sink": {
"type": "<sink type>"
}
}
}
]
NOTE
In Azure Data Factory, you can't select more than one choice data type for a SharePoint Online List source.
Calculated (calculation based on other Edm.String / Edm.Double / String / Double / DateTime / Boolean
columns) Edm.DateTime / Edm.Boolean
1. Follow the Prerequisites section to create AAD application and grant permission to SharePoint Online.
2. Create a Web Activity to get the access token from SharePoint Online:
URL : https://accounts.accesscontrol.windows.net/[Tenant-ID]/tokens/OAuth/2 . Replace the tenant ID.
Method : POST
Headers :
Content-Type: application/x-www-form-urlencoded
Body :
grant_type=client_credentials&client_id=[Client-ID]@[Tenant-ID]&client_secret=[Client-
Secret]&resource=00000003-0000-0ff1-ce00-000000000000/[Tenant-Name].sharepoint.com@[Tenant-ID]
. Replace the client ID (application ID), client secret (application key), tenant ID, and tenant name (of the
SharePoint tenant).
Cau t i on
Set the Secure Output option to true in Web activity to prevent the token value from being logged in
plain text. Any further activities that consume this value should have their Secure Input option set to true.
3. Chain with a Copy activity with HTTP connector as source to copy SharePoint Online file content:
HTTP linked service:
Base URL :
https://[site-url]/_api/web/GetFileByServerRelativeUrl('[relative-path-to-file]')/$value .
Replace the site URL and relative path to file. Sample relative path to file as
/sites/site2/Shared Documents/TestBook.xlsx .
Authentication type: Anonymous (to use the Bearer token configured in copy activity source
later)
Dataset: choose the format you want. To copy file as-is, select "Binary" type.
Copy activity source:
Request method : GET
Additional header : use the following expression
,
@{concat('Authorization: Bearer ', activity('<Web-activity-name>').output.access_token)}
which uses the Bearer token generated by the upstream Web activity as authorization header.
Replace the Web activity name.
Configure the copy activity sink as usual.
NOTE
Even if an Azure AD application has FullControl permissions on SharePoint Online, you can't copy files from document
libraries with IRM enabled.
Next steps
For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see Supported
data stores and formats.
Copy data from Shopify using Azure Data Factory
(Preview)
5/6/2021 • 3 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Shopify connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Shopify to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Shopify connector.
Example:
{
"name": "ShopifyLinkedService",
"properties": {
"type": "Shopify",
"typeProperties": {
"host" : "mystore.myshopify.com",
"accessToken": {
"type": "SecureString",
"value": "<accessToken>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Shopify dataset.
To copy data from Shopify, set the type property of the dataset to ShopifyObject . The following properties are
supported:
Example
{
"name": "ShopifyDataset",
"properties": {
"type": "ShopifyObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Shopify linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM "Products" WHERE
Product_Id = '123'"
.
Example:
"activities":[
{
"name": "CopyFromShopify",
"type": "Copy",
"inputs": [
{
"referenceName": "<Shopify input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ShopifySource",
"query": "SELECT * FROM \"Products\" WHERE Product_Id = '123'"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data in Snowflake by using
Azure Data Factory
5/26/2021 • 13 minutes to read • Edit Online
Supported capabilities
This Snowflake connector is supported for the following activities:
Copy activity with a supported source/sink matrix table
Mapping data flow
Lookup activity
For the Copy activity, this Snowflake connector supports the following functions:
Copy data from Snowflake that utilizes Snowflake's COPY into [location] command to achieve the best
performance.
Copy data to Snowflake that takes advantage of Snowflake's COPY into [table] command to achieve the best
performance. It supports Snowflake on Azure.
If a proxy is required to connect to Snowflake from a self-hosted Integration Runtime, you must configure the
environment variables for HTTP_PROXY and HTTPS_PROXY on the Integration Runtime host.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that define Data Factory entities specific to a Snowflake
connector.
Example:
{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=
<username>&password=<password>&db=<database>&warehouse=<warehouse>&role=<myRole>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Password in Azure Key Vault:
{
"name": "SnowflakeLinkedService",
"properties": {
"type": "Snowflake",
"typeProperties": {
"connectionString": "jdbc:snowflake://<accountname>.snowflakecomputing.com/?user=<username>&db=
<database>&warehouse=<warehouse>&role=<myRole>",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article.
The following properties are supported for the Snowflake dataset.
schema Name of the schema. Note the schema No for source, yes for sink
name is case-sensitive in ADF.
table Name of the table/view. Note the table No for source, yes for sink
name is case-sensitive in ADF.
Example:
{
"name": "SnowflakeDataset",
"properties": {
"type": "SnowflakeTable",
"typeProperties": {
"schema": "<Schema name for your Snowflake database>",
"table": "<Table name for your Snowflake database>"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the Snowflake source and sink.
Snowflake as the source
Snowflake connector utilizes Snowflake’s COPY into [location] command to achieve the best performance.
If sink data store and format are natively supported by the Snowflake COPY command, you can use the Copy
activity to directly copy from Snowflake to sink. For details, see Direct copy from Snowflake. Otherwise, use
built-in Staged copy from Snowflake.
To copy data from Snowflake, the following properties are supported in the Copy activity source section.
Under exportSettings :
NOTE
The staging Azure Blob storage linked service must use shared access signature authentication, as required by the
Snowflake COPY command.
Example:
"activities":[
{
"name": "CopyFromSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<Snowflake input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SnowflakeSource",
"sqlReaderQuery": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]
Snowflake as sink
Snowflake connector utilizes Snowflake’s COPY into [table] command to achieve the best performance. It
supports writing data to Snowflake on Azure.
If source data store and format are natively supported by Snowflake COPY command, you can use the Copy
activity to directly copy from source to Snowflake. For details, see Direct copy to Snowflake. Otherwise, use
built-in Staged copy to Snowflake.
To copy data to Snowflake, the following properties are supported in the Copy activity sink section.
Under importSettings :
"activities":[
{
"name": "CopyToSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Snowflake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SnowflakeSink",
"importSettings": {
"type": "SnowflakeImportCopyCommand",
"copyOptions": {
"FORCE": "TRUE",
"ON_ERROR": "SKIP_FILE",
},
"fileFormatOptions": {
"DATE_FORMAT": "YYYY-MM-DD",
}
}
}
}
}
]
Example:
"activities":[
{
"name": "CopyToSnowflake",
"type": "Copy",
"inputs": [
{
"referenceName": "<input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<Snowflake output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>"
},
"sink": {
"type": "SnowflakeSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingBlob",
"type": "LinkedServiceReference"
},
"path": "mystagingpath"
}
}
}
]
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
source(allowSchemaDrift: true,
validateSchema: false,
query: 'select * from MYTABLE',
format: 'query') ~> SnowflakeSource
If you use inline dataset, the associated data flow script is:
source(allowSchemaDrift: true,
validateSchema: false,
format: 'query',
query: 'select * from MYTABLE',
store: 'snowflake') ~> SnowflakeSource
Sink transformation
The below table lists the properties supported by Snowflake sink. You can edit these properties in the Settings
tab. When using inline dataset, you will see additional settings, which are the same as the properties described
in dataset properties section. The connector utilizes Snowflake internal data transfer.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
If you use inline dataset, the associated data flow script is:
IncomingStream sink(allowSchemaDrift: true,
validateSchema: false,
format: 'table',
tableName: 'table',
schemaName: 'schema',
deletable: true,
insertable: true,
updateable: true,
upsertable: false,
store: 'snowflake',
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> SnowflakeSink
Next steps
For a list of data stores supported as sources and sinks by Copy activity in Data Factory, see supported data
stores and formats.
Copy data from Spark using Azure Data Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Spark connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Spark to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Spark connector.
Linked service properties
The following properties are supported for Spark linked service:
Example:
{
"name": "SparkLinkedService",
"properties": {
"type": "Spark",
"typeProperties": {
"host" : "<cluster>.azurehdinsight.net",
"port" : "<port>",
"authenticationType" : "WindowsAzureHDInsightService",
"username" : "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Spark dataset.
To copy data from Spark, set the type property of the dataset to SparkObject . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "SparkDataset",
"properties": {
"type": "SparkObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Spark linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSpark",
"type": "Copy",
"inputs": [
{
"referenceName": "<Spark input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SparkSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy and transform data to and from SQL Server
by using Azure Data Factory
7/16/2021 • 26 minutes to read • Edit Online
Supported capabilities
This SQL Server connector is supported for the following activities:
Copy activity with supported source/sink matrix
Mapping data flow
Lookup activity
GetMetadata activity
You can copy data from a SQL Server database to any supported sink data store. Or, you can copy data from any
supported source data store to a SQL Server database. For a list of data stores that are supported as sources or
sinks by the copy activity, see the Supported data stores table.
Specifically, this SQL Server connector supports:
SQL Server version 2005 and above.
Copying data by using SQL or Windows authentication.
As a source, retrieving data by using a SQL query or a stored procedure. You can also choose to parallel copy
from SQL Server source, see the Parallel copy from SQL database section for details.
As a sink, automatically creating destination table if not exists based on the source schema; appending data
to a table or invoking a stored procedure with custom logic during copy.
SQL Server Express LocalDB is not supported.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Get started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the SQL Server database connector.
NOTE
SQL Server Always Encr ypted is not supported in data flow.
TIP
If you hit an error with the error code "UserErrorFailedToConnectToSqlServer" and a message like "The session limit for the
database is XXX and has been reached," add Pooling=false to your connection string and try again.
{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Example 3: Use Windows authentication
{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=True;",
"userName": "<domain\\username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "SqlServerLinkedService",
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Data Source=<servername>\\<instance name if using named instance>;Initial
Catalog=<databasename>;Integrated Security=False;User ID=<username>;Password=<password>;"
},
"alwaysEncryptedSettings": {
"alwaysEncryptedAkvAuthType": "ServicePrincipal",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"type": "SecureString",
"value": "<service principal key>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by the SQL Server dataset.
To copy data from and to a SQL Server database, the following properties are supported:
tableName Name of the table/view with schema. No for source, Yes for sink
This property is supported for
backward compatibility. For new
workload, use schema and table .
Example
{
"name": "SQLServerDataset",
"properties":
{
"type": "SqlServerTable",
"linkedServiceName": {
"referenceName": "<SQL Server linked service name>",
"type": "LinkedServiceReference"
},
"schema": [ < physical schema, optional, retrievable during authoring > ],
"typeProperties": {
"schema": "<schema_name>",
"table": "<table_name>"
}
}
}
TIP
To load data from SQL Server efficiently by using data partitioning, learn more from Parallel copy from SQL database.
To copy data from SQL Server, set the source type in the copy activity to SqlSource . The following properties
are supported in the copy activity source section:
Under partitionSettings :
"activities":[
{
"name": "CopyFromSQLServer",
"type": "Copy",
"inputs": [
{
"referenceName": "<SQL Server input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SqlSource",
"sqlReaderStoredProcedureName": "CopyTestSrcStoredProcedureWithParameters",
"storedProcedureParameters": {
"stringData": { "value": "str3" },
"identifier": { "value": "$$Text.Format('{0:yyyy}', <datetime parameter>)", "type":
"Int"}
}
},
"sink": {
"type": "<sink type>"
}
}
}
]
TIP
Learn more about the supported write behaviors, configurations, and best practices from Best practice for loading data
into SQL Server.
To copy data to SQL Server, set the sink type in the copy activity to SqlSink . The following properties are
supported in the copy activity sink section:
When you enable partitioned copy, copy activity runs parallel queries against your SQL Server source to load
data by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For
example, if you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on
your specified partition option and settings, and each query retrieves a portion of data from your SQL Server.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your SQL Server. The following are suggested configurations for different scenarios. When copying data
into file-based data store, it's recommended to write to a folder as multiple files (only specify folder name), in
which case the performance is better than writing to a single file.
Full load from large table, with physical partitions. Par tition option : Physical partitions of table.
Full load from large table, without physical partitions, while Par tition options : Dynamic range partition.
with an integer or datetime column for data partitioning. Par tition column (optional): Specify the column used to
partition data. If not specified, the primary key column is
used.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
table will be partitioned and copied. If not specified, copy
activity auto detects the values and it can take long time
depending on MIN and MAX values. It is recommended to
provide upper bound and lower bound.
Load a large amount of data by using a custom query, Par tition options : Dynamic range partition.
without physical partitions, while with an integer or Quer y :
date/datetime column for data partitioning. SELECT * FROM <TableName> WHERE ?
AdfDynamicRangePartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data.
Par tition upper bound and par tition lower bound
(optional): Specify if you want to determine the partition
stride. This is not for filtering the rows in table, all rows in the
query result will be partitioned and copied. If not specified,
copy activity auto detect the value.
"source": {
"type": "SqlSource",
"partitionOption": "PhysicalPartitionsOfTable"
}
If the table has physical partition, you would see "HasPartition" as "yes" like the following.
In your database, define a stored procedure with MERGE logic, like the following example, which is pointed to
from the previous stored procedure activity. Assume that the target is the Marketing table with three columns:
ProfileID , State , and Categor y . Do the upsert based on the ProfileID column.
Option 2: You can choose to invoke a stored procedure within the copy activity. This approach runs each batch
(as governed by the writeBatchSize property) in the source table instead of using bulk insert as the default
approach in the copy activity.
Overwrite the entire table
You can configure the preCopyScript property in a copy activity sink. In this case, for each copy activity that
runs, Azure Data Factory runs the script first. Then it runs the copy to insert the data. For example, to overwrite
the entire table with the latest data, specify a script to first delete all the records before you bulk load the new
data from the source.
Write data with custom logic
The steps to write data with custom logic are similar to those described in the Upsert data section. When you
need to apply extra processing before the final insertion of source data into the destination table, you can load
to a staging table then invoke stored procedure activity, or invoke a stored procedure in copy activity sink to
apply data.
2. In your database, define the stored procedure with the same name as
sqlWriterStoredProcedureName . It handles input data from your specified source and merges into
the output table. The parameter name of the table type in the stored procedure is the same as
tableName defined in the dataset.
3. In Azure Data Factory, define the SQL sink section in the copy activity as follows:
"sink": {
"type": "SqlSink",
"sqlWriterStoredProcedureName": "spOverwriteMarketing",
"storedProcedureTableTypeParameterName": "Marketing",
"sqlWriterTableType": "MarketingType",
"storedProcedureParameters": {
"category": {
"value": "ProductA"
}
}
}
NOTE
To access on premise SQL Server, you need to use Azure Data Factory Managed Virtual Network using private endpoint.
Refer to this tutorial for detailed steps.
Source transformation
The below table lists the properties supported by SQL Server source. You can edit these properties in the
Source options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
Order By clause is
not supported, but
you can set a full
SELECT FROM
statement. You can
also use user-defined
table functions.
select * from
udfGetData() is a
UDF in SQL that
returns a table that
you can use in data
flow.
Query example:
Select * from
MyTable where
customerId > 1000
and customerId <
2000
Sink transformation
The below table lists the properties supported by SQL Server sink. You can edit these properties in the Sink
options tab.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
bigint Int64
binary Byte[]
bit Boolean
date DateTime
Datetime DateTime
datetime2 DateTime
Datetimeoffset DateTimeOffset
Decimal Decimal
Float Double
image Byte[]
int Int32
money Decimal
numeric Decimal
real Single
rowversion Byte[]
smalldatetime DateTime
smallint Int16
smallmoney Decimal
sql_variant Object
time TimeSpan
timestamp Byte[]
tinyint Int16
uniqueidentifier Guid
varbinary Byte[]
xml String
NOTE
For data types that map to the Decimal interim type, currently Copy activity supports precision up to 28. If you have data
that requires precision larger than 28, consider converting to a string in a SQL query.
NOTE
SQL Server Always Encrypted supports below scenarios:
1. Either source or sink data stores is using managed identity or service principal as key provider authentication type.
2. Both source and sink data stores are using managed identity as key provider authentication type.
3. Both source and sink data stores are using the same service principal as key provider authentication type.
For detailed steps, see Configure the remote access server configuration option.
2. Start SQL Ser ver Configuration Manager . Expand SQL Ser ver Network Configuration for the
instance you want, and select Protocols for MSSQLSERVER . Protocols appear in the right pane. Enable
TCP/IP by right-clicking TCP/IP and selecting Enable .
For more information and alternate ways of enabling TCP/IP protocol, see Enable or disable a server
network protocol.
3. In the same window, double-click TCP/IP to launch the TCP/IP Proper ties window.
4. Switch to the IP Addresses tab. Scroll down to see the IPAll section. Write down the TCP Por t . The
default is 1433 .
5. Create a rule for the Windows Firewall on the machine to allow incoming traffic through this port.
6. Verify connection : To connect to SQL Server by using a fully qualified name, use SQL Server
Management Studio from a different machine. An example is
"<machine>.<domain>.corp.<company>.com,1433" .
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see Supported
data stores.
Copy data from Square using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Square connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Square to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Square connector.
Under connectionProperties :
{
"name": "SquareLinkedService",
"properties": {
"type": "Square",
"typeProperties": {
"connectionProperties":{
"host":"<e.g. mystore.mysquare.com>",
"clientId":"<client ID>",
"clientSecrect":{
"type": "SecureString",
"value": "<clientSecret>"
},
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Square dataset.
To copy data from Square, set the type property of the dataset to SquareObject . The following properties are
supported:
Example
{
"name": "SquareDataset",
"properties": {
"type": "SquareObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Square linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Business" .
Example:
"activities":[
{
"name": "CopyFromSquare",
"type": "Copy",
"inputs": [
{
"referenceName": "<Square input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SquareSource",
"query": "SELECT * FROM Business"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Lookup activity properties
To learn details about the properties, check Lookup activity.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Sybase using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Sybase connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Sybase database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Sybase connector supports:
SAP Sybase SQL Anywhere (ASA) version 16 and above .
Copying data using Basic or Windows authentication.
Sybase IQ and ASE are not supported. You can use generic ODBC connector with Sybase driver instead.
Prerequisites
To use this Sybase connector, you need to:
Set up a Self-hosted Integration Runtime. See Self-hosted Integration Runtime article for details.
Install the data provider for Sybase iAnywhere.Data.SQLAnywhere 16 or above on the Integration Runtime
machine.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Sybase connector.
Example:
{
"name": "SybaseLinkedService",
"properties": {
"type": "Sybase",
"typeProperties": {
"server": "<server>",
"database": "<database>",
"authenticationType": "Basic",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Sybase dataset.
To copy data from Sybase, the following properties are supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED
tableName Name of the table in the Sybase No (if "query" in activity source is
database. specified)
Example
{
"name": "SybaseDataset",
"properties": {
"type": "SybaseTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Sybase linked service name>",
"type": "LinkedServiceReference"
}
}
}
If you were using RelationalTable typed dataset, it is still supported as-is, while you are suggested to use the
new one going forward.
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromSybase",
"type": "Copy",
"inputs": [
{
"referenceName": "<Sybase input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "SybaseSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
If you were using RelationalSource typed source, it is still supported as-is, while you are suggested to use the
new one going forward.
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Teradata Vantage by using Azure
Data Factory
5/6/2021 • 11 minutes to read • Edit Online
Supported capabilities
This Teradata connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Teradata Vantage to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Teradata connector supports:
Teradata version 14.10, 15.0, 15.10, 16.0, 16.10, and 16.20 .
Copying data by using Basic , Windows , or LDAP authentication.
Parallel copying from a Teradata source. See the Parallel copy from Teradata section for details.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
If you use Self-hosted Integration Runtime, note it provides a built-in Teradata driver starting from version 3.18.
You don't need to manually install any driver. The driver requires "Visual C++ Redistributable 2012 Update 4" on
the self-hosted integration runtime machine. If you don't yet have it installed, download it from here.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
the Teradata connector.
More connection properties you can set in connection string per your case:
{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"connectionString": "DBCName=<server>",
"username": "<username>",
"password": "<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"connectionString": "DBCName=<server>;MechanismName=LDAP;Uid=<username>;Pwd=<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
NOTE
The following payload is still supported. Going forward, however, you should use the new one.
Previous payload:
{
"name": "TeradataLinkedService",
"properties": {
"type": "Teradata",
"typeProperties": {
"server": "<server>",
"authenticationType": "<Basic/Windows>",
"username": "<username>",
"password": {
"type": "SecureString",
"value": "<password>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
This section provides a list of properties supported by the Teradata dataset. For a full list of sections and
properties available for defining datasets, see Datasets.
To copy data from Teradata, the following properties are supported:
database The name of the Teradata instance. No (if "query" in activity source is
specified)
table The name of the table in the Teradata No (if "query" in activity source is
instance. specified)
Example:
{
"name": "TeradataDataset",
"properties": {
"type": "TeradataTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
}
}
}
NOTE
RelationalTable type dataset is still supported. However, we recommend that you use the new dataset.
Previous payload:
{
"name": "TeradataDataset",
"properties": {
"type": "RelationalTable",
"linkedServiceName": {
"referenceName": "<Teradata linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {}
}
}
TIP
To load data from Teradata efficiently by using data partitioning, learn more from Parallel copy from Teradata section.
To copy data from Teradata, the following properties are supported in the copy activity source section:
query Use the custom SQL query to read No (if table in dataset is specified)
data. An example is
"SELECT * FROM MyTable" .
When you enable partitioned load, you
need to hook any corresponding built-
in partition parameters in your query.
For examples, see the Parallel copy
from Teradata section.
NOTE
RelationalSource type copy source is still supported, but it doesn't support the new built-in parallel load from Teradata
(partition options). However, we recommend that you use the new dataset.
When you enable partitioned copy, Data Factory runs parallel queries against your Teradata source to load data
by partitions. The parallel degree is controlled by the parallelCopies setting on the copy activity. For example, if
you set parallelCopies to four, Data Factory concurrently generates and runs four queries based on your
specified partition option and settings, and each query retrieves a portion of data from your Teradata.
You are suggested to enable parallel copy with data partitioning especially when you load large amount of data
from your Teradata. The following are suggested configurations for different scenarios. When copying data into
file-based data store, it's recommanded to write to a folder as multiple files (only specify folder name), in which
case the performance is better than writing to a single file.
Load large amount of data by using a custom query. Par tition option : Hash.
Quer y :
SELECT * FROM <TABLENAME> WHERE ?
AdfHashPartitionCondition AND
<your_additional_where_clause>
.
Par tition column : Specify the column used for apply hash
partition. If not specified, Data Factory automatically detects
the PK column of the table you specified in the Teradata
dataset.
Load large amount of data by using a custom query, having Par tition options : Dynamic range partition.
an integer column with evenly distributed value for range Quer y :
partitioning. SELECT * FROM <TABLENAME> WHERE ?
AdfRangePartitionColumnName <= ?
AdfRangePartitionUpbound AND ?
AdfRangePartitionColumnName >= ?
AdfRangePartitionLowbound AND
<your_additional_where_clause>
.
Par tition column : Specify the column used to partition
data. You can partition against the column with integer data
type.
Par tition upper bound and par tition lower bound :
Specify if you want to filter against the partition column to
retrieve data only between the lower and upper range.
"source": {
"type": "TeradataSource",
"query":"SELECT * FROM <TABLENAME> WHERE ?AdfHashPartitionCondition AND <your_additional_where_clause>",
"partitionOption": "Hash",
"partitionSettings": {
"partitionColumnName": "<hash_partition_column_name>"
}
}
BigInt Int64
Blob Byte[]
Byte Byte[]
ByteInt Int16
Char String
Clob String
Date DateTime
Decimal Decimal
Double Double
Integer Int32
Interval Day To Hour Not supported. Apply explicit cast in source query.
Interval Day To Minute Not supported. Apply explicit cast in source query.
Interval Day To Second Not supported. Apply explicit cast in source query.
Interval Hour To Minute Not supported. Apply explicit cast in source query.
T ERA DATA DATA T Y P E DATA FA C TO RY IN T ERIM DATA T Y P E
Interval Hour To Second Not supported. Apply explicit cast in source query.
Interval Minute To Second Not supported. Apply explicit cast in source query.
Interval Year To Month Not supported. Apply explicit cast in source query.
Number Double
Period (Time With Time Zone) Not supported. Apply explicit cast in source query.
Period (Timestamp With Time Zone) Not supported. Apply explicit cast in source query.
SmallInt Int16
Time TimeSpan
Timestamp DateTime
VarByte Byte[]
VarChar String
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Data Factory, see Supported data
stores.
Copy data from Vertica using Azure Data Factory
5/6/2021 • 3 minutes to read • Edit Online
Supported capabilities
This Vertica connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Vertica to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Prerequisites
If your data store is located inside an on-premises network, an Azure virtual network, or Amazon Virtual Private
Cloud, you need to configure a self-hosted integration runtime to connect to it.
If your data store is a managed cloud data service, you can use the Azure Integration Runtime. If the access is
restricted to IPs that are approved in the firewall rules, you can add Azure Integration Runtime IPs to the allow
list.
You can also use the managed virtual network integration runtime feature in Azure Data Factory to access the
on-premises network without installing and configuring a self-hosted integration runtime.
For more information about the network security mechanisms and options supported by Data Factory, see Data
access strategies.
Getting started
You can create a pipeline with copy activity using .NET SDK, Python SDK, Azure PowerShell, REST API, or Azure
Resource Manager template. See Copy activity tutorial for step-by-step instructions to create a pipeline with a
copy activity.
The following sections provide details about properties that are used to define Data Factory entities specific to
Vertica connector.
Example:
{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;PWD=
<password>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
{
"name": "VerticaLinkedService",
"properties": {
"type": "Vertica",
"typeProperties": {
"connectionString": "Server=<server>;Port=<port>;Database=<database>;UID=<user name>;",
"pwd": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "<Azure Key Vault linked service name>",
"type": "LinkedServiceReference"
},
"secretName": "<secretName>"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Vertica dataset.
To copy data from Vertica, set the type property of the dataset to Ver ticaTable . The following properties are
supported:
tableName Name of the table with schema. This No (if "query" in activity source is
property is supported for backward specified)
compatibility. Use schema and
table for new workload.
Example
{
"name": "VerticaDataset",
"properties": {
"type": "VerticaTable",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Vertica linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM MyTable" .
Example:
"activities":[
{
"name": "CopyFromVertica",
"type": "Copy",
"inputs": [
{
"referenceName": "<Vertica input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "VerticaSource",
"query": "SELECT * FROM MyTable"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy data from Web table by using Azure Data
Factory
5/6/2021 • 4 minutes to read • Edit Online
Supported capabilities
This Web table connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Web table database to any supported sink data store. For a list of data stores that are
supported as sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Web table connector supports extracting table content from an HTML page .
Prerequisites
To use this Web table connector, you need to set up a Self-hosted Integration Runtime. See Self-hosted
Integration Runtime article for details.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Web table connector.
Example:
{
"name": "WebLinkedService",
"properties": {
"type": "Web",
"typeProperties": {
"url" : "https://en.wikipedia.org/wiki/",
"authenticationType": "Anonymous"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Web table dataset.
To copy data from Web table, set the type property of the dataset to WebTable . The following properties are
supported:
path A relative URL to the resource that No. When path is not specified, only
contains the table. the URL specified in the linked service
definition is used.
Example:
{
"name": "WebTableInput",
"properties": {
"type": "WebTable",
"typeProperties": {
"index": 1,
"path": "AFI's_100_Years...100_Movies"
},
"schema": [],
"linkedServiceName": {
"referenceName": "<Web linked service name>",
"type": "LinkedServiceReference"
}
}
}
"activities":[
{
"name": "CopyFromWebTable",
"type": "Copy",
"inputs": [
{
"referenceName": "<Web table input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "WebSource"
},
"sink": {
"type": "<sink type>"
}
}
}
]
3. In the From Web dialog box, enter URL that you would use in linked service JSON (for example:
https://en.wikipedia.org/wiki/) along with path you would specify for the dataset (for example:
AFI%27s_100_Years...100_Movies), and click OK .
6. In the Quer y Editor window, click Advanced Editor button on the toolbar.
7. In the Advanced Editor dialog box, the number next to "Source" is the index.
If you are using Excel 2013, use Microsoft Power Query for Excel to get the index. See Connect to a web page
article for details. The steps are similar if you are using Microsoft Power BI for Desktop.
Supported capabilities
This Xero connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Xero to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
Specifically, this Xero connector supports:
OAuth 2.0 and OAuth 1.0 authentication. For OAuth 1.0, the connector supports Xero private application but
not public application.
All Xero tables (API endpoints) except "Reports".
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Xero connector.
Under connectionProperties :
tenantId The tenant ID associated with your Yes for OAuth 2.0 authentication
Xero application. Applicable for OAuth
2.0 authentication.
Learn how to get the tenant ID from
Check the tenants you're authorized to
access section.
P RO P ERT Y DESC RIP T IO N REQ UIRED
refreshToken Applicable for OAuth 2.0 Yes for OAuth 2.0 authentication
authentication.
TheOAuth 2.0 refreshtoken is
associated withtheXero application and
used to refresh the accesstoken; the
access token expires after 30 minutes.
Learn about how the Xero
authorization flow works and how to
get the refresh token from this article.
To get a refresh token, you must
request the offline_access scope.
Know limitation : Note Xero resets
the refresh token after it's used for
access token refresh. For
operationalized workload, before each
copy activity run, you need to set a
valid refresh token for ADF to use.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
{
"name": "XeroLinkedService",
"properties": {
"type": "Xero",
"typeProperties": {
"connectionProperties": {
"host":"api.xero.com",
"authenticationType":"OAuth_1.0",
"consumerKey": {
"type": "SecureString",
"value": "<consumer key>"
},
"privateKey": {
"type": "SecureString",
"value": "<private key>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Xero dataset.
To copy data from Xero, set the type property of the dataset to XeroObject . The following properties are
supported:
Example
{
"name": "XeroDataset",
"properties": {
"type": "XeroObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Xero linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Contacts" .
Example:
"activities":[
{
"name": "CopyFromXero",
"type": "Copy",
"inputs": [
{
"referenceName": "<Xero input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "XeroSource",
"query": "SELECT * FROM Contacts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Xero data is available through two schemas: Minimal (default) and Complete . The Complete schema
contains prerequisite call tables which require additional data (e.g. ID column) before making the desired
query.
The following tables have the same information in the Minimal and Complete schema. To reduce the number of
API calls, use Minimal schema (default).
Bank_Transactions
Contact_Groups
Contacts
Contacts_Sales_Tracking_Categories
Contacts_Phones
Contacts_Addresses
Contacts_Purchases_Tracking_Categories
Credit_Notes
Credit_Notes_Allocations
Expense_Claims
Expense_Claim_Validation_Errors
Invoices
Invoices_Credit_Notes
Invoices_ Prepayments
Invoices_Overpayments
Manual_Journals
Overpayments
Overpayments_Allocations
Prepayments
Prepayments_Allocations
Receipts
Receipt_Validation_Errors
Tracking_Categories
The following tables can only be queried with complete schema:
Complete.Bank_Transaction_Line_Items
Complete.Bank_Transaction_Line_Item_Tracking
Complete.Contact_Group_Contacts
Complete.Contacts_Contact_ Persons
Complete.Credit_Note_Line_Items
Complete.Credit_Notes_Line_Items_Tracking
Complete.Expense_Claim_ Payments
Complete.Expense_Claim_Receipts
Complete.Invoice_Line_Items
Complete.Invoices_Line_Items_Tracking
Complete.Manual_Journal_Lines
Complete.Manual_Journal_Line_Tracking
Complete.Overpayment_Line_Items
Complete.Overpayment_Line_Items_Tracking
Complete.Prepayment_Line_Items
Complete.Prepayment_Line_Item_Tracking
Complete.Receipt_Line_Items
Complete.Receipt_Line_Item_Tracking
Complete.Tracking_Category_Options
Next steps
For a list of supported data stores by the copy activity, see supported data stores.
XML format in Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online
Dataset properties
For a full list of sections and properties available for defining datasets, see the Datasets article. This section
provides a list of properties supported by the XML dataset.
{
"name": "XMLDataset",
"properties": {
"type": "Xml",
"linkedServiceName": {
"referenceName": "<Azure Blob Storage linked service name>",
"type": "LinkedServiceReference"
},
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"container": "containername",
"folderPath": "folder/subfolder",
},
"compression": {
"type": "ZipDeflate"
}
}
}
}
Copy activity properties
For a full list of sections and properties available for defining activities, see the Pipelines article. This section
provides a list of properties supported by the XML source.
Learn about how to map XML data and sink data store/format from schema mapping. When previewing XML
files, data is shown with JSON hierarchy, and you use JSON path to point to the fields.
XML as source
The following properties are supported in the copy activity *source* section. Learn more from XML connector
behavior.
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
DATA F LO W SC RIP T
NAME DESC RIP T IO N REQ UIRED A L LO W ED VA L UES P RO P ERT Y
source(allowSchemaDrift: true,
validateSchema: false,
validationMode: 'xsd',
namespaces: true) ~> XMLSource
The below script is an example of an XML source configuration using inline dataset mode.
source(allowSchemaDrift: true,
validateSchema: false,
format: 'xml',
fileSystem: 'filesystem',
folderPath: 'folder',
validationMode: 'xsd',
namespaces: true) ~> XMLSource
Next steps
Copy activity overview
Mapping data flow
Lookup activity
GetMetadata activity
Copy data from Zoho using Azure Data Factory
(Preview)
5/6/2021 • 4 minutes to read • Edit Online
IMPORTANT
This connector is currently in preview. You can try it out and give us feedback. If you want to take a dependency on
preview connectors in your solution, please contact Azure support.
Supported capabilities
This Zoho connector is supported for the following activities:
Copy activity with supported source/sink matrix
Lookup activity
You can copy data from Zoho to any supported sink data store. For a list of data stores that are supported as
sources/sinks by the copy activity, see the Supported data stores table.
This connector supports Xero access token authentication and OAuth 2.0 authentication.
Azure Data Factory provides a built-in driver to enable connectivity, therefore you don't need to manually install
any driver using this connector.
Getting started
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
The following sections provide details about properties that are used to define Data Factory entities specific to
Zoho connector.
Under connectionProperties :
clientId The client ID associated with your Yes for OAuth 2.0 authentication
Zoho application.
clientSecrect The clientsecret associated with your Yes for OAuth 2.0 authentication
Zoho application. Mark this field as a
SecureString to store it securely in
Data Factory, or reference a secret
stored in Azure Key Vault.
refreshToken The OAuth 2.0 refresh token Yes for OAuth 2.0 authentication
associated with your Zoho application,
used to refresh the access token when
it expires. Refresh token will never
expire. To get a refresh token, you
must request the offline
access_type, learn more from this
article.
Mark this field as a SecureString to
store it securely in Data Factory, or
reference a secret stored in Azure Key
Vault.
{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"connectionProperties": {
"authenticationType":"OAuth_2.0",
"endpoint":"crm.zoho.com/crm/private",
"clientId":"<client ID>",
"clientSecrect":{
"type": "SecureString",
"value": "<client secret>"
},
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"refreshToken":{
"type": "SecureString",
"value": "<refresh token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}
{
"name": "ZohoLinkedService",
"properties": {
"type": "Zoho",
"typeProperties": {
"connectionProperties": {
"authenticationType":"Access Token",
"endpoint":"crm.zoho.com/crm/private",
"accessToken":{
"type": "SecureString",
"value": "<access token>"
},
"useEncryptedEndpoints":true,
"useHostVerification":true,
"usePeerVerification":true
}
}
}
}
Dataset properties
For a full list of sections and properties available for defining datasets, see the datasets article. This section
provides a list of properties supported by Zoho dataset.
To copy data from Zoho, set the type property of the dataset to ZohoObject . The following properties are
supported:
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
{
"name": "ZohoDataset",
"properties": {
"type": "ZohoObject",
"typeProperties": {},
"schema": [],
"linkedServiceName": {
"referenceName": "<Zoho linked service name>",
"type": "LinkedServiceReference"
}
}
}
query Use the custom SQL query to read No (if "tableName" in dataset is
data. For example: specified)
"SELECT * FROM Accounts" .
Example:
"activities":[
{
"name": "CopyFromZoho",
"type": "Copy",
"inputs": [
{
"referenceName": "<Zoho input dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<output dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "ZohoSource",
"query": "SELECT * FROM Accounts"
},
"sink": {
"type": "<sink type>"
}
}
}
]
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Copy activity in Azure Data Factory
6/8/2021 • 14 minutes to read • Edit Online
The Copy activity is executed on an integration runtime. You can use different types of integration runtimes for
different data copy scenarios:
When you're copying data between two data stores that are publicly accessible through the internet from any
IP, you can use the Azure integration runtime for the copy activity. This integration runtime is secure, reliable,
scalable, and globally available.
When you're copying data to and from data stores that are located on-premises or in a network with access
control (for example, an Azure virtual network), you need to set up a self-hosted integration runtime.
An integration runtime needs to be associated with each source and sink data store. For information about how
the Copy activity determines which integration runtime to use, see Determining which IR to use.
To copy data from a source to a sink, the service that runs the Copy activity performs these steps:
1. Reads data from a source data store.
2. Performs serialization/deserialization, compression/decompression, column mapping, and so on. It performs
these operations based on the configuration of the input dataset, output dataset, and Copy activity.
3. Writes data to the sink/destination data store.
Azure Cognitive ✓ ✓ ✓
Search index
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Azure Cosmos ✓ ✓ ✓ ✓
DB (SQL API)
Azure Cosmos ✓ ✓ ✓ ✓
DB's API for
MongoDB
Azure Data ✓ ✓ ✓ ✓
Explorer
Azure Database ✓ ✓ ✓
for MariaDB
Azure Database ✓ ✓ ✓ ✓
for MySQL
Azure Database ✓ ✓ ✓ ✓
for PostgreSQL
Azure Databricks ✓ ✓ ✓ ✓
Delta Lake
Azure File ✓ ✓ ✓ ✓
Storage
Azure SQL ✓ ✓ ✓ ✓
Database
Azure SQL ✓ ✓ ✓ ✓
Managed
Instance
Azure Synapse ✓ ✓ ✓ ✓
Analytics
Azure Table ✓ ✓ ✓ ✓
storage
DB2 ✓ ✓ ✓
Drill ✓ ✓ ✓
Google ✓ ✓ ✓
BigQuery
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Greenplum ✓ ✓ ✓
HBase ✓ ✓ ✓
Hive ✓ ✓ ✓
Apache Impala ✓ ✓ ✓
Informix ✓ ✓ ✓
MariaDB ✓ ✓ ✓
Microsoft Access ✓ ✓ ✓
MySQL ✓ ✓ ✓
Netezza ✓ ✓ ✓
Oracle ✓ ✓ ✓ ✓
Phoenix ✓ ✓ ✓
PostgreSQL ✓ ✓ ✓
Presto ✓ ✓ ✓
SAP Business ✓ ✓
Warehouse via
Open Hub
SAP Business ✓ ✓
Warehouse via
MDX
SAP HANA ✓ ✓ ✓
SAP table ✓ ✓
Snowflake ✓ ✓ ✓ ✓
Spark ✓ ✓ ✓
SQL Server ✓ ✓ ✓ ✓
Sybase ✓ ✓
Teradata ✓ ✓ ✓
Vertica ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
NoSQL Cassandra ✓ ✓ ✓
Couchbase ✓ ✓ ✓
(Preview)
MongoDB ✓ ✓ ✓ ✓
MongoDB Atlas ✓ ✓ ✓ ✓
File Amazon S3 ✓ ✓ ✓
Amazon S3 ✓ ✓ ✓
Compatible
Storage
File system ✓ ✓ ✓ ✓
FTP ✓ ✓ ✓
Google Cloud ✓ ✓ ✓
Storage
HDFS ✓ ✓ ✓
Oracle Cloud ✓ ✓ ✓
Storage
SFTP ✓ ✓ ✓ ✓
Generic OData ✓ ✓ ✓
Generic ODBC ✓ ✓ ✓
Generic REST ✓ ✓ ✓ ✓
Concur (Preview) ✓ ✓ ✓
Dataverse ✓ ✓ ✓ ✓
Dynamics 365 ✓ ✓ ✓ ✓
Dynamics AX ✓ ✓ ✓
Dynamics CRM ✓ ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Google AdWords ✓ ✓ ✓
HubSpot ✓ ✓ ✓
Jira ✓ ✓ ✓
Magento ✓ ✓ ✓
(Preview)
Marketo ✓ ✓ ✓
(Preview)
Microsoft 365 ✓ ✓ ✓
Oracle Eloqua ✓ ✓ ✓
(Preview)
Oracle ✓ ✓ ✓
Responsys
(Preview)
Oracle Service ✓ ✓ ✓
Cloud (Preview)
PayPal (Preview) ✓ ✓ ✓
QuickBooks ✓ ✓ ✓
(Preview)
Salesforce ✓ ✓ ✓ ✓
Salesforce ✓ ✓ ✓ ✓
Service Cloud
Salesforce ✓ ✓ ✓
Marketing Cloud
SAP ECC ✓ ✓ ✓
ServiceNow ✓ ✓ ✓
SharePoint ✓ ✓ ✓
Online List
Shopify (Preview) ✓ ✓ ✓
Square (Preview) ✓ ✓ ✓
SUP P O RT ED A S SUP P O RT ED A S SUP P O RT ED B Y SUP P O RT ED B Y
C AT EGO RY DATA STO RE A SO URC E A SIN K A Z URE IR SEL F - H O ST ED IR
Web table ✓ ✓
(HTML table)
Xero ✓ ✓ ✓
Zoho (Preview) ✓ ✓ ✓
NOTE
If a connector is marked Preview, you can try it out and give us feedback. If you want to take a dependency on preview
connectors in your solution, contact Azure support.
Supported regions
The service that enables the Copy activity is available globally in the regions and geographies listed in Azure
integration runtime locations. The globally available topology ensures efficient data movement that usually
avoids cross-region hops. See Products by region to check the availability of Data Factory and data movement in
a specific region.
Configuration
To perform the Copy activity with a pipeline, you can use one of the following tools or SDKs:
The Copy Data tool
The Azure portal
The .NET SDK
The Python SDK
Azure PowerShell
The REST API
The Azure Resource Manager template
In general, to use the Copy activity in Azure Data Factory, you need to:
1. Create linked ser vices for the source data store and the sink data store. You can find the list of
supported connectors in the Supported data stores and formats section of this article. Refer to the connector
article's "Linked service properties" section for configuration information and supported properties.
2. Create datasets for the source and sink . Refer to the "Dataset properties" sections of the source and
sink connector articles for configuration information and supported properties.
3. Create a pipeline with the Copy activity. The next section provides an example.
Syntax
The following template of a Copy activity contains a complete list of supported properties. Specify the ones that
fit your scenario.
"activities":[
{
"name": "CopyActivityTemplate",
"type": "Copy",
"inputs": [
{
"referenceName": "<source dataset name>",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "<sink dataset name>",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "<source type>",
<properties>
},
"sink": {
"type": "<sink type>"
<properties>
},
"translator":
{
"type": "TabularTranslator",
"columnMappings": "<column mapping>"
},
"dataIntegrationUnits": <number>,
"parallelCopies": <number>,
"enableStaging": true/false,
"stagingSettings": {
<properties>
},
"enableSkipIncompatibleRow": true/false,
"redirectIncompatibleRowSettings": {
<properties>
}
}
}
]
Syntax details
Monitoring
You can monitor the Copy activity run in the Azure Data Factory both visually and programmatically. For details,
see Monitor copy activity.
Incremental copy
Data Factory enables you to incrementally copy delta data from a source data store to a sink data store. For
details, see Tutorial: Incrementally copy data.
TIP
This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.
To configure it programmatically, add the additionalColumns property in your copy activity source:
Example:
"activities":[
{
"name": "CopyWithAdditionalColumns",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "<source type>",
"additionalColumns": [
{
"name": "filePath",
"value": "$$FILEPATH"
},
{
"name": "newColName",
"value": "$$COLUMN:SourceColumnA"
},
{
"name": "pipelineName",
"value": {
"value": "@pipeline().Pipeline",
"type": "Expression"
}
},
{
"name": "staticValue",
"value": "sampleValue"
}
],
...
},
"sink": {
"type": "<sink type>"
}
}
}
]
Session log
You can log your copied file names, which can help you to further ensure the data is not only successfully copied
from source to destination store, but also consistent between source and destination store by reviewing the
copy activity session logs. See Session log in copy activity for details.
Next steps
See the following quickstarts, tutorials, and samples:
Copy data from one location to another location in the same Azure Blob storage account
Copy data from Azure Blob storage to Azure SQL Database
Copy data from a SQL Server database to Azure
Monitor copy activity
5/6/2021 • 5 minutes to read • Edit Online
Monitor visually
Once you've created and published a pipeline in Azure Data Factory, you can associate it with a trigger or
manually kick off an ad hoc run. You can monitor all of your pipeline runs natively in the Azure Data Factory user
experience. Learn about Azure Data Factory monitoring in general from Visually monitor Azure Data Factory.
To monitor the Copy activity run, go to your data factory Author & Monitor UI. On the Monitor tab, you see a
list of pipeline runs, click the pipeline name link to access the list of activity runs in the pipeline run.
At this level, you can see links to copy activity input, output, and errors (if the Copy activity run fails), as well as
statistics like duration/status. Clicking the Details button (eyeglasses) next to the copy activity name will give
you deep details on your copy activity execution.
In this graphical monitoring view, Azure Data Factory presents you the copy activity execution information,
including data read/written volume, number of files/rows of data copied from source to sink, throughput, the
configurations applied for your copy scenario, steps the copy activity goes through with corresponding
durations and details, and more. Refer to this table on each possible metric and its detailed description.
In some scenarios, when you run a Copy activity in Data Factory, you'll see "Performance tuning tips" at the
top of the copy activity monitoring view as shown in the example. The tips tell you the bottleneck identified by
ADF for the specific copy run, along with suggestion on what to change to boost copy throughput. Learn more
about auto performance tuning tips.
The bottom execution details and durations describes the key steps your copy activity goes through, which
is especially useful for troubleshooting the copy performance. The bottleneck of your copy run is the one with
the longest duration. Refer to Troubleshoot copy activity performance on for what each stage represents and the
detailed troubleshooting guidance.
Example: Copy from Amazon S3 to Azure Data Lake Storage Gen2
Monitor programmatically
Copy activity execution details and performance characteristics are also returned in the Copy Activity run
result > Output section, which is used to render the UI monitoring view. Following is a complete list of
properties that might be returned. You'll see only the properties that are applicable to your copy scenario. For
information about how to monitor activity runs programmatically in general, see Programmatically monitor an
Azure data factory.
dataRead The actual amount of data read from Int64 value, in bytes
the source.
P RO P ERT Y N A M E DESC RIP T IO N UN IT IN O UT P UT
filesRead The number of files read from the file- Int64 value (no unit)
based source.
filesSkipped The number of files skipped from the Int64 value (no unit)
file-based source.
rowsRead Number of rows read from the source. Int64 value (no unit)
This metric does not apply when
copying files as-is without parsing
them, for example, when source and
sink datasets are binary format type,
or other format type with identical
settings.
rowsCopied Number of rows copied to sink. This Int64 value (no unit)
metric does not apply when copying
files as-is without parsing them, for
example, when source and sink
datasets are binary format type, or
other format type with identical
settings.
Example:
"output": {
"dataRead": 1180089300500,
"dataWritten": 1180089300500,
"filesRead": 110,
"filesWritten": 110,
"filesSkipped": 0,
"sourcePeakConnections": 640,
"sinkPeakConnections": 1024,
"copyDuration": 388,
"throughput": 2970183,
"errors": [],
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)",
"usedDataIntegrationUnits": 128,
"billingReference": "{\"activityType\":\"DataMovement\",\"billableDuration\":
[{\"Managed\":11.733333333333336}]}",
"usedParallelCopies": 64,
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "None"
},
"executionDetails": [
{
"source": {
"type": "AmazonS3"
},
"sink": {
"type": "AzureBlobFS",
"region": "East US",
"throttlingErrors": 6
},
"status": "Succeeded",
"start": "2020-03-04T02:13:25.1454206Z",
"duration": 388,
"usedDataIntegrationUnits": 128,
"usedParallelCopies": 64,
"profile": {
"queue": {
"status": "Completed",
"duration": 2
},
"transfer": {
"status": "Completed",
"duration": 386,
"details": {
"listingSource": {
"type": "AmazonS3",
"workingDuration": 0
},
"readingFromSource": {
"type": "AmazonS3",
"workingDuration": 301
},
"writingToSink": {
"type": "AzureBlobFS",
"workingDuration": 335
}
}
}
},
},
"detailedDurations": {
"queuingDuration": 2,
"transferDuration": 386
}
}
],
"perfRecommendation": [
{
"Tip": "6 write operations were throttled by the sink data store. To achieve better performance,
you are suggested to check and increase the allowed request rate for Azure Data Lake Storage Gen2, or reduce
the number of concurrent copy runs and other data access, or reduce the DIU or parallel copy.",
"ReferUrl": "https://go.microsoft.com/fwlink/?linkid=2102534 ",
"RuleName": "ReduceThrottlingErrorPerfRecommendationRule"
}
],
"durationInQueue": {
"integrationRuntimeQueue": 0
}
}
Next steps
See the other Copy Activity articles:
- Copy activity overview
- Copy activity performance
Delete Activity in Azure Data Factory
5/14/2021 • 8 minutes to read • Edit Online
WARNING
Deleted files or folders cannot be restored (unless the storage has soft-delete enabled). Be cautious when using the Delete
activity to delete files or folders.
Best practices
Here are some recommendations for using the Delete activity:
Back up your files before deleting them with the Delete activity in case you need to restore them in the
future.
Make sure that Data Factory has write permissions to delete folders or files from the storage store.
Make sure you are not deleting files that are being written at the same time.
If you want to delete files or folder from an on-premises system, make sure you are using a self-hosted
integration runtime with a version greater than 3.14.
Syntax
{
"name": "DeleteActivity",
"type": "Delete",
"typeProperties": {
"dataset": {
"referenceName": "<dataset name>",
"type": "DatasetReference"
},
"storeSettings": {
"type": "<source type>",
"recursive": true/false,
"maxConcurrentConnections": <number>
},
"enableLogging": true/false,
"logStorageSettings": {
"linkedServiceName": {
"referenceName": "<name of linked service>",
"type": "LinkedServiceReference"
},
"path": "<path to save log file>"
}
}
}
Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
recursive Indicates whether the files are deleted No. The default is false .
recursively from the subfolders or only
from the specified folder.
Monitoring
There are two places where you can see and monitor the results of the Delete activity:
From the output of the Delete activity.
From the log file.
Sample output of the Delete activity
{
"datasetName": "AmazonS3",
"type": "AmazonS3Object",
"prefix": "test",
"bucketName": "adf",
"recursive": true,
"isWildcardUsed": false,
"maxConcurrentConnections": 2,
"filesDeleted": 4,
"logPath": "https://sample.blob.core.windows.net/mycontainer/5c698705-a6e2-40bf-911e-e0a927de3f07",
"effectiveIntegrationRuntime": "MyAzureIR (West Central US)",
"executionDuration": 650
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"PartitionedFolder",
"type":"DatasetReference",
"parameters":{
"TriggerTime":{
"value":"@formatDateTime(pipeline().parameters.TriggerTime, 'yyyy/MM/dd')",
"type":"Expression"
}
}
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"mycontainer/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
}
}
}
],
"parameters":{
"TriggerTime":{
"type":"string"
}
},
"annotations":[
]
}
}
Sample dataset
{
"name":"PartitionedFolder",
"properties":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"TriggerTime":{
"type":"string"
}
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":{
"value":"@dataset().TriggerTime",
"type":"Expression"
},
"container":{
"value":"mycontainer",
"type":"Expression"
}
}
}
}
}
Sample trigger
{
"name": "DailyTrigger",
"properties": {
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "cleanup_time_partitioned_folder",
"type": "PipelineReference"
},
"parameters": {
"TriggerTime": "@trigger().scheduledTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Day",
"interval": 1,
"startTime": "2018-12-13T00:00:00.000Z",
"timeZone": "UTC",
"schedule": {
"minutes": [
59
],
"hours": [
23
]
}
}
}
}
}
Clean up the expired files that were last modified before 2018.1.1
You can create a pipeline to clean up the old or expired files by leveraging file attribute filter: “LastModified” in
dataset.
Sample pipeline
{
"name":"CleanupExpiredFiles",
"properties":{
"activities":[
{
"name":"DeleteFilebyLastModified",
"type":"Delete",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"BlobFilesLastModifiedBefore201811",
"type":"DatasetReference"
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"mycontainer/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true,
"modifiedDatetimeEnd":"2018-01-01T00:00:00.000Z"
}
}
}
],
"annotations":[
]
}
}
Sample dataset
{
"name":"BlobFilesLastModifiedBefore201811",
"properties":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"*",
"folderPath":"mydirectory",
"container":"mycontainer"
}
}
}
}
Move files by chaining the Copy activity and the Delete activity
You can move a file by using a copy activity to copy a file and then a delete activity to delete a file in a pipeline.
When you want to move multiple files, you can use the GetMetadata activity + Filter activity + Foreach activity +
Copy activity + Delete activity as in the following sample:
NOTE
If you want to move the entire folder by defining a dataset containing a folder path only, and then using a copy activity
and a the Delete activity to reference to the same dataset representing a folder, you need to be very careful. It is because
you have to make sure that there will NOT be new files arriving into the folder between copying operation and deleting
operation. If there are new files arriving at the folder at the moment when your copy activity just completed the copy job
but the Delete activity has not been stared, it is possible that the Delete activity will delete this new arriving file which has
NOT been copied to the destination yet by deleting the entire folder.
Sample pipeline
{
"name":"MoveFiles",
"properties":{
"activities":[
{
"name":"GetFileList",
"type":"GetMetadata",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"OneSourceFolder",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
}
}
},
"fieldList":[
"childItems"
],
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
},
"formatSettings":{
"type":"BinaryReadSettings"
}
}
},
{
"name":"FilterFiles",
"type":"Filter",
"dependsOn":[
{
"activity":"GetFileList",
"dependencyConditions":[
"Succeeded"
]
}
],
"userProperties":[
],
"typeProperties":{
"items":{
"value":"@activity('GetFileList').output.childItems",
"type":"Expression"
},
"condition":{
"value":"@equals(item().type, 'File')",
"type":"Expression"
}
}
},
{
"name":"ForEachFile",
"type":"ForEach",
"dependsOn":[
{
"activity":"FilterFiles",
"dependencyConditions":[
"Succeeded"
]
}
],
"userProperties":[
],
"typeProperties":{
"items":{
"value":"@activity('FilterFiles').output.value",
"type":"Expression"
},
"batchCount":20,
"activities":[
{
"name":"CopyAFile",
"type":"Copy",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"source":{
"type":"BinarySource",
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":false,
"deleteFilesAfterCompletion":false
},
"formatSettings":{
"type":"BinaryReadSettings"
},
"recursive":false
},
"sink":{
"type":"BinarySink",
"storeSettings":{
"type":"AzureBlobStorageWriteSettings"
}
},
"enableStaging":false,
"dataIntegrationUnits":0
},
"inputs":[
{
"referenceName":"OneSourceFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
}
],
"outputs":[
{
"referenceName":"OneDestinationFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.DestinationStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.DestinationStore_Directory",
"value":"@pipeline().parameters.DestinationStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
}
]
},
{
"name":"DeleteAFile",
"type":"Delete",
"dependsOn":[
{
"activity":"CopyAFile",
"dependencyConditions":[
"Succeeded"
]
}
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"OneSourceFile",
"type":"DatasetReference",
"parameters":{
"Container":{
"value":"@pipeline().parameters.SourceStore_Location",
"type":"Expression"
},
"Directory":{
"value":"@pipeline().parameters.SourceStore_Directory",
"type":"Expression"
},
"filename":{
"value":"@item().name",
"type":"Expression"
}
}
},
"logStorageSettings":{
"linkedServiceName":{
"referenceName":"BloblinkedService",
"type":"LinkedServiceReference"
},
"path":"container/log"
},
"enableLogging":true,
"storeSettings":{
"type":"AzureBlobStorageReadSettings",
"recursive":true
}
}
}
]
}
}
],
"parameters":{
"parameters":{
"SourceStore_Location":{
"type":"String"
},
"SourceStore_Directory":{
"type":"String"
},
"DestinationStore_Location":{
"type":"String"
},
"DestinationStore_Directory":{
"type":"String"
}
},
"annotations":[
]
}
}
Sample datasets
Dataset used by GetMetadata activity to enumerate the file list.
{
"name":"OneSourceFolder",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"Container":{
"type":"String"
},
"Directory":{
"type":"String"
}
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}
Dataset for data source used by copy activity and the Delete activity.
{
"name":"OneSourceFile",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"parameters":{
"Container":{
"type":"String"
},
"Directory":{
"type":"String"
},
"filename":{
"type":"string"
}
},
"annotations":[
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":{
"value":"@dataset().filename",
"type":"Expression"
},
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}
],
"type":"Binary",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":{
"value":"@dataset().filename",
"type":"Expression"
},
"folderPath":{
"value":"@{dataset().Directory}",
"type":"Expression"
},
"container":{
"value":"@{dataset().Container}",
"type":"Expression"
}
}
}
}
}
You can also get the template to move files from here.
Known limitation
Delete activity does not support deleting list of folders described by wildcard.
When using file attribute filter in delete activity: modifiedDatetimeStart and modifiedDatetimeEnd to
select files to be deleted, make sure to set "wildcardFileName": "*" in delete activity as well.
Next steps
Learn more about moving files in Azure Data Factory.
Copy Data tool in Azure Data Factory
Copy Data tool in Azure Data Factory
7/8/2021 • 6 minutes to read • Edit Online
You want to easily build a data loading task without learning You want to implement complex and flexible logic for loading
about Azure Data Factory entities (linked services, datasets, data into lake.
pipelines, etc.)
You want to quickly load a large number of data artifacts You want to chain Copy activity with subsequent activities
into a data lake. for cleansing or processing data.
To start the Copy Data tool, click the Ingest tile on the home page of your data factory.
After you launch copy data tool, you will see two types of the tasks: one is built-in copy task and another is
metadata driven copy task . The built-in copy task leads you to create a pipeline within five minutes to
replicate data without learning about Azure Data Factory entities. The metadata driven copy task to ease your
journey of creating parameterized pipelines and external control table in order to manage to copy large
amounts of objects (for example, thousands of tables) at scale. You can see more details in metadata driven copy
data.
Intuitive flow for loading data into a data lake
This tool allows you to easily move data from a wide variety of sources to destinations in minutes with an
intuitive flow:
1. Configure settings for the source .
2. Configure settings for the destination .
3. Configure advanced settings for the copy operation such as column mapping, performance settings,
and fault tolerance settings.
4. Specify a schedule for the data loading task.
5. Review summar y of Data Factory entities to be created.
6. Edit the pipeline to update settings for the copy activity as needed.
The tool is designed with big data in mind from the start, with support for diverse data and object types.
You can use it to move hundreds of folders, files, or tables. The tool supports automatic data preview,
schema capture and automatic mapping, and data filtering as well.
NOTE
When copying data from SQL Server or Azure SQL Database into Azure Synapse Analytics, if the table does not exist in
the destination store, Copy Data tool supports creation of the table automatically by using the source schema.
Filter data
You can filter source data to select only the data that needs to be copied to the sink data store. Filtering reduces
the volume of the data to be copied to the sink data store and therefore enhances the throughput of the copy
operation. Copy Data tool provides a flexible way to filter data in a relational database by using the SQL query
language, or files in an Azure blob folder.
Filter data in a database
The following screenshot shows a SQL query to filter the data.
2016/03/01/01
2016/03/01/02
2016/03/01/03
...
Click the Browse button for File or folder , browse to one of these folders (for example, 2016->03->01->02),
and click Choose . You should see 2016/03/01/02 in the text box.
Then, replace 2016 with {year} , 03 with {month} , 01 with {day} , and 02 with {hour} , and press the Tab key.
When you select Incremental load: time-par titioned folder/file names in the File loading behavior
section and you select Schedule or Tumbling window on the Proper ties page, you should see drop-down
lists to select the format for these four variables:
The Copy Data tool generates parameters with expressions, functions, and system variables that can be used to
represent {year}, {month}, {day}, {hour}, and {minute} when creating pipeline.
Scheduling options
You can run the copy operation once or on a schedule (hourly, daily, and so on). These options can be used for
the connectors across different environments, including on-premises, cloud, and local desktop.
A one-time copy operation enables data movement from a source to a destination only once. It applies to data
of any size and any supported format. The scheduled copy allows you to copy data on a recurrence that you
specify. You can use rich settings (like retry, timeout, and alerts) to configure the scheduled copy.
Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Build large-scale data copy pipelines with
metadata-driven approach in copy data tool
(Preview)
7/20/2021 • 9 minutes to read • Edit Online
2. Input the connection of your source database . You can use parameterized linked service as well.
3. Select the table name to copy.
NOTE
If you select tabular data store, you will have chance to further select either full load or incremental load in the
next page. If you select storage store, you can further select full load only in the next page. Incrementally loading
new files only from storage store is currently not supported.
TIP
If you want to do full copy on all the tables, select Full load all tables . If you want to do incremental copy, you
can select configure for each table individually , and select Delta load as well as watermark column name &
value to start for each table.
9. Query the main control table and connection control table to review the metadata in it.
Main control table
Connection control table
10. Go back to ADF portal to view and debug pipelines. You will see a folder created by naming
"MetadataDrivenCopyTask_#########". Click the pipeline naming with
"MetadataDrivenCopyTask###_TopLevel" and click debug run .
You are required to input the following parameters:
MainControlTableName You can always change the main control table name, so
the pipeline will get the metadata from that table before
run.
NOTE
The pipeline will NOT be redeployed. The new created SQL script help you to update the control table only.
Control tables
Main control table
Each row in control table contains the metadata for one object (for example, one table) to be copied.
C O L UM N N A M E DESC RIP T IO N
TriggerName Trigger name, which can trigger the pipeline to copy this
object. If debug run, the name is Sandbox. If manual
execution, the name is Manual.
C O L UM N N A M E DESC RIP T IO N
Pipelines
You will see three levels of pipelines are generated by copy data tool.
MetadataDrivenCopyTask_xxx_TopLevel
This pipeline will calculate the total number of objects (tables etc.) required to be copied in this run, come up
with the number of sequential batches based on the max allowed concurrent copy task, and then execute
another pipeline to copy different batches sequentially.
Parameters
PA RA M ET ERS N A M E DESC RIP T IO N
MaxNumberOfConcurrentTasks You can always change the max number of concurrent copy
activities run before pipeline run. The default value will be
the one you input in copy data tool.
MainControlTableName The table name of main control table. The pipeline will get
the metadata from this table before run
Activities
MetadataDrivenCopyTask_xxx_ MiddleLevel
This pipeline will copy one batch of objects. The objects belonging to this batch will be copied in parallel.
Parameters
Activities
MetadataDrivenCopyTask_xxx_ BottomLevel
This pipeline will copy objects from one group. The objects belonging to this group will be copied in parallel.
Parameters
Activities
Known limitations
Copy data tool does not support metadata driven ingestion for incrementally copying new files only
currently. But you can bring your own parameterized pipelines to achieve that.
IR name, database type, file format type cannot be parameterized in ADF. For example, if you want to ingest
data from both Oracle Server and SQL Server, you will need two different parameterized pipelines. But the
single control table can be shared by two sets of pipelines.
Next steps
Try these tutorials that use the Copy Data tool:
Quickstart: create a data factory using the Copy Data tool
Tutorial: copy data in Azure using the Copy Data tool
Tutorial: copy on-premises data to Azure using the Copy Data tool
Supported file formats and compression codecs by
copy activity in Azure Data Factory
5/14/2021 • 2 minutes to read • Edit Online
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Copy activity performance and scalability guide
5/6/2021 • 7 minutes to read • Edit Online
NOTE
If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.
DATA SIZ E
/
B A N DW IDT
H 50 M B P S 100 M B P S 500 M B P S 1 GB P S 5 GB P S 10 GB P S 50 GB P S
1 GB 2.7 min 1.4 min 0.3 min 0.1 min 0.03 min 0.01 min 0.0 min
10 GB 27.3 min 13.7 min 2.7 min 1.3 min 0.3 min 0.1 min 0.03 min
100 GB 4.6 hrs 2.3 hrs 0.5 hrs 0.2 hrs 0.05 hrs 0.02 hrs 0.0 hrs
1 TB 46.6 hrs 23.3 hrs 4.7 hrs 2.3 hrs 0.5 hrs 0.2 hrs 0.05 hrs
10 TB 19.4 days 9.7 days 1.9 days 0.9 days 0.2 days 0.1 days 0.02 days
100 TB 194.2 days 97.1 days 19.4 days 9.7 days 1.9 days 1 day 0.2 days
Next steps
See the other copy activity articles:
Copy activity overview
Troubleshoot copy activity performance
Copy activity performance optimization features
Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure
Migrate data from Amazon S3 to Azure Storage
Troubleshoot copy activity performance
7/6/2021 • 15 minutes to read • Edit Online
C AT EGO RY P ERF O RM A N C E T UN IN G T IP S
Data store specific Loading data into Azure Synapse Analytics : suggest
using PolyBase or COPY statement if it's not used.
Staged copy If staged copy is configured but not helpful for your source-
sink pair, suggest removing it.
Resume When copy activity is resumed from last failure point but
you happen to change the DIU setting after the original run,
note the new DIU setting doesn't take effect.
Queue The elapsed time until the copy activity actually starts on the
integration runtime.
Pre-copy script The elapsed time between copy activity starting on IR and
copy activity finishing executing the pre-copy script in sink
data store. Apply when you configure the pre-copy script for
database sinks, e.g. when writing data into Azure SQL
Database do clean up before copy new data.
STA GE DESC RIP T IO N
Transfer The elapsed time between the end of the previous step and
the IR transferring all the data from source to sink.
Note the sub-steps under transfer run in parallel, and some
operations are not shown now e.g. parsing/generating file
format.
Other references
Here is performance monitoring and tuning references for some of the supported data stores:
Azure Blob storage: Scalability and performance targets for Blob storage and Performance and scalability
checklist for Blob storage.
Azure Table storage: Scalability and performance targets for Table storage and Performance and scalability
checklist for Table storage.
Azure SQL Database: You can monitor the performance and check the Database Transaction Unit (DTU)
percentage.
Azure Synapse Analytics: Its capability is measured in Data Warehouse Units (DWUs). See Manage compute
power in Azure Synapse Analytics (Overview).
Azure Cosmos DB: Performance levels in Azure Cosmos DB.
SQL Server: Monitor and tune for performance.
On-premises file server: Performance tuning for file servers.
Next steps
See the other copy activity articles:
Copy activity overview
Copy activity performance and scalability guide
Copy activity performance optimization features
Use Azure Data Factory to migrate data from your data lake or data warehouse to Azure
Migrate data from Amazon S3 to Azure Storage
Copy activity performance optimization features
5/6/2021 • 11 minutes to read • Edit Online
Between file stores - Copy from or to single file : 2-4 Between 4 and 32 depending on the
- Copy from and to multiple files : number and size of the files
2-256 depending on the number and
size of the files
From file store to non-file store - Copy from single file : 2-4 - Copy into Azure SQL Database
- Copy from multiple files : 2-256 or Azure Cosmos DB : between 4
depending on the number and size of and 16 depending on the sink tier
the files (DTUs/RUs) and source file pattern
- Copy into Azure Synapse
For example, if you copy data from a Analytics using PolyBase or COPY
folder with 4 large files, the max statement: 2
effective DIU is 16. - Other scenario: 4
From non-file store to file store - Copy from par tition-option- - Copy from REST or HTTP : 1
enabled data stores (including - Copy from Amazon Redshift
Azure SQL Database, Azure SQL using UNLOAD: 2
Managed Instance, Azure Synapse - Other scenario : 4
Analytics, Oracle, Netezza, SQL Server,
and Teradata): 2-256 when writing to a
folder, and 2-4 when writing to one
single file. Note per source data
partition can use up to 4 DIUs.
- Other scenarios : 2-4
DEFA ULT DIUS DET ERM IN ED B Y
C O P Y SC EN A RIO SUP P O RT ED DIU RA N GE SERVIC E
Between non-file stores - Copy from par tition-option- - Copy from REST or HTTP : 1
enabled data stores (including - Other scenario : 4
Azure SQL Database, Azure SQL
Managed Instance, Azure Synapse
Analytics, Oracle, Netezza, SQL Server,
and Teradata): 2-256 when writing to a
folder, and 2-4 when writing to one
single file. Note per source data
partition can use up to 4 DIUs.
- Other scenarios : 2-4
You can see the DIUs used for each copy run in the copy activity monitoring view or activity output. For more
information, see Copy activity monitoring. To override this default, specify a value for the dataIntegrationUnits
property as follows. The actual number of DIUs that the copy operation uses at run time is equal to or less than
the configured value, depending on your data pattern.
You will be charged # of used DIUs * copy duration * unit price/DIU-hour . See the current prices here.
Local currency and separate discounting may apply per subscription type.
Example:
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"dataIntegrationUnits": 128
}
}
]
TIP
The default behavior of parallel copy usually gives you the best throughput, which is auto-determined by ADF based on
your source-sink pair, data pattern and number of DIUs or the Self-hosted IR's CPU/memory/node count. Refer to
Troubleshoot copy activity performance on when to tune parallel copy.
C O P Y SC EN A RIO PA RA L L EL C O P Y B EH AVIO R
From file store to non-file store - When copying data into Azure SQL Database or Azure
Cosmos DB, default parallel copy also depend on the sink
tier (number of DTUs/RUs).
- When copying data into Azure Table, default parallel copy is
4.
From non-file store to file store - When copying data from partition-option-enabled data
store (including Azure SQL Database, Azure SQL Managed
Instance, Azure Synapse Analytics, Oracle, Netezza, SAP
HANA, SAP Open Hub, SAP Table, SQL Server, and Teradata),
default parallel copy is 4. The actual number of parallel
copies copy activity uses at run time is no more than the
number of data partitions you have. When use Self-hosted
Integration Runtime and copy to Azure Blob/ADLS Gen2,
note the max effective parallel copy is 4 or 5 per IR node.
- For other scenarios, parallel copy doesn't take effect. Even if
parallelism is specified, it's not applied.
C O P Y SC EN A RIO PA RA L L EL C O P Y B EH AVIO R
Between non-file stores - When copying data into Azure SQL Database or Azure
Cosmos DB, default parallel copy also depend on the sink
tier (number of DTUs/RUs).
- When copying data from partition-option-enabled data
store (including Azure SQL Database, Azure SQL Managed
Instance, Azure Synapse Analytics, Oracle, Netezza, SAP
HANA, SAP Open Hub, SAP Table, SQL Server, and Teradata),
default parallel copy is 4.
- When copying data into Azure Table, default parallel copy is
4.
To control the load on machines that host your data stores, or to tune copy performance, you can override the
default value and specify a value for the parallelCopies property. The value must be an integer greater than or
equal to 1. At run time, for the best performance, the copy activity uses a value that is less than or equal to the
value that you set.
When you specify a value for the parallelCopies property, take the load increase on your source and sink data
stores into account. Also consider the load increase to the self-hosted integration runtime if the copy activity is
empowered by it. This load increase happens especially when you have multiple activities or concurrent runs of
the same activities that run against the same data store. If you notice that either the data store or the self-hosted
integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
Example:
"activities":[
{
"name": "Sample copy activity",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"parallelCopies": 32
}
}
]
Staged copy
When you copy data from a source data store to a sink data store, you might choose to use Azure Blob storage
or Azure Data Lake Storage Gen2 as an interim staging store. Staging is especially useful in the following cases:
You want to ingest data from various data stores into Azure Synapse Analytics via PolyBase,
copy data from/to Snowflake, or ingest data from Amazon Redshift/HDFS performantly. Learn
more details from:
Use PolyBase to load data into Azure Synapse Analytics.
Snowflake connector
Amazon Redshift connector
HDFS connector
You don't want to open por ts other than por t 80 and por t 443 in your firewall because of
corporate IT policies. For example, when you copy data from an on-premises data store to an Azure SQL
Database or an Azure Synapse Analytics, you need to activate outbound TCP communication on port 1433
for both the Windows firewall and your corporate firewall. In this scenario, staged copy can take advantage
of the self-hosted integration runtime to first copy data to a staging storage over HTTP or HTTPS on port 443,
then load the data from staging into SQL Database or Azure Synapse Analytics. In this flow, you don't need to
enable port 1433.
Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-
premises data store to a cloud data store) over a slow network connection. To improve
performance, you can use staged copy to compress the data on-premises so that it takes less time to move
data to the staging data store in the cloud. Then you can decompress the data in the staging store before you
load into the destination data store.
How staged copy works
When you activate the staging feature, first the data is copied from the source data store to the staging storage
(bring your own Azure Blob or Azure Data Lake Storage Gen2). Next, the data is copied from the staging to the
sink data store. Azure Data Factory copy activity automatically manages the two-stage flow for you, and also
cleans up temporary data from the staging storage after the data movement is complete.
When you activate data movement by using a staging store, you can specify whether you want the data to be
compressed before you move data from the source data store to the staging store and then decompressed
before you move data from an interim or staging data store to the sink data store.
Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither
with nor without staged copy. For such scenario, you can configure two explicitly chained copy activities to copy
from source to staging then from staging to sink.
Configuration
Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in
storage before you load it into a destination data store. When you set enableStaging to TRUE , specify the
additional properties listed in the following table.
NOTE
If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked
service isn't supported.
Here's a sample definition of a copy activity with the properties that are described in the preceding table:
"activities":[
{
"name": "CopyActivityWithStaging",
"type": "Copy",
"inputs": [...],
"outputs": [...],
"typeProperties": {
"source": {
"type": "OracleSource",
},
"sink": {
"type": "SqlDWSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": {
"referenceName": "MyStagingStorage",
"type": "LinkedServiceReference"
},
"path": "stagingcontainer/path"
}
}
}
]
IMPORTANT
When you choose to preserve ACLs, make sure you grant high enough permissions for Data Factory to operate against
your sink Data Lake Storage Gen2 account. For example, use account key authentication or assign the Storage Blob Data
Owner role to the service principal or managed identity.
When you configure source as Data Lake Storage Gen1/Gen2 with binary format or the binary copy option, and
sink as Data Lake Storage Gen2 with binary format or the binary copy option, you can find the Preser ve option
on the Settings page in Copy Data Tool or on the Copy Activity > Settings tab for activity authoring.
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity performance
Schema and data type mapping in copy activity
5/6/2021 • 13 minutes to read • Edit Online
Schema mapping
Default mapping
By default, copy activity maps source data to sink by column names in case-sensitive manner. If sink doesn't
exist, for example, writing to file(s), the source field names will be persisted as sink names. If the sink already
exists, it must contain all columns being copied from the source. Such default mapping supports flexible
schemas and schema drift from source to sink from execution to execution - all the data returned by source data
store can be copied to sink.
If your source is text file without header line, explicit mapping is required as the source doesn't contain column
names.
Explicit mapping
You can also specify explicit mapping to customize the column/field mapping from source to sink based on your
need. With explicit mapping, you can copy only partial source data to sink, or map source data to sink with
different names, or reshape tabular/hierarchical data. Copy activity:
1. Reads the data from source and determine the source schema.
2. Applies your defined mapping.
3. Writes the data to sink.
Learn more about:
Tabular source to tabular sink
Hierarchical source to tabular sink
Tabular/Hierarchical source to hierarchical sink
You can configure the mapping on Data Factory authoring UI -> copy activity -> mapping tab, or
programmatically specify the mapping in copy activity -> translator property. The following properties are
supported in translator -> mappings array -> objects -> source and sink , which points to the specific
column/field to map data.
{
"name": "CopyActivityTabularToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "SalesforceSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "name": "Id" },
"sink": { "name": "CustomerID" }
},
{
"source": { "name": "Name" },
"sink": { "name": "LastName" }
},
{
"source": { "name": "LastModifiedDate" },
"sink": { "name": "ModifiedDate" }
}
]
}
},
...
}
To copy data from delimited text file(s) without header line, the columns are represented by ordinal instead of
names.
{
"name": "CopyActivityTabularToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "DelimitedTextSource" },
"sink": { "type": "SqlSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "ordinal": "1" },
"sink": { "name": "CustomerID" }
},
{
"source": { "ordinal": "2" },
"sink": { "name": "LastName" }
},
{
"source": { "ordinal": "3" },
"sink": { "name": "ModifiedDate" }
}
]
}
},
...
}
{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}
And you want to copy it into a text file in the following format with header line, by flattening the data inside the
array (order_pd and order_price) and cross join with the common root info (number, date, and city):
O RDERN UM B ER O RDERDAT E O RDER_P D O RDER_P RIC E C IT Y
01 20170122 P1 23 Seattle
01 20170122 P2 13 Seattle
NOTE
For records where the array marked as collection reference is empty and the check box is selected, the entire record is
skipped.
You can also switch to Advanced editor , in which case you can directly see and edit the fields' JSON paths. If
you choose to add new mapping in this view, specify the JSON path.
The same mapping can be configured as the following in copy activity payload (see translator ):
{
"name": "CopyActivityHierarchicalToTabular",
"type": "Copy",
"typeProperties": {
"source": { "type": "MongoDbV2Source" },
"sink": { "type": "DelimitedTextSink" },
"translator": {
"type": "TabularTranslator",
"mappings": [
{
"source": { "path": "$['number']" },
"sink": { "name": "orderNumber" }
},
{
"source": { "path": "$['date']" },
"sink": { "name": "orderDate" }
},
{
"source": { "path": "['prod']" },
"sink": { "name": "order_pd" }
},
{
"source": { "path": "['price']" },
"sink": { "name": "order_price" }
},
{
"source": { "path": "$['city'][0]['name']" },
"sink": { "name": "city" }
}
],
"collectionReference": "$['orders']"
}
},
...
}
{
"name": "CopyActivityHierarchicalToTabular",
"type": "Copy",
"typeProperties": {
"source": {...},
"sink": {...},
"translator": {
"value": "@pipeline().parameters.mapping",
"type": "Expression"
},
...
}
}
3. Construct the value to pass into the mapping parameter. It should be the entire object of translator
definition, refer to the samples in explicit mapping section. For example, for tabular source to tabular sink
copy, the value should be
{"type":"TabularTranslator","mappings":[{"source":{"name":"Id"},"sink":{"name":"CustomerID"}},
{"source":{"name":"Name"},"sink":{"name":"LastName"}},{"source":{"name":"LastModifiedDate"},"sink":
{"name":"ModifiedDate"}}]}
.
F LO AT -
SO URC E B O O L EA BYT E DEC IM A DAT E/ T I P O IN T IN T EGE T IM ESP
\ SIN K N A RRAY L M E ( 1) ( 2) GUID R ( 3) ST RIN G AN
Boolean ✓ ✓ ✓ ✓ ✓
Byte ✓ ✓
array
F LO AT -
SO URC E B O O L EA BYT E DEC IM A DAT E/ T I P O IN T IN T EGE T IM ESP
\ SIN K N A RRAY L M E ( 1) ( 2) GUID R ( 3) ST RIN G AN
Date/Ti ✓ ✓
me
Decimal ✓ ✓ ✓ ✓ ✓
Float- ✓ ✓ ✓ ✓ ✓
point
GUID ✓ ✓
Integer ✓ ✓ ✓ ✓ ✓
String ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
TimeSpa ✓ ✓
n
NOTE
Currently such data type conversion is supported when copying between tabular data. Hierarchical sources/sinks are
not supported, which means there is no system-defined data type conversion between source and sink interim types.
This feature works with the latest dataset model. If you don't see this option from the UI, try creating a new dataset.
The following properties are supported in copy activity for data type conversion (under translator section for
programmatical authoring):
Under typeConversionSettings
Example:
{
"name": "CopyActivity",
"type": "Copy",
"typeProperties": {
"source": {
"type": "ParquetSource"
},
"sink": {
"type": "SqlSink"
},
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": true,
"dateTimeFormat": "yyyy-MM-dd HH:mm:ss.fff",
"dateTimeOffsetFormat": "yyyy-MM-dd HH:mm:ss.fff zzz",
"timeSpanFormat": "dd\.hh\:mm",
"culture": "en-gb"
}
}
},
...
}
Legacy models
NOTE
The following models to map source columns/fields to sink are still supported as is for backward compatibility. We suggest
that you use the new model mentioned in schema mapping. Data Factory authoring UI has switched to generating the
new model.
{
"name": "OracleDataset",
"properties": {
"structure":
[
{ "name": "UserId"},
{ "name": "Name"},
{ "name": "Group"}
],
"type": "OracleTable",
"linkedServiceName": {
"referenceName": "OracleLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SourceTable"
}
}
}
In this sample, the output dataset has a structure and it points to a table in Salesfoce.
{
"name": "SalesforceDataset",
"properties": {
"structure":
[
{ "name": "MyUserId"},
{ "name": "MyName" },
{ "name": "MyGroup"}
],
"type": "SalesforceObject",
"linkedServiceName": {
"referenceName": "SalesforceLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"tableName": "SinkTable"
}
}
}
The following JSON defines a copy activity in a pipeline. The columns from source mapped to columns in sink
by using the translator -> columnMappings property.
{
"name": "CopyActivity",
"type": "Copy",
"inputs": [
{
"referenceName": "OracleDataset",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "SalesforceDataset",
"type": "DatasetReference"
}
],
"typeProperties": {
"source": { "type": "OracleSource" },
"sink": { "type": "SalesforceSink" },
"translator":
{
"type": "TabularTranslator",
"columnMappings":
{
"UserId": "MyUserId",
"Group": "MyGroup",
"Name": "MyName"
}
}
}
}
If you are using the syntax of "columnMappings": "UserId: MyUserId, Group: MyGroup, Name: MyName" to specify
column mapping, it is still supported as-is.
Alternative schema-mapping (legacy model)
You can specify copy activity -> translator -> schemaMapping to map between hierarchical-shaped data and
tabular-shaped data, for example, copy from MongoDB/REST to text file and copy from Oracle to Azure Cosmos
DB's API for MongoDB. The following properties are supported in copy activity translator section:
{
"id": {
"$oid": "592e07800000000000000000"
},
"number": "01",
"date": "20170122",
"orders": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "name": "Seattle" } ]
}
and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
(order_pd and order_price) and cross join with the common root info (number, date, and city):
O RDERN UM B ER O RDERDAT E O RDER_P D O RDER_P RIC E C IT Y
01 20170122 P1 23 Seattle
01 20170122 P2 13 Seattle
Configure the schema-mapping rule as the following copy activity JSON sample:
{
"name": "CopyFromMongoDBToOracle",
"type": "Copy",
"typeProperties": {
"source": {
"type": "MongoDbV2Source"
},
"sink": {
"type": "OracleSink"
},
"translator": {
"type": "TabularTranslator",
"schemaMapping": {
"$.number": "orderNumber",
"$.date": "orderDate",
"prod": "order_pd",
"price": "order_price",
"$.city[0].name": "city"
},
"collectionReference": "$.orders"
}
}
}
Next steps
See the other Copy Activity articles:
Copy activity overview
Fault tolerance of copy activity in Azure Data
Factory
6/10/2021 • 12 minutes to read • Edit Online
path The path of the log files. Specify the path that you No
use to store the log files. If
you do not provide a path,
the service creates a
container for you.
NOTE
The followings are the prerequisites of enabling fault tolerance in copy activity when copying binary files. For skipping
particular files when they are being deleted from source store:
The source dataset and sink dataset have to be binary format, and the compression type cannot be specified.
The supported data store types are Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage
Gen2, Azure File Storage, File System, FTP, SFTP, Amazon S3, Google Cloud Storage and HDFS.
Only if when you specify multiple files in source dataset, which can be a folder, wildcard or a list of files, copy activity
can skip the particular error files. If a single file is specified in source dataset to be copied to the destination, copy
activity will fail if any error occurred.
For skipping particular files when their access are forbidden from source store:
The source dataset and sink dataset have to be binary format, and the compression type cannot be specified.
The supported data store types are Azure Blob storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage
Gen2, Azure File Storage, SFTP, Amazon S3 and HDFS.
Only if when you specify multiple files in source dataset, which can be a folder, wildcard or a list of files, copy activity
can skip the particular error files. If a single file is specified in source dataset to be copied to the destination, copy
activity will fail if any error occurred.
For skipping particular files when they are verified to be inconsistent between source and destination store:
You can get more details from data consistency doc here.
Monitoring
Output from copy activity
You can get the number of files being read, written, and skipped via the output of each copy activity run.
"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}
C O L UM N DESC RIP T IO N
Level The log level of this item. It will be in 'Warning' level for the
item showing file skipping.
Timestamp,Level,OperationName,OperationItem,Message
2020-03-24 05:35:41.0209942,Warning,FileSkip,"bigfile.csv","File is skipped after read 322961408 bytes:
ErrorCode=UserErrorSourceBlobNotExist,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Mes
sage=The required Blob is missing. ContainerName:
https://transferserviceonebox.blob.core.windows.net/skipfaultyfile, path:
bigfile.csv.,Source=Microsoft.DataTransfer.ClientLibrary,'."
2020-03-24 05:38:41.2595989,Warning,FileSkip,"3_nopermission.txt","File is skipped after read 0 bytes:
ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message
=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Forbidden'. Account:
'adlsgen2perfsource'. FileSystem: 'skipfaultyfilesforbidden'. Path: '3_nopermission.txt'. ErrorCode:
'AuthorizationPermissionMismatch'. Message: 'This request is not authorized to perform this operation using
this permission.'. RequestId: '35089f5d-101f-008c-489e-
01cce4000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.DataTransfer.Common.Shared.Hybr
idDeliveryException,Message=Operation returned an invalid status code
'Forbidden',Source=,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message='Type=Microsoft.
Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code
'Forbidden',Source=Microsoft.DataTransfer.ClientLibrary,',Source=Microsoft.DataTransfer.ClientLibrary,'."
From the log above, you can see bigfile.csv has been skipped due to another application deleted this file when
ADF was copying it. And 3_nopermission.txt has been skipped because ADF is not allowed to access it due to
permission issue.
NOTE
To load data into Azure Synapse Analytics using PolyBase, configure PolyBase's native fault tolerance settings by
specifying reject policies via "polyBaseSettings" in copy activity. You can still enable redirecting PolyBase incompatible
rows to Blob or ADLS as normal as shown below.
This feature doesn't apply when copy activity is configured to invoke Amazon Redshift Unload.
This feature doesn't apply when copy activity is configured to invoke a stored procedure from a SQL sink.
Configuration
The following example provides a JSON definition to configure skipping the incompatible rows in copy activity:
"typeProperties": {
"source": {
"type": "AzureSqlSource"
},
"sink": {
"type": "AzureSqlSink"
},
"enableSkipIncompatibleRow": true,
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
},
path The path of the log files Specify the path that you No
that contains the skipped want to use to log the
rows. incompatible data. If you do
not provide a path, the
service creates a container
for you.
If you configure to log the incompatible rows, you can find the log file from this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-
name]/[copy-activity-run-id]/[auto-generated-GUID].csv
.
The log files will be the csv files. The schema of the log file is as following:
C O L UM N DESC RIP T IO N
Level The log level of this item. It will be in 'Warning' level if this
item shows the skipped rows
From the sample log file above, you can see one row "data1, data2, data3" has been skipped due to type
conversion issue from source to destination store. Another row "data4, data5, data6" has been skipped due to
PK violation issue from source to destination store.
path The path of the log file that Specify the path that you No
contains the skipped rows. want to use to log the
incompatible data. If you do
not provide a path, the
service creates a container
for you.
"output": {
"dataRead": 95,
"dataWritten": 186,
"rowsCopied": 9,
"rowsSkipped": 2,
"copyDuration": 16,
"throughput": 0.01,
"redirectRowPath": "https://myblobstorage.blob.core.windows.net//myfolder/a84bf8d4-233f-4216-
8cb5-45962831cd1b/",
"errors": []
},
If you configure to log the incompatible rows, you can find the log file at this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/[copy-activity-run-id]/[auto-
generated-GUID].csv
.
The log files can only be the csv files. The original data being skipped will be logged with comma as column
delimiter if needed. We add two more columns "ErrorCode" and "ErrorMessage" in additional to the original
source data in log file, where you can see the root cause of the incompatibility. The ErrorCode and ErrorMessage
will be quoted by double quotes.
An example of the log file content is as follows:
data1, data2, data3, "UserErrorInvalidDataValue", "Column 'Prop_2' contains an invalid value 'data3'. Cannot
convert 'data3' to type 'DateTime'."
data4, data5, data6, "2627", "Violation of PRIMARY KEY constraint 'PK_tblintstrdatetimewithpk'. Cannot
insert duplicate key in object 'dbo.tblintstrdatetimewithpk'. The duplicate key value is (data4)."
Next steps
See the other copy activity articles:
Copy activity overview
Copy activity performance
Data consistency verification in copy activity
3/5/2021 • 5 minutes to read • Edit Online
Configuration
The following example provides a JSON definition to enable data consistency verification in Copy Activity:
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureDataLakeStoreWriteSettings"
}
},
"validateDataConsistency": true,
"skipErrorFile": {
"dataInconsistency": true
},
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
}
path The path of the log files. Specify the path that you No
want to store the log files. If
you do not provide a path,
the service creates a
container for you.
NOTE
When copying binary files from, or to Azure Blob or Azure Data Lake Storage Gen2, ADF does block level MD5
checksum verification leveraging Azure Blob API and Azure Data Lake Storage Gen2 API. If ContentMD5 on files exist
on Azure Blob or Azure Data Lake Storage Gen2 as data sources, ADF does file level MD5 checksum verification after
reading the files as well. After copying files to Azure Blob or Azure Data Lake Storage Gen2 as data destination, ADF
writes ContentMD5 to Azure Blob or Azure Data Lake Storage Gen2 which can be further consumed by downstream
applications for data consistency verification.
ADF does file size verification when copying binary files between any storage stores.
Monitoring
Output from copy activity
After the copy activity runs completely, you can see the result of data consistency verification from the output of
each copy activity run:
"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}
You can see the details of data consistency verification from "dataConsistencyVerification property".
Value of VerificationResult :
Verified : Your copied data has been verified to be consistent between source and destination store.
NotVerified : Your copied data has not been verified to be consistent because you have not enabled the
validateDataConsistency in copy activity.
Unsuppor ted : Your copied data has not been verified to be consistent because data consistency verification
is not supported for this particular copy pair.
Value of InconsistentData :
Found : ADF copy activity has found inconsistent data.
Skipped : ADF copy activity has found and skipped inconsistent data.
None : ADF copy activity has not found any inconsistent data. It can be either because your data has been
verified to be consistent between source and destination store or because you disabled
validateDataConsistency in copy activity.
Session log from copy activity
If you configure to log the inconsistent file, you can find the log file from this path:
https://[your-blob-account].blob.core.windows.net/[path-if-configured]/copyactivity-logs/[copy-activity-
name]/[copy-activity-run-id]/[auto-generated-GUID].csv
. The log files will be the csv files.
The schema of a log file is as following:
C O L UM N DESC RIP T IO N
Level The log level of this item. It will be in 'Warning' level for the
item showing file skipping.
From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be
consistent between source and destination store. You can get more details about why sample1.csv becomes
inconsistent is because it was being changed by other applications when ADF copy activity is copying at the
same time.
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity fault tolerance
Session log in copy activity
5/17/2021 • 6 minutes to read • Edit Online
Configuration
The following example provides a JSON definition to enable session log in Copy Activity:
"typeProperties": {
"source": {
"type": "BinarySource",
"storeSettings": {
"type": "AzureDataLakeStoreReadSettings",
"recursive": true
},
"formatSettings": {
"type": "BinaryReadSettings"
}
},
"sink": {
"type": "BinarySink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
}
},
"skipErrorFile": {
"fileForbidden": true,
"dataInconsistency": true
},
"validateDataConsistency": true,
"logSettings": {
"enableCopyActivityLog": true,
"copyActivityLogSettings": {
"logLevel": "Warning",
"enableReliableLogging": false
},
"logLocationSettings": {
"linkedServiceName": {
"referenceName": "ADLSGen2",
"type": "LinkedServiceReference"
},
"path": "sessionlog/"
}
}
}
path The path of the log files. Specify the path that you No
want to store the log files. If
you do not provide a path,
the service creates a
container for you.
Monitoring
Output from copy activity
After the copy activity runs completely, you can see the path of log files from the output of each copy activity
run. You can find the log files from the path:
https://[your-blob-account].blob.core.windows.net/[logFilePath]/copyactivity-logs/[copy-activity-name]/[copy-
activity-run-id]/[auto-generated-GUID].txt
. The log files generated have the .txt extension and their data is in CSV format.
"output": {
"dataRead": 695,
"dataWritten": 186,
"filesRead": 3,
"filesWritten": 1,
"filesSkipped": 2,
"throughput": 297,
"logFilePath": "myfolder/a84bf8d4-233f-4216-8cb5-45962831cd1b/",
"dataConsistencyVerification":
{
"VerificationResult": "Verified",
"InconsistentData": "Skipped"
}
}
NOTE
When the enableCopyActivityLog property is set to Enabled , the log file names are system generated.
C O L UM N DESC RIP T IO N
Timestamp The timestamp when ADF reads, writes, or skips the object.
Message More information to show if the file has been read from
source store, or written to the destination store. It can also
be why the file or rows has being skipped.
From the log file above, you can see sample1.csv has been skipped because it failed to be verified to be
consistent between source and destination store. You can get more details about why sample1.csv becomes
inconsistent is because it was being changed by other applications when ADF copy activity is copying at the
same time. You can also see sample2.csv has been successfully copied from source to destination store.
You can use multiple analysis engines to further analyze the log files. There are a few examples below to use
SQL query to analyze the log file by importing csv log file to SQL database where the table name can be
SessionLogDemo.
Give me the copied file list.
select OperationItem from SessionLogDemo where Message like '%File is successfully copied%'
select OperationItem from SessionLogDemo where TIMESTAMP >= '<start time>' and TIMESTAMP <= '<end time>' and
Message like '%File is successfully copied%'
Give me a list of files with their metadata copied within a time range.
select * from SessionLogDemo where OperationName='FileRead' and Message like 'Start to read%' and
OperationItem in (select OperationItem from SessionLogDemo where TIMESTAMP >= '<start time>' and TIMESTAMP
<= '<end time>' and Message like '%File is successfully copied%')
select TIMESTAMP, OperationItem, Message from SessionLogDemo where OperationName='FileSkip' and Message like
'%UserErrorSourceBlobNotExist%'
Give me the file name that requires the longest time to copy.
Next steps
See the other Copy Activity articles:
Copy activity overview
Copy activity fault tolerance
Copy activity data consistency
Supported file formats and compression codecs in
Azure Data Factory (legacy)
5/6/2021 • 18 minutes to read • Edit Online
IMPORTANT
Data Factory introduced new format-based dataset model, see corresponding format article with details:
- Avro format
- Binary format
- Delimited text format
- JSON format
- ORC format
- Parquet format
The rest configurations mentioned in this article are still supported as-is for backward compabitility. You are suggested to
use the new model going forward.
If you want to read from a text file or write to a text file, set the type property in the format section of the
dataset to TextFormat . You can also specify the following optional properties in the format section. See
TextFormat example section on how to configure.
TextFormat example
In the following JSON definition for a dataset, some of the optional properties are specified.
"typeProperties":
{
"folderPath": "mycontainer/myfolder",
"fileName": "myblobname",
"format":
{
"type": "TextFormat",
"columnDelimiter": ",",
"rowDelimiter": ";",
"quoteChar": "\"",
"NullValue": "NaN",
"firstRowAsHeader": true,
"skipLineCount": 0,
"treatEmptyAsNull": true
}
},
To use an escapeChar instead of quoteChar , replace the line with quoteChar with the following escapeChar:
"escapeChar": "$",
To impor t/expor t a JSON file as-is into/from Azure Cosmos DB , see Import/export JSON documents
section in Move data to/from Azure Cosmos DB article.
If you want to parse the JSON files or write the data in JSON format, set the type property in the format
section to JsonFormat . You can also specify the following optional properties in the format section. See
JsonFormat example section on how to configure.
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
}
{"time":"2015-04-
29T07:12:20.9100000Z","callingimsi":"466920403025604","callingnum1":"678948008","callingnum2":
"567834760","switch1":"China","switch2":"Germany"}
{"time":"2015-04-
29T07:13:21.0220000Z","callingimsi":"466922202613463","callingnum1":"123436380","callingnum2":
"789037573","switch1":"US","switch2":"UK"}
{"time":"2015-04-
29T07:13:21.4370000Z","callingimsi":"466923101048691","callingnum1":"678901578","callingnum2":
"345626404","switch1":"Germany","switch2":"UK"}
[
{
"time": "2015-04-29T07:12:20.9100000Z",
"callingimsi": "466920403025604",
"callingnum1": "678948008",
"callingnum2": "567834760",
"switch1": "China",
"switch2": "Germany"
},
{
"time": "2015-04-29T07:13:21.0220000Z",
"callingimsi": "466922202613463",
"callingnum1": "123436380",
"callingnum2": "789037573",
"switch1": "US",
"switch2": "UK"
},
{
"time": "2015-04-29T07:13:21.4370000Z",
"callingimsi": "466923101048691",
"callingnum1": "678901578",
"callingnum2": "345626404",
"switch1": "Germany",
"switch2": "UK"
}
]
JsonFormat example
Case 1: Copying data from JSON files
Sample 1: extract data from object and array
In this sample, you expect one root JSON object maps to single record in tabular result. If you have a JSON file
with the following content:
{
"id": "ed0e4960-d9c5-11e6-85dc-d7996816aad3",
"context": {
"device": {
"type": "PC"
},
"custom": {
"dimensions": [
{
"TargetResourceType": "Microsoft.Compute/virtualMachines"
},
{
"ResourceManagementProcessRunId": "827f8aaa-ab72-437c-ba48-d8917a7336a3"
},
{
"OccurrenceTime": "1/13/2017 11:24:37 AM"
}
]
}
}
}
and you want to copy it into an Azure SQL table in the following format, by extracting data from both objects
and array:
RESO URC EM A N A GE
TA RGET RESO URC ET Y M EN T P RO C ESSRUN I
ID DEVIC ET Y P E PE D O C C URREN C ET IM E
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
section defines the customized column names and the corresponding data type while converting
structure
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. To
copy data from array, you can use array[x].property to extract value of the given property from the xth
object, or you can use array[*].property to find the value from any object containing such property.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "deviceType",
"type": "String"
},
{
"name": "targetResourceType",
"type": "String"
},
{
"name": "resourceManagementProcessRunId",
"type": "String"
},
{
"name": "occurrenceTime",
"type": "DateTime"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonPathDefinition": {"id": "$.id", "deviceType": "$.context.device.type",
"targetResourceType": "$.context.custom.dimensions[0].TargetResourceType", "resourceManagementProcessRunId":
"$.context.custom.dimensions[1].ResourceManagementProcessRunId", "occurrenceTime": "
$.context.custom.dimensions[2].OccurrenceTime"}
}
}
}
Sample 2: cross apply multiple objects with the same pattern from array
In this sample, you expect to transform one root JSON object into multiple records in tabular result. If you have
a JSON file with the following content:
{
"ordernumber": "01",
"orderdate": "20170122",
"orderlines": [
{
"prod": "p1",
"price": 23
},
{
"prod": "p2",
"price": 13
},
{
"prod": "p3",
"price": 231
}
],
"city": [ { "sanmateo": "No 1" } ]
}
and you want to copy it into an Azure SQL table in the following format, by flattening the data inside the array
and cross join with the common root info:
ORDERNUMBER ORDERDATE ORDER_PD ORDER_PRICE CITY
01 20170122 P1 23 [{"sanmateo":"No
1"}]
01 20170122 P2 13 [{"sanmateo":"No
1"}]
The input dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically:
structure section defines the customized column names and the corresponding data type while converting
to tabular data. This section is optional unless you need to do column mapping. For more information, see
Map source dataset columns to destination dataset columns.
jsonNodeReference indicates to iterate and extract data from the objects with the same pattern under array
orderlines .
jsonPathDefinition specifies the JSON path for each column indicating where to extract the data from. In
this example, ordernumber , orderdate , and city are under root object with JSON path starting with $. ,
while order_pd and order_price are defined with path derived from the array element without $. .
"properties": {
"structure": [
{
"name": "ordernumber",
"type": "String"
},
{
"name": "orderdate",
"type": "String"
},
{
"name": "order_pd",
"type": "String"
},
{
"name": "order_price",
"type": "Int64"
},
{
"name": "city",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat",
"filePattern": "setOfObjects",
"jsonNodeReference": "$.orderlines",
"jsonPathDefinition": {"ordernumber": "$.ordernumber", "orderdate": "$.orderdate", "order_pd":
"prod", "order_price": "price", "city": " $.city"}
}
}
}
and for each record, you expect to write to a JSON object in the following format:
{
"id": "1",
"order": {
"date": "20170119",
"price": 2000,
"customer": "David"
}
}
The output dataset with JsonFormat type is defined as follows: (partial definition with only the relevant parts).
More specifically, structure section defines the customized property names in destination file,
nestingSeparator (default is ".") are used to identify the nest layer from the name. This section is optional
unless you want to change the property name comparing with source column name, or nest some of the
properties.
"properties": {
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "order.date",
"type": "String"
},
{
"name": "order.price",
"type": "Int64"
},
{
"name": "order.customer",
"type": "String"
}
],
"typeProperties": {
"folderPath": "mycontainer/myfolder",
"format": {
"type": "JsonFormat"
}
}
}
If you want to parse the Parquet files or write the data in Parquet format, set the format type property to
ParquetFormat . You do not need to specify any properties in the Format section within the typeProperties
section. Example:
"format":
{
"type": "ParquetFormat"
}
For copy running on Self-hosted IR with Parquet file serialization/deserialization, ADF locates the Java runtime
by firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for
JRE, if not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : it's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
TIP
If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred
when invoking java, message: java.lang.OutOfMemor yError :Java heap space ", you can add an environment variable
_JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such
copy, then rerun the pipeline.
Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g . The flag Xms specifies the initial memory
allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool.
This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx
amount of memory. By default, ADF use min 64MB and max 1G.
Data type mapping for Parquet files
DATA FA C TO RY IN T ERIM PA RQ UET O RIGIN A L T Y P E PA RQ UET O RIGIN A L T Y P E
DATA T Y P E PA RQ UET P RIM IT IVE T Y P E ( DESERIA L IZ E) ( SERIA L IZ E)
If you want to parse the ORC files or write the data in ORC format, set the format type property to
OrcFormat . You do not need to specify any properties in the Format section within the typeProperties section.
Example:
"format":
{
"type": "OrcFormat"
}
IMPORTANT
For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not
copying ORC files as-is , you need to install the 64-bit JRE 8 (Java Runtime Environment) or OpenJDK on your IR
machine. See the following paragraph with more details.
For copy running on Self-hosted IR with ORC file serialization/deserialization, ADF locates the Java runtime by
firstly checking the registry (SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome) for JRE, if
not found, secondly checking system variable JAVA_HOME for OpenJDK.
To use JRE : The 64-bit IR requires 64-bit JRE. You can find it from here.
To use OpenJDK : it's supported since IR version 3.13. Package the jvm.dll with all other required assemblies
of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
Data type mapping for ORC files
DATA FA C TO RY IN T ERIM DATA T Y P E O RC T Y P ES
Boolean Boolean
SByte Byte
Byte Short
Int16 Short
UInt16 Int
Int32 Int
UInt32 Long
Int64 Long
UInt64 String
Single Float
Double Double
Decimal Decimal
String String
DateTime Timestamp
DateTimeOffset Timestamp
TimeSpan Timestamp
ByteArray Binary
DATA FA C TO RY IN T ERIM DATA T Y P E O RC T Y P ES
Guid String
Char Char(1)
If you want to parse the Avro files or write the data in Avro format, set the format type property to
AvroFormat . You do not need to specify any properties in the Format section within the typeProperties section.
Example:
"format":
{
"type": "AvroFormat",
}
To use Avro format in a Hive table, you can refer to Apache Hive's tutorial.
Note the following points:
Complex data types are not supported (records, enums, arrays, maps, unions, and fixed).
To specify compression for a dataset, use the compression property in the dataset JSON as in the following
example:
{
"name": "AzureBlobDataSet",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"fileName": "pagecounts.csv.gz",
"folderPath": "compression/file/",
"format": {
"type": "TextFormat"
},
"compression": {
"type": "GZip",
"level": "Optimal"
}
}
}
}
NOTE
Compression settings are not supported for data in the AvroFormat , OrcFormat , or ParquetFormat . When reading
files in these formats, Data Factory detects and uses the compression codec in the metadata. When writing to files in
these formats, Data Factory chooses the default compression codec for that format. For example, ZLIB for OrcFormat and
SNAPPY for ParquetFormat.
Next steps
Learn the latest supported file formats and compressions from Supported file formats and compressions.
Transform data in Azure Data Factory
5/26/2021 • 5 minutes to read • Edit Online
Overview
This article explains data transformation activities in Azure Data Factory that you can use to transform and
process your raw data into predictions and insights at scale. A transformation activity executes in a computing
environment such as Azure Databricks or Azure HDInsight. It provides links to articles with detailed information
on each transformation activity.
Data Factory supports the following data transformation activities that can be added to pipelines either
individually or chained with another activity.
External transformations
Optionally, you can hand-code transformations and manage the external compute environment yourself.
HDInsight Hive activity
The HDInsight Hive activity in a Data Factory pipeline executes Hive queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Hive activity article for details about this activity.
HDInsight Pig activity
The HDInsight Pig activity in a Data Factory pipeline executes Pig queries on your own or on-demand
Windows/Linux-based HDInsight cluster. See Pig activity article for details about this activity.
HDInsight MapReduce activity
The HDInsight MapReduce activity in a Data Factory pipeline executes MapReduce programs on your own or on-
demand Windows/Linux-based HDInsight cluster. See MapReduce activity article for details about this activity.
HDInsight Streaming activity
The HDInsight Streaming activity in a Data Factory pipeline executes Hadoop Streaming programs on your own
or on-demand Windows/Linux-based HDInsight cluster. See HDInsight Streaming activity for details about this
activity.
HDInsight Spark activity
The HDInsight Spark activity in a Data Factory pipeline executes Spark programs on your own HDInsight cluster.
For details, see Invoke Spark programs from Azure Data Factory.
Azure Machine Learning Studio (classic) activities
Azure Data Factory enables you to easily create pipelines that use a published Azure Machine Learning Studio
(classic) web service for predictive analytics. Using the Batch Execution activity in an Azure Data Factory pipeline,
you can invoke a Studio (classic) web service to make predictions on the data in batch.
Over time, the predictive models in the Studio (classic) scoring experiments need to be retrained using new
input datasets. After you are done with retraining, you want to update the scoring web service with the retrained
machine learning model. You can use the Update Resource activity to update the web service with the newly
trained model.
See Use Azure Machine Learning Studio (classic) activities for details about these Studio (classic) activities.
Stored procedure activity
You can use the SQL Server Stored Procedure activity in a Data Factory pipeline to invoke a stored procedure in
one of the following data stores: Azure SQL Database, Azure Synapse Analytics, SQL Server Database in your
enterprise or an Azure VM. See Stored Procedure activity article for details.
Data Lake Analytics U -SQL activity
Data Lake Analytics U-SQL activity runs a U-SQL script on an Azure Data Lake Analytics cluster. See Data
Analytics U-SQL activity article for details.
Synapse Notebook activity
The Azure Synapse Notebook Activity in a Synapse pipeline runs a Synapse notebook in your Azure Synapse
workspace. See Transform data by running a Synapse notebook.
Databricks Notebook activity
The Azure Databricks Notebook Activity in a Data Factory pipeline runs a Databricks notebook in your Azure
Databricks workspace. Azure Databricks is a managed platform for running Apache Spark. See Transform data
by running a Databricks notebook.
Databricks Jar activity
The Azure Databricks Jar Activity in a Data Factory pipeline runs a Spark Jar in your Azure Databricks cluster.
Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a Jar activity
in Azure Databricks.
Databricks Python activity
The Azure Databricks Python Activity in a Data Factory pipeline runs a Python file in your Azure Databricks
cluster. Azure Databricks is a managed platform for running Apache Spark. See Transform data by running a
Python activity in Azure Databricks.
Custom activity
If you need to transform data in a way that is not supported by Data Factory, you can create a custom activity
with your own data processing logic and use the activity in the pipeline. You can configure the custom .NET
activity to run using either an Azure Batch service or an Azure HDInsight cluster. See Use custom activities article
for details.
You can create a custom activity to run R scripts on your HDInsight cluster with R installed. See Run R Script
using Azure Data Factory.
Compute environments
You create a linked service for the compute environment and then use the linked service when defining a
transformation activity. There are two types of compute environments supported by Data Factory.
On-Demand : In this case, the computing environment is fully managed by Data Factory. It is automatically
created by the Data Factory service before a job is submitted to process data and removed when the job is
completed. You can configure and control granular settings of the on-demand compute environment for job
execution, cluster management, and bootstrapping actions.
Bring Your Own : In this case, you can register your own computing environment (for example HDInsight
cluster) as a linked service in Data Factory. The computing environment is managed by you and the Data
Factory service uses it to execute the activities.
See Compute Linked Services article to learn about compute services supported by Data Factory.
Next steps
See the following tutorial for an example of using a transformation activity: Tutorial: transform data using Spark
Data Flow activity in Azure Data Factory
5/25/2021 • 6 minutes to read • Edit Online
Syntax
{
"name": "MyDataFlowActivity",
"type": "ExecuteDataFlow",
"typeProperties": {
"dataflow": {
"referenceName": "MyDataFlow",
"type": "DataFlowReference"
},
"compute": {
"coreCount": 8,
"computeType": "General"
},
"traceLevel": "Fine",
"runConcurrently": true,
"continueOnError": true,
"staging": {
"linkedService": {
"referenceName": "MyStagingLinkedService",
"type": "LinkedServiceReference"
},
"folderPath": "my-container/my-folder"
},
"integrationRuntime": {
"referenceName": "MyDataFlowIntegrationRuntime",
"type": "IntegrationRuntimeReference"
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
compute.coreCount The number of cores used 8, 16, 32, 48, 80, 144, 272 No
in the spark cluster. Can
only be specified if the
auto-resolve Azure
Integration runtime is used
staging.linkedService If you're using an Azure LinkedServiceReference Only if the data flow reads
Synapse Analytics source or or writes to an Azure
sink, specify the storage Synapse Analytics
account used for PolyBase
staging.
staging.folderPath If you're using an Azure String Only if the data flow reads
Synapse Analytics source or or writes to Azure Synapse
sink, the folder path in blob Analytics
storage account used for
PolyBase staging
NOTE
When choosing driver and worker node cores in Synapse Data Flows, a minimum of 3 nodes will always be utilized.
PolyBase
If you're using an Azure Synapse Analytics as a sink or source, you must choose a staging location for your
PolyBase batch load. PolyBase allows for batch loading in bulk instead of loading the data row-by-row. PolyBase
drastically reduces the load time into Azure Synapse Analytics.
Logging level
If you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs,
you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose"
mode (default), you are requesting ADF to fully log activity at each individual partition level during your data
transformation. This can be an expensive operation, so only enabling verbose when troubleshooting can
improve your overall data flow and pipeline performance. "Basic" mode will only log transformation durations
while "None" will only provide a summary of durations.
Sink properties
The grouping feature in data flows allow you to both set the order of execution of your sinks as well as to group
sinks together using the same group number. To help manage groups, you can ask ADF to run sinks, in the same
group, in parallel. You can also set the sink group to continue even after one of the sinks encounters an error.
The default behavior of data flow sinks is to execute each sink sequentially, in a serial manner, and to fail the data
flow when an error is encountered in the sink. Additionally, all sinks are defaulted to the same group unless you
go into the data flow properties and set different priorities for the sinks.
First row only
This option is only available for data flows that have cache sinks enabled for "Output to activity". The output
from the data flow that is injected directly into your pipeline is limited to 2MB. Setting "first row only" helps you
to limit the data output from data flow when injecting the data flow activity output directly to your pipeline.
The debug pipeline runs against the active debug cluster, not the integration runtime environment specified in
the Data Flow activity settings. You can choose the debug compute environment when starting up debug mode.
For example, to get to number of rows written to a sink named 'sink1' in an activity named 'dataflowActivity',
use @activity('dataflowActivity').output.runStatus.metrics.sink1.rowsWritten .
To get the number of rows read from a source named 'source1' that was used in that sink, use
@activity('dataflowActivity').output.runStatus.metrics.sink1.sources.source1.rowsRead .
NOTE
If a sink has zero rows written, it will not show up in metrics. Existence can be verified using the contains function. For
example, contains(activity('dataflowActivity').output.runStatus.metrics, 'sink1') will check whether any
rows were written to sink1.
Next steps
See control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Power query activity in data factory
3/5/2021 • 2 minutes to read • Edit Online
The Power Query activity allows you to build and execute Power Query mash-ups to execute data wrangling at
scale in a Data Factory pipeline. You can create a new Power Query mash-up from the New resources menu
option or by adding a Power Activity to your pipeline.
Previously, data wrangling in Azure Data Factory was authored from the Data Flow menu option. This has been
changed to authoring from a new Power Query activity. You can work directly inside of the Power Query mash-
up editor to perform interactive data exploration and then save your work. Once complete, you can take your
Power Query activity and add it to a pipeline. Azure Data Factory will automatically scale it out and
operationalize your data wrangling using Azure Data Factory's data flow Spark environment.
Next steps
Learn more about data wrangling concepts using Power Query in Azure Data Factory
Azure Function activity in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online
function app url URL for the Azure Function App. yes
Format is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal
linked service The Azure Function linked Linked service reference yes
service for the
corresponding Azure
Function App
method REST API method for the String Supported Types: yes
function call "GET", "POST", "PUT"
body body that is sent along with String (or expression with Required for PUT/POST
the request to the function resultType of string) or methods
api method object.
See the schema of the request payload in Request payload schema section.
Next steps
Learn more about activities in Data Factory in Pipelines and activities in Azure Data Factory.
Use custom activities in an Azure Data Factory
pipeline
5/28/2021 • 12 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
IMPORTANT
When creating a new Azure Batch pool, ‘VirtualMachineConfiguration’ must be used and NOT
‘CloudServiceConfiguration'. For more details refer Azure Batch Pool migration guidance.
To learn more about Azure Batch linked service, see Compute linked services article.
Custom activity
The following JSON snippet defines a pipeline with a simple Custom Activity. The activity definition has a
reference to the Azure Batch linked service.
{
"name": "MyCustomActivityPipeline",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "helloworld.exe",
"folderPath": "customactv2/helloworld",
"resourceLinkedService": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
}
}]
}
}
In this sample, the helloworld.exe is a custom application stored in the customactv2/helloworld folder of the
Azure Storage account used in the resourceLinkedService. The Custom activity submits this custom application
to be executed on Azure Batch. You can replace the command to any preferred application that can be executed
on the target Operation System of the Azure Batch Pool nodes.
The following table describes names and descriptions of properties that are specific to this activity.
* The properties resourceLinkedService and folderPath must either both be specified or both be omitted.
NOTE
If you are passing linked services as referenceObjects in Custom Activity, it is a good security practice to pass an Azure
Key Vault enabled linked service (since it does not contain any secure strings) and fetch the credentials using secret name
directly from Key Vault from the code. You can find an example here that references AKV enabled linked service, retrieves
the credentials from Key Vault, and then accesses the storage in the code.
Executing commands
You can directly execute a command using Custom Activity. The following example runs the "echo hello world"
command on the target Azure Batch Pool nodes and prints the output to stdout.
{
"name": "MyCustomActivity",
"properties": {
"description": "Custom activity sample",
"activities": [{
"type": "Custom",
"name": "MyCustomActivity",
"linkedServiceName": {
"referenceName": "AzureBatchLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"command": "cmd /c echo hello world"
}
}]
}
}
When the activity is executed, referenceObjects and extendedProperties are stored in following files that are
deployed to the same execution folder of the SampleApp.exe:
activity.json
namespace SampleApp
{
class Program
{
static void Main(string[] args)
{
//From Extend Properties
dynamic activity = JsonConvert.DeserializeObject(File.ReadAllText("activity.json"));
Console.WriteLine(activity.typeProperties.extendedProperties.connectionString.value);
// From LinkedServices
dynamic linkedServices = JsonConvert.DeserializeObject(File.ReadAllText("linkedServices.json"));
Console.WriteLine(linkedServices[0].properties.typeProperties.accountName);
}
}
}
When the pipeline is running, you can check the execution output using the following commands:
while ($True) {
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
if(!$result) {
Write-Host "Waiting for pipeline to start..." -foregroundcolor "Yellow"
}
elseif (($result | Where-Object { $_.Status -eq "InProgress" } | Measure-Object).count -ne 0) {
Write-Host "Pipeline run status: In Progress" -foregroundcolor "Yellow"
}
else {
Write-Host "Pipeline '"$pipelineName"' run finished. Result:" -foregroundcolor "Yellow"
$result
break
}
($result | Format-List | Out-String)
Start-Sleep -Seconds 15
}
The stdout and stderr of your custom application are saved to the adfjobs container in the Azure Storage
Linked Service you defined when creating Azure Batch Linked Service with a GUID of the task. You can get the
detailed path from Activity Run output as shown in the following snippet:
Pipeline ' MyCustomActivity' run finished. Result:
ResourceGroupName : resourcegroupname
DataFactoryName : datafactoryname
ActivityName : MyCustomActivity
PipelineRunId : xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PipelineName : MyCustomActivity
Input : {command}
Output : {exitcode, outputs, effectiveIntegrationRuntime}
LinkedServiceName :
ActivityRunStart : 10/5/2017 3:33:06 PM
ActivityRunEnd : 10/5/2017 3:33:28 PM
DurationInMs : 21203
Status : Succeeded
Error : {errorCode, message, failureType, target}
If you would like to consume the content of stdout.txt in downstream activities, you can get the path to the
stdout.txt file in expression "@activity('MyCustomActivity').output.outputs[0]".
IMPORTANT
The activity.json, linkedServices.json, and datasets.json are stored in the runtime folder of the Batch task. For this
example, the activity.json, linkedServices.json, and datasets.json are stored in
https://adfv2storage.blob.core.windows.net/adfjobs/<GUID>/runtime/ path. If needed, you need to clean them
up separately.
For Linked Services that use the Self-Hosted Integration Runtime, the sensitive information like keys or passwords are
encrypted by the Self-Hosted Integration Runtime to ensure credential stays in customer defined private network
environment. Some sensitive fields could be missing when referenced by your custom application code in this way. Use
SecureString in extendedProperties instead of using Linked Service reference if needed.
This serialization is not truly secure, and is not intended to be secure. The intent is to hint to Data Factory to
mask the value in the Monitoring tab.
To access properties of type SecureString from a custom activity, read the activity.json file, which is placed in
the same folder as your .EXE, deserialize the JSON, and then access the JSON property (extendedProperties =>
[propertyName] => value).
Execution environment of the custom Windows or Linux Windows (.NET Framework 4.5.2)
logic
Executing scripts Supports executing scripts directly (for Requires implementation in the .NET
example "cmd /c echo hello world" on DLL
Windows VM)
Retrieve information in custom logic Parses activity.json, linkedServices.json, Through .NET SDK (.NET Frame 4.5.2)
and datasets.json stored in the same
folder of the executable
If you have existing .NET code written for a version 1 (Custom) DotNet Activity, you need to modify your code
for it to work with the current version of the Custom Activity. Update your code by following these high-level
guidelines:
Change the project from a .NET Class Library to a Console App.
Start your application with the Main method. The Execute method of the IDotNetActivity interface is no
longer required.
Read and parse the Linked Services, Datasets and Activity with a JSON serializer, and not as strongly-typed
objects. Pass the values of required properties to your main custom code logic. Refer to the preceding
SampleApp.exe code as an example.
The Logger object is no longer supported. Output from your executable can be printed to the console and is
saved to stdout.txt.
The Microsoft.Azure.Management.DataFactories NuGet package is no longer required.
Compile your code, upload the executable and its dependencies to Azure Storage, and define the path in the
folderPath property.
For a complete sample of how the end-to-end DLL and pipeline sample described in the Data Factory version 1
article Use custom activities in an Azure Data Factory pipeline can be rewritten as a Data Factory Custom
Activity, see Data Factory Custom Activity sample.
startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 *
TimeInterval_Second));
$TargetDedicated=min(maxNumberofVMs,pendingTaskSamples);
See Automatically scale compute nodes in an Azure Batch pool for details.
If the pool is using the default autoScaleEvaluationInterval, the Batch service could take 15-30 minutes to
prepare the VM before running the custom activity. If the pool is using a different autoScaleEvaluationInterval,
the Batch service could take autoScaleEvaluationInterval + 10 minutes.
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data by running a Jar activity in Azure
Databricks
3/5/2021 • 2 minutes to read • Edit Online
{
"name": "SparkJarActivity",
"type": "DatabricksSparkJar",
"linkedServiceName": {
"referenceName": "AzureDatabricks",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mainClassName": "org.apache.spark.examples.SparkPi",
"parameters": [ "10" ],
"libraries": [
{
"jar": "dbfs:/docs/sparkpi.jar"
}
]
}
}
libraries A list of libraries to be installed on the Yes (at least one containing the
cluster that will execute the job. It can mainClassName method)
be an array of <string, object>
NOTE
Known Issue - When using the same Interactive cluster for running concurrent Databricks Jar activities (without cluster
restart), there is a known issue in Databricks where in parameters of the 1st activity will be used by following activities as
well. Hence resulting to incorrect parameters being passed to the subsequent jobs. To mitigate this use a Job cluster
instead.
For more information, see the Databricks documentation for library types.
Next steps
For an eleven-minute introduction and demonstration of this feature, watch the video.
Transform data by running a Databricks notebook
3/5/2021 • 2 minutes to read • Edit Online
{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksNotebook",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"notebookPath": "/Users/[email protected]/ScalaExampleNotebook",
"baseParameters": {
"inputpath": "input/folder1/",
"outputpath": "output/"
},
"libraries": [
{
"jar": "dbfs:/docs/library.jar"
}
]
}
}
}
For more details, see the Databricks documentation for library types.
IMPORTANT
If you are passing JSON object you can retrieve values by appending property names. Example:
@{activity('databricks notebook activity name').output.runOutput.PropertyName}
{
"activity": {
"name": "MyActivity",
"description": "MyActivity description",
"type": "DatabricksSparkPython",
"linkedServiceName": {
"referenceName": "MyDatabricksLinkedservice",
"type": "LinkedServiceReference"
},
"typeProperties": {
"pythonFile": "dbfs:/docs/pi.py",
"parameters": [
"10"
],
"libraries": [
{
"pypi": {
"package": "tensorflow"
}
}
]
}
}
}
{
"libraries": [
{
"jar": "dbfs:/mnt/libraries/library.jar"
},
{
"egg": "dbfs:/mnt/libraries/library.egg"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2",
"exclusions": [ "slf4j:slf4j" ]
}
},
{
"pypi": {
"package": "simplejson",
"repo": "http://my-pypi-mirror.com"
}
},
{
"cran": {
"package": "ada",
"repo": "https://cran.us.r-project.org"
}
}
]
}
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "<account name>",
"dataLakeAnalyticsUri": "<azure data lake analytics URI>",
"servicePrincipalId": "<service principal id>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenant info, e.g. microsoft.onmicrosoft.com>",
"subscriptionId": "<optional, subscription id of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
To learn more about the linked service, see Compute linked services.
The following table describes names and descriptions of properties that are specific to this activity.
@rs1 =
SELECT Start, Region, Duration
FROM @searchlog
WHERE Region == "en-gb";
@rs1 =
SELECT Start, Region, Duration
FROM @rs1
WHERE Start <= DateTime.Parse("2012/02/19");
OUTPUT @rs1
TO @out
USING Outputters.Tsv(quoting:false, dateTimeFormat:null);
In above script example, the input and output to the script is defined in @in and @out parameters. The values
for @in and @out parameters in the U-SQL script are passed dynamically by Data Factory using the
‘parameters’ section.
You can specify other properties such as degreeOfParallelism and priority as well in your pipeline definition for
the jobs that run on the Azure Data Lake Analytics service.
Dynamic parameters
In the sample pipeline definition, in and out parameters are assigned with hard-coded values.
"parameters": {
"in": "/datalake/input/SearchLog.tsv",
"out": "/datalake/output/Result.tsv"
}
"parameters": {
"in": "/datalake/input/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/data.tsv",
"out": "/datalake/output/@{formatDateTime(pipeline().parameters.WindowStart,'yyyy/MM/dd')}/result.tsv"
}
In this case, input files are still picked up from the /datalake/input folder and output files are generated in the
/datalake/output folder. The file names are dynamic based on the window start time being passed in when
pipeline gets triggered.
Next steps
See the following articles that explain how to transform data in other ways:
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Hive activity in Azure
Data Factory
3/21/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "Hive Activity",
"description": "description",
"type": "HDInsightHive",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\HiveScripts\\MyHiveSript.hql",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED
NOTE
The default value for queryTimeout is 120 minutes.
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop MapReduce activity
in Azure Data Factory
3/5/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "Map Reduce Activity",
"description": "Description",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.myorg.SampleClass",
"jarLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "MyAzureStorage/jars/sample.jar",
"getDebugInfo": "Failure",
"arguments": [
"-SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED
Example
You can use the HDInsight MapReduce Activity to run any MapReduce jar file on an HDInsight cluster. In the
following sample JSON definition of a pipeline, the HDInsight Activity is configured to run a Mahout JAR file.
{
"name": "MapReduce Activity for Mahout",
"description": "Custom MapReduce to generate Mahout result",
"type": "HDInsightMapReduce",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"className": "org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob",
"jarLinkedService": {
"referenceName": "MyStorageLinkedService",
"type": "LinkedServiceReference"
},
"jarFilePath": "adfsamples/Mahout/jars/mahout-examples-0.9.0.2.2.7.1-34.jar",
"arguments": [
"-s",
"SIMILARITY_LOGLIKELIHOOD",
"--input",
"wasb://[email protected]/Mahout/input",
"--output",
"wasb://[email protected]/Mahout/output/",
"--maxSimilaritiesPerItem",
"500",
"--tempDir",
"wasb://[email protected]/Mahout/temp/mahout"
]
}
}
You can specify any arguments for the MapReduce program in the arguments section. At runtime, you see a
few extra arguments (for example: mapreduce.job.tags) from the MapReduce framework. To differentiate your
arguments with the MapReduce arguments, consider using both option and value as arguments as shown in the
following example (-s,--input,--output etc., are options immediately followed by their values).
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Pig activity in Azure
Data Factory
3/5/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "Pig Activity",
"description": "description",
"type": "HDInsightPig",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"scriptLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"scriptPath": "MyAzureStorage\\PigScripts\\MyPigSript.pig",
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Spark activity in Azure Data
Factory
6/10/2021 • 3 minutes to read • Edit Online
{
"name": "Spark Activity",
"description": "Description",
"type": "HDInsightSpark",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"sparkJobLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"rootPath": "adfspark",
"entryFilePath": "test.py",
"sparkConfig": {
"ConfigItem1": "Value"
},
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
]
}
}
The following table describes the JSON properties used in the JSON definition:
Folder structure
Spark jobs are more extensible than Pig/Hive jobs. For Spark jobs, you can provide multiple dependencies such
as jar packages (placed in the java CLASSPATH), python files (placed on the PYTHONPATH), and any other files.
Create the following folder structure in the Azure Blob storage referenced by the HDInsight linked service. Then,
upload dependent files to the appropriate sub folders in the root folder represented by entr yFilePath . For
example, upload python files to the pyFiles subfolder and jar files to the jars subfolder of the root folder. At
runtime, Data Factory service expects the following folder structure in the Azure Blob storage:
PAT H DESC RIP T IO N REQ UIRED TYPE
Here is an example for a storage containing two Spark job files in the Azure Blob Storage referenced by the
HDInsight linked service.
SparkJob1
main.jar
files
input1.txt
input2.txt
jars
package1.jar
package2.jar
logs
archives
pyFiles
SparkJob2
main.py
pyFiles
scrip1.py
script2.py
logs
archives
jars
files
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Transform data using Hadoop Streaming activity in
Azure Data Factory
3/5/2021 • 2 minutes to read • Edit Online
JSON sample
{
"name": "Streaming Activity",
"description": "Description",
"type": "HDInsightStreaming",
"linkedServiceName": {
"referenceName": "MyHDInsightLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mapper": "MyMapper.exe",
"reducer": "MyReducer.exe",
"combiner": "MyCombiner.exe",
"fileLinkedService": {
"referenceName": "MyAzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"filePaths": [
"<containername>/example/apps/MyMapper.exe",
"<containername>/example/apps/MyReducer.exe",
"<containername>/example/apps/MyCombiner.exe"
],
"input": "wasb://<containername>@<accountname>.blob.core.windows.net/example/input/MapperInput.txt",
"output":
"wasb://<containername>@<accountname>.blob.core.windows.net/example/output/ReducerOutput.txt",
"commandEnvironment": [
"CmdEnvVarName=CmdEnvVarValue"
],
"getDebugInfo": "Failure",
"arguments": [
"SampleHadoopJobArgument1"
],
"defines": {
"param1": "param1Value"
}
}
}
Syntax details
P RO P ERT Y DESC RIP T IO N REQ UIRED
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Spark activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution activity
Stored procedure activity
Execute Azure Machine Learning pipelines in Azure
Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "Machine Learning Execute Pipeline",
"type": "AzureMLExecutePipeline",
"linkedServiceName": {
"referenceName": "AzureMLService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"mlPipelineId": "machine learning pipeline ID",
"experimentName": "experimentName",
"mlPipelineParameters": {
"mlParameterName": "mlParameterValue"
}
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
NOTE
To populate the dropdown items in Machine Learning pipeline name and ID, the user needs to have permission to list ML
pipelines. ADF UX calls AzureMLService APIs directly using the logged in user's credentials.
Next steps
See the following articles that explain how to transform data in other ways:
Execute Data Flow activity
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Create a predictive pipeline using Azure Machine
Learning Studio (classic) and Azure Data Factory
3/5/2021 • 8 minutes to read • Edit Online
See Compute linked services article for descriptions about properties in the JSON definition.
Azure Machine Learning Studio (classic) supports both Classic Web Services and New Web Services for your
predictive experiment. You can choose the right one to use from Data Factory. To get the information required to
create the Azure Machine Learning Studio (classic) Linked Service, go to https://services.azureml.net, where all
your (new) Web Services and Classic Web Services are listed. Click the Web Service you would like to access,
and click Consume page. Copy Primar y Key for apiKey property, and Batch Requests for mlEndpoint
property.
Scenario 1: Experiments using Web service inputs/outputs that refer to data in Azure Blob Storage
In this scenario, the Azure Machine Learning Studio (classic) Web service makes predictions using data from a
file in an Azure blob storage and stores the prediction results in the blob storage. The following JSON defines a
Data Factory pipeline with an AzureMLBatchExecution activity. The input and output data in Azure Blog Storage
is referenced using a LinkedName and FilePath pair. In the sample Linked Service of inputs and outputs are
different, you can use different Linked Services for each of your inputs/outputs for Data Factory to be able to
pick up the right files and send to Azure Machine Learning Studio (classic) Web Service.
IMPORTANT
In your Azure Machine Learning Studio (classic) experiment, web service input and output ports, and global parameters
have default names ("input1", "input2") that you can customize. The names you use for webServiceInputs,
webServiceOutputs, and globalParameters settings must exactly match the names in the experiments. You can view the
sample request payload on the Batch Execution Help page for your Azure Machine Learning Studio (classic) endpoint to
verify the expected mapping.
{
"name": "AzureMLExecutionActivityTemplate",
"description": "description",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "AzureMLLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in1.csv"
},
"input2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService1",
"type": "LinkedServiceReference"
},
"FilePath":"amltest/input/in2.csv"
}
},
"webServiceOutputs": {
"outputName1": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out1.csv"
},
"outputName2": {
"LinkedServiceName":{
"referenceName": "AzureStorageLinkedService2",
"type": "LinkedServiceReference"
},
"FilePath":"amltest2/output/out2.csv"
}
}
}
}
Let's look at a scenario for using Web service parameters. You have a deployed Azure Machine Learning Studio
(classic) web service that uses a reader module to read data from one of the data sources supported by Azure
Machine Learning Studio (classic) (for example: Azure SQL Database). After the batch execution is performed, the
results are written using a Writer module (Azure SQL Database). No web service inputs and outputs are defined
in the experiments. In this case, we recommend that you configure relevant web service parameters for the
reader and writer modules. This configuration allows the reader/writer modules to be configured when using
the AzureMLBatchExecution activity. You specify Web service parameters in the globalParameters section in
the activity JSON as follows.
"typeProperties": {
"globalParameters": {
"Database server name": "<myserver>.database.windows.net",
"Database name": "<database>",
"Server user account name": "<user name>",
"Server user account password": "<password>"
}
}
NOTE
The Web service parameters are case-sensitive, so ensure that the names you specify in the activity JSON match the ones
exposed by the Web service.
After you are done with retraining, update the scoring web service (predictive experiment exposed as a web
service) with the newly trained model by using the Azure Machine Learning Studio (classic) Update
Resource Activity . See Updating models using Update Resource Activity article for details.
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Update Azure Machine Learning Studio (classic)
models by using Update Resource activity
3/22/2021 • 6 minutes to read • Edit Online
Overview
As part of the process of operationalizing Azure Machine Learning Studio (classic) models, your model is trained
and saved. You then use it to create a predictive Web service. The Web service can then be consumed in web
sites, dashboards, and mobile apps.
Models you create using Azure Machine Learning Studio (classic) are typically not static. As new data becomes
available or when the consumer of the API has their own data the model needs to be retrained.
Retraining may occur frequently. With Batch Execution activity and Update Resource activity, you can
operationalize the Azure Machine Learning Studio (classic) model retraining and updating the predictive Web
Service using Data Factory.
The following picture depicts the relationship between training and predictive Web Services.
End-to-end workflow
The entire process of operationalizing retraining a model and update the predictive Web Services involves the
following steps:
Invoke the training Web Ser vice by using the Batch Execution activity . Invoking a training Web Service
is the same as invoking a predictive Web Service described in Create predictive pipelines using Azure
Machine Learning Studio (classic) and Data Factory Batch Execution activity. The output of the training Web
Service is an iLearner file that you can use to update the predictive Web Service.
Invoke the update resource endpoint of the predictive Web Ser vice by using the Update Resource
activity to update the Web Service with the newly trained model.
Azure Machine Learning Studio (classic) linked service
For the above mentioned end-to-end workflow to work, you need to create two Azure Machine Learning Studio
(classic) linked services:
1. An Azure Machine Learning Studio (classic) linked service to the training web service, this linked service is
used by Batch Execution activity in the same way as what's mentioned in Create predictive pipelines using
Azure Machine Learning Studio (classic) and Data Factory Batch Execution activity. Difference is the output of
the training web service is an iLearner file, which is then used by Update Resource activity to update the
predictive web service.
2. An Azure Machine Learning Studio (classic) linked service to the update resource endpoint of the predictive
web service. This linked service is used by Update Resource activity to update the predictive web service
using the iLearner file returned from above step.
For the second Azure Machine Learning Studio (classic) linked service, the configuration is different when your
Azure Machine Learning Studio (classic) Web Service is a classic Web Service or a new Web Service. The
differences are discussed separately in the following sections.
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview
You can get values for place holders in the URL when querying the web service on the Azure Machine Learning
Studio (classic) Web Services Portal.
The new type of update resource endpoint requires service principal authentication. To use service principal
authentication, register an application entity in Azure Active Directory (Azure AD) and grant it the Contributor
or Owner role of the subscription or the resource group where the web service belongs to. The See how to
create service principal and assign permissions to manage Azure resource. Make note of the following values,
which you use to define the linked service:
Application ID
Application key
Tenant ID
Here is a sample linked service definition:
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"description": "The linked service for AML web service.",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/0000000000000000
000000000000000000000/services/0000000000000000000000000000000000000/jobs?api-version=2.0",
"apiKey": {
"type": "SecureString",
"value": "APIKeyOfEndpoint1"
},
"updateResourceEndpoint":
"https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resource-group-
name}/providers/Microsoft.MachineLearning/webServices/{web-service-name}?api-version=2016-05-01-preview",
"servicePrincipalId": "000000000-0000-0000-0000-0000000000000",
"servicePrincipalKey": {
"type": "SecureString",
"value": "servicePrincipalKey"
},
"tenant": "mycompany.com"
}
}
}
The following scenario provides more details. It has an example for retraining and updating Azure Machine
Learning Studio (classic) models from an Azure Data Factory pipeline.
{
"name": "StorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=name;AccountKey=key"
}
}
}
Linked service for Azure Machine Learning Studio (classic) training endpoint
The following JSON snippet defines an Azure Machine Learning Studio (classic) linked service that points to the
default endpoint of the training web service.
{
"name": "trainingEndpoint",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/xxx/services/--training
experiment--/jobs",
"apiKey": "myKey"
}
}
}
In Azure Machine Learning Studio (classic) , do the following to get values for mlEndpoint and apiKey :
1. Click WEB SERVICES on the left menu.
2. Click the training web ser vice in the list of web services.
3. Click copy next to API key text box. Paste the key in the clipboard into the Data Factory JSON editor.
4. In the Azure Machine Learning Studio (classic) , click BATCH EXECUTION link.
5. Copy the Request URI from the Request section and paste it into the Data Factory JSON editor.
Linked service for Azure Machine Learning Studio (classic) updatable scoring endpoint:
The following JSON snippet defines an Azure Machine Learning Studio (classic) linked service that points to
updatable endpoint of the scoring web service.
{
"name": "updatableScoringEndpoint2",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint":
"https://ussouthcentral.services.azureml.net/workspaces/00000000eb0abe4d6bbb1d7886062747d7/services/00000000
026734a5889e02fbb1f65cefd/jobs?api-version=2.0",
"apiKey":
"sooooooooooh3WvG1hBfKS2BNNcfwSO7hhY6dY98noLfOdqQydYDIXyf2KoIaN3JpALu/AKtflHWMOCuicm/Q==",
"updateResourceEndpoint": "https://management.azure.com/subscriptions/00000000-0000-0000-0000-
000000000000/resourceGroups/Default-MachineLearning-
SouthCentralUS/providers/Microsoft.MachineLearning/webServices/myWebService?api-version=2016-05-01-preview",
"servicePrincipalId": "fe200044-c008-4008-a005-94000000731",
"servicePrincipalKey": "zWa0000000000Tp6FjtZOspK/WMA2tQ08c8U+gZRBlw=",
"tenant": "mycompany.com"
}
}
}
Pipeline
The pipeline has two activities: AzureMLBatchExecution and AzureMLUpdateResource . The Batch Execution
activity takes the training data as input and produces an iLearner file as an output. The Update Resource activity
then takes this iLearner file and use it to update the predictive web service.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "amlBEGetilearner",
"description": "Use AML BES to get the ileaner file from training web service",
"type": "AzureMLBatchExecution",
"linkedServiceName": {
"referenceName": "trainingEndpoint",
"type": "LinkedServiceReference"
},
"typeProperties": {
"webServiceInputs": {
"input1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
},
"input2": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/input"
}
},
"webServiceOutputs": {
"output1": {
"LinkedServiceName":{
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
},
"FilePath":"azuremltesting/output"
}
}
}
},
{
"name": "amlUpdateResource",
"type": "AzureMLUpdateResource",
"description": "Use AML Update Resource to update the predict web service",
"linkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "updatableScoringEndpoint2"
},
"typeProperties": {
"trainedModelName": "ADFV2Sample Model [trained model]",
"trainedModelLinkedServiceName": {
"type": "LinkedServiceReference",
"referenceName": "StorageLinkedService"
},
"trainedModelFilePath": "azuremltesting/output/newModelForArm.ilearner"
},
"dependsOn": [
{
"activity": "amlbeGetilearner",
"dependencyConditions": [ "Succeeded" ]
}
]
}
]
}
}
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL activity
Hive activity
Pig activity
MapReduce activity
Hadoop Streaming activity
Spark activity
.NET custom activity
Stored procedure activity
Transform data by using the SQL Server Stored
Procedure activity in Azure Data Factory
6/24/2021 • 3 minutes to read • Edit Online
NOTE
If you are new to Azure Data Factory, read through Introduction to Azure Data Factory and do the tutorial: Tutorial:
transform data before reading this article.
You can use the Stored Procedure Activity to invoke a stored procedure in one of the following data stores in
your enterprise or on an Azure virtual machine (VM):
Azure SQL Database
Azure Synapse Analytics
SQL Server Database. If you are using SQL Server, install Self-hosted integration runtime on the same
machine that hosts the database or on a separate machine that has access to the database. Self-Hosted
integration runtime is a component that connects data sources on-premises/on Azure VM with cloud
services in a secure and managed way. See Self-hosted integration runtime article for details.
IMPORTANT
When copying data into Azure SQL Database or SQL Server, you can configure the SqlSink in copy activity to invoke a
stored procedure by using the sqlWriterStoredProcedureName property. For details about the property, see following
connector articles: Azure SQL Database, SQL Server. Invoking a stored procedure while copying data into an Azure
Synapse Analytics by using a copy activity is not supported. But, you can use the stored procedure activity to invoke a
stored procedure in Azure Synapse Analytics.
When copying data from Azure SQL Database or SQL Server or Azure Synapse Analytics, you can configure SqlSource in
copy activity to invoke a stored procedure to read data from the source database by using the
sqlReaderStoredProcedureName property. For more information, see the following connector articles: Azure SQL
Database, SQL Server, Azure Synapse Analytics
Syntax details
Here is the JSON format for defining a Stored Procedure Activity:
{
"name": "Stored Procedure Activity",
"description":"Description",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "usp_sample",
"storedProcedureParameters": {
"identifier": { "value": "1", "type": "Int" },
"stringData": { "value": "str1" }
}
}
}
Next steps
See the following articles that explain how to transform data in other ways:
U-SQL Activity
Hive Activity
Pig Activity
MapReduce Activity
Hadoop Streaming Activity
Spark Activity
.NET custom activity
Azure Machine Learning Studio (classic) Batch Execution Activity
Stored procedure activity
Compute environments supported by Azure Data
Factory
5/28/2021 • 22 minutes to read • Edit Online
On-demand HDInsight cluster or your own HDInsight Hive, Pig, Spark, MapReduce, Hadoop Streaming
cluster
Azure Machine Learning Studio (classic) Machine Learning Studio (classic) activities: Batch Execution
and Update Resource
IN C O M P UT E
L IN K ED P RO P ERT Y A Z URE SQ L
SERVIC E NAME DESC RIP T IO N B LO B A DL S GEN 2 DB A DL S GEN 1
additionalLink Specifies No No No No
edServiceNam additional
es storage
accounts for
the HDInsight
linked service
so that the
Data Factory
service can
register them
on your
behalf.
hcatalogLinke A reference to No No No No
dServiceName the Azure
SQL linked
service that
points to the
HCatalog
database.
NOTE
The on-demand configuration is currently supported only for Azure HDInsight clusters. Azure Databricks also supports
on-demand jobs using job clusters. For more information, see Azure databricks linked service.
The Azure Data Factory service can automatically create an on-demand HDInsight cluster to process data. The
cluster is created in the same region as the storage account (linkedServiceName property in the JSON)
associated with the cluster. The storage account must be a general-purpose standard Azure Storage account.
Note the following impor tant points about on-demand HDInsight linked service:
The on-demand HDInsight cluster is created under your Azure subscription. You are able to see the cluster in
your Azure portal when the cluster is up and running.
The logs for jobs that are run on an on-demand HDInsight cluster are copied to the storage account
associated with the HDInsight cluster. The clusterUserName, clusterPassword, clusterSshUserName,
clusterSshPassword defined in your linked service definition are used to log in to the cluster for in-depth
troubleshooting during the lifecycle of the cluster.
You are charged only for the time when the HDInsight cluster is up and running jobs.
You can use a Script Action with the Azure HDInsight on-demand linked service.
IMPORTANT
It typically takes 20 minutes or more to provision an Azure HDInsight cluster on demand.
Example
The following JSON defines a Linux-based on-demand HDInsight linked service. The Data Factory service
automatically creates a Linux-based HDInsight cluster to process the required activity.
{
"name": "HDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"clusterType": "hadoop",
"clusterSize": 1,
"timeToLive": "00:15:00",
"hostSubscriptionId": "<subscription ID>",
"servicePrincipalId": "<service principal ID>",
"servicePrincipalKey": {
"value": "<service principal key>",
"type": "SecureString"
},
"tenant": "<tenent id>",
"clusterResourceGroup": "<resource group name>",
"version": "3.6",
"osType": "Linux",
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
IMPORTANT
The HDInsight cluster creates a default container in the blob storage you specified in the JSON (linkedSer viceName ).
HDInsight does not delete this container when the cluster is deleted. This behavior is by design. With on-demand
HDInsight linked service, a HDInsight cluster is created every time a slice needs to be processed unless there is an existing
live cluster (timeToLive ) and is deleted when the processing is done.
As more activity runs, you see many containers in your Azure blob storage. If you do not need them for troubleshooting
of the jobs, you may want to delete them to reduce the storage cost. The names of these containers follow a pattern:
adf**yourdatafactoryname**-**linkedservicename**-datetimestamp . Use tools such as Microsoft Azure Storage
Explorer to delete containers in your Azure blob storage.
Properties
IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.
"additionalLinkedServiceNames": [{
"referenceName": "MyStorageLinkedService2",
"type": "LinkedServiceReference"
}]
Advanced Properties
You can also specify the following properties for the granular configuration of the on-demand HDInsight cluster.
Node sizes
You can specify the sizes of head, data, and zookeeper nodes using the following properties:
Specifying node sizes See the Sizes of Virtual Machines article for string values you need to specify for the
properties mentioned in the previous section. The values need to conform to the CMDLETs & APIS
referenced in the article. As you can see in the article, the data node of Large (default) size has 7-GB memory,
which may not be good enough for your scenario.
If you want to create D4 sized head nodes and worker nodes, specify Standard_D4 as the value for
headNodeSize and dataNodeSize properties.
"headNodeSize": "Standard_D4",
"dataNodeSize": "Standard_D4",
If you specify a wrong value for these properties, you may receive the following error : Failed to create cluster.
Exception: Unable to complete the cluster create operation. Operation failed with code '400'. Cluster left behind
state: 'Error'. Message: 'PreClusterCreationValidationFailure'. When you receive this error, ensure that you are
using the CMDLET & APIS name from the table in the Sizes of Virtual Machines article.
Bring your own compute environment
In this type of configuration, users can register an already existing computing environment as a linked service in
Data Factory. The computing environment is managed by the user and the Data Factory service uses it to
execute the activities.
This type of configuration is supported for the following compute environments:
Azure HDInsight
Azure Batch
Azure Machine Learning
Azure Data Lake Analytics
Azure SQL DB, Azure Synapse Analytics, SQL Server
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
IMPORTANT
HDInsight supports multiple Hadoop cluster versions that can be deployed. Each version choice creates a specific version
of the Hortonworks Data Platform (HDP) distribution and a set of components that are contained within that distribution.
The list of supported HDInsight versions keeps being updated to provide latest Hadoop ecosystem components and fixes.
Make sure you always refer to latest information of Supported HDInsight version and OS Type to ensure you are using
supported version of HDInsight.
IMPORTANT
Currently, HDInsight linked services does not support HBase, Interactive Query (Hive LLAP), Storm.
You can create an Azure Batch linked service to register a Batch pool of virtual machines (VMs) to a data factory.
You can run Custom activity using Azure Batch.
See following articles if you are new to Azure Batch service:
Azure Batch basics for an overview of the Azure Batch service.
New-AzBatchAccount cmdlet to create an Azure Batch account (or) Azure portal to create the Azure Batch
account using Azure portal. See Using PowerShell to manage Azure Batch Account article for detailed
instructions on using the cmdlet.
New-AzBatchPool cmdlet to create an Azure Batch pool.
IMPORTANT
When creating a new Azure Batch pool, ‘VirtualMachineConfiguration’ must be used and NOT
‘CloudServiceConfiguration'. For more details refer Azure Batch Pool migration guidance.
Example
{
"name": "AzureBatchLinkedService",
"properties": {
"type": "AzureBatch",
"typeProperties": {
"accountName": "batchaccount",
"accessKey": {
"type": "SecureString",
"value": "access key"
},
"batchUri": "https://batchaccount.region.batch.azure.com",
"poolName": "poolname",
"linkedServiceName": {
"referenceName": "StorageLinkedService",
"type": "LinkedServiceReference"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
{
"name": "AzureMLLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch scoring endpoint]/jobs",
"apiKey": {
"type": "SecureString",
"value": "access key"
}
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
NOTE
Currently only service principal authentication is supported for the Azure Machine Learning linked service.
Example
{
"name": "AzureMLServiceLinkedService",
"properties": {
"type": "AzureMLService",
"typeProperties": {
"subscriptionId": "subscriptionId",
"resourceGroupName": "resourceGroupName",
"mlWorkspaceName": "mlWorkspaceName",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID"
},
"connectVia": {
"referenceName": "<name of Integration Runtime?",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
{
"name": "AzureDataLakeAnalyticsLinkedService",
"properties": {
"type": "AzureDataLakeAnalytics",
"typeProperties": {
"accountName": "adftestaccount",
"dataLakeAnalyticsUri": "azuredatalakeanalytics URI",
"servicePrincipalId": "service principal id",
"servicePrincipalKey": {
"value": "service principal key",
"type": "SecureString"
},
"tenant": "tenant ID",
"subscriptionId": "<optional, subscription ID of ADLA>",
"resourceGroupName": "<optional, resource group name of ADLA>"
},
"connectVia": {
"referenceName": "<name of Integration Runtime>",
"type": "IntegrationRuntimeReference"
}
}
}
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
IMPORTANT
Databricks linked services supports Instance pools & System-assigned managed identity authentication.
{
"name": "AzureDatabricks_LS",
"properties": {
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://eastus.azuredatabricks.net",
"newClusterNodeType": "Standard_D3_v2",
"newClusterNumOfWorker": "1:10",
"newClusterVersion": "4.0.x-scala2.11",
"accessToken": {
"type": "SecureString",
"value": "dapif33c9c721144c3a790b35000b57f7124f"
}
}
}
}
Properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
function app url URL for the Azure Function App. yes
Format is
https://<accountname>.azurewebsites.net
. This URL is the value under URL
section when viewing your Function
App in the Azure portal
Next steps
For a list of the transformation activities supported by Azure Data Factory, see Transform data.
Append Variable Activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
Next steps
Learn about a related control flow activity supported by Data Factory:
Set Variable Activity
Execute Pipeline activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "MyPipeline",
"properties": {
"activities": [
{
"name": "ExecutePipelineActivity",
"type": "ExecutePipeline",
"typeProperties": {
"parameters": {
"mySourceDatasetFolderPath": {
"value": "@pipeline().parameters.mySourceDatasetFolderPath",
"type": "Expression"
}
},
"pipeline": {
"referenceName": "<InvokedPipelineName>",
"type": "PipelineReference"
},
"waitOnCompletion": true
}
}
],
"parameters": [
{
"mySourceDatasetFolderPath": {
"type": "String"
}
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Sample
This scenario has two pipelines:
Master pipeline - This pipeline has one Execute Pipeline activity that calls the invoked pipeline. The master
pipeline takes two parameters: masterSourceBlobContainer , masterSinkBlobContainer .
Invoked pipeline - This pipeline has one Copy activity that copies data from an Azure Blob source to Azure
Blob sink. The invoked pipeline takes two parameters: sourceBlobContainer , sinkBlobContainer .
Master pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "MyExecutePipelineActivity"
}
],
"parameters": {
"masterSourceBlobContainer": {
"type": "String"
},
"masterSinkBlobContainer": {
"type": "String"
}
}
}
}
{
"name": "BlobStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=*****;AccountKey=*****"
}
}
}
Source dataset
{
"name": "SourceBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sourceBlobContainer",
"type": "Expression"
},
"fileName": "salesforce.txt"
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
Sink dataset
{
"name": "sinkBlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@pipeline().parameters.sinkBlobContainer",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "BlobStorageLinkedService",
"type": "LinkedServiceReference"
}
}
}
{
"masterSourceBlobContainer": "executetest",
"masterSinkBlobContainer": "executesink"
}
The master pipeline forwards these values to the invoked pipeline as shown in the following example:
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "invokedPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceBlobContainer": {
"value": "@pipeline().parameters.masterSourceBlobContainer",
"type": "Expression"
},
"sinkBlobContainer": {
"value": "@pipeline().parameters.masterSinkBlobContainer",
"type": "Expression"
}
},
....
}
Next steps
See other control flow activities supported by Data Factory:
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Filter activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
You can use a Filter activity in a pipeline to apply a filter expression to an input array.
APPLIES TO: Azure Data Factory Azure Synapse Analytics
Syntax
{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "<condition>",
"items": "<input array>"
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Example
In this example, the pipeline has two activities: Filter and ForEach . The Filter activity is configured to filter the
input array for items with a value greater than 3. The ForEach activity then iterates over the filtered values and
sets the variable test to the current value.
{
"name": "PipelineName",
"properties": {
"activities": [{
"name": "MyFilterActivity",
"type": "filter",
"typeProperties": {
"condition": "@greater(item(),3)",
"items": "@pipeline().parameters.inputs"
}
},
{
"name": "MyForEach",
"type": "ForEach",
"dependsOn": [
{
"activity": "MyFilterActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('MyFilterActivity').output.value",
"type": "Expression"
},
"isSequential": "false",
"batchCount": 1,
"activities": [
{
"name": "Set Variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "test",
"value": {
"value": "@string(item())",
"type": "Expression"
}
}
}
]
}
}],
"parameters": {
"inputs": {
"type": "Array",
"defaultValue": [1, 2, 3, 4, 5, 6]
}
},
"variables": {
"test": {
"type": "String"
}
},
"annotations": []
}
}
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
ForEach activity in Azure Data Factory
6/23/2021 • 6 minutes to read • Edit Online
Syntax
The properties are described later in this article. The items property is the collection and each item in the
collection is referred to by using the @item() as shown in the following syntax:
{
"name":"MyForEachActivityName",
"type":"ForEach",
"typeProperties":{
"isSequential":"true",
"items": {
"value": "@pipeline().parameters.mySinkDatasetFolderPathCollection",
"type": "Expression"
},
"activities":[
{
"name":"MyCopyActivity",
"type":"Copy",
"typeProperties":{
...
},
"inputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@pipeline().parameters.mySourceDatasetFolderPath"
}
}
],
"outputs":[
{
"referenceName":"MyDataset",
"type":"DatasetReference",
"parameters":{
"MyFolderPath":"@item()"
}
}
]
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
If "isSequential" is set to
False, ensure that there is a
correct configuration to run
multiple executables.
Otherwise, this property
should be used with caution
to avoid incurring write
conflicts. For more
information, see Parallel
execution section.
batchCount Batch count to be used for Integer (maximum 50) No. Default is 20.
controlling the number of
parallel execution (when
isSequential is set to false).
This is the upper
concurrency limit, but the
for-each activity will not
always execute at this
number
Parallel execution
If isSequential is set to false, the activity iterates in parallel with a maximum of 50 concurrent iterations. This
setting should be used with caution. If the concurrent iterations are writing to the same folder but to different
files, this approach is fine. If the concurrent iterations are writing concurrently to the exact same file, this
approach most likely causes an error.
{
"mySourceDatasetFolderPath": "input/",
"mySinkDatasetFolderPath": [ "outputs/file1", "outputs/file2" ]
}
Example
Scenario: Iterate over an InnerPipeline within a ForEach activity with Execute Pipeline activity. The inner pipeline
copies with schema definitions parameterized.
Master Pipeline definition
{
"name": "masterPipeline",
"properties": {
"activities": [
{
"type": "ForEach",
"name": "MyForEachActivity",
"typeProperties": {
"isSequential": true,
"items": {
"value": "@pipeline().parameters.inputtables",
"type": "Expression"
},
"activities": [
{
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": {
"referenceName": "InnerCopyPipeline",
"type": "PipelineReference"
},
"parameters": {
"sourceTableName": {
"value": "@item().SourceTable",
"type": "Expression"
},
"sourceTableStructure": {
"value": "@item().SourceTableStructure",
"type": "Expression"
},
"sinkTableName": {
"value": "@item().DestTable",
"type": "Expression"
},
"sinkTableStructure": {
"value": "@item().DestTableStructure",
"type": "Expression"
}
},
"waitOnCompletion": true
},
"name": "ExecuteCopyPipeline"
}
]
}
}
],
"parameters": {
"inputtables": {
"type": "Array"
}
}
}
}
{
"name": "InnerCopyPipeline",
"properties": {
"activities": [
{
"type": "Copy",
"typeProperties": {
"source": {
"type": "SqlSource",
"type": "SqlSource",
}
},
"sink": {
"type": "SqlSink"
}
},
"name": "CopyActivity",
"inputs": [
{
"referenceName": "sqlSourceDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sourceTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sourceTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "sqlSinkDataset",
"parameters": {
"SqlTableName": {
"value": "@pipeline().parameters.sinkTableName",
"type": "Expression"
},
"SqlTableStructure": {
"value": "@pipeline().parameters.sinkTableStructure",
"type": "Expression"
}
},
"type": "DatasetReference"
}
]
}
],
"parameters": {
"sourceTableName": {
"type": "String"
},
"sourceTableStructure": {
"type": "String"
},
"sinkTableName": {
"type": "String"
},
"sinkTableStructure": {
"type": "String"
}
}
}
}
{
"name": "sqlSinkDataSet",
"properties": {
"type": "AzureSqlTable",
"typeProperties": {
"tableName": {
"value": "@dataset().SqlTableName",
"type": "Expression"
}
},
"structure": {
"value": "@dataset().SqlTableStructure",
"type": "Expression"
},
"linkedServiceName": {
"referenceName": "azureSqlLS",
"type": "LinkedServiceReference"
},
"parameters": {
"SqlTableName": {
"type": "String"
},
"SqlTableStructure": {
"type": "String"
}
}
}
}
Aggregating outputs
To aggregate outputs of foreach activity, please utilize Variables and Append Variable activity.
First, declare an array variable in the pipeline. Then, invoke Append Variable activity inside each foreach loop.
Subsequently, you can retrieve the aggregation from your array.
L IM ITAT IO N W O RK A RO UN D
You can't nest a ForEach loop inside another ForEach loop Design a two-level pipeline where the outer pipeline with the
(or an Until loop). outer ForEach loop iterates over an inner pipeline with the
nested loop.
The ForEach activity has a maximum batchCount of 50 for Design a two-level pipeline where the outer pipeline with the
parallel processing, and a maximum of 100,000 items. ForEach activity iterates over an inner pipeline.
SetVariable can't be used inside a ForEach activity that runs Consider using sequential ForEach or use Execute Pipeline
in parallel as the variables are global to the whole pipeline, inside ForEach (Variable/Parameter handled in child Pipeline).
they are not scoped to a ForEach or any other activity.
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
Get Metadata Activity
Lookup Activity
Web Activity
Get Metadata activity in Azure Data Factory
5/14/2021 • 5 minutes to read • Edit Online
Supported capabilities
The Get Metadata activity takes a dataset as an input and returns metadata information as output. Currently, the
following connectors and the corresponding retrievable metadata are supported. The maximum size of returned
metadata is 4 MB .
Supported connectors
File storage
L A ST M
IT EM N IT EM T C REAT O DIF IE EXIST S
C ONN AME YPE ED D1 C H IL DI C ONT E C OLU 3
EC TO R ( F IL E/ F ( F IL E/ F ( F IL E/ F ( F IL E/ F T EM S NT MD ST RUC MNCO ( F IL E/ F
/ M ETA O L DER O L DER SIZ E O L DER O L DER ( F O L DE 5 T URE 2 UN T 2 O L DER
DATA ) ) ( F IL E) ) ) R) ( F IL E) ( F IL E) ( F IL E) )
1 Metadata lastModified :
For Amazon S3, Amazon S3 Compatible Storage, Google Cloud Storage and Oracle Cloud Storage,
lastModified applies to the bucket and the key but not to the virtual folder, and exists applies to the
bucket and the key but not to the prefix or virtual folder.
For Azure Blob storage, lastModified applies to the container and the blob but not to the virtual folder.
2 Metadata structure and columnCount are not supported when getting metadata from Binary, JSON, or XML
files.
3 Metadata exists : For Amazon S3, Amazon S3 Compatible Storage, Google Cloud Storage and Oracle Cloud
Storage, exists applies to the bucket and the key but not to the prefix or virtual folder.
Note the following:
When using Get Metadata activity against a folder, make sure you have LIST/EXECUTE permission to the
given folder.
Wildcard filter on folders/files is not supported for Get Metadata activity.
modifiedDatetimeStart and modifiedDatetimeEnd filter set on connector:
These two properties are used to filter the child items when getting metadata from a folder. It does not
apply when getting metadata from a file.
When such filter is used, the childItems in output includes only the files that are modified within the
specified range but not folders.
To apply such filter, GetMetadata activity will enumerate all the files in the specified folder and check
the modified time. Avoid pointing to a folder with a large number of files even if the expected qualified
file count is small.
Relational database
SQL Server à à Ã
Metadata options
You can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding
information:
childItems List of subfolders and files in the given folder. Applicable only
to folders. Returned value is a list of the name and type of
each child item.
TIP
When you want to validate that a file, folder, or table exists, specify exists in the Get Metadata activity field list. You can
then check the exists: true/false result in the activity output. If exists isn't specified in the field list, the Get
Metadata activity will fail if the object isn't found.
NOTE
When you get metadata from file stores and configure modifiedDatetimeStart or modifiedDatetimeEnd , the
childItems in the output includes only files in the specified path that have a last modified time within the specified
range. Items in subfolders are not included.
NOTE
For the Structure field list to provide the actual data structure for delimited text and Excel format datasets, you must
enable the First Row as Header property, which is supported only for these data sources.
Syntax
Get Metadata activity
{
"name":"MyActivity",
"type":"GetMetadata",
"dependsOn":[
],
"policy":{
"timeout":"7.00:00:00",
"retry":0,
"retryIntervalInSeconds":30,
"secureOutput":false,
"secureInput":false
},
"userProperties":[
],
"typeProperties":{
"dataset":{
"referenceName":"MyDataset",
"type":"DatasetReference"
},
"fieldList":[
"size",
"lastModified",
"structure"
],
"storeSettings":{
"type":"AzureBlobStorageReadSettings"
},
"formatSettings":{
"type":"JsonReadSettings"
}
}
}
Dataset
{
"name":"MyDataset",
"properties":{
"linkedServiceName":{
"referenceName":"AzureStorageLinkedService",
"type":"LinkedServiceReference"
},
"annotations":[
],
"type":"Json",
"typeProperties":{
"location":{
"type":"AzureBlobStorageLocation",
"fileName":"file.json",
"folderPath":"folder",
"container":"container"
}
}
}
}
Type properties
Currently, the Get Metadata activity can return the following types of metadata information:
Sample output
The Get Metadata results are shown in the activity output. Following are two samples showing extensive
metadata options. To use the results in a subsequent activity, use this pattern:
@{activity('MyGetMetadataActivity').output.itemName} .
{
"exists": true,
"itemName": "testFolder",
"itemType": "Folder",
"lastModified": "2017-02-23T06:17:09Z",
"created": "2017-02-23T06:17:09Z",
"childItems": [
{
"name": "test.avro",
"type": "File"
},
{
"name": "folder hello",
"type": "Folder"
}
]
}
Next steps
Learn about other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
Lookup activity
Web activity
If Condition activity in Azure Data Factory
5/28/2021 • 3 minutes to read • Edit Online
Syntax
{
"name": "<Name of the activity>",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"ifTrueActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
],
"ifFalseActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is
determined by the value of pipeline parameter: routeSelection. If the value of routeSelection is true, the data is
copied to outputPath1. And, if the value of routeSelection is false, the data is copied to outputPath2.
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MyIfCondition",
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "@bool(pipeline().parameters.routeSelection)",
"type": "Expression"
},
"ifTrueActivities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1"
},
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
],
"ifFalseActivities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account
name>;AccountKey=<Azure Storage account key>"
}
}
}
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputIf",
"outputPath2": "adftutorial/outputElse",
"routeSelection": "false"
}
PowerShell commands
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
These commands assume that you have saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Lookup activity in Azure Data Factory
5/6/2021 • 7 minutes to read • Edit Online
Supported capabilities
Note the following:
The Lookup activity can return up to 5000 rows ; if the result set contains more records, the first 5000 rows
will be returned.
The Lookup activity output supports up to 4 MB in size, activity will fail if the size exceeds the limit.
The longest duration for Lookup activity before timeout is 24 hours .
When you use query or stored procedure to lookup data, make sure to return one and exact one result set.
Otherwise, Lookup activity fails.
The following data sources are supported for Lookup activity.
Azure Files
DB2
Drill
Google BigQuery
Greenplum
HBase
Hive
Apache Impala
Informix
MariaDB
Microsoft Access
MySQL
Netezza
Oracle
Phoenix
PostgreSQL
Presto
SAP HANA
SAP Table
Snowflake
C AT EGO RY DATA STO RE
Spark
SQL Server
Sybase
Teradata
Vertica
NoSQL Cassandra
Couchbase (Preview)
File Amazon S3
File System
FTP
HDFS
SFTP
Generic OData
Generic ODBC
Concur (Preview)
Dataverse
Dynamics 365
Dynamics AX
Dynamics CRM
Google AdWords
C AT EGO RY DATA STO RE
HubSpot
Jira
Magento (Preview)
Marketo (Preview)
PayPal (Preview)
QuickBooks (Preview)
Salesforce
SAP ECC
ServiceNow
Shopify (Preview)
Square (Preview)
Xero
Zoho (Preview)
NOTE
Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency
on preview connectors in your solution, please contact Azure support.
Syntax
{
"name":"LookupActivity",
"type":"Lookup",
"typeProperties":{
"source":{
"type":"<source type>"
},
"dataset":{
"referenceName":"<source dataset name>",
"type":"DatasetReference"
},
"firstRowOnly":<true or false>
}
}
Type properties
NAME DESC RIP T IO N TYPE REQ UIRED?
NOTE
Source columns with ByteArray type aren't supported.
Structure isn't supported in dataset definitions. For text-format files, use the header row to provide the column name.
If your lookup source is a JSON file, the jsonPathDefinition setting for reshaping the JSON object isn't supported.
The entire objects will be retrieved.
When firstRowOnly is set to false , the output format is as shown in the following code. A count
field indicates how many records are returned. Detailed values are displayed under a fixed value array.
In such a case, the Lookup activity is followed by a Foreach activity. You pass the value array to the
ForEach activity items field by using the pattern of @activity('MyLookupActivity').output.value . To
access elements in the value array, use the following syntax:
@{activity('lookupActivity').output.value[zero based index].propertyname} . An example is
@{activity('lookupActivity').output.value[0].schema} .
{
"count": "2",
"value": [
{
"Id": "1",
"schema":"dbo",
"table":"Table1"
},
{
"Id": "2",
"schema":"dbo",
"table":"Table2"
}
]
}
Example
In this example, the pipeline contains two activities: Lookup and Copy . The Copy Activity copies data from a
SQL table in your Azure SQL Database instance to Azure Blob storage. The name of the SQL table is stored in a
JSON file in Blob storage. The Lookup activity looks up the table name at runtime. JSON is modified dynamically
by using this approach. You don't need to redeploy pipelines or datasets.
This example demonstrates lookup for the first row only. For lookup for all rows and to chain the results with
ForEach activity, see the samples in Copy multiple tables in bulk by using Azure Data Factory.
Pipeline
The Lookup activity is configured to use LookupDataset , which refers to a location in Azure Blob storage.
The Lookup activity reads the name of the SQL table from a JSON file in this location.
The Copy Activity uses the output of the Lookup activity, which is the name of the SQL table. The tableName
property in the SourceDataset is configured to use the output from the Lookup activity. Copy Activity
copies data from the SQL table to a location in Azure Blob storage. The location is specified by the
SinkDataset property.
{
"name": "LookupPipelineDemo",
"properties": {
"activities": [
{
"name": "LookupActivity",
"type": "Lookup",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "JsonSource",
"storeSettings": {
"type": "AzureBlobStorageReadSettings",
"recursive": true
},
"formatSettings": {
"type": "JsonReadSettings"
}
},
"dataset": {
"referenceName": "LookupDataset",
"type": "DatasetReference"
},
"firstRowOnly": true
}
},
{
"name": "CopyActivity",
"type": "Copy",
"dependsOn": [
{
"activity": "LookupActivity",
"dependencyConditions": [
"Succeeded"
]
}
],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": {
"value": "select * from [@{activity('LookupActivity').output.firstRow.schema}].
[@{activity('LookupActivity').output.firstRow.table}]",
"type": "Expression"
},
"queryTimeout": "02:00:00",
"partitionOption": "None"
},
"sink": {
"type": "DelimitedTextSink",
"storeSettings": {
"type": "AzureBlobStorageWriteSettings"
},
"formatSettings": {
"type": "DelimitedTextWriteSettings",
"quoteAllText": true,
"fileExtension": ".txt"
}
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"typeConversion": true,
"typeConversionSettings": {
"allowDataTruncation": true,
"treatBooleanAsNumber": false
}
}
},
"inputs": [
{
"referenceName": "SourceDataset",
"type": "DatasetReference",
"parameters": {
"schemaName": {
"value": "@activity('LookupActivity').output.firstRow.schema",
"type": "Expression"
},
"tableName": {
"value": "@activity('LookupActivity').output.firstRow.table",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "SinkDataset",
"type": "DatasetReference",
"parameters": {
"schema": {
"value": "@activity('LookupActivity').output.firstRow.schema",
"type": "Expression"
},
"table": {
"value": "@activity('LookupActivity').output.firstRow.table",
"type": "Expression"
}
}
}
]
}
],
"annotations": [],
"lastPublishTime": "2020-08-17T10:48:25Z"
}
}
Lookup dataset
The lookup dataset is the sourcetable.json file in the Azure Storage lookup folder specified by the
AzureBlobStorageLinkedSer vice type.
{
"name": "LookupDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureBlobStorageLinkedService",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "Json",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "sourcetable.json",
"container": "lookup"
}
}
}
}
{
"name": "SourceDataset",
"properties": {
"linkedServiceName": {
"referenceName": "AzureSqlDatabase",
"type": "LinkedServiceReference"
},
"parameters": {
"schemaName": {
"type": "string"
},
"tableName": {
"type": "string"
}
},
"annotations": [],
"type": "AzureSqlTable",
"schema": [],
"typeProperties": {
"schema": {
"value": "@dataset().schemaName",
"type": "Expression"
},
"table": {
"value": "@dataset().tableName",
"type": "Expression"
}
}
}
}
sourcetable.json
You can use following two kinds of formats for sourcetable.json file.
Set of objects
{
"Id":"1",
"schema":"dbo",
"table":"Table1"
}
{
"Id":"2",
"schema":"dbo",
"table":"Table2"
}
Array of objects
[
{
"Id": "1",
"schema":"dbo",
"table":"Table1"
},
{
"Id": "2",
"schema":"dbo",
"table":"Table2"
}
]
L IM ITAT IO N W O RK A RO UN D
The Lookup activity has a maximum of 5,000 rows, and a Design a two-level pipeline where the outer pipeline iterates
maximum size of 4 MB. over an inner pipeline, which retrieves data that doesn't
exceed the maximum rows or size.
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline activity
ForEach activity
GetMetadata activity
Web activity
Set Variable Activity in Azure Data Factory
6/17/2021 • 2 minutes to read • Edit Online
Type properties
P RO P ERT Y DESC RIP T IO N REQ UIRED
Incrementing a variable
A common scenario involving variables in Azure Data Factory is using a variable as an iterator within an until or
foreach activity. In a set variable activity you cannot reference the variable being set in the value field. To
workaround this limitation, set a temporary variable and then create a second set variable activity. The second
set variable activity sets the value of the iterator to the temporary variable.
Below is an example of this pattern:
{
"name": "pipeline3",
"properties": {
"activities": [
{
"name": "Set I",
"type": "SetVariable",
"dependsOn": [
{
"activity": "Increment J",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"variableName": "i",
"value": {
"value": "@variables('j')",
"type": "Expression"
}
}
},
{
"name": "Increment J",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "j",
"value": {
"value": "@string(add(int(variables('i')), 1))",
"type": "Expression"
}
}
}
],
"variables": {
"i": {
"type": "String",
"defaultValue": "0"
},
"j": {
"type": "String",
"defaultValue": "0"
}
},
"annotations": []
}
}
Variables are currently scoped at the pipeline level. This means that they are not thread safe and can cause
unexpected and undesired behavior if they are accessed from within a parallel iteration activity such as a foreach
loop, especially when the value is also being modified within that foreach activity.
Next steps
Learn about a related control flow activity supported by Data Factory:
Append Variable Activity
Switch activity in Azure Data Factory
7/12/2021 • 4 minutes to read • Edit Online
Syntax
{
"name": "<Name of the activity>",
"type": "Switch",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to some string value>",
"type": "Expression"
},
"cases": [
{
"value": "<string value that matches expression evaluation>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
],
"defaultActivities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Example
The pipeline in this example copies data from an input folder to an output folder. The output folder is
determined by the value of pipeline parameter: routeSelection.
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"name": "MySwitch",
"type": "Switch",
"typeProperties": {
"expression": {
"value": "@pipeline().parameters.routeSelection",
"type": "Expression"
},
"cases": [
{
"value": "1",
"activities": [
{
"name": "CopyFromBlobToBlob1",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath1",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
{
"value": "2",
"activities": [
{
"name": "CopyFromBlobToBlob2",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath",
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath2",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
{
"value": "3",
"activities": [
{
"name": "CopyFromBlobToBlob3",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath",
},
"type": "DatasetReference"
}
],
"outputs": [
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath3",
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}
}
]
},
],
"defaultActivities": []
}
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath1": {
"type": "String"
},
"outputPath2": {
"type": "String"
},
"outputPath3": {
"type": "String"
},
"routeSelection": {
"type": "String"
}
}
}
}
{
"name": "AzureStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=<Azure Storage account
name>;AccountKey=<Azure Storage account key>"
}
}
}
{
"inputPath": "adftutorial/input",
"outputPath1": "adftutorial/outputCase1",
"outputPath2": "adftutorial/outputCase2",
"outputPath2": "adftutorial/outputCase3",
"routeSelection": "1"
}
PowerShell commands
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
These commands assume that you've saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
}
Start-Sleep -Seconds 30
}
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until activity in Azure Data Factory
5/28/2021 • 4 minutes to read • Edit Online
Syntax
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "<expression that evaluates to true or false>",
"type": "Expression"
},
"timeout": "<time out for the loop. for example: 00:01:00 (1 minute)>",
"activities": [
{
"<Activity 1 definition>"
},
{
"<Activity 2 definition>"
},
{
"<Activity N definition>"
}
]
},
"name": "MyUntilActivity"
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Example 1
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.
Example 2
The pipeline in this sample copies data from an input folder to an output folder in a loop. The loop terminates
when the value for the repeat parameter is set to false or it times out after one minute.
Pipeline with Until activity (Adfv2QuickStartPipeline.json)
{
"name": "Adfv2QuickStartPipeline",
"properties": {
"activities": [
{
"type": "Until",
"typeProperties": {
"expression": {
"value": "@equals('false', pipeline().parameters.repeat)",
"type": "Expression"
},
},
"timeout": "00:01:00",
"activities": [
{
"name": "CopyFromBlobToBlob",
"type": "Copy",
"inputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.inputPath"
},
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "BlobDataset",
"parameters": {
"path": "@pipeline().parameters.outputPath"
},
"type": "DatasetReference"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
},
"policy": {
"retry": 1,
"timeout": "00:10:00",
"retryIntervalInSeconds": 60
}
}
]
},
"name": "MyUntilActivity"
}
],
"parameters": {
"inputPath": {
"type": "String"
},
"outputPath": {
"type": "String"
},
"repeat": {
"type": "String"
}
}
}
}
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": {
"value": "@{dataset().path}",
"type": "Expression"
}
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
{
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/outputUntil",
"repeat": "true"
}
PowerShell commands
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
These commands assume that you have saved the JSON files into the folder: C:\ADF.
Connect-AzAccount
Select-AzSubscription "<Your subscription name>"
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ($run.Status -ne 'InProgress') {
Write-Host "Pipeline run finished. The status is: " $run.Status -foregroundcolor "Yellow"
$run
break
}
Write-Host "Pipeline is running...status: InProgress" -foregroundcolor "Yellow"
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -RunStartedBefore (Get-
Date).AddMinutes(30)
$result
Start-Sleep -Seconds 15
}
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Validation activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Syntax
{
"name": "Validation_Activity",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_File",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"minimumSize": 20
}
},
{
"name": "Validation_Activity_Folder",
"type": "Validation",
"typeProperties": {
"dataset": {
"referenceName": "Storage_Folder",
"type": "DatasetReference"
},
"timeout": "7.00:00:00",
"sleep": 10,
"childItems": true
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Execute wait activity in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
When you use a Wait activity in a pipeline, the pipeline waits for the specified period of time before continuing
with execution of subsequent activities.
APPLIES TO: Azure Data Factory Azure Synapse Analytics
Syntax
{
"name": "MyWaitActivity",
"type": "Wait",
"typeProperties": {
"waitTimeInSeconds": 1
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Example
NOTE
This section provides JSON definitions and sample PowerShell commands to run the pipeline. For a walkthrough with
step-by-step instructions to create a Data Factory pipeline by using Azure PowerShell and JSON definitions, see tutorial:
create a data factory by using Azure PowerShell.
Next steps
See other control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Web activity in Azure Data Factory
4/22/2021 • 4 minutes to read • Edit Online
NOTE
Web Activity is supported for invoking URLs that are hosted in a private virtual network as well by leveraging self-hosted
integration runtime. The integration runtime should have a line of sight to the URL endpoint.
NOTE
The maximum supported output response payload size is 4 MB.
Syntax
{
"name":"MyWebActivity",
"type":"WebActivity",
"typeProperties":{
"method":"Post",
"url":"<URLEndpoint>",
"connectVia": {
"referenceName": "<integrationRuntimeName>",
"type": "IntegrationRuntimeReference"
}
"headers":{
"Content-Type":"application/json"
},
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
},
"datasets":[
{
"referenceName":"<ConsumedDatasetName>",
"type":"DatasetReference",
"parameters":{
...
}
}
],
"linkedServices":[
{
"referenceName":"<ConsumedLinkedServiceName>",
"type":"LinkedServiceReference"
}
]
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
url Target endpoint and path String (or expression with Yes
resultType of string). The
activity will timeout at 1
minute with an error if it
does not receive a response
from the endpoint.
headers Headers that are sent to String (or expression with Yes, Content-type header is
the request. For example, to resultType of string) required.
set the language and type "headers":{ "Content-
on a request: Type":"application/json"}
"headers" : { "Accept-
Language": "en-us",
"Content-Type":
"application/json" }
.
body Represents the payload that String (or expression with Required for POST/PUT
is sent to the endpoint. resultType of string). methods.
NOTE
REST endpoints that the web activity invokes must return a response of type JSON. The activity will timeout at 1 minute
with an error if it does not receive a response from the endpoint.
Authentication
Below are the supported authentication types in the web activity.
None
If authentication is not required, do not include the "authentication" property.
Basic
Specify user name and password to use with the basic authentication.
"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}
Client certificate
Specify base64-encoded contents of a PFX file and the password.
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}
Managed Identity
Specify the resource uri for which the access token will be requested using the managed identity for the data
factory. To call the Azure Resource Management API, use https://management.azure.com/ . For more information
about how managed identities works see the managed identities for Azure resources overview page.
"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}
NOTE
If your data factory is configured with a git repository, you must store your credentials in Azure Key Vault to use basic or
client certificate authentication. Azure Data Factory doesn't store passwords in git.
{
"body": {
"myMessage": "Sample",
"datasets": [{
"name": "MyDataset1",
"properties": {
...
}
}],
"linkedServices": [{
"name": "MyStorageLinkedService1",
"properties": {
...
}
}]
}
}
Example
In this example, the web activity in the pipeline calls a REST end point. It passes an Azure SQL linked service and
an Azure SQL dataset to the endpoint. The REST end point uses the Azure SQL connection string to connect to
the logical SQL server and returns the name of the instance of SQL server.
Pipeline definition
{
"name": "<MyWebActivityPipeline>",
"properties": {
"activities": [
{
"name": "<MyWebActivity>",
"type": "WebActivity",
"typeProperties": {
"method": "Post",
"url": "@pipeline().parameters.url",
"headers": {
"Content-Type": "application/json"
},
"authentication": {
"type": "ClientCertificate",
"pfx": "*****",
"password": "*****"
},
"datasets": [
{
"referenceName": "MySQLDataset",
"type": "DatasetReference",
"parameters": {
"SqlTableName": "@pipeline().parameters.sqlTableName"
}
}
],
"linkedServices": [
{
"referenceName": "SqlLinkedService",
"type": "LinkedServiceReference"
}
]
}
}
],
"parameters": {
"sqlTableName": {
"type": "String"
},
"url": {
"type": "String"
}
}
}
}
{
"sqlTableName": "department",
"url": "https://adftes.azurewebsites.net/api/execute/running"
}
result.Add("sinkServer", sqlConn.DataSource);
Trace.TraceInformation("Stop Execute");
Next steps
See other control flow activities supported by Data Factory:
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Webhook activity in Azure Data Factory
4/22/2021 • 3 minutes to read • Edit Online
IMPORTANT
WebHook activity now allows you to surface error status and custom messages back to activity and pipeline. Set
reportStatusOnCallBack to true, and include StatusCode and Error in callback payload. For more information, see
Additional Notes section.
Syntax
{
"name": "MyWebHookActivity",
"type": "WebHook",
"typeProperties": {
"method": "POST",
"url": "<URLEndpoint>",
"headers": {
"Content-Type": "application/json"
},
"body": {
"key": "value"
},
"timeout": "00:03:00",
"reportStatusOnCallBack": false,
"authentication": {
"type": "ClientCertificate",
"pfx": "****",
"password": "****"
}
}
}
Type properties
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
method The REST API method for String. The supported type Yes
the target endpoint. is "POST".
P RO P ERT Y DESC RIP T IO N A L LO W ED VA L UES REQ UIRED
Authentication
A webhook activity supports the following authentication types.
None
If authentication isn't required, don't include the authentication property.
Basic
Specify the username and password to use with basic authentication.
"authentication":{
"type":"Basic",
"username":"****",
"password":"****"
}
Client certificate
Specify the Base64-encoded contents of a PFX file and a password.
"authentication":{
"type":"ClientCertificate",
"pfx":"****",
"password":"****"
}
Managed identity
Use the data factory's managed identity to specify the resource URI for which the access token is requested. To
call the Azure Resource Management API, use https://management.azure.com/ . For more information about how
managed identities work, see the managed identities for Azure resources overview.
"authentication": {
"type": "MSI",
"resource": "https://management.azure.com/"
}
NOTE
If your data factory is configured with a Git repository, you must store your credentials in Azure Key Vault to use basic or
client-certificate authentication. Azure Data Factory doesn't store passwords in Git.
Additional notes
Data Factory passes the additional property callBackUri in the body sent to the URL endpoint. Data Factory
expects this URI to be invoked before the specified timeout value. If the URI isn't invoked, the activity fails with
the status "TimedOut".
The webhook activity fails when the call to the custom endpoint fails. Any error message can be added to the
callback body and used in a later activity.
For every REST API call, the client times out if the endpoint doesn't respond within one minute. This behavior is
standard HTTP best practice. To fix this problem, implement a 202 pattern. In the current case, the endpoint
returns 202 (Accepted) and the client polls.
The one-minute timeout on the request has nothing to do with the activity timeout. The latter is used to wait for
the callback specified by callbackUri .
The body passed back to the callback URI must be valid JSON. Set the Content-Type header to
application/json .
When you use the Repor t status on callback property, you must add the following code to the body when
you make the callback:
{
"Output": {
// output object is used in activity output
"testProp": "testPropValue"
},
"Error": {
// Optional, set it when you want to fail the activity
"ErrorCode": "testErrorCode",
"Message": "error message to show in activity error"
},
"StatusCode": "403" // when status code is >=400, activity is marked as failed
}
Next steps
See the following control flow activities supported by Data Factory:
If Condition Activity
Execute Pipeline Activity
For Each Activity
Get Metadata Activity
Lookup Activity
Web Activity
Until Activity
Mapping data flow transformation overview
4/22/2021 • 2 minutes to read • Edit Online
Alter row Row modifier Set insert, delete, update, and upsert
policies on rows.
Group by
Select an existing column or create a new computed column to use as a group by clause for your aggregation.
To use an existing column, select it from the dropdown. To create a new computed column, hover over the clause
and click Computed column . This opens the data flow expression builder. Once you create your computed
column, enter the output column name under the Name as field. If you wish to add an additional group by
clause, hover over an existing clause and click the plus icon.
Aggregate columns
Go to the Aggregates tab to build aggregation expressions. You can either overwrite an existing column with
an aggregation, or create a new field with a new name. The aggregation expression is entered in the right-hand
box next to the column name selector. To edit the expression, click on the text box and open the expression
builder. To add more aggregate columns, click on Add above the column list or the plus icon next to an existing
aggregate column. Choose either Add column or Add column pattern . Each aggregation expression must
contain at least one aggregate function.
NOTE
In Debug mode, the expression builder cannot produce data previews with aggregate functions. To view data previews for
aggregate transformations, close the expression builder and view the data via the 'Data Preview' tab.
Column patterns
Use column patterns to apply the same aggregation to a set of columns. This is useful if you wish to persist
many columns from the input schema as they are dropped by default. Use a heuristic such as first() to persist
input columns through the aggregation.
<incomingStream>
aggregate(
groupBy(
<groupByColumnName> = <groupByExpression1>,
<groupByExpression2>
),
<aggregateColumn1> = <aggregateExpression1>,
<aggregateColumn2> = <aggregateExpression2>,
each(
match(matchExpression),
<metadataColumn1> = <metadataExpression1>,
<metadataColumn2> = <metadataExpression2>
)
) ~> <aggregateTransformationName>
Example
The below example takes an incoming stream MoviesYear and groups rows by column year . The
transformation creates an aggregate column avgrating that evaluates to the average of column Rating . This
aggregate transformation is named AvgComedyRatingsByYear .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below.
MoviesYear aggregate(
groupBy(year),
avgrating = avg(toInteger(Rating))
) ~> AvgComedyRatingByYear
MoviesYear : Derived Column defining year and title columns AvgComedyRatingByYear : Aggregate transformation
for average rating of comedies grouped by year avgrating : Name of new column being created to hold the
aggregated value
MoviesYear aggregate(groupBy(year),
avgrating = avg(toInteger(Rating))) ~> AvgComedyRatingByYear
Next steps
Define window-based aggregation using the Window transformation
Alter row transformation in mapping data flow
5/8/2020 • 4 minutes to read • Edit Online
Alter Row transformations will only operate on database or CosmosDB sinks in your data flow. The actions that
you assign to rows (insert, update, delete, upsert) won't occur during debug sessions. Run an Execute Data Flow
activity in a pipeline to enact the alter row policies on your database tables.
NOTE
To mark all rows with one policy, you can create a condition for that policy and specify the condition as true() .
View policies in data preview
Use debug mode to view the results of your alter row policies in the data preview pane. A data preview of an
alter row transformation won't produce DDL or DML actions against your target.
Each alter row policy is represented by an icon that indicates whether an insert, update, upsert, or deleted action
will occur. The top header shows how many rows are affected by each policy in the preview.
The default behavior is to only allow inserts. To allow updates, upserts, or deletes, check the box in the sink
corresponding to that condition. If updates, upserts, or, deletes are enabled, you must specify which key columns
in the sink to match on.
NOTE
If your inserts, updates, or upserts modify the schema of the target table in the sink, the data flow will fail. To modify the
target schema in your database, choose Recreate table as the table action. This will drop and recreate your table with
the new schema definition.
The sink transformation requires either a single key or a series of keys for unique row identification in your
target database. For SQL sinks, set the keys in the sink settings tab. For CosmosDB, set the partition key in the
settings and also set the CosmosDB system field "id" in your sink mapping. For CosmosDB, it is mandatory to
include the system column "id" for updates, upserts, and deletes.
<incomingStream>
alterRow(
insertIf(<condition>?),
updateIf(<condition>?),
deleteIf(<condition>?),
upsertIf(<condition>?),
) ~> <alterRowTransformationName>
Example
The below example is an alter row transformation named CleanData that takes an incoming stream
SpecifyUpsertConditions and creates three alter row conditions. In the previous transformation, a column
named alterRowCondition is calculated that determines whether or not a row is inserted, updated, or deleted in
the database. If the value of the column has a string value that matches the alter row rule, it is assigned that
policy.
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
Configuration
The Split on setting determines whether the row of data flows to the first matching stream or every stream it
matches to.
Use the data flow expression builder to enter an expression for the split condition. To add a new condition, click
on the plus icon in an existing row. A default stream can be added as well for rows that don't match any
condition.
<incomingStream>
split(
<conditionalExpression1>
<conditionalExpression2>
...
disjoint: {true | false}
) ~> <splitTx>@(stream1, stream2, ..., <defaultStream>)
Example
The below example is a conditional split transformation named SplitByYear that takes in incoming stream
CleanData . This transformation has two split conditions year < 1960 and year > 1980 . disjoint is false
because the data goes to the first matching condition. Every row matching the first condition goes to output
stream moviesBefore1960 . All remaining rows matching the second condition go to output stream
moviesAFter1980 . All other rows flow through the default stream AllOtherMovies .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
CleanData
split(
year < 1960,
year > 1980,
disjoint: false
) ~> SplitByYear@(moviesBefore1960, moviesAfter1980, AllOtherMovies)
Next steps
Common data flow transformations used with conditional split are the join transformation, lookup
transformation, and the select transformation
Derived column transformation in mapping data
flow
11/2/2020 • 3 minutes to read • Edit Online
To add more derived columns, click on Add above the column list or the plus icon next to an existing derived
column. Choose either Add column or Add column pattern .
Column patterns
In cases where your schema is not explicitly defined or if you want to update a set of columns in bulk, you will
want to create column patters. Column patterns allow for you to match columns using rules based upon the
column metadata and create derived columns for each matched column. For more information, learn how to
build column patterns in the derived column transformation.
Building schemas using the expression builder
When using the mapping data flow expression builder, you can create, edit, and manage your derived columns
in the Derived Columns section. All columns that are created or changed in the transformation are listed.
Interactively choose which column or pattern you are editing by clicking on the column name. To add an
additional column select Create new and choose whether you wish to add a single column or a pattern.
When working with complex columns, you can create subcolumns. To do this, click on the plus icon next to any
column and select Add subcolumn . For more information on handling complex types in data flow, see JSON
handling in mapping data flow.
For more information on handling complex types in data flow, see JSON handling in mapping data flow.
Locals
If you are sharing logic across multiple columns or want to compartmentalize your logic, you can create a local
within a derived column transformation. A local is a set of logic that doesn't get propagated downstream to the
following transformation. Locals can be created within the expression builder by going to Expression
elements and selecting Locals . Create a new one by selecting Create new .
Locals can reference any expression element a derived column including functions, input schema, parameters,
and other locals. When referencing other locals, order does matter as the referenced local needs to be "above"
the current one.
To reference a local in a derived column, either click on the local from the Expression elements view or
reference it with a colon in front of its name. For example, a local called local1 would be referenced by :local1 .
To edit a local definition, hover over it in the expression elements view and click on the pencil icon.
Data flow script
Syntax
<incomingStream>
derive(
<columnName1> = <expression1>,
<columnName2> = <expression2>,
each(
match(matchExpression),
<metadataColumn1> = <metadataExpression1>,
<metadataColumn2> = <metadataExpression2>
)
) ~> <deriveTransformationName>
Example
The below example is a derived column named CleanData that takes an incoming stream MoviesYear and
creates two derived columns. The first derived column replaces column Rating with Rating's value as an integer
type. The second derived column is a pattern that matches each column whose name starts with 'movies'. For
each matched column, it creates a column movie that is equal to the value of the matched column prefixed with
'movie_'.
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
MoviesYear derive(
Rating = toInteger(Rating),
each(
match(startsWith(name,'movies')),
'movie' = 'movie_' + toString($$)
)
) ~> CleanData
Next steps
Learn more about the Mapping Data Flow expression language.
Exists transformation in mapping data flow
5/8/2020 • 2 minutes to read • Edit Online
Configuration
1. Choose which data stream you're checking for existence in the Right stream dropdown.
2. Specify whether you're looking for the data to exist or not exist in the Exist type setting.
3. Select whether or not your want a Custom expression .
4. Choose which key columns you want to compare as your exists conditions. By default, data flow looks for
equality between one column in each stream. To compare via a computed value, hover over the column
dropdown and select Computed column .
Custom expression
To create a free-form expression that contains operators other than "and" and "equals to", select the Custom
expression field. Enter a custom expression via the data flow expression builder by clicking on the blue box.
Broadcast optimization
In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.
<leftStream>, <rightStream>
exists(
<conditionalExpression>,
negate: { true | false },
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <existsTransformationName>
Example
The below example is an exists transformation named checkForChanges that takes left stream NameNorm2 and
right stream TypeConversions . The exists condition is the expression
NameNorm2@EmpID == TypeConversions@EmpID && NameNorm2@Region == DimEmployees@Region that returns true if both
the EMPID and Region columns in each stream matches. As we're checking for existence, negate is false. We
aren't enabling any broadcasting in the optimize tab so broadcast has value 'none' .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
NameNorm2, TypeConversions
exists(
NameNorm2@EmpID == TypeConversions@EmpID && NameNorm2@Region == DimEmployees@Region,
negate:false,
broadcast: 'auto'
) ~> checkForChanges
Next steps
Similar transformations are Lookup and Join.
Filter transformation in mapping data flow
11/2/2020 • 2 minutes to read • Edit Online
Configuration
Use the data flow expression builder to enter an expression for the filter condition. To open the expression
builder, click on the blue box. The filter condition must be of type boolean. For more information on how to
create an expression, see the expression builder documentation.
<incomingStream>
filter(
<conditionalExpression>
) ~> <filterTransformationName>
Example
The below example is a filter transformation named FilterBefore1960 that takes in incoming stream CleanData .
The filter condition is the expression year <= 1960 .
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
CleanData
filter(
year <= 1960
) ~> FilterBefore1960
Next steps
Filter out columns with the select transformation
Flatten transformation in mapping data flow
7/9/2021 • 3 minutes to read • Edit Online
Configuration
The flatten transformation contains the following configuration settings
Unroll by
Select an array to unroll. The output data will have one row per item in each array. If the unroll by array in the
input row is null or empty, there will be one output row with unrolled values as null.
Unroll root
By default, the flatten transformation unrolls an array to the top of the hierarchy it exists in. You can optionally
select an array as your unroll root. The unroll root must be an array of complex objects that either is or contains
the unroll by array. If an unroll root is selected, the output data will contain at least one row per items in the
unroll root. If the input row doesn't have any items in the unroll root, it will be dropped from the output data.
Choosing an unroll root will always output a less than or equal number of rows than the default behavior.
Flatten mapping
Similar to the select transformation, choose the projection of the new structure from incoming fields and the
denormalized array. If a denormalized array is mapped, the output column will be the same data type as the
array. If the unroll by array is an array of complex objects that contains subarrays, mapping an item of that
subarry will output an array.
Refer to the inspect tab and data preview to verify your mapping output.
Rule-based mapping
The flatten transformation supports rule-based mapping allowing you to create dynamic and flexible
transformations that will flatten arrays based on rules and flatten structures based on hierarchy levels.
Matching condition
Enter a pattern matching condition for the column or columns that you wish to flatten using either exact
matching or patterns. Example: like(name,'cust%')
Deep column traversal
Optional setting that tells ADF to handle all subcolumns of a complex object individually instead of handling the
complex object as a whole column.
Hierarchy level
Choose the level of the hierarchy that you would like expand.
Name matches (regex)
Optionally choose to express your name matching as a regular expression in this box, instead of using the
matching condition above.
Examples
Refer to the following JSON object for the below examples of the flatten transformation
{
"name":"MSFT","location":"Redmond", "satellites": ["Bay Area", "Shanghai"],
"goods": {
"trade":true, "customers":["government", "distributer", "retail"],
"orders":[
{"orderId":1,"orderTotal":123.34,"shipped":{"orderItems":[{"itemName":"Laptop","itemQty":20},
{"itemName":"Charger","itemQty":2}]}},
{"orderId":2,"orderTotal":323.34,"shipped":{"orderItems":[{"itemName":"Mice","itemQty":2},
{"itemName":"Keyboard","itemQty":1}]}}
]}}
{"name":"Company1","location":"Seattle", "satellites": ["New York"],
"goods":{"trade":false, "customers":["store1", "store2"],
"orders":[
{"orderId":4,"orderTotal":123.34,"shipped":{"orderItems":[{"itemName":"Laptop","itemQty":20},
{"itemName":"Charger","itemQty":3}]}},
{"orderId":5,"orderTotal":343.24,"shipped":{"orderItems":[{"itemName":"Chair","itemQty":4},
{"itemName":"Lamp","itemQty":2}]}}
]}}
{"name": "Company2", "location": "Bellevue",
"goods": {"trade": true, "customers":["Bank"], "orders": [{"orderId": 4, "orderTotal": 123.34}]}}
{"name": "Company3", "location": "Kirkland"}
Output
{ 'MSFT', 'government'}
{ 'MSFT', 'distributer'}
{ 'MSFT', 'retail'}
{ 'Company1', 'store'}
{ 'Company1', 'store2'}
{ 'Company2', 'Bank'}
{ 'Company3', null}
Output
Output
Output
<incomingStream>
foldDown(unroll(<unroll cols>),
mapColumn(
name,
each(<array>(type == '<arrayDataType>')),
each(<array>, match(true())),
location
)) ~> <transformationName>
Example
Next steps
Use the Pivot transformation to pivot rows to columns.
Use the Unpivot transformation to pivot columns to rows.
Join transformation in mapping data flow
11/2/2020 • 5 minutes to read • Edit Online
Join types
Mapping data flows currently supports five different join types.
Inner Join
Inner join only outputs rows that have matching values in both tables.
Left Outer
Left outer join returns all rows from the left stream and matched records from the right stream. If a row from
the left stream has no match, the output columns from the right stream are set to NULL. The output will be the
rows returned by an inner join plus the unmatched rows from the left stream.
NOTE
The Spark engine used by data flows will occasionally fail due to possible cartesian products in your join conditions. If this
occurs, you can switch to a custom cross join and manually enter your join condition. This may result in slower
performance in your data flows as the execution engine may need to calculate all rows from both sides of the relationship
and then filter rows.
Right Outer
Right outer join returns all rows from the right stream and matched records from the left stream. If a row from
the right stream has no match, the output columns from the left stream are set to NULL. The output will be the
rows returned by an inner join plus the unmatched rows from the right stream.
Full Outer
Full outer join outputs all columns and rows from both sides with NULL values for columns that aren't matched.
Custom cross join
Cross join outputs the cross product of the two streams based upon a condition. If you're using a condition that
isn't equality, specify a custom expression as your cross join condition. The output stream will be all rows that
meet the join condition.
You can use this join type for non-equi joins and OR conditions.
If you would like to explicitly produce a full cartesian product, use the Derived Column transformation in each of
the two independent streams before the join to create a synthetic key to match on. For example, create a new
column in Derived Column in each stream called SyntheticKey and set it equal to 1 . Then use
a.SyntheticKey == b.SyntheticKey as your custom join expression.
NOTE
Make sure to include at least one column from each side of your left and right relationship in a custom cross join.
Executing cross joins with static values instead of columns from each side results in full scans of the entire dataset, causing
your data flow to perform poorly.
Configuration
1. Choose which data stream you're joining with in the Right stream dropdown.
2. Select your Join type
3. Choose which key columns you want to match on for you join condition. By default, data flow looks for
equality between one column in each stream. To compare via a computed value, hover over the column
dropdown and select Computed column .
Non-equi joins
To use a conditional operator such as not equals (!=) or greater than (>) in your join conditions, change the
operator dropdown between the two columns. Non-equi joins require at least one of the two streams to be
broadcasted using Fixed broadcasting in the Optimize tab.
Optimizing join performance
Unlike merge join in tools like SSIS, the join transformation isn't a mandatory merge join operation. The join
keys don't require sorting. The join operation occurs based on the optimal join operation in Spark, either
broadcast or map-side join.
In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.
Self-Join
To self-join a data stream with itself, alias an existing stream with a select transformation. Create a new branch
by clicking on the plus icon next to a transformation and selecting New branch . Add a select transformation to
alias the original stream. Add a join transformation and choose the original stream as the Left stream and the
select transformation as the Right stream .
Testing join conditions
When testing the join transformations with data preview in debug mode, use a small set of known data. When
sampling rows from a large dataset, you can't predict which rows and keys will be read for testing. The result is
non-deterministic, meaning that your join conditions may not return any matches.
<leftStream>, <rightStream>
join(
<conditionalExpression>,
joinType: { 'inner'> | 'outer' | 'left_outer' | 'right_outer' | 'cross' }
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <joinTransformationName>
In the Data Factory UX, this transformation looks like the below image:
The data flow script for this transformation is in the snippet below:
TripData, TripFare
join(
hack_license == { hack_license}
&& TripData@medallion == TripFare@medallion
&& vendor_id == { vendor_id}
&& pickup_datetime == { pickup_datetime},
joinType:'inner',
broadcast: 'left'
)~> JoinMatchedData
LeftStream, RightStream
join(
leftstreamcolumn > rightstreamcolumn,
joinType:'cross',
broadcast: 'none'
)~> JoiningColumns
Next steps
After joining data, create a derived column and sink your data to a destination data store.
Lookup transformation in mapping data flow
5/11/2021 • 3 minutes to read • Edit Online
Configuration
Primar y stream: The incoming stream of data. This stream is equivalent to the left side of a join.
Lookup stream: The data that is appended to the primary stream. Which data is added is determined by the
lookup conditions. This stream is equivalent to the right side of a join.
Match multiple rows: If enabled, a row with multiple matches in the primary stream will return multiple rows.
Otherwise, only a single row will be returned based upon the 'Match on' condition.
Match on: Only visible if 'Match multiple rows' is not selected. Choose whether to match on any row, the first
match, or the last match. Any row is recommended as it executes the fastest. If first row or last row is selected,
you'll be required to specify sort conditions.
Lookup conditions: Choose which columns to match on. If the equality condition is met, then the rows will be
considered a match. Hover and select 'Computed column' to extract a value using the data flow expression
language.
All columns from both streams are included in the output data. To drop duplicate or unwanted columns, add a
select transformation after your lookup transformation. Columns can also be dropped or renamed in a sink
transformation.
Non-equi joins
To use a conditional operator such as not equals (!=) or greater than (>) in your lookup conditions, change the
operator dropdown between the two columns. Non-equi joins require at least one of the two streams to be
broadcasted using Fixed broadcasting in the Optimize tab.
Analyzing matched rows
After your lookup transformation, the function isMatch() can be used to see if the lookup matched for
individual rows.
An example of this pattern is using the conditional split transformation to split on the isMatch() function. In the
example above, matching rows go through the top stream and non-matching rows flow through the NoMatch
stream.
Broadcast optimization
In joins, lookups and exists transformation, if one or both data streams fit into worker node memory, you can
optimize performance by enabling Broadcasting . By default, the spark engine will automatically decide
whether or not to broadcast one side. To manually choose which side to broadcast, select Fixed .
It's not recommended to disable broadcasting via the Off option unless your joins are running into timeout
errors.
Cached lookup
If you're doing multiple smaller lookups on the same source, a cached sink and lookup maybe a better use case
than the lookup transformation. Common examples where a cache sink may be better are looking up a max
value on a data store and matching error codes to an error message database. For more information, learn
about cache sinks and cached lookups.
<leftStream>, <rightStream>
lookup(
<lookupConditionExpression>,
multiple: { true | false },
pickup: { 'first' | 'last' | 'any' }, ## Only required if false is selected for multiple
{ desc | asc }( <sortColumn>, { true | false }), ## Only required if 'first' or 'last' is selected.
true/false determines whether to put nulls first
broadcast: { 'auto' | 'left' | 'right' | 'both' | 'off' }
) ~> <lookupTransformationName>
Example
The data flow script for the above lookup configuration is in the code snippet below.
SQLProducts, DimProd lookup(ProductID == ProductKey,
multiple: false,
pickup: 'first',
asc(ProductKey, true),
broadcast: 'auto')~> LookupKeys
Next steps
The join and exists transformations both take in multiple stream inputs
Use a conditional split transformation with isMatch() to split rows on matching and non-matching values
Creating a new branch in mapping data flow
4/17/2021 • 2 minutes to read • Edit Online
In the below example, the data flow is reading taxi trip data. Output aggregated by both day and vendor is
required. Instead of creating two separate data flows that read from the same source, a new branch can be
added. This way both aggregations can be executed as part of the same data flow.
NOTE
When clicking the plus (+) to add transformations to your graph, you will only see the New Branch option when there are
subsequent transformation blocks. This is because New Branch creates a reference to the existing stream and requires
further upstream processing to operate on. If you do not see the New Branch option, add a Derived Column or other
transformation first, then return to the previous block and you will see New Branch as an option.
Next steps
After branching, you may want to use the data flow transformations
Parse transformation in mapping data flow
5/11/2021 • 2 minutes to read • Edit Online
Configuration
In the parse transformation configuration panel, you will first pick the type of data contained in the columns that
you wish to parse inline. The parse transformation also contains the following configuration settings.
Column
Similar to derived columns and aggregates, this is where you will either modify an exiting column by selecting it
from the drop-down picker. Or you can type in the name of a new column here. ADF will store the parsed source
data in this column. In most cases, you will want to define a new column that parses the incoming embedded
document field.
Expression
Use the expression builder to set the source for your parsing. This can be as simple as just selecting the source
column with the self-contained data that you wish to parse, or you can create complex expressions to parse.
Example expressions
Source string data: chrome|steel|plastic
Refer to the inspect tab and data preview to verify your output is mapped properly.
Examples
source(output(
name as string,
location as string,
satellites as string[],
goods as (trade as boolean, customers as string[], orders as (orderId as string, orderTotal as double,
shipped as (orderItems as (itemName as string, itemQty as string)[]))[])
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false,
documentForm: 'documentPerLine') ~> JsonSource
source(output(
movieId as string,
title as string,
genres as string
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false) ~> CsvSource
JsonSource derive(jsonString = toString(goods)) ~> StringifyJson
StringifyJson parse(json = jsonString ? (trade as boolean,
customers as string[]),
format: 'json',
documentForm: 'arrayOfDocuments') ~> ParseJson
CsvSource derive(csvString = 'Id|name|year\n\'1\'|\'test1\'|\'1999\'') ~> CsvString
CsvString parse(csv = csvString ? (id as integer,
name as string,
year as string),
format: 'delimited',
columnNamesAsHeader: true,
columnDelimiter: '|',
nullValue: '',
documentForm: 'documentPerLine') ~> ParseCsv
ParseJson select(mapColumn(
jsonString,
json
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> KeepStringAndParsedJson
ParseCsv select(mapColumn(
csvString,
csv
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> KeepStringAndParsedCsv
Next steps
Use the Flatten transformation to pivot rows to columns.
Use the Derived column transformation to pivot columns to rows.
Pivot transformation in mapping data flow
11/2/2020 • 3 minutes to read • Edit Online
Configuration
The pivot transformation requires three different inputs: group by columns, the pivot key, and how to generate
the pivoted columns
Group by
Select which columns to aggregate the pivoted columns over. The output data will group all rows with the same
group by values into one row. The aggregation done in the pivoted column will occur over each group.
This section is optional. If no group by columns are selected, the entire data stream will be aggregated and only
one row will be outputted.
Pivot key
The pivot key is the column whose row values get pivoted into new columns. By default, the pivot
transformation will create a new column for each unique row value.
In the section labeled Value , you can enter specific row values to be pivoted. Only the row values entered in this
section will be pivoted. Enabling Null value will create a pivoted column for the null values in the column.
Pivoted columns
For each unique pivot key value that becomes a column, generate an aggregated row value for each group. You
can create multiple columns per pivot key. Each pivot column must contain at least one aggregate function.
Column name pattern: Select how to format the column name of each pivot column. The outputted column
name will be a combination of the pivot key value, column prefix and optional prefix, suffice, middle characters.
Column arrangement: If you generate more than one pivot column per pivot key, choose how you want the
columns to be ordered.
Column prefix: If you generate more than one pivot column per pivot key, enter a column prefix for each
column. This setting is optional if you only have one pivoted column.
Help graphic
The below help graphic shows how the different pivot components interact with one another
Pivot metadata
If no values are specified in the pivot key configuration, the pivoted columns will be dynamically generated at
run time. The number of pivoted columns will equal the number of unique pivot key values multiplied by the
number of pivot columns. As this can be a changing number, the UX will not display the column metadata in the
Inspect tab and there will be no column propagation. To transformation these columns, use the column pattern
capabilities of mapping data flow.
If specific pivot key values are set, the pivoted columns will appear in the metadata. The column names will be
available to you in the Inspect and Sink mapping.
Generate metadata from drifted columns
Pivot generates new column names dynamically based on row values. You can add these new columns into the
metadata that can be referenced later in your data flow. To do this, use the map drifted quick action in data
preview.
<incomingStreamName>
pivot(groupBy(Tm),
pivotBy(<pivotKeyColumn, [<specifiedColumnName1>,...,<specifiedColumnNameN>]),
<pivotColumnPrefix> = <pivotedColumnValue>,
columnNaming: '< prefix >< $N | $V ><middle >< $N | $V >< suffix >',
lateral: { 'true' | 'false'}
) ~> <pivotTransformationName
Example
The screens shown in the configuration section, have the following data flow script:
BasketballPlayerStats pivot(groupBy(Tm),
pivotBy(Pos),
{} = count(),
columnNaming: '$V$N count',
lateral: true) ~> PivotExample
Next steps
Try the unpivot transformation to turn column values into row values.
Rank transformation in mapping data flow
4/22/2021 • 2 minutes to read • Edit Online
Configuration
Case insensitive: If a sort column is of type string, case will be factored into the ranking.
Dense: If enabled, the rank column will be dense ranked. Each rank count will be a consecutive number and
rank values won't be skipped after a tie.
Rank column: The name of the rank column generated. This column will be of type long.
Sor t conditions: Choose which columns you're sorting by and in which order the sort happens. The order
determines sorting priority.
The above configuration takes incoming basketball data and creates a rank column called 'pointsRanking'. The
row with the highest value of the column PTS will have a pointsRanking value of 1.
<incomingStream>
rank(
desc(<sortColumn1>),
asc(<sortColumn2>),
...,
caseInsensitive: { true | false }
dense: { true | false }
output(<rankColumn> as long)
) ~> <sortTransformationName<>
Example
The data flow script for the above rank configuration is in the following code snippet.
PruneColumns
rank(
desc(PTS, true),
caseInsensitive: false,
output(pointsRanking as long),
dense: false
) ~> RankByPoints
Next steps
Filter rows based upon the rank values using the filter transformation.
Select transformation in mapping data flow
11/2/2020 • 5 minutes to read • Edit Online
Fixed mapping
If there are fewer than 50 columns defined in your projection, all defined columns will have a fixed mapping by
default. A fixed mapping takes a defined, incoming column and maps it an exact name.
NOTE
You can't map or rename a drifted column using a fixed mapping
Each rule-based mapping requires two inputs: the condition on which to match by and what to name each
mapped column. Both values are inputted via the expression builder. In the left expression box, enter your
boolean match condition. In the right expression box, specify what the matched column will be mapped to.
Use $$ syntax to reference the input name of a matched column. Using the above image as an example, say a
user wants to match on all string columns whose names are shorter than six characters. If one incoming column
was named test , the expression $$ + '_short' will rename the column test_short . If that's the only mapping
that exists, all columns that don't meet the condition will be dropped from the outputted data.
Patterns match both drifted and defined columns. To see which defined columns are mapped by a rule, click the
eyeglasses icon next to the rule. Verify your output using data preview.
Regex mapping
If you click the downward chevron icon, you can specify a regex-mapping condition. A regex-mapping condition
matches all column names that match the specified regex condition. This can be used in combination with
standard rule-based mappings.
The above example matches on regex pattern (r) or any column name that contains a lower case r. Similar to
standard rule-based mapping, all matched columns are altered by the condition on the right using $$ syntax.
If you have multiple regex matches in your column name, you can refer to specific matches using $n where 'n'
refers to which match. For example, '$2' refers to the second match within a column name.
Rule -based hierarchies
If your defined projection has a hierarchy, you can use rule-based mapping to map the hierarchies subcolumns.
Specify a matching condition and the complex column whose subcolumns you wish to map. Every matched
subcolumn will be outputted using the 'Name as' rule specified on the right.
The above example matches on all subcolumns of complex column a . a contains two subcolumns b and c .
The output schema will include two columns b and c as the 'Name as' condition is $$ .
Parameterization
You can parameterize column names using rule-based mapping. Use the keyword name to match incoming
column names against a parameter. For example, if you have a data flow parameter mycolumn , you can create a
rule that matches any column name that is equal to mycolumn . You can rename the matched column to a hard-
coded string such as 'business key' and reference it explicitly. In this example, the matching condition is
name == $mycolumn and the name condition is 'business key'.
Auto mapping
When adding a select transformation, Auto mapping can be enabled by switching the Auto mapping slider.
With auto mapping, the select transformation maps all incoming columns, excluding duplicates, with the same
name as their input. This will include drifted columns, which means the output data may contain columns not
defined in your schema. For more information on drifted columns, see schema drift.
With auto mapping on, the select transformation will honor the skip duplicate settings and provide a new alias
for the existing columns. Aliasing is useful when doing multiple joins or lookups on the same stream and in self-
join scenarios.
Duplicate columns
By default, the select transformation drops duplicate columns in both the input and output projection. Duplicate
input columns often come from join and lookup transformations where column names are duplicated on each
side of the join. Duplicate output columns can occur if you map two different input columns to the same name.
Choose whether to drop or pass on duplicate columns by toggling the checkbox.
Ordering of columns
The order of mappings determines the order of the output columns. If an input column is mapped multiple
times, only the first mapping will be honored. For any duplicate column dropping, the first match will be kept.
<incomingStream>
select(mapColumn(
each(<hierarchicalColumn>, match(<matchCondition>), <nameCondition> = $$), ## hierarchical rule-
based matching
<fixedColumn>, ## fixed mapping, no rename
<renamedFixedColumn> = <fixedColumn>, ## fixed mapping, rename
each(match(<matchCondition>), <nameCondition> = $$), ## rule-based mapping
each(patternMatch(<regexMatching>), <nameCondition> = $$) ## regex mapping
),
skipDuplicateMapInputs: { true | false },
skipDuplicateMapOutputs: { true | false }) ~> <selectTransformationName>
Example
Below is an example of a select mapping and its data flow script:
DerivedColumn1 select(mapColumn(
each(a, match(true())),
movie,
title1 = title,
each(match(name == 'Rating')),
each(patternMatch(`(y)`),
$1 + 'regex' = $$)
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> Select1
Next steps
After using Select to rename, reorder, and alias columns, use the Sink transformation to land your data into a
data store.
Sink transformation in mapping data flow
7/21/2021 • 6 minutes to read • Edit Online
Inline datasets
When you create a sink transformation, choose whether your sink information is defined inside a dataset object
or within the sink transformation. Most formats are available in only one or the other. To learn how to use a
specific connector, see the appropriate connector document.
When a format is supported for both inline and in a dataset object, there are benefits to both. Dataset objects
are reusable entities that can be used in other data flows and activities such as Copy. These reusable entities are
especially useful when you use a hardened schema. Datasets aren't based in Spark. Occasionally, you might
need to override certain settings or schema projection in the sink transformation.
Inline datasets are recommended when you use flexible schemas, one-off sink instances, or parameterized sinks.
If your sink is heavily parameterized, inline datasets allow you to not create a "dummy" object. Inline datasets
are based in Spark, and their properties are native to data flow.
To use an inline dataset, select the format you want in the Sink type selector. Instead of selecting a sink dataset,
you select the linked service you want to connect to.
Snowflake ✓/✓
Settings specific to these connectors are located on the Settings tab. Information and data flow script examples
on these settings are located in the connector documentation.
Azure Data Factory has access to more than 90 native connectors. To write data to those other sources from
your data flow, use the Copy Activity to load that data from a supported sink.
Sink settings
After you've added a sink, configure via the Sink tab. Here you can pick or create the dataset your sink writes to.
Development values for dataset parameters can be configured in Debug settings. (Debug mode must be turned
on.)
The following video explains a number of different sink options for text-delimited file types.
Schema drift : Schema drift is the ability of Data Factory to natively handle flexible schemas in your data flows
without needing to explicitly define column changes. Enable Allow schema drift to write additional columns
on top of what's defined in the sink data schema.
Validate schema : If validate schema is selected, the data flow will fail if any column of the incoming source
schema isn't found in the source projection, or if the data types don't match. Use this setting to enforce that the
source data meets the contract of your defined projection. It's useful in database source scenarios to signal that
column names or types have changed.
Cache sink
A cache sink is when a data flow writes data into the Spark cache instead of a data store. In mapping data flows,
you can reference this data within the same flow many times using a cache lookup. This is useful when you want
to reference data as part of an expression but don't want to explicitly join the columns to it. Common examples
where a cache sink can help are looking up a max value on a data store and matching error codes to an error
message database.
To write to a cache sink, add a sink transformation and select Cache as the sink type. Unlike other sink types,
you don't need to select a dataset or linked service because you aren't writing to an external store.
In the sink settings, you can optionally specify the key columns of the cache sink. These are used as matching
conditions when using the lookup() function in a cache lookup. If you specify key columns, you can't use the
outputs() function in a cache lookup. To learn more about the cache lookup syntax, see cached lookups.
For example, if I specify a single key column of column1 in a cache sink called cacheExample , calling
cacheExample#lookup() would have one parameter specifies which row in the cache sink to match on. The
function outputs a single complex column with subcolumns for each column mapped.
NOTE
A cache sink must be in a completely independent data stream from any transformation referencing it via a cache lookup.
A cache sink also must the first sink written.
Write to activity output The cached sink can optionally write your output data to the input of the next
pipeline activity. This will allow you to quickly and easily pass data out of your data flow activity without needing
to persist the data in a data store.
Field mapping
Similar to a select transformation, on the Mapping tab of the sink, you can decide which incoming columns will
get written. By default, all input columns, including drifted columns, are mapped. This behavior is known as
automapping.
When you turn off automapping, you can add either fixed column-based mappings or rule-based mappings.
With rule-based mappings, you can write expressions with pattern matching. Fixed mapping maps logical and
physical column names. For more information on rule-based mapping, see Column patterns in mapping data
flow.
NOTE
When utilizing cached lookups, make sure that your sink ordering has the cached sinks set to 1, the lowest (or first) in
ordering.
Sink groups
You can group sinks together by applying the same order number for a series of sinks. ADF will treat those sinks
as groups that can execute in parallel. Options for parallel execution will surface in the pipeline data flow activity.
Next steps
Now that you've created your data flow, add a data flow activity to your pipeline.
Sort transformation in mapping data flow
4/17/2021 • 2 minutes to read • Edit Online
NOTE
Mapping data flows are executed on spark clusters which distribute data across multiple nodes and partitions. If you
choose to repartition your data in a subsequent transformation, you may lose your sorting due to reshuffling of data. The
best way to maintain sort order in your data flow is to set single partition in the Optimize tab on the transformation and
keep the Sort transformation as close to the Sink as possible.
Configuration
Case insensitive: Whether or not you wish to ignore case when sorting string or text fields
Sor t Only Within Par titions: As data flows are run on spark, each data stream is divided into partitions. This
setting sorts data only within the incoming partitions rather than sorting the entire data stream.
Sor t conditions: Choose which columns you are sorting by and in which order the sort happens. The order
determines sorting priority. Choose whether or not nulls will appear at the beginning or end of the data stream.
Computed columns
To modify or extract a column value before applying the sort, hover over the column and select "computed
column". This will open the expression builder to create an expression for the sort operation instead of using a
column value.
<incomingStream>
sort(
desc(<sortColumn1>, { true | false }),
asc(<sortColumn2>, { true | false }),
...
) ~> <sortTransformationName<>
Example
The data flow script for the above sort configuration is in the code snippet below.
Next steps
After sorting, you may want to use the Aggregate Transformation
Source transformation in mapping data flow
7/15/2021 • 6 minutes to read • Edit Online
Inline datasets
The first decision you make when you create a source transformation is whether your source information is
defined inside a dataset object or within the source transformation. Most formats are available in only one or
the other. To learn how to use a specific connector, see the appropriate connector document.
When a format is supported for both inline and in a dataset object, there are benefits to both. Dataset objects
are reusable entities that can be used in other data flows and activities such as Copy. These reusable entities are
especially useful when you use a hardened schema. Datasets aren't based in Spark. Occasionally, you might
need to override certain settings or schema projection in the source transformation.
Inline datasets are recommended when you use flexible schemas, one-off source instances, or parameterized
sources. If your source is heavily parameterized, inline datasets allow you to not create a "dummy" object. Inline
datasets are based in Spark, and their properties are native to data flow.
To use an inline dataset, select the format you want in the Source type selector. Instead of selecting a source
dataset, you select the linked service you want to connect to.
Hive -/✓
Snowflake ✓/✓
Settings specific to these connectors are located on the Source options tab. Information and data flow script
examples on these settings are located in the connector documentation.
Azure Data Factory has access to more than 90 native connectors. To include data from those other sources in
your data flow, use the Copy Activity to load that data into one of the supported staging areas.
Source settings
After you've added a source, configure via the Source settings tab. Here you can pick or create the dataset
your source points at. You can also select schema and sampling options for your data.
Development values for dataset parameters can be configured in debug settings. (Debug mode must be turned
on.)
NOTE
When debug mode is turned on, the row limit configuration in debug settings will overwrite the sampling setting in the
source during data preview.
Source options
The Source options tab contains settings specific to the connector and format chosen. For more information
and examples, see the relevant connector documentation.
Projection
Like schemas in datasets, the projection in a source defines the data columns, types, and formats from the
source data. For most dataset types, such as SQL and Parquet, the projection in a source is fixed to reflect the
schema defined in a dataset. When your source files aren't strongly typed (for example, flat .csv files rather than
Parquet files), you can define the data types for each field in the source transformation.
If your text file has no defined schema, select Detect data type so that Data Factory will sample and infer the
data types. Select Define default format to autodetect the default data formats.
Reset schema resets the projection to what is defined in the referenced dataset.
You can modify the column data types in a downstream derived-column transformation. Use a select
transformation to modify the column names.
Import schema
Select the Impor t schema button on the Projection tab to use an active debug cluster to create a schema
projection. It's available in every source type. Importing the schema here will override the projection defined in
the dataset. The dataset object won't be changed.
Importing schema is useful in datasets like Avro and Azure Cosmos DB that support complex data structures
that don't require schema definitions to exist in the dataset. For inline datasets, importing schema is the only
way to reference column metadata without schema drift.
Next steps
Begin building your data flow with a derived-column transformation and a select transformation.
Surrogate key transformation in mapping data flow
11/2/2020 • 2 minutes to read • Edit Online
Configuration
File sources
If your previous max value is in a file, use the max() function in the aggregate transformation to get the
previous max value:
In both cases, you will need to write to a cache sink and lookup the value.
Example
The data flow script for the above surrogate key configuration is in the code snippet below.
AggregateDayStats
keyGenerate(
output(key as long),
startAt: 1L
) ~> SurrogateKey1
Next steps
These examples use the Join and Derived Column transformations.
Union transformation in mapping data flow
3/5/2021 • 2 minutes to read • Edit Online
In this case, you can combine disparate metadata from multiple sources (in this example, three different source
files) and combine them into a single stream:
To achieve this, add additional rows in the Union Settings by including all source you wish to add. There is no
need for a common lookup or join key:
If you set a Select transformation after your Union, you will be able to rename overlapping fields or fields that
were not named from headerless sources. Click on "Inspect" to see the combine metadata with 132 total
columns in this example from three different sources:
Ungroup By
First, set the columns that you wish to ungroup by for your unpivot aggregation. Set one or more columns for
ungrouping with the + sign next to the column list.
Unpivot Key
The Unpivot Key is the column that ADF will pivot from column to row. By default, each unique value in the
dataset for this field will pivot to a row. However, you can optionally enter the values from the dataset that you
wish to pivot to row values.
Unpivoted Columns
Lastly, choose the column name for storing the values for unpivoted columns that are transformed into rows.
(Optional) You can drop rows with Null values.
For instance, SumCost is the column name that is chosen in the example shared above.
Setting the Column Arrangement to "Normal" will group together all of the new unpivoted columns from a
single value. Setting the columns arrangement to "Lateral" will group together new unpivoted columns
generated from an existing column.
The final unpivoted data result set shows the column totals now unpivoted into separate row values.
Next steps
Use the Pivot transformation to pivot rows to columns.
Window transformation in mapping data flow
3/5/2021 • 2 minutes to read • Edit Online
Over
Set the partitioning of column data for your window transformation. The SQL equivalent is the Partition By in
the Over clause in SQL. If you wish to create a calculation or create an expression to use for the partitioning, you
can do that by hovering over the column name and select "computed column".
Sort
Another part of the Over clause is setting the Order By . This will set the data sort ordering. You can also create
an expression for a calculate value in this column field for sorting.
Range By
Next, set the window frame as Unbounded or Bounded. To set an unbounded window frame, set the slider to
Unbounded on both ends. If you choose a setting between Unbounded and Current Row, then you must set the
Offset start and end values. Both values will be positive integers. You can use either relative numbers or values
from your data.
The window slider has two values to set: the values before the current row and the values after the current row.
The Start and End offset matches the two selectors on the slider.
Window columns
Lastly, use the Expression Builder to define the aggregations you wish to use with the data windows such as
RANK, COUNT, MIN, MAX, DENSE RANK, LEAD, LAG, etc.
The full list of aggregation and analytical functions available for you to use in the ADF Data Flow Expression
Language via the Expression Builder are listed here: https://aka.ms/dataflowexpressions.
Next steps
If you are looking for a simple group-by aggregation, use the Aggregate transformation
Parameterize linked services in Azure Data Factory
6/8/2021 • 2 minutes to read • Edit Online
TIP
We recommend not to parameterize passwords or secrets. Store all secrets in Azure Key Vault instead, and parameterize
the Secret Name.
NOTE
There is open bug to use "-" in parameter names, we recommend to use names without "-" until the bug is resolved.
For a seven-minute introduction and demonstration of this feature, watch the following video:
Data Factory UI
JSON
{
"name": "AzureSqlDatabase",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString":
"Server=tcp:myserver.database.windows.net,1433;Database=@{linkedService().DBName};User
ID=user;Password=fake;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
},
"connectVia": null,
"parameters": {
"DBName": {
"type": "String"
}
}
}
}
Global parameters in Azure Data Factory
5/28/2021 • 3 minutes to read • Edit Online
In the side-nav, enter a name, select a data type, and specify the value of your parameter.
After a global parameter is created, you can edit it by clicking the parameter's name. To alter multiple
parameters at once, select Edit all .
Using global parameters in a pipeline
Global parameters can be used in any pipeline expression. If a pipeline is referencing another resource such as a
dataset or data flow, you can pass down the global parameter value via that resource's parameters. Global
parameters are referenced as pipeline().globalParameters.<parameterName> .
NOTE
The Include in ARM template configuration is only available in "Git mode". Currently it is disabled in "live mode" or
"Data Factory" mode. In case of automatic publishing or Purview connection, do not use Include global parameters
method; use PowerShell script method.
WARNING
You cannot use ‘-‘ in the parameter name. You will receive an errorcode "
{"code":"BadRequest","message":"ErrorCode=InvalidTemplate,ErrorMessage=The expression
>'pipeline().globalParameters.myparam-dbtest-url' is not valid: .....}". But, you can use the ‘_’ in the parameter name.
Adding global parameters to the ARM template adds a factory-level setting that will override other factory-level
settings such as a customer-managed key or git configuration in other environments. If you have these settings
enabled in an elevated environment such as UAT or PROD, it's better to deploy global parameters via a
PowerShell script in the steps highlighted below.
Deploying using PowerShell
The following steps outline how to deploy global parameters via PowerShell. This is useful when your target
factory has a factory-level setting such as customer-managed key.
When you publish a factory or export an ARM template with global parameters, a folder called
globalParameters is created with a file called your-factory-name_GlobalParameters.json. This file is a JSON
object that contains each global parameter type and value in the published factory.
If you're deploying to a new environment such as TEST or PROD, it's recommended to create a copy of this
global parameters file and overwrite the appropriate environment-specific values. When you republish the
original global parameters file will get overwritten, but the copy for the other environment will be untouched.
For example, if you have a factory named 'ADF-DEV' and a global parameter of type string named 'environment'
with a value 'dev', when you publish a file named ADF-DEV_GlobalParameters.json will get generated. If
deploying to a test factory named 'ADF_TEST', create a copy of the JSON file (for example named ADF-
TEST_GlobalParameters.json) and replace the parameter values with the environment-specific values. The
parameter 'environment' may have a value 'test' now.
Use the below PowerShell script to promote global parameters to additional environments. Add an Azure
PowerShell DevOps task before your ARM Template deployment. In the DevOps task, you must specify the
location of the new parameters file, the target resource group, and the target data factory.
NOTE
To deploy global parameters using PowerShell, you must use at least version 4.4.0 of the Az module.
param
(
[parameter(Mandatory = $true)] [String] $globalParametersFilePath,
[parameter(Mandatory = $true)] [String] $resourceGroupName,
[parameter(Mandatory = $true)] [String] $dataFactoryName
)
Import-Module Az.DataFactory
$newGlobalParameters = New-Object
'system.collections.generic.dictionary[string,Microsoft.Azure.Management.DataFactory.Models.GlobalParameterS
pecification]'
Next steps
Learn about Azure Data Factory's continuous integration and deployment process
Learn how to use the control flow expression language
Expressions and functions in Azure Data Factory
7/16/2021 • 53 minutes to read • Edit Online
Expressions
JSON values in the definition can be literal or expressions that are evaluated at runtime. For example:
"name": "value"
or
"name": "@pipeline().parameters.password"
Expressions can appear anywhere in a JSON string value and always result in another JSON value. If a JSON
value is an expression, the body of the expression is extracted by removing the at-sign (@). If a literal string is
needed that starts with @, it must be escaped by using @@. The following examples show how expressions are
evaluated.
JSO N VA L UE RESULT
Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"
Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as
foo :
JSO N VA L UE RESULT
In the control flow activities like ForEach activity, you can provide an array to be iterated over for the property
items and use @item() to iterate over a single enumeration in ForEach activity. For example, if items is an array:
[1, 2, 3], @item() returns 1 in the first iteration, 2 in the second iteration, and 3 in the third iteration. You can also
use @range(0,10) like expression to iterate ten times starting with 0 ending with 9.
You can use @activity('activity name') to capture output of activity and make decisions. Consider a web activity
called Web1. For placing the output of the first activity in the body of the second, the expression generally looks
like: @activity('Web1').output or @activity('Web1').output.data or something similar depending upon what the
output of the first activity looks like.
Examples
Complex expression example
The below example shows a complex example that references a deep sub-field of activity output. To reference a
pipeline parameter that evaluates to a sub-field, use [] syntax instead of dot(.) operator (as in case of subfield1
and subfield2), as part of an activity output.
@activity('*activityName*').output.*subfield1*.*subfield2*[pipeline().parameters.*subfield3*].*subfield4*
{
"type": "@{if(equals(1, 2), 'Blob', 'Table' )}",
"name": "@{toUpper('myData')}"
}
{
"type": "Table",
"name": "MYDATA"
}
Tutorial
This tutorial walks you through how to pass parameters between a pipeline and activity as well as between the
activities.
Functions
You can call functions within expressions. The following sections provide information about the functions that
can be used in an expression.
String functions
To work with strings, you can use these string functions and also some collection functions. String functions
work only on strings.
ST RIN G F UN C T IO N TA SK
replace Replace a substring with the specified string, and return the
updated string.
Collection functions
To work with collections, generally arrays, strings, and sometimes, dictionaries, you can use these collection
functions.
C O L L EC T IO N F UN C T IO N TA SK
intersection Return a collection that has only the common items across
the specified collections.
join Return a string that has all the items from an array,
separated by the specified character.
skip Remove items from the front of a collection, and return all
the other items.
union Return a collection that has all the items from the specified
collections.
Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.
LO GIC A L C O M PA RISO N F UN C T IO N TA SK
greater Check whether the first value is greater than the second
value.
greaterOrEquals Check whether the first value is greater than or equal to the
second value.
LO GIC A L C O M PA RISO N F UN C T IO N TA SK
less Check whether the first value is less than the second value.
lessOrEquals Check whether the first value is less than or equal to the
second value.
Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries
C O N VERSIO N F UN C T IO N TA SK
coalesce Return the first non-null value from one or more parameters.
xpath Check XML for nodes or values that match an XPath (XML
Path Language) expression, and return the matching nodes
or values.
Math functions
These functions can be used for either types of numbers: integers and floats .
M AT H F UN C T IO N TA SK
sub Return the result from subtracting the second number from
the first number.
Date functions
DAT E O R T IM E F UN C T IO N TA SK
getFutureTime Return the current timestamp plus the specified time units.
See also addToTime.
getPastTime Return the current timestamp minus the specified time units.
See also subtractFromTime.
Function reference
This section lists all the available functions in alphabetical order.
add
Return the result from adding two numbers.
add(<summand_1>, <summand_2>)
Example
This example adds the specified numbers:
add(1, 1.5)
addDays
Add a number of days to a timestamp.
Example 1
This example adds 10 days to the specified timestamp:
addDays('2018-03-15T13:00:00Z', 10)
Example 2
This example subtracts five days from the specified timestamp:
addDays('2018-03-15T00:00:00Z', -5)
addHours
Add a number of hours to a timestamp.
Example 1
This example adds 10 hours to the specified timestamp:
addHours('2018-03-15T00:00:00Z', 10)
Example 2
This example subtracts five hours from the specified timestamp:
addHours('2018-03-15T15:00:00Z', -5)
addMinutes
Add a number of minutes to a timestamp.
Example 1
This example adds 10 minutes to the specified timestamp:
addMinutes('2018-03-15T00:10:00Z', 10)
Example 2
This example subtracts five minutes from the specified timestamp:
addMinutes('2018-03-15T00:20:00Z', -5)
addSeconds
Add a number of seconds to a timestamp.
Example 1
This example adds 10 seconds to the specified timestamp:
addSeconds('2018-03-15T00:00:00Z', 10)
Example 2
This example subtracts five seconds to the specified timestamp:
addSeconds('2018-03-15T00:00:30Z', -5)
addToTime
Add a number of time units to a timestamp. See also getFutureTime().
Example 1
This example adds one day to the specified timestamp:
addToTime('2018-01-01T00:00:00Z', 1, 'Day')
Example 2
This example adds one day to the specified timestamp:
And returns the result using the optional "D" format: "Tuesday, January 2, 2018"
and
Check whether both expressions are true. Return true when both expressions are true, or return false when at
least one expression is false.
and(<expression1>, <expression2>)
Example 1
These examples check whether the specified Boolean values are both true:
and(true, true)
and(false, true)
and(false, false)
Example 2
These examples check whether the specified expressions are both true:
array
Return an array from a single specified input. For multiple inputs, see createArray().
array('<value>')
Example
This example creates an array from the "hello" string:
array('hello')
base64
Return the base64-encoded version for a string.
base64('<value>')
Example
This example converts the "hello" string to a base64-encoded string:
base64('hello')
base64ToBinary
Return the binary version for a base64-encoded string.
base64ToBinary('<value>')
Example
This example converts the "aGVsbG8=" base64-encoded string to a binary string:
base64ToBinary('aGVsbG8=')
base64ToString
Return the string version for a base64-encoded string, effectively decoding the base64 string. Use this function
rather than decodeBase64(). Although both functions work the same way, base64ToString() is preferred.
base64ToString('<value>')
Example
This example converts the "aGVsbG8=" base64-encoded string to just a string:
base64ToString('aGVsbG8=')
binary
Return the binary version for a string.
binary('<value>')
Example
This example converts the "hello" string to a binary string:
binary('hello')
bool
Return the Boolean version for a value.
bool(<value>)
Example
These examples convert the specified values to Boolean values:
bool(1)
bool(0)
coalesce
Return the first non-null value from one or more parameters. Empty strings, empty arrays, and empty objects
are not null.
<object_1>, <object_2>, ... Yes Any, can mix types One or more items to check
for null
RET URN VA L UE TYPE DESC RIP T IO N
Example
These examples return the first non-null value from the specified values, or null when all the values are null:
concat
Combine two or more strings, and return the combined string.
Example
This example combines the strings "Hello" and "World":
concat('Hello', 'World')
contains
Check whether a collection has a specific item. Return true when the item is found, or return false when not
found. This function is case-sensitive.
contains('<collection>', '<value>')
contains([<collection>], '<value>')
Example 1
This example checks the string "hello world" for the substring "world" and returns true:
Example 2
This example checks the string "hello world" for the substring "universe" and returns false:
convertFromUtc
Convert a timestamp from Universal Time Coordinated (UTC) to the target time zone.
Example 1
This example converts a timestamp to the specified time zone:
Example 2
This example converts a timestamp to the specified time zone and format:
convertTimeZone
Convert a timestamp from the source time zone to the target time zone.
Example 1
This example converts the source time zone to the target time zone:
Example 2
This example converts a time zone to the specified time zone and format:
convertToUtc
Convert a timestamp from the source time zone to Universal Time Coordinated (UTC).
Example 1
This example converts a timestamp to UTC:
Example 2
This example converts a timestamp to UTC:
createArray
Return an array from multiple inputs. For single input arrays, see array().
<object1>, <object2>, ... Yes Any, but not mixed At least two items to create
the array
[<object1>, <object2>, ...] Array The array created from all the input
items
Example
This example creates an array from these inputs:
dataUri
Return a data uniform resource identifier (URI) for a string.
dataUri('<value>')
Example
This example creates a data URI for the "hello" string:
dataUri('hello')
dataUriToBinary
Return the binary version for a data uniform resource identifier (URI). Use this function rather than
decodeDataUri(). Although both functions work the same way, dataUriBinary() is preferred.
dataUriToBinary('<value>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
Example
This example creates a binary version for this data URI:
dataUriToBinary('data:text/plain;charset=utf-8;base64,aGVsbG8=')
dataUriToString
Return the string version for a data uniform resource identifier (URI).
dataUriToString('<value>')
Example
This example creates a string for this data URI:
dataUriToString('data:text/plain;charset=utf-8;base64,aGVsbG8=')
dayOfMonth
Return the day of the month from a timestamp.
dayOfMonth('<timestamp>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
Example
This example returns the number for the day of the month from this timestamp:
dayOfMonth('2018-03-15T13:27:36Z')
dayOfWeek
Return the day of the week from a timestamp.
dayOfWeek('<timestamp>')
Example
This example returns the number for the day of the week from this timestamp:
dayOfWeek('2018-03-15T13:27:36Z')
dayOfYear
Return the day of the year from a timestamp.
dayOfYear('<timestamp>')
Example
This example returns the number of the day of the year from this timestamp:
dayOfYear('2018-03-15T13:27:36Z')
decodeBase64
Return the string version for a base64-encoded string, effectively decoding the base64 string. Consider using
base64ToString() rather than decodeBase64() . Although both functions work the same way, base64ToString() is
preferred.
decodeBase64('<value>')
Example
This example creates a string for a base64-encoded string:
decodeBase64('aGVsbG8=')
decodeDataUri
Return the binary version for a data uniform resource identifier (URI). Consider using dataUriToBinary(), rather
than decodeDataUri() . Although both functions work the same way, dataUriToBinary() is preferred.
decodeDataUri('<value>')
Example
This example returns the binary version for this data URI:
decodeDataUri('data:text/plain;charset=utf-8;base64,aGVsbG8=')
decodeUriComponent
Return a string that replaces escape characters with decoded versions.
decodeUriComponent('<value>')
Example
This example replaces the escape characters in this string with decoded versions:
decodeUriComponent('http%3A%2F%2Fcontoso.com')
div
Return the integer result from dividing two numbers. To get the remainder result, see mod().
div(<dividend>, <divisor>)
Example
Both examples divide the first number by the second number:
div(10, 5)
div(11, 5)
encodeUriComponent
Return a uniform resource identifier (URI) encoded version for a string by replacing URL-unsafe characters with
escape characters. Consider using uriComponent(), rather than encodeUriComponent() . Although both functions
work the same way, uriComponent() is preferred.
encodeUriComponent('<value>')
Example
This example creates a URI-encoded version for this string:
encodeUriComponent('https://contoso.com')
empty
Check whether a collection is empty. Return true when the collection is empty, or return false when not empty.
empty('<collection>')
empty([<collection>])
Example
These examples check whether the specified collections are empty:
empty('')
empty('abc')
endsWith
Check whether a string ends with a specific substring. Return true when the substring is found, or return false
when not found. This function is not case-sensitive.
endsWith('<text>', '<searchText>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
Example 1
This example checks whether the "hello world" string ends with the "world" string:
Example 2
This example checks whether the "hello world" string ends with the "universe" string:
equals
Check whether both values, expressions, or objects are equivalent. Return true when both are equivalent, or
return false when they're not equivalent.
equals('<object1>', '<object2>')
Example
These examples check whether the specified inputs are equivalent.
equals(true, 1)
equals('abc', 'abcd')
first
Return the first item from a string or array.
first('<collection>')
first([<collection>])
Example
These examples find the first item in these collections:
first('hello')
first(createArray(0, 1, 2))
float
Convert a string version for a floating-point number to an actual floating point number.
float('<value>')
Example
This example creates a string version for this floating-point number:
float('10.333')
formatDateTime
Return a timestamp in the specified format.
formatDateTime('<timestamp>', '<format>'?)
Example
This example converts a timestamp to the specified format:
getFutureTime
Return the current timestamp plus the specified time units.
getFutureTime(<interval>, <timeUnit>, <format>?)
Example 1
Suppose the current timestamp is "2018-03-01T00:00:00.0000000Z". This example adds five days to that
timestamp:
getFutureTime(5, 'Day')
Example 2
Suppose the current timestamp is "2018-03-01T00:00:00.0000000Z". This example adds five days and converts
the result to "D" format:
getPastTime
Return the current timestamp minus the specified time units.
Example 1
Suppose the current timestamp is "2018-02-01T00:00:00.0000000Z". This example subtracts five days from that
timestamp:
getPastTime(5, 'Day')
Example 2
Suppose the current timestamp is "2018-02-01T00:00:00.0000000Z". This example subtracts five days and
converts the result to "D" format:
greater
Check whether the first value is greater than the second value. Return true when the first value is more, or return
false when less.
greater(<value>, <compareTo>)
greater('<value>', '<compareTo>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
Example
These examples check whether the first value is greater than the second value:
greater(10, 5)
greater('apple', 'banana')
greaterOrEquals
Check whether the first value is greater than or equal to the second value. Return true when the first value is
greater or equal, or return false when the first value is less.
greaterOrEquals(<value>, <compareTo>)
greaterOrEquals('<value>', '<compareTo>')
Example
These examples check whether the first value is greater or equal than the second value:
greaterOrEquals(5, 5)
greaterOrEquals('apple', 'banana')
guid
Generate a globally unique identifier (GUID) as a string, for example, "c2ecc88d-88c8-4096-912c-
d6f2e2b138ce":
guid()
Also, you can specify a different format for the GUID other than the default format, "D", which is 32 digits
separated by hyphens.
guid('<format>')
Example
This example generates the same GUID, but as 32 digits, separated by hyphens, and enclosed in parentheses:
guid('P')
And returns this result: "(c2ecc88d-88c8-4096-912c-d6f2e2b138ce)"
if
Check whether an expression is true or false. Based on the result, return a specified value.
Example
This example returns "yes" because the specified expression returns true. Otherwise, the example returns
"no" :
indexOf
Return the starting position or index value for a substring. This function is not case-sensitive, and indexes start
with the number 0.
indexOf('<text>', '<searchText>')
Example
This example finds the starting index value for the "world" substring in the "hello world" string:
int
Return the integer version for a string.
int('<value>')
Example
This example creates an integer version for the string "10":
int('10')
json
Return the JavaScript Object Notation (JSON) type value or object for a string or XML.
json('<value>')
<JSON-result> JSON native type or object The JSON native type value or object
for the specified string or XML. If the
string is null, the function returns an
empty object.
Example 1
This example converts this string to the JSON value:
json('[1, 2, 3]')
Example 2
This example converts this string to JSON:
{
"fullName": "Sophia Owen"
}
Example 3
This example converts this XML to JSON:
{
"?xml": { "@version": "1.0" },
"root": {
"person": [ {
"@id": "1",
"name": "Sophia Owen",
"occupation": "Engineer"
} ]
}
}
intersection
Return a collection that has only the common items across the specified collections. To appear in the result, an
item must appear in all the collections passed to this function. If one or more items have the same name, the last
item with that name appears in the result.
<collection1>, Yes Array or Object, but not The collections from where
<collection2>, ... both you want only the common
items
<common-items> Array or Object, respectively A collection that has only the common
items across the specified collections
Example
This example finds the common items across these arrays:
join
Return a string that has all the items from an array and has each character separated by a delimiter.
join([<collection>], '<delimiter>')
Example
This example creates a string from all the items in this array with the specified character as the delimiter:
join(createArray('a', 'b', 'c'), '.')
last
Return the last item from a collection.
last('<collection>')
last([<collection>])
Example
These examples find the last item in these collections:
last('abcd')
last(createArray(0, 1, 2, 3))
lastIndexOf
Return the starting position or index value for the last occurrence of a substring. This function is not case-
sensitive, and indexes start with the number 0.
lastIndexOf('<text>', '<searchText>')
Example
This example finds the starting index value for the last occurrence of the "world" substring in the "hello world"
string:
length
Return the number of items in a collection.
length('<collection>')
length([<collection>])
Example
These examples count the number of items in these collections:
length('abcd')
length(createArray(0, 1, 2, 3))
less
Check whether the first value is less than the second value. Return true when the first value is less, or return
false when the first value is more.
less(<value>, <compareTo>)
less('<value>', '<compareTo>')
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
true or false Boolean Return true when the first value is less
than the second value. Return false
when the first value is equal to or
greater than the second value.
Example
These examples check whether the first value is less than the second value.
less(5, 10)
less('banana', 'apple')
lessOrEquals
Check whether the first value is less than or equal to the second value. Return true when the first value is less
than or equal, or return false when the first value is more.
lessOrEquals(<value>, <compareTo>)
lessOrEquals('<value>', '<compareTo>')
true or false Boolean Return true when the first value is less
than or equal to the second value.
Return false when the first value is
greater than the second value.
Example
These examples check whether the first value is less or equal than the second value.
lessOrEquals(10, 10)
lessOrEquals('apply', 'apple')
max
Return the highest value from a list or array with numbers that is inclusive at both ends.
<number1>, <number2>, Yes Integer, Float, or both The set of numbers from
... which you want the highest
value
[<number1>, <number2>, Yes Array - Integer, Float, or The array of numbers from
...] both which you want the highest
value
Example
These examples get the highest value from the set of numbers and the array:
max(1, 2, 3)
max(createArray(1, 2, 3))
min
Return the lowest value from a set of numbers or an array.
<number1>, <number2>, Yes Integer, Float, or both The set of numbers from
... which you want the lowest
value
[<number1>, <number2>, Yes Array - Integer, Float, or The array of numbers from
...] both which you want the lowest
value
Example
These examples get the lowest value in the set of numbers and the array:
min(1, 2, 3)
min(createArray(1, 2, 3))
mod
Return the remainder from dividing two numbers. To get the integer result, see div().
mod(<dividend>, <divisor>)
mod(3, 2)
mul
Return the product from multiplying two numbers.
mul(<multiplicand1>, <multiplicand2>)
Example
These examples multiple the first number by the second number:
mul(1, 2)
mul(1.5, 2)
not
Check whether an expression is false. Return true when the expression is false, or return false when true.
not(<expression>)
Example 1
These examples check whether the specified expressions are false:
not(false)
not(true)
Example 2
These examples check whether the specified expressions are false:
not(equals(1, 2))
not(equals(1, 1))
or
Check whether at least one expression is true. Return true when at least one expression is true, or return false
when both are false.
or(<expression1>, <expression2>)
Example 1
These examples check whether at least one expression is true:
or(true, false)
or(false, false)
Example 2
These examples check whether at least one expression is true:
rand
Return a random integer from a specified range, which is inclusive only at the starting end.
rand(<minValue>, <maxValue>)
Example
This example gets a random integer from the specified range, excluding the maximum value:
rand(1, 5)
range
Return an integer array that starts from a specified integer.
range(<startIndex>, <count>)
Example
This example creates an integer array that starts from the specified index and has the specified number of
integers:
range(1, 4)
replace
Replace a substring with the specified string, and return the result string. This function is case-sensitive.
Example
This example finds the "old" substring in "the old string" and replaces "old" with "new":
skip
Remove items from the front of a collection, and return all the other items.
skip([<collection>], <count>)
Example
This example removes one item, the number 0, from the front of the specified array:
skip(createArray(0, 1, 2, 3), 1)
split
Return an array that contains substrings, separated by commas, based on the specified delimiter character in the
original string.
split('<text>', '<delimiter>')
Example
This example creates an array with substrings from the specified string based on the specified character as the
delimiter:
split('a_b_c', '_')
startOfDay
Return the start of the day for a timestamp.
startOfDay('<timestamp>', '<format>'?)
Example
This example finds the start of the day for this timestamp:
startOfDay('2018-03-15T13:30:30Z')
startOfHour
Return the start of the hour for a timestamp.
startOfHour('<timestamp>', '<format>'?)
Example
This example finds the start of the hour for this timestamp:
startOfHour('2018-03-15T13:30:30Z')
startOfMonth
Return the start of the month for a timestamp.
startOfMonth('<timestamp>', '<format>'?)
Example
This example returns the start of the month for this timestamp:
startOfMonth('2018-03-15T13:30:30Z')
startsWith
Check whether a string starts with a specific substring. Return true when the substring is found, or return false
when not found. This function is not case-sensitive.
startsWith('<text>', '<searchText>')
Example 1
This example checks whether the "hello world" string starts with the "hello" substring:
startsWith('hello world', 'hello')
Example 2
This example checks whether the "hello world" string starts with the "greetings" substring:
string
Return the string version for a value.
string(<value>)
Example 1
This example creates the string version for this number:
string(10)
Example 2
This example creates a string for the specified JSON object and uses the backslash character (\) as an escape
character for the double-quotation mark (").
sub
Return the result from subtracting the second number from the first number.
sub(<minuend>, <subtrahend>)
PA RA M ET ER REQ UIRED TYPE DESC RIP T IO N
Example
This example subtracts the second number from the first number:
sub(10.3, .3)
substring
Return characters from a string, starting from the specified position, or index. Index values start with the number
0.
Example
This example creates a five-character substring from the specified string, starting from the index value 6:
substring('hello world', 6, 5)
subtractFromTime
Subtract a number of time units from a timestamp. See also getPastTime.
Example 1
This example subtracts one day from this timestamp:
subtractFromTime('2018-01-02T00:00:00Z', 1, 'Day')
Example 2
This example subtracts one day from this timestamp:
subtractFromTime('2018-01-02T00:00:00Z', 1, 'Day', 'D')
And returns this result using the optional "D" format: "Monday, January, 1, 2018"
take
Return items from the front of a collection.
take('<collection>', <count>)
take([<collection>], <count>)
<subset> or [<subset>] String or Array, respectively A string or array that has the specified
number of items taken from the front
of the original collection
Example
These examples get the specified number of items from the front of these collections:
take('abcde', 3)
take(createArray(0, 1, 2, 3, 4), 3)
ticks
Return the ticks property value for a specified timestamp. A tick is a 100-nanosecond interval.
ticks('<timestamp>')
toLower
Return a string in lowercase format. If a character in the string doesn't have a lowercase version, that character
stays unchanged in the returned string.
toLower('<text>')
Example
This example converts this string to lowercase:
toLower('Hello World')
toUpper
Return a string in uppercase format. If a character in the string doesn't have an uppercase version, that character
stays unchanged in the returned string.
toUpper('<text>')
Example
This example converts this string to uppercase:
toUpper('Hello World')
trim
Remove leading and trailing whitespace from a string, and return the updated string.
trim('<text>')
Example
This example removes the leading and trailing whitespace from the string " Hello World ":
union
Return a collection that has all the items from the specified collections. To appear in the result, an item can
appear in any collection passed to this function. If one or more items have the same name, the last item with
that name appears in the result.
<collection1>, Yes Array or Object, but not The collections from where
<collection2>, ... both you want all the items
RET URN VA L UE TYPE DESC RIP T IO N
<updatedCollection> Array or Object, respectively A collection with all the items from the
specified collections - no duplicates
Example
This example gets all the items from these collections:
uriComponent
Return a uniform resource identifier (URI) encoded version for a string by replacing URL-unsafe characters with
escape characters. Use this function rather than encodeUriComponent(). Although both functions work the same
way, uriComponent() is preferred.
uriComponent('<value>')
Example
This example creates a URI-encoded version for this string:
uriComponent('https://contoso.com')
uriComponentToBinary
Return the binary version for a uniform resource identifier (URI) component.
uriComponentToBinary('<value>')
Example
This example creates the binary version for this URI-encoded string:
uriComponentToBinary('http%3A%2F%2Fcontoso.com')
uriComponentToString
Return the string version for a uniform resource identifier (URI) encoded string, effectively decoding the URI-
encoded string.
uriComponentToString('<value>')
Example
This example creates the decoded string version for this URI-encoded string:
uriComponentToString('http%3A%2F%2Fcontoso.com')
utcNow('<format>')
Optionally, you can specify a different format with the <format> parameter.
Example 1
Suppose today is April 15, 2018 at 1:00:00 PM. This example gets the current timestamp:
utcNow()
Example 2
Suppose today is April 15, 2018 at 1:00:00 PM. This example gets the current timestamp using the optional "D"
format:
utcNow('D')
xml
Return the XML version for a string that contains a JSON object.
xml('<value>')
Example 1
This example creates the XML version for this string, which contains a JSON object:
xml(json('{ \"name\": \"Sophia Owen\" }'))
<name>Sophia Owen</name>
Example 2
Suppose you have this JSON object:
{
"person": {
"name": "Sophia Owen",
"city": "Seattle"
}
}
This example creates XML for a string that contains this JSON object:
xml(json('{\"person\": {\"name\": \"Sophia Owen\", \"city\": \"Seattle\"}}'))
<person>
<name>Sophia Owen</name>
<city>Seattle</city>
<person>
xpath
Check XML for nodes or values that match an XPath (XML Path Language) expression, and return the matching
nodes or values. An XPath expression, or just "XPath", helps you navigate an XML document structure so that
you can select nodes or compute values in the XML content.
xpath('<xml>', '<xpath>')
Example 1
Following on Example 1, this example finds nodes that match the <count></count> node and adds those node
values with the sum() function:
xpath(xml(parameters('items')), 'sum(/produce/item/count)')
Example 2
For this example, both expressions find nodes that match the <location></location> node, in the specified
arguments, which include XML with a namespace. The expressions use the backslash character (\) as an escape
character for the double quotation mark (").
Expression 1
xpath(xml(body('Http')), '/*[name()=\"file\"]/*[name()=\"location\"]')
Expression 2
xpath(xml(body('Http')), '/*[local-name()=\"file\" and namespace-uri()=\"http://contoso.com\"]/*
[local-name()=\"location\"]')
<location xmlns="https://contoso.com">Paris</location>
Example 3
Following on Example 3, this example finds the value in the <location></location> node:
xpath(xml(body('Http')), 'string(/*[name()=\"file\"]/*[name()=\"location\"])')
Next steps
For a list of system variables you can use in expressions, see System variables.
System variables supported by Azure Data Factory
7/20/2021 • 3 minutes to read • Edit Online
Pipeline scope
These system variables can be referenced anywhere in the pipeline JSON.
@pipeline().TriggerType The type of trigger that invoked the pipeline (for example,
ScheduleTrigger , BlobEventsTrigger ). For a list of
supported trigger types, see Pipeline execution and triggers
in Azure Data Factory. A trigger type of Manual indicates
that the pipeline was triggered manually.
@pipeline().TriggerTime Time of the trigger run that invoked the pipeline. This is the
time at which the trigger actually fired to invoke the
pipeline run, and it may differ slightly from the trigger's
scheduled time.
@pipeline()?.TriggeredByPipelineName Name of the pipeline that trigger the pipeline run. Applicable
when the pipeline run is triggered by an ExecutePipeline
activity. Evaluate to Null when used in other circumstances.
Note the question mark after @pipeline()
NOTE
Trigger-related date/time system variables (in both pipeline and trigger scopes) return UTC dates in ISO 8601 format, for
example, 2017-06-01T22:20:00.4061448Z .
Schedule trigger scope
These system variables can be referenced anywhere in the trigger JSON for triggers of type ScheduleTrigger.
@trigger().startTime Time at which the trigger fired to invoke the pipeline run.
@triggerBody().event.eventType Type of events that triggered the Custom Event Trigger run.
Event type is customer defined field and take on any values
of string type.
@triggerBody().event.subject Subject of the custom event that caused the trigger to fire.
@triggerBody().event.data._keyName_ Data field in custom event is a free from JSON blob, which
customer can use to send messages and data. Please use
data.keyName to reference each field. For example,
@triggerBody().event.data.callback returns the value for the
callback field stored under data.
@trigger().startTime Time at which the trigger fired to invoke the pipeline run.
Next steps
For information about how these variables are used in expressions, see Expression language & functions.
To use trigger scope system variables in pipeline, see Reference trigger metadata in pipeline
Parameterizing mapping data flows
4/20/2021 • 3 minutes to read • Edit Online
You can quickly add additional parameters by selecting New parameter and specifying the name and type.
Assign parameter values from a pipeline
Once you've created a data flow with parameters, you can execute it from a pipeline with the Execute Data Flow
Activity. After you add the activity to your pipeline canvas, you will be presented with the available data flow
parameters in the activity's Parameters tab.
When assigning parameter values, you can use either the pipeline expression language or the data flow
expression language based on spark types. Each mapping data flow can have any combination of pipeline and
data flow expression parameters.
If data flow parameter stringParam references a pipeline parameter with value upper(column1) .
If expression is checked, $stringParam evaluates to the value of column1 all uppercase.
If expression is not checked (default behavior), $stringParam evaluates to 'upper(column1)'
Passing in timestamps
In the pipeline expression language, System variables such as pipeline().TriggerTime and functions like
utcNow() return timestamps as strings in format 'yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ'. To convert these into
data flow parameters of type timestamp, use string interpolation to include the desired timestamp in a
toTimestamp() function. For example, to convert the pipeline trigger time into a data flow parameter, you can
use toTimestamp(left('@{pipeline().TriggerTime}', 23), 'yyyy-MM-dd\'T\'HH:mm:ss.SSS') .
NOTE
Data Flows can only support up to 3 millisecond digits. The left() function is used trim off additional digits.
When $intParam is referenced in an expression such as a derived column, it will evaluate abs(1) return 1 .
NOTE
If you pass in an invalid expression or reference a schema column that doesn't exist in that transformation, the parameter
will evaluate to null.
Passing in a column name as a parameter
A common pattern is to pass in a column name as a parameter value. If the column is defined in the data flow
schema, you can reference it directly as a string expression. If the column isn't defined in the schema, use the
byName() function. Remember to cast the column to its appropriate type with a casting function such as
toString() .
For example, if you wanted to map a string column based upon a parameter columnName , you can add a derived
column transformation equal to toString(byName($columnName)) .
Next steps
Execute data flow activity
Control flow expressions
How to use parameters, expressions and functions
in Azure Data Factory
3/26/2021 • 9 minutes to read • Edit Online
"name": "value"
or
"name": "@pipeline().parameters.password"
Expressions can appear anywhere in a JSON string value and always result in another JSON value. Here,
password is a pipeline parameter in the expression. If a JSON value is an expression, the body of the expression
is extracted by removing the at-sign (@). If a literal string is needed that starts with @, it must be escaped by
using @@. The following examples show how expressions are evaluated.
JSO N VA L UE RESULT
Expressions can also appear inside strings, using a feature called string interpolation where expressions are
wrapped in @{ ... } . For example:
"name" : "First Name: @{pipeline().parameters.firstName} Last Name: @{pipeline().parameters.lastName}"
Using string interpolation, the result is always a string. Say I have defined myNumber as 42 and myString as
foo :
JSO N VA L UE RESULT
{
"type": "@{if(equals(1, 2), 'Blob', 'Table' )}",
"name": "@{toUpper('myData')}"
}
{
"name": "BlobDataset",
"properties": {
"type": "AzureBlob",
"typeProperties": {
"folderPath": "@dataset().path"
},
"linkedServiceName": {
"referenceName": "AzureStorageLinkedService",
"type": "LinkedServiceReference"
},
"parameters": {
"path": {
"type": "String"
}
}
}
}
ST RIN G F UN C T IO N TA SK
replace Replace a substring with the specified string, and return the
updated string.
Collection functions
To work with collections, generally arrays, strings, and sometimes, dictionaries, you can use these collection
functions.
C O L L EC T IO N F UN C T IO N TA SK
intersection Return a collection that has only the common items across
the specified collections.
join Return a string that has all the items from an array,
separated by the specified character.
skip Remove items from the front of a collection, and return all
the other items.
union Return a collection that has all the items from the specified
collections.
Logical functions
These functions are useful inside conditions, they can be used to evaluate any type of logic.
LO GIC A L C O M PA RISO N F UN C T IO N TA SK
greater Check whether the first value is greater than the second
value.
greaterOrEquals Check whether the first value is greater than or equal to the
second value.
less Check whether the first value is less than the second value.
lessOrEquals Check whether the first value is less than or equal to the
second value.
Conversion functions
These functions are used to convert between each of the native types in the language:
string
integer
float
boolean
arrays
dictionaries
C O N VERSIO N F UN C T IO N TA SK
coalesce Return the first non-null value from one or more parameters.
xpath Check XML for nodes or values that match an XPath (XML
Path Language) expression, and return the matching nodes
or values.
Math functions
These functions can be used for either types of numbers: integers and floats .
M AT H F UN C T IO N TA SK
sub Return the result from subtracting the second number from
the first number.
Date functions
DAT E O R T IM E F UN C T IO N TA SK
getFutureTime Return the current timestamp plus the specified time units.
See also addToTime.
getPastTime Return the current timestamp minus the specified time units.
See also subtractFromTime.
Next steps
For a list of system variables you can use in expressions, see System variables.
Security considerations for data movement in Azure
Data Factory
7/6/2021 • 11 minutes to read • Edit Online
C SA STA R C ERT IF IC AT IO N
ISO 20000-1:2011
ISO 22301:2012
ISO 27001:2013
ISO 27017:2015
ISO 27018:2014
ISO 9001:2015
SOC 1, 2, 3
HIPAA BAA
HITRUST
If you're interested in Azure compliance and how Azure secures its own infrastructure, visit the Microsoft Trust
Center. For the latest list of all Azure Compliance offerings check - https://aka.ms/AzureCompliance.
In this article, we review security considerations in the following two data movement scenarios:
Cloud scenario : In this scenario, both your source and your destination are publicly accessible through the
internet. These include managed cloud storage services such as Azure Storage, Azure Synapse Analytics,
Azure SQL Database, Azure Data Lake Store, Amazon S3, Amazon Redshift, SaaS services such as Salesforce,
and web protocols such as FTP and OData. Find a complete list of supported data sources in Supported data
stores and formats.
Hybrid scenario : In this scenario, either your source or your destination is behind a firewall or inside an on-
premises corporate network. Or, the data store is in a private network or virtual network (most often the
source) and is not publicly accessible. Database servers hosted on virtual machines also fall under this
scenario.
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Cloud scenarios
Securing data store credentials
Store encr ypted credentials in an Azure Data Factor y managed store . Data Factory helps protect
your data store credentials by encrypting them with certificates managed by Microsoft. These certificates are
rotated every two years (which includes certificate renewal and the migration of credentials). For more
information about Azure Storage security, see Azure Storage security overview.
Store credentials in Azure Key Vault . You can also store the data store's credential in Azure Key Vault.
Data Factory retrieves the credential during the execution of an activity. For more information, see Store
credential in Azure Key Vault.
Data encryption in transit
If the cloud data store supports HTTPS or TLS, all data transfers between data movement services in Data
Factory and a cloud data store are via secure channel HTTPS or TLS.
NOTE
All connections to Azure SQL Database and Azure Synapse Analytics require encryption (SSL/TLS) while data is in transit
to and from the database. When you're authoring a pipeline by using JSON, add the encryption property and set it to
true in the connection string. For Azure Storage, you can use HTTPS in the connection string.
NOTE
To enable encryption in transit while moving data from Oracle follow one of the below options:
1. In Oracle server, go to Oracle Advanced Security (OAS) and configure the encryption settings, which supports Triple-
DES Encryption (3DES) and Advanced Encryption Standard (AES), refer here for details. ADF automatically negotiates
the encryption method to use the one you configure in OAS when establishing connection to Oracle.
2. In ADF, you can add EncryptionMethod=1 in the connection string (in the Linked Service). This will use SSL/TLS as the
encryption method. To use this, you need to disable non-SSL encryption settings in OAS on the Oracle server side to
avoid encryption conflict.
NOTE
TLS version used is 1.2.
Hybrid scenarios
Hybrid scenarios require self-hosted integration runtime to be installed in an on-premises network, inside a
virtual network (Azure), or inside a virtual private cloud (Amazon). The self-hosted integration runtime must be
able to access the local data stores. For more information about self-hosted integration runtime, see How to
create and configure self-hosted integration runtime.
The command channel allows communication between data movement services in Data Factory and self-hosted
integration runtime. The communication contains information related to the activity. The data channel is used for
transferring data between on-premises data stores and cloud data stores.
On-premises data store credentials
The credentials can be stored within data factory or be referenced by data factory during the runtime from
Azure Key Vault. If storing credentials within data factory, it is always stored encrypted on the self-hosted
integration runtime.
Store credentials locally . If you directly use the Set-AzDataFactor yV2LinkedSer vice cmdlet with
the connection strings and credentials inline in the JSON, the linked service is encrypted and stored on
self-hosted integration runtime. In this case the credentials flow through Azure backend service, which is
extremely secure, to the self-hosted integration machine where it is finally encrypted and stored. The self-
hosted integration runtime uses Windows DPAPI to encrypt the sensitive data and credential information.
Store credentials in Azure Key Vault . You can also store the data store's credential in Azure Key Vault.
Data Factory retrieves the credential during the execution of an activity. For more information, see Store
credential in Azure Key Vault.
Store credentials locally without flowing the credentials through Azure backend to the self-
hosted integration runtime . If you want to encrypt and store credentials locally on the self-hosted
integration runtime without having to flow the credentials through data factory backend, follow the steps
in Encrypt credentials for on-premises data stores in Azure Data Factory. All connectors support this
option. The self-hosted integration runtime uses Windows DPAPI to encrypt the sensitive data and
credential information.
Use the New-AzDataFactor yV2LinkedSer viceEncr yptedCredential cmdlet to encrypt linked service
credentials and sensitive details in the linked service. You can then use the JSON returned (with the
Encr yptedCredential element in the connection string) to create a linked service by using the Set-
AzDataFactor yV2LinkedSer vice cmdlet.
Ports used when encrypting linked service on self-hosted integration runtime
By default, when remote access from intranet is enabled, PowerShell uses port 8060 on the machine with self-
hosted integration runtime for secure communication. If necessary, this port can be changed from the
Integration Runtime Configuration Manager on the Settings tab:
Encryption in transit
All data transfers are via secure channel HTTPS and TLS over TCP to prevent man-in-the-middle attacks during
communication with Azure services.
You can also use IPSec VPN or Azure ExpressRoute to further secure the communication channel between your
on-premises network and Azure.
Azure Virtual Network is a logical representation of your network in the cloud. You can connect an on-premises
network to your virtual network by setting up IPSec VPN (site-to-site) or ExpressRoute (private peering).
The following table summarizes the network and self-hosted integration runtime configuration
recommendations based on different combinations of source and destination locations for hybrid data
movement.
N ET W O RK IN T EGRAT IO N RUN T IM E
SO URC E DEST IN AT IO N C O N F IGURAT IO N SET UP
On-premises Virtual machines and cloud IPSec VPN (point-to-site or The self-hosted integration
services deployed in virtual site-to-site) runtime should be installed
networks on an Azure virtual machine
in the virtual network.
On-premises Virtual machines and cloud ExpressRoute (private The self-hosted integration
services deployed in virtual peering) runtime should be installed
networks on an Azure virtual machine
in the virtual network.
The following images show the use of self-hosted integration runtime for moving data between an on-premises
database and Azure services by using ExpressRoute and IPSec VPN (with Azure Virtual Network):
Express Route
IPSec VPN
Firewall configurations and allow list setting up for IP addresses
NOTE
You might have to manage ports or set up allow list for domains at the corporate firewall level as required by the
respective data sources. This table only uses Azure SQL Database, Azure Synapse Analytics, and Azure Data Lake Store as
examples.
NOTE
For details about data access strategies through Azure Data Factory, see this article.
DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N
The following table provides inbound port requirements for Windows Firewall:
IN B O UN D P O RT S DESC RIP T IO N
Next steps
For information about Azure Data Factory Copy Activity performance, see Copy Activity performance and tuning
guide.
Data access strategies
5/6/2021 • 4 minutes to read • Edit Online
TIP
With the introduction of Static IP address range, you can now allow list IP ranges for the particular Azure integration
runtime region to ensure you don’t have to allow all Azure IP addresses in your cloud data stores. This way, you can
restrict the IP addresses that are permitted to access the data stores.
NOTE
The IP address ranges are blocked for Azure Integration Runtime and is currently only used for Data Movement, pipeline
and external activities. Dataflows and Azure Integration Runtime that enable Managed Virtual Network now do not use
these IP ranges.
This should work in many scenarios, and we do understand that a unique Static IP address per integration
runtime would be desirable, but this wouldn't be possible using Azure Integration Runtime currently, which is
serverless. If necessary, you can always set up a Self-hosted Integration Runtime and use your Static IP with it.
SUP P O RT ED
N ET W O RK
SEC URIT Y
M EC H A N IS A L LO W
DATA M O N DATA P RIVAT E T RUST ED STAT IC IP SERVIC E A Z URE
STO RES STO RES L IN K SERVIC E RA N GE TA GS SERVIC ES
*Applicable only when Azure Data Explorer is virtual network injected, and IP range can be applied on
NSG/ Firewall.
Self-hosted Integration Runtime (in Vnet/on-premise)
SUP P O RT ED N ET W O RK
SEC URIT Y M EC H A N ISM
DATA STO RES O N DATA STO RES STAT IC IP T RUST ED SERVIC ES
Next steps
For more information, see the following related articles:
Supported data stores
Azure Key Vault ‘Trusted Services’
Azure Storage ‘Trusted Microsoft Services’
Managed identity for Data Factory
Azure Integration Runtime IP addresses
5/6/2021 • 2 minutes to read • Edit Online
IMPORTANT
Data flows and Azure Integration Runtime which enable Managed Virtual Network don't support the use of fixed IP
ranges.
You can use these IP ranges for Data Movement, Pipeline and External activities executions. These IP ranges can be used
for filtering in data stores/ Network Security Group (NSG)/ Firewalls for inbound access from Azure Integration runtime.
Next steps
Security considerations for data movement in Azure Data Factory
Store credential in Azure Key Vault
5/6/2021 • 3 minutes to read • Edit Online
Prerequisites
This feature relies on the data factory managed identity. Learn how it works from Managed identity for Data
factory and make sure your data factory have an associated one.
Steps
To reference a credential stored in Azure Key Vault, you need to:
1. Retrieve data factor y managed identity by copying the value of "Managed Identity Object ID" generated
along with your factory. If you use ADF authoring UI, the managed identity object ID will be shown on the
Azure Key Vault linked service creation window; you can also retrieve it from Azure portal, refer to Retrieve
data factory managed identity.
2. Grant the managed identity access to your Azure Key Vault. In your key vault -> Access policies ->
Add Access Policy, search this managed identity to grant Get permission in Secret permissions dropdown. It
allows this designated factory to access secret in key vault.
3. Create a linked ser vice pointing to your Azure Key Vault. Refer to Azure Key Vault linked service.
4. Create data store linked ser vice, inside which reference the corresponding secret stored in key
vault. Refer to reference secret stored in key vault.
{
"name": "AzureKeyVaultLinkedService",
"properties": {
"type": "AzureKeyVault",
"typeProperties": {
"baseUrl": "https://<azureKeyVaultName>.vault.azure.net"
}
}
}
TIP
For connectors using connection string in linked service like SQL Server, Blob storage, etc., you can choose either to store
only the secret field e.g. password in AKV, or to store the entire connection string in AKV. You can find both options on the
UI.
JSON example: (see the "password" section)
{
"name": "DynamicsLinkedService",
"properties": {
"type": "Dynamics",
"typeProperties": {
"deploymentType": "<>",
"organizationName": "<>",
"authenticationType": "<>",
"username": "<>",
"password": {
"type": "AzureKeyVaultSecret",
"secretName": "<secret name in AKV>",
"store":{
"referenceName": "<Azure Key Vault linked service>",
"type": "LinkedServiceReference"
}
}
}
}
}
Next steps
For a list of data stores supported as sources and sinks by the copy activity in Azure Data Factory, see supported
data stores.
Use Azure Key Vault secrets in pipeline activities
4/22/2021 • 2 minutes to read • Edit Online
Prerequisites
This feature relies on the data factory managed identity. Learn how it works from Managed identity for Data
Factory and make sure your data factory has one associated.
Steps
1. Open the properties of your data factory and copy the Managed Identity Application ID value.
2. Open the key vault access policies and add the managed identity permissions to Get and List secrets.
Click Add , then click Save .
3. Navigate to your Key Vault secret and copy the Secret Identifier.
Make a note of your secret URI that you want to get during your data factory pipeline run.
4. In your Data Factory pipeline, add a new Web activity and configure it as follows.
P RO P ERT Y VA L UE
Method GET
Authentication MSI
P RO P ERT Y VA L UE
Resource https://vault.azure.net
IMPORTANT
You must add ?api-version=7.0 to the end of your secret URI.
Cau t i on
Set the Secure Output option to true to prevent the secret value from being logged in plain text. Any
further activities that consume this value should have their Secure Input option set to true.
5. To use the value in another activity, use the following code expression @activity('Web1').output.value .
Next steps
To learn how to use Azure Key Vault to store credentials for data stores and computes, see Store credentials in
Azure Key Vault
Encrypt credentials for on-premises data stores in
Azure Data Factory
5/28/2021 • 2 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
{
"properties": {
"type": "SqlServer",
"typeProperties": {
"connectionString": "Server=<servername>;Database=<databasename>;User ID=<username>;Password=
<password>;Timeout=60"
},
"connectVia": {
"type": "integrationRuntimeReference",
"referenceName": "<integration runtime name>"
},
"name": "SqlServerLinkedService"
}
}
Encrypt credentials
To encrypt the sensitive data from the JSON payload on an on-premises self-hosted integration runtime, run
New-AzDataFactor yV2LinkedSer viceEncr yptedCredential , and pass on the JSON payload. This cmdlet
ensures the credentials are encrypted using DPAPI and stored on the self-hosted integration runtime node
locally. The output payload containing the encrypted reference to the credential can be redirected to another
JSON file (in this case 'encryptedLinkedService.json').
New-AzDataFactoryV2LinkedServiceEncryptedCredential -DataFactoryName $dataFactoryName -ResourceGroupName
$ResourceGroupName -Name "SqlServerLinkedService" -DefinitionFile ".\SQLServerLinkedService.json" >
encryptedSQLServerLinkedService.json
Next steps
For information about security considerations for data movement, see Data movement security considerations.
Managed identity for Data Factory
5/28/2021 • 5 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Overview
When creating a data factory, a managed identity can be created along with factory creation. The managed
identity is a managed application registered to Azure Active Directory, and represents this specific data factory.
Managed identity for Data Factory benefits the following features:
Store credential in Azure Key Vault, in which case data factory managed identity is used for Azure Key Vault
authentication.
Access data stores or computes using managed identity authentication, including Azure Blob storage, Azure
Data Explorer, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure SQL Database, Azure SQL
Managed Instance, Azure Synapse Analytics, REST, Databricks activity, Web activity, and more. Check the
connector and activity articles for details.
DataFactoryName : ADFV2DemoFactory
DataFactoryId :
/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factories/ADFV2De
moFactory
ResourceGroupName : <resourceGroupName>
Location : East US
Tags : {}
Identity : Microsoft.Azure.Management.DataFactory.Models.FactoryIdentity
ProvisioningState : Succeeded
PATCH
https://management.azure.com/subscriptions/<subsID>/resourceGroups/<resourceGroupName>/providers/Microsoft.D
ataFactory/factories/<data factory name>?api-version=2018-06-01
{
"name": "<dataFactoryName>",
"location": "<region>",
"properties": {},
"identity": {
"type": "SystemAssigned"
}
}
Response : managed identity is created automatically, and "identity" section is populated accordingly.
{
"name": "<dataFactoryName>",
"tags": {},
"properties": {
"provisioningState": "Succeeded",
"loggingStorageAccountKey": "**********",
"createTime": "2017-09-26T04:10:01.1135678Z",
"version": "2018-06-01"
},
"identity": {
"type": "SystemAssigned",
"principalId": "765ad4ab-XXXX-XXXX-XXXX-51ed985819dc",
"tenantId": "72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},
"id":
"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/factorie
s/ADFV2DemoFactory",
"type": "Microsoft.DataFactory/factories",
"location": "<region>"
}
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"resources": [{
"name": "<dataFactoryName>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "<region>",
"identity": {
"type": "SystemAssigned"
}
}]
}
TIP
If you don't see the managed identity, generate managed identity by updating your factory.
PrincipalId TenantId
----------- --------
765ad4ab-XXXX-XXXX-XXXX-51ed985819dc 72f988bf-XXXX-XXXX-XXXX-2d7cd011db47
You can get the application ID by copying above principal ID, then running below Azure Active Directory
command with principal ID as parameter.
ServicePrincipalNames : {76f668b3-XXXX-XXXX-XXXX-1b3348c75e02,
https://identity.azure.net/P86P8g6nt1QxfPJx22om8MOooMf/Ag0Qf/nnREppHkU=}
ApplicationId : 76f668b3-XXXX-XXXX-XXXX-1b3348c75e02
DisplayName : ADFV2DemoFactory
Id : 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc
Type : ServicePrincipal
GET
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Mic
rosoft.DataFactory/factories/{factoryName}?api-version=2018-06-01
Response : You will get response like shown in below example. The "identity" section is populated accordingly.
{
"name":"<dataFactoryName>",
"identity":{
"type":"SystemAssigned",
"principalId":"554cff9e-XXXX-XXXX-XXXX-90c7d9ff2ead",
"tenantId":"72f988bf-XXXX-XXXX-XXXX-2d7cd011db47"
},
"id":"/subscriptions/<subscriptionId>/resourceGroups/<resourceGroupName>/providers/Microsoft.DataFactory/fac
tories/<dataFactoryName>",
"type":"Microsoft.DataFactory/factories",
"properties":{
"provisioningState":"Succeeded",
"createTime":"2020-02-12T02:22:50.2384387Z",
"version":"2018-06-01",
"factoryStatistics":{
"totalResourceCount":0,
"maxAllowedResourceCount":0,
"factorySizeInGbUnits":0,
"maxAllowedFactorySizeInGbUnits":0
}
},
"eTag":"\"03006b40-XXXX-XXXX-XXXX-5e43617a0000\"",
"location":"<region>",
"tags":{
}
}
TIP
To retrieve the managed identity from an ARM template, add an outputs section in the ARM JSON:
{
"outputs":{
"managedIdentityObjectId":{
"type":"string",
"value":"[reference(resourceId('Microsoft.DataFactory/factories',
parameters('<dataFactoryName>')), '2018-06-01', 'Full').identity.principalId]"
}
}
}
Next steps
See the following topics that introduce when and how to use data factory managed identity:
Store credential in Azure Key Vault
Copy data from/to Azure Data Lake Store using managed identities for Azure resources authentication
See Managed Identities for Azure Resources Overview for more background on managed identities for Azure
resources, which data factory managed identity is based upon.
Encrypt Azure Data Factory with customer-
managed keys
4/2/2021 • 6 minutes to read • Edit Online
NOTE
A customer-managed key can only be configured on an empty data Factory. The data factory can't contain any resources
such as linked services, pipelines and data flows. It is recommended to enable customer-managed key right after factory
creation.
IMPORTANT
This approach does not work with managed virtual network enabled factories. Please consider the alternative route, if you
want encrypt such factories.
1. Make sure that data factory's Managed Service Identity (MSI) has Get, Unwrap Key and Wrap Key
permissions to Key Vault.
2. Ensure the Data Factory is empty. The data factory can't contain any resources such as linked services,
pipelines, and data flows. For now, deploying customer-managed key to a non-empty factory will result in
an error.
3. To locate the key URI in the Azure portal, navigate to Azure Key Vault, and select the Keys setting. Select
the wanted key, then select the key to view its versions. Select a key version to view the settings
4. Copy the value of the Key Identifier field, which provides the URI
5. Launch Azure Data Factory portal, and using the navigation bar on the left, jump to Data Factory
Management Portal
6. Click on the Customer manged key icon
7. Enter the URI for customer-managed key that you copied before
8. Click Save and customer-manged key encryption is enabled for Data Factory
During factory creation in Azure portal
This section walks through steps to add customer managed key encryption in Azure portal, during factory
deployment.
To encrypt the factory, Data Factory needs to first retrieve customer-managed key from Key Vault. Since factory
deployment is still in progress, Managed Service Identity (MSI) isn't available yet to authenticate with Key Vault.
As such, to use this approach, customer needs to assign a user-assigned managed identity (UA-MI) to data
factory. We will assume the roles defined in the UA-MI and authenticate with Key Vault.
To learn more about user-assigned managed identity, see Managed identity types and Role assignment for user
assigned managed identity.
1. Make sure that User-assigned Managed Identity (UA-MI) has Get, Unwrap Key and Wrap Key permissions
to Key Vault
2. Under Advanced tab, check the box for Enable encryption using a customer managed key
3. Provide the url for the customer managed key stored in Key Vault
4. Select an appropriate user assigned managed identity to authenticate with Key Vault
5. Continue with factory deployment
NOTE
Adding the encryption setting to the ARM templates adds a factory-level setting that will override other factory level
settings, such as git configurations, in other environments. If you have these settings enabled in an elevated environment
such as UAT or PROD, please refer to Global Parameters in CI/CD.
Next steps
Go through the tutorials to learn about using Data Factory in more scenarios.
Azure Data Factory Managed Virtual Network
(preview)
7/20/2021 • 6 minutes to read • Edit Online
IMPORTANT
Currently, the managed Virtual Network is only supported in the same region as Azure Data Factory region.
NOTE
As Azure Data Factory managed Virtual Network is still in public preview, there is no SLA guarantee.
NOTE
Existing public Azure integration runtime can't switch to Azure integration runtime in Azure Data Factory managed virtual
network and vice versa.
Managed private endpoints
Managed private endpoints are private endpoints created in the Azure Data Factory Managed Virtual Network
establishing a private link to Azure resources. Azure Data Factory manages these private endpoints on your
behalf.
Azure Data Factory supports private links. Private link enables you to access Azure (PaaS) services (such as
Azure Storage, Azure Cosmos DB, Azure Synapse Analytics).
When you use a private link, traffic between your data stores and managed Virtual Network traverses entirely
over the Microsoft backbone network. Private Link protects against data exfiltration risks. You establish a private
link to a resource by creating a private endpoint.
Private endpoint uses a private IP address in the managed Virtual Network to effectively bring the service into it.
Private endpoints are mapped to a specific resource in Azure and not the entire service. Customers can limit
connectivity to a specific resource approved by their organization. Learn more about private links and private
endpoints.
NOTE
It's recommended that you create Managed private endpoints to connect to all your Azure data sources.
WARNING
If a PaaS data store (Blob, ADLS Gen2, Azure Synapse Analytics) has a private endpoint already created against it, and
even if it allows access from all networks, ADF would only be able to access it using a managed private endpoint. If a
private endpoint does not already exist, you must create one in such scenarios.
A private endpoint connection is created in a "Pending" state when you create a managed private endpoint in
Azure Data Factory. An approval workflow is initiated. The private link resource owner is responsible to approve
or reject the connection.
If the owner approves the connection, the private link is established. Otherwise, the private link won't be
established. In either case, the Managed private endpoint will be updated with the status of the connection.
Only a Managed private endpoint in an approved state can send traffic to a given private link resource.
Interactive Authoring
Interactive authoring capabilities is used for functionalities like test connection, browse folder list and table list,
get schema, and preview data. You can enable interactive authoring when creating or editing an Azure
Integration Runtime which is in ADF-managed virtual network. The backend service will pre-allocate compute
for interactive authoring functionalities. Otherwise, the compute will be allocated every time any interactive
operation is performed which will take more time. The Time To Live (TTL) for interactive authoring is 60 minutes,
which means it will automatically become disabled after 60 minutes of the last interactive authoring operation.
Activity execution time using managed virtual network
By design, Azure integration runtime in managed virtual network takes longer queue time than public Azure
integration runtime as we are not reserving one compute node per data factory, so there is a warm up for each
activity to start, and it occurs primarily on virtual network join rather than Azure integration runtime. For non-
copy activities including pipeline activity and external activity, there is a 60 minutes Time To Live (TTL) when you
trigger them at the first time. Within TTL, the queue time is shorter because the node is already warmed up.
NOTE
Copy activity doesn't have TTL support yet.
$vnetResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/managedVirtualNetworks/default"
$privateEndpointResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/managedVirtualNetworks/default/managedprivateendpoints/${managedPrivateEndpointName}"
$integrationRuntimeResourceId =
"subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/providers/Microsoft.DataFactory/factori
es/${factoryName}/integrationRuntimes/${integrationRuntimeName}"
NOTE
You still can access all data sources that are supported by Data Factory through public network.
NOTE
Because Azure SQL Managed Instance doesn't support native Private Endpoint right now, you can access it from
managed Virtual Network using Private Linked Service and Load Balancer. Please see How to access SQL Managed
Instance from Data Factory Managed VNET using Private Endpoint.
Next steps
Tutorial: Build a copy pipeline using managed Virtual Network and private endpoints
Tutorial: Build mapping dataflow pipeline using managed Virtual Network and private endpoints
Azure Private Link for Azure Data Factory
6/18/2021 • 9 minutes to read • Edit Online
DO M A IN P O RT DESC RIP T IO N
With the support of Private Link for Azure Data Factory, you can:
Create a private endpoint in your virtual network.
Enable the private connection to a specific data factory instance.
The communications to Azure Data Factory service go through Private Link and help provide secure private
connectivity.
Enabling the Private Link service for each of the preceding communication channels offers the following
functionality:
Suppor ted :
You can author and monitor the data factory in your virtual network, even if you block all outbound
communications.
The command communications between the self-hosted integration runtime and the Azure Data
Factory service can be performed securely in a private network environment. The traffic between the
self-hosted integration runtime and the Azure Data Factory service goes through Private Link.
Not currently suppor ted :
Interactive authoring that uses a self-hosted integration runtime, such as test connection, browse
folder list and table list, get schema, and preview data, goes through Private Link.
The new version of the self-hosted integration runtime which can be automatically downloaded from
Microsoft Download Center if you enable Auto-Update , is not supported at this time .
NOTE
For functionality that's not currently supported, you still need to configure the previously mentioned domain and
port in the virtual network or your corporate firewall.
NOTE
Connecting to Azure Data Factory via private endpoint is only applicable to self-hosted integration runtime in data
factory. It's not supported in Synapse.
WARNING
If you enable Private Link in Azure Data Factory and block public access at the same time, make sure when you create a
linked service, your credentials are stored in an Azure key vault. Otherwise, the credentials won't work.
DNS changes for private endpoints
When you create a private endpoint, the DNS CNAME resource record for the Data Factory is updated to an alias
in a subdomain with the prefix 'privatelink'. By default, we also create a private DNS zone, corresponding to the
'privatelink' subdomain, with the DNS A resource records for the private endpoints.
When you resolve the data factory endpoint URL from outside the VNet with the private endpoint, it resolves to
the public endpoint of the data factory service. When resolved from the VNet hosting the private endpoint, the
storage endpoint URL resolves to the private endpoint's IP address.
For the illustrated example above, the DNS resource records for the Data Factory 'DataFactoryA', when resolved
from outside the VNet hosting the private endpoint, will be:
NAME TYPE VA L UE
< data factory service public endpoint A < data factory service public IP
> address >
The DNS resource records for DataFactoryA, when resolved in the VNet hosting the private endpoint, will be:
NAME TYPE VA L UE
If you are using a custom DNS server on your network, clients must be able to resolve the FQDN for the Data
Factory endpoint to the private endpoint IP address. You should configure your DNS server to delegate your
private link subdomain to the private DNS zone for the VNet, or configure the A records for ' DataFactoryA.
{region}.privatelink.datafactory.azure.net' with the private endpoint IP address.
For more information on configuring your own DNS server to support private endpoints, refer to the following
articles:
Name resolution for resources in Azure virtual networks
DNS configuration for private endpoints
SET T IN G VA L UE
Project Details
Instance details
4. Select the IP Addresses tab or select the Next: IP Addresses button at the bottom of the page.
5. In the IP Addresses tab, enter this information:
SET T IN G VA L UE
SET T IN G VA L UE
8. Select Save .
9. Select the Review + create tab or select the Review + create button.
10. Select Create .
Create a virtual machine for the Self-Hosted Integration Runtime (SHIR )
You must also create or assign an existing virtual machine to run the Self-Hosted Integration Runtime in the new
subnet created above.
1. On the upper-left side of the portal, select Create a resource > Compute > Vir tual machine or
search for Vir tual machine in the search box.
2. In Create a vir tual machine , type or select the values in the Basics tab:
SET T IN G VA L UE
Project Details
Instance details
Region Select the region used above for your virtual network
Administrator account
3. Select the Networking tab, or select Next: Disks , then Next: Networking .
4. In the Networking tab, select or enter:
SET T IN G VA L UE
Network interface
NOTE
Azure provides an ephemeral IP for Azure Virtual Machines which aren't assigned a public IP address, or are in the
backend pool of an internal Basic Azure Load Balancer. The ephemeral IP mechanism provides an outbound IP address
that isn't configurable.
The ephemeral IP is disabled when a public IP address is assigned to the virtual machine or the virtual machine is placed
in the backend pool of a Standard Load Balancer with or without outbound rules. If a Azure Virtual Network NAT gateway
resource is assigned to the subnet of the virtual machine, the ephemeral IP is disabled.
For more information on outbound connections in Azure, see Using Source Network Address Translation (SNAT) for
outbound connections.
SET T IN G VA L UE
Project details
Instance details
3. Select the Resource tab or the Next: Resource button at the bottom of the page.
4. In Resource , enter or select this information:
SET T IN G VA L UE
Target sub-resource If you want to use the private endpoint for command
communications between the self-hosted integration
runtime and the Azure Data Factory service, select
datafactor y as Target sub-resource . If you want to
use the private endpoint for authoring and monitoring
the data factory in your virtual network, select por tal as
Target sub-resource .
5. Select the Configuration tab or the Next: Configuration button at the bottom of the screen.
6. In Configuration , enter or select this information:
SET T IN G VA L UE
Networking
NOTE
Disabling public network access is applicable only to the self-hosted integration runtime, not to Azure Integration Runtime
and SQL Server Integration Services (SSIS) Integration Runtime.
NOTE
You can still access the Azure Data Factory portal through a public network after you create private endpoint for portal.
Next steps
Create a data factory by using the Azure Data Factory UI
Introduction to Azure Data Factory
Visual authoring in Azure Data Factory
Visually monitor Azure Data Factory
4/22/2021 • 5 minutes to read • Edit Online
C O L UM N N A M E DESC RIP T IO N
Run Start Start date and time for the pipeline run (MM/DD/YYYY,
HH:MM:SS AM/PM)
Run End End date and time for the pipeline run (MM/DD/YYYY,
HH:MM:SS AM/PM)
You need to manually select the Refresh button to refresh the list of pipeline and activity runs. Autorefresh is
currently not supported.
C O L UM N N A M E DESC RIP T IO N
Actions Icons that allow you to see JSON input information, JSON
output information, or detailed activity-specific monitoring
experiences
Run Start Start date and time for the activity run (MM/DD/YYYY,
HH:MM:SS AM/PM)
If an activity failed, you can see the detailed error message by clicking on the icon in the error column.
NOTE
You can only promote up to five pipeline activity properties as user properties.
After you create the user properties, you can monitor them in the monitoring list views.
If the source for the copy activity is a table name, you can monitor the source table name as a column in the list
view for activity runs.
If you wish to rerun starting at a specific point, you can do so from the activity runs view. Select the activity you
wish to start from and select Rerun from activity .
Rerun from failed activity
If an activity fails, times out, or is canceled, you can rerun the pipeline from that failed activity by selecting
Rerun from failed activity .
You can also view rerun history for a particular pipeline run.
Monitor consumption
You can see the resources consumed by a pipeline run by clicking the consumption icon next to the run.
Clicking the icon opens a consumption report of resources used by that pipeline run.
You can plug these values into the Azure pricing calculator to estimate the cost of the pipeline run. For more
information on Azure Data Factory pricing, see Understanding pricing.
NOTE
These values returned by the pricing calculator is an estimate. It doesn't reflect the exact amount you will be billed by
Azure Data Factory
Gantt views
A Gantt chart is a view that allows you to see the run history over a time range. By switching to a Gantt view,
you will see all pipeline runs grouped by name displayed as bars relative to how long the run took. You can also
group by annotations/tags that you've create on your pipeline. The Gantt view is also available at the activity run
level.
The length of the bar informs the duration of the pipeline. You can also select the bar to see more details.
Alerts
You can raise alerts on supported metrics in Data Factory. Select Monitor > Aler ts & metrics on the Data
Factory monitoring page to get started.
For a seven-minute introduction and demonstration of this feature, watch the following video:
Create alerts
1. Select New aler t rule to create a new alert.
2. Specify the rule name and select the alert severity.
If there are existing settings on the data factory, you see a list of settings already configured on the data
factory. Select Add diagnostic setting .
4. Give your setting a name, select Send to Log Analytics , and then select a workspace from Log
Analytics Workspace .
In Azure-Diagnostics mode, diagnostic logs flow into the AzureDiagnostics table.
In Resource-Specific mode, diagnostic logs from Azure Data Factory flow into the following tables:
ADFActivityRun
ADFPipelineRun
ADFTriggerRun
ADFSSISIntegrationRuntimeLogs
ADFSSISPackageEventMessageContext
ADFSSISPackageEventMessages
ADFSSISPackageExecutableStatistics
ADFSSISPackageExecutionComponentPhases
ADFSSISPackageExecutionDataStatistics
You can select various logs relevant to your workloads to send to Log Analytics tables. For
example, if you don't use SQL Server Integration Services (SSIS) at all, you need not select any
SSIS logs. If you want to log SSIS Integration Runtime (IR) start/stop/maintenance operations, you
can select SSIS IR logs. If you invoke SSIS package executions via T-SQL on SQL Server
Management Studio (SSMS), SQL Server Agent, or other designated tools, you can select SSIS
package logs. If you invoke SSIS package executions via Execute SSIS Package activities in ADF
pipelines, you can select all logs.
If you select AllMetrics, various ADF metrics will be made available for you to monitor or raise
alerts on, including the metrics for ADF activity, pipeline, and trigger runs, as well as for SSIS IR
operations and SSIS package executions.
NOTE
Because an Azure log table can't have more than 500 columns, we highly recommended you select Resource-
Specific mode. For more information, see AzureDiagnostics Logs reference.
5. Select Save .
After a few moments, the new setting appears in your list of settings for this data factory. Diagnostic logs are
streamed to that workspace as soon as new event data is generated. Up to 15 minutes might elapse between
when an event is emitted and when it appears in Log Analytics.
3. Select Create and then create or select the Log Analytics Workspace .
Monitor Data Factory metrics
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics
workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration
You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts,
and take other actions.
NOTE
Azure Data Factory Analytics (Preview) sends diagnostic logs to Resource-specific destination tables. You can write queries
against the following tables: ADFPipelineRun, ADFTriggerRun, and ADFActivityRun.
M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N
To access the metrics, complete the instructions in Azure Monitor data platform.
NOTE
Only events from completed, triggered activity and pipeline runs are emitted. In progress and debug runs are not
emitted. On the other hand, events from all SSIS package executions are emitted, including those that are completed and
in progress, regardless of their invocation methods. For example, you can invoke package executions on Azure-enabled
SQL Server Data Tools (SSDT), via T-SQL on SSMS, SQL Server Agent, or other designated tools, and as triggered or
debug runs of Execute SSIS Package activities in ADF pipelines.
NOTE
Make sure to select All in the Filter by resource type drop-down list.
3. Define the alert details.
Header s
{
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<stor
ageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHub
Name>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspace
s/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}
metrics Parameter values of the pipeline run to A JSON object that maps parameter
be passed to the invoked pipeline names to argument values.
R e sp o n se
200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<sto
rageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHu
bName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspac
es/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}
Header s
200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/auth
orizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}
{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}
{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"resultType": "",
"properties": {
"message": ""
},
"resourceId": ""
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"executionPath": "",
"startTime": "",
"endTime": "",
"executionDuration": "",
"executionResult": "",
"executionValue": ""
},
"resourceId": ""
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"subcomponentName": "",
"phase": "",
"startTime": "",
"endTime": "",
"executionPath": ""
},
"resourceId": ""
}
dataflowPathName String The name of data flow path ADO NET Source Output
To raise alerts on SSIS operational metrics from Azure portal, select the Aler ts page of Azure Monitor hub and
follow the step-by-step instructions provided.
The schemas and content of SSIS package execution logs in Azure Monitor and Log Analytics are similar to the
schemas of SSISDB internal tables or views.
SSISIntegrationRuntimeLogs ADFSSISIntegrationRuntimeLogs
For more info on SSIS operational log attributes/properties, see Azure Monitor and Log Analytics schemas for
ADF.
Your selected SSIS package execution logs are always sent to Log Analytics regardless of their invocation
methods. For example, you can invoke package executions on Azure-enabled SSDT, via T-SQL on SSMS, SQL
Server Agent, or other designated tools, and as triggered or debug runs of Execute SSIS Package activities in
ADF pipelines.
When querying SSIS IR operation logs on Logs Analytics, you can use OperationName and ResultType
properties that are set to Start/Stop/Maintenance and Started/InProgress/Succeeded/Failed , respectively.
When querying SSIS package execution logs on Logs Analytics, you can join them using
OperationId /ExecutionId /CorrelationId properties. OperationId /ExecutionId are always set to 1 for all
operations/executions related to packages not stored in SSISDB/invoked via T-SQL.
Next steps
Monitor and manage pipelines programmatically
Monitor and Alert Data Factory by using Azure
Monitor
4/22/2021 • 27 minutes to read • Edit Online
If there are existing settings on the data factory, you see a list of settings already configured on the data
factory. Select Add diagnostic setting .
4. Give your setting a name, select Send to Log Analytics , and then select a workspace from Log
Analytics Workspace .
In Azure-Diagnostics mode, diagnostic logs flow into the AzureDiagnostics table.
In Resource-Specific mode, diagnostic logs from Azure Data Factory flow into the following tables:
ADFActivityRun
ADFPipelineRun
ADFTriggerRun
ADFSSISIntegrationRuntimeLogs
ADFSSISPackageEventMessageContext
ADFSSISPackageEventMessages
ADFSSISPackageExecutableStatistics
ADFSSISPackageExecutionComponentPhases
ADFSSISPackageExecutionDataStatistics
You can select various logs relevant to your workloads to send to Log Analytics tables. For
example, if you don't use SQL Server Integration Services (SSIS) at all, you need not select any
SSIS logs. If you want to log SSIS Integration Runtime (IR) start/stop/maintenance operations, you
can select SSIS IR logs. If you invoke SSIS package executions via T-SQL on SQL Server
Management Studio (SSMS), SQL Server Agent, or other designated tools, you can select SSIS
package logs. If you invoke SSIS package executions via Execute SSIS Package activities in ADF
pipelines, you can select all logs.
If you select AllMetrics, various ADF metrics will be made available for you to monitor or raise
alerts on, including the metrics for ADF activity, pipeline, and trigger runs, as well as for SSIS IR
operations and SSIS package executions.
NOTE
Because an Azure log table can't have more than 500 columns, we highly recommended you select Resource-
Specific mode. For more information, see AzureDiagnostics Logs reference.
5. Select Save .
After a few moments, the new setting appears in your list of settings for this data factory. Diagnostic logs are
streamed to that workspace as soon as new event data is generated. Up to 15 minutes might elapse between
when an event is emitted and when it appears in Log Analytics.
3. Select Create and then create or select the Log Analytics Workspace .
Monitor Data Factory metrics
Installing this solution creates a default set of views inside the workbooks section of the chosen Log Analytics
workspace. As a result, the following metrics become enabled:
ADF Runs - 1) Pipeline Runs by Data Factory
ADF Runs - 2) Activity Runs by Data Factory
ADF Runs - 3) Trigger Runs by Data Factory
ADF Errors - 1) Top 10 Pipeline Errors by Data Factory
ADF Errors - 2) Top 10 Activity Runs by Data Factory
ADF Errors - 3) Top 10 Trigger Errors by Data Factory
ADF Statistics - 1) Activity Runs by Type
ADF Statistics - 2) Trigger Runs by Type
ADF Statistics - 3) Max Pipeline Runs Duration
You can visualize the preceding metrics, look at the queries behind these metrics, edit the queries, create alerts,
and take other actions.
NOTE
Azure Data Factory Analytics (Preview) sends diagnostic logs to Resource-specific destination tables. You can write queries
against the following tables: ADFPipelineRun, ADFTriggerRun, and ADFActivityRun.
M ET RIC DISP L AY
M ET RIC NAME UN IT A GGREGAT IO N T Y P E DESC RIP T IO N
To access the metrics, complete the instructions in Azure Monitor data platform.
NOTE
Only events from completed, triggered activity and pipeline runs are emitted. In progress and debug runs are not
emitted. On the other hand, events from all SSIS package executions are emitted, including those that are completed and
in progress, regardless of their invocation methods. For example, you can invoke package executions on Azure-enabled
SQL Server Data Tools (SSDT), via T-SQL on SSMS, SQL Server Agent, or other designated tools, and as triggered or
debug runs of Execute SSIS Package activities in ADF pipelines.
NOTE
Make sure to select All in the Filter by resource type drop-down list.
3. Define the alert details.
Header s
{
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.Storage/storageAccounts/<stor
ageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.EventHub/namespaces/<eventHub
Name>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>/providers/Microsoft.OperationalInsights/workspace
s/<LogAnalyticsName>",
"metrics": [
],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"location": ""
}
metrics Parameter values of the pipeline run to A JSON object that maps parameter
be passed to the invoked pipeline names to argument values.
R e sp o n se
200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.Storage/storageAccounts/<sto
rageAccountName>",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.EventHub/namespaces/<eventHu
bName>/authorizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/<resourceGroupName>//providers/Microsoft.OperationalInsights/workspac
es/<LogAnalyticsName>",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
GET
https://management.azure.com/{resource-id}/providers/microsoft.insights/diagnosticSettings/service?api-
version={api-version}
Header s
200 OK.
{
"id":
"/subscriptions/<subID>/resourcegroups/adf/providers/microsoft.datafactory/factories/shloadobetest2/provider
s/microsoft.insights/diagnosticSettings/service",
"type": null,
"name": "service",
"location": null,
"kind": null,
"tags": null,
"properties": {
"storageAccountId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.Storage/storageAccounts/azmonlogs",
"serviceBusRuleId":
"/subscriptions/<subID>/resourceGroups/shloprivate/providers/Microsoft.EventHub/namespaces/shloeventhub/auth
orizationrules/RootManageSharedAccessKey",
"workspaceId":
"/subscriptions/<subID>/resourceGroups/ADF/providers/Microsoft.OperationalInsights/workspaces/mihaipie",
"eventHubAuthorizationRuleId": null,
"eventHubName": null,
"metrics": [],
"logs": [
{
"category": "PipelineRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "TriggerRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
},
{
"category": "ActivityRuns",
"enabled": true,
"retentionPolicy": {
"enabled": false,
"days": 0
}
}
]
},
"identity": null
}
{
"Level": "",
"correlationId":"",
"time":"",
"activityRunId":"",
"pipelineRunId":"",
"resourceId":"",
"category":"ActivityRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"activityName":"",
"start":"",
"end":"",
"properties":
{
"Input": "{
"source": {
"type": "BlobSource"
},
"sink": {
"type": "BlobSink"
}
}",
"Output": "{"dataRead":121,"dataWritten":121,"copyDuration":5,
"throughput":0.0236328132,"errors":[]}",
"Error": "{
"errorCode": "null",
"message": "null",
"failureType": "null",
"target": "CopyBlobtoBlob"
}
}
}
{
"Level": "",
"correlationId":"",
"time":"",
"runId":"",
"resourceId":"",
"category":"PipelineRuns",
"level":"Informational",
"operationName":"",
"pipelineName":"",
"start":"",
"end":"",
"status":"",
"properties":
{
"Parameters": {
"<parameter1Name>": "<parameter1Value>"
},
"SystemParameters": {
"ExecutionStart": "",
"TriggerId": "",
"SubscriptionId": ""
}
}
}
{
"Level": "",
"correlationId":"",
"time":"",
"triggerId":"",
"resourceId":"",
"category":"TriggerRuns",
"level":"Informational",
"operationName":"",
"triggerName":"",
"triggerType":"",
"triggerEvent":"",
"start":"",
"status":"",
"properties":
{
"Parameters": {
"TriggerTime": "",
"ScheduleTime": ""
},
"SystemParameters": {}
}
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"resultType": "",
"properties": {
"message": ""
},
"resourceId": ""
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"executionPath": "",
"startTime": "",
"endTime": "",
"executionDuration": "",
"executionResult": "",
"executionValue": ""
},
"resourceId": ""
}
{
"time": "",
"operationName": "",
"category": "",
"correlationId": "",
"dataFactoryName": "",
"integrationRuntimeName": "",
"level": "",
"properties": {
"executionId": "",
"packageName": "",
"taskName": "",
"subcomponentName": "",
"phase": "",
"startTime": "",
"endTime": "",
"executionPath": ""
},
"resourceId": ""
}
dataflowPathName String The name of data flow path ADO NET Source Output
To raise alerts on SSIS operational metrics from Azure portal, select the Aler ts page of Azure Monitor hub and
follow the step-by-step instructions provided.
The schemas and content of SSIS package execution logs in Azure Monitor and Log Analytics are similar to the
schemas of SSISDB internal tables or views.
SSISIntegrationRuntimeLogs ADFSSISIntegrationRuntimeLogs
For more info on SSIS operational log attributes/properties, see Azure Monitor and Log Analytics schemas for
ADF.
Your selected SSIS package execution logs are always sent to Log Analytics regardless of their invocation
methods. For example, you can invoke package executions on Azure-enabled SSDT, via T-SQL on SSMS, SQL
Server Agent, or other designated tools, and as triggered or debug runs of Execute SSIS Package activities in
ADF pipelines.
When querying SSIS IR operation logs on Logs Analytics, you can use OperationName and ResultType
properties that are set to Start/Stop/Maintenance and Started/InProgress/Succeeded/Failed , respectively.
When querying SSIS package execution logs on Logs Analytics, you can join them using
OperationId /ExecutionId /CorrelationId properties. OperationId /ExecutionId are always set to 1 for all
operations/executions related to packages not stored in SSISDB/invoked via T-SQL.
Next steps
Monitor and manage pipelines programmatically
Programmatically monitor an Azure data factory
7/6/2021 • 3 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Data range
Data Factory only stores pipeline run data for 45 days. When you query programmatically for data about Data
Factory pipeline runs - for example, with the PowerShell command Get-AzDataFactoryV2PipelineRun - there are
no maximum dates for the optional LastUpdatedAfter and LastUpdatedBefore parameters. But if you query for
data for the past year, for example, you won't get an error but only pipeline run data from the last 45 days.
If you want to keep pipeline run data for more than 45 days, set up your own diagnostic logging with Azure
Monitor.
.NET
For a complete walk-through of creating and monitoring a pipeline using .NET SDK, see Create a data factory
and pipeline using .NET.
1. Add the following code to continuously check the status of the pipeline run until it finishes copying the
data.
// Monitor the pipeline run
Console.WriteLine("Checking pipeline run status...");
PipelineRun pipelineRun;
while (true)
{
pipelineRun = client.PipelineRuns.Get(resourceGroup, dataFactoryName, runResponse.RunId);
Console.WriteLine("Status: " + pipelineRun.Status);
if (pipelineRun.Status == "InProgress" || pipelineRun.Status == "Queued")
System.Threading.Thread.Sleep(15000);
else
break;
}
2. Add the following code to that retrieves copy activity run details, for example, size of the data
read/written.
For complete documentation on .NET SDK, see Data Factory .NET SDK reference.
Python
For a complete walk-through of creating and monitoring a pipeline using Python SDK, see Create a data factory
and pipeline using Python.
To monitor the pipeline run, add the following code:
For complete documentation on Python SDK, see Data Factory Python SDK reference.
REST API
For a complete walk-through of creating and monitoring a pipeline using REST API, see Create a data factory
and pipeline using REST API.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
$request =
"https://management.azure.com/subscriptions/${subsId}/resourceGroups/${resourceGroup}/providers/Micro
soft.DataFactory/factories/${dataFactoryName}/pipelineruns/${runId}?api-version=${apiVersion}"
while ($True) {
$response = Invoke-RestMethod -Method GET -Uri $request -Header $authHeader
Write-Host "Pipeline run status: " $response.Status -foregroundcolor "Yellow"
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
$request =
"https://management.azure.com/subscriptions/${subscriptionId}/resourceGroups/${resourceGroupName}/pro
viders/Microsoft.DataFactory/factories/${factoryName}/pipelineruns/${runId}/queryActivityruns?api-
version=${apiVersion}&startTime="+(Get-Date).ToString('yyyy-MM-dd')+"&endTime="+(Get-
Date).AddDays(1).ToString('yyyy-MM-dd')+"&pipelineName=Adfv2QuickStartPipeline"
$response = Invoke-RestMethod -Method POST -Uri $request -Header $authHeader
$response | ConvertTo-Json
For complete documentation on REST API, see Data Factory REST API reference.
PowerShell
For a complete walk-through of creating and monitoring a pipeline using PowerShell, see Create a data factory
and pipeline using PowerShell.
1. Run the following script to continuously check the pipeline run status until it finishes copying the data.
while ($True) {
$run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $resourceGroupName -DataFactoryName
$DataFactoryName -PipelineRunId $runId
if ($run) {
if ( ($run.Status -ne "InProgress") -and ($run.Status -ne "Queued") ) {
Write-Output ("Pipeline run finished. The status is: " + $run.Status)
$run
break
}
Write-Output ("Pipeline is running...status: " + $run.Status)
}
Start-Sleep -Seconds 30
}
2. Run the following script to retrieve copy activity run details, for example, size of the data read/written.
Write-Host "Activity run details:" -foregroundcolor "Yellow"
$result = Get-AzDataFactoryV2ActivityRun -DataFactoryName $dataFactoryName -ResourceGroupName
$resourceGroupName -PipelineRunId $runId -RunStartedAfter (Get-Date).AddMinutes(-30) -
RunStartedBefore (Get-Date).AddMinutes(30)
$result
For complete documentation on PowerShell cmdlets, see Data Factory PowerShell cmdlet reference.
Next steps
See Monitor pipelines using Azure Monitor article to learn about using Azure Monitor to monitor Data Factory
pipelines.
Monitor an integration runtime in Azure Data
Factory
5/28/2021 • 14 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
To get the status of an instance of integration runtime (IR), run the following PowerShell command:
The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.
DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.
ResourceGroupName Name of the resource group that the data factory belongs
to.
Status
The following table provides possible statuses of an Azure integration runtime:
STAT US C O M M EN T S/ SC EN A RIO S
NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in
the runtime.
Properties
The following table provides descriptions of monitoring Properties for each node :
Concurrent Jobs (Running/ Limit) Running . Number of jobs or tasks running on each node.
This value is a near real-time snapshot.
Some settings of the properties make more sense when there are two or more nodes in the self-hosted
integration runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you
run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see
low resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:
Use the Get-AzDataFactor yV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of
the cmdlet.
Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}
]
}
Properties
The following table provides descriptions of properties returned by the above cmdlet for an Azure-SSIS IR.
CatalogAdminUserName The admin username for your existing Azure SQL Database
server or managed instance. ADF uses this information to
prepare and manage SSISDB on your behalf.
CatalogAdminPassword The admin password for your existing Azure SQL Database
server or managed instance.
CatalogPricingTier The pricing tier for SSISDB hosted by Azure SQL Database
server. Not applicable to Azure SQL Managed Instance
hosting SSISDB.
ResourceGroupName The name of your Azure Resource Group, in which your ADF
and Azure-SSIS IR were created.
Next, select the name of your Azure-SSIS IR to open its monitoring page, where you can see its overall/node-
specific properties and statuses. On this page, depending on how you configure the general, deployment, and
advanced settings of your Azure-SSIS IR, you'll find various informational/functional tiles.
The TYPE and REGION informational tiles show the type and region of your Azure-SSIS IR, respectively.
The NODE SIZE informational tile shows the SKU (SSIS edition_VM tier_VM series), number of CPU cores, and
size of RAM per node for your Azure-SSIS IR.
The RUNNING / REQUESTED NODE(S) informational tile compares the number of nodes currently running
to the total number of nodes previously requested for your Azure-SSIS IR.
The DUAL STANDBY PAIR / ROLE informational tile shows the name of your dual standby Azure-SSIS IR pair
that works in sync with Azure SQL Database/Managed Instance failover group for business continuity and
disaster recovery (BCDR) and the current primary/secondary role of your Azure-SSIS IR. When SSISDB failover
occurs, your primary and secondary Azure-SSIS IRs will swap roles (see Configuring your Azure-SSIS IR for
BCDR).
The functional tiles are described in more details below.
STATUS tile
On the STATUS tile of your Azure-SSIS IR monitoring page, you can see its overall status, for example Running
or Stopped . Selecting the Running status pops up a window with live Stop button to stop your Azure-SSIS IR.
Selecting the Stopped status pops up a window with live Star t button to start your Azure-SSIS IR. The pop-up
window also has an Execute SSIS package button to auto-generate an ADF pipeline with Execute SSIS
Package activity that runs on your Azure-SSIS IR (see Running SSIS packages as Execute SSIS Package activities
in ADF pipelines) and a Resource ID text box, from which you can copy your Azure-SSIS IR resource ID (
/subscriptions/YourAzureSubscripton/resourcegroups/YourResourceGroup/providers/Microsoft.DataFactory/factories/YourADF/integrationruntimes/YourAzur
). The suffix of your Azure-SSIS IR resource ID that contains your ADF and Azure-SSIS IR names forms a cluster
ID that can be used to purchase additional premium/licensed SSIS components from independent software
vendors (ISVs) and bind them to your Azure-SSIS IR (see Installing premium/licensed components on your
Azure-SSIS IR).
SSISDB SERVER ENDPOINT tile
If you use Project Deployment Model where packages are stored in SSISDB hosted by your Azure SQL Database
server or managed instance, you'll see the SSISDB SERVER ENDPOINT tile on your Azure-SSIS IR monitoring
page (see Configuring your Azure-SSIS IR deployment settings). On this tile, you can select a link designating
your Azure SQL Database server or managed instance to pop up a window, where you can copy the server
endpoint from a text box and use it when connecting from SSMS to deploy, configure, run, and manage your
packages. On the pop-up window, you can also select the See your Azure SQL Database or managed
instance settings link to reconfigure/resize your SSISDB in Azure portal.
ERROR(S) tile
If there are issues with the starting/stopping/maintenance/upgrade of your Azure-SSIS IR, you'll see an
additional ERROR(S) tile on your Azure-SSIS IR monitoring page. On this tile, you can select a link designating
the number of errors generated by your Azure-SSIS IR to pop up a window, where you can see those errors in
more details and copy them to find the recommended solutions in our troubleshooting guide (see
Troubleshooting your Azure-SSIS IR).
Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Monitor an integration runtime in Azure Data
Factory
5/28/2021 • 14 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
To get the status of an instance of integration runtime (IR), run the following PowerShell command:
The cmdlet returns different information for different types of integration runtime. This article explains the
properties and statuses for each type of integration runtime.
DataFactoryName Name of the data factory that the Azure integration runtime
belongs to.
ResourceGroupName Name of the resource group that the data factory belongs
to.
Status
The following table provides possible statuses of an Azure integration runtime:
STAT US C O M M EN T S/ SC EN A RIO S
NOTE
The returned properties and status contain information about overall self-hosted integration runtime and each node in
the runtime.
Properties
The following table provides descriptions of monitoring Properties for each node :
Concurrent Jobs (Running/ Limit) Running . Number of jobs or tasks running on each node.
This value is a near real-time snapshot.
Some settings of the properties make more sense when there are two or more nodes in the self-hosted
integration runtime (that is, in a scale out scenario).
Concurrent jobs limit
The default value of the concurrent jobs limit is set based on the machine size. The factors used to calculate this
value depend on the amount of RAM and the number of CPU cores of the machine. So the more cores and the
more memory, the higher the default limit of concurrent jobs.
You scale out by increasing the number of nodes. When you increase the number of nodes, the concurrent jobs
limit is the sum of the concurrent job limit values of all the available nodes. For example, if one node lets you
run a maximum of twelve concurrent jobs, then adding three more similar nodes lets you run a maximum of 48
concurrent jobs (that is, 4 x 12). We recommend that you increase the concurrent jobs limit only when you see
low resource usage with the default values on each node.
You can override the calculated default value in the Azure portal. Select Author > Connections > Integration
Runtimes > Edit > Nodes > Modify concurrent job value per node. You can also use the PowerShell update-
Azdatafactoryv2integrationruntimenode command.
Status (per node )
The following table provides possible statuses of a self-hosted integration runtime node:
Use the Get-AzDataFactor yV2IntegrationRuntimeMetric cmdlet to fetch the JSON payload containing the
detailed self-hosted integration runtime properties, and their snapshot values during the time of execution of
the cmdlet.
Sample output (assumes that there are two nodes associated with this self-hosted integration runtime):
{
"IntegrationRuntimeName": "<Name of your integration runtime>",
"ResourceGroupName": "<Resource Group Name>",
"DataFactoryName": "<Data Factory Name>",
"Nodes": [
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
},
{
"NodeName": "<Node Name>",
"AvailableMemoryInMB": <Value>,
"CpuUtilization": <Value>,
"ConcurrentJobsLimit": <Value>,
"ConcurrentJobsRunning": <Value>,
"MaxConcurrentJobs": <Value>,
"SentBytes": <Value>,
"ReceivedBytes": <Value>
}
]
}
Properties
The following table provides descriptions of properties returned by the above cmdlet for an Azure-SSIS IR.
CatalogAdminUserName The admin username for your existing Azure SQL Database
server or managed instance. ADF uses this information to
prepare and manage SSISDB on your behalf.
CatalogAdminPassword The admin password for your existing Azure SQL Database
server or managed instance.
CatalogPricingTier The pricing tier for SSISDB hosted by Azure SQL Database
server. Not applicable to Azure SQL Managed Instance
hosting SSISDB.
ResourceGroupName The name of your Azure Resource Group, in which your ADF
and Azure-SSIS IR were created.
Next, select the name of your Azure-SSIS IR to open its monitoring page, where you can see its overall/node-
specific properties and statuses. On this page, depending on how you configure the general, deployment, and
advanced settings of your Azure-SSIS IR, you'll find various informational/functional tiles.
The TYPE and REGION informational tiles show the type and region of your Azure-SSIS IR, respectively.
The NODE SIZE informational tile shows the SKU (SSIS edition_VM tier_VM series), number of CPU cores, and
size of RAM per node for your Azure-SSIS IR.
The RUNNING / REQUESTED NODE(S) informational tile compares the number of nodes currently running
to the total number of nodes previously requested for your Azure-SSIS IR.
The DUAL STANDBY PAIR / ROLE informational tile shows the name of your dual standby Azure-SSIS IR pair
that works in sync with Azure SQL Database/Managed Instance failover group for business continuity and
disaster recovery (BCDR) and the current primary/secondary role of your Azure-SSIS IR. When SSISDB failover
occurs, your primary and secondary Azure-SSIS IRs will swap roles (see Configuring your Azure-SSIS IR for
BCDR).
The functional tiles are described in more details below.
STATUS tile
On the STATUS tile of your Azure-SSIS IR monitoring page, you can see its overall status, for example Running
or Stopped . Selecting the Running status pops up a window with live Stop button to stop your Azure-SSIS IR.
Selecting the Stopped status pops up a window with live Star t button to start your Azure-SSIS IR. The pop-up
window also has an Execute SSIS package button to auto-generate an ADF pipeline with Execute SSIS
Package activity that runs on your Azure-SSIS IR (see Running SSIS packages as Execute SSIS Package activities
in ADF pipelines) and a Resource ID text box, from which you can copy your Azure-SSIS IR resource ID (
/subscriptions/YourAzureSubscripton/resourcegroups/YourResourceGroup/providers/Microsoft.DataFactory/factories/YourADF/integrationruntimes/YourAzur
). The suffix of your Azure-SSIS IR resource ID that contains your ADF and Azure-SSIS IR names forms a cluster
ID that can be used to purchase additional premium/licensed SSIS components from independent software
vendors (ISVs) and bind them to your Azure-SSIS IR (see Installing premium/licensed components on your
Azure-SSIS IR).
SSISDB SERVER ENDPOINT tile
If you use Project Deployment Model where packages are stored in SSISDB hosted by your Azure SQL Database
server or managed instance, you'll see the SSISDB SERVER ENDPOINT tile on your Azure-SSIS IR monitoring
page (see Configuring your Azure-SSIS IR deployment settings). On this tile, you can select a link designating
your Azure SQL Database server or managed instance to pop up a window, where you can copy the server
endpoint from a text box and use it when connecting from SSMS to deploy, configure, run, and manage your
packages. On the pop-up window, you can also select the See your Azure SQL Database or managed
instance settings link to reconfigure/resize your SSISDB in Azure portal.
ERROR(S) tile
If there are issues with the starting/stopping/maintenance/upgrade of your Azure-SSIS IR, you'll see an
additional ERROR(S) tile on your Azure-SSIS IR monitoring page. On this tile, you can select a link designating
the number of errors generated by your Azure-SSIS IR to pop up a window, where you can see those errors in
more details and copy them to find the recommended solutions in our troubleshooting guide (see
Troubleshooting your Azure-SSIS IR).
Next steps
See the following articles for monitoring pipelines in different ways:
Quickstart: create a data factory.
Use Azure Monitor to monitor Data Factory pipelines
Reconfigure the Azure-SSIS integration runtime
3/5/2021 • 3 minutes to read • Edit Online
Data Factory UI
You can use Data Factory UI to stop, edit/reconfigure, or delete an Azure-SSIS IR.
1. Open Data Factory UI by selecting the Author & Monitor tile on the home page of your data factory.
2. Select the Manage hub below Home , Edit , and Monitor hubs to show the Connections pane.
To reconfigure an Azure -SSIS IR
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .
You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
After you provision and start an instance of Azure-SSIS integration runtime, you can reconfigure it by running a
sequence of Stop - Set - Start PowerShell cmdlets consecutively. For example, the following PowerShell
script changes the number of nodes allocated for the Azure-SSIS integration runtime instance to five.
Reconfigure an Azure -SSIS IR
1. First, stop the Azure-SSIS integration runtime by using the Stop-AzDataFactoryV2IntegrationRuntime
cmdlet. This command releases all of its nodes and stops billing.
2. Next, stop all existing Azure SSIS IRs in your data factory.
3. Next, remove all existing Azure SSIS IRs in your data factory one by one.
5. If you had created a new resource group, remove the resource group.
Next steps
For more information about Azure-SSIS runtime, see the following topics:
Azure-SSIS Integration Runtime. This article provides conceptual information about integration runtimes in
general including the Azure-SSIS IR.
Tutorial: deploy SSIS packages to Azure. This article provides step-by-step instructions to create an Azure-
SSIS IR and uses Azure SQL Database to host the SSIS catalog.
How to: Create an Azure-SSIS integration runtime. This article expands on the tutorial and provides
instructions on using Azure SQL Managed Instance and joining the IR to a virtual network.
Join an Azure-SSIS IR to a virtual network. This article provides conceptual information about joining an
Azure-SSIS IR to an Azure virtual network. It also provides steps to use Azure portal to configure virtual
network so that Azure-SSIS IR can join the virtual network.
Monitor an Azure-SSIS IR. This article shows you how to retrieve information about an Azure-SSIS IR and
descriptions of statuses in the returned information.
Copy or clone a data factory in Azure Data Factory
4/22/2021 • 2 minutes to read • Edit Online
Next steps
Review the guidance for creating a data factory in the Azure portal in Create a data factory by using the Azure
Data Factory UI.
How to create and configure Azure Integration
Runtime
7/2/2021 • 2 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Default Azure IR
By default, each data factory has an Azure IR in the backend that supports operations on cloud data stores and
compute services in public network. The location of that Azure IR is autoresolve. If connectVia property is not
specified in the linked service definition, the default Azure IR is used. You only need to explicitly create an Azure
IR when you would like to explicitly define the location of the IR, or if you would like to virtually group the
activity executions on different IRs for management purpose.
Create Azure IR
To create and set up an Azure IR, you can use the following procedures.
Create an Azure IR via Azure PowerShell
Integration Runtime can be created using the Set-AzDataFactor yV2IntegrationRuntime PowerShell cmdlet.
To create an Azure IR, you specify the name, location, and type to the command. Here is a sample command to
create an Azure IR with location set to "West Europe":
For Azure IR, the type must be set to Managed . You do not need to specify compute details because it is fully
managed elastically in cloud. Specify compute details like node size and node count when you would like to
create Azure-SSIS IR. For more information, see Create and Configure Azure-SSIS IR.
You can configure an existing Azure IR to change its location using the Set-AzDataFactoryV2IntegrationRuntime
PowerShell cmdlet. For more information about the location of an Azure IR, see Introduction to integration
runtime.
Create an Azure IR via Azure Data Factory UI
Use the following steps to create an Azure IR using Azure Data Factory UI.
1. On the home page of Azure Data Factory UI, select the Manage tab from the leftmost pane.
2. Select Integration runtimes on the left pane, and then select +New .
3. On the Integration runtime setup page, select Azure, Self-Hosted , and then select Continue .
4. On the following page, select Azure to create an Azure IR, and then select Continue .
5. Enter a name for your Azure IR, and select Create .
6. You'll see a pop-up notification when the creation completes. On the Integration runtimes page, make
sure that you see the newly created IR in the list.
Use Azure IR
Once an Azure IR is created, you can reference it in your Linked Service definition. Below is a sample of how you
can reference the Azure Integration Runtime created above from an Azure Storage Linked Service:
{
"name": "MyStorageLinkedService",
"properties": {
"type": "AzureStorage",
"typeProperties": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=myaccountname;AccountKey=..."
},
"connectVia": {
"referenceName": "MySampleAzureIR",
"type": "IntegrationRuntimeReference"
}
}
}
Next steps
See the following articles on how to create other types of integration runtimes:
Create self-hosted integration runtime
Create Azure-SSIS integration runtime
Create and configure a self-hosted integration
runtime
7/7/2021 • 23 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
1. A data developer first creates a self-hosted integration runtime within an Azure data factory by using the
Azure portal or the PowerShell cmdlet. Then the data developer creates a linked service for an on-
premises data store, specifying the self-hosted integration runtime instance that the service should use to
connect to data stores.
2. The self-hosted integration runtime node encrypts the credentials by using Windows Data Protection
Application Programming Interface (DPAPI) and saves the credentials locally. If multiple nodes are set for
high availability, the credentials are further synchronized across other nodes. Each node encrypts the
credentials by using DPAPI and stores them locally. Credential synchronization is transparent to the data
developer and is handled by the self-hosted IR.
3. Azure Data Factory communicates with the self-hosted integration runtime to schedule and manage jobs.
Communication is via a control channel that uses a shared Azure Relay connection. When an activity job
needs to be run, Data Factory queues the request along with any credential information. It does so in case
credentials aren't already stored on the self-hosted integration runtime. The self-hosted integration
runtime starts the job after it polls the queue.
4. The self-hosted integration runtime copies data between an on-premises store and cloud storage. The
direction of the copy depends on how the copy activity is configured in the data pipeline. For this step, the
self-hosted integration runtime directly communicates with cloud-based storage services like Azure Blob
storage over a secure HTTPS channel.
Prerequisites
The supported versions of Windows are:
Windows 8.1
Windows 10
Windows Server 2012
Windows Server 2012 R2
Windows Server 2016
Windows Server 2019
Installation of the self-hosted integration runtime on a domain controller isn't supported.
Self-hosted integration runtime requires a 64-bit Operating System with .NET Framework 4.7.2 or above. See
.NET Framework System Requirements for details.
The recommended minimum configuration for the self-hosted integration runtime machine is a 2-GHz
processor with 4 cores, 8 GB of RAM, and 80 GB of available hard drive space. For the details of system
requirements, see Download.
If the host machine hibernates, the self-hosted integration runtime doesn't respond to data requests.
Configure an appropriate power plan on the computer before you install the self-hosted integration runtime.
If the machine is configured to hibernate, the self-hosted integration runtime installer prompts with a
message.
You must be an administrator on the machine to successfully install and configure the self-hosted integration
runtime.
Copy-activity runs happen with a specific frequency. Processor and RAM usage on the machine follows the
same pattern with peak and idle times. Resource usage also depends heavily on the amount of data that is
moved. When multiple copy jobs are in progress, you see resource usage go up during peak times.
Tasks might fail during extraction of data in Parquet, ORC, or Avro formats. For more on Parquet, see Parquet
format in Azure Data Factory. File creation runs on the self-hosted integration machine. To work as expected,
file creation requires the following prerequisites:
Visual C++ 2010 Redistributable Package (x64)
Java Runtime (JRE) version 8 from a JRE provider such as Adopt OpenJDK. Ensure that the JAVA_HOME
environment variable is set to the JRE folder (and not just the JDK folder).
NOTE
If you are running in government cloud, please review Connect to government cloud.
NOTE
Run PowerShell command in Azure government, please see Connect to Azure Government with PowerShell.
2. Select Integration runtimes on the left pane, and then select +New .
3. On the Integration runtime setup page, select Azure, Self-Hosted , and then select Continue .
4. On the following page, select Self-Hosted to create a Self-Hosted IR, and then select Continue .
-era , " <port> " [" <thumbprint> "] Enable remote access on the current
-EnableRemoteAccess node to set up a high-availability
cluster. Or enable setting credentials
directly against the self-hosted IR
without going through Azure Data
Factory. You do the latter by using the
New-
AzDataFactor yV2LinkedSer viceEn
cr yptedCredential cmdlet from a
remote machine in the same network.
-erac , " <port> " [" <thumbprint> "] Enable remote access to the current
-EnableRemoteAccessInContainer node when the node runs in a
container.
-gbf , " <filePath> " " <password> " Generate a backup file for the current
-GenerateBackupFile node. The backup file includes the
node key and data-store credentials.
-ibf , " <filePath> " " <password> " Restore the node from a backup file.
-ImportBackupFile
-ssa , " <domain\user> " [" <password> "] Set DIAHostService to run as a new
-SwitchServiceAccount account. Use the empty password ""
for system accounts and virtual
accounts.
10. On the Register Integration Runtime (Self-hosted) window of Microsoft Integration Runtime
Configuration Manager running on your machine, take the following steps:
a. Paste the authentication key in the text area.
b. Optionally, select Show authentication key to see the key text.
c. Select Register .
NOTE
Release Notes are available on the same Microsoft integration runtime download page.
Make sure the account has the permission of Log on as a service. Otherwise self-hosted integration runtime
can't start successfully. You can check the permission in Local Security Policy -> Security Settings ->
Local Policies -> User Rights Assignment -> Log on as a ser vice
Notification area icons and notifications
If you move your cursor over the icon or message in the notification area, you can see details about the state of
the self-hosted integration runtime.
NOTE
You don't need to create a new self-hosted integration runtime to associate each node. You can install the self-hosted
integration runtime on another machine and register it by using the same authentication key.
NOTE
Before you add another node for high availability and scalability, ensure that the Remote access to intranet option is
enabled on the first node. To do so, select Microsoft Integration Runtime Configuration Manager > Settings >
Remote access to intranet .
Scale considerations
Scale out
When processor usage is high and available memory is low on the self-hosted IR, add a new node to help scale
out the load across machines. If activities fail because they time out or the self-hosted IR node is offline, it helps
if you add a node to the gateway.
Scale up
When the processor and available RAM aren't well utilized, but the execution of concurrent jobs reaches a node's
limits, scale up by increasing the number of concurrent jobs that a node can run. You might also want to scale up
when activities time out because the self-hosted IR is overloaded. As shown in the following image, you can
increase the maximum capacity for a node:
Credential Sync
If you don't store credentials or secret values in an Azure Key Vault, the credentials or secret values will be stored
in the machines where your self-hosted integration runtime locates. Each node will have a copy of credential
with certain version. In order to make all nodes work together, the version number should be the same for all
nodes.
When configured, the self-hosted integration runtime uses the proxy server to connect to the cloud service's
source and destination (which use the HTTP or HTTPS protocol). This is why you select Change link during
initial setup.
There are three configuration options:
Do not use proxy : The self-hosted integration runtime doesn't explicitly use any proxy to connect to cloud
services.
Use system proxy : The self-hosted integration runtime uses the proxy setting that is configured in
diahost.exe.config and diawp.exe.config. If these files specify no proxy configuration, the self-hosted
integration runtime connects to the cloud service directly without going through a proxy.
Use custom proxy : Configure the HTTP proxy setting to use for the self-hosted integration runtime, instead
of using configurations in diahost.exe.config and diawp.exe.config. Address and Por t values are required.
User Name and Password values are optional, depending on your proxy's authentication setting. All
settings are encrypted with Windows DPAPI on the self-hosted integration runtime and stored locally on the
machine.
The integration runtime host service restarts automatically after you save the updated proxy settings.
After you register the self-hosted integration runtime, if you want to view or update proxy settings, use
Microsoft Integration Runtime Configuration Manager.
1. Open Microsoft Integration Runtime Configuration Manager .
2. Select the Settings tab.
3. Under HTTP Proxy , select the Change link to open the Set HTTP Proxy dialog box.
4. Select Next . You then see a warning that asks for your permission to save the proxy setting and restart the
integration runtime host service.
You can use the configuration manager tool to view and update the HTTP proxy.
NOTE
If you set up a proxy server with NTLM authentication, the integration runtime host service runs under the domain
account. If you later change the password for the domain account, remember to update the configuration settings for the
service and restart the service. Because of this requirement, we suggest that you access the proxy server by using a
dedicated domain account that doesn't require you to update the password frequently.
<system.net>
<defaultProxy useDefaultCredentials="true" />
</system.net>
You can then add proxy server details as shown in the following example:
<system.net>
<defaultProxy enabled="true">
<proxy bypassonlocal="true" proxyaddress="http://proxy.domain.org:8888/" />
</defaultProxy>
</system.net>
The proxy tag allows additional properties to specify required settings like scriptLocation . See <proxy>
Element (Network Settings) for syntax.
5. Save the configuration file in its original location. Then restart the self-hosted integration runtime host
service, which picks up the changes.
To restart the service, use the services applet from Control Panel. Or from Integration Runtime
Configuration Manager, select the Stop Ser vice button, and then select Star t Ser vice .
If the service doesn't start, you likely added incorrect XML tag syntax in the application configuration file
that you edited.
IMPORTANT
Don't forget to update both diahost.exe.config and diawp.exe.config.
You also need to make sure that Microsoft Azure is in your company's allowlist. You can download the list of
valid Azure IP addresses from Microsoft Download Center.
Possible symptoms for issues related to the firewall and proxy server
If you see error messages like the following ones, the likely reason is improper configuration of the firewall or
proxy server. Such configuration prevents the self-hosted integration runtime from connecting to Data Factory
to authenticate itself. To ensure that your firewall and proxy server are properly configured, refer to the previous
section.
When you try to register the self-hosted integration runtime, you receive the following error message:
"Failed to register this Integration Runtime node! Confirm that the Authentication key is valid and the
integration service host service is running on this machine."
When you open Integration Runtime Configuration Manager, you see a status of Disconnected or
Connecting . When you view Windows event logs, under Event Viewer > Application and Ser vices
Logs > Microsoft Integration Runtime , you see error messages like this one:
If you choose not to open port 8060 on the self-hosted integration runtime machine, use mechanisms other
than the Setting Credentials application to configure data-store credentials. For example, you can use the New-
AzDataFactor yV2LinkedSer viceEncr yptCredential PowerShell cmdlet.
At the corporate firewall level, you need to configure the following domains and outbound ports:
DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N
At the Windows firewall level or machine level, these outbound ports are normally enabled. If they aren't, you
can configure the domains and ports on a self-hosted integration runtime machine.
NOTE
As currently Azure Relay doesn't support service tag, you have to use service tag AzureCloud or Internet in NSG rules
for the communication to Azure Relay. For the communication to Azure Data Factory, you can use service tag
DataFactor yManagement in the NSG rule setup.
Based on your source and sinks, you might need to allow additional domains and outbound ports in your
corporate firewall or Windows firewall.
DO M A IN N A M ES O UT B O UN D P O RT S DESC RIP T IO N
For some cloud databases, such as Azure SQL Database and Azure Data Lake, you might need to allow IP
addresses of self-hosted integration runtime machines on their firewall configuration.
Get URL of Azure Relay
One required domain and port that need to be put in the allowlist of your firewall is for the communication to
Azure Relay. The self-hosted integration runtime uses it for interactive authoring such as test connection, browse
folder list and table list, get schema, and preview data. If you don't want to allow .ser vicebus.windows.net and
would like to have more specific URLs, then you can see all the FQDNs that are required by your self-hosted
integration runtime from the ADF portal. Follow these steps:
1. Go to ADF portal and select your self-hosted integration runtime.
2. In Edit page, select Nodes .
3. Select View Ser vice URLs to get all FQDNs.
NOTE
For the details related to Azure Relay connections protocol, see Azure Relay Hybrid Connections protocol.
NOTE
If your firewall doesn't allow outbound port 1433, the self-hosted integration runtime can't access the SQL database
directly. In this case, you can use a staged copy to SQL Database and Azure Synapse Analytics. In this scenario, you
require only HTTPS (port 443) for the data movement.
Next steps
For step-by-step instructions, see Tutorial: Copy on-premises data to cloud.
Self-hosted integration runtime auto-update and
expire notification
7/15/2021 • 2 minutes to read • Edit Online
You can check the last update datetime in your self-hosted integration runtime client.
You can use this PowerShell command to get the auto-update version.
NOTE
If you have multiple self-hosted integration runtime nodes, there is no downtime during auto-update. The auto-update
happens in one node first while others are working on tasks. When the first node finishes the update, it will take over the
remain tasks when other nodes are updating. If you only have one self-hosted integration runtime, then it has some
downtime during the auto-update.
Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Create a shared self-hosted integration runtime in
Azure Data Factory
7/21/2021 • 6 minutes to read • Edit Online
Terminology
Shared IR : An original self-hosted IR that runs on a physical infrastructure.
Linked IR : An IR that references another shared IR. The linked IR is a logical IR and uses the infrastructure of
another shared self-hosted IR.
2. Note and copy the above "Resource ID" of the self-hosted IR to be shared.
3. In the data factory to which the permissions were granted, create a new self-hosted IR (linked) and enter
the resource ID.
Create a shared self-hosted IR using Azure PowerShell
To create a shared self-hosted IR using Azure PowerShell, you can take following steps:
1. Create a data factory.
2. Create a self-hosted integration runtime.
3. Share the self-hosted integration runtime with other data factories.
4. Create a linked integration runtime.
5. Revoke the sharing.
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure subscription . If you don't have an Azure subscription, create a free account before you begin.
Azure PowerShell . Follow the instructions in Install Azure PowerShell on Windows with PowerShellGet.
You use PowerShell to run a script to create a self-hosted integration runtime that can be shared with
other data factories.
NOTE
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you on Products
available by region.
# If input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$".
$SubscriptionName = "[Azure subscription name]"
$ResourceGroupName = "[Azure resource group name]"
$DataFactoryLocation = "EastUS"
# Shared Self-hosted integration runtime information. This is a Data Factory compute resource for
running any activities
# Data factory name. Must be globally unique
$SharedDataFactoryName = "[Shared Data factory name]"
$SharedIntegrationRuntimeName = "[Shared Integration Runtime Name]"
$SharedIntegrationRuntimeDescription = "[Description for Shared Integration Runtime]"
# Linked integration runtime information. This is a Data Factory compute resource for running any
activities
# Data factory name. Must be globally unique
$LinkedDataFactoryName = "[Linked Data factory name]"
$LinkedIntegrationRuntimeName = "[Linked Integration Runtime Name]"
$LinkedIntegrationRuntimeDescription = "[Description for Linked Integration Runtime]"
3. Sign in and select a subscription. Add the following code to the script to sign in and select your Azure
subscription:
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
NOTE
This step is optional. If you already have a data factory, skip this step.
Create an Azure resource group by using the New-AzResourceGroup command. A resource group is a
logical container into which Azure resources are deployed and managed as a group. The following
example creates a resource group named myResourceGroup in the WestEurope location:
NOTE
This step is optional. If you already have the self-hosted integration runtime that you want to share with other data
factories, skip this step.
$SharedIR = Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-Type SelfHosted `
-Description $SharedIntegrationRuntimeDescription
Get-AzDataFactoryV2IntegrationRuntimeKey `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName
The response contains the authentication key for this self-hosted integration runtime. You use this key when you
register the integration runtime node.
Install and register the self-hosted integration runtime
1. Download the self-hosted integration runtime installer from Azure Data Factory Integration Runtime.
2. Run the installer to install the self-hosted integration on a local computer.
3. Register the new self-hosted integration with the authentication key that you retrieved in a previous step.
Share the self-hosted integration runtime with another data factory
Create another data factory
NOTE
This step is optional. If you already have the data factory that you want to share with, skip this step. But in order to add
or remove role assignments to other data factory, you must have Microsoft.Authorization/roleAssignments/write
and Microsoft.Authorization/roleAssignments/delete permissions, such as User Access Administrator or Owner.
Grant permission
Grant permission to the data factory that needs to access the self-hosted integration runtime you created and
registered.
IMPORTANT
Do not skip this step!
New-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId ` #MSI of the Data Factory with which it needs to be shared
-RoleDefinitionName 'Contributor' `
-Scope $SharedIR.Id
Set-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $LinkedDataFactoryName `
-Name $LinkedIntegrationRuntimeName `
-Type SelfHosted `
-SharedIntegrationRuntimeResourceId $SharedIR.Id `
-Description $LinkedIntegrationRuntimeDescription
Now you can use this linked integration runtime in any linked service. The linked integration runtime uses the
shared integration runtime to run activities.
Revoke integration runtime sharing from a data factory
To revoke the access of a data factory from the shared integration runtime, run the following command:
Remove-AzRoleAssignment `
-ObjectId $factory.Identity.PrincipalId `
-RoleDefinitionName 'Contributor' `
-Scope $SharedIR.Id
To remove the existing linked integration runtime, run the following command against the shared integration
runtime:
Remove-AzDataFactoryV2IntegrationRuntime `
-ResourceGroupName $ResourceGroupName `
-DataFactoryName $SharedDataFactoryName `
-Name $SharedIntegrationRuntimeName `
-LinkedDataFactoryName $LinkedDataFactoryName
Monitoring
Shared IR
Linked IR
NOTE
This feature is available only in Data Factory V2.
Next steps
Review integration runtime concepts in Azure Data Factory.
Learn how to create a self-hosted integration runtime in the Azure portal.
Automating self-hosted integration runtime
installation using local PowerShell scripts
5/6/2021 • 2 minutes to read • Edit Online
To automate installation of Self-hosted Integration Runtime on local machines (other than Azure VMs where we
can leverage the Resource Manager template instead), you can use local PowerShell scripts. This article
introduces two scripts you can use.
Prerequisites
Launch PowerShell on your local machine. To run the scripts, you need to choose Run as Administrator .
Download the self-hosted integration runtime software. Copy the path where the downloaded file is.
You also need an authentication key to register the self-hosted integration runtime.
For automating manual updates, you need to have a pre-configured self-hosted integration runtime.
Scripts introduction
NOTE
These scripts are created using the documented command line utility in the self-hosted integration runtime. If needed one
can customize these scripts accordingly to cater to their automation needs. The scripts need to be applied per node, so
make sure to run it across all nodes in case of high availability setup (2 or more nodes).
For automating setup: Install and register a new self-hosted integration runtime node using
InstallGatewayOnLocalMachine.ps1 - The script can be used to install self-hosted integration runtime
node and register it with an authentication key. The script accepts two arguments, first specifying the
location of the self-hosted integration runtime on a local disk, second specifying the authentication
key (for registering self-hosted IR node).
For automating manual updates: Update the self-hosted IR node with a specific version or to the latest
version script-update-gateway.ps1 - This is also supported in case you have turned off the auto-
update, or want to have more control over updates. The script can be used to update the self-hosted
integration runtime node to the latest version or to a specified higher version (downgrade doesn’t work).
It accepts an argument for specifying version number (example: -version 3.13.6942.1). When no version
is specified, it always updates the self-hosted IR to the latest version found in the downloads.
NOTE
Only last 3 versions can be specified. Ideally this is used to update an existing node to the latest version. IT
ASSUMES THAT YOU HAVE A REGISTERED SELF HOSTED IR.
Usage examples
For automating setup
1. Download the self-hosted IR from here.
2. Specify the path where the above downloaded SHIR MSI (installation file) is. For example, if the path is
C:\Users\username\Downloads\IntegrationRuntime_4.7.7368.1.msi, then you can use below PowerShell
command-line example for this task:
NOTE
Replace [key] with the authentication key to register your IR. Replace "username" with your user name. Specify the
location of the "InstallGatewayOnLocalMachine.ps1" file when running the script. In this example we stored it on
Desktop.
3. If there is one pre-installed self-hosted IR on your machine, the script automatically uninstalls it and then
configures a new one. You'll see following window popped out:
4. When the installation and key registration completes, you'll see Succeed to install gateway and Succeed
to register gateway results in your local PowerShell.
PS C:\windows\system32> C:\Users\username\Desktop\script-update-gateway.ps1
If your current version is already the latest one, you'll see following result, suggesting no update is
required.
How to run Self-Hosted Integration Runtime in
Windows container
5/25/2021 • 2 minutes to read • Edit Online
Prerequisites
Windows container requirements
Docker Version 2.3 and later
Self-Hosted Integration Runtime Version 5.2.7713.1 and later
Get started
1. Install Docker and enable Windows Container
2. Download the source code from https://github.com/Azure/Azure-Data-Factory-Integration-Runtime-in-
Windows-Container
3. Download the latest version SHIR in ‘SHIR’ folder
4. Open your folder in the shell:
cd"yourFolderPath"
dockerbuild.-t"yourDockerImageName"
dockerrun-d-eNODE_NAME="irNodeName"-eAUTH_KEY="IR_AUTHENTICATION_KEY"-eENABLE_HA=true-e HA_PORT=8060
"yourDockerImageName"
NOTE
AUTH_KEY is mandatory for this command. NODE_NAME, ENABLE_HA and HA_PORT are optional. If you don't set the
value, the command will use default values. The default value of ENABLE_HA is false and HA_PORT is 8060.
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Azure subscription . If you don't already have a subscription, you can create a free trial account.
Azure SQL Database ser ver or SQL Managed Instance (optional) . If you don't already have a
database server or managed instance, create one in the Azure portal before you get started. Data Factory
will in turn create an SSISDB instance on this database server.
We recommend that you create the database server or managed instance in the same Azure region as
the integration runtime. This configuration lets the integration runtime write execution logs into SSISDB
without crossing Azure regions.
Keep these points in mind:
The SSISDB instance can be created on your behalf as a single database, as part of an elastic pool,
or in a managed instance. It can be accessible in a public network or by joining a virtual network.
For guidance in choosing between SQL Database and SQL Managed Instance to host SSISDB, see
the Compare SQL Database and SQL Managed Instance section in this article.
If you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a SQL managed instance with private endpoint to host SSISDB, or if you require access to on-
premises data without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual
network. For more information, see Join an Azure-SSIS IR to a virtual network.
Confirm that the Allow access to Azure ser vices setting is enabled for the database server. This
setting is not applicable when you use an Azure SQL Database server with IP firewall rules/virtual
network service endpoints or a SQL managed instance with private endpoint to host SSISDB. For
more information, see Secure Azure SQL Database. To enable this setting by using PowerShell, see
New-AzSqlServerFirewallRule.
Add the IP address of the client machine, or a range of IP addresses that includes the IP address of
the client machine, to the client IP address list in the firewall settings for the database server. For
more information, see Azure SQL Database server-level and database-level firewall rules.
You can connect to the database server by using SQL authentication with your server admin
credentials, or by using Azure AD authentication with the specified system/user-assigned managed
identity for your data factory. For the latter, you need to add the specified system/user-assigned
managed identity for your data factory into an Azure AD group with access permissions to the
database server. For more information, see Enable Azure AD authentication for an Azure-SSIS IR.
Confirm that your database server does not have an SSISDB instance already. The provisioning of
an Azure-SSIS IR does not support using an existing SSISDB instance.
Azure Resource Manager vir tual network (optional) . You must have an Azure Resource Manager
virtual network if at least one of the following conditions is true:
You're hosting SSISDB on an Azure SQL Database server with IP firewall rules/virtual network
service endpoints or a managed instance with private endpoint.
You want to connect to on-premises data stores from SSIS packages running on your Azure-SSIS
IR without configuring a self-hosted IR.
Azure PowerShell (optional) . Follow the instructions in How to install and configure Azure PowerShell,
if you want to run a PowerShell script to provision your Azure-SSIS IR.
Regional support
For a list of Azure regions in which Data Factory and an Azure-SSIS IR are available, see Data Factory and SSIS IR
availability by region.
Comparison of SQL Database and SQL Managed Instance
The following table compares certain features of an Azure SQL Database server and SQL Managed Instance as
they relate to Azure-SSIR IR:
F EAT URE SQ L DATA B A SE SQ L M A N A GED IN STA N C E
Scheduling The SQL Server Agent is not available. The Managed Instance Agent is
available.
See Schedule a package execution in a
Data Factory pipeline.
Authentication You can create an SSISDB instance with You can create an SSISDB instance with
a contained database user who a contained database user who
represents any Azure AD group with represents the managed identity of
the managed identity of your data your data factory.
factory as a member in the db_owner
role. See Enable Azure AD authentication to
create an SSISDB in Azure SQL
See Enable Azure AD authentication to Managed Instance.
create an SSISDB in Azure SQL
Database server.
Ser vice tier When you create an Azure-SSIS IR with When you create an Azure-SSIS IR with
your Azure SQL Database server, you your managed instance, you can't
can select the service tier for SSISDB. select the service tier for SSISDB. All
There are multiple service tiers. databases in your managed instance
share the same resource allocated to
that instance.
Vir tual network Your Azure-SSIS IR can join an Azure Your Azure-SSIS IR can join an Azure
Resource Manager virtual network if Resource Manager virtual network if
you use an Azure SQL Database server you use a managed instance with
with IP firewall rules/virtual network private endpoint. The virtual network
service endpoints. is required when you don't enable a
public endpoint for your managed
instance.
The Integration runtime setup pane has three pages where you successively configure general, deployment,
and advanced settings.
General settings page
On the General settings page of Integration runtime setup pane, complete the following steps.
1. For Name , enter the name of your integration runtime.
2. For Description , enter the description of your integration runtime.
3. For Location , select the location of your integration runtime. Only supported locations are displayed. We
recommend that you select the same location of your database server to host SSISDB.
4. For Node Size , select the size of the node in your integration runtime cluster. Only supported node sizes
are displayed. Select a large node size (scale up) if you want to run many compute-intensive or memory-
intensive packages.
NOTE
If you require compute isolation, please select the Standard_E64i_v3 node size. This node size represents
isolated virtual machines that consume their entire physical host and provide the necessary level of isolation
required by certain workloads, such as the US Department of Defense's Impact Level 5 (IL5) workloads.
5. For Node Number , select the number of nodes in your integration runtime cluster. Only supported node
numbers are displayed. Select a large cluster with many nodes (scale out) if you want to run many
packages in parallel.
6. For Edition/License , select the SQL Server edition for your integration runtime: Standard or Enterprise.
Select Enterprise if you want to use advanced features on your integration runtime.
7. For Save Money , select the Azure Hybrid Benefit option for your integration runtime: Yes or No . Select
Yes if you want to bring your own SQL Server license with Software Assurance to benefit from cost
savings with hybrid use.
8. Select Continue .
Deployment settings page
On the Deployment settings page of Integration runtime setup pane, you have the options to create
SSISDB and or Azure-SSIS IR package stores.
C r e a t i n g SSI SD B
On the Deployment settings page of Integration runtime setup pane, if you want to deploy your packages
into SSISDB (Project Deployment Model), select the Create SSIS catalog (SSISDB) hosted by Azure SQL
Database ser ver/Managed Instance to store your projects/packages/environments/execution logs
check box. Alternatively, if you want to deploy your packages into file system, Azure Files, or SQL Server
database (MSDB) hosted by Azure SQL Managed Instance (Package Deployment Model), no need to create
SSISDB nor select the check box.
Regardless of your deployment model, if you want to use SQL Server Agent hosted by Azure SQL Managed
Instance to orchestrate/schedule your package executions, it's enabled by SSISDB, so select the check box
anyway. For more information, see Schedule SSIS package executions via Azure SQL Managed Instance Agent.
If you select the check box, complete the following steps to bring your own database server to host SSISDB that
we'll create and manage on your behalf.
1. For Subscription , select the Azure subscription that has your database server to host SSISDB.
2. For Location , select the location of your database server to host SSISDB. We recommend that you select
the same location of your integration runtime.
3. For Catalog Database Ser ver Endpoint , select the endpoint of your database server to host SSISDB.
Based on the selected database server, the SSISDB instance can be created on your behalf as a single
database, as part of an elastic pool, or in a managed instance. It can be accessible in a public network or
by joining a virtual network. For guidance in choosing the type of database server to host SSISDB, see
Compare SQL Database and SQL Managed Instance.
If you select an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a
managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
without configuring a self-hosted IR, you need to join your Azure-SSIS IR to a virtual network. For more
information, see Join an Azure-SSIS IR to a virtual network.
4. Select either the Use AAD authentication with the system managed identity for Data Factor y or
Use AAD authentication with a user-assigned managed identity for Data Factor y check box to
choose Azure AD authentication method for Azure-SSIS IR to access your database server that hosts
SSISDB. Don't select any of the check boxes to choose SQL authentication method instead.
If you select any of the check boxes, you'll need to add the specified system/user-assigned managed
identity for your data factory into an Azure AD group with access permissions to your database server. If
you select the Use AAD authentication with a user-assigned managed identity for Data Factor y
check box, you can then select any existing credentials created using your specified user-assigned
managed identities or create new ones. For more information, see Enable Azure AD authentication for an
Azure-SSIS IR.
5. For Admin Username , enter the SQL authentication username for your database server that hosts
SSISDB.
6. For Admin Password , enter the SQL authentication password for your database server that hosts
SSISDB.
7. Select the Use dual standby Azure-SSIS Integration Runtime pair with SSISDB failover check
box to configure a dual standby Azure SSIS IR pair that works in sync with Azure SQL Database/Managed
Instance failover group for business continuity and disaster recovery (BCDR).
If you select the check box, enter a name to identify your pair of primary and secondary Azure-SSIS IRs in
the Dual standby pair name text box. You need to enter the same pair name when creating your
primary and secondary Azure-SSIS IRs.
For more information, see Configure your Azure-SSIS IR for BCDR.
8. For Catalog Database Ser vice Tier , select the service tier for your database server to host SSISDB.
Select the Basic, Standard, or Premium tier, or select an elastic pool name.
Select Test connection when applicable, and if it's successful, select Continue .
NOTE
If you use Azure SQL Database server to host SSISDB, your data will be stored in geo-redundant storage for backups by
default. If you don't want your data to be replicated in other regions, please follow the instructions to Configure backup
storage redundancy by using PowerShell.
C r e a t i n g A z u r e - SSI S I R p a c k a g e st o r e s
On the Deployment settings page of Integration runtime setup pane, if you want to manage your
packages that are deployed into MSDB, file system, or Azure Files (Package Deployment Model) with Azure-SSIS
IR package stores, select the Create package stores to manage your packages that are deployed into
file system/Azure Files/SQL Ser ver database (MSDB) hosted by Azure SQL Managed Instance check
box.
Azure-SSIS IR package store allows you to import/export/delete/run packages and monitor/stop running
packages via SSMS similar to the legacy SSIS package store. For more information, see Manage SSIS packages
with Azure-SSIS IR package stores.
If you select this check box, you can add multiple package stores to your Azure-SSIS IR by selecting New .
Conversely, one package store can be shared by multiple Azure-SSIS IRs.
On the Add package store pane, complete the following steps.
1. For Package store name , enter the name of your package store.
2. For Package store linked ser vice , select your existing linked service that stores the access information
for file system/Azure Files/Azure SQL Managed Instance where your packages are deployed or create a
new one by selecting New . On the New linked ser vice pane, complete the following steps.
NOTE
You can use either Azure File Storage or File System linked services to access Azure Files. If you use Azure
File Storage linked service, Azure-SSIS IR package store supports only Basic (not Account key nor SAS URI )
authentication method for now.
a. For Name , enter the name of your linked service.
b. For Description , enter the description of your linked service.
c. For Type , select Azure File Storage , Azure SQL Managed Instance , or File System .
d. You can ignore Connect via integration runtime , since we always use your Azure-SSIS IR to
fetch the access information for package stores.
e. If you select Azure File Storage , for Authentication method , select Basic , and then complete
the following steps.
a. For Account selection method , select From Azure subscription or Enter manually .
b. If you select From Azure subscription , select the relevant Azure subscription , Storage
account name , and File share .
c. If you select Enter manually , enter
\\<storage account name>.file.core.windows.net\<file share name> for Host ,
Azure\<storage account name>for Username , and <storage account key> for Password or
select your Azure Key Vault where it's stored as a secret.
f. If you select Azure SQL Managed Instance , complete the following steps.
a. Select Connection string or your Azure Key Vault where it's stored as a secret.
b. If you select Connection string , complete the following steps.
a. For Account selection method , if you choose From Azure subscription , select
the relevant Azure subscription , Ser ver name , Endpoint type and Database
name . If you choose Enter manually , complete the following steps.
a. For Fully qualified domain name , enter
<server name>.<dns prefix>.database.windows.net or
<server name>.public.<dns prefix>.database.windows.net,3342 as the private
or public endpoint of your Azure SQL Managed Instance, respectively. If you
enter the private endpoint, Test connection isn't applicable, since ADF UI
can't reach it.
b. For Database name , enter msdb .
b. For Authentication type , select SQL Authentication , Managed Identity ,
Ser vice Principal , or User-Assigned Managed Identity .
If you select SQL Authentication , enter the relevant Username and
Password or select your Azure Key Vault where it's stored as a secret.
If you select Managed Identity , grant the system managed identity for your
ADF access to your Azure SQL Managed Instance.
If you select Ser vice Principal , enter the relevant Ser vice principal ID and
Ser vice principal key or select your Azure Key Vault where it's stored as a
secret.
If you select User-Assigned Managed Identity , grant the specified user-
assigned managed identity for your ADF access to your Azure SQL Managed
Instance. You can then select any existing credentials created using your
specified user-assigned managed identities or create new ones.
g. If you select File system , enter the UNC path of folder where your packages are deployed for
Host , as well as the relevant Username and Password or select your Azure Key Vault where it's
stored as a secret.
h. Select Test connection when applicable and if it's successful, select Create .
3. Your added package stores will appear on the Deployment settings page. To remove them, select their
check boxes, and then select Delete .
Select Test connection when applicable and if it's successful, select Continue .
Advanced settings page
On the Advanced settings page of Integration runtime setup pane, complete the following steps.
1. For Maximum Parallel Executions Per Node , select the maximum number of packages to run
concurrently per node in your integration runtime cluster. Only supported package numbers are
displayed. Select a low number if you want to use more than one core to run a single large package that's
compute or memory intensive. Select a high number if you want to run one or more small packages in a
single core.
2. Select the Customize your Azure-SSIS Integration Runtime with additional system
configurations/component installations check box to choose whether you want to add
standard/express custom setups on your Azure-SSIS IR. For more information, see Custom setup for an
Azure-SSIS IR.
If you select the check box, complete the following steps.
a. For Custom setup container SAS URI , enter the SAS URI of your container where you store
scripts and associated files for standard custom setups.
b. For Express custom setup , select New to open the Add express custom setup panel and then
select any types under the Express custom setup type dropdown menu, e.g. Run cmdkey
command , Add environment variable , Install licensed component , etc.
If you select the Install licensed component type, you can then select any integrated
components from our ISV partners under the Component name dropdown menu and if
required, enter the product license key/upload the product license file that you purchased from
them into the License key /License file box.
Your added express custom setups will appear on the Advanced settings page. To remove them,
you can select their check boxes and then select Delete .
3. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to create
cer tain network resources, and optionally bring your own static public IP addresses check box
to choose whether you want to join your integration runtime to a virtual network.
Select it if you use an Azure SQL Database server with IP firewall rules/virtual network service endpoints
or a managed instance with private endpoint to host SSISDB, or if you require access to on-premises data
(that is, you have on-premises data sources or destinations in your SSIS packages) without configuring a
self-hosted IR. For more information, see Join Azure-SSIS IR to a virtual network.
If you select the check box, complete the following steps.
a. For Subscription , select the Azure subscription that has your virtual network.
b. For Location , the same location of your integration runtime is selected.
c. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
d. For VNet Name , select the name of your virtual network. It should be the same one used for your
Azure SQL Database server with virtual network service endpoints or managed instance with
private endpoint to host SSISDB. Or it should be the same one connected to your on-premises
network. Otherwise, it can be any virtual network to bring your own static public IP addresses for
Azure-SSIS IR.
e. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for your Azure SQL Database server with virtual network service endpoints to host SSISDB.
Or it should be a different subnet from the one used for your managed instance with private
endpoint to host SSISDB. Otherwise, it can be any subnet to bring your own static public IP
addresses for Azure-SSIS IR.
f. Select the Bring static public IP addresses for your Azure-SSIS Integration Runtime
check box to choose whether you want to bring your own static public IP addresses for Azure-SSIS
IR, so you can allow them on the firewall for your data sources.
If you select the check box, complete the following steps.
a. For First static public IP address , select the first static public IP address that meets the
requirements for your Azure-SSIS IR. If you don't have any, click Create new link to create
static public IP addresses on Azure portal and then click the refresh button here, so you can
select them.
b. For Second static public IP address , select the second static public IP address that meets
the requirements for your Azure-SSIS IR. If you don't have any, click Create new link to
create static public IP addresses on Azure portal and then click the refresh button here, so
you can select them.
4. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS Integration
Runtime check box to choose whether you want to configure a self-hosted IR as proxy for your Azure-
SSIS IR. For more information, see Set up a self-hosted IR as proxy.
If you select the check box, complete the following steps.
a. For Self-Hosted Integration Runtime , select your existing self-hosted IR as a proxy for Azure-
SSIS IR.
b. For Staging Storage Linked Ser vice , select your existing Azure Blob storage linked service or
create a new one for staging.
c. For Staging Path , specify a blob container in your selected Azure Blob storage account or leave it
empty to use a default one for staging.
5. Select VNet Validation > Continue .
On the Summar y section, review all provisioning settings, bookmark the recommended documentation links,
and select Finish to start the creation of your integration runtime.
NOTE
Excluding any custom setup time, this process should finish within 5 minutes. But it might take 20-30 minutes for the
Azure-SSIS IR to join a virtual network.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB. It also configures
permissions and settings for your virtual network, if specified, and joins your Azure-SSIS IR to the virtual network.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.
Connections pane
On the Connections pane of Manage hub, switch to the Integration runtimes page and select Refresh .
You can edit/reconfigure your Azure-SSIS IR by selecting its name. You can also select the relevant buttons to
monitor/start/stop/delete your Azure-SSIS IR, auto-generate an ADF pipeline with Execute SSIS Package activity
to run on your Azure-SSIS IR, and view the JSON code/payload of your Azure-SSIS IR. Editing/deleting your
Azure-SSIS IR can only be done when it's stopped.
Azure SSIS integration runtimes in the portal
1. In the Azure Data Factory UI, switch to the Manage tab and then switch to the Integration runtimes
tab on the Connections pane to view existing integration runtimes in your data factory.
2. Select New to create a new Azure-SSIS IR and open the Integration runtime setup pane.
3. In the Integration runtime setup pane, select the Lift-and-shift existing SSIS packages to
execute in Azure tile, and then select Continue .
4. For the remaining steps to set up an Azure-SSIS IR, see the Provision an Azure SSIS integration runtime
section.
### Azure-SSIS integration runtime info - This is a Data Factory compute resource for running SSIS packages.
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, whereas Enterprise lets you use advanced features on
your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, whereas BasePrice lets you bring
your own on-premises SQL Server license with Software Assurance to earn cost savings from Azure Hybrid
Benefit option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported. For other nodes, up to (2 x
number of cores) are currently supported.
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use an Azure SQL Database
server with IP firewall rules/virtual network service endpoints or a managed instance with private endpoint
to host SSISDB, or if you require access to on-premises data without configuring a self-hosted IR. We
recommend an Azure Resource Manager virtual network, because classic virtual networks will be deprecated
soon.
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Use the same subnet as the one used for your
Azure SQL Database server with virtual network service endpoints, or a different subnet from the one used
for your managed instance with a private endpoint
# Public IP address info: OPTIONAL to provide two standard static public IP addresses with DNS name under
the same subscription and in the same region as your virtual network
$FirstPublicIP = "[your first public IP address resource ID or leave it empty]"
$SecondPublicIP = "[your second public IP address resource ID or leave it empty]"
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
Sign in and select a subscription
Add the following script to sign in and select your Azure subscription.
Connect-AzAccount
Select-AzSubscription -SubscriptionName $SubscriptionName
# Validate only if you use SSISDB and you don't use virtual network or Azure AD authentication
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
if([string]::IsNullOrEmpty($VnetId) -and [string]::IsNullOrEmpty($SubnetName))
{
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) -and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword))
{
$SSISDBConnectionString = "Data Source=" + $SSISDBServerEndpoint + ";User ID=" +
$SSISDBServerAdminUserName + ";Password=" + $SSISDBServerAdminPassword
$sqlConnection = New-Object System.Data.SqlClient.SqlConnection $SSISDBConnectionString;
Try
{
$sqlConnection.Open();
}
Catch [System.Data.SqlClient.SqlException]
{
Write-Warning "Cannot connect to your Azure SQL Database server, exception: $_";
Write-Warning "Please make sure the server you specified has already been created. Do you
want to proceed? [Y/N]"
$yn = Read-Host
if(!($yn -ieq "Y"))
{
Return;
}
}
}
}
}
If you don't use an Azure SQL Database server with IP firewall rules/virtual network service endpoints or a
managed instance with private endpoint to host SSISDB, or require access to on-premises data, you can omit the
VNetId and Subnet parameters or pass empty values for them. You can also omit them if you configure a self-
hosted IR as proxy for your Azure-SSIS IR to access data on-premises. Otherwise, you can't omit them and must
pass valid values from your virtual network configuration. For more information, see Join an Azure-SSIS IR to a
virtual network.
If you use managed instance to host SSISDB, you can omit the CatalogPricingTier parameter or pass an empty
value for it. Otherwise, you can't omit it and must pass a valid value from the list of supported pricing tiers for
Azure SQL Database. For more information, see SQL Database resource limits.
If you use Azure AD authentication with the specified system/user-assigned managed identity for your data
factory to connect to the database server, you can omit the CatalogAdminCredential parameter. But you must
add the specified system/user-assigned managed identity for your data factory into an Azure AD group with
access permissions to the database server. For more information, see Enable Azure AD authentication for an
Azure-SSIS IR. Otherwise, you can't omit it and must pass a valid object formed from your server admin
username and password for SQL authentication.
# Add the CatalogServerEndpoint, CatalogPricingTier, and CatalogAdminCredential parameters if you use SSISDB
if(![string]::IsNullOrEmpty($SSISDBServerEndpoint))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-CatalogServerEndpoint $SSISDBServerEndpoint `
-CatalogPricingTier $SSISDBPricingTier
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword)) # Add the CatalogAdminCredential parameter if you don't
use Azure AD authentication
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)
# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName
if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}
Full script
Here's the full script that creates an Azure-SSIS integration runtime.
### Azure-SSIS integration runtime info - This is a Data Factory compute resource for running SSIS packages.
$AzureSSISName = "[your Azure-SSIS IR name]"
$AzureSSISDescription = "[your Azure-SSIS IR description]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, whereas Enterprise lets you use advanced features on
your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, whereas BasePrice lets you bring
your own on-premises SQL Server license with Software Assurance to earn cost savings from the Azure Hybrid
Benefit option
# For a Standard_D1_v2 node, up to four parallel executions per node are supported. For other nodes, up to
(2 x number of cores) are currently supported.
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info: Standard/express custom setups
$SetupScriptContainerSasUri = "" # OPTIONAL to provide a SAS URI of blob container for standard custom setup
where your script and its associated files are stored
$ExpressCustomSetup = "
[RunCmdkey|SetEnvironmentVariable|InstallAzurePowerShell|SentryOne.TaskFactory|oh22is.SQLPhonetics.NET|oh22i
s.HEDDA.IO|KingswaySoft.IntegrationToolkit|KingswaySoft.ProductivityPack|Theobald.XtractIS|AecorSoft.Integra
tionService|CData.Standard|CData.Extended or leave it empty]" # OPTIONAL to configure an express custom
setup without script
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use an Azure SQL Database
server with IP firewall rules/virtual network service endpoints or a managed instance with private endpoint
to host SSISDB, or if you require access to on-premises data without configuring a self-hosted IR. We
recommend an Azure Resource Manager virtual network, because classic virtual networks will be deprecated
soon.
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Use the same subnet as the one used for your
Azure SQL Database server with virtual network service endpoints, or a different subnet from the one used
for your managed instance with a private endpoint
for your managed instance with a private endpoint
# Public IP address info: OPTIONAL to provide two standard static public IP addresses with DNS name under
the same subscription and in the same region as your virtual network
$FirstPublicIP = "[your first public IP address resource ID or leave it empty]"
$SecondPublicIP = "[your second public IP address resource ID or leave it empty]"
### Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
if(![string]::IsNullOrEmpty($SSISDBServerAdminUserName) –and !
[string]::IsNullOrEmpty($SSISDBServerAdminPassword)) # Add the CatalogAdminCredential parameter if you don't
use Azure AD authentication
{
$secpasswd = ConvertTo-SecureString $SSISDBServerAdminPassword -AsPlainText -Force
$serverCreds = New-Object System.Management.Automation.PSCredential($SSISDBServerAdminUserName,
$secpasswd)
# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName
if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}
{
"contentVersion": "1.0.0.0",
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"parameters": {},
"variables": {},
"resources": [{
"name": "<Specify a name for your data factory>",
"apiVersion": "2018-06-01",
"type": "Microsoft.DataFactory/factories",
"location": "East US",
"properties": {},
"resources": [{
"type": "integrationruntimes",
"name": "<Specify a name for your Azure-SSIS IR>",
"dependsOn": [ "<The name of the data factory you specified at the beginning>" ],
"apiVersion": "2018-06-01",
"properties": {
"type": "Managed",
"typeProperties": {
"computeProperties": {
"location": "East US",
"nodeSize": "Standard_D8_v3",
"numberOfNodes": 1,
"maxParallelExecutionsPerNode": 8
},
"ssisProperties": {
"catalogInfo": {
"catalogServerEndpoint": "<Azure SQL Database server
name>.database.windows.net",
"catalogAdminUserName": "<Azure SQL Database server admin username>",
"catalogAdminPassword": {
"type": "SecureString",
"value": "<Azure SQL Database server admin password>"
},
"catalogPricingTier": "Basic"
}
}
}
}
}]
}]
}
2. To deploy the Azure Resource Manager template, run the New-AzResourceGroupDeployment command as
shown in the following example. In the example, ADFTutorialResourceGroup is the name of your resource
group. ADFTutorialARM.json is the file that contains the JSON definition for your data factory and the
Azure-SSIS IR.
This command creates your data factory and Azure-SSIS IR in it, but it doesn't start the IR.
3. To start your Azure-SSIS IR, run the Start-AzDataFactoryV2IntegrationRuntime command:
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName "<Resource Group Name>" `
-DataFactoryName "<Data Factory Name>" `
-Name "<Azure SSIS IR Name>" `
-Force
NOTE
Excluding any custom setup time, this process should finish within 5 minutes. But it might take 20-30 minutes for the
Azure-SSIS IR to join a virtual network.
If you use SSISDB, the Data Factory service will connect to your database server to prepare SSISDB. It also configures
permissions and settings for your virtual network, if specified, and joins your Azure-SSIS IR to the virtual network.
When you provision an Azure-SSIS IR, Access Redistributable and Azure Feature Pack for SSIS are also installed. These
components provide connectivity to Excel files, Access files, and various Azure data sources, in addition to the data
sources that built-in components already support. For more information about built-in/preinstalled components, see
Built-in/preinstalled components on Azure-SSIS IR. For more information about additional components that you can
install, see Custom setups for Azure-SSIS IR.
If you don't use SSISDB, you can deploy your packages into file system, Azure Files, or MSDB hosted by your
Azure SQL Managed Instance and run them on your Azure-SSIS IR by using dtutil and AzureDTExec command-
line utilities.
For more information, see Deploy SSIS projects/packages.
In both cases, you can also run your deployed packages on Azure-SSIS IR by using the Execute SSIS Package
activity in Data Factory pipelines. For more information, see Invoke SSIS package execution as a first-class Data
Factory activity.
Next steps
See other Azure-SSIS IR topics in this documentation:
Azure-SSIS integration runtime. This article provides information about integration runtimes in general,
including Azure-SSIS IR.
Monitor an Azure-SSIS IR. This article shows you how to retrieve and understand information about your
Azure-SSIS IR.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or delete your Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding more nodes.
Deploy, run, and monitor SSIS packages in Azure
Connect to SSISDB in Azure
Connect to on-premises data sources with Windows authentication
Schedule package executions in Azure
Execute SSIS packages in Azure from SSDT
3/5/2021 • 11 minutes to read • Edit Online
Prerequisites
To use this feature, please download and install the latest SSDT with SSIS Projects extension for Visual Studio
(VS) from here. Alternatively, you can also download and install the latest SSDT as a standalone installer from
here.
If you want to connect to your Azure-SSIS IR right away, see Connecting to Azure-SSIS IR for more details. You
can also connect later by right-clicking on your project node in the Solution Explorer window of SSDT to pop up
a menu. Next, select the Connect to SSIS in Azure Data Factor y item in SSIS in Azure Data Factor y
submenu.
Azure -enabling existing SSIS projects
For existing SSIS projects, you can Azure-enable them by following these steps:
1. Right-click on your project node in the Solution Explorer window of SSDT to pop up a menu. Next, select
the Azure-Enabled Project item in SSIS in Azure Data Factor y submenu to launch the Azure-
Enabled Project Wizard .
2. On the Select Visual Studio Configuration page, select your existing VS configuration to apply
package execution settings in Azure. You can also create a new one if you haven't done so already, see
Creating a new VS configuration. We recommend that you have at least two different VS configurations
for package executions in the local and cloud environments, so you can Azure-enable your project against
the cloud configuration. In this way, if you've parameterized your project or packages, you can assign
different values to your project or package parameters at run-time based on the different execution
environments (either on your local machine or in Azure). For example, see Switching package execution
environments.
3. Azure-enabling your existing SSIS projects requires you to set their target server version to be the latest
one supported by Azure-SSIS IR. Azure-SSIS IR is currently based on SQL Ser ver 2017 . Please ensure
that your packages don't contain additional components that are unsupported on SQL Server 2017.
Please also ensure that all compatible additional components have also been installed on your Azure-
SSIS IR via custom setups, see Customizing your Azure-SSIS IR. Select the Next button to continue.
4. See Connecting to Azure-SSIS IR to complete connecting your project to Azure-SSIS IR.
Alternatively, right-click on your project node in the Solution Explorer window of SSDT to pop up a menu.
Select the Azure-Enabled Settings item in SSIS in Azure Data Factor y submenu to pop up a
window containing your project property pages. Select the Suppressed Assessment Rule IDs
property in Azure-Enabled Settings section. Finally, select its ellipsis (...) button to pop up the
Assessment Rule Suppression Settings window, where you can select the assessment rules to
suppress.
Execute SSIS packages in Azure
Configuring Azure -enabled settings
Before executing your packages in Azure, you can configure your Azure-enabled settings for them. For example,
you can enable Windows authentication on your Azure-SSIS IR to access on-premises/cloud data stores by
following these steps:
1. Right-click on your project node in the Solution Explorer window of SSDT to pop up a menu. Next, select
the Azure-Enabled Settings item in SSIS in Azure Data Factor y submenu to pop up a window
containing your project property pages.
2. Select the Enable Windows Authentication property in Azure-Enabled Settings section and then
select True in its dropdown menu. Next, select the Windows Authentication Credentials property
and then select its ellipsis (...) button to pop up the Windows Authentication Credentials window.
3. Enter your Windows authentication credentials. For example, to access Azure Files, you can enter Azure ,
YourStorageAccountName , and YourStorageAccountKey for Domain , Username , and Password ,
respectively.
Alternatively, right-click on your package node in the Solution Explorer window of SSDT to pop up a
menu. Next, select the Execute Package in Azure item.
NOTE
Executing your packages in Azure requires you to have a running Azure-SSIS IR, so if your Azure-SSIS IR is stopped, a
dialog window will pop up to start it. Excluding any custom setup time, this process should be completed within 5
minutes, but could take approximately 20 - 30 minutes for Azure-SSIS IR joining a virtual network. After executing your
packages in Azure, you can stop your Azure-SSIS IR to manage its running cost by right-clicking on its node in the
Solution Explorer window of SSDT to pop up a menu and then selecting the Star t\Stop\Manage item that takes you to
ADF portal to do so.
2. Replace the file path of those child packages in the File Connection Manager of Execute Package Tasks
with their new UNC path
If your local machine running SSDT can't access the new UNC path, you can enter it on the Properties
panel of File Connection Manager
Alternatively, you can use a variable for the file path to assign the right value at run-time
If your packages contain Execute Package Tasks that refer to child packages in the same project, no additional
step is necessary.
Switching package protection level
Executing SSIS packages in Azure doesn't support Encr yptSensitiveWithUserKey /Encr yptAllWithUserKey
protection levels. Consequently, if your packages are configured to use those, we'll temporarily convert them
into using Encr yptSensitiveWithPassword /Encr yptAllWithPassword protection levels, respectively. We'll
also randomly generate encryption passwords when we upload your packages into Azure Files for executions
on your Azure-SSIS IR.
NOTE
If your packages contain Execute Package Tasks that refer to child packages configured to use
Encr yptSensitiveWithUserKey /Encr yptAllWithUserKey protection levels, you need to manually reconfigure those
child packages to use Encr yptSensitiveWithPassword /Encr yptAllWithPassword protection levels, respectively,
before executing your packages.
If your packages are already configured to use Encr yptSensitiveWithPassword /Encr yptAllWithPassword
protection levels, we'll keep them unchanged. We'll still randomly generate encryption passwords when we
upload your packages into Azure Files for executions on your Azure-SSIS IR.
Switching package execution environments
If you parameterize your project/packages in Project Deployment Model, you can create multiple VS
configurations to switch package execution environments. In this way, you can assign environment-specific
values to your project/package parameters at run-time. We recommend that you have at least two different VS
configurations for package executions in the local and cloud environments, so you can Azure-enable your
projects against the cloud configuration. Here's a step-by-step example of switching package execution
environments between your local machine and Azure:
1. Let's say your package contains a File System Task that sets the attributes of a file. When you run it on
your local machine, it sets the attributes of a file stored on your local file system. When you run it on your
Azure-SSIS IR, you want it to set the attributes of a file stored in Azure Files. First, create a package
parameter of string type and name it FilePath to hold the value of target file path.
2. Next, on the General page of File System Task Editor window, parameterize the SourceVariable
property in Source Connection section with the FilePath package parameter.
3. By default, you have an existing VS configuration for package executions in the local environment named
Development . Create a new VS configuration for package executions in the cloud environment named
Azure , see Creating a new VS configuration, if you haven't done so already.
4. When viewing the parameters of your package, select the Add Parameters to Configurations button
to open the Manage Parameter Values window for your package. Next, assign different values of
target file path to the FilePath package parameter under the Development and Azure configurations.
5. Azure-enable your project against the cloud configuration, see Azure-enabling existing SSIS projects, if
you haven't done so already. Next, configure Azure-enabled settings to enable Windows authentication
for your Azure-SSIS IR to access Azure Files, see Configuring Azure-enabled settings, if you haven't done
so already.
6. Execute your package in Azure. You can switch your package execution environment back to your local
machine by selecting the Development configuration.
Current limitations
The Azure-enabled SSDT supports only commercial/global cloud regions and doesn't support
governmental/national cloud regions for now.
Next steps
Once you're satisfied with running your packages in Azure from SSDT, you can deploy and run them as Execute
SSIS Package activities in ADF pipelines, see Running SSIS packages as Execute SSIS Package activities in ADF
pipelines.
Run SSIS packages by using Azure SQL Managed
Instance Agent
4/22/2021 • 5 minutes to read • Edit Online
This article describes how to run a SQL Server Integration Services (SSIS) package by using Azure SQL
Managed Instance Agent. This feature provides behaviors that are similar to when you schedule SSIS packages
by using SQL Server Agent in your on-premises environment.
With this feature, you can run SSIS packages that are stored in SSISDB in a SQL Managed Instance, a file system
like Azure Files, or an Azure-SSIS integration runtime package store.
Prerequisites
To use this feature, download and install latest SQL Server Management Studio (SSMS). Version support details
as below:
To run packages in SSISDB or file system, install SSMS version 18.5 or above.
To run packages in package store, install SSMS version 18.6 or above.
You also need to provision an Azure-SSIS integration runtime in Azure Data Factory. It uses a SQL Managed
Instance as an endpoint server.
6. On the Execution options tab, you can choose whether to use Windows authentication or 32-bit
runtime to run the SSIS package.
7. On the Logging tab, you can choose the logging path and corresponding logging access credential to
store the log files. By default, the logging path is the same as the package folder path, and the logging
access credential is the same as the package access credential. If you store your logs in Azure Files, your
logging path will be \\<storage account name>.file.core.windows.net\<file share name>\<log folder name>
.
8. On the Set values tab, you can enter the property path and value to override the package properties.
For example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .
3. On the New Job Step page, select SQL Ser ver Integration Ser vices Package as the type.
4. On the Package tab:
a. For Package location , select Package Store .
b. For Package path :
The package path is <package store name>\<folder name>\<package name> .
c. If your package file is encrypted with a password, select Encr yption password and enter the
password.
5. On the Configurations tab, enter the configuration file path if you need a configuration file to run the
SSIS package. If you store your configuration in Azure Files, its configuration path will be
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .
6. On the Execution options tab, you can choose whether to use Windows authentication or 32-bit
runtime to run the SSIS package.
7. On the Logging tab, you can choose the logging path and corresponding logging access credential to
store the log files. By default, the logging path is the same as the package folder path, and the logging
access credential is the same as the package access credential. If you store your logs in Azure Files, your
logging path will be \\<storage account name>.file.core.windows.net\<file share name>\<log folder name>
.
8. On the Set values tab, you can enter the property path and value to override the package properties.
For example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .
select * from '{table for job execution}' where parameter_value = 'SQL_Agent_Job_{jobId}' order by
execution_id desc
Next steps
You can also schedule SSIS packages by using Azure Data Factory. For step-by-step instructions, see Azure Data
Factory event trigger.
Run SQL Server Integration Services packages with
the Azure-enabled dtexec utility
3/5/2021 • 6 minutes to read • Edit Online
Prerequisites
To use AzureDTExec, download and install the latest version of SSMS, which is version 18.3 or later. Download it
from this website.
This action opens a AzureDTExecConfig window that needs to be opened with administrative privileges for it
to write into the AzureDTExec.settings file. If you haven't run SSMS as an administrator, a User Account Control
(UAC) window opens. Enter your admin password to elevate your privileges.
Invoking AzureDTExec offers similar options as invoking dtexec. For more information, see dtexec Utility. Here
are the options that are currently supported:
/F[ile] : Loads a package that's stored in file system, file share, or Azure Files. As the value for this option, you
can specify the UNC path for your package file in file system, file share, or Azure Files with its .dtsx extension.
If the UNC path specified contains any space, put quotation marks around the whole path.
/Conf[igFile] : Specifies a configuration file to extract values from. Using this option, you can set a run-time
configuration for your package that differs from the one specified at design time. You can store different
settings in an XML configuration file and then load them before your package execution. For more
information, see SSIS package configurations. To specify the value for this option, use the UNC path for your
configuration file in file system, file share, or Azure Files with its dtsConfig extension. If the UNC path
specified contains any space, put quotation marks around the whole path.
/Conn[ection] : Specifies connection strings for existing connection managers in your package. Using this
option, you can set run-time connection strings for existing connection managers in your package that differ
from the ones specified at design time. Specify the value for this option as follows:
connection_manager_name_or_id;connection_string [[;connection_manager_name_or_id;connection_string]...] .
/Set : Overrides the configuration of a parameter, variable, property, container, log provider, Foreach
enumerator, or connection in your package. This option can be specified multiple times. Specify the value for
this option as follows: property_path;value . For example, \package.variables[counter].Value;1 overrides the
value of counter variable as 1. You can use the Package Configuration wizard to find, copy, and paste the
value of property_path for items in your package whose value you want to override. For more information,
see Package Configuration wizard.
/De[cr ypt] : Sets the decryption password for your package that's configured with the
Encr yptAllWithPassword /Encr yptSensitiveWithPassword protection level.
NOTE
Invoking AzureDTExec with new values for its options generates a new pipeline except for the option /De[cript] .
Next steps
After unique pipelines with the Execute SSIS Package activity in them are generated and run when you invoke
AzureDTExec, they can be monitored on the Data Factory portal. You can also assign Data Factory triggers to
them if you want to orchestrate/schedule them using Data Factory. For more information, see Run SSIS
packages as Data Factory activities.
WARNING
The generated pipeline is expected to be used only by AzureDTExec. Its properties or parameters might change in the
future, so don't modify or reuse them for any other purposes. Modifications might break AzureDTExec. If this happens,
delete the pipeline. AzureDTExec generates a new pipeline the next time it's invoked.
Run an SSIS package with the Execute SSIS Package
activity in Azure Data Factory
7/2/2021 • 30 minutes to read • Edit Online
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Create an Azure-SSIS integration runtime (IR) if you don't have one already by following the step-by-step
instructions in the Tutorial: Provisioning Azure-SSIS IR.
2. In the Activities toolbox, expand General . Then drag an Execute SSIS Package activity to the pipeline
designer surface.
Select the Execute SSIS Package activity object to configure its General , Settings , SSIS Parameters ,
Connection Managers , and Proper ty Overrides tabs.
General tab
On the General tab of Execute SSIS Package activity, complete the following steps.
1. For Name , enter the name of your Execute SSIS Package activity.
2. For Description , enter the description of your Execute SSIS Package activity.
3. For Timeout , enter the maximum amount of time your Execute SSIS Package activity can run. Default is 7
days, format is D.HH:MM:SS.
4. For Retr y , enter the maximum number of retry attempts for your Execute SSIS Package activity.
5. For Retr y inter val , enter the number of seconds between each retry attempt for your Execute SSIS
Package activity. Default is 30 seconds.
6. Select the Secure output check box to choose whether you want to exclude the output of your Execute
SSIS Package activity from logging.
7. Select the Secure input check box to choose whether you want to exclude the input of your Execute SSIS
Package activity from logging.
Settings tab
On the Settings tab of Execute SSIS Package activity, complete the following steps.
1. For Azure-SSIS IR , select the designated Azure-SSIS IR to run your Execute SSIS Package activity.
2. For Description , enter the description of your Execute SSIS Package activity.
3. Select the Windows authentication check box to choose whether you want to use Windows
authentication to access data stores, such as SQL servers/file shares on-premises or Azure Files.
If you select this check box, enter the values for your package execution credentials in the Domain ,
Username , and Password boxes. For example, to access Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .
4. Select the 32-Bit runtime check box to choose whether your package needs 32-bit runtime to run.
5. For Package location , select SSISDB , File System (Package) , File System (Project) , Embedded
package , or Package store .
P a c k a g e l o c a t i o n : SSI SD B
SSISDB as your package location is automatically selected if your Azure-SSIS IR was provisioned with an SSIS
catalog (SSISDB) hosted by Azure SQL Database server/Managed Instance or you can select it yourself. If it's
selected, complete the following steps.
1. If your Azure-SSIS IR is running and the Manual entries check box is cleared, browse and select your
existing folders, projects, packages, and environments from SSISDB. Select Refresh to fetch your newly
added folders, projects, packages, or environments from SSISDB, so that they're available for browsing
and selection. To browse and select the environments for your package executions, you must configure
your projects beforehand to add those environments as references from the same folders under SSISDB.
For more information, see Create and map SSIS environments.
2. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
3. If your Azure-SSIS IR isn't running or the Manual entries check box is selected, enter your package and
environment paths from SSISDB directly in the following formats:
<folder name>/<project name>/<package name>.dtsx and <folder name>/<environment name> .
P a c k a g e l o c a t i o n : F i l e Sy st e m (P a c k a g e )
File System (Package) as your package location is automatically selected if your Azure-SSIS IR was
provisioned without SSISDB or you can select it yourself. If it's selected, complete the following steps.
1. Specify your package to run by providing a Universal Naming Convention (UNC) path to your package
file (with .dtsx ) in the Package path box. You can browse and select your package by selecting Browse
file storage or enter its path manually. For example, if you store your package in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<package name>.dtsx .
2. If you configure your package in a separate file, you also need to provide a UNC path to your
configuration file (with .dtsConfig ) in the Configuration path box. You can browse and select your
configuration by selecting Browse file storage or enter its path manually. For example, if you store your
configuration in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .
3. Specify the credentials to access your package and configuration files. If you previously entered the
values for your package execution credentials (for Windows authentication ), you can reuse them by
selecting the Same as package execution credentials check box. Otherwise, enter the values for your
package access credentials in the Domain , Username , and Password boxes. For example, if you store
your package and configuration in Azure Files, the domain is Azure , the username is
<storage account name> , and the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .
These credentials are also used to access your child packages in Execute Package Task that are referenced
by their own path and other configurations specified in your packages.
4. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SQL Server Data Tools (SSDT), enter the value for your password in the
Encr yption password box. Alternatively, you can use a secret stored in your Azure Key Vault as its value
(see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
5. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
6. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
7. Specify the credentials to access your log folder. If you previously entered the values for your package
access credentials (see above), you can reuse them by selecting the Same as package access
credentials check box. Otherwise, enter the values for your logging access credentials in the Domain ,
Username , and Password boxes. For example, if you store your logs in Azure Files, the domain is Azure
, the username is <storage account name> , and the password is <storage account key> . Alternatively, you
can use secrets stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : F i l e Sy st e m (P r o j e c t )
If you select File System (Project) as your package location, complete the following steps.
1. Specify your package to run by providing a UNC path to your project file (with .ispac ) in the Project
path box and a package file (with .dtsx ) from your project in the Package name box. You can browse
and select your project by selecting Browse file storage or enter its path manually. For example, if you
store your project in Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<project name>.ispac .
2. Specify the credentials to access your project and package files. If you previously entered the values for
your package execution credentials (for Windows authentication ), you can reuse them by selecting the
Same as package execution credentials check box. Otherwise, enter the values for your package
access credentials in the Domain , Username , and Password boxes. For example, if you store your
project and package in Azure Files, the domain is Azure , the username is <storage account name> , and
the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .
These credentials are also used to access your child packages in Execute Package Task that are referenced
from the same project.
3. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value (see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values on the
SSIS Parameters , Connection Managers , or Proper ty Overrides tabs (see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
4. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
5. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
6. Specify the credentials to access your log folder. If you previously entered the values for your package
access credentials (see above), you can reuse them by selecting the Same as package access
credentials check box. Otherwise, enter the values for your logging access credentials in the Domain ,
Username , and Password boxes. For example, if you store your logs in Azure Files, the domain is Azure
, the username is <storage account name> , and the password is <storage account key> . Alternatively, you
can use secrets stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : Em b e d d e d p a c k a g e
If you select Embedded package as your package location, complete the following steps.
1. Drag and drop your package file (with .dtsx ) or Upload it from a file folder into the box provided. Your
package will be automatically compressed and embedded in the activity payload. Once embedded, you
can Download your package later for editing. You can also Parameterize your embedded package by
assigning it to a pipeline parameter that can be used in multiple activities, hence optimizing the size of
your pipeline payload. Embedding project files (with .ispac ) is currently unsupported, so you can't use
SSIS parameters/connection managers with project-level scope in your embedded packages.
2. If your embedded package is not all encrypted and we detect the use of Execute Package Task (EPT) in it,
the Execute Package Task check box will be automatically selected and your child packages that are
referenced by their file system path will be automatically added, so you can also embed them.
If we can't detect the use of EPT, you need to manually select the Execute Package Task check box and
add your child packages that are referenced by their file system path one by one, so you can also embed
them. If your child packages are stored in SQL Server database (MSDB), you can't embed them, so you
need to ensure that your Azure-SSIS IR can access MSDB to fetch them using their SQL Server references.
Embedding project files (with .ispac ) is currently unsupported, so you can't use project-based
references for your child packages.
3. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value. To do so, select the AZURE
KEY VAULT check box next to it. Select or edit your existing key vault linked service or create a new one.
Then select the secret name and version for your value. When you create or edit your key vault linked
service, you can select or edit your existing key vault or create a new one. Make sure to grant Data
Factory managed identity access to your key vault if you haven't done so already. You can also enter your
secret directly in the following format: <key vault linked service name>/<secret name>/<secret version> .
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
4. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
5. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
6. Specify the credentials to access your log folder by entering their values in the Domain , Username , and
Password boxes. For example, if you store your logs in Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> . Alternatively, you can use secrets
stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
P a c k a g e l o c a t i o n : P a c k a g e st o r e
If you select Package store as your package location, complete the following steps.
1. For Package store name , select an existing package store that's attached to your Azure-SSIS IR.
2. Specify your package to run by providing its path (without .dtsx ) from the selected package store in the
Package path box. If the selected package store is on top of file system/Azure Files, you can browse and
select your package by selecting Browse file storage , otherwise you can enter its path in the format of
<folder name>\<package name> . You can also import new packages into the selected package store via SQL
Server Management Studio (SSMS) similar to the legacy SSIS package store. For more information, see
Manage SSIS packages with Azure-SSIS IR package stores.
3. If you configure your package in a separate file, you need to provide a UNC path to your configuration
file (with .dtsConfig ) in the Configuration path box. You can browse and select your configuration by
selecting Browse file storage or enter its path manually. For example, if you store your configuration in
Azure Files, its path is
\\<storage account name>.file.core.windows.net\<file share name>\<configuration name>.dtsConfig .
4. Select the Configuration access credentials check box to choose whether you want to specify the
credentials to access your configuration file separately. This is needed when the selected package store is
on top of SQL Server database (MSDB) hosted by your Azure SQL Managed Instance or doesn't also store
your configuration file.
If you previously entered the values for your package execution credentials (for Windows
authentication ), you can reuse them by selecting the Same as package execution credentials check
box. Otherwise, enter the values for your configuration access credentials in the Domain , Username ,
and Password boxes. For example, if you store your configuration in Azure Files, the domain is Azure ,
the username is <storage account name> , and the password is <storage account key> .
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the
AZURE KEY VAULT check box next to them. Select or edit your existing key vault linked service or create
a new one. Then select the secret name and version for your value. When you create or edit your key
vault linked service, you can select or edit your existing key vault or create a new one. Make sure to grant
Data Factory managed identity access to your key vault if you haven't done so already. You can also enter
your secret directly in the following format:
<key vault linked service name>/<secret name>/<secret version> .
5. If you used the Encr yptAllWithPassword or Encr yptSensitiveWithPassword protection level when
you created your package via SSDT, enter the value for your password in the Encr yption password box.
Alternatively, you can use a secret stored in your Azure Key Vault as its value (see above).
If you used the Encr yptSensitiveWithUserKey protection level, reenter your sensitive values in
configuration files or on the SSIS Parameters , Connection Managers , or Proper ty Overrides tabs
(see below).
If you used the Encr yptAllWithUserKey protection level, it's unsupported. You need to reconfigure your
package to use another protection level via SSDT or the dtutil command-line utility.
6. For Logging level , select a predefined scope of logging for your package execution. Select the
Customized check box if you want to enter your customized logging name instead.
7. If you want to log your package executions beyond using the standard log providers that can be specified
in your package, specify your log folder by providing its UNC path in the Logging path box. You can
browse and select your log folder by selecting Browse file storage or enter its path manually. For
example, if you store your logs in Azure Files, your logging path is
\\<storage account name>.file.core.windows.net\<file share name>\<log folder name> . A subfolder is
created in this path for each individual package run, named after the Execute SSIS Package activity run ID,
and in which log files are generated every five minutes.
8. Specify the credentials to access your log folder by entering their values in the Domain , Username , and
Password boxes. For example, if you store your logs in Azure Files, the domain is Azure , the username
is <storage account name> , and the password is <storage account key> . Alternatively, you can use secrets
stored in your Azure Key Vault as their values (see above).
For all UNC paths previously mentioned, the fully qualified file name must be fewer than 260 characters. The
directory name must be fewer than 248 characters.
SSIS Parameters tab
On the SSIS Parameters tab of Execute SSIS Package activity, complete the following steps.
1. If your Azure-SSIS IR is running, SSISDB is selected as your package location, and the Manual entries
check box on the Settings tab is cleared, the existing SSIS parameters in your selected project and
package from SSISDB are displayed for you to assign values to them. Otherwise, you can enter them one
by one to assign values to them manually. Make sure that they exist and are correctly entered for your
package execution to succeed.
2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive parameters to assign values to
them on this tab.
When you assign values to your parameters, you can add dynamic content by using expressions, functions, Data
Factory system variables, and Data Factory pipeline parameters or variables.
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the AZURE KEY
VAULT check box next to them. Select or edit your existing key vault linked service or create a new one. Then
select the secret name and version for your value. When you create or edit your key vault linked service, you can
select or edit your existing key vault or create a new one. Make sure to grant Data Factory managed identity
access to your key vault if you haven't done so already. You can also enter your secret directly in the following
format: <key vault linked service name>/<secret name>/<secret version> .
Connection Managers tab
On the Connection Managers tab of Execute SSIS Package activity, complete the following steps.
1. If your Azure-SSIS IR is running, SSISDB is selected as your package location, and the Manual entries
check box on the Settings tab is cleared, the existing connection managers in your selected project and
package from SSISDB are displayed for you to assign values to their properties. Otherwise, you can enter
them one by one to assign values to their properties manually. Make sure that they exist and are correctly
entered for your package execution to succeed.
You can obtain the correct SCOPE , NAME , and PROPERTY names for any connection manager by
opening the package that contains it on SSDT. After the package is opened, select the relevant connection
manager to show the names and values for all of its properties on the Proper ties window of SSDT. With
this info, you can override the values of any connection manager properties at run-time.
For example, without modifying your original package on SSDT, you can convert its on-premises-to-on-
premises data flows running on SQL Server into on-premises-to-cloud data flows running on SSIS IR in
ADF by overriding the values of ConnectByProxy , ConnectionString , and
ConnectUsingManagedIdentity properties in existing connection managers at run-time.
These run-time overrides can enable Self-Hosted IR (SHIR) as a proxy for SSIS IR when accessing data on
premises, see Configuring SHIR as a proxy for SSIS IR, and Azure SQL Database/Managed Instance
connections using the latest MSOLEDBSQL driver that in turn enables Azure Active Directory (AAD)
authentication with ADF managed identity, see Configuring AAD authentication with ADF managed
identity for OLEDB connections.
2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive connection manager properties
to assign values to them on this tab.
When you assign values to your connection manager properties, you can add dynamic content by using
expressions, functions, Data Factory system variables, and Data Factory pipeline parameters or variables.
Alternatively, you can use secrets stored in your Azure Key Vault as their values. To do so, select the AZURE KEY
VAULT check box next to them. Select or edit your existing key vault linked service or create a new one. Then
select the secret name and version for your value. When you create or edit your key vault linked service, you can
select or edit your existing key vault or create a new one. Make sure to grant Data Factory managed identity
access to your key vault if you haven't done so already. You can also enter your secret directly in the following
format: <key vault linked service name>/<secret name>/<secret version> .
Property Overrides tab
On the Proper ty Overrides tab of Execute SSIS Package activity, complete the following steps.
1. Enter the paths of existing properties in your selected package one by one to assign values to them
manually. Make sure that they exist and are correctly entered for your package execution to succeed. For
example, to override the value of your user variable, enter its path in the following format:
\Package.Variables[User::<variable name>].Value .
You can obtain the correct PROPERTY PATH for any package property by opening the package that
contains it on SSDT. After the package is opened, select its control flow and Configurations property on
the Proper ties window of SSDT. Next, select the ellipsis (...) button next to its Configurations property
to open the Package Configurations Organizer that's normally used to create package configurations
in Package Deployment Model.
On the Package Configurations Organizer , select the Enable package configurations check box
and the Add... button to open the Package Configuration Wizard .
On the Package Configuration Wizard , select the XML configuration file item in Configuration
type dropdown menu and the Specify configuration settings directly button, enter your
configuration file name, and select the Next > button.
Finally, select the package properties whose path you want and the Next > button. You can now see,
copy & paste the package property paths you want and save them in your configuration file. With this
info, you can override the values of any package properties at run-time.
2. If you used the Encr yptSensitiveWithUserKey protection level when you created your package via
SSDT and File System (Package) , File System (Project) , Embedded package , or Package store is
selected as your package location, you also need to reenter your sensitive package properties to assign
values to them on this tab.
When you assign values to your package properties, you can add dynamic content by using expressions,
functions, Data Factory system variables, and Data Factory pipeline parameters or variables.
The values assigned in configuration files and on the SSIS Parameters tab can be overridden by using the
Connection Managers or Proper ty Overrides tabs. The values assigned on the Connection Managers tab
can also be overridden by using the Proper ty Overrides tab.
To validate the pipeline configuration, select Validate on the toolbar. To close the Pipeline Validation Repor t ,
select >> .
To publish the pipeline to Data Factory, select Publish All .
Run the pipeline
In this step, you trigger a pipeline run.
1. To trigger a pipeline run, select Trigger on the toolbar, and select Trigger now .
2. In the Pipeline Run window, select Finish .
Monitor the pipeline
1. Switch to the Monitor tab on the left. You see the pipeline run and its status along with other
information, such as the Run Star t time. To refresh the view, select Refresh .
2. Select the View Activity Runs link in the Actions column. You see only one activity run because the
pipeline has only one activity. It's the Execute SSIS Package activity.
3. Run the following query against the SSISDB database in your SQL server to verify that the package
executed.
IMPORTANT
Replace object names, descriptions, and paths, property or parameter values, passwords, and other variable values
before you save the file.
{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [{
"name": "MySSISActivity",
"description": "My SSIS package/activity description",
"type": "ExecuteSSISPackage",
"typeProperties": {
"connectVia": {
"referenceName": "MyAzureSSISIR",
"type": "IntegrationRuntimeReference"
},
"executionCredential": {
"domain": "MyExecutionDomain",
"username": "MyExecutionUsername",
"password": {
"type": "SecureString",
"value": "MyExecutionPassword"
}
},
"runtime": "x64",
"loggingLevel": "Basic",
"packageLocation": {
"type": "SSISDB",
"packagePath": "MyFolder/MyProject/MyPackage.dtsx"
},
"environmentPath": "MyFolder/MyEnvironment",
"projectParameters": {
"project_param_1": {
"value": "123"
},
"project_param_2": {
"value": {
"value": "@pipeline().parameters.MyProjectParameter",
"type": "Expression"
}
}
},
"packageParameters": {
"package_param_1": {
"value": "345"
"value": "345"
},
"package_param_2": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyPackageParameter"
}
}
},
"projectConnectionManagers": {
"MyAdonetCM": {
"username": {
"value": "MyConnectionUsername"
},
"password": {
"value": {
"type": "SecureString",
"value": "MyConnectionPassword"
}
}
}
},
"packageConnectionManagers": {
"MyOledbCM": {
"username": {
"value": {
"value": "@pipeline().parameters.MyConnectionUsername",
"type": "Expression"
}
},
"password": {
"value": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyConnectionPassword",
"secretVersion": "MyConnectionPasswordVersion"
}
}
}
},
"propertyOverrides": {
"\\Package.MaxConcurrentExecutables": {
"value": 8,
"isSensitive": false
}
}
},
"policy": {
"timeout": "0.01:00:00",
"retry": 0,
"retryIntervalInSeconds": 30
}
}]
}
}
To execute packages stored in file system/Azure Files, enter the values for your package and log location
properties as follows:
{
{
{
{
"packageLocation": {
"type": "File",
"packagePath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyPackage.dtsx",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
}
}
},
"logLocation": {
"type": "File",
"logPath": "//MyStorageAccount.file.core.windows.net/MyFileShare/MyLogFolder",
"typeProperties": {
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyAccountKey"
}
}
}
}
}
}
}
}
To execute packages within projects stored in file system/Azure Files, enter the values for your package
location properties as follows:
{
{
{
{
"packageLocation": {
"type": "File",
"packagePath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyProject.ispac:MyPackage.dtsx",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"userName": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
}
}
}
}
}
}
}
To execute embedded packages, enter the values for your package location properties as follows:
{
{
{
{
"packageLocation": {
"type": "InlinePackage",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"packageName": "MyPackage.dtsx",
"packageContent":"My compressed/uncompressed package content",
"packageLastModifiedDate": "YYYY-MM-DDTHH:MM:SSZ UTC-/+HH:MM"
}
}
}
}
}
}
To execute packages stored in package stores, enter the values for your package and configuration
location properties as follows:
{
{
{
{
"packageLocation": {
"type": "PackageStore",
"packagePath": "myPackageStore/MyFolder/MyPackage",
"typeProperties": {
"packagePassword": {
"type": "SecureString",
"value": "MyEncryptionPassword"
},
"accessCredential": {
"domain": "Azure",
"username": "MyStorageAccount",
"password": {
"type": "SecureString",
"value": "MyAccountKey"
}
},
"configurationPath":
"//MyStorageAccount.file.core.windows.net/MyFileShare/MyConfiguration.dtsConfig",
"configurationAccessCredential": {
"domain": "Azure",
"userName": "MyStorageAccount",
"password": {
"type": "AzureKeyVaultSecret",
"store": {
"referenceName": "myAKV",
"type": "LinkedServiceReference"
},
"secretName": "MyAccountKey"
}
}
}
}
}
}
}
}
PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification], [outputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
while ($True) {
$Run = Get-AzDataFactoryV2PipelineRun -ResourceGroupName $ResGrp.ResourceGroupName `
-DataFactoryName $DataFactory.DataFactoryName `
-PipelineRunId $RunId
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
You can also monitor the pipeline by using the Azure portal. For step-by-step instructions, see Monitor the
pipeline.
Schedule the pipeline with a trigger
In the previous step, you ran the pipeline on demand. You can also create a schedule trigger to run the pipeline
on a schedule, such as hourly or daily.
1. Create a JSON file named MyTrigger.json in the C:\ADF\RunSSISPackage folder with the following
content:
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}]
}
}
4. By default, the trigger is in stopped state. Start the trigger by running the Star t-
AzDataFactor yV2Trigger cmdlet.
5. Confirm that the trigger is started by running the Get-AzDataFactor yV2Trigger cmdlet.
6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.
Run the following query against the SSISDB database in your SQL server to verify that the package
executed.
Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in Azure Data Factory pipelines
Run an SSIS package with the Stored Procedure
activity in Azure Data Factory
7/2/2021 • 10 minutes to read • Edit Online
Prerequisites
Azure SQL Database
The walkthrough in this article uses Azure SQL Database to host the SSIS catalog. You can also use Azure SQL
Managed Instance.
5. Select your Azure subscription in which you want to create the data factory.
6. For the Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of a resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources.
7. Select V2 for the version .
8. Select the location for the data factory. Only locations that are supported by Data Factory are shown in
the drop-down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight,
etc.) used by data factory can be in other locations.
9. Select Pin to dashboard .
10. Click Create .
11. On the dashboard, you see the following tile with status: Deploying data factor y .
12. After the creation is complete, you see the Data Factor y page as shown in the image.
13. Click Author & Monitor tile to launch the Azure Data Factory user interface (UI) application in a
separate tab.
Create a pipeline with stored procedure activity
In this step, you use the Data Factory UI to create a pipeline. You add a stored procedure activity to the pipeline
and configure it to run the SSIS package by using the sp_executesql stored procedure.
1. In the home page, click Orchestrate :
2. In the Activities toolbox, expand General , and drag-drop Stored Procedure activity to the pipeline
designer surface.
3. In the properties window for the stored procedure activity, switch to the SQL Account tab, and click +
New . You create a connection to the database in Azure SQL Database that hosts the SSIS Catalog (SSIDB
database).
4. In the New Linked Ser vice window, do the following steps:
a. Select Azure SQL Database for Type .
b. Select the Default Azure Integration Runtime to connect to the Azure SQL Database that hosts the
SSISDB database.
c. Select the Azure SQL Database that hosts the SSISDB database for the Ser ver name field.
d. Select SSISDB for Database name .
e. For User name , enter the name of user who has access to the database.
f. For Password , enter the password of the user.
g. Test the connection to the database by clicking Test connection button.
h. Save the linked service by clicking the Save button.
5. In the properties window, switch to the Stored Procedure tab from the SQL Account tab, and do the
following steps:
a. Select Edit .
b. For the Stored procedure name field, Enter sp_executesql .
c. Click + New in the Stored procedure parameters section.
d. For name of the parameter, enter stmt .
e. For type of the parameter, enter String .
f. For value of the parameter, enter the following SQL query:
In the SQL query, specify the right values for the folder_name , project_name , and
package_name parameters.
4. Click View Activity Runs link in the Actions column. You see only one activity run as the pipeline has
only one activity (stored procedure activity).
5. You can run the following quer y against the SSISDB database in SQL Database to verify that the package
executed.
NOTE
You can also create a scheduled trigger for your pipeline so that the pipeline runs on a schedule (hourly, daily, etc.). For an
example, see Create a data factory - Data Factory UI.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
In this section, you use Azure PowerShell to create a Data Factory pipeline with a stored procedure activity that
invokes an SSIS package.
Install the latest Azure PowerShell modules by following instructions in How to install and configure Azure
PowerShell.
Create a data factory
You can either use the same data factory that has the Azure-SSIS IR or create a separate data factory. The
following procedure provides steps to create a data factory. You create a pipeline with a stored procedure
activity in this data factory. The stored procedure activity executes a stored procedure in the SSISDB database to
run your SSIS package.
1. Define a variable for the resource group name that you use in PowerShell commands later. Copy the
following command text to PowerShell, specify a name for the Azure resource group in double quotes,
and then run the command. For example: "adfrg" .
$resourceGroupName = "ADFTutorialResourceGroup";
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again
If the resource group already exists, you may not want to overwrite it. Assign a different value to the
$ResourceGroupName variable and run the command again.
IMPORTANT
Update the data factory name to be globally unique.
$DataFactoryName = "ADFTutorialFactory";
4. To create the data factory, run the following Set-AzDataFactor yV2 cmdlet, using the Location and
ResourceGroupName property from the $ResGrp variable:
The specified Data Factory name 'ADFv2QuickStartDataFactory' is already in use. Data Factory names
must be globally unique.
To create Data Factory instances, the user account you use to log in to Azure must be a member of
contributor or owner roles, or an administrator of the Azure subscription.
For a list of Azure regions in which Data Factory is currently available, select the regions that interest you
on the following page, and then expand Analytics to locate Data Factor y : Products available by region.
The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data
factory can be in other regions.
Create an Azure SQL Database linked service
Create a linked service to link your database that hosts the SSIS catalog to your data factory. Data Factory uses
information in this linked service to connect to SSISDB database, and executes a stored procedure to run an SSIS
package.
1. Create a JSON file named AzureSqlDatabaseLinkedSer vice.json in C:\ADF\RunSSISPackage folder
with the following content:
IMPORTANT
Replace <servername>, <username>, and <password> with values of your Azure SQL Database before saving
the file.
{
"name": "AzureSqlDatabaseLinkedService",
"properties": {
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "Server=tcp:
<servername>.database.windows.net,1433;Database=SSISDB;User ID=<username>;Password=
<password>;Trusted_Connection=False;Encrypt=True;Connection Timeout=30"
}
}
}
IMPORTANT
Replace <FOLDER NAME>, <PROJECT NAME>, <PACKAGE NAME> with names of folder, project, and package in
the SSIS catalog before saving the file.
{
"name": "RunSSISPackagePipeline",
"properties": {
"activities": [
{
"name": "My SProc Activity",
"description":"Runs an SSIS package",
"type": "SqlServerStoredProcedure",
"linkedServiceName": {
"referenceName": "AzureSqlDatabaseLinkedService",
"type": "LinkedServiceReference"
},
"typeProperties": {
"storedProcedureName": "sp_executesql",
"storedProcedureParameters": {
"stmt": {
"value": "DECLARE @return_value INT, @exe_id BIGINT, @err_msg
NVARCHAR(150) EXEC @return_value=[SSISDB].[catalog].[create_execution] @folder_name=N'<FOLDER
NAME>', @project_name=N'<PROJECT NAME>', @package_name=N'<PACKAGE NAME>', @use32bitruntime=0,
@runinscaleout=1, @useanyworker=1, @execution_id=@exe_id OUTPUT EXEC [SSISDB].[catalog].
[set_execution_parameter_value] @exe_id, @object_type=50, @parameter_name=N'SYNCHRONIZED',
@parameter_value=1 EXEC [SSISDB].[catalog].[start_execution] @execution_id=@exe_id, @retry_count=0
IF(SELECT [status] FROM [SSISDB].[catalog].[executions] WHERE execution_id=@exe_id)<>7 BEGIN SET
@err_msg=N'Your package execution did not succeed for execution ID: ' + CAST(@exe_id AS NVARCHAR(20))
RAISERROR(@err_msg,15,1) END"
}
}
}
}
]
}
}
PipelineName : Adfv2QuickStartPipeline
ResourceGroupName : <resourceGroupName>
DataFactoryName : <dataFactoryName>
Activities : {CopyFromBlobToBlob}
Parameters : {[inputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification], [outputPath,
Microsoft.Azure.Management.DataFactory.Models.ParameterSpecification]}
if ($Run) {
if ($run.Status -ne 'InProgress') {
Write-Output ("Pipeline run finished. The status is: " + $Run.Status)
$Run
break
}
Write-Output "Pipeline is running...status: InProgress"
}
Start-Sleep -Seconds 10
}
Create a trigger
In the previous step, you invoked the pipeline on-demand. You can also create a schedule trigger to run the
pipeline on a schedule (hourly, daily, etc.).
1. Create a JSON file named MyTrigger.json in C:\ADF\RunSSISPackage folder with the following
content:
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2017-12-07T00:00:00-08:00",
"endTime": "2017-12-08T00:00:00-08:00"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "RunSSISPackagePipeline"
},
"parameters": {}
}
]
}
}
4. By default, the trigger is in stopped state. Start the trigger by running the Star t-
AzDataFactor yV2Trigger cmdlet.
6. Run the following command after the next hour. For example, if the current time is 3:25 PM UTC, run the
command at 4 PM UTC.
You can run the following query against the SSISDB database in SQL Database to verify that the package
executed.
Next steps
You can also monitor the pipeline using the Azure portal. For step-by-step instructions, see Monitor the pipeline.
How to start and stop Azure-SSIS Integration
Runtime on a schedule
7/2/2021 • 14 minutes to read • Edit Online
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Prerequisites
If you have not provisioned your Azure-SSIS IR already, provision it by following instructions in the tutorial.
Create and schedule ADF pipelines that start and or stop Azure-SSIS
IR
This section shows you how to use Web activities in ADF pipelines to start/stop your Azure-SSIS IR on schedule
or start & stop it on demand. We will guide you to create three pipelines:
1. The first pipeline contains a Web activity that starts your Azure-SSIS IR.
2. The second pipeline contains a Web activity that stops your Azure-SSIS IR.
3. The third pipeline contains an Execute SSIS Package activity chained between two Web activities that
start/stop your Azure-SSIS IR.
After you create and test those pipelines, you can create a schedule trigger and associate it with any pipeline.
The schedule trigger defines a schedule for running the associated pipeline.
For example, you can create two triggers, the first one is scheduled to run daily at 6 AM and associated with the
first pipeline, while the second one is scheduled to run daily at 6 PM and associated with the second pipeline. In
this way, you have a period between 6 AM to 6 PM every day when your IR is running, ready to execute your
daily ETL workloads.
If you create a third trigger that is scheduled to run daily at midnight and associated with the third pipeline, that
pipeline will run at midnight every day, starting your IR just before package execution, subsequently executing
your package, and immediately stopping your IR just after package execution, so your IR will not be running idly.
Create your ADF
1. Sign in to Azure portal.
2. Click New on the left menu, click Data + Analytics , and click Data Factor y .
3. In the New data factor y page, enter MyAzureSsisDataFactor y for Name .
The name of your ADF must be globally unique. If you receive the following error, change the name of
your ADF (e.g. yournameMyAzureSsisDataFactory) and try creating it again. See Data Factory - Naming
Rules article to learn about naming rules for ADF artifacts.
Data factory name MyAzureSsisDataFactory is not available
4. Select your Azure Subscription under which you want to create your ADF.
5. For Resource Group , do one of the following steps:
Select Use existing , and select an existing resource group from the drop-down list.
Select Create new , and enter the name of your new resource group.
To learn about resource groups, see Using resource groups to manage your Azure resources article.
6. For Version , select V2 .
7. For Location , select one of the locations supported for ADF creation from the drop-down list.
8. Select Pin to dashboard .
9. Click Create .
10. On Azure dashboard, you will see the following tile with status: Deploying Data Factor y .
11. After the creation is complete, you can see your ADF page as shown below.
12. Click Author & Monitor to launch ADF UI/app in a separate tab.
Create your pipelines
1. In the home page, select Orchestrate .
2. In Activities toolbox, expand General menu, and drag & drop a Web activity onto the pipeline designer
surface. In General tab of the activity properties window, change the activity name to star tMyIR . Switch
to Settings tab, and do the following actions.
a. For URL , enter the following URL for REST API that starts Azure-SSIS IR, replacing
{subscriptionId} , {resourceGroupName} , {factoryName} , and {integrationRuntimeName} with the
actual values for your IR:
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRu
api-version=2018-06-01
Alternatively, you can also copy & paste the resource ID of your IR from its monitoring page on
ADF UI/app to replace the following part of the above URL:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.DataFactory/factories/{factoryName}/integrationRuntimes/{integrationRuntimeNa
5. Assign the managed identity for your ADF a Contributor role to itself, so Web activities in its pipelines
can call REST API to start/stop Azure-SSIS IRs provisioned in it. On your ADF page in Azure portal, click
Access control (IAM) , click + Add role assignment , and then on Add role assignment blade, do
the following actions.
a. For Role , select Contributor .
b. For Assign access to , select Azure AD user, group, or ser vice principal .
c. For Select , search for your ADF name and select it.
d. Click Save .
6. Validate your ADF and all pipeline settings by clicking Validate all/Validate on the factory/pipeline
toolbar. Close Factor y/Pipeline Validation Output by clicking >> button.
Test run your pipelines
1. Select Test Run on the toolbar for each pipeline and see Output window in the bottom pane.
2. To test the third pipeline, launch SQL Server Management Studio (SSMS). In Connect to Ser ver window,
do the following actions.
a. For Ser ver name , enter <your ser ver name>.database.windows.net .
b. Select Options >> .
c. For Connect to database , select SSISDB .
d. Select Connect .
e. Expand Integration Ser vices Catalogs -> SSISDB -> Your folder -> Projects -> Your SSIS project
-> Packages .
f. Right-click the specified SSIS package to run and select Repor ts -> Standard Repor ts -> All
Executions .
g. Verify that it ran.
4. In Trigger Run Parameters page, review any warning, and select Finish .
5. Publish the whole ADF settings by selecting Publish All in the factory toolbar.
2. To view the activity runs associated with a pipeline run, select the first link (View Activity Runs ) in
Actions column. For the third pipeline, you will see three activity runs, one for each chained activity in
the pipeline (Web activity to start your IR, Stored Procedure activity to execute your package, and Web
activity to stop your IR). To view the pipeline runs again, select Pipelines link at the top.
3. To view the trigger runs, select Trigger Runs from the drop-down list under Pipeline Runs at the top.
5. You will see the deployment status of your Azure Automation account in Azure dashboard and
notifications.
6. You will see the homepage of your Azure Automation account after it is created successfully.
2. If you do not have Az.DataFactor y , go to the PowerShell Gallery for Az.DataFactory module, select
Deploy to Azure Automation , select your Azure Automation account, and then select OK . Go back to
view Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of
Az.DataFactor y module changed to Available .
3. If you do not have Az.Profile , go to the PowerShell Gallery for Az.Profile module, select Deploy to
Azure Automation , select your Azure Automation account, and then select OK . Go back to view
Modules in SHARED RESOURCES section on the left menu and wait until you see STATUS of the
Az.Profile module changed to Available .
Create your PowerShell runbook
The following section provides steps for creating a PowerShell runbook. The script associated with your runbook
either starts/stops Azure-SSIS IR based on the command you specify for OPERATION parameter. This section
does not provide the complete details for creating a runbook. For more information, see Create a runbook
article.
1. Switch to Runbooks tab and select + Add a runbook from the toolbar.
3. Copy & paste the following PowerShell script to your runbook script window. Save and then publish your
runbook by using Save and Publish buttons on the toolbar.
Param
(
[Parameter (Mandatory= $true)]
[String] $ResourceGroupName,
$connectionName = "AzureRunAsConnection"
try
{
# Get the connection "AzureRunAsConnection "
$servicePrincipalConnection=Get-AutomationConnection -Name $connectionName
"Logging in to Azure..."
Connect-AzAccount `
-ServicePrincipal `
-TenantId $servicePrincipalConnection.TenantId `
-ApplicationId $servicePrincipalConnection.ApplicationId `
-CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
if (!$servicePrincipalConnection)
{
$ErrorMessage = "Connection $connectionName not found."
throw $ErrorMessage
} else{
Write-Error -Message $_.Exception
throw $_.Exception
}
}
6. In the job window, select Output tile. In the output window, wait for the message ##### Completed
##### after you see ##### Star ting ##### . Starting Azure-SSIS IR takes approximately 20 minutes.
Close Job window and get back to Runbook window.
7. Repeat the previous two steps using STOP as the value for OPERATION . Start your runbook again by
selecting Star t button on the toolbar. Enter your resource group, ADF, and Azure-SSIS IR names. For
OPERATION , enter STOP . In the output window, wait for the message ##### Completed ##### after
you see ##### Stopping ##### . Stopping Azure-SSIS IR does not take as long as starting it. Close Job
window and get back to Runbook window.
8. You can also trigger your runbook via a webhook that can be created by selecting the Webhooks menu
item or on a schedule that can be created by selecting the Schedules menu item as specified below.
3. Switch to Parameters and run settings tab. Specify your resource group, ADF, and Azure-SSIS IR
names. For OPERATION , enter START and select OK . Select OK again to see the schedule on Schedules
page of your runbook.
4. Repeat the previous two steps to create a schedule named Stop IR daily . Enter a time that is at least 30
minutes after the time you specified for Star t IR daily schedule. For OPERATION , enter STOP and
select OK . Select OK again to see the schedule on Schedules page of your runbook.
5. In Runbook window, select Jobs on the left menu. You should see the jobs created by your schedules at
the specified times and their statuses. You can see the job details, such as its output, similar to what you
have seen after you tested your runbook.
6. After you are done testing, disable your schedules by editing them. Select Schedules on the left menu,
select Star t IR daily/Stop IR daily , and select No for Enabled .
Next steps
See the following blog post:
Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines
See the following articles from SSIS documentation:
Deploy, run, and monitor an SSIS package on Azure
Connect to SSIS catalog on Azure
Schedule package execution on Azure
Connect to on-premises data sources with Windows authentication
Join an Azure-SSIS integration runtime to a virtual
network
7/16/2021 • 31 minutes to read • Edit Online
IMPORTANT
The classic virtual network is being deprecated, so use the Azure Resource Manager virtual network instead. If you already
use the classic virtual network, switch to the Azure Resource Manager virtual network as soon as possible.
The configuring an Azure-SQL Server Integration Services (SSIS) integration runtime (IR) to join a virtual
network tutorial shows the minimum steps via Azure portal. This article expands on the tutorial and describes all
the optional tasks:
If you are using virtual network (classic).
If you bring your own public IP addresses for the Azure-SSIS IR.
If you use your own Domain Name System (DNS) server.
If you use a network security group (NSG) on the subnet.
If you use Azure ExpressRoute or a user-defined route (UDR).
If you use customized Azure-SSIS IR.
If you use Azure Powershell provisioning.
Set up permissions
The user who creates the Azure-SSIS IR must have the following permissions:
If you're joining your SSIS IR to an Azure Resource Manager virtual network, you have two options:
Use the built-in Network Contributor role. This role comes with the Microsoft.Network/*
permission, which has a much larger scope than necessary.
Create a custom role that includes only the necessary
Microsoft.Network/virtualNetworks/*/join/action permission. If you also want to bring your own
public IP addresses for Azure-SSIS IR while joining it to an Azure Resource Manager virtual
network, please also include Microsoft.Network/publicIPAddresses/*/join/action permission in the
role.
If you're joining your SSIS IR to a classic virtual network, we recommend that you use the built-in Classic
Virtual Machine Contributor role. Otherwise you have to define a custom role that includes the
permission to join the virtual network.
Select the subnet
As you choose a subnet:
Don't select the GatewaySubnet to deploy an Azure-SSIS IR. It's dedicated for virtual network gateways.
Ensure that the subnet you select has enough available address space for the Azure-SSIS IR to use. Leave
available IP addresses for at least two times the IR node number. Azure reserves some IP addresses
within each subnet. These addresses can't be used. The first and last IP addresses of the subnets are
reserved for protocol conformance, and three more addresses are used for Azure services. For more
information, see Are there any restrictions on using IP addresses within these subnets?
Don’t use a subnet that is exclusively occupied by other Azure services (for example, SQL Database SQL
Managed Instance, App Service, and so on).
Select the static public IP addresses
If you want to bring your own static public IP addresses for Azure-SSIS IR while joining it to a virtual network,
make sure they meet the following requirements:
Exactly two unused ones that are not already associated with other Azure resources should be provided.
The extra one will be used when we periodically upgrade your Azure-SSIS IR. Note that one public IP
address cannot be shared among your active Azure-SSIS IRs.
They should both be static ones of standard type. Refer to SKUs of Public IP Address for more details.
They should both have a DNS name. If you have not provided a DNS name when creating them, you can
do so on Azure portal.
They and the virtual network should be under the same subscription and in the same region.
Set up the DNS server
If you need to use your own DNS server in a virtual network joined by your Azure-SSIS IR to resolve your
private host name, make sure it can also resolve global Azure host names (for example, an Azure Storage blob
named <your storage account>.blob.core.windows.net ).
One recommended approach is below:
Configure the custom DNS to forward requests to Azure DNS. You can forward unresolved DNS records to
the IP address of the Azure recursive resolvers (168.63.129.16) on your own DNS server.
For more information, see Name resolution that uses your own DNS server.
NOTE
Please use a Fully Qualified Domain Name (FQDN) for your private host name (for example, use
<your_private_server>.contoso.com instead of <your_private_server> ). Alternatively, you can use a standard
custom setup on your Azure-SSIS IR to automatically append your own DNS suffix (for example contoso.com ) to any
unqualified single label domain name and turn it into an FQDN before using it in DNS queries, see standard custom setup
samples.
Set up an NSG
If you need to implement an NSG for the subnet used by your Azure-SSIS IR, allow inbound and outbound
traffic through the following ports:
Inbound requirement of Azure-SSIS IR
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
At NIC level
NSG, port
3389 is open
by default
and we allow
you to control
port 3389 at
subnet level
NSG,
meanwhile
Azure-SSIS IR
has
disallowed
port 3389
outbound by
default at
windows
firewall rule
on each IR
node for
protection.
This
outbound
security rule
isn't
applicable to
an SSISDB
hosted by
your SQL
Managed
Instance in
the virtual
network or
SQL Database
configured
with private
endpoint.
T RA N SP O RT SO URC E DEST IN AT IO N
DIREC T IO N P ROTO C O L SO URC E P O RT RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
NOTE
This approach incurs an additional maintenance cost. Regularly check the IP range and add new IP ranges into your UDR
to avoid breaking the Azure-SSIS IR. We recommend checking the IP range monthly because when the new IP appears in
the service tag, the IP will take another month go into effect.
To make the setup of UDR rules easier, you can run following Powershell script to add UDR rules for Azure Batch
management services:
For firewall appliance to allow outbound traffic, you need to allow outbound to below ports same as
requirement in NSG outbound rules.
Port 443 with destination as Azure Cloud services.
If you use Azure Firewall, you can specify network rule with AzureCloud Service Tag. For firewall of the
other types, you can either simply allow destination as all for port 443 or allow below FQDNs based on
the type of your Azure environment:
A Z URE EN VIRO N M EN T EN DP O IN T S
As for the FQDNs of Azure Storage, Azure Container Registry and Event Hub, you can also choose to
enable the following service endpoints for your virtual network so that network traffic to these endpoints
goes through Azure backbone network instead of being routed to your firewall appliance:
Microsoft.Storage
Microsoft.ContainerRegistry
Microsoft.EventHub
Port 80 with destination as CRL download sites.
You shall allow below FQDNs which are used as CRL (Certificate Revocation List) download sites of
certificates for Azure-SSIS IR management purpose:
crl.microsoft.com:80
mscrl.microsoft.com:80
crl3.digicert.com:80
crl4.digicert.com:80
ocsp.digicert.com:80
cacerts.digicert.com:80
If you are using certificates having different CRL, you are suggested to include them as well. You can read
this to understand more on Certificate Revocation List.
If you disallow this traffic, you might experience performance downgrade when start Azure-SSIS IR and
lose capability to check certificate revocation list for certificate usage which is not recommended from
security point of view.
Port 1433, 11000-11999 with destination as Azure SQL Database (only required when the nodes of your
Azure-SSIS IR in the virtual network access an SSISDB hosted by your server).
If you use Azure Firewall, you can specify network rule with Azure SQL Service Tag, otherwise you might
allow destination as specific azure sql url in firewall appliance.
Port 445 with destination as Azure Storage (only required when you execute SSIS package stored in
Azure Files).
If you use Azure Firewall, you can specify network rule with Storage Service Tag, otherwise you might
allow destination as specific azure file storage url in firewall appliance.
NOTE
For Azure SQL and Storage, if you configure Virtual Network service endpoints on your subnet, then traffic between
Azure-SSIS IR and Azure SQL in same region \ Azure Storage in same region or paired region will be routed to Microsoft
Azure backbone network directly instead of your firewall appliance.
If you don't need capability of inspecting outbound traffic of Azure-SSIS IR, you can simply apply route to force
all traffic to next hop type Internet :
In an Azure ExpressRoute scenario, you can apply a 0.0.0.0/0 route with the next hop type as Internet on the
subnet that hosts the Azure-SSIS IR.
In a NVA scenario, you can modify the existing 0.0.0.0/0 route applied on the subnet that hosts the Azure-
SSIS IR from the next hop type as Vir tual appliance to Internet .
NOTE
Specify route with next hop type Internet doesn't mean all traffic will go over Internet. As long as destination address is
for one of Azure's services, Azure routes the traffic directly to the service over Azure's backbone network, rather than
routing the traffic to the Internet.
NOTE
You can now bring your own static public IP addresses for Azure-SSIS IR. In this scenario, we will create only the Azure
load balancer and network security group under the same resource group as your static public IP addresses instead of the
virtual network.
Those resources will be created when your Azure-SSIS IR starts. They'll be deleted when your Azure-SSIS IR
stops. If you bring your own static public IP addresses for Azure-SSIS IR, your own static public IP addresses
won't be deleted when your Azure-SSIS IR stops. To avoid blocking your Azure-SSIS IR from stopping, don't
reuse these network resources in your other resources.
Make sure that you have no resource lock on the resource group/subscription to which the virtual network/your
static public IP addresses belong. If you configure a read-only/delete lock, starting and stopping your Azure-
SSIS IR will fail, or it will stop responding.
Make sure that you don't have an Azure Policy assignment that prevents the following resources from being
created under the resource group/subscription to which the virtual network/your static public IP addresses
belong:
Microsoft.Network/LoadBalancers
Microsoft.Network/NetworkSecurityGroups
Microsoft.Network/PublicIPAddresses
Make sure that the resource quota of your subscription is enough for the above three network resources.
Specifically, for each Azure-SSIS IR created in virtual network, you need to reserve two free quotas for each of
the above three network resources. The extra one quota will be used when we periodically upgrade your Azure-
SSIS IR.
FAQ
How can I protect the public IP address exposed on my Azure-SSIS IR for inbound connection? Is it
possible to remove the public IP address?
Right now, a public IP address will be automatically created when your Azure-SSIS IR joins a virtual
network. We do have an NIC-level NSG to allow only Azure Batch management services to inbound-
connect to your Azure-SSIS IR. You can also specify a subnet-level NSG for inbound protection.
If you don't want any public IP address to be exposed, consider configuring a self-hosted IR as proxy for
your Azure-SSIS IR instead of joining your Azure-SSIS IR to a virtual network, if this applies to your
scenario.
Can I add the public IP address of my Azure-SSIS IR to the firewall's allow list for my data sources?
You can now bring your own static public IP addresses for Azure-SSIS IR. In this case, you can add your IP
addresses to the firewall's allow list for your data sources. You can also consider other options below to
secure data access from your Azure-SSIS IR depending on your scenario:
If your data source is on premises, after connecting a virtual network to your on-premises network
and joining your Azure-SSIS IR to the virtual network subnet, you can then add the private IP address
range of that subnet to the firewall's allow list for your data source.
If your data source is an Azure service that supports virtual network service endpoints, you can
configure a virtual network service endpoint on your virtual network subnet and join your Azure-SSIS
IR to that subnet. You can then add a virtual network rule with that subnet to the firewall for your data
source.
If your data source is a non-Azure cloud service, you can use a UDR to route outbound traffic from
your Azure-SSIS IR to an NVA/Azure Firewall via a static public IP address. You can then add the static
public IP address of your NVA/Azure Firewall to the firewall's allow list for your data source.
If none of the above options meets your needs, consider configuring a self-hosted IR as proxy for your
Azure-SSIS IR. You can then add the static public IP address of the machine that hosts your self-hosted
IR to the firewall's allow list for your data source.
Why do I need to provide two static public addresses if I want to bring my own for Azure-SSIS IR?
Azure-SSIS IR is automatically updated on a regular basis. New nodes are created during upgrade and
old ones will be deleted. However, to avoid downtime, the old nodes will not be deleted until the new
ones are ready. Thus, your first static public IP address used by the old nodes cannot be released
immediately and we need your second static public IP address to create the new nodes.
I have brought my own static public IP addresses for Azure-SSIS IR, but why it still cannot access my data
sources?
Confirm that the two static public IP addresses are both added to the firewall's allow list for your data
sources. Each time your Azure-SSIS IR is upgraded, its static public IP address is switched between the
two brought by you. If you add only one of them to the allow list, data access for your Azure-SSIS IR
will be broken after its upgrade.
If your data source is an Azure service, please check whether you have configured it with virtual
network service endpoints. If that's the case, the traffic from Azure-SSIS IR to your data source will
switch to use the private IP addresses managed by Azure services and adding your own static public
IP addresses to the firewall's allow list for your data source will not take effect.
6. Select the copy button for RESOURCE ID to copy the resource ID for the classic network to the
clipboard. Save the ID from the clipboard in OneNote or a file.
7. On the left menu, select Subnets . Ensure that the number of available addresses is greater than the
nodes in your Azure-SSIS IR.
8. Join MicrosoftAzureBatch to the Classic Vir tual Machine Contributor role for the virtual network.
a. On the left menu, select Access control (IAM) , and select the Role assignments tab.
9. Verify that the Azure Batch provider is registered in the Azure subscription that has the virtual network.
Or register the Azure Batch provider. If you already have an Azure Batch account in your subscription,
your subscription is registered for Azure Batch. (If you create the Azure-SSIS IR in the Data Factory portal,
the Azure Batch provider is automatically registered for you.)
a. In the Azure portal, on the left menu, select Subscriptions .
b. Select your subscription.
c. On the left, select Resource providers , and confirm that Microsoft.Batch is a registered
provider.
If you don't see Microsoft.Batch in the list, to register it, create an empty Azure Batch account in your
subscription. You can delete it later.
Join the Azure -SSIS IR to a virtual network
After you've configured your Azure Resource Manager virtual network or classic virtual network, you can join
the Azure-SSIS IR to the virtual network:
1. Start Microsoft Edge or Google Chrome. Currently, only these web browsers support the Data Factory UI.
2. In the Azure portal, on the left menu, select Data factories . If you don't see Data factories on the
menu, select More ser vices , and then in the INTELLIGENCE + ANALYTICS section, select Data
factories .
3. Select your data factory with the Azure-SSIS IR in the list. You see the home page for your data factory.
Select the Author & Monitor tile. You see the Data Factory UI on a separate tab.
4. In the Data Factory UI, switch to the Edit tab, select Connections , and switch to the Integration
Runtimes tab.
5. If your Azure-SSIS IR is running, in the Integration Runtimes list, in the Actions column, select the
Stop button for your Azure-SSIS IR. You can't edit your Azure-SSIS IR until you stop it.
6. In the Integration Runtimes list, in the Actions column, select the Edit button for your Azure-SSIS IR.
7. On the integration runtime setup panel, advance through the General Settings and SQL Settings
sections by selecting the Next button.
8. On the Advanced Settings section:
a. Select the Select a VNet for your Azure-SSIS Integration Runtime to join, allow ADF to
create cer tain network resources, and optionally bring your own static public IP
addresses check box.
b. For Subscription , select the Azure subscription that has your virtual network.
c. For Location , the same location of your integration runtime is selected.
d. For Type , select the type of your virtual network: classic or Azure Resource Manager. We
recommend that you select an Azure Resource Manager virtual network, because classic virtual
networks will be deprecated soon.
e. For VNet Name , select the name of your virtual network. It should be the same one used for SQL
Database with virtual network service endpoints or SQL Managed Instance with private endpoint
to host SSISDB. Or it should be the same one connected to your on-premises network. Otherwise,
it can be any virtual network to bring your own static public IP addresses for Azure-SSIS IR.
f. For Subnet Name , select the name of subnet for your virtual network. It should be the same one
used for SQL Database with virtual network service endpoints to host SSISDB. Or it should be a
different subnet from the one used for SQL Managed Instance with private endpoint to host
SSISDB. Otherwise, it can be any subnet to bring your own static public IP addresses for Azure-
SSIS IR.
g. Select the Bring static public IP addresses for your Azure-SSIS Integration Runtime
check box to choose whether you want to bring your own static public IP addresses for Azure-SSIS
IR, so you can allow them on the firewall for your data sources.
If you select the check box, complete the following steps.
a. For First static public IP address , select the first static public IP address that meets the
requirements for your Azure-SSIS IR. If you don't have any, click Create new link to create
static public IP addresses on Azure portal and then click the refresh button here, so you can
select them.
b. For Second static public IP address , select the second static public IP address that meets
the requirements for your Azure-SSIS IR. If you don't have any, click Create new link to
create static public IP addresses on Azure portal and then click the refresh button here, so
you can select them.
h. Select VNet Validation . If the validation is successful, select Continue .
9. On the Summar y section, review all settings for your Azure-SSIS IR. Then select Update .
10. Start your Azure-SSIS IR by selecting the Star t button in the Actions column for your Azure-SSIS IR. It
takes about 20 to 30 minutes to start the Azure-SSIS IR that joins a virtual network.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign the VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}
# Make sure to run this script against the subscription to which the virtual network belongs.
if(![string]::IsNullOrEmpty($VnetId) -and ![string]::IsNullOrEmpty($SubnetName))
{
# Register to the Azure Batch resource provider
$BatchApplicationId = "ddbf3205-c6bd-46ae-8127-60eb93363864"
$BatchObjectId = (Get-AzADServicePrincipal -ServicePrincipalName $BatchApplicationId).Id
Register-AzResourceProvider -ProviderNamespace Microsoft.Batch
while(!(Get-AzResourceProvider -ProviderNamespace
"Microsoft.Batch").RegistrationState.Contains("Registered"))
{
Start-Sleep -s 10
}
if($VnetId -match "/providers/Microsoft.ClassicNetwork/")
{
# Assign VM contributor role to Microsoft.Batch
New-AzRoleAssignment -ObjectId $BatchObjectId -RoleDefinitionName "Classic Virtual Machine
Contributor" -Scope $VnetId
}
}
# Add public IP address parameters if you bring your own static public IP addresses
if(![string]::IsNullOrEmpty($FirstPublicIP) -and ![string]::IsNullOrEmpty($SecondPublicIP))
{
$publicIPs = @($FirstPublicIP, $SecondPublicIP)
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-PublicIPs $publicIPs
}
Next steps
For more information about Azure-SSIS IR, see the following articles:
Azure-SSIS IR. This article provides general conceptual information about IRs, including Azure-SSIS IR.
Tutorial: Deploy SSIS packages to Azure. This tutorial provides step-by-step instructions to create your Azure-
SSIS IR. It uses Azure SQL Database to host the SSIS catalog.
Create an Azure-SSIS IR. This article expands on the tutorial. It provides instructions about using Azure SQL
Database with virtual network service endpoints or SQL Managed Instance in a virtual network to host the
SSIS catalog. It shows how to join your Azure-SSIS IR to a virtual network.
Monitor an Azure-SSIS IR. This article shows you how to get information about your Azure-SSIS IR. It
provides status descriptions for the returned information.
Manage an Azure-SSIS IR. This article shows you how to stop, start, or delete your Azure-SSIS IR. It also
shows you how to scale out your Azure-SSIS IR by adding nodes.
Configure a self-hosted IR as a proxy for an Azure-
SSIS IR in Azure Data Factory
7/21/2021 • 11 minutes to read • Edit Online
TIP
If you select the Ser vice Principal method, grant your service principal at least a Storage Blob Data Contributor role.
For more information, see Azure Blob Storage connector. If you select the Managed Identity /User-Assigned
Managed Identity method, grant the specified system/user-assigned managed identity for your ADF a proper role to
access Azure Blob Storage. For more information, see Access Azure Blob Storage using Azure Active Directory (Azure AD)
authentication with the specified system/user-assigned managed identity for your ADF.
Configure an Azure-SSIS IR with your self-hosted IR as a proxy
Having prepared your self-hosted IR and Azure Blob Storage linked service for staging, you can now configure
your new or existing Azure-SSIS IR with the self-hosted IR as a proxy in your data factory portal or app. Before
you do so, though, if your existing Azure-SSIS IR is already running, you can stop, edit, and then restart it.
1. In the Integration runtime setup pane, skip past the General settings and Deployment settings
pages by selecting the Continue button.
2. On the Advanced settings page, do the following:
a. Select the Set up Self-Hosted Integration Runtime as a proxy for your Azure-SSIS
Integration Runtime check box.
b. In the Self-Hosted Integration Runtime drop-down list, select your existing self-hosted IR as a
proxy for the Azure-SSIS IR.
c. In the Staging storage linked ser vice drop-down list, select your existing Azure Blob Storage
linked service or create a new one for staging.
d. In the Staging path box, specify a blob container in your selected Azure Storage account or leave
it empty to use a default one for staging.
e. Select the Continue button.
You can also configure your new or existing Azure-SSIS IR with the self-hosted IR as a proxy by using
PowerShell.
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
$AzureSSISName = "[your Azure-SSIS IR name]"
# Self-hosted integration runtime info - This can be configured as a proxy for on-premises data access
$DataProxyIntegrationRuntimeName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingLinkedServiceName = "" # OPTIONAL to configure a proxy for on-premises data access
$DataProxyStagingPath = "" # OPTIONAL to configure a proxy for on-premises data access
# Add self-hosted integration runtime parameters if you configure a proxy for on-premises data accesss
if(![string]::IsNullOrEmpty($DataProxyIntegrationRuntimeName) -and !
[string]::IsNullOrEmpty($DataProxyStagingLinkedServiceName))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyIntegrationRuntimeName $DataProxyIntegrationRuntimeName `
-DataProxyStagingLinkedServiceName $DataProxyStagingLinkedServiceName
if(![string]::IsNullOrEmpty($DataProxyStagingPath))
{
Set-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-DataProxyStagingPath $DataProxyStagingPath
}
}
Start-AzDataFactoryV2IntegrationRuntime -ResourceGroupName $ResourceGroupName `
-DataFactoryName $DataFactoryName `
-Name $AzureSSISName `
-Force
You can also enable the ConnectByProxy property by setting it to True for the relevant connection
managers that appear on the Connection Managers tab of Execute SSIS Package activity when you're
running packages in Data Factory pipelines.
Option B: Redeploy the project containing those packages to run on your SSIS IR. You can then enable
the ConnectByProxy / ExecuteOnProxy properties by providing their property paths,
\Package.Connections[YourConnectionManagerName].Properties[ConnectByProxy] /
\Package\YourExecuteSQLTaskName.Properties[ExecuteOnProxy] /
\Package\YourExecuteProcessTaskName.Properties[ExecuteOnProxy] , and setting them to True as property
overrides on the Advanced tab of Execute Package pop-up window when you're running packages
from SSMS.
You can also enable the / ExecuteOnProxy properties by providing their property paths,
ConnectByProxy
\Package.Connections[YourConnectionManagerName].Properties[ConnectByProxy] /
\Package\YourExecuteSQLTaskName.Properties[ExecuteOnProxy] /
\Package\YourExecuteProcessTaskName.Properties[ExecuteOnProxy] , and setting them to True as property
overrides on the Proper ty Overrides tab of Execute SSIS Package activity when you're running
packages in Data Factory pipelines.
Debug the on-premises tasks and cloud staging tasks
On your self-hosted IR, you can find the runtime logs in the C:\ProgramData\SSISTelemetry folder and the
execution logs of on-premises staging tasks and Execute SQL/Process Tasks in the
C:\ProgramData\SSISTelemetry\ExecutionLog folder. You can find the execution logs of cloud staging tasks in
your SSISDB, specified logging file paths, or Azure Monitor depending on whether you store your packages in
SSISDB, enable Azure Monitor integration, etc. You can also find the unique IDs of on-premises staging tasks in
the execution logs of cloud staging tasks.
If you've raised customer support tickets, you can select the Send logs button on Diagnostics tab of
Microsoft Integration Runtime Configuration Manager that's installed on your self-hosted IR to send
recent operation/execution logs for us to investigate.
Current limitations
Only data flow components that are built-in/preinstalled on Azure-SSIS IR Standard Edition, except
Hadoop/HDFS/DQS components, are currently supported, see all built-in/preinstalled components on Azure-
SSIS IR.
Only custom/3rd party data flow components that are written in managed code (.NET Framework) are
currently supported - Those written in native code (C++) are currently unsupported.
Changing variable values in both on-premises and cloud staging tasks is currently unsupported.
Changing variable values of type object in on-premises staging tasks won't be reflected in other tasks.
ParameterMapping in OLEDB Source is currently unsupported. As a workaround, please use SQL Command
From Variable as the AccessMode and use Expression to insert your variables/parameters in a SQL
command. As an illustration, see the ParameterMappingSample.dtsx package that can be found in the
SelfHostedIRProxy/Limitations folder of our public preview blob container. Using Azure Storage Explorer, you
can connect to our public preview blob container by entering the above SAS URI.
Next steps
After you've configured your self-hosted IR as a proxy for your Azure-SSIS IR, you can deploy and run your
packages to access data on-premises as Execute SSIS Package activities in Data Factory pipelines. To learn how,
see Run SSIS packages as Execute SSIS Package activities in Data Factory pipelines.
Enable Azure Active Directory authentication for
Azure-SSIS integration runtime
7/21/2021 • 7 minutes to read • Edit Online
NOTE
In this scenario, Azure AD authentication with the specified system/user-assigned managed identity for your ADF
is only used in the creation and subsequent starting operations of your Azure-SSIS IR that will in turn provision
and connect to SSISDB. For SSIS package executions, your Azure-SSIS IR will still connect to SSISDB using SQL
authentication with fully managed accounts that are created during SSISDB provisioning.
If you have already created your Azure-SSIS IR using SQL authentication, you can not reconfigure it to use Azure
AD authentication via PowerShell at this time, but you can do so via Azure portal/ADF app.
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
The result looks like the following example, which also displays the variable value:
$Group
3. Add the specified system/user-assigned managed identity for your ADF to the group. You can follow the
Managed identity for Data Factory article to get the Object ID of specified system/user-assigned managed
identity for your ADF (e.g. 765ad4ab-XXXX-XXXX-XXXX-51ed985819dc, but do not use the Application ID
for this purpose).
The command should complete successfully, creating a contained user to represent the group.
9. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, granting the contained user the ability to create a database
(SSISDB).
10. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, first make sure that the steps to grant permissions to
the master database have finished successfully. Then, right-click on the SSISDB database and select
New quer y .
11. In the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, creating a contained user to represent the group.
12. Clear the query window, enter the following T-SQL command, and select Execute on the toolbar.
The command should complete successfully, granting the contained user the ability to access SSISDB.
If you use the system managed identity for your ADF, then your managed identity name should be your
ADF name. If you use a user-assigned managed identity for your ADF, then your managed identity name
should be the specified user-assigned managed identity name.
The command should complete successfully, granting the system/user-assigned managed identity for
your ADF the ability to create a database (SSISDB).
6. If your SSISDB was created using SQL authentication and you want to switch to use Azure AD
authentication for your Azure-SSIS IR to access it, first make sure that the steps to grant permissions to
the master database have finished successfully. Then, right-click on the SSISDB database and select
New quer y .
7. In the query window, enter the following T-SQL command, and select Execute on the toolbar.
CREATE USER [{your managed identity name}] FOR LOGIN [{your managed identity name}] WITH
DEFAULT_SCHEMA = dbo
ALTER ROLE db_owner ADD MEMBER [{your managed identity name}]
The command should complete successfully, granting the system/user-assigned managed identity for
your ADF the ability to access SSISDB.
N UM B ER O F
C REDEN T IA L
A C C ESS SET S A N D T YPE OF
C O N N EC T IO N EF F EC T IVE M ET H O D IN C O N N EC T ED C O N N EC T ED
M ET H O D SC O P E SET UP ST EP PA C K A GES RESO URC ES RESO URC ES
Setting up an Per Execute SSIS Configure the Access resources Support only - File shares on
activity-level Package activity Windows directly in one credential premises/Azure
execution authentication packages, for set for all VMs
context property to set example, use connected
up an UNC path to resources - Azure Files, see
"Execution/Run access file shares Use an Azure file
as" context when or Azure Files: share
running SSIS \\YourFileShareServerName\YourFolderName
packages as or - SQL Servers on
Execute SSIS premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
Package activities VMs with
in ADF pipelines. Windows
authentication
For more info,
see Configure - Other
Execute SSIS resources with
Package activity. Windows
authentication
Setting up a Per Azure-SSIS Execute SSISDB Access resources Support only - File shares on
catalog-level IR, but is directly in
catalog.set_execution_credential one credential premises/Azure
execution overridden when stored procedure packages, for set for all VMs
context setting up an to set up an example, use connected
activity-level "Execution/Run UNC path to resources - Azure Files, see
execution as" context. access file shares Use an Azure file
context (see or Azure Files: share
above) For more info, \\YourFileShareServerName\YourFolderName
see the rest of or - SQL Servers on
this article below. premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
VMs with
Windows
authentication
- Other
resources with
Windows
authentication
N UM B ER O F
C REDEN T IA L
A C C ESS SET S A N D T YPE OF
C O N N EC T IO N EF F EC T IVE M ET H O D IN C O N N EC T ED C O N N EC T ED
M ET H O D SC O P E SET UP ST EP PA C K A GES RESO URC ES RESO URC ES
Persisting Per Azure-SSIS Execute cmdkey Access resources Support multiple - File shares on
credentials via IR, but is command in a directly in credential sets premises/Azure
cmdkey overridden when custom setup packages, for for different VMs
command setting up an script ( example, use connected
activity/catalog - main.cmd ) UNC path to resources - Azure Files, see
level execution when access file shares Use an Azure file
context (see provisioning or Azure Files: share
above) your Azure-SSIS \\YourFileShareServerName\YourFolderName
IR, for example, if or - SQL Servers on
you use file premises/Azure
\\YourAzureStorageAccountName.file.core.windows.net\YourFolderName
shares, Azure VMs with
Files, or SQL Windows
Server: authentication
cmdkey - Other
/add:YourFileShareServerName resources with
/user:YourDomainName\YourUsername Windows
/pass:YourPassword
authentication
,
cmdkey
/add:YourAzureStorageAccountName.file.core.windows.net
/user:azure\YourAzureStorageAccountName
/pass:YourAccessKey
, or
cmdkey
/add:YourSQLServerFullyQualifiedDomainNameOrIPAddress:YorSQLServerPort
/user:YourDomainName\YourUsername /pass:YourPassword
.
Mounting drives Per package Execute Access file shares Support multiple - File shares on
at package net use via mapped drives for premises/Azure
execution time command in drives different file VMs
(non-persistent) Execute Process shares
Task that is - Azure Files, see
added at the Use an Azure file
beginning of share
control flow in
your packages,
for example,
net use D:
\\YourFileShareServerName\YourFolderName
WARNING
If you do not use any of the above methods to access data stores with Windows authentication, your packages that
depend on Windows authentication are not able to access them and fail at run time.
The rest of this article describes how to configure SSIS catalog (SSISDB) hosted in SQL Database/SQL Managed
Instance to run packages on Azure-SSIS IR that use Windows authentication to access data stores.
4. Run your SSIS packages. The packages use the credentials that you provided to access data stores on
premises with Windows authentication.
View domain credentials
To view the active domain credentials, do the following things:
1. With SSMS or another tool, connect to SQL Database/SQL Managed Instance that hosts SSISDB. For
more info, see Connect to SSISDB in Azure.
2. With SSISDB as the current database, open a query window.
3. Run the following stored procedure and check the output:
SELECT *
FROM catalog.master_properties
WHERE property_name = 'EXECUTION_DOMAIN' OR property_name = 'EXECUTION_USER'
3. From SSMS, check whether you can connect to the SQL Server on premises.
Prerequisites
To access a SQL Server on premises from packages running in Azure, do the following things:
1. In SQL Server Configuration Manager, enable TCP/IP protocol.
2. Allow access through Windows firewall. For more info, see Configure Windows firewall to access SQL
Server.
3. Join your Azure-SSIS IR to a Microsoft Azure Virtual Network that is connected to the SQL Server on
premises. For more info, see Join Azure-SSIS IR to a Microsoft Azure Virtual Network.
4. Use SSISDB catalog.set_execution_credential stored procedure to provide credentials as described in
this article.
3. Check whether the directory listing is returned for the file share on premises.
Prerequisites
To access a file share on premises from packages running in Azure, do the following things:
1. Allow access through Windows firewall.
2. Join your Azure-SSIS IR to a Microsoft Azure Virtual Network that is connected to the file share on
premises. For more info, see Join Azure-SSIS IR to a Microsoft Azure Virtual Network.
3. Use SSISDB catalog.set_execution_credential stored procedure to provide credentials as described in
this article.
Next steps
Deploy your packages. For more info, see Deploy an SSIS project to Azure with SSMS.
Run your packages. For more info, see Run SSIS packages in Azure with SSMS.
Schedule your packages. For more info, see Schedule SSIS packages in Azure.
Open and save files on premises and in Azure with
SSIS packages deployed in Azure
3/22/2021 • 2 minutes to read • Edit Online
Next steps
Deploy your packages. For more info, see Deploy an SSIS project to Azure with SSMS.
Run your packages. For more info, see Run SSIS packages in Azure with SSMS.
Schedule your packages. For more info, see Schedule SSIS packages in Azure.
Provision Enterprise Edition for the Azure-SSIS
Integration Runtime
3/5/2021 • 3 minutes to read • Edit Online
Enterprise features
EN T ERP RISE F EAT URES DESC RIP T IO N S
CDC components The CDC Source, Control Task, and Splitter Transformation
are preinstalled on the Azure-SSIS IR Enterprise Edition. To
connect to Oracle, you also need to install the CDC Designer
and Service on another computer.
Oracle connectors The Oracle Connection Manager, Source, and Destination are
preinstalled on the Azure-SSIS IR Enterprise Edition. You also
need to install the Oracle Call Interface (OCI) driver, and if
necessary configure the Oracle Transport Network Substrate
(TNS), on the Azure-SSIS IR. For more info, see Custom setup
for the Azure-SSIS integration runtime.
Analysis Services components The Data Mining Model Training Destination, the Dimension
Processing Destination, and the Partition Processing
Destination, as well as the Data Mining Query
Transformation, are preinstalled on the Azure-SSIS IR
Enterprise Edition. All these components support SQL Server
Analysis Services (SSAS), but only the Partition Processing
Destination supports Azure Analysis Services (AAS). To
connect to SSAS, you also need to configure Windows
Authentication credentials in SSISDB. In addition to these
components, the Analysis Services Execute DDL Task, the
Analysis Services Processing Task, and the Data Mining
Query Task are also preinstalled on the Azure-SSIS IR
Standard/Enterprise Edition.
Fuzzy Grouping and Fuzzy Lookup transformations The Fuzzy Grouping and Fuzzy Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.
Term Extraction and Term Lookup transformations The Term Extraction and Term Lookup transformations are
preinstalled on the Azure-SSIS IR Enterprise Edition. These
components support both SQL Server and Azure SQL
Database for storing reference data.
Instructions
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
$MyAzureSsisIrEdition = "Enterprise"
Next steps
Custom setup for the Azure-SSIS integration runtime
How to develop paid or licensed custom components for the Azure-SSIS integration runtime
Built-in and preinstalled components on Azure-SSIS
Integration Runtime
3/5/2021 • 3 minutes to read • Edit Online
Preinstalled connection managers ( Azure Feature Azure Data Lake Analytics Connection Manager
Pack )
Azure Data Lake Store Connection Manager
Excel Source
OData Source
ODBC Source
OLEDB Source
XML Source
DataReader Destination
Excel Destination
ODBC Destination
OLEDB Destination
Recordset Destination
Script Component
Pivot Transformation
Sort Transformation
Unpivot Transformation
Cache Transform
Lookup Transformation
Merge Transformation
Multicast Transformation
Built-in Analysis Ser vices tasks Analysis Services Execute DDL Task
FTP Task
XML Task
Expression Task
Next steps
To install additional custom/Open Source/3rd party components on your SSIS IR, follow the instructions in
Customize Azure-SSIS IR.
Customize the setup for an Azure-SSIS Integration
Runtime
5/25/2021 • 19 minutes to read • Edit Online
IMPORTANT
To benefit from future enhancements, we recommend using v3 or later series of nodes for your Azure-SSIS IR with custom
setup.
Current limitations
The following limitations apply only to standard custom setups:
If you want to use gacutil.exe in your script to install assemblies in the global assembly cache (GAC), you
need to provide gacutil.exe as part of your custom setup. Or you can use the copy that's provided in the
Sample folder of our Public Preview blob container, see the Standard custom setup samples section
below.
If you want to reference a subfolder in your script, msiexec.exe doesn't support the .\ notation to
reference the root folder. Use a command such as msiexec /i "MySubfolder\MyInstallerx64.msi" ...
instead of msiexec /i ".\MySubfolder\MyInstallerx64.msi" ... .
Administrative shares, or hidden network shares that are automatically created by Windows, are currently
not supported on the Azure-SSIS IR.
The IBM iSeries Access ODBC driver is not supported on the Azure-SSIS IR. You might see installation
errors during your custom setup. If you do, contact IBM support for assistance.
Prerequisites
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
Instructions
You can provision or reconfigure your Azure-SSIS IR with custom setups on ADF UI. If you want to do the same
using PowerShell, download and install Azure PowerShell.
Standard custom setup
To provision or reconfigure your Azure-SSIS IR with standard custom setups on ADF UI, complete the following
steps.
1. Prepare your custom setup script and its associated files (for example, .bat, .cmd, .exe, .dll, .msi, or .ps1
files).
You must have a script file named main.cmd, which is the entry point of your custom setup.
To ensure that the script can be silently executed, you should test it on your local machine first.
If you want additional logs generated by other tools (for example, msiexec.exe) to be uploaded to your
blob container, specify the predefined environment variable, CUSTOM_SETUP_SCRIPT_LOG_DIR , as the log
folder in your scripts (for example, msiexec /i xxx.msi /quiet /lv
%CUSTOM_SETUP_SCRIPT_LOG_DIR%\install.log).
2. Download, install, and open Azure Storage Explorer.
a. Under Local and Attached , right-click Storage Accounts , and then select Connect to Azure
Storage .
b. Select Storage account or ser vice , select Account name and key , and then select Next .
c. Enter your Azure Storage account name and key, select Next , and then select Connect .
d. Under your connected Azure Storage account, right-click Blob Containers , select Create Blob
Container , and name the new blob container.
e. Select the new blob container, and upload your custom setup script and its associated files. Make sure
that you upload main.cmd at the top level of your blob container, not in any folder. Your blob container
should contain only the necessary custom setup files, so downloading them to your Azure-SSIS IR later
won't take a long time. The maximum duration of a custom setup is currently set at 45 minutes before it
times out. This includes the time to download all files from your blob container and install them on the
Azure-SSIS IR. If setup requires more time, raise a support ticket.
f. Right-click the blob container, and then select Get Shared Access Signature .
g. Create the SAS URI for your blob container with a sufficiently long expiration time and with
read/write/list permission. You need the SAS URI to download and run your custom setup script and its
associated files. This happens whenever any node of your Azure-SSIS IR is reimaged or restarted. You
also need write permission to upload setup execution logs.
IMPORTANT
Ensure that the SAS URI doesn't expire and the custom setup resources are always available during the whole
lifecycle of your Azure-SSIS IR, from creation to deletion, especially if you regularly stop and start your Azure-SSIS
IR during this period.
b. Select Blob container , select Shared access signature URL (SAS) , and then select Next .
c. In the Blob container SAS URL text box, enter the SAS URI for our Public Preview blob container
below, select Next , and then select Connect .
https://ssisazurefileshare.blob.core.windows.net/publicpreview?sp=rl&st=2020-03-25T04:00:00Z&se=2025-
03-25T04:00:00Z&sv=2019-02-02&sr=c&sig=WAD3DATezJjhBCO3ezrQ7TUZ8syEUxZZtGIhhP6Pt4I%3D
d. In the left pane, select the connected publicpreview blob container, and then double-click the
CustomSetupScript folder. In this folder are the following items:
A Sample folder, which contains a custom setup to install a basic task on each node of your Azure-
SSIS IR. The task does nothing but sleep for a few seconds. The folder also contains a gacutil folder,
whose entire content (gacutil.exe, gacutil.exe.config, and 1033\gacutlrc.dll) can be copied as is to
your blob container.
A UserScenarios folder, which contains several custom setup samples from real user scenarios. If
you want to install multiple samples on your Azure-SSIS IR, you can combine their custom setup
script (main.cmd) files into a single one and upload it with all of their associated files into your
blob container.
f. To reuse these standard custom setup samples, copy the content of selected folder to your blob
container.
2. When you provision or reconfigure your Azure-SSIS IR on ADF UI, select the Customize your Azure-
SSIS Integration Runtime with additional system configurations/component installations
check box on the Advanced settings page of Integration runtime setup pane. Next, enter the SAS
URI of your blob container in the Custom setup container SAS URI text box.
3. When you provision or reconfigure your Azure-SSIS IR using Azure PowerShell, stop it if it's already
started/running, run the Set-AzDataFactoryV2IntegrationRuntime cmdlet with the SAS URI of your blob
container as the value for SetupScriptContainerSasUri parameter, and then start your Azure-SSIS IR.
4. After your standard custom setup finishes and your Azure-SSIS IR starts, you can find all custom setup
logs in the main.cmd.log folder of your blob container. They include the standard output of main.cmd and
other execution logs.
Next steps
Set up the Enterprise Edition of Azure-SSIS IR
Develop paid or licensed components for Azure-SSIS IR
Install paid or licensed custom components for the
Azure-SSIS integration runtime
3/5/2021 • 3 minutes to read • Edit Online
The problem
The nature of the Azure-SSIS integration runtime presents several challenges, which make the typical licensing
methods used for the on-premises installation of custom components inadequate. As a result, the Azure-SSIS IR
requires a different approach.
The nodes of the Azure-SSIS IR are volatile and can be allocated or released at any time. For example, you
can start or stop nodes to manage the cost, or scale up and down through various node sizes. As a result,
binding a third-party component license to a particular node by using machine-specific info such as MAC
address or CPU ID is no longer viable.
You can also scale the Azure-SSIS IR in or out, so that the number of nodes can shrink or expand at any
time.
The solution
As a result of the limitations of traditional licensing methods described in the previous section, the Azure-SSIS IR
provides a new solution. This solution uses Windows environment variables and SSIS system variables for the
license binding and validation of third-party components. ISVs can use these variables to obtain unique and
persistent info for an Azure-SSIS IR, such as Cluster ID and Cluster Node Count. With this info, ISVs can then
bind the license for their component to an Azure-SSIS IR as a cluster. This binding uses an ID that doesn't
change when customers start or stop, scale up or down, scale in or out, or reconfigure the Azure-SSIS IR in any
way.
The following diagram shows the typical installation, activation and license binding, and validation flows for
third-party components that use these new variables:
Instructions
1. ISVs can offer their licensed components in various SKUs or tiers (for example, single node, up to 5 nodes,
up to 10 nodes, and so forth). The ISV provides the corresponding Product Key when customers purchase
a product. The ISV can also provide an Azure Storage blob container that contains an ISV Setup script and
associated files. Customers can copy these files into their own storage container and modify them with
their own Product Key (for example, by running IsvSetup.exe -pid xxxx-xxxx-xxxx ). Customers can then
provision or reconfigure the Azure-SSIS IR with the SAS URI of their container as parameter. For more
info, see Custom setup for the Azure-SSIS integration runtime.
2. When the Azure-SSIS IR is provisioned or reconfigured, ISV Setup runs on each node to query the
Windows environment variables, SSIS_CLUSTERID and SSIS_CLUSTERNODECOUNT . Then the Azure-SSIS IR
submits its Cluster ID and the Product Key for the licensed product to the ISV Activation Server to
generate an Activation Key.
3. After receiving the Activation Key, ISV Setup can store the key locally on each node (for example, in the
Registry).
4. When customers run a package that uses the ISV's licensed component on a node of the Azure-SSIS IR,
the package reads the locally stored Activation Key and validates it against the node's Cluster ID. The
package can also optionally report the Cluster Node Count to the ISV activation server.
Here is an example of code that validates the activation key and reports the cluster node count:
public override DTSExecResult Validate(Connections, VariableDispenser, IDTSComponentEvents
componentEvents, IDTSLogging log)
variableDispenser.LockForRead("System::ClusterID");
variableDispenser.LockForRead("System::ClusterNodeCount");
variableDispenser.GetVariables(ref vars);
// Report on ClusterNodeCount
vars.Unlock();
ISV partners
You can find a list of ISV partners who have adapted their components and extensions for the Azure-SSIS IR at
the end of this blog post - Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
Next steps
Custom setup for the Azure-SSIS integration runtime
Enterprise Edition of the Azure-SSIS Integration Runtime
Configure the Azure-SSIS Integration Runtime for
high performance
3/5/2021 • 8 minutes to read • Edit Online
IMPORTANT
This article contains performance results and observations from in-house testing done by members of the SSIS
development team. Your results may vary. Do your own testing before you finalize your configuration settings, which
affect both cost and performance.
Properties to configure
The following portion of a configuration script shows the properties that you can configure when you create an
Azure-SSIS Integration Runtime. For the complete PowerShell script and description, see Deploy SQL Server
Integration Services packages to Azure.
# If your input contains a PSH special character, e.g. "$", precede it with the escape character "`" like
"`$"
$SubscriptionName = "[your Azure subscription name]"
$ResourceGroupName = "[your Azure resource group name]"
$DataFactoryName = "[your data factory name]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$DataFactoryLocation = "EastUS"
### Azure-SSIS integration runtime information - This is a Data Factory compute resource for running SSIS
packages
$AzureSSISName = "[specify a name for your Azure-SSIS IR]"
$AzureSSISDescription = "[specify a description for your Azure-SSIS IR]"
# For supported regions, see https://azure.microsoft.com/global-infrastructure/services/?products=data-
factory®ions=all
$AzureSSISLocation = "EastUS"
# For supported node sizes, see https://azure.microsoft.com/pricing/details/data-factory/ssis/
$AzureSSISNodeSize = "Standard_D8_v3"
# 1-10 nodes are currently supported
$AzureSSISNodeNumber = 2
# Azure-SSIS IR edition/license info: Standard or Enterprise
$AzureSSISEdition = "Standard" # Standard by default, while Enterprise lets you use advanced/premium
features on your Azure-SSIS IR
# Azure-SSIS IR hybrid usage info: LicenseIncluded or BasePrice
$AzureSSISLicenseType = "LicenseIncluded" # LicenseIncluded by default, while BasePrice lets you bring your
existing SQL Server license with Software Assurance to earn cost savings from Azure Hybrid Benefit (AHB)
option
# For a Standard_D1_v2 node, up to 4 parallel executions per node are supported, but for other nodes, up to
max(2 x number of cores, 8) are currently supported
$AzureSSISMaxParallelExecutionsPerNode = 8
# Custom setup info
$SetupScriptContainerSasUri = "" # OPTIONAL to provide SAS URI of blob container where your custom setup
script and its associated files are stored
# Virtual network info: Classic or Azure Resource Manager
$VnetId = "[your virtual network resource ID or leave it empty]" # REQUIRED if you use Azure SQL Database
with virtual network service endpoints/SQL Managed Instance/on-premises data, Azure Resource Manager virtual
network is recommended, Classic virtual network will be deprecated soon
$SubnetName = "[your subnet name or leave it empty]" # WARNING: Please use the same subnet as the one used
with your Azure SQL Database with virtual network service endpoints or a different subnet than the one used
for your SQL Managed Instance
AzureSSISLocation
AzureSSISLocation is the location for the integration runtime worker node. The worker node maintains a
constant connection to the SSIS Catalog database (SSISDB) in Azure SQL Database. Set the AzureSSISLocation
to the same location as logical SQL server that hosts SSISDB, which lets the integration runtime to work as
efficiently as possible.
AzureSSISNodeSize
Data Factory, including the Azure-SSIS IR, supports the following options:
Standard_A4_v2
Standard_A8_v2
Standard_D1_v2
Standard_D2_v2
Standard_D3_v2
Standard_D4_v2
Standard_D2_v3
Standard_D4_v3
Standard_D8_v3
Standard_D16_v3
Standard_D32_v3
Standard_D64_v3
Standard_E2_v3
Standard_E4_v3
Standard_E8_v3
Standard_E16_v3
Standard_E32_v3
Standard_E64_v3
In the unofficial in-house testing by the SSIS engineering team, the D series appears to be more suitable for SSIS
package execution than the A series.
The performance/price ratio of the D series is higher than the A series and the performance/price ratio of the
v3 series is higher than the v2 series.
The throughput for the D series is higher than the A series at the same price and the throughput for the v3
series is higher than the v2 series at the same price.
The v2 series nodes of Azure-SSIS IR are not suitable for custom setup, so please use the v3 series nodes
instead. If you already use the v2 series nodes, please switch to use the v3 series nodes as soon as possible.
The E series is memory optimized VM sizes that provides a higher memory-to-CPU ratio than other
machines.If your package requires a lot of memory, you can consider choosing E series VM.
Configure for execution speed
If you don't have many packages to run, and you want packages to run quickly, use the information in the
following chart to choose a virtual machine type suitable for your scenario.
This data represents a single package execution on a single worker node. The package loads 3 million records
with first name and last name columns from Azure Blob Storage, generates a full name column, and writes the
records that have the full name longer than 20 characters to Azure Blob Storage.
The y-axis is the number of packages that completed execution in one hour. Please note that this is only a test
result of one memory-consuming package. If you want to know the throughput of your package, it is
recommended to perform the test by yourself.
Configure for overall throughput
If you have lots of packages to run, and you care most about the overall throughput, use the information in the
following chart to choose a virtual machine type suitable for your scenario.
The y-axis is the number of packages that completed execution in one hour. Please note that this is only a test
result of one memory-consuming package. If you want to know the throughput of your package, it is
recommended to perform the test by yourself.
AzureSSISNodeNumber
AzureSSISNodeNumber adjusts the scalability of the integration runtime. The throughput of the integration
runtime is proportional to the AzureSSISNodeNumber . Set the AzureSSISNodeNumber to a small value at
first, monitor the throughput of the integration runtime, then adjust the value for your scenario. To reconfigure
the worker node count, see Manage an Azure-SSIS integration runtime.
AzureSSISMaxParallelExecutionsPerNode
When you're already using a powerful worker node to run packages, increasing
AzureSSISMaxParallelExecutionsPerNode may increase the overall throughput of the integration runtime. If
you want to increase max value, you need use Azure PowerShell to update
AzureSSISMaxParallelExecutionsPerNode . You can estimate the appropriate value based on the cost of your
package and the following configurations for the worker nodes. For more information, see General-purpose
virtual machine sizes.
M A X T EM P
STO RA GE M A X N IC S /
T H RO UGH P U M A X DATA EXP EC T ED
T EM P T : IO P S / DISK S / N ET W O RK
STO RA GE REA D M B P S / T H RO UGH P U P ERF O RM A N
SIZ E VC P U M EM O RY : GIB ( SSD) GIB W RIT E M B P S T : IO P S C E ( M B P S)
Here are the guidelines for setting the right value for the AzureSSISMaxParallelExecutionsPerNode
property:
1. Set it to a small value at first.
2. Increase it by a small amount to check whether the overall throughput is improved.
3. Stop increasing the value when the overall throughput reaches the maximum value.
SSISDBPricingTier
SSISDBPricingTier is the pricing tier for the SSIS Catalog database (SSISDB) on in Azure SQL Database. This
setting affects the maximum number of workers in the IR instance, the speed to queue a package execution, and
the speed to load the execution log.
If you don't care about the speed to queue package execution and to load the execution log, you can
choose the lowest database pricing tier. Azure SQL Database with Basic pricing supports 8 workers in an
integration runtime instance.
Choose a more powerful database than Basic if the worker count is more than 8, or the core count is
more than 50. Otherwise the database becomes the bottleneck of the integration runtime instance and
the overall performance is negatively impacted.
Choose a more powerful database such as s3 if the logging level is set to verbose. According our
unofficial in-house testing, s3 pricing tier can support SSIS package execution with 2 nodes, 128 parallel
counts and verbose logging level.
You can also adjust the database pricing tier based on database transaction unit (DTU) usage information
available on the Azure portal.
b. Run the following command for SSISDB in both your primary and secondary Azure SQL Managed
Instances to add the new password for decrypting DMK.
5. If you want to have a near-zero downtime when SSISDB failover occurs, keep both of your Azure-SSIS IRs
running. Only your primary Azure-SSIS IR can access the primary SSISDB to fetch and execute packages,
as well as write package execution logs, while your secondary Azure-SSIS IR can only do the same for
packages deployed somewhere else, for example in Azure Files.
If you want to minimize your running cost, you can stop your secondary Azure-SSIS IR after it's created.
When SSISDB failover occurs, your primary and secondary Azure-SSIS IRs will swap roles. If your
primary Azure-SSIS IR is stopped, you need to restart it. Depending on whether it's injected into a virtual
network and the injection method used, it will take within 5 minutes or around 20 - 30 minutes for it to
run.
6. If you use Azure SQL Managed Instance Agent for orchestration/scheduling package executions, make
sure that all relevant SSIS jobs with their job steps and associated schedules are copied to your
secondary Azure SQL Managed Instance with the schedules initially disabled. Using SSMS, complete the
following steps.
a. For each SSIS job, right-click and select the Script Job as , CREATE To , and New Quer y Editor
Window dropdown menu items to generate its script.
b. For each generated SSIS job script, find the command to execute sp_add_job stored procedure
and modify/remove the value assignment to @owner_login_name argument as necessary.
c. For each updated SSIS job script, run it on your secondary Azure SQL Managed Instance to copy
the job with its job steps and associated schedules.
d. Using the following script, create a new T-SQL job to enable/disable SSIS job schedules based on
the primary/secondary SSISDB role, respectively, in both your primary and secondary Azure SQL
Managed Instances and run it regularly. When SSISDB failover occurs, SSIS job schedules that
were disabled will be enabled and vice versa.
7. If you use ADF for orchestration/scheduling package executions, make sure that all relevant ADF pipelines
with Execute SSIS Package activities and associated triggers are copied to your secondary ADF with the
triggers initially disabled. When SSISDB failover occurs, you need to enable them.
8. You can test your Azure SQL Managed Instance failover group and check on Azure-SSIS IR monitoring
page in ADF portal whether your primary and secondary Azure-SSIS IRs have swapped roles.
3. Using Azure portal/ADF UI or Azure PowerShell, create your new ADF/Azure-SSIS IR named
YourNewADF/YourNewAzureSSISIR, respectively, in another region. If you use Azure portal/ADF UI, you
can ignore the test connection error on Deployment settings page of Integration runtime setup
pane.
Next steps
You can consider these other configuration options for your Azure-SSIS IR:
Configure package stores for your Azure-SSIS IR
Configure custom setups for your Azure-SSIS IR
Configure virtual network injection for your Azure-SSIS IR
Configure self-hosted IR as a proxy for your Azure-SSIS IR
How to clean up SSISDB logs automatically
7/19/2021 • 12 minutes to read • Edit Online
USE msdb
IF EXISTS(SELECT * FROM sys.server_principals where name = '##MS_SSISServerCleanupJobLogin##')
DROP LOGIN ##MS_SSISServerCleanupJobLogin##
USE master
GRANT VIEW SERVER STATE TO ##MS_SSISServerCleanupJobLogin##
USE SSISDB
IF EXISTS (SELECT name FROM sys.database_principals WHERE name = '##MS_SSISServerCleanupJobUser##')
DROP USER ##MS_SSISServerCleanupJobUser##
CREATE USER ##MS_SSISServerCleanupJobUser## FOR LOGIN ##MS_SSISServerCleanupJobLogin##
GRANT EXECUTE ON [internal].[cleanup_server_retention_window_exclusive] TO ##MS_SSISServerCleanupJobUser##
GRANT EXECUTE ON [internal].[cleanup_server_project_version] TO ##MS_SSISServerCleanupJobUser##
USE msdb
EXEC dbo.sp_add_job
@job_name = N'SSIS Server Maintenance Job',
@enabled = 0,
@owner_login_name = '##MS_SSISServerCleanupJobLogin##',
@description = N'Runs every day. The job removes operation records from the database that are outside
the retention window and maintains a maximum number of versions per project.'
EXEC sp_add_jobstep
@job_name = N'SSIS Server Maintenance Job',
@step_name = N'SSIS Server Operation Records Maintenance',
@subsystem = N'TSQL',
@command = N'
DECLARE @role int
SET @role = (SELECT [role] FROM [sys].[dm_hadr_availability_replica_states] hars INNER JOIN [sys].
[availability_databases_cluster] adc ON hars.[group_id] = adc.[group_id] WHERE hars.[is_local] = 1 AND adc.
[database_name] =''SSISDB'')
IF DB_ID(''SSISDB'') IS NOT NULL AND (@role IS NULL OR @role = 1)
EXEC [SSISDB].[internal].[cleanup_server_retention_window_exclusive]',
@database_name = N'msdb',
@on_success_action = 3,
@retry_attempts = 3,
@retry_interval = 3;
EXEC sp_add_jobstep
@job_name = N'SSIS Server Maintenance Job',
@step_name = N'SSIS Server Max Version Per Project Maintenance',
@subsystem = N'TSQL',
@command = N'
DECLARE @role int
SET @role = (SELECT [role] FROM [sys].[dm_hadr_availability_replica_states] hars INNER JOIN [sys].
[availability_databases_cluster] adc ON hars.[group_id] = adc.[group_id] WHERE hars.[is_local] = 1 AND adc.
[database_name] =''SSISDB'')
IF DB_ID(''SSISDB'') IS NOT NULL AND (@role IS NULL OR @role = 1)
EXEC [SSISDB].[internal].[cleanup_server_project_version]',
@database_name = N'msdb',
@retry_attempts = 3,
@retry_interval = 3;
EXEC sp_add_jobschedule
@job_name = N'SSIS Server Maintenance Job',
@name = 'SSISDB Scheduler',
@enabled = 1,
@freq_type = 4, /*daily*/
@freq_interval = 1,/*every day*/
@freq_subday_type = 0x1,
@active_start_date = 20001231,
@active_end_date = 99991231,
@active_start_time = 0,
@active_end_time = 120000
The following Azure PowerShell scripts create a new Elastic Job that invokes SSISDB log clean-up stored
procedure. For more info, see Create an Elastic Job agent using PowerShell.
Create parameters
# Your job database should be a clean, empty S0 or higher service tier. We set S0 as default.
$PricingTier = "S0",
# Parameters needed to create credentials in your job database for connecting to SSISDB
$PasswordForSSISDBCleanupUser = $(Read-Host "Please provide a new password for the log clean-up job user to
connect to SSISDB"),
# Parameters needed to set the job schedule for invoking SSISDB log clean-up stored procedure
$RunJobOrNot = $(Read-Host "Please indicate whether you want to run the job to clean up SSISDB logs outside
the retention window immediately (Y/N). Make sure the retention window is set properly before running the
following scripts as deleted logs cannot be recovered."),
$IntervalType = $(Read-Host "Please enter the interval type for SSISDB log clean-up schedule: Year, Month,
Day, Hour, Minute, Second are supported."),
$IntervalCount = $(Read-Host "Please enter the count of interval type for SSISDB log clean-up schedule."),
# The start time for SSISDB log clean-up schedule is set to current time by default.
$StartTime = (Get-Date)
# Install the latest PowerShell PackageManagement module that PowerShellGet v1.6.5 depends on
Find-Package PackageManagement -RequiredVersion 1.1.7.2 | Install-Package -Force
# Install AzureRM.Sql preview cmdlets side by side with the existing AzureRM.Sql version
# Install AzureRM.Sql preview cmdlets side by side with the existing AzureRM.Sql version
Install-Module -Name AzureRM.Sql -AllowPrerelease -Force
# Create your job database for defining SSISDB log clean-up job and tracking the job history
Write-Output "Creating a blank SQL database to be used as your job database ..."
$JobDatabase = New-AzureRmSqlDatabase -ResourceGroupName $ResourceGroupName -ServerName $AgentServerName -
DatabaseName $SSISDBLogCleanupJobDB -RequestedServiceObjectiveName $PricingTier
$JobDatabase
# Create job credentials in your job database for connecting to SSISDB in target server
Write-Output "Creating job credentials for connecting to SSISDB..."
$JobCredSecure = ConvertTo-SecureString -String $PasswordForSSISDBCleanupUser -AsPlainText -Force
$JobCred = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList
"SSISDBLogCleanupUser", $JobCredSecure
$JobCred = $JobAgent | New-AzureRmSqlElasticJobCredential -Name "SSISDBLogCleanupUser" -Credential $JobCred
# Create SSISDB log clean-up user from login in SSISDB and grant it permissions to invoke SSISDB log clean-
up stored procedure
Write-Output "Grant appropriate permissions on SSISDB..."
$TargetDatabase = $SSISDBName
$CreateJobUser = "CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser"
$GrantStoredProcedureExecution = "GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO
SSISDBLogCleanupUser"
# Run your job to immediately invoke SSISDB log clean-up stored procedure once
if ($RunJobOrNot -eq 'Y')
{
Write-Output "Invoking SSISDB log clean-up stored procedure immediately..."
$JobExecution = $Job | Start-AzureRmSqlElasticJob
$JobExecution
}
# Schedule your job to invoke SSISDB log clean-up stored procedure periodically, deleting SSISDB logs
outside the retention window
Write-Output "Starting your schedule to invoke SSISDB log clean-up stored procedure periodically..."
$Job | Set-AzureRmSqlElasticJob -IntervalType $IntervalType -IntervalCount $IntervalCount -StartTime
$StartTime -Enable
-- Connect to the job database specified when creating your job agent.
-- Create a database master key if one doesn't already exist, using your own password.
CREATE MASTER KEY ENCRYPTION BY PASSWORD= '<EnterStrongPasswordHere>';
3. Define your target group that includes only SSISDB to clean up.
4. Create SSISDB log clean-up user from login in SSISDB and grant it permissions to invoke SSISDB log
clean-up stored procedure. For detailed guidance, see Manage logins.
-- Connect to SSISDB
CREATE USER SSISDBLogCleanupUser FROM LOGIN SSISDBLogCleanupUser;
GRANT EXECUTE ON internal.cleanup_server_retention_window_exclusive TO SSISDBLogCleanupUser
5. Create your job and add your job step to invoke SSISDB log clean-up stored procedure.
6. Before continuing, make sure you set the retention window properly. SSISDB logs outside this window
will be deleted and can't be recovered. You can then run your job immediately to start SSISDB log clean-
up.
7. Optionally, you can delete SSISDB logs outside the retention window on a schedule. Configure your job
parameters as follows.
Next steps
To manage and monitor your Azure-SSIS IR, see the following articles.
Reconfigure the Azure-SSIS integration runtime
Monitor the Azure-SSIS integration runtime.
Use Azure SQL Managed Instance with SQL Server
Integration Services (SSIS) in Azure Data Factory
3/26/2021 • 8 minutes to read • Edit Online
For more information, see Allow public endpoint traffic on the network security group.
when Azure-SSIS IR inside a virtual network
There is a special scenario when SQL Managed Instance is in a region that Azure-SSIS IR
does not support, Azure-SSIS IR is inside a virtual network without VNet peering due to
Global VNet peering limitation. In this scenario, Azure-SSIS IR inside a vir tual network
connects SQL Managed Instance over public endpoint . Use below Network Security
Group(NSG) rules to allow traffic between SQL Managed Instance and Azure-SSIS IR:
a. Inbound requirement of SQL Managed Instance , to allow inbound traffic from
Azure-SSIS IR.
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE
DEST IN AT IO
T RA N SP O RT SO URC E DEST IN AT IO N P O RT
P ROTO C O L SO URC E P O RT RA N GE N RA N GE C O M M EN T S
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
T RA N SP O RT SO URC E P O RT DEST IN AT IO N
P ROTO C O L SO URC E RA N GE DEST IN AT IO N P O RT RA N GE C O M M EN T S
At NIC level
NSG, port
3389 is open
by default and
we allow you
to control
port 3389 at
subnet level
NSG,
meanwhile
Azure-SSIS IR
has disallowed
port 3389
outbound by
default at
windows
firewall rule on
each IR node
for protection.
Next steps
Execute SSIS packages by Azure SQL Managed Instance Agent job
Set up Business continuity and disaster recovery (BCDR)
Migrate on-premises SSIS workloads to SSIS in ADF
Migrate SQL Server Agent jobs to ADF with SSMS
3/5/2021 • 3 minutes to read • Edit Online
NOTE
Package location of File System is supported only.
migrate applicable jobs with applicable job steps to corresponding ADF resources as below:
SSIS job step Execute SSIS package activity Name of the activity will be <step
name>.
Proxy account used in job step will
be migrated as Windows
authentication of this activity.
Execution options except Use 32-bit
runtime defined in job step will be
ignored in migration.
Verification defined in job step will
be ignored in migration.
SQ L A GEN T JO B O B JEC T A DF RESO URC E N OT ES
generate Azure Resource Manager (ARM) templates in local output folder, and deploy to data factory directly
or later manually. For more information about ADF Resource Manager templates, see Microsoft.DataFactory
resource types.
Prerequisites
The feature described in this article requires SQL Server Management Studio version 18.5 or higher. To get the
latest version of SSMS, see Download SQL Server Management Studio (SSMS).
2. Sign In Azure, select Azure Subscription, Data Factory, and Integration Runtime. Azure Storage is optional,
which is used in the package location mapping step if SSIS jobs to be migrated have SSIS File System
packages.
3. Map the paths of SSIS packages and configuration files in SSIS jobs to destination paths where migrated
pipelines can access. In this mapping step, you can:
a. Select a source folder, then Add Mapping .
b. Update source folder path. Valid paths are folder paths or parent folder paths of packages.
c. Update destination folder path. Default is relative path to the default Storage account, which is
selected in step 1.
d. Delete a selected mapping via Delete Mapping .
4. Select applicable jobs to migrate, and configure the settings of corresponding Executed SSIS Package
activity.
Default Setting, applies to all selected steps by default. For more information of each property, see
Settings tab for the Execute SSIS Package activity when package location is File System (Package).
Connect to Azure-SSIS IR
Once your Azure-SSIS IR is provisioned, you can connect to it to browse its package stores on SSMS.
On the Object Explorer window of SSMS, select Azure-SSIS Integration Runtime in the Connect drop-
down menu. Next, sign in to Azure and select the relevant subscription, ADF, and Azure-SSIS IR that you've
provisioned with package stores. Your Azure-SSIS IR will appear with Running Packages and Stored
Packages nodes underneath. Expand the Stored Packages node to see your package stores underneath.
Expand your package stores to see folders and packages underneath. You may be asked to enter the access
credentials for your package stores, if SSMS fails to connect to them automatically. For example, if you expand a
package store on top of MSDB, you may be asked to connect to your Azure SQL Managed Instance first.
Manage folders and packages
After you connect to your Azure-SSIS IR on SSMS, you can right-click on any package stores, folders, or
packages to pop up a menu and select New Folder , Impor t Package , Expor t Package , Delete , or Refresh .
Select New Folder to create a new folder for imported packages.
Select Impor t Package to import packages from File System , SQL Ser ver (MSDB), or the legacy SSIS
Package Store into your package store.
Depending on the Package location to import from, select the relevant Ser ver /Authentication type ,
enter the access credentials if necessary, select the Package path , and enter the new Package name .
When importing packages, their protection level can't be changed. To change it, use SQL Server Data
Tools (SSDT) or dtutil command-line utility.
NOTE
Importing SSIS packages into Azure-SSIS IR package stores can only be done one-by-one and will simply copy
them into the underlying MSDB/file system/Azure Files while preserving their SQL Server/SSIS version.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will upgrade
them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Additionally, since legacy SSIS package stores are bound to specific SQL Server version and accessible only on
SSMS for that version, lower-version packages in legacy SSIS package stores need to be exported into file system
first using the designated SSMS version before they can be imported into Azure-SSIS IR package stores using
SSMS 2019 or later versions.
Alternatively, to import multiple SSIS packages into Azure-SSIS IR package stores while switching their protection
level, you can use dtutil command line utility, see Deploying multiple packages with dtutil.
Select Expor t Package to export packages from your package store into File System , SQL Ser ver
(MSDB), or the legacy SSIS Package Store .
Depending on the Package location to export into, select the relevant Ser ver /Authentication type ,
enter the access credentials if necessary, and select the Package path . When exporting packages, if
they're encrypted, enter the passwords to decrypt them first and then you can change their protection
level, for example to avoid storing any sensitive data or to encrypt it or all data with user key or
password.
NOTE
Exporting SSIS packages from Azure-SSIS IR package stores can only be done one-by-one and doing so without
switching their protection level will simply copy them while preserving their SQL Server/SSIS version, otherwise it
will upgrade them into SSIS 2019 or later-version packages.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will upgrade
them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Alternatively, to export multiple SSIS packages from Azure-SSIS IR package stores while switching their protection
level, you can use dtutil command line utility, see Deploying multiple packages with dtutil.
Execute packages
After you connect to your Azure-SSIS IR on SSMS, you can right-click on any stored packages to pop up a menu
and select Run Package . This will open the Execute Package Utility dialog, where you can configure your
package executions on Azure-SSIS IR as Execute SSIS Package activities in ADF pipelines.
The General , Configurations , Execution Options , and Logging pages of Execute Package Utility dialog
correspond to the Settings tab of Execute SSIS Package activity. On these pages, you can enter the encryption
password for your package and access information for your package configuration file. You can also enter your
package execution credentials and properties, as well as the access information for your log folder. The Set
Values page of Execute Package Utility dialog corresponds to the Proper ty Overrides tab of Execute SSIS
Package activity, where you can enter your existing package properties to override. For more information, see
Run SSIS packages as Execute SSIS Package activities in ADF pipelines.
When you select the Execute button, a new ADF pipeline with Execute SSIS Package activity will be
automatically generated and triggered. If an ADF pipeline with the same settings already exists, it will be rerun
and a new pipeline won't be generated. The ADF pipeline and Execute SSIS Package activity will be named
Pipeline_SSMS_YourPackageName_HashString and Activity_SSMS_YourPackageName , respectively.
Monitor and stop running packages
After you connect to your Azure-SSIS IR on SSMS, you can expand the Running Packages node to see your
currently running packages underneath. Right-click on any of them to pop up a menu and select Stop or
Refresh .
Select Stop to cancel the currently running ADF pipeline that runs the package as Execute SSIS Package
activity.
Select Refresh to show newly running packages from your package stores.
Select Refresh to show newly added folders/packages in your package stores and running packages
from your package stores.
Deploying multiple packages with dtutil
To lift & shift your on-premises SSIS workloads onto SSIS in ADF while maintaining the legacy Package
Deployment Model, you need to deploy your packages from file system, MSDB hosted by SQL Server, or legacy
SSIS package stores into Azure Files, MSDB hosted by Azure SQL Managed Instance, or Azure-SSIS IR package
stores. At the same time, you should also switch their protection level from encryption by user key to
unencrypted or encryption by password if you haven't done so already.
You can use dtutil command line utility that comes with SQL Server/SSIS installation to deploy multiple
packages in batches. It's bound to specific SSIS version, so if you use it to deploy lower-version packages
without switching their protection level, it will simply copy them while preserving their SSIS version. If you use it
to deploy them and switch their protection level at the same time, it will upgrade them into its SSIS version.
Since Azure-SSIS IR is currently based on SQL Ser ver 2017 , executing lower-version packages on it will
upgrade them into SSIS 2017 packages at run-time. Executing higher-version packages is unsupported.
Consequently, to avoid run-time upgrades, deploying packages to run on Azure-SSIS IR in Package Deployment
Model should use dtutil 2017 that comes with SQL Server/SSIS 2017 installation. You can download and install
the free SQL Server/SSIS 2017 Developer Edition for this purpose. Once installed, you can find dtutil 2017 on
this folder: YourLocalDrive:\Program Files\Microsoft SQL Server\140\DTS\Binn .
Deploying multiple packages from file system on premises into Azure Files with dtutil
To deploy multiple packages from file system into Azure Files and switch their protection level at the same time,
you can run the following commands at a command prompt. Please replace all strings that are specific to your
case.
REM Persist the access credentials for Azure Files on your local machine
cmdkey /ADD:YourStorageAccountName.file.core.windows.net /USER:azure\YourStorageAccountName
/PASS:YourStorageAccountKey
REM Run dtutil in a loop to deploy your packages from the local folder into Azure Files while switching
their protection level
for %f in (*.dtsx) do dtutil.exe /FILE %f /ENCRYPT FILE;Z:\%f;2;YourEncryptionPassword
BEGIN
SELECT 'dtutil /SQL '+f.foldername+'\'+NAME+' /ENCRYPT
SQL;'+f.foldername+'\'+NAME+';2;YourEncryptionPassword /DestServer YourSQLManagedInstanceEndpoint /DestUser
YourSQLAuthUsername /DestPassword YourSQLAuthPassword'
FROM msdb.dbo.sysssispackages p
inner join msdb.dbo.sysssispackagefolders f
ON p.folderid = f.folderid
END
To use the private/public endpoint of your Azure SQL Managed Instance, replace
YourSQLManagedInstanceEndpoint with YourSQLMIName.YourDNSPrefix.database.windows.net /
YourSQLMIName.public.YourDNSPrefix.database.windows.net,3342 , respectively.
The script will generate dtutil command lines for all packages in MSDB that you can multiselect, copy & paste,
and run at a command prompt.
If you've configured Azure-SSIS IR package stores on top of MSDB, your deployed packages will appear in them
when you connect to your Azure-SSIS IR on SSMS 2019 or later versions.
Deploying multiple packages from MSDB on premises into Azure Files with dtutil
To deploy multiple packages from MSDB hosted by SQL Server or legacy SSIS package stores on top of MSDB
into Azure Files and switch their protection level at the same time, you can connect to your SQL Server on
SSMS, right-click on Databases->System Databases->msdb node on the Object Explorer of SSMS to open a New
Quer y window, and run the following T-SQL script. Please replace all strings that are specific to your case:
BEGIN
SELECT 'dtutil /SQL '+f.foldername+'\'+NAME+' /ENCRYPT
FILE;Z:\'+f.foldername+'\'+NAME+'.dtsx;2;YourEncryptionPassword'
FROM msdb.dbo.sysssispackages p
inner join msdb.dbo.sysssispackagefolders f
ON p.folderid = f.folderid
END
The script will generate dtutil command lines for all packages in MSDB that you can multiselect, copy & paste,
and run at a command prompt.
REM Persist the access credentials for Azure Files on your local machine
cmdkey /ADD:YourStorageAccountName.file.core.windows.net /USER:azure\YourStorageAccountName
/PASS:YourStorageAccountKey
REM Multiselect, copy & paste, and run the T-SQL-generated dtutil command lines to deploy your packages from
MSDB on premises into Azure Files while switching their protection level
dtutil /SQL YourFolder\YourPackage1 /ENCRYPT FILE;Z:\YourFolder\YourPackage1.dtsx;2;YourEncryptionPassword
dtutil /SQL YourFolder\YourPackage2 /ENCRYPT FILE;Z:\YourFolder\YourPackage2.dtsx;2;YourEncryptionPassword
dtutil /SQL YourFolder\YourPackage3 /ENCRYPT FILE;Z:\YourFolder\YourPackage3.dtsx;2;YourEncryptionPassword
If you've configured Azure-SSIS IR package stores on top of Azure Files, your deployed packages will appear in
them when you connect to your Azure-SSIS IR on SSMS 2019 or later versions.
Next steps
You can rerun/edit the auto-generated ADF pipelines with Execute SSIS Package activities or create new ones on
ADF portal. For more information, see Run SSIS packages as Execute SSIS Package activities in ADF pipelines.
Create a trigger that runs a pipeline on a schedule
6/25/2021 • 20 minutes to read • Edit Online
Data Factory UI
You can create a schedule trigger to schedule a pipeline to run periodically (hourly, daily, etc.).
NOTE
For a complete walkthrough of creating a pipeline and a schedule trigger, which associates the trigger with the pipeline,
and runs and monitors the pipeline, see Quickstart: create a data factory using Data Factory UI.
NOTE
For time zones that observe daylight saving, trigger time will auto-adjust for the twice a year change. To
opt out of the daylight saving change, please select a time zone that does not observe daylight saving, for
instance UTC
d. Specify Recurrence for the trigger. Select one of the values from the drop-down list (Every
minute, Hourly, Daily, Weekly, and Monthly). Enter the multiplier in the text box. For example, if you
want the trigger to run once for every 15 minutes, you select Ever y Minute , and enter 15 in the
text box.
e. In the Recurrence , if you choose "Day(s), Week(s) or Month(s)" from the drop-down, you can find
"Advanced recurrence options".
f. To specify an end date time, select Specify an End Date , and specify Ends On, then select OK .
There is a cost associated with each pipeline run. If you are testing, you may want to ensure that
the pipeline is triggered only a couple of times. However, ensure that there is enough time for the
pipeline to run between the publish time and the end time. The trigger comes into effect only after
you publish the solution to Data Factory, not when you save the trigger in the UI.
5. In the New Trigger window, select Yes in the Activated option, then select OK . You can use this
checkbox to deactivate the trigger later.
6. In the New Trigger window, review the warning message, then select OK .
7. Select Publish all to publish the changes to Data Factory. Until you publish the changes to Data Factory,
the trigger doesn't start triggering the pipeline runs.
8. Switch to the Pipeline runs tab on the left, then select Refresh to refresh the list. You will see the
pipeline runs triggered by the scheduled trigger. Notice the values in the Triggered By column. If you
use the Trigger Now option, you will see the manual trigger run in the list.
Azure PowerShell
NOTE
This article has been updated to use the Azure Az PowerShell module. The Az PowerShell module is the recommended
PowerShell module for interacting with Azure. To get started with the Az PowerShell module, see Install Azure PowerShell.
To learn how to migrate to the Az PowerShell module, see Migrate Azure PowerShell from AzureRM to Az.
This section shows you how to use Azure PowerShell to create, start, and monitor a schedule trigger. To see this
sample working, first go through the Quickstart: Create a data factory by using Azure PowerShell. Then, add the
following code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The
trigger is associated with a pipeline named Adfv2QuickStar tPipeline that you create as part of the Quickstart.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:
IMPORTANT
Before you save the JSON file, set the value of the star tTime element to the current UTC time. Set the value of
the endTime element to one hour past the current UTC time.
{
"properties": {
"name": "MyTrigger",
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2017-12-08T00:00:00Z",
"endTime": "2017-12-08T01:00:00Z",
"timeZone": "UTC"
}
},
"pipelines": [{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "Adfv2QuickStartPipeline"
},
"parameters": {
"inputPath": "adftutorial/input",
"outputPath": "adftutorial/output"
}
}
]
}
}
IMPORTANT
For UTC timezone, the startTime and endTime need to follow format 'yyyy-MM-ddTHH:mm:ssZ ', while for
other timezones, startTime and endTime follow 'yyyy-MM-ddTHH:mm:ss'.
Per ISO 8601 standard, the Z suffix to timestamp mark the datetime to UTC timezone, and render
timeZone field useless. While missing Z suffix for UTC time zone will result in an error upon trigger
activation.
The trigger is associated with the Adfv2QuickStar tPipeline pipeline. To associate multiple
pipelines with a trigger, add more pipelineReference sections.
The pipeline in the Quickstart takes two parameters values: inputPath and outputPath . And you
pass values for these parameters from the trigger.
2. Create a trigger by using the Set-AzDataFactor yV2Trigger cmdlet:
5. Confirm that the status of the trigger is Star ted by using the Get-AzDataFactor yV2Trigger cmdlet:
6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactor yV2TriggerRun cmdlet. To get
the information about the trigger runs, execute the following command periodically. Update the
TriggerRunStar tedAfter and TriggerRunStar tedBefore values to match the values in your trigger
definition:
NOTE
Trigger time of Schedule triggers are specified in UTC timestamp. TriggerRunStartedAfter and
TriggerRunStartedBefore also expects UTC timestamp
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
.NET SDK
This section shows you how to use the .NET SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the .NET SDK. Then, add the following
code to the main method, which creates and starts a schedule trigger that runs every 15 minutes. The trigger is
associated with a pipeline named Adfv2QuickStar tPipeline that you create as part of the Quickstart.
To create and start a schedule trigger that runs every 15 minutes, add the following code to the main method:
// Create the trigger
Console.WriteLine("Creating the trigger");
To create triggers in a different time zone, other than UTC, following settings are required:
<<ClientInstance>>.SerializationSettings.DateFormatHandling =
Newtonsoft.Json.DateFormatHandling.IsoDateFormat;
<<ClientInstance>>.SerializationSettings.DateTimeZoneHandling =
Newtonsoft.Json.DateTimeZoneHandling.Unspecified;
<<ClientInstance>>.SerializationSettings.DateParseHandling = DateParseHandling.None;
<<ClientInstance>>.DeserializationSettings.DateParseHandling = DateParseHandling.None;
<<ClientInstance>>.DeserializationSettings.DateFormatHandling =
Newtonsoft.Json.DateFormatHandling.IsoDateFormat;
<<ClientInstance>>.DeserializationSettings.DateTimeZoneHandling =
Newtonsoft.Json.DateTimeZoneHandling.Unspecified;
To monitor a trigger run, add the following code before the last Console.WriteLine statement in the sample:
// Check that the trigger runs every 15 minutes
Console.WriteLine("Trigger runs. You see the output every 15 minutes");
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
Python SDK
This section shows you how to use the Python SDK to create, start, and monitor a trigger. To see this sample
working, first go through the Quickstart: Create a data factory by using the Python SDK. Then, add the following
code block after the "monitor the pipeline run" code block in the Python script. This code creates a schedule
trigger that runs every 15 minutes between the specified start and end times. Update the star t_time variable to
the current UTC time, and the end_time variable to one hour past the current UTC time.
# Create a trigger
tr_name = 'mytrigger'
scheduler_recurrence = ScheduleTriggerRecurrence(frequency='Minute', interval='15',start_time='2017-12-
12T04:00:00Z', end_time='2017-12-12T05:00:00Z', time_zone='UTC')
pipeline_parameters = {'inputPath':'adftutorial/input', 'outputPath':'adftutorial/output'}
pipelines_to_run = []
pipeline_reference = PipelineReference('copyPipeline')
pipelines_to_run.append(TriggerPipelineReference(pipeline_reference, pipeline_parameters))
tr_properties = ScheduleTrigger(description='My scheduler trigger', pipelines = pipelines_to_run,
recurrence=scheduler_recurrence)
adf_client.triggers.create_or_update(rg_name, df_name, tr_name, tr_properties)
To monitor the trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
"parameters": {
"scheduledRunTime": "@trigger().scheduledTime"
}
JSON schema
The following JSON definition shows you how to create a schedule trigger with scheduling and recurrence:
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": <<Minute, Hour, Day, Week, Month>>,
"interval": <<int>>, // Optional, specifies how often to fire (default to 1)
"startTime": <<datetime>>,
"endTime": <<datetime - optional>>,
"timeZone": "UTC"
"schedule": { // Optional (advanced scheduling specifics)
"hours": [<<0-23>>],
"weekDays": [<<Monday-Sunday>>],
"minutes": [<<0-59>>],
"monthDays": [<<1-31>>],
"monthlyOccurrences": [
{
"day": <<Monday-Sunday>>,
"occurrence": <<1-5>>
}
]
}
}
},
"pipelines": [
{
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "<Name of your pipeline>"
},
"parameters": {
"<parameter 1 Name>": {
"type": "Expression",
"value": "<parameter 1 Value>"
},
"<parameter 2 Name>" : "<parameter 2 Value>"
}
}
]
}
}
IMPORTANT
The parameters property is a mandatory property of the pipelines element. If your pipeline doesn't take any
parameters, you must include an empty JSON definition for the parameters property.
Schema overview
The following table provides a high-level overview of the major schema elements that are related to recurrence
and scheduling of a trigger:
star tTime A Date-Time value. For simple schedules, the value of the
star tTime property applies to the first occurrence. For
complex schedules, the trigger starts no sooner than the
specified star tTime value.
For UTC time zone, format is 'yyyy-MM-ddTHH:mm:ssZ' , for
other time zone, format is 'yyyy-MM-ddTHH:mm:ss' .
endTime The end date and time for the trigger. The trigger doesn't
execute after the specified end date and time. The value for
the property can't be in the past. This property is optional.
For UTC time zone, format is 'yyyy-MM-ddTHH:mm:ssZ' , for
other time zone, format is 'yyyy-MM-ddTHH:mm:ss' .
timeZone The time zone the trigger is created in. This setting impact
star tTime , endTime , and schedule . See list of supported
time zone
recurrence A recurrence object that specifies the recurrence rules for the
trigger. The recurrence object supports the frequency ,
inter val, endTime , count , and schedule elements. When
a recurrence object is defined, the frequency element is
required. The other elements of the recurrence object are
optional.
inter val A positive integer that denotes the interval for the
frequency value, which determines how often the trigger
runs. For example, if the inter val is 3 and the frequency is
"week," the trigger recurs every 3 weeks.
IMPORTANT
For UTC timezone, the startTime and endTime need to follow format 'yyyy-MM-ddTHH:mm:ssZ ', while for other
timezones, startTime and endTime follow 'yyyy-MM-ddTHH:mm:ss'.
Per ISO 8601 standard, the Z suffix to timestamp mark the datetime to UTC timezone, and render timeZone field useless.
While missing Z suffix for UTC time zone will result in an error upon trigger activation.
star tTime String Yes None ISO-8601 Date- for UTC time
Times zone
"startTime" :
"2013-01-
09T09:30:00-
08:00Z"
for other time
zone
"2013-01-
09T09:30:00-
08:00"
This list is incomplete. For complete list of time zone options, explore in Data Factory portal Trigger creation
page
startTime property
The following table shows you how the star tTime property controls a trigger run:
Start time in past Calculates the first future execution The trigger starts no sooner than the
time after the start time and runs at specified start time. The first
that time. occurrence is based on the schedule
that's calculated from the start time.
Runs subsequent executions based on
calculating from the last execution Runs subsequent executions based on
time. the recurrence schedule.
Start time in future or at present Runs once at the specified start time. The trigger starts no sooner than the
specified start time. The first
Runs subsequent executions based on occurrence is based on the schedule
calculating from the last execution that's calculated from the start time.
time.
Runs subsequent executions based on
the recurrence schedule.
Let's see an example of what happens when the start time is in the past, with a recurrence, but no schedule.
Assume that the current time is 2017-04-08 13:00 , the start time is 2017-04-07 14:00 , and the recurrence is
every two days. (The recurrence value is defined by setting the frequency property to "day" and the inter val
property to 2.) Notice that the star tTime value is in the past and occurs before the current time.
Under these conditions, the first execution is at 2017-04-09 at 14:00 . The Scheduler engine calculates execution
occurrences from the start time. Any instances in the past are discarded. The engine uses the next instance that
occurs in the future. In this scenario, the start time is 2017-04-07 at 2:00pm , so the next instance is two days
from that time, which is 2017-04-09 at 2:00pm .
The first execution time is the same even if the star tTime value is 2017-04-05 14:00 or 2017-04-01 14:00 . After
the first execution, subsequent executions are calculated by using the schedule. Therefore, the subsequent
executions are at 2017-04-11 at 2:00pm , then 2017-04-13 at 2:00pm , then 2017-04-15 at 2:00pm , and so on.
Finally, when the hours or minutes aren’t set in the schedule for a trigger, the hours or minutes of the first
execution are used as the defaults.
schedule property
On one hand, the use of a schedule can limit the number of trigger executions. For example, if a trigger with a
monthly frequency is scheduled to run only on day 31, the trigger runs only in those months that have a 31st
day.
Whereas, a schedule can also expand the number of trigger executions. For example, a trigger with a monthly
frequency that's scheduled to run on month days 1 and 2, runs on the 1st and 2nd days of the month, rather
than once a month.
If multiple schedule elements are specified, the order of evaluation is from the largest to the smallest schedule
setting. The evaluation starts with week number, and then month day, weekday, hour, and finally, minute.
The following table describes the schedule elements in detail:
weekDays Days of the week on which the trigger Monday, Tuesday, Wednesday,
runs. The value can be specified with a Thursday, Friday, Saturday,
weekly frequency only. Sunday
Array of day values (maximum
array size is 7)
Day values are not case-
sensitive
monthDays Day of the month on which the trigger Any value <= -1 and >= -31
runs. The value can be specified with a Any value >= 1 and <= 31
monthly frequency only. Array of values
{"minutes":[15,45], "hours":[5,17]} Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM every day.
{hours":[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, Run every hour. This trigger runs every hour. The minutes
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]} are controlled by the star tTime value, when a value is
specified. If a value not specified, the minutes are controlled
by the creation time. For example, if the start time or
creation time (whichever applies) is 12:25 PM, the trigger
runs at 00:25, 01:25, 02:25, ..., and 23:25.
{"minutes":[0]} Run every hour on the hour. This trigger runs every hour on
the hour starting at 12:00 AM, 1:00 AM, 2:00 AM, and so
on.
{"minutes":[15]} Run at 15 minutes past every hour. This trigger runs every
hour at 15 minutes past the hour starting at 00:15 AM, 1:15
AM, 2:15 AM, and so on, and ending at 11:15 PM.
EXA M P L E DESC RIP T IO N
{"hours":[17], "weekDays":["monday", "wednesday", Run at 5:00 PM on Monday, Wednesday, and Friday every
"friday"]} week.
{"minutes":[15,45], "hours":[17], "weekDays": Run at 5:15 PM and 5:45 PM on Monday, Wednesday, and
["monday", "wednesday", "friday"]} Friday every week.
{"minutes":[0,15,30,45], "hours": [9, 10, 11, 12, Run every 15 minutes on weekdays between 9:00 AM and
13, 14, 15, 16] "weekDays":["monday", "tuesday", 4:45 PM.
"wednesday", "thursday", "friday"]}
{"weekDays":["tuesday", "thursday"]} Run on Tuesdays and Thursdays at the specified start time.
{"minutes":[0], "hours":[6], "monthDays":[28]} Run at 6:00 AM on the 28th day of every month (assuming
a frequency value of "month").
{"minutes":[0], "hours":[6], "monthDays":[-1]} Run at 6:00 AM on the last day of the month. To run a
trigger on the last day of a month, use -1 instead of day 28,
29, 30, or 31.
{"minutes":[0], "hours":[6], "monthDays":[1,-1]} Run at 6:00 AM on the first and last day of every month.
{monthDays":[1,14]} Run on the first and 14th day of every month at the
specified start time.
{"minutes":[0], "hours":[5], "monthlyOccurrences": Run on the first Friday of every month at 5:00 AM.
[{"day":"friday", "occurrence":1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first Friday of every month at the specified start
"occurrence":1}]} time.
{"monthlyOccurrences":[{"day":"friday", Run on the third Friday from the end of the month, every
"occurrence":-3}]} month, at the specified start time.
{"minutes":[15], "hours":[5], "monthlyOccurrences": Run on the first and last Friday of every month at 5:15 AM.
[{"day":"friday", "occurrence":1},{"day":"friday",
"occurrence":-1}]}
{"monthlyOccurrences":[{"day":"friday", Run on the first and last Friday of every month at the
"occurrence":1},{"day":"friday", "occurrence":-1}]} specified start time.
{"monthlyOccurrences":[{"day":"friday", Run on the fifth Friday of every month at the specified start
"occurrence":5}]} time. When there's no fifth Friday in a month, the pipeline
doesn't run, since it's scheduled to run only on fifth Fridays.
To run the trigger on the last occurring Friday of the month,
consider using -1 instead of 5 for the occurrence value.
{"minutes":[0,15,30,45], "monthlyOccurrences": Run every 15 minutes on the last Friday of the month.
[{"day":"friday", "occurrence":-1}]}
EXA M P L E DESC RIP T IO N
{"minutes":[15,45], "hours":[5,17], Run at 5:15 AM, 5:45 AM, 5:15 PM, and 5:45 PM on the
"monthlyOccurrences":[{"day":"wednesday", third Wednesday of every month.
"occurrence":3}]}
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a trigger that runs a pipeline on a tumbling
window
3/22/2021 • 8 minutes to read • Edit Online
Data Factory UI
1. To create a tumbling window trigger in the Data Factory UI, select the Triggers tab, and then select New .
2. After the trigger configuration pane opens, select Tumbling Window , and then define your tumbling
window trigger properties.
3. When you're done, select Save .
The following table provides a high-level overview of the major JSON elements that are related to recurrence
and scheduling of a tumbling window trigger:
retr yPolicy: Count The number of retries Integer An integer, where the No
before the pipeline default is 0 (no
run is marked as retries).
"Failed."
JSO N EL EM EN T DESC RIP T IO N TYPE A L LO W ED VA L UES REQ UIRED
dependsOn: offset The offset of the Timespan A timespan value Self-Dependency: Yes
dependency trigger. (hh:mm:ss) that must be Other: No
negative in a self-
dependency. If no
value specified, the
window is the same
as the trigger itself.
NOTE
After a tumbling window trigger is published, inter val and frequency can't be edited.
To use the WindowStar t and WindowEnd system variable values in the pipeline definition, use your
"MyWindowStart" and "MyWindowEnd" parameters, accordingly.
Execution order of windows in a backfill scenario
If the startTime of trigger is in the past, then based on this formula, M=(CurrentTime-
TriggerStartTime)/TumblingWindowSize, the trigger will generate {M} backfill(past) runs in parallel, honoring
trigger concurrency, before executing the future runs. The order of execution for windows is deterministic, from
oldest to newest intervals. Currently, this behavior can't be modified.
Existing TriggerResource elements
The following points apply to update of existing TriggerResource elements:
The value for the frequency element (or window size) of the trigger along with inter val element cannot be
changed once the trigger is created. This is required for proper functioning of triggerRun reruns and
dependency evaluations
If the value for the endTime element of the trigger changes (added or updated), the state of the windows
that are already processed is not reset. The trigger honors the new endTime value. If the new endTime
value is before the windows that are already executed, the trigger stops. Otherwise, the trigger stops when
the new endTime value is encountered.
User assigned retries of pipelines
In case of pipeline failures, tumbling window trigger can retry the execution of the referenced pipeline
automatically, using the same input parameters, without the user intervention. This can be specified using the
property "retryPolicy" in the trigger definition.
Tumbling window trigger dependency
If you want to make sure that a tumbling window trigger is executed only after the successful execution of
another tumbling window trigger in the data factory, create a tumbling window trigger dependency.
Cancel tumbling window run
You can cancel runs for a tumbling window trigger, if the specific window is in Waiting, Waiting on Dependency,
or Running state
If the window is in Running state, cancel the associated Pipeline Run, and the trigger run will be marked as
Canceled afterwards
If the window is in Waiting or Waiting on Dependency state, you can cancel the window from Monitoring:
You can also rerun a canceled window. The rerun will take the latest published definitions of the trigger, and
dependencies for the specified window will be re-evaluated upon rerun
This section shows you how to use Azure PowerShell to create, start, and monitor a trigger.
1. Create a JSON file named MyTrigger.json in the C:\ADFv2QuickStartPSH\ folder with the following
content:
IMPORTANT
Before you save the JSON file, set the value of the star tTime element to the current UTC time. Set the value of
the endTime element to one hour past the current UTC time.
{
"name": "PerfTWTrigger",
"properties": {
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Minute",
"interval": "15",
"startTime": "2017-09-08T05:30:00Z",
"delay": "00:00:01",
"retryPolicy": {
"count": 2,
"intervalInSeconds": 30
},
"maxConcurrency": 50
},
"pipeline": {
"pipelineReference": {
"type": "PipelineReference",
"referenceName": "DynamicsToBlobPerfPipeline"
},
"parameters": {
"windowStart": "@trigger().outputs.windowStartTime",
"windowEnd": "@trigger().outputs.windowEndTime"
}
},
"runtimeState": "Started"
}
}
3. Confirm that the status of the trigger is Stopped by using the Get-AzDataFactor yV2Trigger cmdlet:
5. Confirm that the status of the trigger is Star ted by using the Get-AzDataFactor yV2Trigger cmdlet:
6. Get the trigger runs in Azure PowerShell by using the Get-AzDataFactor yV2TriggerRun cmdlet. To get
information about the trigger runs, execute the following command periodically. Update the
TriggerRunStar tedAfter and TriggerRunStar tedBefore values to match the values in your trigger
definition:
Get-AzDataFactoryV2TriggerRun -ResourceGroupName $ResourceGroupName -DataFactoryName $DataFactoryName
-TriggerName "MyTrigger" -TriggerRunStartedAfter "2017-12-08T00:00:00" -TriggerRunStartedBefore
"2017-12-08T01:00:00"
To monitor trigger runs and pipeline runs in the Azure portal, see Monitor pipeline runs.
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Create a tumbling window trigger dependency.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a tumbling window trigger dependency
3/5/2021 • 4 minutes to read • Edit Online
The following table provides the list of attributes needed to define a Tumbling Window dependency.
NOTE
A tumbling window trigger can depend on a maximum of five other triggers.
NOTE
If your triggered pipeline relies on the output of pipelines in previously triggered windows, we recommend using only
tumbling window trigger self-dependency. To limit parallel trigger runs, set the maximimum trigger concurrency.
{
"name": "DemoSelfDependency",
"properties": {
"runtimeState": "Started",
"pipeline": {
"pipelineReference": {
"referenceName": "Demo",
"type": "PipelineReference"
}
},
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": 1,
"startTime": "2018-10-04T00:00:00Z",
"delay": "00:01:00",
"maxConcurrency": 50,
"retryPolicy": {
"intervalInSeconds": 30
},
"dependsOn": [
{
"type": "SelfDependencyTumblingWindowTriggerReference",
"size": "01:00:00",
"offset": "-01:00:00"
}
]
}
}
}
Self-dependency
Monitor dependencies
You can monitor the dependency chain and the corresponding windows from the trigger run monitoring page.
Navigate to Monitoring > Trigger Runs . If a Tumbling Window trigger has dependencies, Trigger Name will
bear a hyperlink to dependency monitoring view.
Click through the trigger name to view trigger dependencies. Right-hand panel shows detailed trigger run
information, such as RunID, window time, status, and so on.
You can see the status of the dependencies, and windows for each dependent trigger. If one of the dependencies
triggers fails, you must successfully rerun it in order for the dependent trigger to run.
A tumbling window trigger will wait on dependencies for seven days before timing out. After seven days, the
trigger run will fail.
For a more visual to view the trigger dependency schedule, select the Gantt view.
Transparent boxes show the dependency windows for each down stream-dependent trigger, while solid colored
boxes above show individual window runs. Here are some tips for interpreting the Gantt chart view:
Transparent box renders blue when dependent windows are in pending or running state
After all windows succeeds for a dependent trigger, the transparent box will turn green
Transparent box renders red when some dependent window fails. Look for a solid red box to identify the
failure window run
To rerun a window in Gantt chart view, select the solid color box for the window, and an action panel will pop up
with details and rerun options
Next steps
Review How to create a tumbling window trigger
Create a trigger that runs a pipeline in response to
a storage event
4/2/2021 • 9 minutes to read • Edit Online
NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with
the Event Grid resource provider. For more info, see Resource providers and types. You must be able to do the
Microsoft.EventGrid/eventSubscriptions/* action. This action is part of the EventGrid EventSubscription Contributor built-
in role.
Data Factory UI
This section shows you how to create a storage event trigger within the Azure Data Factory User Interface.
1. Switch to the Edit tab, shown with a pencil symbol.
2. Select Trigger on the menu, then select New/Edit .
3. On the Add Triggers page, select Choose trigger..., then select +New .
4. Select trigger type Storage Event
5. Select your storage account from the Azure subscription dropdown or manually using its Storage account
resource ID. Choose which container you wish the events to occur on. Container selection is required, but
be mindful that selecting all containers can lead to a large number of events.
NOTE
The Storage Event Trigger currently supports only Azure Data Lake Storage Gen2 and General-purpose version 2
storage accounts. Due to an Azure Event Grid limitation, Azure Data Factory only supports a maximum of 500
storage event triggers per storage account. If you hit the limit, please contact support for recommendations and
increasing the limit upon evaluation by Event Grid team.
NOTE
To create a new or modify an existing Storage Event Trigger, the Azure account used to log into Data Factory and
publish the storage event trigger must have appropriate role based access control (Azure RBAC) permission on
the storage account. No additional permission is required: Service Principal for the Azure Data Factory does not
need special permission to either the Storage account or Event Grid. For more information about access control,
see Role based access control section.
6. The Blob path begins with and Blob path ends with properties allow you to specify the containers,
folders, and blob names for which you want to receive events. Your storage event trigger requires at least
one of these properties to be defined. You can use variety of patterns for both Blob path begins with
and Blob path ends with properties, as shown in the examples later in this article.
Blob path begins with: The blob path must start with a folder path. Valid values include 2018/ and
2018/april/shoes.csv . This field can't be selected if a container isn't selected.
Blob path ends with: The blob path must end with a file name or extension. Valid values include
shoes.csv and .csv . Container and folder names, when specified, they must be separated by a
/blobs/ segment. For example, a container named 'orders' can have a value of
/orders/blobs/2018/april/shoes.csv . To specify a folder in any container, omit the leading '/' character.
For example, april/shoes.csv will trigger an event on any file named shoes.csv in folder a called
'april' in any container.
Note that Blob path begins with and ends with are the only pattern matching allowed in Storage
Event Trigger. Other types of wildcard matching aren't supported for the trigger type.
7. Select whether your trigger will respond to a Blob created event, Blob deleted event, or both. In your
specified storage location, each event will trigger the Data Factory pipelines associated with the trigger.
8. Select whether or not your trigger ignores blobs with zero bytes.
9. After you configure you trigger, click on Next: Data preview . This screen shows the existing blobs
matched by your storage event trigger configuration. Make sure you've specific filters. Configuring filters
that are too broad can match a large number of files created/deleted and may significantly impact your
cost. Once your filter conditions have been verified, click Finish .
10. To attach a pipeline to this trigger, go to the pipeline canvas and click Trigger and select New/Edit . When
the side nav appears, click on the Choose trigger... dropdown and select the trigger you created. Click
Next: Data preview to confirm the configuration is correct and then Next to validate the Data preview
is correct.
11. If your pipeline has parameters, you can specify them on the trigger runs parameter side nav. The storage
event trigger captures the folder path and file name of the blob into the properties
@triggerBody().folderPath and @triggerBody().fileName . To use the values of these properties in a
pipeline, you must map the properties to pipeline parameters. After mapping the properties to
parameters, you can access the values captured by the trigger through the
@pipeline().parameters.parameterName expression throughout the pipeline. For detailed explanation, see
Reference Trigger Metadata in Pipelines
In the preceding example, the trigger is configured to fire when a blob path ending in .csv is created in the
folder event-testing in the container sample-data. The folderPath and fileName properties capture the
location of the new blob. For example, when MoviesDB.csv is added to the path sample-data/event-
testing, @triggerBody().folderPath has a value of sample-data/event-testing and
@triggerBody().fileName has a value of moviesDB.csv . These values are mapped, in the example, to the
pipeline parameters sourceFolder and sourceFile , which can be used throughout the pipeline as
@pipeline().parameters.sourceFolder and @pipeline().parameters.sourceFile respectively.
NOTE
If you are creating your pipeline and trigger in Azure Synapse Analytics, you must use
@trigger().outputs.body.fileName and @trigger().outputs.body.folderPath as parameters. Those two
properties capture blob information. Use those properties instead of using @triggerBody().fileName and
@triggerBody().folderPath .
JSON schema
The following table provides an overview of the schema elements that are related to storage event triggers:
IMPORTANT
You have to include the /blobs/ segment of the path, as shown in the following examples, whenever you specify
container and folder, container and file, or container, folder, and file. For blobPathBeginsWith , the Data Factory UI will
automatically add /blobs/ between the folder and container name in the trigger JSON.
Blob path begins with /containername/ Receives events for any blob in the
container.
Blob path begins with /containername/blobs/foldername/ Receives events for any blobs in the
containername container and
foldername folder.
Blob path ends with file.txt Receives events for a blob named
file.txt in any path.
Blob path ends with /containername/blobs/file.txt Receives events for a blob named
file.txt under container
containername .
Blob path ends with foldername/file.txt Receives events for a blob named
file.txt in foldername folder
under any container.
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Learn how to reference trigger metadata in pipeline, see Reference Trigger Metadata in Pipeline Runs
Create a custom event trigger to run a pipeline in
Azure Data Factory (preview)
5/7/2021 • 4 minutes to read • Edit Online
NOTE
The integration described in this article depends on Azure Event Grid. Make sure that your subscription is registered with
the Event Grid resource provider. For more information, see Resource providers and types. You must be able to do the
Microsoft.EventGrid/eventSubscriptions/ action. This action is part of the EventGrid EventSubscription Contributor
built-in role.
If you combine pipeline parameters and a custom event trigger, you can parse and reference custom data
payloads in pipeline runs. Because the data field in a custom event payload is a free-form, JSON key-value
structure, you can control event-driven pipeline runs.
IMPORTANT
If a key referenced in parameterization is missing in the custom event payload, trigger run will fail. You'll get an error
that states the expression cannot be evaluated because property keyName doesn't exist. In this case, no pipeline run
will be triggered by the event.
NOTE
The workflow is different from Storage Event Trigger. Here, Data Factory doesn't set up the topic for you.
Data Factory expects events to follow the Event Grid event schema. Make sure event payloads have the
following fields:
[
{
"topic": string,
"subject": string,
"id": string,
"eventType": string,
"eventTime": string,
"data":{
object-unique-to-each-publisher
},
"dataVersion": string,
"metadataVersion": string
}
]
6. Select your custom topic from the Azure subscription dropdown or manually enter the event topic scope.
NOTE
To create or modify a custom event trigger in Data Factory, you need to use an Azure account with appropriate
role-based access control (Azure RBAC). No additional permission is required. The Data Factory service principle
does not require special permission to your Event Grid. For more information about access control, see the Role-
based access control section.
7. The Subject begins with and Subject ends with properties allow you to filter for trigger events. Both
properties are optional.
8. Use + New to add Event Types to filter on. The list of custom event triggers uses an OR relationship.
When a custom event with an eventType property that matches one on the list, a pipeline run is
triggered. The event type is case insensitive. For example, in the following screenshot, the trigger matches
all copycompleted or copysucceeded events that have a subject that begins with factories.
9. A custom event trigger can parse and send a custom data payload to your pipeline. You create the
pipeline parameters, and then fill in the values on the Parameters page. Use the format
@triggerBody().event.data._keyName_ to parse the data payload and pass values to the pipeline
parameters.
For a detailed explanation, see the following articles:
Reference trigger metadata in pipelines
System variables in custom event trigger
10. After you've entered the parameters, select OK .
JSON schema
The following table provides an overview of the schema elements that are related to custom event triggers:
Next steps
Get detailed information about trigger execution.
Learn how to reference trigger metadata in pipeline runs.
Reference trigger metadata in pipeline runs
3/17/2021 • 2 minutes to read • Edit Online
NOTE
Different trigger type provides different meta data information. For more information, see System Variable
Data Factory UI
This section shows you how to pass meta data information from trigger to pipeline, within the Azure Data
Factory User Interface.
1. Go to the Authoring Canvas and edit a pipeline
2. Click on the blank canvas to bring up pipeline settings. Do not select any activity. You may need to pull up
the setting panel from the bottom of the canvas, as it may have been collapsed
3. Select Parameters section and select + New to add parameters
4. Add triggers to pipeline, by clicking on + Trigger .
5. Create or attach a trigger to the pipeline, and click OK
6. In the following page, fill in trigger meta data for each parameter. Use format defined in System Variable
to retrieve trigger information. You don't need to fill in the information for all parameters, just the ones
that will assume trigger metadata values. For instance, here we assign trigger run start time to
parameter_1.
JSON schema
To pass in trigger information to pipeline runs, both the trigger and the pipeline json need to be updated with
parameters section.
Pipeline definition
Under proper ties section, add parameter definitions to parameters section
{
"name": "demo_pipeline",
"properties": {
"activities": [
{
"name": "demo_activity",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"url": {
"value": "@pipeline().parameters.parameter_2",
"type": "Expression"
},
"method": "GET"
}
}
],
"parameters": {
"parameter_1": {
"type": "string"
},
"parameter_2": {
"type": "string"
},
"parameter_3": {
"type": "string"
},
"parameter_4": {
"type": "string"
},
"parameter_5": {
"type": "string"
}
},
"annotations": [],
"lastPublishTime": "2021-02-24T03:06:23Z"
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
Trigger definition
Under pipelines section, assign parameter values in parameters section. You don't need to fill in the
information for all parameters, just the ones that will assume trigger metadata values.
{
"name": "trigger1",
"properties": {
"annotations": [],
"runtimeState": "Started",
"pipelines": [
{
"pipelineReference": {
"referenceName": "demo_pipeline",
"type": "PipelineReference"
},
"parameters": {
"parameter_1": "@trigger().startTime"
}
}
],
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Minute",
"interval": 15,
"startTime": "2021-03-03T04:38:00Z",
"timeZone": "UTC"
}
}
}
}
Next steps
For detailed information about triggers, see Pipeline execution and triggers.
Connect Data Factory to Azure Purview (Preview)
5/6/2021 • 2 minutes to read • Edit Online
2. You can choose From Azure subscription or Enter manually . From Azure subscription , you can select
the account that you have access to. 3. Once connected, you should be able to see the name of the Purview
account in the tab Pur view account . 4. You can use the Search bar at the top center of Azure Data Factory
portal to search for data.
If you see warning in Azure Data Factory portal after you register Azure Purview account to Data Factory, follow
below steps to fix the issue:
1. Go to Azure portal and find your data factory. Choose section "Tags" and see if there is a tag named
catalogUri . If not, please disconnect and reconnect the Azure Purview account in the ADF portal.
2. Check if the permission is granted for registering an Azure Purview account to Data Factory. See How to
connect Azure Data Factory and Azure Purview
Register Data Factory in Azure Purview
For how to register Data Factory in Azure Purview, see How to connect Azure Data Factory and Azure Purview.
Next steps
Catalog lineage user guide
Tutorial: Push Data Factory lineage data to Azure Purview
Discover and explore data in ADF using Purview
5/6/2021 • 2 minutes to read • Edit Online
Prerequisites
AzurePurview account
Data Factory
Connect an Azure Purview Account into Data Factory
Actions that you can perform over datasets with Data Factory resources
You can directly create Linked Service, Dataset, or dataflow over the data you search by Azure Purview.
Nextsteps
Register and scan Azure Data Factory assets in Azure Purview
How to Search Data in Azure Purview Data Catalog
Use Azure Data Factory to migrate data from your
data lake or data warehouse to Azure
3/5/2021 • 2 minutes to read • Edit Online
NOTE
By using online migration, you can achieve both historical data loading and incremental feeds end-to-end through a
single tool. Through this approach, your data can be kept synchronized between the existing store and the new store
during the entire migration window. This means you can rebuild your ETL logic on the new store with refreshed data.
Next steps
Migrate data from AWS S3 to Azure
Migrate data from on-premises hadoop cluster to Azure
Migrate data from on-premises Netezza server to Azure
Use Azure Data Factory to migrate data from
Amazon S3 to Azure Storage
3/5/2021 • 8 minutes to read • Edit Online
Performance
ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build
pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data
movement throughput for your environment.
Customers have successfully migrated petabytes of data consisting of hundreds of millions of files from
Amazon S3 to Azure Blob Storage, with a sustained throughput of 2 GBps and higher.
The picture above illustrates how you can achieve great data movement speeds through different levels of
parallelism:
A single copy activity can take advantage of scalable compute resources: when using Azure Integration
Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-
hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to
4 nodes), and a single copy activity will partition its file set across all nodes.
A single copy activity reads from and writes to the data store using multiple threads.
ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
Resilience
Within a single copy activity run, ADF has built-in retry mechanism so it can handle a certain level of transient
failures in the data stores or in the underlying network.
When doing binary copying from S3 to Blob and from S3 to ADLS Gen2, ADF automatically performs
checkpointing. If a copy activity run has failed or timed out, on a subsequent retry, the copy resumes from the
last failure point instead of starting from the beginning.
Network security
By default, ADF transfers data from Amazon S3 to Azure Blob Storage or Azure Data Lake Storage Gen2 using
encrypted connection over HTTPS protocol. HTTPS provides data encryption in transit and prevents
eavesdropping and man-in-the-middle attacks.
Alternatively, if you do not want data to be transferred over public Internet, you can achieve higher security by
transferring data over a private peering link between AWS Direct Connect and Azure Express Route. Refer to the
solution architecture below on how this can be achieved.
Solution architecture
Migrate data over public Internet:
In this architecture, data is transferred securely using HTTPS over public Internet.
Both the source Amazon S3 as well as the destination Azure Blob Storage or Azure Data Lake Storage Gen2
are configured to allow traffic from all network IP addresses. Refer to the second architecture below on how
you can restrict network access to specific IP range.
You can easily scale up the amount of horsepower in serverless manner to fully utilize your network and
storage bandwidth so that you can get the best throughput for your environment.
Both initial snapshot migration and delta data migration can be achieved using this architecture.
Migrate data over private link:
In this architecture, data migration is done over a private peering link between AWS Direct Connect and
Azure Express Route such that data never traverses over public Internet. It requires use of AWS VPC and
Azure Virtual network.
You need to install ADF self-hosted integration runtime on a Windows VM within your Azure virtual network
to achieve this architecture. You can manually scale up your self-hosted IR VMs or scale out to multiple VMs
(up to 4 nodes) to fully utilize your network and storage IOPS/bandwidth.
If it is acceptable to transfer data over HTTPS but you want to lock down network access to source S3 to a
specific IP range, you can adopt a variation of this architecture by removing AWS VPC and replacing private
link with HTTPS. You will want to keep Azure Virtual and self-hosted IR on Azure VM so you can have a static
publicly routable IP for filtering purpose.
Both initial snapshot data migration and delta data migration can be achieved using this architecture.
NOTE
This is a hypothetical pricing example. Your actual pricing depends on the actual throughput in your environment.
Consider the following pipeline constructed for migrating data from S3 to Azure Blob Storage:
Additional references
Amazon Simple Storage Service connector
Azure Blob Storage connector
Azure Data Lake Storage Gen2 connector
Copy activity performance tuning guide
Creating and configuring self-hosted Integration Runtime
Self-hosted integration runtime HA and scalability
Data movement security considerations
Store credentials in Azure Key Vault
Copy file incrementally based on time partitioned file name
Copy new and changed files based on LastModifiedDate
ADF pricing page
Template
Here is the template to start with to migrate petabytes of data consisting of hundreds of millions of files from
Amazon S3 to Azure Data Lake Storage Gen2.
Next steps
Copy files from multiple containers with Azure Data Factory
Use Azure Data Factory to migrate data from an
on-premises Hadoop cluster to Azure Storage
3/5/2021 • 9 minutes to read • Edit Online
Performance
In Data Factory DistCp mode, throughput is the same as if you use the DistCp tool independently. Data Factory
DistCp mode maximizes the capacity of your existing Hadoop cluster. You can use DistCp for large inter-cluster
or intra-cluster copying.
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of
files and directories into input for task mapping. Each task copies a file partition that's specified in the source list.
You can use Data Factory integrated with DistCp to build pipelines to fully utilize your network bandwidth,
storage IOPS, and bandwidth to maximize data movement throughput for your environment.
Data Factory native integration runtime mode also allows parallelism at different levels. You can use parallelism
to fully utilize your network bandwidth, storage IOPS, and bandwidth to maximize data movement throughput:
A single copy activity can take advantage of scalable compute resources. With a self-hosted integration
runtime, you can manually scale up the machine or scale out to multiple machines (up to four nodes). A
single copy activity partitions its file set across all nodes.
A single copy activity reads from and writes to the data store by using multiple threads.
Data Factory control flow can start multiple copy activities in parallel. For example, you can use a For Each
loop.
For more information, see the copy activity performance guide.
Resilience
In Data Factory DistCp mode, you can use different DistCp command-line parameters (For example, -i , ignore
failures or -update , write data when source file and destination file differ in size) for different levels of
resilience.
In the Data Factory native integration runtime mode, in a single copy activity run, Data Factory has a built-in
retry mechanism. It can handle a certain level of transient failures in the data stores or in the underlying
network.
When doing binary copying from on-premises HDFS to Blob storage and from on-premises HDFS to Data Lake
Store Gen2, Data Factory automatically performs checkpointing to a large extent. If a copy activity run fails or
times out, on a subsequent retry (make sure that retry count is > 1), the copy resumes from the last failure point
instead of starting at the beginning.
Network security
By default, Data Factory transfers data from on-premises HDFS to Blob storage or Azure Data Lake Storage
Gen2 by using an encrypted connection over HTTPS protocol. HTTPS provides data encryption in transit and
prevents eavesdropping and man-in-the-middle attacks.
Alternatively, if you don't want data to be transferred over the public internet, for higher security, you can
transfer data over a private peering link via ExpressRoute.
Solution architecture
This image depicts migrating data over the public internet:
In this architecture, data is transferred securely by using HTTPS over the public internet.
We recommend using Data Factory DistCp mode in a public network environment. You can take advantage
of a powerful existing cluster to achieve the best copy throughput. You also get the benefit of flexible
scheduling and unified monitoring experience from Data Factory.
For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows
machine behind a corporate firewall to submit the DistCp command to your Hadoop cluster and to monitor
the copy status. Because the machine isn't the engine that will move data (for control purpose only), the
capacity of the machine doesn't affect the throughput of data movement.
Existing parameters from the DistCp command are supported.
This image depicts migrating data over a private link:
In this architecture, data is migrated over a private peering link via Azure ExpressRoute. Data never traverses
over the public internet.
The DistCp tool doesn't support ExpressRoute private peering with an Azure Storage virtual network
endpoint. We recommend that you use Data Factory's native capability via the integration runtime to migrate
the data.
For this architecture, you must install the Data Factory self-hosted integration runtime on a Windows VM in
your Azure virtual network. You can manually scale up your VM or scale out to multiple VMs to fully utilize
your network and storage IOPS or bandwidth.
The recommended configuration to start with for each Azure VM (with the Data Factory self-hosted
integration runtime installed) is Standard_D32s_v3 with a 32 vCPU and 128 GB of memory. You can monitor
the CPU and memory usage of the VM during data migration to see whether you need to scale up the VM for
better performance or to scale down the VM to reduce cost.
You can also scale out by associating up to four VM nodes with a single self-hosted integration runtime. A
single copy job running against a self-hosted integration runtime automatically partitions the file set and
makes use of all VM nodes to copy the files in parallel. For high availability, we recommend that you start
with two VM nodes to avoid a single-point-of-failure scenario during data migration.
When you use this architecture, initial snapshot data migration and delta data migration are available to you.
Additional references
HDFS connector
Azure Blob storage connector
Azure Data Lake Storage Gen2 connector
Copy activity performance tuning guide
Create and configure a self-hosted integration runtime
Self-hosted integration runtime high availability and scalability
Data movement security considerations
Store credentials in Azure Key Vault
Copy a file incrementally based on a time-partitioned file name
Copy new and changed files based on LastModifiedDate
Data Factory pricing page
Next steps
Copy files from multiple containers by using Azure Data Factory
Use Azure Data Factory to migrate data fro