Azure Data Factory Compressed
Azure Data Factory Compressed
integration.
services include the following:
A components:
A tool for authoring and monitoring the execution of pipelines and Figure 1: An ADF pipeline controls the execution of activities, each of
the activities they depend on. which runs on an integration runtime.
It’s often possible to configure the connection between Azure and on-
premises data sources so that there is a direct connection (if you do,
you don’t need to use the Self-Hosted IR), but not always. For example,
setting up a direct connection from Azure to an on-premises data
source might require working with your network administrator to
configure your firewall in a specific way, something admins aren’t always
happy to do.
The Self-Hosted IR exists for situations like this. It provides a way for an
ADF pipeline to use an activity that runs outside Azure while giving it a
direct connection back to the cloud.
A single pipeline can use many different Self-Hosted IRs, along with the
Azure IR, depending on where its activities need to execute. It’s entirely
possible, for example, that a single pipeline uses activities running on
Azure, on AWS, inside your organization, and in a partner organization.
All but the activities on Azure could run on instances of the Self-Hosted
IR.
Click here or press enter for the accessibility optimised version
Scenarios
To get a sense of how you can use ADF pipelines, it’s helpful to look Figure 2 shows an example of data movement and processing that can
at real scenarios. This section describes two: be automated using ADF pipelines.
1. Building a modern data warehouse on Azure, and In this scenario, data is first extracted from an on-premises Oracle
2. Providing the data analysis back end for a Software as a Service database and Salesforce.com (step 1).
(SaaS) application.
Going forward, however, data warehouses are moving into the cloud.
Figure 2: A modern data warehouse loads diverse data into a data lake,
There are some excellent reasons for this, including low-cost data
does some processing on that data, then loads a relevant subset into a
storage (which means you can store more data) and massive amounts
relational data warehouse for analysis.
of processing power (which lets you do more analysis on that data).
In any case, creating a modern data warehouse in the cloud requires a This data isn’t moved directly into the data warehouse, however.
way to automate data integration throughout your environment. ADF Instead, it’s copied into a data lake, a much less expensive form of
pipelines are designed to do precisely this. storage implemented using either Blob Storage or Azure Data Lake.
Unlike a relational data warehouse, a data lake typically stores data in its
original form. If this data is relational, the data lake can store traditional
tables. But if it’s not relational (you might be working with a stream of
tweets, for example, or clickstream data from a web application), the
data lake stores your data in whatever form it’s in.
Why do this?
Rather than using a data lake, why not transform the data as
needed and dump it directly into a data warehouse?
The answer stems from the fact that organizations are storing ever-
larger amounts of increasingly diverse data. Some of that data might be
worth processing and copying into a data warehouse, but much of it
might not.
Now that the data has been prepared and had some initial analysis, it’s
finally time to load it into SQL Data Warehouse (step 4).
This processing looks much like what’s required to create and maintain The resulting data isn’t typically loaded into a relational data
an enterprise data warehouse, and ADF pipelines can be used to warehouse, however.
automate the work.
Instead, this data is a fundamental part of the service the application
Figure 3 shows an example of how this might look. provides to its users.
This scenario looks much like the previous example. It begins with data Accordingly, it’s copied into the operational database this application
extracted from various sources into a data lake (step 1). uses, which in this example is Azure Cosmos DB (step 4).
Unlike the scenario shown in Figure 2, the primary goal here isn’t to Several applications already use ADF for scenarios like these, including
allow interactive queries on the data through standard BI tools Adobe Marketing Cloud and Lumdex, a healthcare data intelligence
(although an ISV might also provide that for its internal use). company.
Instead, it’s to give the SaaS application the data it needs to support its As big data becomes increasingly important, expect to
users, who access this app through a browser or device (step 5). And see others follow suit.
as in the previous scenario, an ADF pipeline can be used to automate
this entire process.
Click here or press enter for the accessibility optimised version
A Closer Look at
Pipelines
nderstanding the basics of ADF pipelines isn’t hard. Figure 4 For example, ADF provides a scheduler trigger that starts a pipeline
U shows the components of a simple example. running at a specific time. However it starts, a pipeline always runs in
some Azure data center.
One way to start a pipeline running is to execute it on
demand. You can do this through PowerShell, by calling a RESTful API, The activities a pipeline uses might run either on the Azure IR, which is
through .NET, or by using Python. also in an Azure data center or on the Self-Hosted IR, which runs either
on-premises or on another cloud platform. The pipeline shown in Figure
A pipeline can also start executing because of some trigger. 4 uses both options.
Figure 4: A pipeline executes one or more activities, each carrying out a step in a data integration workflow.
Using Activities The example in Figure 4 gives you an idea of what activities can do, but
it’s pretty simple. Activities can do much more.
ipelines are the operation's boss, but activities do the actual For example, the Copy activity is a general-purpose tool to move data
4. Once the processing is complete, the pipeline invokes another These activities can also scale out, letting you run loops and more in
Copy activity, this time to move the processed data from Blobs into parallel for better performance.
SQL Data Warehouse.
Authoring Pipelines This example shows the same simple pipeline illustrated earlier in Figure
4. Each of the pipeline’s activities — the two Copies, Spark, and Web —
is represented by a rectangle, with arrows defining the connections
ipelines are described using JavaScript Object Notation between them. Some other available activities are shown on the left,
integration aren’t developers; they prefer graphical tools. For The first Copy activity is highlighted, bringing up space at the bottom to
this audience, ADF provides a web-based tool for authoring and give it a name (used in monitoring the pipeline’s execution), a
monitoring pipelines. There’s no need to use Visual Studio. Figure 5 description, and a way to set parameters for this activity.
shows an example of authoring a simple pipeline.
n a perfect world, all pipelines would complete successfully, and But whatever the reason, the reality is the same: We need an effective
I there would be no need to monitor their execution. tool for monitoring pipelines. ADF provides this as part of the authoring
and monitoring tool. Figure 6 shows an example.
In the real world, however, pipelines can fail. One reason is that
a single pipeline might interact with multiple cloud services, each of As this example shows, the tool lets you monitor the execution of
which has its failure modes. individual pipelines. You can see when each one started, for example,
how it was started, whether it succeeded or failed, and more. A primary
goal of this tool is to help you find and fix failures. To help do this, the
tool lets you look further into the execution of each pipeline.
The tool also pushes all of its monitoring data to Azure Monitor, the
common clearinghouse for monitoring data on Azure.
Figure 6: The ADF authoring and monitoring tool lets you monitor
pipeline execution, showing when each pipeline started, how long it ran,
its current status, and more.
Click here or press enter for the accessibility optimised version
Pricing
ricing for ADF pipelines depends primarily on two factors:
Activities that run on the Azure IR are a bit cheaper than those run on
the Self-Hosted IR.
You pay by the hour for the compute resources used for data
movement, e.g., the data moved by a Copy activity.
As with activities, the prices for data movement with the Azure IR vs.
the Self-Hosted IR differ (although, in this case, using the SelfHosted IR
is cheaper). You will also incur the standard charges for moving data
from an Azure data center.
It’s also worth noting that you’ll be charged separately for any other
Azure resources your pipeline uses, such as blob storage or a Spark
cluster.
For current details on ADF pipeline pricing, see here.
Click here or press enter for the accessibility optimised version
Conclusion
ata integration is a critical function in many on-premises data centers. As our industry moves to the cloud, it will remain a fundamental
Azure Data Factory addresses two main data integration concerns that organizations have today:
1. A way to automate data workflows in Azure, on-premises, and across other clouds using ADF pipelines. This includes the ability to run data
transformation activities both on Azure or elsewhere, along with a single view for scheduling, monitoring, and managing your pipelines.
If you’re an Azure user facing these challenges, ADF is almost certainly in your future.