0% found this document useful (0 votes)
345 views18 pages

Talend Architecture White Paper - Branded - Final 11302020

Uploaded by

Noman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
345 views18 pages

Talend Architecture White Paper - Branded - Final 11302020

Uploaded by

Noman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

MODERN DATA ARCHITECTURE

WITH
TALEND AND
MICROSOFT AZURE

Brought to you by:

White Paper
Table of Contents
Introduction 1

The Reference Architecture for Talend Cloud with Azure 2

Modern Data Integration Architecture with Talend on Azure 4

Data Management Principles for Modern Data Architecture 9

Agile Delivery Principles for Scalable Enterprise Analytics 12

Adopting Cloud-Native Architecture Principles 14

Conclusion 15

Modern Data Architecture with Talend and Azure


Introduction
As companies face the mandate to quickly migrate their data and analytics
systems into the cloud, many of them miss their target expectations of
increased analytics speed and agility. Modernization efforts that begin with a
“lift-and-shift” strategy of merely taking on-premises systems and moving them
to cloud-based platforms will quickly reveal that a more holistic cloud strategy
including new cloud-native principles is the key to unlocking the potential
with the cloud. For existing data and analytics programs, the modern data
architecture is the next-generation mindset that challenges organizations to
unlearn old habits and pioneer new best practices. Companies need to take the
next steps in what was always intended to be an integration journey – rather
than a one-time migration activity – to fulfill the promises of the cloud with a
modern data architecture.

The goal of a modern data architecture in the cloud is to leverage the benefits
that allow agile teams to build and deliver high-quality data and analytics
environments faster and more efficiently to respond to business needs. Within
this context, modern data management principles and best practices are
emerging and, when combined with cloud-native principles, help fulfill the
cloud promises. This white paper explains these principles and establishes a
reference architecture based on Radiant Advisors’ research of Talend Cloud
customers on the Microsoft Azure platform. Additionally, this paper distills key
success factors customers discovered when using Talend Cloud and modern
data integration.

Modern Data Architecture with Talend and Azure 1


The Reference Architecture for Talend Cloud
with Azure
The modern data architecture is designed for data architects and data
engineers to enable agile analytic delivery teams, business analysts, and
data scientists to deliver three primary analytics capabilities to the business
in a faster and more agile manner (see Figure 1). The combination of data
architecture and integration is based on data management principles and the
pipelines that support it are designed to deliver cloud-native capabilities. The
reference architecture incorporates technologies and techniques that accelerate
Figure 1: analytics delivery.
Conceptual enterprise analytics
ecosystem based on a modern
data architecture

B usine ss Inte llig e nce E nte r p r ise S e lf-S e r v ice D ata S cie nce
and R e p o r ting D ata A nalytics and AI

Dashboards, Reporting Discover and Visualize Data Science


and OLAP Data and Analytics API Gateway

Data Science
Teams
Data Unification (Data Catalog, Governance, Collaboration, Semantic Layer)
Data Management

Data Warehouse Labs and Sandboxes Data Science Engines


Support Teams
Self-Service

Systems Analytic Databases Spark Cluster


Teams

Data Integration Data Prep

Data Platform Team


Enterprise Data Lake (Raw, Curated, Processed)

Engineering Data Pipelines

Data Ingestion (Streaming and Batch)

Operational Systems 2nd Party Systems 3rd Party Systems Public External Data User Data

Modern Data Architecture with Talend and Azure 2


Two separate – but related – architecture principles are represented in the
reference architecture shown in Figure 2. The data integration architecture
supports batch-oriented and streaming data processing with DataOps principles
for continuous integration and continuous deployment (CI/CD) through open-
source languages and orchestrated deployment options for data pipelines.
Figure 2:
Additionally, the data architecture is based on the data management principle
Reference Architecture for
Talend Cloud deployments of polyglot persistence, which dictates that the best-suited database technology
with Microsoft Azure is selected for the data classification and workload.

Data Sources Data Ingestion Data Pipeline Data Platform Analytics User Experience

On-Premises Streaming Data Processing

Op Sys Power BI
Consumers
Apache Kafka Azure Databricks Azure Azure Embedded
Replicate on HDInsight Synapse Analytics API Manger

Op Sys Batch Data Processing


Azure Azure Power BI Executive
Websites Machine Learning
SQL Database
Talend Talend
Cloud Apps Remote Engine Remote Engine

Azure Report
Azure Power BI Pro
Dynamics Developers
Cosmos DB Analysis Services
IoT Data Processing
Office Apps

eCommerce
Azure IoT Hub Talend Azure Azure Talend Business
Remote Cluster Data Lake Storage Data Lake Analytics Data Prep Analysts
External Data

Public Data

Social Media
Talend Jupyter Data
Talend Cloud
Data Quality and Governance Notebooks Scientists
IoT Devices
Orchestration Data Governance

This reference architecture illustrates at a high level how data moves from
data sources to analytics end-users through separate data ingestion and data
pipeline stages before arriving in the multi-tiered data platform analytics. These
architecture components will be discussed in more detail throughout this paper.
As we will show, Talend plays a key role in data pipeline deployments and
orchestration, along with data quality and governance.

Talend customers working on the Azure platform shared that


the goal of having “enterprise trusted data” and “more accurate
information” were among the primary reasons
for choosing Talend.

Modern Data Architecture with Talend and Azure 3


Modern Data Integration Architecture with
Talend on Azure
Following the well-established agile development principle of polyglot
programming, data engineers select the programming language and
deployment option that is best suited for their pipeline development project.
Further, these pipelines are then developed and deployed based on the cloud-
native principle of high accessibility with managed APIs. In this way, a data
engineer can deploy a corresponding API to access the data pipeline or function
from any application. Both Talend Cloud and Azure have the ability to manage
APIs, and interviewed Talend customers tend to leverage both.

Figure 3: Data Ingestion Data Pipeline


Deployment Options
with Talend Cloud Streaming Data Processing

Apache Kafka Azure Databricks


on HDInsight

Batch Data Processing

Talend Talend
Remote Engine Remote Engine

IoT Data Processing

Azure IoT Hub Talend


Remote Cluster

Talend Cloud

Orchestration

As an example, a data engineer can use Talend Cloud to orchestrate their


data pipeline in Spark and deploy it to a Spark cluster, while another data
engineer may choose to generate and orchestrate their data pipeline as a Java
program running on a Linux virtual machine. One Talend customer cited their
preference for deploying to Windows virtual machines because it could run
both .NET and Java data pipelines.

Modern Data Architecture with Talend and Azure 4


The flexible deployment options in Talend Cloud include remote engines such
as Windows or Linux virtual machines, clusters, or Spark clusters, Spark on
Azure HDInsight, and Azure Databricks. Figure 3 illustrates how teams can
configure and orchestrate multiple deployment options for data ingestion and
data processing for batch, streaming, and IoT streaming data pipelines.

Interviews with Talend Cloud customers cited the “speed and


flexibility of deploying servers” for development teams with
their own deployment options as one of their top benefits of
selecting Talend Cloud on Azure.

Based on orchestration tips from Talend customer interviews, Figure 4


represents the best practice of having an on-premise remote engine for local
processing and pairs of Linux virtual machines clustered for deployments,
with plans to move to Spark cluster or Azure Databricks in the future. Further,
one Talend customer created a deployment server in Dev, QA, and Prod and
named each according to their environments, which enhanced their success.
Without this environment naming convention they were challenged to keep
track of where they were deploying code. Further, they highly recommended
others should “follow the Talend model for configuration” with the “use
of generic context variables” and “avoid hardcoding anything” to prevent
additional complexity and potential rework in the future. In other tips for
success, another customer found benefits of using Talend Cloud orchestration
rather than Windows Task Scheduler in Windows virtual machine deployments
because Talend facilitates overall centralized management (whereas the
Windows management only managed that particular machine). The benefit of
centralized orchestration was also cited by a customer who noted that the way
Talend Cloud “retained the orchestration for Databricks jobs was very handy.”

Talend customers shared that the Talend development environment


speeds pipeline development time because it offers so many components
that minimize manual coding – as opposed to sorting through third-party
component tools. One Talend Cloud customer we interviewed shared the
additional benefit of using the development environment “for maintenance
activities, such as versioning, branching, and promoting pipelines”
to production.

Modern Data Architecture with Talend and Azure 5


Many Talend customers are still focused on batch-oriented data warehousing
refreshes with an orchestrated series of data reads and writes from data
lakes and staging areas before loading into the data warehouse (see Figure
4). These Talend customers on Azure stated that they had an intent for
streaming data processing in the future and that they will consider Talend’s
Figure 4:
streaming capability. This facilitates the data requirements for analytic
Data integration principle
for separating ingestion applications to deliver answers in near real-time scenarios, such as with
and processing predictions and recommendations.

Data Sources Data Ingestion Data Pipeline Data Platform

On-Premises

Op Sys
Azure
Batch Data Processing Synapse Analytics
Replicate

Talend
Remote Engine
Azure
Websites
SQL Database

Talend
Cloud Apps Remote Engine

Azure
Dynamics
Cosmos DB

Office Apps

eCommerce
Azure
Data Lake Storage

Talend
Talend Cloud
Data Quality and Governance

Orchestration Data Governance

When modernizing traditional ETL to modern data integration architectures,


a recommended data integration principle is to create smaller data pipeline
segments for improved delivery speed and management. (Note: This is
similar to the cloud-native principle to employ microservices and functions, as
discussed later in the paper.) As an initial step, a best practice recommendation
is to isolate the data ingestion processes in order to serve all data consumers
and agile development product teams (as represented in Figure 4).

Modern Data Architecture with Talend and Azure 6


Figure 5 illustrates the value of having a streaming data hub, such as Apache
Kafka, in the reference architecture to improve data pipeline development
speed by leveraging the work of other data pipelines’ functionality already
deployed upstream (i.e., reusability). Azure offers Apache Kafka on HDInsight
service and Azure Event Hubs or Azure IoT Hub as options for streaming
data hubs. A data ingestion pipeline or streaming application is developed to
continuously receive source data and publish it to a Kafka topic dedicated to
that data source, such as a database table. For scheduled data acquisitions, a
Talend data pipeline is developed to connect and acquire a set of records for
the data source. Traditionally, the acquired data is written to a data warehouse
staging area or the data lake, but, ideally, the records are also published to
their own Kafka topic a become a stream of data records.

Figure 5:
Data Ingestion Data Pipeline Data Platform
Integration architecture
pattern for streaming data Streaming Data Processing
with Talend
Subscribers
Azure
r Synapse Analytics
ib e
Su bscr
Azure Databricks
u cer
P ro d
Azure
Subscriber SQL Database

Apache Kafka Producer


on HDInsight
Talend
Azure
Remote Cluster
Cosmos DB

Subscribers

Azure
Data Lake Storage

Talend
Talend Cloud Data Quality and Governance

Orchestration Data Governance

The streaming data hub is the most influential aspect of a modern data
integration architecture that moves away from batch-oriented extract from
source, transform data, load into data warehouse ETL paradigm. This
architecture follows a publish-subscribe paradigm where there can be many

Modern Data Architecture with Talend and Azure 7


independent and asynchronous subscribers of the same data. Therefore, what
used to be data targets are now subscribers to the streaming data hub. This
isolates the traditional extraction (ingestion) process, transformation processes,
and loading process for more reusability and fault tolerance in an enterprise
environment. (See Figure 5.)

To further break down this process, a data pipeline can be dedicated to data
cleansing or data masking so that other data pipelines don’t have to duplicate
that process. As an example, a traditional ETL job is broken down into several
data pipeline segments. First, a data pipeline is created that uses SQL to acquire
changed data in a source database then deployed to a Talend remote engine
on-premises where Talend Cloud is scheduled to execute this job every 15
minutes. This data ingestion pipeline publishes its data to a topic in Apache
Kafka on Azure HDInsight. The Kafka connector for Azure Data Lake Storage
subscribes to the topic and automatically receives data every 15 minutes.
Another data pipeline is developed to integrate data from several Kafka topics
and writes its output to a different Kafka topic named “integrated data” where
other downstream data pipelines or databases can subscribe to it.

Talend customer interviews revealed that developing


and deploying a data pipeline that executes in-database
transformations with SQL is a performant way to leverage the
database compute and keep the resource load on the Talend
remote engine low.

This type of data pipeline that uses SQL statements to transform data inside
of the database is often referred to as extract, load, transform (ELT), or in-
database processing.

Modern Data Architecture with Talend and Azure 8


Data Management Principles for
Modern Data Architecture
The data management principle of polyglot persistence states that data should
be persisted (or stored) in the database technology that is the most optimal for
its workload. In a modern data architecture, the most common methods for
working with data is file-based, SQL access, or a REST API data service that
decouples the database. When applied to a data architecture for enterprise
analytics, Radiant Advisors specifies three classifications of data technologies
for analytics: a flexible class, an analytics-optimized class, and a data
management class.

Figure 6:
Data Platform
Data management with
polyglot persistence
principle

Azure
Synapse Analytics

Azure
SQL Database

Azure
Cosmos DB

Azure
Data Lake Storage

The flexible class serves as the important data architecture foundation and
repository of all enterprise data assets. Most commonly referred to as the data
lake, the data technology best-suited for flexible access is an object store such
as Azure Blob Storage or Hadoop Distributed File System (HDFS). A Talend
Cloud customer shared that their successful technique is to write files to their
Azure Blob Storage utilizing Talend data pipelines and the optimal Parquet
format standard.

Modern Data Architecture with Talend and Azure 9


The analytics-optimized class includes SQL database engines that leverage
techniques such as massively parallel processing or shared-nothing architecture.
Other analytics databases include columnar data storage, in-memory data
storage, and OLAP cubes through Azure Analysis Services. The analytics-
optimized class also includes NoSQL document stores and graph databases,
such as Azure’s popular Cosmos DB.

The reference class refers to SQL databases that retain row-based storage
(similar to OLTP databases), such as Azure Database PostgreSQL and Azure
SQL Database, and in-memory persistence with streaming ingestion for high-
performance data loading and updates.

The organization of these technology classes means the modern data


architecture is fundamentally a two-tiered data architecture of a scalable
data lake of all enterprise data assets and an optimized database layer for
analytics workloads. Over time, we anticipate that cloud-native architectures
for databases will evolve and improve to compete with the current optimized
databases, and therefore more data will be persisted in object stores with
analytic engines that leverage elastic compute resources.

The data lake serves as a single repository of all enterprise data, including the
data sources’ raw data formats (structured, semi-structured, and unstructured).
The default technology is an object store such as the Azure Blob Storage or
Azure Data Lake Storage (ADLS). In the past, it was common for on-premises
Hadoop clusters and Azure HDInsight to facilitate a distributed file system to
meet this data architecture role. In a modern data architecture, a data lake
allows for the data to be well-organized, managed, and cataloged, in addition
to being secure and governed.

The data warehouse and data marts continue to provide business


performance analytics with reporting and OLAP over historical trends.
MPP databases, columnar databases, and OLAP cubes are proven database
technologies that have been optimized for analytics and can support faster
query response time for large amounts of event data or high-performance

Modern Data Architecture with Talend and Azure 10


slicing and dicing of data in dimensional data models. Azure Synapse
Analytics represents the cloud-native evolution of the Azure SQL Data
Warehouse, while Azure SQL and Azure Analysis Services combine with

Figure 7: Power BI for delivering the reports, data visualizations, and dashboards that
Talend data pipelines organizations need to run their businesses.
on Azure Databricks for
optimized analytics

Data Ingestion Data Pipeline Data Analytics Access User Experience

Enterprise Data Hubs

Spark Streaming

Azure Databricks Azure Azure Power BI


Consumers
Synapse Analytics API Manger Embedded

Apache Kafka
Azure Azure Power BI Internal
on HDInsight Machine Learning Analysis Services

Figure 7 illustrates a data integration pattern for Talend Cloud deploying


data pipelines to Azure Databricks that delivers transformed data to Azure
Synapse Analytics. This combines the benefits of SQL Data Warehouse,
Spark-like compute, and with Jupyter-like notebooks (with Python and R)
in order to minimize the number of components involved and streamline the
data scientists’ data flow. Azure Synapse also easily integrates with Power
BI and Azure Machine Learning services. The Azure Cosmos DB is also
easily integrated to act as the analytics-optimized NoSQL database in the
modern data architecture. Azure Databricks is a popular choice on the Azure
platform for developers and data engineers who prefer to work in the latest
Spark environments.

Modern Data Architecture with Talend and Azure 11


Agile Delivery Principles
for Scalable Enterprise Analytics
Many organizations have embraced an agile methodology for delivering
analytics products and features rather than project-oriented development.
These agile delivery teams can now work independently for their product
owners and customers while leveraging the variety of Azure services, options,
and resources available to them. The independent delivery teams work in a
federated organizational model with centralized support in IT Cloud Ops and
Platform Architecture teams.

Speed and agility for delivering analytics are mostly derived from having an
agile methodology for delivering data pipelines and data preparation. Agile
delivery teams engage in sprints that focus on incremental product and features
for engineering data pipelines and analytics model development. Having a
data pipeline platform that allows data engineers to minimize the amount
of time needed for packaging, deploying, and monitoring code increases the
amount of time available for product development and business value within a
given sprint. This agile methodology can be amplified with DataOps (the data
engineering equivalent to DevOps) and CI/CD processes that can speed release
cycles if updates are needed to respond to production changes.

It is the responsibility of the Platform Architecture team to recommend best


practices, architecture patterns, and standards for the modern data architecture
intended to enable the agile delivery teams with fewer architectural and
standards decisions. Every agile delivery team has the option to develop their
data pipelines in any language they choose and operate them on various Azure
services. This includes Python, Java, Scala, or Julia running on Windows or
Linux virtual machines, Apache Spark clusters, or Azure Databricks services.
The Platform Architecture team also recommends a data pipeline development
tool, such as Talend, which gives the agile delivery teams faster development
times from a component-based development environment, the flexibility of
embedding custom code and exporting to open source languages such as
Python and Java, and the independence to deploying the data pipelines to
remote engines or Spark clusters.

Modern Data Architecture with Talend and Azure 12


Even with DataOps principles, monitoring data pipelines in production can
be a challenge due to inconsistency and volume if there is not a dedicated
data management console such as Talend Management Console. The Talend
Management Console monitors all data pipelines in remote engines and
Spark clusters – both in Azure and on-premises – and includes configurable
alert notifications for jobs and impacted dependencies. Azure Monitoring
offers a similar ability to monitor application logs and server metrics
universally for analysis and notification but does not have the specifics for
data pipeline operations.

Orchestration is one of the most challenging aspects of cloud data pipeline


execution. This is where data pipeline job scheduling, environments, and
notifications allow users to set up and orchestrate jobs. This also promotes a
consistent understanding and terminology for all agile delivery teams when
working with each other and with the Platform Architecture and Cloud Ops
teams. Deploying custom code into the cloud environments has proven to be
challenging and time-consuming for many agile delivery teams when trying to
sort out the many options in cloud services that need to be leveraged together,
but the Talend Management Console can create users, projects, environments,
and workspaces for every agile delivery team and their corresponding remote
execution engines. Agile delivery teams can log in to the console to manage
and monitor all tasks (both scheduled and running in production) and
configure their alerts and notifications.

One of the Talend Cloud apps is Talend Pipeline Designer, used for agile
delivery to build data pipelines directly in the browser, while Talend Data
Inventory centralizes connectivity to enterprise data sets for agile delivery
teams to share. The Talend Cloud API Designer can be used to support the
deployment of APIs for data services.

Modern Data Architecture with Talend and Azure 13


Adopting Cloud-Native Architecture Principles
Migrating to the cloud with Microsoft Azure and Talend presents the potential
for speed and agility to create value with enterprise analytics. In order to
realize that potential, this journey must be guided by cloud-native principles
and an organization’s ability to adopt them. Cloud-native principles are
primarily focused on application development through DevOps, CI/CD,
microservices, and containers, and these principles can be adapted to facilitate
analytics delivery speed (as previously discussed), development scalability, and
portability of data engineering and enterprise analytics.

Scalability in analytics delivery will come from the combination of adopting


microservices, serverless functions, and automation. Data pipelines need
to evolve away from large ETL packages and, rather, be designed for
microservices and serverless functions. We have already discussed how the
larger sub-modules dedicated to extracting and loading can be isolated and
fully automated with streaming data topics.

Further, Radiant Advisors recommends as a best practice to


isolate the cleansing, integrating, and calculating sub-modules
of data transformations. Talend customers agreed that this is a
best practice they plan to adopt.

For example, a data pipeline can subscribe to a Kafka topic representing a data
source table, be dedicated to cleansing each column of data, leverage API calls
for conversions or validations if needed, and then publish cleansed data back
to a new Kafka topic (for cleansed data only) that can be further transformed
by multiple consumer applications and analytics. A data pipeline can then
subscribe to the cleansed data Kafka topic and be dedicated to enrichment
of the data with additional information from an external service API for geo-
spatial information, demographics, or economic data for data scientists to
leverage in analytic models. Serverless functions can be called when needed
as part of a data process, and execution is measured and billed in hundreds of
milliseconds of use without the need to provision computing resources.

For analytics portability, data pipelines and analytics model work can be
deployed in lightweight containers, such as Kubernetes or Docker, with APIs

Modern Data Architecture with Talend and Azure 14


and URL designations for data persistence. With an orchestration service, these
containers can be deployed across on-premises data centers, Microsoft Azure,
or other public clouds when needed for execution. Containers empower data
engineers and data scientists to write-once and deploy-anywhere without
being concerned about which computing resources are available. The use of
containers is a mature cloud-native capability that we recommend if there is an
appropriate specific use case or when agile delivery teams have the experience
and expertise to increase their speed and agility with containers. Still, the
most significant initial benefits will come from adopting the CI/CD process and
DataOps, followed by microservices and serverless functions.

Conclusion
For companies modernizing their data architecture for analytics delivery on
Microsoft Azure, the journey requires that they embrace these modern data and
cloud principles found in this reference architecture. As the journey advances,
this cloud strategy can be holistically characterized as “rehost, replatform,
then rearchitect,” which requires tools that share these principles and
comprehensive functionality, such as Talend Cloud. A modern data architecture
designed with the principles and best practices distilled within this paper can
fully leverage the potential of the cloud to enable analytics delivery speed,
agility, and scalability.

While every attempt has been made to ensure that the information in this
document is accurate and complete, some typographical errors or technical
inaccuracies may exist. Radiant Advisors does not accept responsibility for any
kind of loss resulting from the use of information contained in this document.
The information contained in this document is subject to change without notice.

All brands and their products are trademarks or registered trademarks of their
respective holders and should be noted as such.

This edition published November 2020.

Modern Data Architecture with Talend and Azure 15


About the Author

John O’Brien is Principal Advisor and CEO of Radiant


Advisors. A recognized thought leader in data strategy and
analytics, John provides research, strategic advisory services
and mentoring that guide companies in data strategy,
architecture, analytics and emerging technologies.

This research report sponsored by:

About Talend

Talend (NASDAQ: TLND), a leader in cloud data integration


and data integrity, enables companies to transform by
delivering trusted data at the speed of business. Learn more at
www.Talend.com.

About Radiant Advisors

Radiant Advisors is an independent research and advisory firm that


delivers innovative, cutting-edge research and thought-leadership
to transform today’s organizations into tomorrow’s data-centric
industry leaders. To learn more, visit www.RadiantAdvisors.com.

© 2020 Radiant Advisors. All Rights Reserved.


Radiant Advisors
Boulder, CO USA
Email: info@radiantadvisors.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy