Talend Architecture White Paper - Branded - Final 11302020
Talend Architecture White Paper - Branded - Final 11302020
WITH
TALEND AND
MICROSOFT AZURE
White Paper
Table of Contents
Introduction 1
Conclusion 15
The goal of a modern data architecture in the cloud is to leverage the benefits
that allow agile teams to build and deliver high-quality data and analytics
environments faster and more efficiently to respond to business needs. Within
this context, modern data management principles and best practices are
emerging and, when combined with cloud-native principles, help fulfill the
cloud promises. This white paper explains these principles and establishes a
reference architecture based on Radiant Advisors’ research of Talend Cloud
customers on the Microsoft Azure platform. Additionally, this paper distills key
success factors customers discovered when using Talend Cloud and modern
data integration.
B usine ss Inte llig e nce E nte r p r ise S e lf-S e r v ice D ata S cie nce
and R e p o r ting D ata A nalytics and AI
Data Science
Teams
Data Unification (Data Catalog, Governance, Collaboration, Semantic Layer)
Data Management
Operational Systems 2nd Party Systems 3rd Party Systems Public External Data User Data
Data Sources Data Ingestion Data Pipeline Data Platform Analytics User Experience
Op Sys Power BI
Consumers
Apache Kafka Azure Databricks Azure Azure Embedded
Replicate on HDInsight Synapse Analytics API Manger
Azure Report
Azure Power BI Pro
Dynamics Developers
Cosmos DB Analysis Services
IoT Data Processing
Office Apps
eCommerce
Azure IoT Hub Talend Azure Azure Talend Business
Remote Cluster Data Lake Storage Data Lake Analytics Data Prep Analysts
External Data
Public Data
Social Media
Talend Jupyter Data
Talend Cloud
Data Quality and Governance Notebooks Scientists
IoT Devices
Orchestration Data Governance
This reference architecture illustrates at a high level how data moves from
data sources to analytics end-users through separate data ingestion and data
pipeline stages before arriving in the multi-tiered data platform analytics. These
architecture components will be discussed in more detail throughout this paper.
As we will show, Talend plays a key role in data pipeline deployments and
orchestration, along with data quality and governance.
Talend Talend
Remote Engine Remote Engine
Talend Cloud
Orchestration
On-Premises
Op Sys
Azure
Batch Data Processing Synapse Analytics
Replicate
Talend
Remote Engine
Azure
Websites
SQL Database
Talend
Cloud Apps Remote Engine
Azure
Dynamics
Cosmos DB
Office Apps
eCommerce
Azure
Data Lake Storage
Talend
Talend Cloud
Data Quality and Governance
Figure 5:
Data Ingestion Data Pipeline Data Platform
Integration architecture
pattern for streaming data Streaming Data Processing
with Talend
Subscribers
Azure
r Synapse Analytics
ib e
Su bscr
Azure Databricks
u cer
P ro d
Azure
Subscriber SQL Database
Subscribers
Azure
Data Lake Storage
Talend
Talend Cloud Data Quality and Governance
The streaming data hub is the most influential aspect of a modern data
integration architecture that moves away from batch-oriented extract from
source, transform data, load into data warehouse ETL paradigm. This
architecture follows a publish-subscribe paradigm where there can be many
To further break down this process, a data pipeline can be dedicated to data
cleansing or data masking so that other data pipelines don’t have to duplicate
that process. As an example, a traditional ETL job is broken down into several
data pipeline segments. First, a data pipeline is created that uses SQL to acquire
changed data in a source database then deployed to a Talend remote engine
on-premises where Talend Cloud is scheduled to execute this job every 15
minutes. This data ingestion pipeline publishes its data to a topic in Apache
Kafka on Azure HDInsight. The Kafka connector for Azure Data Lake Storage
subscribes to the topic and automatically receives data every 15 minutes.
Another data pipeline is developed to integrate data from several Kafka topics
and writes its output to a different Kafka topic named “integrated data” where
other downstream data pipelines or databases can subscribe to it.
This type of data pipeline that uses SQL statements to transform data inside
of the database is often referred to as extract, load, transform (ELT), or in-
database processing.
Figure 6:
Data Platform
Data management with
polyglot persistence
principle
Azure
Synapse Analytics
Azure
SQL Database
Azure
Cosmos DB
Azure
Data Lake Storage
The flexible class serves as the important data architecture foundation and
repository of all enterprise data assets. Most commonly referred to as the data
lake, the data technology best-suited for flexible access is an object store such
as Azure Blob Storage or Hadoop Distributed File System (HDFS). A Talend
Cloud customer shared that their successful technique is to write files to their
Azure Blob Storage utilizing Talend data pipelines and the optimal Parquet
format standard.
The reference class refers to SQL databases that retain row-based storage
(similar to OLTP databases), such as Azure Database PostgreSQL and Azure
SQL Database, and in-memory persistence with streaming ingestion for high-
performance data loading and updates.
The data lake serves as a single repository of all enterprise data, including the
data sources’ raw data formats (structured, semi-structured, and unstructured).
The default technology is an object store such as the Azure Blob Storage or
Azure Data Lake Storage (ADLS). In the past, it was common for on-premises
Hadoop clusters and Azure HDInsight to facilitate a distributed file system to
meet this data architecture role. In a modern data architecture, a data lake
allows for the data to be well-organized, managed, and cataloged, in addition
to being secure and governed.
Figure 7: Power BI for delivering the reports, data visualizations, and dashboards that
Talend data pipelines organizations need to run their businesses.
on Azure Databricks for
optimized analytics
Spark Streaming
Apache Kafka
Azure Azure Power BI Internal
on HDInsight Machine Learning Analysis Services
Speed and agility for delivering analytics are mostly derived from having an
agile methodology for delivering data pipelines and data preparation. Agile
delivery teams engage in sprints that focus on incremental product and features
for engineering data pipelines and analytics model development. Having a
data pipeline platform that allows data engineers to minimize the amount
of time needed for packaging, deploying, and monitoring code increases the
amount of time available for product development and business value within a
given sprint. This agile methodology can be amplified with DataOps (the data
engineering equivalent to DevOps) and CI/CD processes that can speed release
cycles if updates are needed to respond to production changes.
One of the Talend Cloud apps is Talend Pipeline Designer, used for agile
delivery to build data pipelines directly in the browser, while Talend Data
Inventory centralizes connectivity to enterprise data sets for agile delivery
teams to share. The Talend Cloud API Designer can be used to support the
deployment of APIs for data services.
For example, a data pipeline can subscribe to a Kafka topic representing a data
source table, be dedicated to cleansing each column of data, leverage API calls
for conversions or validations if needed, and then publish cleansed data back
to a new Kafka topic (for cleansed data only) that can be further transformed
by multiple consumer applications and analytics. A data pipeline can then
subscribe to the cleansed data Kafka topic and be dedicated to enrichment
of the data with additional information from an external service API for geo-
spatial information, demographics, or economic data for data scientists to
leverage in analytic models. Serverless functions can be called when needed
as part of a data process, and execution is measured and billed in hundreds of
milliseconds of use without the need to provision computing resources.
For analytics portability, data pipelines and analytics model work can be
deployed in lightweight containers, such as Kubernetes or Docker, with APIs
Conclusion
For companies modernizing their data architecture for analytics delivery on
Microsoft Azure, the journey requires that they embrace these modern data and
cloud principles found in this reference architecture. As the journey advances,
this cloud strategy can be holistically characterized as “rehost, replatform,
then rearchitect,” which requires tools that share these principles and
comprehensive functionality, such as Talend Cloud. A modern data architecture
designed with the principles and best practices distilled within this paper can
fully leverage the potential of the cloud to enable analytics delivery speed,
agility, and scalability.
While every attempt has been made to ensure that the information in this
document is accurate and complete, some typographical errors or technical
inaccuracies may exist. Radiant Advisors does not accept responsibility for any
kind of loss resulting from the use of information contained in this document.
The information contained in this document is subject to change without notice.
All brands and their products are trademarks or registered trademarks of their
respective holders and should be noted as such.
About Talend