100% found this document useful (1 vote)
164 views30 pages

Gartner - Use - Data - Integration - Pattern - 270543

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
164 views30 pages

Gartner - Use - Data - Integration - Pattern - 270543

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

This research note is restricted to the personal use of rolando.quezada@itau.

cl

G00270543

Use Data Integration Patterns to Build Optimal


Architecture
Published: 11 December 2014

Analyst(s): Mei Yang Selvage

Data integration is indispensable for many mission-critical initiatives, but the


multitude of options can be overwhelming and confusing. This document
and its accompanying spreadsheet present a repeatable process to choose
an appropriate DI pattern for a given use case.

Table of Contents

Decision Point........................................................................................................................................ 2
Decision Context.................................................................................................................................... 3
Business Scenario............................................................................................................................ 5
Architectural Context........................................................................................................................ 5
Related Decisions.............................................................................................................................6
Evaluation Criteria...................................................................................................................................6
Requirements and Constraints..........................................................................................................7
Principles........................................................................................................................................10
Alternatives...........................................................................................................................................11
Batch Data Movement....................................................................................................................11
Data Federation/Data Virtualization................................................................................................. 14
Replication..................................................................................................................................... 15
Messaging......................................................................................................................................17
Future Developments........................................................................................................................... 19
Cloud Computing Impacts DI Requirements and Delivery Modes....................................................19
Integrating Structured and Unstructured Data.................................................................................19
Decision Tool........................................................................................................................................ 20
Step 1: Choose One Subpattern Using the Attached Spreadsheet................................................. 20
Step 2: Perform a Trade-Off Analysis.............................................................................................. 22
Step 3: Document the Decision...................................................................................................... 23
Decision Justification............................................................................................................................ 23

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Batch Data Movement....................................................................................................................23


Data Federation/Data Virtualization (DF/DV).................................................................................... 25
Replication..................................................................................................................................... 27
Messaging......................................................................................................................................28
Gartner Recommended Reading.......................................................................................................... 29

List of Tables

Table 1. Ten Subpatterns to Integrate Data............................................................................................. 4


Table 2. Common Usages of DI Patterns................................................................................................ 5
Table 3. Key Questions for Evaluation Criteria......................................................................................... 9
Table 4. Architectural Principle Examples.............................................................................................. 11
Table 5. ETL Versus ELT Style of Batch Data Movement....................................................................... 12
Table 6. How to Prioritize Decision Criteria............................................................................................21

List of Figures

Figure 1. DI Conceptual Architecture Diagram........................................................................................ 8


Figure 2. Batch Data Movement Architecture Diagram..........................................................................13
Figure 3. DF/DV Architecture Diagram.................................................................................................. 15
Figure 4. Replication Architecture Diagram........................................................................................... 17
Figure 5. Messaging Architecture Diagram............................................................................................18
Figure 6. Perform a Trade-Off Analysis..................................................................................................22
Figure 7. Batch Data Movement Characteristics................................................................................... 24
Figure 8. Data Federation/Data Virtualization Characteristics................................................................ 26
Figure 9. Replication Characteristics..................................................................................................... 27
Figure 10. Messaging Characteristics................................................................................................... 28

Decision Point
How do we create a repeatable process to choose an appropriate data integration (DI) pattern for a
given use case?

Page 2 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Decision Context
This document was revised on 31 March 2016. The document you are viewing is the corrected
version. For more information, see the Corrections page on gartner.com.

DI technologies have been widely adopted in the last decade. They are indispensable for many
mission-critical initiatives, such as logical data warehouses (LDWs), master data management
(MDM), and system migration and integration. But the multitude of DI options and interchangeable
terms can be confusing. The vendor market is constantly changing; for example, analytical and
database vendors are adding additional capabilities to support DI. Moreover, constraints — such as
budget, skill sets or timelines — often dominate DI decisions, which don't always result in optimal
DI architecture.

Solutions to these problems can be addressed by using DI patterns. Patterns provide repeatable
solutions for reoccurring problems. They help technical professionals eliminate confusion and
improve the decision-making process. Focusing on structure data, this document and its
accompanying spreadsheet — "DI_Decision_Tool" — present a repeatable process to perform an
architectural trade-off analysis between requirements and constraints among DI patterns. They can
be implemented with stand-alone DI tools, embedded capabilities in analytical tools or databases,
or even custom coding. However, embedded capabilities or custom coding is not recommended in
the long term for many reasons such as agility, maintainability and vendor lock-in.

There are four groups of DI patterns, corresponding to four distinct groups of commercial stand-
alone DI products. In "Critical Capabilities: Data Delivery Styles for Data Integration Tools," Gartner
defines the four main data delivery styles as the following:

■ Batch data movement: Batch or bulk data movement — simply referred to as "batch data
movement" here — involves batch data extraction and delivery approaches (such as support for
extraction, transformation and loading [ETL] processes) to consolidate data from data sources.
■ Data federation/data virtualization (DF/DV): Instead of physically moving the data, data
federation/virtualization executes queries against multiple data sources to create virtual
integrated views of data on the fly. DV/DF requires adapters to various data sources, a
metadata repository and a query engine that can provide results in various ways (for example,
as an SQL row set, XML or a Web services interface).
■ Message-oriented movement: Message-oriented movement delivers data through message
queues or services (in XML) to the target systems. It's often associated with service-oriented
architecture (SOA). Data is often delivered in near real time.
■ Data replication and synchronization: Typically through change data capture (CDC), this
process replicates and synchronizes data among database management systems (DBMSs).
Schemas and other data structures in DBMS may be identical or slightly different. This pattern
keeps operational data current across multiple systems.

Finally, DI patterns consist of 10 subpatterns, based on their interfaces to data sources and targets
(see Table 1).

Gartner, Inc. | G00270543 Page 3 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Table 1. Ten Subpatterns to Integrate Data

Batch Data Federation/Data Virtualization Data Replication Messaging

■ SQL + [Transform]+SQL ■ SQL + [Transform] + SQL ■ CDC + [Transform] + SQL ■ API + [Transform] +
Queue
■ CDC + [Transform]+SQL ■ SQL + [Transform] + XML ■ CDC + [Transform] + Queue
■ API + [Transform] + XML
■ API + [Transform] + SQL ■ API + [Transform] + SQL

Source: Gartner (December 2014)

Page 4 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Business Scenario
DI is critical for various business scenarios. Table 2 lists common usages of DI patterns in key
business scenarios and puts DI patterns in a business context. Note that systems or data sources
can sit on-premises or in the cloud.

Table 2. Common Usages of DI Patterns

Business Batch Data Movement Data Federation/ Replication Messaging


Scenarios (ETL or ELT) Virtualization

LDW Initial load; updates data Real-time or drill-down Consolidates data from Extracts data from
in batch or trickle feed access; extends or databases to data applications and
(i.e., mini-batch). unifies data access; warehouses in near real delivers to data
prototype or front time using change data warehouses.
interface for ETL. capture (CDC).

System Exchanges data among Replaces or reduces Copies data from Exchanges data
Integration on-premises and cloud the need of data databases in near real among
systems in batch or movement; delivers a time. applications.
trickle feed. reusable data access
layer.

System Initial load; updates data Provides a data access Synchronizes data Synchronizes data
Migration in batch or trickle feed. layer to reduce impact between old and new between old and
to applications and end databases during new applications
users. migration transition. during migration
transition.

MDM Primarily used in MDM in Extends or unifies data Copies data from Extracts data from
consolidation, access in all styles of databases in near real applications and
centralized and MDM. time; used in MDM delivers to various
coexistence styles for consolidation, MDM systems.
initial load and batch centralized and
updates. (See Note 1 on coexistence styles.
the four MDM styles.)

B2B Data Batch-extracts and Provides a data access Copies data from Extracts data from
Exchange sends data to external layer for extracting databases in near real applications and
entities. data. time and sends it to delivers to external
external parties. entities.

Source: Gartner (December 2014)

Architectural Context
The following Gartner templates provide an architectural context for DI decisions:

■ "Data Management": Focuses on data management domain and establishes a high-level


context for examining technologies, techniques and related domains for data management.

Gartner, Inc. | G00270543 Page 5 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ "Data Integration and Master Data Management": Provides a detailed view of DI and MDM
components within the data management environment.

Related Decisions
The following Decision Points are relevant to this Decision Point:

■ "Decision Point for Selecting a Master Data Management Implementation Style"


■ "Decision Point for Selecting a Database Technical Architecture"

Evaluation Criteria
Evaluation criteria determine which DI patterns are appropriate for a given use case. The nine
important evaluation criteria are:

1. Variable data schema: Data sources often have different variability of data schemas. Some
data sources have highly variable schema, such as sensor data; whereas others may have more
stable schema, such as financial transactional systems.
2. Source data volume: Batch data movement is suitable for extracting or accessing large
volumes of data, whereas other three patterns are suitable for smaller sets of data.
3. Impact to data sources: Some DI technologies are less intrusive on data sources than others.
For example, log-based CDC scrapes changes from the database transaction logs, which
minimizes the impact on databases. The extra load volume that data sources can handle is
determined by the amount of spare computing resources that data sources have. Their criticality
also determines their tolerance levels to extra workload.
4. Granularity is defined as the data volume per process, and it is often determined by
consumption patterns of target systems. For example, partner systems or data warehouses
may be set up to process data daily, so batch data movement is sufficient. On the other hand, a
customer-facing system may need fresher data, so near-real-time DI patterns are more
appropriate.
5. Data quality deals with semantics — the meaning of words or sentences — of data instances.
Data cleansing ensures that data is suitable for its intended purposes within the business
context. Not all use cases require high data quality. If high data quality is necessary, a variety of
technical techniques is required, such as data augmentation, validation, standardization,
matching and monitoring.
6. Data format transformation reconciles the data format differences between a source and
target data structures. For example, convert data among XML, electronic data interchange (EDI)
or comma-delimited files.
7. Data latency: This is measured by how closely a target system stays up-to-date with source
systems. It is influenced by frequencies of data changes. Real-time integration has become an
important requirement as enterprises strive to be more responsive to changing environments.

Page 6 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Moreover, data latency, combing variability of schemas and data volumes, creates huge
challenges for DI.
8. Reliability: The criticality of a use case drives the reliability requirement for DI technologies. For
example, customer data integration for cross- and up-sale scenario requires higher reliability
than back-office systems.
9. Flexibility: Once foundational DI is in place, organizations experience different demands on
integrating and performing a variety of analytics. For example, what-if analysis requires more DI
flexibility than static dashboards.

Requirements and Constraints


Decision criteria for choosing a DI pattern can be mapped to the source, the target and three
computing stages (as illustrated in the DI conceptual architecture diagram in Figure 1):

1. Extract/access: DI technologies physically extract and move data, or virtually access source
data through SQL, CDC or API interfaces.
2. Transform/cleanse: Data format transformation is required when sources and targets have
different data structures. Also, data cleansing is required to reconcile semantic disparity or to
correct erroneous data. Transformation and cleansing are omitted for some use cases, such as
high-availability replication.
3. Move/present: DI technologies either move data into target systems physically or present data
directly to client applications. Three main interfaces for target systems are SQL, queues or XML
(for example, Web services).

Gartner, Inc. | G00270543 Page 7 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Figure 1. DI Conceptual Architecture Diagram

Target

Move/Present

[Transform/Cleanse]

Extract/Access

Source

Source: Gartner (December 2014)

Table 3 maps the nine decision criteria into the source, the target and the DI computing stages.
Each decision criteria is associated with a key question, which is helpful in clarifying requirements
and constraints of a specific use case. To reiterate, the DI decision-making process is driven by use
cases, not by generic system characteristics. For example, if a use case may only need to extract or
access a small subset of data in KBs, the requirement is low for source data volume, even though a
system may have TBs of data.

Page 8 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Table 3. Key Questions for Evaluation Criteria

Stage Decision Criteria Key Question to Ask

Source Variable data schema How variable are data schemas in the data sources?

Source Source data volume How large is the average data volume for extracting or accessing in
the data sources?

Source Impact to data sources How much additional impact can the data sources absorb?

Extract/Access Granularity How large is the average data volume for movement or
presentation per transaction?

[Transform/ Data quality How much data cleansing is required in order to achieve a desired
Cleanse] level of data quality?

[Transform/ Data format How complex is data format transformation?


Cleanse] transformation

Move/Present Data latency How closely are the target systems expected to be up-to-date with
the data sources?

Move/Present Reliability How reliable does the data integration technology need to be in
response to power outages and system failures?

Target Flexibility Once DI is established, how much change is anticipated for


different types of analytics?

Source: Gartner (December 2014)

The following are important DI requirements (which are listed in "Toolkit: RFP Template for Data
Integration Tools"):

■ Connectivity or adapters: The ability to interact with a range of different data structure types
(for example, relational databases, software as a service [SaaS], packaged applications and
messages)
■ Data delivery modes: The ability to provide data to consuming applications, processes and
databases in a variety of modes (for example, batch, real time or event-driven)
■ Data format transformation: Built-in capabilities for performing both simple and complex data
format transformations
■ Metadata management and data modeling: The ability to capture, reconcile and interchange
metadata and to create and maintain data models
■ Design and development environment: Capabilities for facilitating design and construction of
DI processes

Gartner, Inc. | G00270543 Page 9 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ Data governance support: Support for the understanding, collaboration and enforcement of
data governance for data quality and data access
■ Deployment environment: Hardware and operating system options for deployment of DI
processes
■ Operations and administration capabilities: Facilities for supporting, managing and
controlling DI processes
■ Architecture and integration: Commonality, consistency and interoperability among the
various components of a DI toolset
■ Service enablement: Service-oriented characteristics and support for SOA deployments

Principles
Architectural principles (see "Reference Architecture Principles") build a foundation for guiding
architectural decisions. The following principles help choose an appropriate DI pattern for a given
use case:

■ Technical maturity: Since DI technologies have matured over the years, technology maturity
may be viewed as selecting and implementing certain components in a range of maturity. For
example, the integration component for Hadoop is less matured than ones for traditional
database integration.
■ Cost center versus competitive advantage: Investment in DI technologies depends on
whether IT is viewed as a cost center or as a primary contributor to an enterprise's competitive
advantage. When you tie DI initiatives directly to business objectives, you can increase funding
and obtain stronger business support.
■ Single vendor versus best-of-breed vendors: Some organizations prefer to partner with one
strategic vendor for all their integration needs, while others prefer to partner with numerous
vendors and choose the best-of-breed technology from each category. Your current technology
and vendor choices influence this preference greatly.
■ Vendor risk: Vendors' risk levels are measured by their organizational maturity, financial viability
and market share. A small, niche startup may not survive the market competition so some
organizations may choose to avoid startups.
■ Build versus buy: Although commercial DI middleware is the norm nowadays, some
organizations still resort to custom-coded solutions.

Table 4 gives examples of opposite extremes for a given principle. Most of the organizations choose
a position somewhere between the two extremes.

Page 10 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Table 4. Architectural Principle Examples

Principle Area Extreme Principle Opposite Extreme Principle

Technology maturity Deploy only those technologies and Deploy any technology with the potential for
services that are mature, stable, secure competitive and market advantage, regardless of
and proven in the field. technical maturity, stability or security.

Cost center vs. Cost containment is the first priority for Competitive advantage is the first priority for
competitive investment in IT infrastructure. investment in IT infrastructure.
advantage

Single vendor vs. Where possible, buy products and Buy products and services based on the best-of-
best-of-breed services from strategic vendors. breed principle.
vendors

Vendor risk We will buy only from well-established Our choice of vendor products will not be limited
vendors with large market share. by vendor maturity, viability or market share
considerations.

Build vs. buy We will build our own software as much We will buy commercial off-the-shelf (COTS)
as possible. software as much as possible.

Source: Gartner (December 2014)

Alternatives
This section will compare four groups of data integration patterns, which consist of 10 subpatterns
based on interfaces to data sources and targets. Refer the summary in Table 1 "Ten Subpatterns to
Integrate Data."

Batch Data Movement


Batch data movement includes two styles: ETL and ELT — extract, load, and transform. They are
combined in a single category here because they share same evaluation criteria, except that the
execution orders of transformation and load are different. The transformation process is often
combined with data cleansing.

Often scheduled hourly or daily, ETL extracts data from sources, and then transforms and loads
data into targets in batches. ETL is usually implemented in a stand-alone environment to provide
extra processing power and to reduce the impact during the transformation phase. A trickle-feed
style is sometimes used to reduce batch windows and to improve data latency. The trickle feed is
essentially micro batch jobs that are scheduled to run more frequently, such as in five- to 15-minute
intervals. However, the trickle feed does have wide adoption due to the extra complexity in
managing near-real-time infrastructure, in addition to the batch infrastructure.

In comparison, ELT extracts and loads data into staging areas in target systems first, and then
transforms and cleanses data using the computing capabilities of a target DBMS. Consequently,

Gartner, Inc. | G00270543 Page 11 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

transformed and cleansed data will be loaded into production tables. Also, ELT is often associated
with powerful data warehouses, MDM and enterprise resource planning (ERP).

Table 5 compares ETL and ELT at a high level.

Table 5. ETL Versus ELT Style of Batch Data Movement

Pros Cons

ETL ■ Isolates the transformation workload from ■ Adds complexity and cost to
the target production environment; is manage a separate ETL environment
scalable for high-volume, complex
transformation and data cleansing
■ Moves data over network twice:
once to ETL engine, and then to
■ Is flexible and decoupling data sources targets
and targets; extracts and transforms data
once and then populates many targets
■ Requires skills on specific ETL tools

■ Offers comprehensive out-of-the-box


capabilities on connectivity,
transformation and discovery
■ Provides improved graphical user
interface (GUI) design tools, governance
and operational support for DI

ELT ■ Reduces the footprint by avoiding a ■ Is inflexible and couples data


separate ETL server sources and targets
■ Reuses existing DBMS engines; is suitable ■ Shares workload with a target
for workloads with limited transformation DBMS, which may cause
and cleansing performance and scalability issues
■ Reuses existing skills on triggers and ■ May require the development of
stored procedures of DBMS more custom coding, therefore
harder to manage in the long run
■ Lacks comprehensive governance
and operational support for complex
DI

Source: Gartner (December 2014)

Figure 2 shows ETL and ELT styles of batch data movement. In addition, batch data movement
includes three subpatterns based on their interfaces to data sources and targets:

■ SQL + [Transform] + SQL: As the dominant form of batch data movement, this subpattern
extracts data through a standard SQL interface in batches, then transforms and inserts data
into a target DBMS.

Page 12 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ CDC + [Transform] + SQL: CDC described in this document is assumed to be log-based CDC,
which scrapes changes from transactional logs, such as DBMS or Virtual Storage Access
Method (VSAM). Log-based CDC creates far less impact than trigger-based CDC. After data
changes are captured by log-based CDC, they are transformed and loaded into a target DBMS.
■ API + [Transform] + SQL: This subpattern extracts data through APIs, then transforms and
loads data into targets. It masks the complexity of internal data models and may be the only
choice for ERPs or SaaS.
Figure 2. Batch Data Movement Architecture Diagram

ETL ELT
DBMS DBMS
Transform
SQL

SQL
Load
Data Integration

Data Integration
Transform Load

Extract Extract

SQL CDC API SQL CDC API

Logs ERPs, SaaS, Logs ERPs, SaaS,


DBMS NoSQL DBMS NoSQL
DBMS DBMS

Source: Gartner (December 2014)

Batch data movement is the top choice for the following sample use cases:

■ Consolidated financial or risk reports: The high stakes associated with financial or risk
reporting require data meeting certain data quality standards. Batch data movement
consolidates data into a single place and provides a governance environment to deliver
trustworthy data for reports or queries.
■ 360-degree view of a master data entity: Master data entities are key business data entities
critical for enterprises. Examples are customers, products or locations. Batch data movement is
essential for initial data loading and for performing ongoing batch updates for MDM.

Gartner, Inc. | G00270543 Page 13 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ Data migration: Since old and new systems often have different data models, batch data
movement reconciles data models during the migration process. It is also necessary for initial
data loading.

Data Federation/Data Virtualization


DF/DV speeds up development, increases reuse and reduces data sprawl. During runtime, it
presents heterogeneous data sources to consumers as a single virtual source. Sample data sources
include DBMS, XML and Web services. DF/DV simplifies data access and reduces solution delivery
time. Optional transformation can be performed on the fly.

DF/DV is ideal for relatively simple and direct data access. Capabilities of vendor products vary
significantly in query optimization, transformation and transaction coordination to support create,
read, update, delete (CRUD) on heterogeneous sources. "Four Critical Steps for Successful Data
Virtualization" provides a four-step road map to plan strategies, choose a vendor, develop and
manage DV, and address people and processes.

As shown in Figure 3, DF/DV is composed of three subpatterns:

■ SQL + [Transform] + SQL: As a dominant form among DF/DV choices, this subpattern
accesses, transforms and presents data through an SQL interface of databases.
■ SQL + [Transform] + XML: This subpattern accesses data through an SQL interface, and then
transforms and presents data in XML.
■ API + [Transform] + SQL: This subpattern accesses data through an API, and then transforms
and presents data in SQL.

DF/DV is the first choice in the following sample use cases:

■ Abstract data access layer: DF/DV provides an abstraction layer between data sources and
consumers. It is also beneficial to reduce point-to-point integration. By unifying data access
through an abstraction layer, DV simplifies data access, improves reuse, reduces change impact
and achieves consistent semantics.
■ Drill-down access to detailed data and extended data sources: DF/DV enables drill-down
access to detailed data in real time. It is commonly used in interactive and ad hoc queries. It
also extends data sources and supports self-service analytics.
■ Rapid prototyping for ETL: DF/DV can be a precursor or integration façade of ETL. Integrated
data is eventually moved and stored using ETL to improve performance or enhance data quality.
As a rapid prototyping method, DF/DV reduces time to develop ETL. It is valuable for
understanding new and evolving requirements.
■ Regulatory constraints on moving data: DF/DV can be instrumental in integrating data when
there are regulatory constraints, such as the Data Protection Directive within the EU, that restrict
movement of personal data to third countries.

Page 14 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Figure 3. DF/DV Architecture Diagram

Source: Gartner (December 2014)

Replication
Replication copies data changes using CDC and applies changes to target DBMS through SQL
statements or queues. Schemas between the source and the target most likely are identical.
Occasionally, light data transformation and cleansing are performed. Replication is composed of
two subpatterns (see Figure 4):

■ CDC + [Transform] + SQL: This subpattern extracts changes from transaction logs using a
CDC component and applies data changes to a target DBMS using SQL statements. This is
considered as a "push" mechanism to the target system.

Gartner, Inc. | G00270543 Page 15 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ CDC + [Transform] + Queue: This subpattern also uses a log-based CDC to capture changes,
but it delivers changes into a message queue. This is considered as a "pull" mechanism to the
target system.

Replication is the first choice for the following sample use cases:

■ Providing scalability: Replication is commonly used to improve scalability and load balancing
of mission-critical systems. Replication synchronizes the entire database or a logical subset
(that is, data for a geographic region) to another database.
■ High availability (HA) and disaster recovery (DR): Replication ships database logs to remote
locations to achieve HA and DR. It's often planned in conjunction with DR of data centers. See
"Decision Point for Data Replication for HA/DR" for more details.
■ Offloading analytical workload: Replication copies a whole database or subsets of data from
a mission-critical system to a separate system, which helps to offload workloads for analytical
purposes.
■ Ensuring data consistency: Replication is a good choice for synchronizing data in multiple
directions across different data stores, whether they reside on-premises or in the cloud.

Page 16 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Figure 4. Replication Architecture Diagram

Consuming
DBMS Applications

SQL Message
Queue

Apply

Transform

Capture

CDC

Logs

DBMS

Source: Gartner (December 2014)

Messaging
Messaging shares some common characteristics with replication, such as near real time, small
processing volume, and usages of message queues. In addition, messaging accesses API directly
and offers greater flexibility. It's known as service-oriented architecture (SOA).

Messaging is composed of two subpatterns (see Figure 5):

■ API + [Transform] + Queue: When transaction logs are not available, or data models are not
exposed, API may be the only choice to access and extract data. This subpattern is commonly
used in legacy system, ERPs and SaaS applications.

Gartner, Inc. | G00270543 Page 17 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ API + [Transform] + XML: This subpattern accesses data via API too, and it presents data to
consuming applications in XML, such as Web services or REST services.

Messaging is the first choice for the following sample use cases:

■ Exchanging data in B2B: Many industries use standard message formats (for example, Health
Level Seven [HL7]) to exchange data among organizations. Messaging is the default choice to
exchange data in a common data format and common semantics.
■ Accessing packaged applications: To reduce the impact of database schema changes and to
mask complicated internal schemas, many ERPs or SaaS publish changes to message queues,
rather than allowing direct access to the database using SQL statements.
■ Integrating business processes and applications: Messaging makes business processes and
applications more agile by decoupling data providers and data consumers. Service-oriented
architecture depends on messaging as the backbone of integration.
Figure 5. Messaging Architecture Diagram

Consuming
DBMS Applications

Message XML
Queue

Apply

Transform

Capture

API

Providing
Applications

Source: Gartner (December 2014)

Page 18 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Future Developments
This section discusses trends that are likely to influence DI technologies in the next 18 months to
three years.

Cloud Computing Impacts DI Requirements and Delivery Modes


On-premises computing ties up a large portion of an IT budget in acquiring and maintaining
infrastructure, software, skills and so on. SaaS offloads the burden of managing infrastructure and
software in some business areas, but it also brings new challenges for integrating data on-premises
and in the cloud. Without proper foresight, organizations can end up with many data silo and
governance problems.

As a counterpart to traditional, on-premises integration middleware (the focus of this research),


integration platform as a service (iPaaS) increases speed and elasticity, as well as to reduce initial
costs. For the past several years, many organizations have adopted iPaaS actively. For the next
several years, iPaaS products will remain as a complement to, rather than a replacement for, on-
premises DI middleware.

However, iPaaS also has its issues, such as the incomplete functions, immature ecosystem and
cloud-inherited risks, including the data privacy issue. To maximize the benefits and reduce the risks
of iPaaS, organizations need to choose appropriate hybrid DI architectures, which include on-
premises-centric DI architecture, iPaaS-centric DI architecture and on-premises deployment
architecture. Refer to "Assessing Data Integration in the Cloud" and its associated paper "Assessing
Business Analytics in the Cloud."

Last, iPaaS shows a clear trend of convergence of application and data integration. The tools
enable many hybrid features that have traditionally segmented into different categories of DI
technologies. Among them, iPaaS' support for messaging and batch DI styles are the strongest,
whereas support for DF/DV and replication are the weakest.

Integrating Structured and Unstructured Data


Although this paper focuses on structured DI, unstructured DI is increasingly gaining more attention
as a part of big data trend. Creating a coherent strategy on structured and unstructured DI helps
organizations to increase profits, to meet legal compliance, and to understand customers better.

Big data's high data volume and wide data variety have imposed big challenges on data integration.
Not only is data volume growing exponentially each year, but data variety also includes more
semistructured and unstructured data, such as graphs, geospatial information, images and sensor
data.

To develop a coherent DI strategy ranging from structured and unstructured data, technical
professionals should first define and focus on high-value use cases, such as fraud detection,
customer analytics, operational effectiveness, risk management and product management.

Finally, adopt the following actions to develop a coherent DI strategy:

Gartner, Inc. | G00270543 Page 19 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ Short term: Form a cross-disciplinary working group that consists of both IT and
businesspeople. IT people should come from both structured and unstructured data
management areas. The working group would define a focused business subject area and high-
value use cases. To facilitate the semantic consistency, start defining a shared business
vocabulary based on selected use cases.
■ Midterm: Defining architectural principles and reference architectures (RAs) on integrating
structured and unstructured data. The principles, RAs, and architectural patterns would provide
a solid foundation for project teams.
■ Long term: First, improve knowledge sharing among various DI practices. Second, standardize
integration technology and approaches. Third, create or fine-tune an enterprise data
governance that supports optimal DI for all types of data.

Decision Tool
To ensure the quality of a system, it is important to evaluate DI pattern based on requirements
upfront before letting the budget or timeline overshadow the decisions. Even if you intuitively know
which DI pattern to use for a use case, it's recommended to go through the following three steps to
gain additional credibility, document the decisions and potentially build a business case for
technology gaps:

■ Step 1: Choose one subpattern using the attached spreadsheet.


■ Step 2: Perform a trade-off analysis.
■ Step 3: Document the decision.

Step 1: Choose One Subpattern Using the Attached Spreadsheet


Before you start, click the menu bar on the left side of the Web page where you saw this article.
Download the attached spreadsheet, "DI_Decision_Tool." In addition, the numeric values in
"Algorithm" tab assess the capabilities of DI patterns in a general and are not directly related to use
cases' requirements. You can customize these values based on your implementation specifics. The
most valuable part of this decision-making process comes from educating and involving the
technical team to reach consensus on architectural decisions.

For each decision criterion, pick a corresponding numeric scale that matches the requirement and
write the scale from 1 to 3 in Column G, "Use Case Requirements," in the "Questionnaire" tab.
The numeral "1" means low requirement for your use case; "2" means medium and "3" means high.
Once you put numeric values in the "Questionnaire" tab, the spreadsheet will automatically
populate inputs into the bar charts in the four DI pattern tabs.

The recommended DI subpattern is the one with the closest match of your use case's profile. It
normally doesn't cause issues if a pattern has a higher capability than the level required for a use
case.

Page 20 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Last, Table 6 provides some concrete values to facilitate classifying requirements into high, medium
or low priorities. These values are also shown in "Questionnaire" tab of "DI_Decision_Tool." Refer
to Evaluation Criteria section and Table 3 "Key Questions for Evaluation Criteria" for definitions of
decision criteria.

Table 6. How to Prioritize Decision Criteria

Stages Decision Criteria High — 3 Medium — 2 Low — 1

Source Variable data Highly variable (e.g., Low variable (e.g.,


schema sensor, log or graph Semivariable (e.g., EDI, financial transaction
data) ERPs or cloud systems) systems)

Source Source data Terabytes (TBs) Gigabytes (GBs) Kilobytes (KBs)


volume

Source Impact to data Can take minimal impact Some impact (some Fair amount of
sources (minimal spare capacity spare capacity left) impact (abundant
left) spare capacity left)

Extract/ Granularity TBs GBs KBs to megabytes


access (MBs)

[Transform/ Data quality Probabilistic matching, Basic text parsing based Data validation, look-
cleanse] deduplication, on delimiters, data up and replace
sophisticated free-form augmentation,
parsing and deterministic matching
standardization (e.g.,
name, address), and
enrichment from external
source

[Transform/ Data format Rule-based Content filter; relabel Data type


cleanse] transformation transformation; complex data elements; conversion; simple
industry standard aggregation; normalize calculations or string
messages or denormalize the manipulation
structure; data structure
change (e.g., split or
merge)

Move/ Data latency Seconds (real time) Minutes (near real time) Hours and days
present (batch)

Move/ Assured delivery Recover automatically, Recover automatically Some manual work
present typically in seconds to within a given time to recover is
minutes frame, typically in a acceptable, typically
couple of hours in several hours

Target Flexibility Frequently adding new Moderately changing Stable data targets,
data targets; constantly data targets, BA tools BA tools and
changing BA tools and and requirements requirements
requirements

Source: Gartner (December 2014)

Gartner, Inc. | G00270543 Page 21 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Step 2: Perform a Trade-Off Analysis


Technical professionals constantly battle with pressures imposed by short deadlines, low budgets
and legacy systems with poor architectures. If they dismiss these limitations and strive for a
"perfect" architecture, projects can go over budget and fall behind schedule. However, if they
operate only within the "reality" of limitations, the systems can deteriorate and become difficult to
maintain in the long run.

Moreover, although large enterprises typically implement DI using stand-alone DI tools, project
teams for small/medium businesses or lines of business may prefer using embedded DI functions
from analytical or database tools, especially for data virtualization functions. To perform a trade-off
analysis, you need to ask two key questions: "What qualities do I need to provide for the system?"
and "What are my limitations?"

In addition to the evaluation criteria listed in this document, there are many important factors to be
considered such as agility, maintainability, sharing, speed, skill sets and total cost of ownership. The
"sweet spot" is to balance the competing limitations and nonfunctional requirements against the
short-term and long-term benefits (see Figure 6).

Figure 6. Perform a Trade-Off Analysis

Nonfunctional
Limitations: Sweet Requirements:
What Are My Spot Which Qualities Do
Limitations?
I Need to Provide?
J

Reality Perfection

Short Term Long Term

Source: Gartner (December 2014)

Page 22 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Step 3: Document the Decision


Once stakeholders reach consensus after Steps 1 and 2, document the decision as a part of
knowledge base for enterprise architecture. Documenting the decisions is beneficial not only for
increasing credibility and tracing decision rationales later, but also for building a business case for
filling up key technology gaps in the future. Over time, you can analyze the decision patterns and
discover technology overlaps and gaps.

After following several decision-making processes, your team should have an increased
architectural competency. At this point, it's beneficial to operationalize this decision-making process
to an enterprise level.

Decision Justification
This section shows the characteristics of and justifications for four DI pattern groups. Intended for
educational purposes, this section compares how well four groups of DI subpatterns satisfy
specified decision criteria. Keep in mind that you should always make a decision based on the
requirements of a specific use case (as described in the Requirements and Constraints section).
Starting from this section without understanding the requirements may lead you to either
overengineer or underengineer certain aspects of DI architecture.

Batch Data Movement


The bar chart in Figure 7 shows the characteristics of three subpatterns of batch data movement.

Gartner, Inc. | G00270543 Page 23 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Figure 7. Batch Data Movement Characteristics

Source: Gartner (December 2014)

Page 24 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Adopt the batch data movement group when you need:

■ High data quality: Improving data quality often requires complex data format transformation
and cleansing. Batch data movement is the top choice when organizations require high data
quality, because it has many built-in transformation capabilities and it is also tightly integrated
with data quality tools.
■ High source data volume: ETL can efficiently process large volumes of data by leveraging
parallel and grid technologies.
■ Great performance and scalability: Because all data is consolidated and transformed
beforehand, client applications enjoy high performance and scalability during runtime.
■ Strong metadata management: ETL often provides end-to-end lineage and impact analysis
among data sources, DI components, data warehouses and BA tools.

Avoid the batch data movement group when in these situations:

■ Low data latency: The majority of ETL jobs run in batches, which introduces data latency.
Although ETL's trickle-feed style can reduce data latency, it also increases administrative and
infrastructure cost because trickle-feed typically requires a separate computing environment.
■ High reliability: In the case of power outages and system failures, you may need to restart ETL
job from scratch and manually clean up data that has been loaded into the target system.
Therefore, batch data movement is not suitable for use cases that require high reliability (for
example, operational business analytics).
■ High flexibility: Although batch data movement is powerful in many aspects, it takes a long
time to valid the requirements. Once jobs are developed, it takes lots of effort to maintain and
change them.

Data Federation/Data Virtualization (DF/DV)


The bar chart in Figure 8 shows the characteristics of three subpatterns of DF/DV.

Gartner, Inc. | G00270543 Page 25 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

Figure 8. Data Federation/Data Virtualization Characteristics

Source: Gartner (December 2014)

Adopt the DF/DV group when you need:

■ Low data latency: DF/DV accesses heterogeneous data sources on the fly. Client applications
see the latest data from the source systems. They do not experience data latency except when
caching is used.
■ Improved reuse and delivery time: Organizations can reduce delivery time and reuse assets
built on top of a shared DF/DV layer (for example, logics, views, services and business
glossary).
■ Reduced potential data sprawl: Since data is not physically moved, DF/DV lowers total cost of
ownership and improves data governance.
■ Improved agility and flexibility: DF/DV decouples data providers and consumers via a
virtualization layer. DF/DV reduces change impact from data sources and helps system
migration. It is also indispensable for understanding the new and evolving requirements for
consuming applications.

Avoid the DF/DV group in general when your use case requires:

■ Complex data cleansing: Invoking complex cleansing or data format transformation services
on the fly can cause performance and scalability issues. Although you can use caching to
improve performance, caching also increases data latency.

Page 26 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ Low impact to data sources: Avoid using DF/DV if source systems have limited spare
capacities to satisfy additional workloads.
■ High availability: Availability of DF/DV depends on data sources. If a consuming application
has a higher HA requirement than any of data sources, it is better to either improve HA of data
sources or copy the data to an improved HA environment.

Replication
The bar chart in Figure 9 shows the characteristics of two subpatterns of replication.

Figure 9. Replication Characteristics

Source: Gartner (December 2014)

Adopt the replication group when you need:

■ Minimized impact to data sources: Log-based CDC captures changes from transaction logs
and has lower overhead than the direct SQL approach.
■ High reliability: Replication provides guaranteed data delivery in the event of a system failure or
outage.

Avoid the replication group when your situation requires:

■ Variable data schema: Replication works mostly with relatively stable schema from DBMS.
Schemas fall under typical relational data models.

Gartner, Inc. | G00270543 Page 27 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ High data quality: Performing complex data cleansing and transformation during runtime can
cause performance and scalability concerns.

Messaging
The bar chart in Figure 10 shows the characteristics of two subpatterns of messaging.

Figure 10. Messaging Characteristics

Source: Gartner (December 2014)

Adopt the messaging group when you require:

■ Variable data schema: Messaging can easily handle schema changes and deal with a variable
of data schemas.
■ High transformation: Messaging can perform extensive data format transformation such as
data format conversions among common-delimited files, XML and EDI.
■ High reliability: Messaging provides high reliability for guaranteed data delivery in the event of
system failures or power outages.
■ Improved flexibility: Messaging increases agility by decoupling data providers and consumers.

Avoid the messaging group when you require:

■ High source data volume: Messaging excels at extracting small amounts of data frequently.
But it may not be efficient to deal with high data volume in data sources using API.

Page 28 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

■ Low impact to data sources: Avoid using messaging if source systems have limited spare
capacities to satisfy additional workloads. Messaging typically hits to the data sources
frequently.
■ High granularity: Messaging excels at delivering small amounts of data frequently, so
performance for large transactional data volume can be poor.

Gartner Recommended Reading


Some documents may not be available as part of your current Gartner subscription.

"Four Critical Steps for Successful Data Virtualization"

"Solution Path for Building the Next-Generation Data Warehouse"

"Embrace Sound Design Principles to Architect a Successful Logical Data Warehouse"

"Assessing Data Integration in the Cloud"

"Assessing Business Analytics in the Cloud"

"Reference Architecture Principles"

"Decision Point for Selecting a Master Data Management Implementation Style"

"Toolkit: RFP Template for Data Integration Tools"

Note 1 Four Different Styles of MDM


Four MDM implementation styles are compared in "A Comparison of Master Data Management
Implementation Styles":

■ Consolidation: Used primarily to support business intelligence (BI) or data warehousing


initiatives. This is generally referred to as downstream MDM because MDM is applied
downstream of the operational systems, where master data is originally created. In this style,
there is no attempt to remediate the data in operational systems.
■ Registry: Used primarily as a central index to master data that is authored in a distributed
fashion and remains fragmented across those systems.
■ Centralized (formerly transactional): Where master data is authored, stored and accessed
from a central system — either in a workflow or a transaction use case.
■ Coexistence: Used primarily where master data authoring is distributed, but a golden copy is
maintained centrally. The central system publishes the golden copy master data to subscribing
systems.

Gartner, Inc. | G00270543 Page 29 of 30

This research note is restricted to the personal use of rolando.quezada@itau.cl


This research note is restricted to the personal use of rolando.quezada@itau.cl

GARTNER HEADQUARTERS

Corporate Headquarters
56 Top Gallant Road
Stamford, CT 06902-7700
USA
+1 203 964 0096

Regional Headquarters
AUSTRALIA
BRAZIL
JAPAN
UNITED KINGDOM

For a complete list of worldwide locations,


visit http://www.gartner.com/technology/about.jsp

© 2014 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This
publication may not be reproduced or distributed in any form without Gartner’s prior written permission. If you are authorized to access
this publication, your use of it is subject to the Usage Guidelines for Gartner Services posted on gartner.com. The information contained
in this publication has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy,
completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. This
publication consists of the opinions of Gartner’s research organization and should not be construed as statements of fact. The opinions
expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues,
Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company,
and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of
Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization
without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner
research, see “Guiding Principles on Independence and Objectivity.”

Page 30 of 30 Gartner, Inc. | G00270543

This research note is restricted to the personal use of rolando.quezada@itau.cl

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy