Gartner - Use - Data - Integration - Pattern - 270543
Gartner - Use - Data - Integration - Pattern - 270543
cl
G00270543
Table of Contents
Decision Point........................................................................................................................................ 2
Decision Context.................................................................................................................................... 3
Business Scenario............................................................................................................................ 5
Architectural Context........................................................................................................................ 5
Related Decisions.............................................................................................................................6
Evaluation Criteria...................................................................................................................................6
Requirements and Constraints..........................................................................................................7
Principles........................................................................................................................................10
Alternatives...........................................................................................................................................11
Batch Data Movement....................................................................................................................11
Data Federation/Data Virtualization................................................................................................. 14
Replication..................................................................................................................................... 15
Messaging......................................................................................................................................17
Future Developments........................................................................................................................... 19
Cloud Computing Impacts DI Requirements and Delivery Modes....................................................19
Integrating Structured and Unstructured Data.................................................................................19
Decision Tool........................................................................................................................................ 20
Step 1: Choose One Subpattern Using the Attached Spreadsheet................................................. 20
Step 2: Perform a Trade-Off Analysis.............................................................................................. 22
Step 3: Document the Decision...................................................................................................... 23
Decision Justification............................................................................................................................ 23
List of Tables
List of Figures
Decision Point
How do we create a repeatable process to choose an appropriate data integration (DI) pattern for a
given use case?
Decision Context
This document was revised on 31 March 2016. The document you are viewing is the corrected
version. For more information, see the Corrections page on gartner.com.
DI technologies have been widely adopted in the last decade. They are indispensable for many
mission-critical initiatives, such as logical data warehouses (LDWs), master data management
(MDM), and system migration and integration. But the multitude of DI options and interchangeable
terms can be confusing. The vendor market is constantly changing; for example, analytical and
database vendors are adding additional capabilities to support DI. Moreover, constraints — such as
budget, skill sets or timelines — often dominate DI decisions, which don't always result in optimal
DI architecture.
Solutions to these problems can be addressed by using DI patterns. Patterns provide repeatable
solutions for reoccurring problems. They help technical professionals eliminate confusion and
improve the decision-making process. Focusing on structure data, this document and its
accompanying spreadsheet — "DI_Decision_Tool" — present a repeatable process to perform an
architectural trade-off analysis between requirements and constraints among DI patterns. They can
be implemented with stand-alone DI tools, embedded capabilities in analytical tools or databases,
or even custom coding. However, embedded capabilities or custom coding is not recommended in
the long term for many reasons such as agility, maintainability and vendor lock-in.
There are four groups of DI patterns, corresponding to four distinct groups of commercial stand-
alone DI products. In "Critical Capabilities: Data Delivery Styles for Data Integration Tools," Gartner
defines the four main data delivery styles as the following:
■ Batch data movement: Batch or bulk data movement — simply referred to as "batch data
movement" here — involves batch data extraction and delivery approaches (such as support for
extraction, transformation and loading [ETL] processes) to consolidate data from data sources.
■ Data federation/data virtualization (DF/DV): Instead of physically moving the data, data
federation/virtualization executes queries against multiple data sources to create virtual
integrated views of data on the fly. DV/DF requires adapters to various data sources, a
metadata repository and a query engine that can provide results in various ways (for example,
as an SQL row set, XML or a Web services interface).
■ Message-oriented movement: Message-oriented movement delivers data through message
queues or services (in XML) to the target systems. It's often associated with service-oriented
architecture (SOA). Data is often delivered in near real time.
■ Data replication and synchronization: Typically through change data capture (CDC), this
process replicates and synchronizes data among database management systems (DBMSs).
Schemas and other data structures in DBMS may be identical or slightly different. This pattern
keeps operational data current across multiple systems.
Finally, DI patterns consist of 10 subpatterns, based on their interfaces to data sources and targets
(see Table 1).
■ SQL + [Transform]+SQL ■ SQL + [Transform] + SQL ■ CDC + [Transform] + SQL ■ API + [Transform] +
Queue
■ CDC + [Transform]+SQL ■ SQL + [Transform] + XML ■ CDC + [Transform] + Queue
■ API + [Transform] + XML
■ API + [Transform] + SQL ■ API + [Transform] + SQL
Business Scenario
DI is critical for various business scenarios. Table 2 lists common usages of DI patterns in key
business scenarios and puts DI patterns in a business context. Note that systems or data sources
can sit on-premises or in the cloud.
LDW Initial load; updates data Real-time or drill-down Consolidates data from Extracts data from
in batch or trickle feed access; extends or databases to data applications and
(i.e., mini-batch). unifies data access; warehouses in near real delivers to data
prototype or front time using change data warehouses.
interface for ETL. capture (CDC).
System Exchanges data among Replaces or reduces Copies data from Exchanges data
Integration on-premises and cloud the need of data databases in near real among
systems in batch or movement; delivers a time. applications.
trickle feed. reusable data access
layer.
System Initial load; updates data Provides a data access Synchronizes data Synchronizes data
Migration in batch or trickle feed. layer to reduce impact between old and new between old and
to applications and end databases during new applications
users. migration transition. during migration
transition.
MDM Primarily used in MDM in Extends or unifies data Copies data from Extracts data from
consolidation, access in all styles of databases in near real applications and
centralized and MDM. time; used in MDM delivers to various
coexistence styles for consolidation, MDM systems.
initial load and batch centralized and
updates. (See Note 1 on coexistence styles.
the four MDM styles.)
B2B Data Batch-extracts and Provides a data access Copies data from Extracts data from
Exchange sends data to external layer for extracting databases in near real applications and
entities. data. time and sends it to delivers to external
external parties. entities.
Architectural Context
The following Gartner templates provide an architectural context for DI decisions:
■ "Data Integration and Master Data Management": Provides a detailed view of DI and MDM
components within the data management environment.
Related Decisions
The following Decision Points are relevant to this Decision Point:
Evaluation Criteria
Evaluation criteria determine which DI patterns are appropriate for a given use case. The nine
important evaluation criteria are:
1. Variable data schema: Data sources often have different variability of data schemas. Some
data sources have highly variable schema, such as sensor data; whereas others may have more
stable schema, such as financial transactional systems.
2. Source data volume: Batch data movement is suitable for extracting or accessing large
volumes of data, whereas other three patterns are suitable for smaller sets of data.
3. Impact to data sources: Some DI technologies are less intrusive on data sources than others.
For example, log-based CDC scrapes changes from the database transaction logs, which
minimizes the impact on databases. The extra load volume that data sources can handle is
determined by the amount of spare computing resources that data sources have. Their criticality
also determines their tolerance levels to extra workload.
4. Granularity is defined as the data volume per process, and it is often determined by
consumption patterns of target systems. For example, partner systems or data warehouses
may be set up to process data daily, so batch data movement is sufficient. On the other hand, a
customer-facing system may need fresher data, so near-real-time DI patterns are more
appropriate.
5. Data quality deals with semantics — the meaning of words or sentences — of data instances.
Data cleansing ensures that data is suitable for its intended purposes within the business
context. Not all use cases require high data quality. If high data quality is necessary, a variety of
technical techniques is required, such as data augmentation, validation, standardization,
matching and monitoring.
6. Data format transformation reconciles the data format differences between a source and
target data structures. For example, convert data among XML, electronic data interchange (EDI)
or comma-delimited files.
7. Data latency: This is measured by how closely a target system stays up-to-date with source
systems. It is influenced by frequencies of data changes. Real-time integration has become an
important requirement as enterprises strive to be more responsive to changing environments.
Moreover, data latency, combing variability of schemas and data volumes, creates huge
challenges for DI.
8. Reliability: The criticality of a use case drives the reliability requirement for DI technologies. For
example, customer data integration for cross- and up-sale scenario requires higher reliability
than back-office systems.
9. Flexibility: Once foundational DI is in place, organizations experience different demands on
integrating and performing a variety of analytics. For example, what-if analysis requires more DI
flexibility than static dashboards.
1. Extract/access: DI technologies physically extract and move data, or virtually access source
data through SQL, CDC or API interfaces.
2. Transform/cleanse: Data format transformation is required when sources and targets have
different data structures. Also, data cleansing is required to reconcile semantic disparity or to
correct erroneous data. Transformation and cleansing are omitted for some use cases, such as
high-availability replication.
3. Move/present: DI technologies either move data into target systems physically or present data
directly to client applications. Three main interfaces for target systems are SQL, queues or XML
(for example, Web services).
Target
Move/Present
[Transform/Cleanse]
Extract/Access
Source
Table 3 maps the nine decision criteria into the source, the target and the DI computing stages.
Each decision criteria is associated with a key question, which is helpful in clarifying requirements
and constraints of a specific use case. To reiterate, the DI decision-making process is driven by use
cases, not by generic system characteristics. For example, if a use case may only need to extract or
access a small subset of data in KBs, the requirement is low for source data volume, even though a
system may have TBs of data.
Source Variable data schema How variable are data schemas in the data sources?
Source Source data volume How large is the average data volume for extracting or accessing in
the data sources?
Source Impact to data sources How much additional impact can the data sources absorb?
Extract/Access Granularity How large is the average data volume for movement or
presentation per transaction?
[Transform/ Data quality How much data cleansing is required in order to achieve a desired
Cleanse] level of data quality?
Move/Present Data latency How closely are the target systems expected to be up-to-date with
the data sources?
Move/Present Reliability How reliable does the data integration technology need to be in
response to power outages and system failures?
The following are important DI requirements (which are listed in "Toolkit: RFP Template for Data
Integration Tools"):
■ Connectivity or adapters: The ability to interact with a range of different data structure types
(for example, relational databases, software as a service [SaaS], packaged applications and
messages)
■ Data delivery modes: The ability to provide data to consuming applications, processes and
databases in a variety of modes (for example, batch, real time or event-driven)
■ Data format transformation: Built-in capabilities for performing both simple and complex data
format transformations
■ Metadata management and data modeling: The ability to capture, reconcile and interchange
metadata and to create and maintain data models
■ Design and development environment: Capabilities for facilitating design and construction of
DI processes
■ Data governance support: Support for the understanding, collaboration and enforcement of
data governance for data quality and data access
■ Deployment environment: Hardware and operating system options for deployment of DI
processes
■ Operations and administration capabilities: Facilities for supporting, managing and
controlling DI processes
■ Architecture and integration: Commonality, consistency and interoperability among the
various components of a DI toolset
■ Service enablement: Service-oriented characteristics and support for SOA deployments
Principles
Architectural principles (see "Reference Architecture Principles") build a foundation for guiding
architectural decisions. The following principles help choose an appropriate DI pattern for a given
use case:
■ Technical maturity: Since DI technologies have matured over the years, technology maturity
may be viewed as selecting and implementing certain components in a range of maturity. For
example, the integration component for Hadoop is less matured than ones for traditional
database integration.
■ Cost center versus competitive advantage: Investment in DI technologies depends on
whether IT is viewed as a cost center or as a primary contributor to an enterprise's competitive
advantage. When you tie DI initiatives directly to business objectives, you can increase funding
and obtain stronger business support.
■ Single vendor versus best-of-breed vendors: Some organizations prefer to partner with one
strategic vendor for all their integration needs, while others prefer to partner with numerous
vendors and choose the best-of-breed technology from each category. Your current technology
and vendor choices influence this preference greatly.
■ Vendor risk: Vendors' risk levels are measured by their organizational maturity, financial viability
and market share. A small, niche startup may not survive the market competition so some
organizations may choose to avoid startups.
■ Build versus buy: Although commercial DI middleware is the norm nowadays, some
organizations still resort to custom-coded solutions.
Table 4 gives examples of opposite extremes for a given principle. Most of the organizations choose
a position somewhere between the two extremes.
Technology maturity Deploy only those technologies and Deploy any technology with the potential for
services that are mature, stable, secure competitive and market advantage, regardless of
and proven in the field. technical maturity, stability or security.
Cost center vs. Cost containment is the first priority for Competitive advantage is the first priority for
competitive investment in IT infrastructure. investment in IT infrastructure.
advantage
Single vendor vs. Where possible, buy products and Buy products and services based on the best-of-
best-of-breed services from strategic vendors. breed principle.
vendors
Vendor risk We will buy only from well-established Our choice of vendor products will not be limited
vendors with large market share. by vendor maturity, viability or market share
considerations.
Build vs. buy We will build our own software as much We will buy commercial off-the-shelf (COTS)
as possible. software as much as possible.
Alternatives
This section will compare four groups of data integration patterns, which consist of 10 subpatterns
based on interfaces to data sources and targets. Refer the summary in Table 1 "Ten Subpatterns to
Integrate Data."
Often scheduled hourly or daily, ETL extracts data from sources, and then transforms and loads
data into targets in batches. ETL is usually implemented in a stand-alone environment to provide
extra processing power and to reduce the impact during the transformation phase. A trickle-feed
style is sometimes used to reduce batch windows and to improve data latency. The trickle feed is
essentially micro batch jobs that are scheduled to run more frequently, such as in five- to 15-minute
intervals. However, the trickle feed does have wide adoption due to the extra complexity in
managing near-real-time infrastructure, in addition to the batch infrastructure.
In comparison, ELT extracts and loads data into staging areas in target systems first, and then
transforms and cleanses data using the computing capabilities of a target DBMS. Consequently,
transformed and cleansed data will be loaded into production tables. Also, ELT is often associated
with powerful data warehouses, MDM and enterprise resource planning (ERP).
Pros Cons
ETL ■ Isolates the transformation workload from ■ Adds complexity and cost to
the target production environment; is manage a separate ETL environment
scalable for high-volume, complex
transformation and data cleansing
■ Moves data over network twice:
once to ETL engine, and then to
■ Is flexible and decoupling data sources targets
and targets; extracts and transforms data
once and then populates many targets
■ Requires skills on specific ETL tools
Figure 2 shows ETL and ELT styles of batch data movement. In addition, batch data movement
includes three subpatterns based on their interfaces to data sources and targets:
■ SQL + [Transform] + SQL: As the dominant form of batch data movement, this subpattern
extracts data through a standard SQL interface in batches, then transforms and inserts data
into a target DBMS.
■ CDC + [Transform] + SQL: CDC described in this document is assumed to be log-based CDC,
which scrapes changes from transactional logs, such as DBMS or Virtual Storage Access
Method (VSAM). Log-based CDC creates far less impact than trigger-based CDC. After data
changes are captured by log-based CDC, they are transformed and loaded into a target DBMS.
■ API + [Transform] + SQL: This subpattern extracts data through APIs, then transforms and
loads data into targets. It masks the complexity of internal data models and may be the only
choice for ERPs or SaaS.
Figure 2. Batch Data Movement Architecture Diagram
ETL ELT
DBMS DBMS
Transform
SQL
SQL
Load
Data Integration
Data Integration
Transform Load
Extract Extract
Batch data movement is the top choice for the following sample use cases:
■ Consolidated financial or risk reports: The high stakes associated with financial or risk
reporting require data meeting certain data quality standards. Batch data movement
consolidates data into a single place and provides a governance environment to deliver
trustworthy data for reports or queries.
■ 360-degree view of a master data entity: Master data entities are key business data entities
critical for enterprises. Examples are customers, products or locations. Batch data movement is
essential for initial data loading and for performing ongoing batch updates for MDM.
■ Data migration: Since old and new systems often have different data models, batch data
movement reconciles data models during the migration process. It is also necessary for initial
data loading.
DF/DV is ideal for relatively simple and direct data access. Capabilities of vendor products vary
significantly in query optimization, transformation and transaction coordination to support create,
read, update, delete (CRUD) on heterogeneous sources. "Four Critical Steps for Successful Data
Virtualization" provides a four-step road map to plan strategies, choose a vendor, develop and
manage DV, and address people and processes.
■ SQL + [Transform] + SQL: As a dominant form among DF/DV choices, this subpattern
accesses, transforms and presents data through an SQL interface of databases.
■ SQL + [Transform] + XML: This subpattern accesses data through an SQL interface, and then
transforms and presents data in XML.
■ API + [Transform] + SQL: This subpattern accesses data through an API, and then transforms
and presents data in SQL.
■ Abstract data access layer: DF/DV provides an abstraction layer between data sources and
consumers. It is also beneficial to reduce point-to-point integration. By unifying data access
through an abstraction layer, DV simplifies data access, improves reuse, reduces change impact
and achieves consistent semantics.
■ Drill-down access to detailed data and extended data sources: DF/DV enables drill-down
access to detailed data in real time. It is commonly used in interactive and ad hoc queries. It
also extends data sources and supports self-service analytics.
■ Rapid prototyping for ETL: DF/DV can be a precursor or integration façade of ETL. Integrated
data is eventually moved and stored using ETL to improve performance or enhance data quality.
As a rapid prototyping method, DF/DV reduces time to develop ETL. It is valuable for
understanding new and evolving requirements.
■ Regulatory constraints on moving data: DF/DV can be instrumental in integrating data when
there are regulatory constraints, such as the Data Protection Directive within the EU, that restrict
movement of personal data to third countries.
Replication
Replication copies data changes using CDC and applies changes to target DBMS through SQL
statements or queues. Schemas between the source and the target most likely are identical.
Occasionally, light data transformation and cleansing are performed. Replication is composed of
two subpatterns (see Figure 4):
■ CDC + [Transform] + SQL: This subpattern extracts changes from transaction logs using a
CDC component and applies data changes to a target DBMS using SQL statements. This is
considered as a "push" mechanism to the target system.
■ CDC + [Transform] + Queue: This subpattern also uses a log-based CDC to capture changes,
but it delivers changes into a message queue. This is considered as a "pull" mechanism to the
target system.
Replication is the first choice for the following sample use cases:
■ Providing scalability: Replication is commonly used to improve scalability and load balancing
of mission-critical systems. Replication synchronizes the entire database or a logical subset
(that is, data for a geographic region) to another database.
■ High availability (HA) and disaster recovery (DR): Replication ships database logs to remote
locations to achieve HA and DR. It's often planned in conjunction with DR of data centers. See
"Decision Point for Data Replication for HA/DR" for more details.
■ Offloading analytical workload: Replication copies a whole database or subsets of data from
a mission-critical system to a separate system, which helps to offload workloads for analytical
purposes.
■ Ensuring data consistency: Replication is a good choice for synchronizing data in multiple
directions across different data stores, whether they reside on-premises or in the cloud.
Consuming
DBMS Applications
SQL Message
Queue
Apply
Transform
Capture
CDC
Logs
DBMS
Messaging
Messaging shares some common characteristics with replication, such as near real time, small
processing volume, and usages of message queues. In addition, messaging accesses API directly
and offers greater flexibility. It's known as service-oriented architecture (SOA).
■ API + [Transform] + Queue: When transaction logs are not available, or data models are not
exposed, API may be the only choice to access and extract data. This subpattern is commonly
used in legacy system, ERPs and SaaS applications.
■ API + [Transform] + XML: This subpattern accesses data via API too, and it presents data to
consuming applications in XML, such as Web services or REST services.
Messaging is the first choice for the following sample use cases:
■ Exchanging data in B2B: Many industries use standard message formats (for example, Health
Level Seven [HL7]) to exchange data among organizations. Messaging is the default choice to
exchange data in a common data format and common semantics.
■ Accessing packaged applications: To reduce the impact of database schema changes and to
mask complicated internal schemas, many ERPs or SaaS publish changes to message queues,
rather than allowing direct access to the database using SQL statements.
■ Integrating business processes and applications: Messaging makes business processes and
applications more agile by decoupling data providers and data consumers. Service-oriented
architecture depends on messaging as the backbone of integration.
Figure 5. Messaging Architecture Diagram
Consuming
DBMS Applications
Message XML
Queue
Apply
Transform
Capture
API
Providing
Applications
Future Developments
This section discusses trends that are likely to influence DI technologies in the next 18 months to
three years.
However, iPaaS also has its issues, such as the incomplete functions, immature ecosystem and
cloud-inherited risks, including the data privacy issue. To maximize the benefits and reduce the risks
of iPaaS, organizations need to choose appropriate hybrid DI architectures, which include on-
premises-centric DI architecture, iPaaS-centric DI architecture and on-premises deployment
architecture. Refer to "Assessing Data Integration in the Cloud" and its associated paper "Assessing
Business Analytics in the Cloud."
Last, iPaaS shows a clear trend of convergence of application and data integration. The tools
enable many hybrid features that have traditionally segmented into different categories of DI
technologies. Among them, iPaaS' support for messaging and batch DI styles are the strongest,
whereas support for DF/DV and replication are the weakest.
Big data's high data volume and wide data variety have imposed big challenges on data integration.
Not only is data volume growing exponentially each year, but data variety also includes more
semistructured and unstructured data, such as graphs, geospatial information, images and sensor
data.
To develop a coherent DI strategy ranging from structured and unstructured data, technical
professionals should first define and focus on high-value use cases, such as fraud detection,
customer analytics, operational effectiveness, risk management and product management.
■ Short term: Form a cross-disciplinary working group that consists of both IT and
businesspeople. IT people should come from both structured and unstructured data
management areas. The working group would define a focused business subject area and high-
value use cases. To facilitate the semantic consistency, start defining a shared business
vocabulary based on selected use cases.
■ Midterm: Defining architectural principles and reference architectures (RAs) on integrating
structured and unstructured data. The principles, RAs, and architectural patterns would provide
a solid foundation for project teams.
■ Long term: First, improve knowledge sharing among various DI practices. Second, standardize
integration technology and approaches. Third, create or fine-tune an enterprise data
governance that supports optimal DI for all types of data.
Decision Tool
To ensure the quality of a system, it is important to evaluate DI pattern based on requirements
upfront before letting the budget or timeline overshadow the decisions. Even if you intuitively know
which DI pattern to use for a use case, it's recommended to go through the following three steps to
gain additional credibility, document the decisions and potentially build a business case for
technology gaps:
For each decision criterion, pick a corresponding numeric scale that matches the requirement and
write the scale from 1 to 3 in Column G, "Use Case Requirements," in the "Questionnaire" tab.
The numeral "1" means low requirement for your use case; "2" means medium and "3" means high.
Once you put numeric values in the "Questionnaire" tab, the spreadsheet will automatically
populate inputs into the bar charts in the four DI pattern tabs.
The recommended DI subpattern is the one with the closest match of your use case's profile. It
normally doesn't cause issues if a pattern has a higher capability than the level required for a use
case.
Last, Table 6 provides some concrete values to facilitate classifying requirements into high, medium
or low priorities. These values are also shown in "Questionnaire" tab of "DI_Decision_Tool." Refer
to Evaluation Criteria section and Table 3 "Key Questions for Evaluation Criteria" for definitions of
decision criteria.
Source Impact to data Can take minimal impact Some impact (some Fair amount of
sources (minimal spare capacity spare capacity left) impact (abundant
left) spare capacity left)
[Transform/ Data quality Probabilistic matching, Basic text parsing based Data validation, look-
cleanse] deduplication, on delimiters, data up and replace
sophisticated free-form augmentation,
parsing and deterministic matching
standardization (e.g.,
name, address), and
enrichment from external
source
Move/ Data latency Seconds (real time) Minutes (near real time) Hours and days
present (batch)
Move/ Assured delivery Recover automatically, Recover automatically Some manual work
present typically in seconds to within a given time to recover is
minutes frame, typically in a acceptable, typically
couple of hours in several hours
Target Flexibility Frequently adding new Moderately changing Stable data targets,
data targets; constantly data targets, BA tools BA tools and
changing BA tools and and requirements requirements
requirements
Moreover, although large enterprises typically implement DI using stand-alone DI tools, project
teams for small/medium businesses or lines of business may prefer using embedded DI functions
from analytical or database tools, especially for data virtualization functions. To perform a trade-off
analysis, you need to ask two key questions: "What qualities do I need to provide for the system?"
and "What are my limitations?"
In addition to the evaluation criteria listed in this document, there are many important factors to be
considered such as agility, maintainability, sharing, speed, skill sets and total cost of ownership. The
"sweet spot" is to balance the competing limitations and nonfunctional requirements against the
short-term and long-term benefits (see Figure 6).
Nonfunctional
Limitations: Sweet Requirements:
What Are My Spot Which Qualities Do
Limitations?
I Need to Provide?
J
Reality Perfection
After following several decision-making processes, your team should have an increased
architectural competency. At this point, it's beneficial to operationalize this decision-making process
to an enterprise level.
Decision Justification
This section shows the characteristics of and justifications for four DI pattern groups. Intended for
educational purposes, this section compares how well four groups of DI subpatterns satisfy
specified decision criteria. Keep in mind that you should always make a decision based on the
requirements of a specific use case (as described in the Requirements and Constraints section).
Starting from this section without understanding the requirements may lead you to either
overengineer or underengineer certain aspects of DI architecture.
■ High data quality: Improving data quality often requires complex data format transformation
and cleansing. Batch data movement is the top choice when organizations require high data
quality, because it has many built-in transformation capabilities and it is also tightly integrated
with data quality tools.
■ High source data volume: ETL can efficiently process large volumes of data by leveraging
parallel and grid technologies.
■ Great performance and scalability: Because all data is consolidated and transformed
beforehand, client applications enjoy high performance and scalability during runtime.
■ Strong metadata management: ETL often provides end-to-end lineage and impact analysis
among data sources, DI components, data warehouses and BA tools.
■ Low data latency: The majority of ETL jobs run in batches, which introduces data latency.
Although ETL's trickle-feed style can reduce data latency, it also increases administrative and
infrastructure cost because trickle-feed typically requires a separate computing environment.
■ High reliability: In the case of power outages and system failures, you may need to restart ETL
job from scratch and manually clean up data that has been loaded into the target system.
Therefore, batch data movement is not suitable for use cases that require high reliability (for
example, operational business analytics).
■ High flexibility: Although batch data movement is powerful in many aspects, it takes a long
time to valid the requirements. Once jobs are developed, it takes lots of effort to maintain and
change them.
■ Low data latency: DF/DV accesses heterogeneous data sources on the fly. Client applications
see the latest data from the source systems. They do not experience data latency except when
caching is used.
■ Improved reuse and delivery time: Organizations can reduce delivery time and reuse assets
built on top of a shared DF/DV layer (for example, logics, views, services and business
glossary).
■ Reduced potential data sprawl: Since data is not physically moved, DF/DV lowers total cost of
ownership and improves data governance.
■ Improved agility and flexibility: DF/DV decouples data providers and consumers via a
virtualization layer. DF/DV reduces change impact from data sources and helps system
migration. It is also indispensable for understanding the new and evolving requirements for
consuming applications.
Avoid the DF/DV group in general when your use case requires:
■ Complex data cleansing: Invoking complex cleansing or data format transformation services
on the fly can cause performance and scalability issues. Although you can use caching to
improve performance, caching also increases data latency.
■ Low impact to data sources: Avoid using DF/DV if source systems have limited spare
capacities to satisfy additional workloads.
■ High availability: Availability of DF/DV depends on data sources. If a consuming application
has a higher HA requirement than any of data sources, it is better to either improve HA of data
sources or copy the data to an improved HA environment.
Replication
The bar chart in Figure 9 shows the characteristics of two subpatterns of replication.
■ Minimized impact to data sources: Log-based CDC captures changes from transaction logs
and has lower overhead than the direct SQL approach.
■ High reliability: Replication provides guaranteed data delivery in the event of a system failure or
outage.
■ Variable data schema: Replication works mostly with relatively stable schema from DBMS.
Schemas fall under typical relational data models.
■ High data quality: Performing complex data cleansing and transformation during runtime can
cause performance and scalability concerns.
Messaging
The bar chart in Figure 10 shows the characteristics of two subpatterns of messaging.
■ Variable data schema: Messaging can easily handle schema changes and deal with a variable
of data schemas.
■ High transformation: Messaging can perform extensive data format transformation such as
data format conversions among common-delimited files, XML and EDI.
■ High reliability: Messaging provides high reliability for guaranteed data delivery in the event of
system failures or power outages.
■ Improved flexibility: Messaging increases agility by decoupling data providers and consumers.
■ High source data volume: Messaging excels at extracting small amounts of data frequently.
But it may not be efficient to deal with high data volume in data sources using API.
■ Low impact to data sources: Avoid using messaging if source systems have limited spare
capacities to satisfy additional workloads. Messaging typically hits to the data sources
frequently.
■ High granularity: Messaging excels at delivering small amounts of data frequently, so
performance for large transactional data volume can be poor.
GARTNER HEADQUARTERS
Corporate Headquarters
56 Top Gallant Road
Stamford, CT 06902-7700
USA
+1 203 964 0096
Regional Headquarters
AUSTRALIA
BRAZIL
JAPAN
UNITED KINGDOM
© 2014 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This
publication may not be reproduced or distributed in any form without Gartner’s prior written permission. If you are authorized to access
this publication, your use of it is subject to the Usage Guidelines for Gartner Services posted on gartner.com. The information contained
in this publication has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy,
completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. This
publication consists of the opinions of Gartner’s research organization and should not be construed as statements of fact. The opinions
expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues,
Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company,
and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of
Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization
without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner
research, see “Guiding Principles on Independence and Objectivity.”