0% found this document useful (0 votes)
3 views11 pages

Session Five - Data Integration

The document provides an overview of data warehousing and data integration concepts, highlighting the characteristics of data warehouses, data marts, and operational data stores. It discusses the ETL process, data mapping techniques, and various data integration approaches, including federated databases and memory-mapped data structures. Additionally, it covers data quality considerations and the importance of using ETL tools for effective data integration.

Uploaded by

Malsha Vithanage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

Session Five - Data Integration

The document provides an overview of data warehousing and data integration concepts, highlighting the characteristics of data warehouses, data marts, and operational data stores. It discusses the ETL process, data mapping techniques, and various data integration approaches, including federated databases and memory-mapped data structures. Additionally, it covers data quality considerations and the importance of using ETL tools for effective data integration.

Uploaded by

Malsha Vithanage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Basic of Data Warehouse and Data Integration Data Warehouse

Data warehouse is a subject-oriented, integrated, time


variant and non-volatile collection of data in support of
management decision making process (Willam H Inmon).
It is a large store of data and a set of processes collected
into a database for the primary purpose of helping a
business analyze data to make decisions

 Subject-oriented - A data warehouse typically


provides information on a topic (such as a sales
inventory, customers, suppliers or supply chain) rather
Prof. Pradeep Dharmadasa,
Dharmadasa,
than company operations.
Dean, Faculty of Management & Finance,
University of Colombo

Data Warehouse Data Warehouse


 Integrated - A data warehouse combines data from  Non-volatile: Non-volatile means the previous data is not
various sources. These may include a cloud, erased when new data is added to it. A data warehouse is
relational databases, flat files, structured and semi- kept separate from the operational database and therefore
structured data, metadata, and master data. The frequent changes in operational database is not reflected in
sources are combined in a manner that’s consistent, the data warehouse.
relatable, and ideally certifiable, providing a business  Prior data isn’t deleted when new data is added. Historical
with confidence in the data’s quality. data is preserved for comparisons, trends, and analytics.

 Time Variant - The data collected in a data warehouse


is identified with a particular time period. The data in a
data warehouse provides information from the historical
point of view.
Data Mart Data Mart
 A data mart is similar to a data warehouse, but it holds data  Based on their relation to the data warehouse and the data
only for a specific department or line of business, such as sources that are used to create the system , here are three
sales, finance, or human resources. types of data marts:
 A data warehouse can feed data to a data mart, or a data  Dependent – This is created from an existing enterprise
mart can feed a data warehouse. data warehouse.
 The data mart is a subset of the data warehouse and is  Independent - An independent data mart is a stand-alone
usually oriented to a specific business line or team. Whereas system—created without the use of a data warehouse—
data warehouses have an enterprise-wide depth, the that focuses on one subject area or business function.
information in data marts pertains to a single department.  Hybrid - A hybrid data mart combines data from an
existing data warehouse and other operational source
systems

Operational Data Store (ODS) Ralph Kimball’s Approach vs WH Inmon’s Approach


Two school of thought when it come to building of a DW
 An Operational Data Store (ODS) also known as OLTP
 According to Kimball’s a
(On-Line Transfer Processing) is a Database Management data warehouse is made-
System where data is stored and processed in real-time. up of all the data marts in
 Operational data stores (ODS) are data repositories that an enterprise. This is a
store a snapshot of an organization's current data. This is a bottom up approach.
highly volatile data repository that is ideally suited for real-  According to Inmon’s a
data warehouse is a
time analysis.
subject-oriented,
integrated, time variant and
non-volatile collection of
data in support of
management decisions.
This is a top down
approach.
Comparison: Kimball vs Inmon Goals of Data Warehousing (DW)
 How to decide between Kimball and Inmon’s architectures?  Information accessibility – Data in a DW must be easy to
 It is all in the tradeoffs between the comparative advantages comprehend, by both business users and developers alike. The
and disadvantages. business user should be allowed to slice and dice the data in every
possible way.
 Kimball is the better choice if you want to see results faster,
have a small team of engineers, and foresee little changes in  Information credibility – The data in a DW should be credible,
the business requirements. Otherwise, the data redundancy complete and of desired quality.
could cause anomalies and maintenance costs down the line.  Flexible to change - The data DW must be adaptable to change
 Inmon is the go-to for huge enterprises that wish to see a  Support for more fact based decision making
complete picture of their enterprise data, even if the deployment
of the data warehouse is going to cost them more and take  Support for data security
longer than Kimball’s counterpart.  Information consistency

Data Warehouse – Advantage and Limitations Data Warehouse – Advantage and Limitations

Advantages Limitations
 Integration at the lowest level, eliminating need for  Process would take a considerable amount of time
integration queries. and effort
 Runtime schematic cleaning is not needed –  Requires an understanding of the domain
performed at the data staging environment  More scalable when accompanied with a
 Independent of original data source metadata repository – increased load.
 Query optimization is possible.  Tightly coupled architecture.
Data Warehousing (DW) Extract, Transform , Load (ETL)
• ETL is a process that extracts the data from different source
systems, then transforms the data (like applying
calculations, concatenations, etc.) and finally loads the data
into the data warehouse system. ETL provides the
foundation for data analytics and machine learning work
streams. ETL is often used by an organization to:
• Extract data from legacy systems
• Cleanse the data to improve data quality and establish
consistency
• Load data into a target database , usually a DW

Data Mapping Data Mapping Techniques


 The process of creating data element mapping between two There are three main data mapping techniques
distinct data models  Manual Data Mapping: It requires IT professionals to hand-
 It is used as the first step towards a wide variety of data code or manually map the data source to the target schema.
integration tasks which include  Schema Mapping: It is a semi-automated strategy. A data
 Data transformation between data sources and data mapping solution establishes a relationship between a data
source and the target schema. IT professionals check the
destination
connections made by the schema mapping tool and make any
 Identification of data relationships required adjustments.
 Discovery of hidden sensitive data  Fully-Automated Data Mapping: The most convenient, simple,
and efficient data mapping technique uses a code-free, drag-
 Consolidation of multiple data base in to a single
and-drop data mapping UI. Even non-technical users can carry
dtabse.
out mapping tasks in just a few clicks.
Data Staging Data Extraction
A data staging area can be defined as an intermediate Extraction is the operation of extracting data from the
storage area that falls between the operational/ source system for further use in a data warehouse
transactional sources of data and DW or data mart. A environment. This the first step in the ETL process.
staging are can be used for to
Designing this process means making decisions about the
 Gather data from different sources reday to be following main aspects:
processed at different times
 Which extraction method would I choose?
 Quickly load information from the operational
database  How do I provide the extracted data for further
processing?
 Find changes against current DB/DM values
 Cleanse data and recalculate aggregates

Data Extraction Data Transformation


The data has to be extracted both logically and physically. It is the most complex and, in terms of production the most
costly part of ETL process.
The logical extraction method They can range from simple data conversion to extreme data
 Full extraction scrubbing techniques.
From an architectural perspective, transformations can be
 Incremental extraction
performed in two ways.
The physical extraction method  Multistage data transformation - Extracted data is moved to a
staging area where transformations occur prior to loading the data
 Online extraction
into the warehouse
 Offline extraction  In-warehouse data transformation – Data is extracted and
loaded into the analytics warehouse, and transformations are done
there. This is sometimes referred to as Extract, Load, Translate (ELT).
Data Loading What Is Data Integration?
 The load phase loads the data into the end target, DI is a process of coherent merging of data from various
usually the data warehouse (DW). Depending on the data sources and presenting a cohesive/consolidated
requirements of the organization, this process varies view to the user.
widely.  Involves combining data residing at different sources
and providing users with a unified view of the data.
 The timing and scope to replace or append into the
DW are strategic design choices dependent on the  Significant in a variety of situations; both
time available and the business needs.  commercial (e.g., two similar companies trying to
merge their database)
 More complex systems can maintain a history and  Scientific (e.g., combining research results from
audit trail of all changes to the data loaded in the DW different bioinformatics research repositories)

Need for Data Integration Knowledge required for Data Integration


 Concepts and skills required Development challenges
 Able to quickly access information based on a key
o Translation of relational database to object-oriented
variable along with the query against existing data for
applications
meaningful insights
o Consistent and inconsistent metadata
 Helps reduce costs, overlaps, and redundancies, and o Handling redundant and missing data
business will be less expose to risks and losses o Normalization of data from different sources
 Helps in better monitoring of key variables trending  Technological challenges
patterns which would alleviate to need the conduct more o Various formats of data
studies and survey and bring down the R& D expending o Structured and unstructured data
o Huge volumes of data
 Organizational challenges
o Unavailability of data
o Manual integration risk, failure
Common Data Integration Approaches Common Data Integration Approaches
 Federated database (virtual database):
Type of meta-database management system which  Data warehousing
transparently integrates multiple autonomous  Memory mapped data structure -Memory mapping is a
databases into a single federated database process or command in computer programming that requests that
files, code, or objects be brought into system memory. It allows files
The constituent databases are interconnected via a or data to be processed temporarily as main memory by a central
computer network, geographically decentralized. processing unit

The federated databases is the fully integrated,


logical composite of all constituent databases in a
federated database management system.

Data Integration Approaches Difference between a federated databases and a DW


Federated DW
Memory-mapped data structure:
Preferred when the databases are present Preferred when the source of information
 Useful when needed to do in-memory data across various locations over a large area can be taken from one location

manipulation and data structure is large. It’s Data would be present in various servers The entire DW would be present in one
mainly used in the dot net platform and is always server
Requires high speed network connections Requires no network connections
performed with C# or using VB.NET
It is easier to create as compared to DW Its creation is not easy as that of federated
 It’s is a much faster way of accessing the data database
than using Memory Stream. Requires no creation of new database DW must be created from scratch

Requires network expert to set up the Requited database experts such as data
network connection steward
Data Integration Technologies Data Integration Technologies
The technologies that are used for data integration  Modeling techniques
include:  Entity-Relational Modeling - An Entity–relationship
model (ER model) describes the structure of a database with
 Data interchange – it is a structured transmission of the help of a diagram, which is known as Entity Relationship
organizational data between two or more organizations through
electronic means. Used for the transfer of electronic documents Diagram (ER Diagram).
from one computer to another.
 Object Brokering - an object request broker (ORB) is a
middleware software. It gives programmers the freedom to make
calls from one computer to another over via a computer network.

Data Integration Technologies Difference between ER modelling and Dimensional modelling


 Modeling
ER modelling Dimensional modelling
techniques
Optimised for transactional data Optimised for query ability and
 Dimensional performance
Eliminate redundant data Does not eliminate redundant data where
Modeling - appropriate
Dimensional Modeling
Highly normalised It aggregates most of the attributes and
(DM) is a data structure hierarchies of a dimension into single
technique optimized for entity
data storage in a Data It is complex maze of hundreds of entries It has logical grouped set of star schemas
warehouse. The linked with each other
purpose of dimensional Useful for transactional systems Useful for analytical systems
modeling is to optimize It is spilt as per the entities It is spilt as per the dimensions and facts
the database for faster
retrieval of data.
Advantages of Using Data Integration Challenges in Data Integration
 Development challenges
 Of benefit to decision-makers, who have access o Translation of relational database to object-oriented
to important information from past studies applications
o Consistent and inconsistent metadata
 Reduces cost, overlaps and redundancies; o Handling redundant and missing data
reduces exposure to risks o Normalization of data from different sources
 Technological challenges
 Helps to monitor key variables like trends and o Various formats of data
consumer behaviour, etc. o Structured and unstructured data
o Huge volumes of data
 Organizational challenges
o Unavailability of data
o Manual integration risk, failure

Data Quality Data Quality


 Consistency: When one piece of data is stored in multiple
locations, do they have the same values?  Correcting, standardizing and validating the
 Accuracy: Does the data accurately describes the information
properties of the object it is meant to model?  Creating business rules to correct, standardize
 Relevance: Is the data appropriate to support the objective? and validate your data.
 Existence: Does the organization have the right data?  High-quality data is essential to successful
 Integrity: How accurate are the relationships between data business operations
elements and data sets?
 Validity: Are the values acceptable?
Data Quality Data Quality in Data Integration
An effective data integration strategy can lower costs
Data quality helps you to: and improve productivity by ensuring the consistency,
 Plan and prioritize data accuracy and reliability of data.

 Parse data Data integration enables to:

 Standardize, correct and normalize data  Match, link and consolidate multiple data sources

 Verify and validate data accuracy  Gain access to the right data sources at the right
time
 Apply business rules
 Deliver high-quality information
 Increase the quality of information

Data Quality in Data Integration ETL Tools


 Understand Corporate Information Anywhere in ETL tools can be grouped into four categories based on their
the Enterprise infrastructure and supporting organization or vendor. These
categories — enterprise-grade, open-source, cloud-based,
 Data integration involves combining processes and custom ETL tool
and technology to ensure an effective use of the
 Enterprise Software ETL Tools – These tools are
data can be made. developed and supported by commercial organizations. These
solutions tend to be the most robust and mature in the marketplace
Data integration can include:
 Open-Source ETL Tools
 Data movement  Cloud-Based ETL Tools
 Data linking and matching  Custom ETL Tools - Companies with development resources
may produce proprietary ETL tools using general programming
 Data house holding languages
ETL Tools ETL Tools
ETL process can be create using programming language. Some Popular ETL Tools
Some Open source ETL framework tools  IBM DataStage
 Hevo Data  Oracle Data Integrator
 Apache Camel  Informatica PowerCenter
 Airbyte  SAS Data Management
 Apache Kafka  Talend Open Studio
 Logstash  Pentaho Data Integration
 Pentaho Kettle  Singer
 Talend Open Studio  Hadoop
 Singer  Dataddo
 KETL ,Apache NiFi and CloverDX  AWS Glue , Azure Data Factory,Google Cloud Dataflow

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy