Data Integration - Techniques For Extracting, Transforming and Loading Data
Data Integration - Techniques For Extracting, Transforming and Loading Data
25 March 2019
Module 1 –
Source Data Analysis and Modeling
25 March 2019
Data Acquisition Concepts
• What is Data Acquisition?
– Data acquisition is the process of moving data to and within the Data sources
warehousing environment
• Goal oriented
– Not an isolated activity. Like data modeling, it is driven by goals external data Operational data
and purpose of the data warehouse
• Source Driven / Target Driven
– Source Driven – the necessary activities for getting data into the Get data from sources
warehousing environment from various sources Data Intake
25 March 2019
Module 1 – Source Data Analysis & Modeling
25 March 2019
Source Data Analysis – Scope of Data Sources
• Data acquisition issues
– Identifying and understanding data sources
– Mapping source data to target data
– Deciding which data to capture
– Effectively and efficiently capturing the data, and
– Determining how and when to transform the data
• Acquisition process design must pay attention to:
Subjects
– Ensuring data quality
?
– Efficiently loading warehouse databases
product Customer
– Using tools effectively
– Planning for error recovery
Finance Process
Entities HR Organization
Customer
Business
Order Events
Product….. Receive order
History?
Ship order 2003
Cancel 2004
order…. 2005
Enterprise
Kinds Events
of Merger with..
data? Acquisition of..
Termination of….
25 March 2019
Source Data Analysis – Identification of sources
• Types of sources
– Operational Systems
– Secondary Systems
– Backups, Logs, and Archives
– Shadow Systems
– DSS/EIS Systems, and
– External Data
25 March 2019
Source Data Analysis – Evaluation of Sources
Qualifying criteria Assessment Questions
Availability How available and accessible is the data?
Are there technical obstacles to access?
Or ownership and access authority issues?
Understandability How easily understood is the data?
Is it well documented?
Does someone in the organization have depth of knowledge?
Who works regularly with this data?
Stability How frequently do data structures change?
What is the history of change for the data?
What is the expected life span of the potential data source?
Granularity Is the source the lowest available grain ( most detailed level) for this data?
25 March 2019
Evaluation of Sources – Origin of data
• Original Point of Entry
– Best practice technique is to evaluate the original point of entry. “Is is this the very first
place that the data is recorded anywhere within the business?”
– If “yes”, then you have found the original point of entry. If “no”, then source may not be the
original point of entry. Ask the follow up question “ Can the element be updated in this
file/table?”
– If not then this is not the original point of entry. If “yes” then the data element may be useful
as a data warehousing source data
CLAIM-NUMBER Unique number to identify the claim N(9)
CUSTOMER MASTER FILE POLICY-NUMBER ID of policy against which the claim is filed NNNN-XXX-NN
CUSTOMER-NUMBER N(9) CUSTOMER-NUMBER ID number of the customer who filed the claim N(9)
CUSTOMER-NAME Name of the policy holder A(32)
CUSTOMER-NAME A(32)
DRIVER-NAME Name of the driver(if any) involved in the accident A(32)
GENDER A(1)
DATE-OF-BIRTH N(6)
DRIVER-ID State and driver's license number of the driver X(22)
25 March 2019
Evaluation of Sources – Origin of data
• Original Point of Entry – This practice has many benefits
– Data timeliness and accuracy are improved
– Simplifies the set of extracts from the source system
• Data Stewardship
– In organizations that have data stewardship program, involve the data stewards
25 March 2019
Evaluation of Sources – An example
System of Record?
Understandability
Poinf of Origin?
Completeness
Granularity
Availability
Timeliness
Accuracy
Stability
APMS Policy Master File + - - + + + + policy policy
APMS Driver File X - - + - X + driver driver
APMS Premium File - X + - X - + premium
CPS Claim Master File + X + - X + +
CPS Claim Detail File + X - X X + +
CPS Claim Action File - + + - + - +
FIELD ATTRIBUTE ID(KEY ENTITY RELATIONS POE SOR
) HIP
CLAIM-NUMBER Unique number to identify the claim Y Claim Y Y Y
POLICY-NUMBER ID of policy against which the claim is N Policy Y N Y
filed
CUSTOMER-NUMBER ID number of the customer who filed the N Customer Y N N
claim
CUSTOMER-NAME Name of the policy holder N Customer N N N
DRIVER-NAME Name of the driver(if any) involved in the N Claim N Y N
accident
DRIVER-ID State and driver's license number of the N Claim N Y N
driver
VIN Vehicle ID number of involved vehicle N Claim N Y Y
25 March 2019
Source Data Modeling – Overview of Warehouse data
modeling
• Business Goals and Drivers
Contextual
• Information Needs
Models
• Warehousing subjects
• Business Questions
• Source composition Conceptual
• Facts and Qualifiers
• Source subjects Models
• Targets Configuration
25 March 2019
Source Data Modeling
Source Data Modeling Objectives
• A single logical model representing a design view of all source data within scope
• An entity relationship model in 3rd normal form ( a business model without
implementation redundancies)
• Traceability from logical entities to the specific data sources that implement those
entities
• Traceability from logical relationships to the specific data sources that implement those
relationships
• Verification that each logical attribute is identifiable in implemented data sources
25 March 2019
Source Data Modeling – The activities
Business drivers
Contextual
To
Source data What warehousing data
kinds of
target
data stores modeling
Each source
Source composition
model Does
no yes
source model validate
Conceptua
exist
(analyze)
Which Source
top-down Subject
Modeling Existing data
Approach? model model
bottom-up
l
Source
Logical Logical
integrate
Model (ERM)
(design)
Structure
Structural Of data
(specify) Store (matrix)
Physical Existing
(optimize) file desc
locate extract
Functional
(Implement) Existing data store
25 March 2019
Module 1 – Source Data Analysis & Modeling
25 March 2019
Understand the Scope of Data Sources
Business drivers
Context
(scope)
To
What
Source data kinds of warehousing data target
data stores modeling
exist
Which Source
Modeling top-down Subject Existing data
ual
bottom-up
• Source composition model uses set notation to develop a subject area model
• Classifies each source by the business subjects that it supports
• Helps to understand
• which subjects have a robust set of sources
• which sources address a broad range of business subjects
• Helpful to plan, size, sequence and schedule development of the DW increments
25 March 2019
Composition Subject Model - Example
CUSTOMER
MIS customer table
REVENUE
LIS policy file
APMS premium file
RPS policy file
POLICY
APMS policy master
CLAIM
MIS product table
EXPENSE
INCIDENT
CPS claim master CPS claim
action file
LIS claim file
CPS party file
CPS claim detail file
MIS auto
ORGANIZATION
Marketplace table
MIS residential
PARTY
Marketplace table
MARKETPLACE
25 March 2019
Composition Subject Matrix Example
ORGANIZATION
SUBJECTS
MARKETPLACE
CUSTOMER
SYSTEMS
INCIDENT
REVENUE
EXPENSE
POLICY
PARTY
CLAIM
FILES/TABLES
25 March 2019
Module 1 – Source Data Analysis & Modeling
25 March 2019
Understanding source content - Integrated view of Non-
integrated sources
Source Data Modeling…
“Not within the charter of the warehousing program to redesign data sources”
25 March 2019
Understanding Source content – Using existing
models
Source composition
model
no Does yes
source model validate Existing data
Conceptual
exist model
(analyze)
Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up
Source
Logical Logical
integrate
(design) Model (ERM)
25 March 2019
Understanding Source content – Working Top Down
Source composition
model
no Does yes
source model validate Existing data
Conceptual
exist model
(analyze)
Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up
Source
Logical Logical
integrate
(design) Model (ERM)
25 March 2019
Understanding Source Content – Working Bottom-
Up
• Derive the data model from the File descriptions
• The source data element matrix serves as the tool to perform source data modeling
• Source modeling and source assessment work well together and share the same set of
documentation techniques.
25 March 2019
Integrating Multiple Views – Resolving Redundancy &
Conflict
Resolve Redundancy
And Conflict
Model States
Normalize
Verify Model
discover patterns
25 March 2019
Module 1 – Source Data Analysis & Modeling
25 March 2019
Logical to Physical Mapping
Tracing Business data to Physical Implementation
• Two way connection
– What attribute is implemented by this field/column?
– Which fields/columns implement this attribute?
• Documenting the mapping
– Extend the source data element matrix to include all data sources
– Provides comprehensive documentation of source data, and detailed mapping from
the business view to implemented data elements
25 March 2019
Module 1 – Source Data Analysis & Modeling
25 March 2019
Understanding the source systems – A Team Effort
• Understanding the source systems feeding the warehousing environment is a critical success factor
25 March 2019
Module 2 –
Data Capture Analysis and Design
Source/Target Mapping
25 March 2019
Data Capture Concepts – An overview
• Data capture – activities involved in getting data out of sources
• Synonym for Data Extraction source
Extract
Data capture Analysis – What to extract?
– Performed to understand requirements for data capture
• Which data entities and elements are required by target data stores? Transform
• Which data sources are needed to meet target requirements?
• What data elements need to be extracted from each data source?
Load
target
Data capture Design – When to extract? How to extract?
– Performed to understand and specify methods of data capture
• Timing of extracts from each data source
• Kinds of data to be extracted
• Occurrences of data (all or changes only) to be captured
• Change detection methods
• Extract technique (snapshot or audit trail)
• Data capture method (push from source or pull from warehouse
25 March 2019
Module 2 – Data Capture Analysis & Design
Source/Target Mapping
25 March 2019
Source/Target Mapping – Mapping Objectives
– Primary technique used to perform data capture
analysis
source
– Mapping source data to target data
– Three levels of detail Extract
• Entities
• Data Stores Transform
Source/Target mapping
• Data Elements
Load
– The terms “source” and “target” describe roles
that a data store may play, not fixed target
characteristics of a data store
25 March 2019
Source and Target as Roles
Data sources
ETL Data
warehouse
25 March 2019
Mapping Techniques
customer
Source & target data models
product
service
Structural &
25 March 2019
Source/Target Mapping: EXAMPLES
Entities from target logical models
ENTITY MAPPING
PRODUCT SALES
TRANSACTION
TRANSACTION
INVENTORY
TURNOVER
CUSTOMER
CUSTOMER
PRODUCT
SALES
MEMBER √
source logical
entities from
SALES TRANSACTION √ √ √
data model
PRODUCT √
PRODUCT INVENTORY √ √ √
RESTOCK TRANSACTION √ √
TRANSACTION
TRANSACTION
DATA STORE
INVENTORY
TURNOVER
CUSTOMER
CUSTOMER
PRODUCT
PRODUCT
MAPPING TABLE
TABLE
TABLE
TABLE
TABLE
SALES
SALES
MEMBERSHIP MASTER
FILE √
source structuralmodel
CUSTOMER SALES
MEMBERSHIP ACTIVITY
CUSTOMER ADDRESS TRANSACTION
FILE √ √
Files/table from
transaction quantity
customer last name
POINT-OF-SALE
customer number
DETAIL FILE √
customer state
transaction ID
PRODUCT INVENTORY
customer city
renewal date
line number
terminal ID
FILE √ √ √
WAREHOUSE TO STORE
SKU ID
ACTIVITY TABLE √ √
member-number √ √
MEMBERSHIP MASTER
membership-type √ √ √
date-joined
date-last-renewed √
term-last-renewed √
date-of-last-activity
MAPPING
zip-code √
transcation-ID √
SALE DETAIL
line-number √
POINT OF
√
25 March 2019
Source/target Mapping: Full set of Data elements
Source/target mapping
Elements added by
transform logic
Elements
added by triage
triage
transform design
25 March 2019
Module 2 – Data Capture Analysis & Design
Source/Target Mapping
25 March 2019
Source Data Triage
What to extract - opportunities
• Source/target mapping analyzes need, triage analyzes opportunity
• Triage is about extracting all data with potential value
What is Triage?
• Source data structures are analyzed to determine the appropriate data elements for inclusion
Why Triage?
• Ensure that a complete set of attributes is captured in the warehousing environment.
• Rework is minimized
25 March 2019
The Triage Technique
Source systems needed for
increment
25 March 2019
Module 2 – Data Capture Analysis & Design
Source/Target Mapping
25 March 2019
Kinds of Data
Event Data
Customer
Sales Transaction
Member Number
Membership Type Transaction Date
Last Name Transaction Time
First Name Payment Method
Date Joined Transaction Status
Last Activity Date Register ID
Last Update Date Store Number
Last Upate Time Checker Employee ID
Last Update User ID Member Number
Reference Data
Store
Employee
Store Number
Employee Id Store Name
Employee Name Store Address
Employee Address Phone Number
Store Number Manager Employee ID
Source system
metadata Source system keys 25 March 2019
Data Capture Methods
ALL CHANGED
DATA DATA
Replicate Replicate
PUSH TO source Source changes
WAREHOUSE Files/tables Or transactions
25 March 2019
Detecting Data Changes
• Detecting changes at source how to
• source date/time stamps
know
which data
• source transaction files/logs
has
• replicate source data changes
changed??
• DBMS logs ?
• compare back-up files
25 March 2019
Module 2 – Data Capture Analysis & Design
Source/Target Mapping
25 March 2019
Timing issues
OLTP Frequency of
Acquisition
Sources Data
Extraction
Data
Transformation
Work
Tables Warehouse
Loading
Latency of
Load
Intake
layer
Periodicity
Of
Data
Data Marts Mart
25 March 2019
Source System Considerations
OLTP
Sources Data
Extraction
Work
When is the data ready in
Tables each source system ?
25 March 2019
Handling time variance – techniques and methods
• SNAPSHOT
– Periodically posts records as of a specific point in time
– Records all data of interest without regard to changes
– Acquisition techniques to create snapshots
• DBMS replication
• Full File Unload or Copy
• AUDIT TRAIL
– Records details of each change to data of interest
– Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
– Acquisition techniques
• DBMS triggers
• DBMS replication
• Incremental selection
• Full file unload/copy
25 March 2019
Module 3 –
Data Transformation Analysis &
Design
Transformation Analysis
Transformation Design
25 March 2019
Transformation concepts – An overview
• Data Transformation
• Changes that occur to the data after it is extracted
• Transformation processing removes
– Complexities of operational environments
– Conflicts and redundancies of multiple databases
– Details of daily operations
– Obscurity of highly encoded data
• Transformation Analysis
• Integrate disparate data
• Change granularity of data
• Assure data quality
• Transformation Design
• Specifies the processing needed to meet the requirements that are determined by
transformation analysis
• Determining kinds of transformations
– Selection
– Filtering
– Conversion
– Translation
– Derivation
– Summarization
– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology
25 March 2019
Module 3 – Data Transformation Analysis & Design
Transformation Analysis
Transformation Design
25 March 2019
Data Integration Requirements
• Integration
– Create a single view of the data
25 March 2019
Data Granularity Requirements
• Granularity
– Each change of data grain, from atomic data to progressively higher levels of
summary – achieved through transformation
25 March 2019
Data Quality Requirements
• Data Cleansing
– process by which data quality needs are met
– range from filtering bad date to replacing data values with some alternative default or derived
values
25 March 2019
Module 3 – Data Transformation Analysis & Design
Transformation Analysis
Transformation Design
25 March 2019
Transformation Design - Approach
transformation requirements
Determine transformation
sequences
Specify transformation
process
transformation specifications
25 March 2019
Module 3 – Data Transformation Analysis & Design
Transformation Analysis
Transformation Design
25 March 2019
Kinds of transformations
This Transformation type… is Used to…
25 March 2019
Selection
Choose among
alternative sources
Extracted based upon selection
Source # 1 rules
Select
Extracted
Source # 2
Transformed
Target data
sometimes from source 1
sometimes from source 2
25 March 2019
Filtering
eliminate some
data from the target
Extracted set of data based on
Source data filtering rules
Filter
25 March 2019
Conversion
Change data content
and/or format based
Extracted on conversion rules
Source data
Convert
Transformed
Target data
Value/format in is different than
value/format out
25 March 2019
Translation
decode data whose
values are encoded
encode values in based on rules for
Extracted
Source data translation
Translate
Transformed
Target data
both encoded and decoded value out
25 March 2019
Derivation
use existing data
values to create new
Extracted data based on
Source data derivation rules
Derive
Transformed
Target data
new data values created…
More values out than in
25 March 2019
Summarization
Change data
granularity based on
atomic or base rules of
Extracted data in
Source data summarization
Summarize
Transformed
Target data
Summary data out
‘for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions))) ’
‘for each week (sum daily transaction count, sum daily dollar
total)
25 March 2019
Identifying Transformation Rules
CUSTOMER SALES
CUSTOMER ADDRESS TRANSACTION
transaction quantity
customer last name
customer biz name
customer number
customer state
transaction ID
customer city
renewal date
line number
terminal ID
SKU ID
member-number √ √
MEMBERSHIP MASTER
membership-type √ √ √
date-joined
date-last-renewed √
term-last-renewed √
date-of-last-activity
name √ √ √ for any source-to-target data
address √ element association, what
city-and-state √ √
zip-code √needs exist for:
transcation-ID • selection?
√
SALE DETAIL
line-number √
• filtering?
POINT OF
• conversion?
• translation?
• derivation?
• summarization?
25 March 2019
Specifying TransformationCUSTOMER
Rules SALES
CUSTOMER ADDRESS TRANSACTION
transaction quantity
customer last name
cells expand
customer number
to identify
customer state
transaction ID
customer city
transformations
renewal date
line number
terminal ID
by type &
name
SKU ID
member-number √ √
MEMBERSHIP MASTER
membership-type √ √ √
date-joined
date-last-renewed √ cleansing DTR027 (default value)
term-last-renewed √
date-of-last-activity
name √ √ √ Derivation DTR008 (Derive
address √
city-and-state name)
√ √
zip-code √
transcation-ID DTR027(Default
√ Membership Type)
SALE DETAIL
line-number
If membership-type
√
is null or invalid
POINT OF
√
assume “family” membership
DTR008(Derive Name)
If membership-type is “family”
separate name using comma
insert characters prior to comma in customer-last-name insert
characters after comma in customer-first-name else move name
to customer-biz-name 25 March 2019
Module 3 – Data Transformation Analysis & Design
Transformation Analysis
Transformation Design
25 March 2019
Dependencies and Sequences
• Time Dependency – when one transformation rule must execute before another
• example: summarization of derived data cannot occur before the derivation
• Rule Dependency – when execution of a transformation rule is based upon the result of
another rule
• example: different translations occur depending on source chosen by a selection
rule
25 March 2019
Dependencies and Sequences
2 4
Specify selection
Specify filtering
Specify derivation
Specify summarization
1 3
1. Identify the transformation rules
2. Understand rule dependency – package as modules
3. Understand time dependency – package as processes
4. Validate and define the test plan
25 March 2019
Modules and Programs
DTR027(Default Membership Type)
If membership-type is null or invalid
DTR008(Derive Name) If membership-type is “family”
assume “family” membership
separate name using comma
insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name
else move name to customer-biz-name
Transformation Rules
Dependencie
s among
rules
Structures of Modules,
Programs, Scripts, etc.
25 March 2019
Job Streams & Manual Procedures- completing the ETL
design
Extract &
Load
scheduling Dependencie
s
execution
communication
25 March 2019
Module 4 –
Data Transportation & Loading Design
25 March 2019
Overview
Source Data
Extract
data transport
Transform
Load
Target Data
25 March 2019
Module 4 – Data Transport & Load Design
25 March 2019
Data Transport Issues
Source Data
data transport
transport frequency?
network capacity?
ASCII vs EBCDIC
Transform
data security?
transport methods?
Load
data transport
Alternatives to FTP
Data compression
Data Encryption
Transform
ETL tools
Load
25 March 2019
Database Load Issues
Source Data
which DBMS?
Extract relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
Transform load methods?
referential integrity?
Load
Target Data
25 March 2019
Populating Tables
• Drop and rebuild the tables
• Insert (only) rows into a table
• Delete old rows and insert changed rows
25 March 2019
Indexing
Load
Indices
Tables
25 March 2019
Updating
allow
updating of Load
rows in
tables?
Indices
Tables
25 March 2019
Referential Integrity
• RI is the condition where every reference to another table has a foreign key/primary key
match.
25 March 2019
Timing Considerations
• User Expectations
• Data Readiness
• Database synchronization
25 March 2019
Exception Processing
Transform
Load
Suspend
exceptions
ok
Reports
Target
data
Log
Discard
25 March 2019
Integrating with ETL processes
scheduling restart/recover
dependencies y
EXTRACT
scheduling dependencies
execution
TRANSFOR verification
communicatio
scheduling
n process
dependencies metadata
M
execution
verification
• Loading as a part of single transform
& load job stream
• Loads triggered by completion of
transform job stream
• Loads triggered by verification of
transforms
scheduling
• Parallel ETL processing
LOAD
• Loading Partitions
execution • Updating summary tables
parallel verification
tool
processing capabilities
25 March 2019
Module 5 – Implementation Guidelines
ETL Summary
25 March 2019
Technology in Data Acquisition
ETL Technology
Data Mapping
Database Management
Data Transformation
Source Systems
Database Loading
Data Access
Data Conversion
Data Cleansing
Data Movement
Storage Management
Metadata Management
25 March 2019
ETL - Critical Success Factors
Data Store Data
Roles Transformation
Roles
Information
Integration
Intake
Delivery
Distribution
Granularity
Cleansing
1. Design for the Future, Not for the Present v v v v v v
2. Capture and store only changed data v
3. Fully understand source systems and data v v v v v
4. Allow enough time to do the job right v v v v v v
5. Use the right sources, not the easy ones v v v
6. Pay attention to data quality v v v v v v
7. Capture comprehensive ETL metadata v v v v v v
8. Test thoroughly and according to a test plan v v v v v v
9. Distinguish between one-time and ongoing loads v v v
10. Use the right technology for the right reasons v v v v v v
11. Triage source attributes v v v
12. Capture atomic level detail v v
13. Strive for subject orientation and integration v v
14. Capture history of changes in audit trail form v
15. Modularize ETL processing v v v v v v
16. Ensure that business data is non-volatile v v v
17. Use bulk loads and/or insert-only processing v
18. Complete subject orientation and integration v v
19. Use the right data structures (relational vs. dimensional) v v v v v
20. Use shared transformation rules and logic v v v v v
21. Design for distribution first, then for access v
22. Fully understand each unique access need v v v
23. Use DBMS update capabilities v v v
24. Design for access before other purposes v v
25. Design for access tool capabilities v v
26. Capture quality metadata and report data quality v v v
25 March 2019
Exercises
25 March 2019