0% found this document useful (0 votes)
73 views95 pages

Data Integration - Techniques For Extracting, Transforming and Loading Data

The document discusses data integration techniques, including source data analysis and modeling. It covers identifying and understanding various data sources, evaluating them based on availability, understandability, stability, accuracy, timeliness and completeness. The document also discusses mapping source data to target data and determining which data to capture and transform for loading into a data warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views95 pages

Data Integration - Techniques For Extracting, Transforming and Loading Data

The document discusses data integration techniques, including source data analysis and modeling. It covers identifying and understanding various data sources, evaluating them based on availability, understandability, stability, accuracy, timeliness and completeness. The document also discusses mapping source data to target data and determining which data to capture and transform for loading into a data warehouse.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPS, PDF, TXT or read online on Scribd
You are on page 1/ 95

Data Integration –

Techniques for Extracting,


Transforming and Loading Data

Internal and Confidential


Agenda
• Module 1 - Source Data Analysis & Modeling
• Module 2 – Data Capture Analysis & Design
• Module 3 – Data Transformation Analysis & Design
• Module 4 – Data Transport & Load Design
• Module 5 – Implementation Guidelines

25 March 2019
Module 1 –
Source Data Analysis and Modeling

Internal and Confidential 25 March 2019


Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Data Acquisition Concepts
• What is Data Acquisition?
– Data acquisition is the process of moving data to and within the Data sources

warehousing environment
• Goal oriented
– Not an isolated activity. Like data modeling, it is driven by goals external data Operational data
and purpose of the data warehouse
• Source Driven / Target Driven
– Source Driven – the necessary activities for getting data into the Get data from sources
warehousing environment from various sources Data Intake

– Target Driven – activities for acquisition within the warehousing


environment ETL
Staging
• Data Acquisition activities
– Identifying the set of data needed
– Choosing the extract approach Get data into warehouse
– Extract data from the source Data Distribution
– Apply transformations ETL Data
– Load data warehouse

Get data for delivery


Information Delivery
ETL Access
ETL ETL

Data Data Data


mart mart mart

25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Source Data Analysis – Scope of Data Sources
• Data acquisition issues
– Identifying and understanding data sources
– Mapping source data to target data
– Deciding which data to capture
– Effectively and efficiently capturing the data, and
– Determining how and when to transform the data
• Acquisition process design must pay attention to:
Subjects
– Ensuring data quality
?
– Efficiently loading warehouse databases
product Customer
– Using tools effectively
– Planning for error recovery
Finance Process

Entities HR Organization
Customer
Business
Order Events
Product….. Receive order
History?
Ship order 2003
Cancel 2004
order…. 2005
Enterprise
Kinds Events
of Merger with..
data? Acquisition of..
Termination of….

25 March 2019
Source Data Analysis – Identification of sources
• Types of sources
– Operational Systems
– Secondary Systems
– Backups, Logs, and Archives
– Shadow Systems
– DSS/EIS Systems, and
– External Data

• On-Going versus Single Load Sources

25 March 2019
Source Data Analysis – Evaluation of Sources
Qualifying criteria Assessment Questions
Availability How available and accessible is the data?
Are there technical obstacles to access?
Or ownership and access authority issues?
Understandability How easily understood is the data?
Is it well documented?
Does someone in the organization have depth of knowledge?
Who works regularly with this data?
Stability How frequently do data structures change?
What is the history of change for the data?
What is the expected life span of the potential data source?

Accuracy How reliable is the data?


Do the business people who work with the data trust it?

Timeliness When and how often is the data updated?


How current is the data?
How much history is available?
How available is it for extraction?
Completeness Does the scope of data correspond to the scope of the data warehouse?
Is any data missing?

Granularity Is the source the lowest available grain ( most detailed level) for this data?

25 March 2019
Evaluation of Sources – Origin of data
• Original Point of Entry
– Best practice technique is to evaluate the original point of entry. “Is is this the very first
place that the data is recorded anywhere within the business?”
– If “yes”, then you have found the original point of entry. If “no”, then source may not be the
original point of entry. Ask the follow up question “ Can the element be updated in this
file/table?”
– If not then this is not the original point of entry. If “yes” then the data element may be useful
as a data warehousing source data
CLAIM-NUMBER Unique number to identify the claim N(9)
CUSTOMER MASTER FILE POLICY-NUMBER ID of policy against which the claim is filed NNNN-XXX-NN
CUSTOMER-NUMBER N(9) CUSTOMER-NUMBER ID number of the customer who filed the claim N(9)
CUSTOMER-NAME Name of the policy holder A(32)
CUSTOMER-NAME A(32)
DRIVER-NAME Name of the driver(if any) involved in the accident A(32)
GENDER A(1)
DATE-OF-BIRTH N(6)
DRIVER-ID State and driver's license number of the driver X(22)

CLAIM MASTER FILE


SSN N(9) VIN Vehicle ID number of involved vehicle X(40)
STATUS-CODE Status of claim, blank=open, C-closed A(1)
INCIDENT-DATE Date of incident that initiated the claim YYYYMMDD
FILING-DATE Date that the claim was received YYYYMMDD
INCIDENT-TYPE-CODE What kind of incident resulted in a claim? N(1)
1=accident
2=theft
3=vandalism
Point of Origin? 4=fire
5=earthquake
6=hurricane
7=tornado
8=flood
9=act of god

25 March 2019
Evaluation of Sources – Origin of data
• Original Point of Entry – This practice has many benefits
– Data timeliness and accuracy are improved
– Simplifies the set of extracts from the source system

• Business System of Record


– To what system do the business people go when they are validating results?
– If business identifies a system as the “System of Record” then it must be considered as a
probable warehousing data source

• Data Stewardship
– In organizations that have data stewardship program, involve the data stewards

25 March 2019
Evaluation of Sources – An example

System of Record?
Understandability

Poinf of Origin?
Completeness

Granularity
Availability

Timeliness
Accuracy
Stability
APMS Policy Master File + - - + + + + policy policy
APMS Driver File X - - + - X + driver driver
APMS Premium File - X + - X - + premium
CPS Claim Master File + X + - X + +
CPS Claim Detail File + X - X X + +
CPS Claim Action File - + + - + - +
FIELD ATTRIBUTE ID(KEY ENTITY RELATIONS POE SOR
) HIP
CLAIM-NUMBER Unique number to identify the claim Y Claim Y Y Y
POLICY-NUMBER ID of policy against which the claim is N Policy Y N Y
filed
CUSTOMER-NUMBER ID number of the customer who filed the N Customer Y N N
claim
CUSTOMER-NAME Name of the policy holder N Customer N N N
DRIVER-NAME Name of the driver(if any) involved in the N Claim N Y N
accident
DRIVER-ID State and driver's license number of the N Claim N Y N
driver
VIN Vehicle ID number of involved vehicle N Claim N Y Y

STATUS-CODE Status of claim, blank=open, C-closed N Claim N Y Y

INCIDENT-DATE Date of incident that initiated the claim N Claim N Y Y

FILING-DATE Date that the claim was received N Claim N Y Y


25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Source Data Modeling – Overview of Warehouse data
modeling
• Business Goals and Drivers
Contextual
• Information Needs
Models

• Warehousing subjects
• Business Questions
• Source composition Conceptual
• Facts and Qualifiers
• Source subjects Models
• Targets Configuration

• Staging, Warehouse, & mart


ER models
• Data mart Dimensional Logical
• Integrated Source
Data Model (ERM)
Triage models
Models

• Staging Area Structure


• Warehouse structure
• Relational mart structures Structural
• Dimensional mart structures Models
• Source Data Structure Model

• Staging Physical Design


• Source Data File • Warehouse Physical Design Physical
Descriptions • Data Mart Physical Designs
Models

• Implemented warehousing Functional


• Source Data Files
databases Databases

25 March 2019
Source Data Modeling
Source Data Modeling Objectives
• A single logical model representing a design view of all source data within scope
• An entity relationship model in 3rd normal form ( a business model without
implementation redundancies)
• Traceability from logical entities to the specific data sources that implement those
entities
• Traceability from logical relationships to the specific data sources that implement those
relationships
• Verification that each logical attribute is identifiable in implemented data sources

Source Data Modeling Challenges


• Many data sources do not have data models
• Where data models exist, they are probably out-dated and almost certainly not
integrated
• Many source structures are only documented in code (e.g. COBOL definitions of
VSAM files)
• Sometimes multiple and conflicting file descriptions exist for a single data structure

25 March 2019
Source Data Modeling – The activities
Business drivers
Contextual

Information needs Business goals


(scope)

To
Source data What warehousing data
kinds of
target
data stores modeling

Each source
Source composition
model Does
no yes
source model validate
Conceptua

exist
(analyze)

Which Source
top-down Subject
Modeling Existing data
Approach? model model
bottom-up
l

Source
Logical Logical
integrate
Model (ERM)
(design)

Structure
Structural Of data
(specify) Store (matrix)

Physical Existing
(optimize) file desc
locate extract
Functional
(Implement) Existing data store
25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Understand the Scope of Data Sources
Business drivers
Context

(scope)

Information needs Business goals


ual

To
What
Source data kinds of warehousing data target
data stores modeling

Source composition Each source


model Does
no source model
yes validate
(analyze
Concept

exist
Which Source
Modeling top-down Subject Existing data
ual

Approach? model model


)

bottom-up

Identify & Name Associate


Subjects Subjects

• Source composition model uses set notation to develop a subject area model
• Classifies each source by the business subjects that it supports
• Helps to understand
• which subjects have a robust set of sources
• which sources address a broad range of business subjects
• Helpful to plan, size, sequence and schedule development of the DW increments

25 March 2019
Composition Subject Model - Example
CUSTOMER
MIS customer table
REVENUE
LIS policy file
APMS premium file
RPS policy file
POLICY
APMS policy master

CLAIM
MIS product table
EXPENSE

INCIDENT
CPS claim master CPS claim
action file
LIS claim file
CPS party file
CPS claim detail file

MIS auto
ORGANIZATION
Marketplace table
MIS residential
PARTY
Marketplace table
MARKETPLACE
25 March 2019
Composition Subject Matrix Example

ORGANIZATION
SUBJECTS

MARKETPLACE
CUSTOMER
SYSTEMS

INCIDENT

REVENUE

EXPENSE
POLICY

PARTY
CLAIM
FILES/TABLES

CLAIM MASTER FILE √ √ √


CLAIM DETAIL FILE √ √ √ √

LIS RIP APMS CPS

CLAIM ACTION FILE


PARTY FILE √
POLICY MASTER √ √
PREMIUM FILE √
DRIVER FILE

RESIDENTIAL POLICY FILE √ √ √


LIFE CLAIMS FILE √ √ √ √
LIFE POLICY FILE √ √ √
CUSTOMER TABLE √
PRODUCT TABLE √

MIS

AUTO MARKET TABLE


RESIDENTIAL MARKET TABLE √

25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Understanding source content - Integrated view of Non-
integrated sources
Source Data Modeling…
“Not within the charter of the warehousing program to redesign data sources”

• Understand the existing source data designs


• Merging all of the designs into one representative model
• The source logical data modeling process is not one of design but of integrating
• To develop a logical source data model, you will need to integrate design information from
multiple inputs including
– Merging and integrating existing data models
– Extending the subject model that represents any source
– Extracting design structures from source data stores with reverse engineering
techniques

25 March 2019
Understanding Source content – Using existing
models
Source composition
model
no Does yes
source model validate Existing data
Conceptual

exist model
(analyze)

Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up

Source
Logical Logical
integrate
(design) Model (ERM)

Check for Currency Combine into


& Accuracy Single model

• This modeling activity begins with collection of existing data models


• Models must be validated – to ensure accuracy and currency
• Existing models – “jump start” the process
• Merging models – identifying and resolving redundancy and conflict across source models

25 March 2019
Understanding Source content – Working Top Down
Source composition
model
no Does yes
source model validate Existing data
Conceptual

exist model
(analyze)

Which Source
top-down
Modeling Subject Existing data
Existing data
Approach? model model
model
bottom-up

Source
Logical Logical
integrate
(design) Model (ERM)

Identify, Name &


Describe Entities

Identify, Name &


Describe Relationships

Identify, Name &


Describe Attributes

Map to Data Stores

25 March 2019
Understanding Source Content – Working Bottom-
Up
• Derive the data model from the File descriptions

• The source data element matrix serves as the tool to perform source data modeling

• Source modeling and source assessment work well together and share the same set of
documentation techniques.

File Field Attribute ID Entity Relationship


(what fact?) (key?) (what (foreign key?)
subject?)
APMS Premium POLICY-NUMBER Unique Policy ID Yes POLICY
APMS Premium NAME Policy Holder Name CUSTOMER
APMS Premium ADDRESS Policy Holder Address CUSTOMER
APMS Premium PREMIUM-AMOUNT Cost of Policy Premium POLICY
APMS Premium POLICY-TERM Coverage Duration POLICY
APMS Premium BEGIN-DATE Start date of coverage POLICY
APMS Premium END-DATE End date of coverage POLICY
APMS Premium DISCOUNT-CD Identify kind of discount Partial DISCOUNT
APMS Premium SCHEDULE Basis of discount amt DISCOUNT
APMS Policy POLICY-NUMBER Unique Policy ID Yes POLICY
APMS Policy CUSTOMER-NUMBER Unique customer ID Yes CUSTOMER POLICY->CUSTOMER
APMS Policy VIN Vehicle ID number Yes VEHICLE POLICY-> VEHICLE
APMS Policy MAKE Vehicle Manufacturer VEHICLE

25 March 2019
Integrating Multiple Views – Resolving Redundancy &
Conflict
Resolve Redundancy
And Conflict

Model States

Normalize

Verify Model

Examine sets of entities and relationships:


customer places order and person places order  customer and person
are redundant
customer places order and customer sends order  places redundant
with sends
Examine sets of entities and attributes
When differently named entities have a high degree of similarity in their
sets of attributes
25 March 2019
Understanding Source content - Data Profiling – Looking at
the Data

Look at the data to:

discover patterns

know how it is used

understand data quality

identify all data values


3 types of profiling: classify and organize
• Column profiling
• Dependency profiling
• Redundancy profiling

25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Logical to Physical Mapping
Tracing Business data to Physical Implementation
• Two way connection
– What attribute is implemented by this field/column?
– Which fields/columns implement this attribute?
• Documenting the mapping
– Extend the source data element matrix to include all data sources
– Provides comprehensive documentation of source data, and detailed mapping from
the business view to implemented data elements

25 March 2019
Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

25 March 2019
Understanding the source systems – A Team Effort
• Understanding the source systems feeding the warehousing environment is a critical success factor

• All members of the warehousing team have a role in this effort

• The acquisition (ETL) team is generally responsible to


– document the source layouts
– perform reverse engineering as needed
– determine point of entry
– identify system of record
– and look at actual data values

• The data modelers are likely to


– create the single source logical model (using the inputs gathered by the acquisition team)

• Business Analysts/representatives are involved to


– look at the data and help understand the values
– help to identify point of entry
– and help to determine system of record

25 March 2019
Module 2 –
Data Capture Analysis and Design

Internal and Confidential 25 March 2019


Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

25 March 2019
Data Capture Concepts – An overview
• Data capture – activities involved in getting data out of sources
• Synonym for Data Extraction source

Extract
Data capture Analysis – What to extract?
– Performed to understand requirements for data capture
• Which data entities and elements are required by target data stores? Transform
• Which data sources are needed to meet target requirements?
• What data elements need to be extracted from each data source?
Load

target
Data capture Design – When to extract? How to extract?
– Performed to understand and specify methods of data capture
• Timing of extracts from each data source
• Kinds of data to be extracted
• Occurrences of data (all or changes only) to be captured
• Change detection methods
• Extract technique (snapshot or audit trail)
• Data capture method (push from source or pull from warehouse

25 March 2019
Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

25 March 2019
Source/Target Mapping – Mapping Objectives
– Primary technique used to perform data capture
analysis
source
– Mapping source data to target data
– Three levels of detail Extract

• Entities
• Data Stores Transform
Source/Target mapping
• Data Elements
Load
– The terms “source” and “target” describe roles
that a data store may play, not fixed target
characteristics of a data store

25 March 2019
Source and Target as Roles
Data sources

external data Operational data


Source: Operational/external data
Target: staging data
Data Intake
The terms “source” and “target”
describe roles that a data store ETL
Staging
may play, not fixed characteristics
of a data store
Source: staging data
Target: data warehouse
Data Distribution

ETL Data
warehouse

Source: data warehouse


Information Delivery
Target: data marts ETL
ETL ETL

Data Data Data


mart mart mart

25 March 2019
Mapping Techniques

customer
Source & target data models
product

service

Map source entities Logical data


to target entities models

Map source data stores


to target data stores

Map source data elements to


target data elements

Structural &

design transformations physical models

25 March 2019
Source/Target Mapping: EXAMPLES
Entities from target logical models

ENTITY MAPPING

PRODUCT SALES
TRANSACTION

TRANSACTION

INVENTORY
TURNOVER
CUSTOMER

CUSTOMER

PRODUCT
SALES
MEMBER √

source logical
entities from
SALES TRANSACTION √ √ √

data model
PRODUCT √
PRODUCT INVENTORY √ √ √
RESTOCK TRANSACTION √ √

Entities from target structural model

TRANSACTION

TRANSACTION
DATA STORE

INVENTORY
TURNOVER
CUSTOMER

CUSTOMER

PRODUCT

PRODUCT
MAPPING TABLE

TABLE

TABLE

TABLE

TABLE
SALES

SALES
MEMBERSHIP MASTER
FILE √
source structuralmodel

CUSTOMER SALES
MEMBERSHIP ACTIVITY
CUSTOMER ADDRESS TRANSACTION
FILE √ √
Files/table from

customer first name

transaction quantity
customer last name
POINT-OF-SALE

customer biz name

customer zip code


customer address
customer number

customer number
DETAIL FILE √

customer state

transaction ID
PRODUCT INVENTORY

customer city
renewal date

line number
terminal ID
FILE √ √ √
WAREHOUSE TO STORE

SKU ID
ACTIVITY TABLE √ √

member-number √ √

MEMBERSHIP MASTER
membership-type √ √ √
date-joined
date-last-renewed √
term-last-renewed √
date-of-last-activity

DATA ELEMENT name


address
city-and-state
√ √ √

√ √

MAPPING
zip-code √
transcation-ID √

SALE DETAIL
line-number √

POINT OF

25 March 2019
Source/target Mapping: Full set of Data elements

Elements added by business

Source/target mapping

Elements added by
transform logic
Elements
added by triage
triage

transform design

25 March 2019
Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

25 March 2019
Source Data Triage
What to extract - opportunities
• Source/target mapping analyzes need, triage analyzes opportunity
• Triage is about extracting all data with potential value

What is Triage?
• Source data structures are analyzed to determine the appropriate data elements for inclusion

Why Triage?
• Ensure that a complete set of attributes is captured in the warehousing environment.
• Rework is minimized

Triage and Acquisition


• Performing triage is a joint effort between the acquisition team and the warehousing data modelers

25 March 2019
The Triage Technique
Source systems needed for
increment

Select needed files

Identify elements addressing


known business questions

Eliminate Operational and


redundant elements

Take all other business


elements

First draft of element mapping for


staging area or atomic DW

25 March 2019
Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

25 March 2019
Kinds of Data
Event Data
Customer
Sales Transaction
Member Number
Membership Type Transaction Date
Last Name Transaction Time
First Name Payment Method
Date Joined Transaction Status
Last Activity Date Register ID
Last Update Date Store Number
Last Upate Time Checker Employee ID
Last Update User ID Member Number

Reference Data

Store
Employee
Store Number
Employee Id Store Name
Employee Name Store Address
Employee Address Phone Number
Store Number Manager Employee ID

Source system
metadata Source system keys 25 March 2019
Data Capture Methods

ALL CHANGED
DATA DATA

Replicate Replicate
PUSH TO source Source changes
WAREHOUSE Files/tables Or transactions

Extract Extract source


PULL FROM source Changes or
SOURCE Files/tables transactions

25 March 2019
Detecting Data Changes
• Detecting changes at source how to
• source date/time stamps
know
which data
• source transaction files/logs
has
• replicate source data changes
changed??
• DBMS logs ?
• compare back-up files

• Detecting changes after extract ALL CHANGED


DATA DATA
• compare warehouse extract
generations
• compare warehouse extract to Replicate Replicate
source system PUSH TO source Source changes
WAREHOUSE Files/tables Or transactions

Extract Extract source


PULL FROM source Changes or
SOURCE Files/tables transactions

25 March 2019
Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

25 March 2019
Timing issues

OLTP Frequency of
Acquisition
Sources Data
Extraction

Data
Transformation

Work
Tables Warehouse
Loading

Latency of
Load
Intake
layer

Periodicity
Of
Data
Data Marts Mart

25 March 2019
Source System Considerations

OLTP

Sources Data
Extraction

Work
When is the data ready in
Tables each source system ?

How will I know when How will i know when it’s


source systems fail? ready?

How long will it remain in


the steady state? How will I respond to
source system failures?

How will i recover from a


failure?

25 March 2019
Handling time variance – techniques and methods
• SNAPSHOT
– Periodically posts records as of a specific point in time
– Records all data of interest without regard to changes
– Acquisition techniques to create snapshots
• DBMS replication
• Full File Unload or Copy

• AUDIT TRAIL
– Records details of each change to data of interest
– Details may include date and time of change, how the change was detected, reason for
change, before and after data values, etc.
– Acquisition techniques
• DBMS triggers
• DBMS replication
• Incremental selection
• Full file unload/copy

Important distinction between Snapshot and Audit trail :


Audit trail techniques only Changed data is extracted and loaded,
Snapshot all data is extracted and loaded, whether changed or not

25 March 2019
Module 3 –
Data Transformation Analysis &
Design

Internal and Confidential 25 March 2019


Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

25 March 2019
Transformation concepts – An overview
• Data Transformation
• Changes that occur to the data after it is extracted
• Transformation processing removes
– Complexities of operational environments
– Conflicts and redundancies of multiple databases
– Details of daily operations
– Obscurity of highly encoded data

• Transformation Analysis
• Integrate disparate data
• Change granularity of data
• Assure data quality

• Transformation Design
• Specifies the processing needed to meet the requirements that are determined by
transformation analysis
• Determining kinds of transformations
– Selection
– Filtering
– Conversion
– Translation
– Derivation
– Summarization
– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools
and technology

25 March 2019
Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

25 March 2019
Data Integration Requirements
• Integration
– Create a single view of the data

• Integration & Staging Data


– organize data by business subjects
– ensure integrated identity through use of common, shared business keys

• Integration & Warehouse data


– implement data standards, including derivation of conformed facts and structuring
of conformed dimensions
– ensure integration of internal identifiers – where staging integrates real world keys,
the warehouse needs to do the same for surrogate keys

• Integration & Data marts


– Intended to satisfy business specific /department specific requirements

25 March 2019
Data Granularity Requirements
• Granularity
– Each change of data grain, from atomic data to progressively higher levels of
summary – achieved through transformation

• Granularity & Staging Data


– Staging data kept at atomic level

• Granularity & warehouse data


– In a 3 tier environment, warehouse should contain all common and standard
summaries

• Granularity & Data marts


– Derivation of summaries specific to individual needs

25 March 2019
Data Quality Requirements
• Data Cleansing
– process by which data quality needs are met
– range from filtering bad date to replacing data values with some alternative default or derived
values

• Cleansing & Staging Data


– the earlier the data is cleansed, the better the result
– sometimes important for staging data to reflect what was contained in the source systems
– Delay data cleansing transformation until data is moved from staging to warehouse
– Keep both cleansed and un-cleansed data in staging area

• Cleansing & Warehouse data


– data not cleansed in staging is cleansed before loaded into the warehouse

• Cleansing & Data marts


– cleansing at data marts is not necessarily desirable, however as a practical matter may be
necessary

25 March 2019
Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

25 March 2019
Transformation Design - Approach

transformation requirements

Identify transformation rules


& logic

Determine transformation
sequences

Specify transformation
process

transformation specifications

25 March 2019
Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

25 March 2019
Kinds of transformations
This Transformation type… is Used to…

Selection Choose one source to be used among


multiple possibilities

Filtering Choose a subset of rows from a source


data table, or a subset of records from a
source data file
Conversion and Translation Change the format of data elements

Derivation Create new data values, which can be


inferred from the values of existing data
elements

Summarization Create new data values, which can


inferred from the values of existing data
elements

25 March 2019
Selection
Choose among
alternative sources
Extracted based upon selection
Source # 1 rules

Select

Extracted
Source # 2
Transformed
Target data
sometimes from source 1
sometimes from source 2

‘If membership type is individual use member name from the


membership master file, otherwise use member name from the
business contact table’

25 March 2019
Filtering
eliminate some
data from the target
Extracted set of data based on
Source data filtering rules

Filter

Some rows or values discarded


Transformed
Target data

‘If the last 2 digits of policy number are 04,27,46, or 89 extract


data for the data mart, otherwise exclude the policy and all
associated data’

25 March 2019
Conversion
Change data content
and/or format based
Extracted on conversion rules
Source data

Convert

Transformed
Target data
Value/format in is different than
value/format out

‘For policy history prior to 1994, reformat from Julian date to


YYYYMMDD format. Default century to 19’

25 March 2019
Translation
decode data whose
values are encoded
encode values in based on rules for
Extracted
Source data translation

Translate

Transformed
Target data
both encoded and decoded value out

‘if membership-type-code is ‘C’ translate to ‘Business’; If


membership-type-code is ‘P’, blank, or null translate to
‘Individual’; otherwise translate to ‘Unknown’ ’

25 March 2019
Derivation
use existing data
values to create new
Extracted data based on
Source data derivation rules

Derive

Transformed
Target data
new data values created…
More values out than in

‘Total Premium Cost = base-premium-amount + (sum of all


additional coverage amounts)-(sum of all discount amounts) ’

25 March 2019
Summarization
Change data
granularity based on
atomic or base rules of
Extracted data in
Source data summarization

Summarize

Transformed
Target data
Summary data out

‘for each store (for each product line (for each day (count the
number of transactions, accumulate the total dollar value of the
transactions))) ’
‘for each week (sum daily transaction count, sum daily dollar
total)
25 March 2019
Identifying Transformation Rules
CUSTOMER SALES
CUSTOMER ADDRESS TRANSACTION

customer first name

transaction quantity
customer last name
customer biz name

customer zip code


customer address
customer number

customer number

customer state

transaction ID
customer city
renewal date

line number
terminal ID
SKU ID
member-number √ √
MEMBERSHIP MASTER

membership-type √ √ √
date-joined
date-last-renewed √
term-last-renewed √
date-of-last-activity
name √ √ √ for any source-to-target data
address √ element association, what
city-and-state √ √
zip-code √needs exist for:
transcation-ID • selection?

SALE DETAIL

line-number √
• filtering?
POINT OF

• conversion?
• translation?
• derivation?
• summarization?
25 March 2019
Specifying TransformationCUSTOMER
Rules SALES
CUSTOMER ADDRESS TRANSACTION

customer first name

transaction quantity
customer last name
cells expand

customer biz name

customer zip code


customer address
customer number

customer number
to identify

customer state

transaction ID
customer city
transformations

renewal date

line number
terminal ID
by type &
name

SKU ID
member-number √ √
MEMBERSHIP MASTER

membership-type √ √ √
date-joined
date-last-renewed √ cleansing DTR027 (default value)
term-last-renewed √
date-of-last-activity
name √ √ √ Derivation DTR008 (Derive
address √
city-and-state name)
√ √
zip-code √
transcation-ID DTR027(Default
√ Membership Type)
SALE DETAIL

line-number
If membership-type

is null or invalid
POINT OF


assume “family” membership
DTR008(Derive Name)
If membership-type is “family”
separate name using comma
insert characters prior to comma in customer-last-name insert
characters after comma in customer-first-name else move name
to customer-biz-name 25 March 2019
Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

25 March 2019
Dependencies and Sequences
• Time Dependency – when one transformation rule must execute before another
• example: summarization of derived data cannot occur before the derivation

• Rule Dependency – when execution of a transformation rule is based upon the result of
another rule
• example: different translations occur depending on source chosen by a selection
rule

• Grain Dependency – when developing one level of summary if based on results of a


previous summarization
• example: quarters can’t be summarized annually before months are summarized
on a quarterly basis

25 March 2019
Dependencies and Sequences
2 4

Specify selection

Specify filtering

Specify conversion & translation

Specify derivation

Specify summarization

1 3
1. Identify the transformation rules
2. Understand rule dependency – package as modules
3. Understand time dependency – package as processes
4. Validate and define the test plan
25 March 2019
Modules and Programs
DTR027(Default Membership Type)
If membership-type is null or invalid
DTR008(Derive Name) If membership-type is “family”
assume “family” membership
separate name using comma
insert characters prior to comma in customer-last-name
insert characters after comma in customer-first-name
else move name to customer-biz-name

Transformation Rules

Dependencie
s among
rules

Structures of Modules,
Programs, Scripts, etc.

25 March 2019
Job Streams & Manual Procedures- completing the ETL
design

Transformation Rules and their


implementation

Extract &
Load
scheduling Dependencie
s
execution

Automated & Manual


verification
Procedures

communication
25 March 2019
Module 4 –
Data Transportation & Loading Design

Internal and Confidential 25 March 2019


Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

25 March 2019
Overview

Source Data

where do platform changes occur?

Extract
data transport

Transform

Load

Target Data
25 March 2019
Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

25 March 2019
Data Transport Issues
Source Data

Extract which platforms?


data volumes?

data transport
transport frequency?
network capacity?
ASCII vs EBCDIC
Transform
data security?
transport methods?

Load

Target Data 25 March 2019


Data Transport Techniques
Source Data

Extract Open FTP


Secure FTP

data transport
Alternatives to FTP
Data compression
Data Encryption
Transform
ETL tools

Load

Target Data 25 March 2019


Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

25 March 2019
Database Load Issues
Source Data

which DBMS?
Extract relational vs dimensional?
tables & indices?
load frequency?
load timing?
data volumes?
exception handling?
restart & recovery?
Transform load methods?
referential integrity?

Load

Target Data
25 March 2019
Populating Tables
• Drop and rebuild the tables
• Insert (only) rows into a table
• Delete old rows and insert changed rows

25 March 2019
Indexing

Load

Indices

Tables

25 March 2019
Updating

allow
updating of Load
rows in
tables?

Indices

Tables

• Isn’t the warehouse read only?


• Updating Business Data
• Updating Row level Metadata

25 March 2019
Referential Integrity
• RI is the condition where every reference to another table has a foreign key/primary key
match.

• Three common options for RI


• DBMS checking
• Test load files before load using a tool/custom application
• Test data base(s) after load using a tool/custom application

25 March 2019
Timing Considerations
• User Expectations
• Data Readiness
• Database synchronization

25 March 2019
Exception Processing

Transform

Load
Suspend
exceptions

ok
Reports
Target
data
Log

Discard
25 March 2019
Integrating with ETL processes
scheduling restart/recover
dependencies y

EXTRACT
scheduling dependencies

execution

TRANSFOR verification

communicatio
scheduling
n process
dependencies metadata
M

execution

verification
• Loading as a part of single transform
& load job stream
• Loads triggered by completion of
transform job stream
• Loads triggered by verification of
transforms
scheduling
• Parallel ETL processing
LOAD

• Loading Partitions
execution • Updating summary tables

parallel verification
tool
processing capabilities
25 March 2019
Module 5 – Implementation Guidelines

Internal and Confidential 25 March 2019


Module 5 – Implementation Guidelines

Data Acquisition Technology

ETL Summary

25 March 2019
Technology in Data Acquisition
ETL Technology

Data Mapping

Database Management
Data Transformation
Source Systems

Database Loading
Data Access

Data Conversion

Data Cleansing

Data Movement

Storage Management

Metadata Management
25 March 2019
ETL - Critical Success Factors
Data Store Data
Roles Transformation
Roles

Information

Integration
Intake

Delivery
Distribution

Granularity

Cleansing
1. Design for the Future, Not for the Present v v v v v v
2. Capture and store only changed data v
3. Fully understand source systems and data v v v v v
4. Allow enough time to do the job right v v v v v v
5. Use the right sources, not the easy ones v v v
6. Pay attention to data quality v v v v v v
7. Capture comprehensive ETL metadata v v v v v v
8. Test thoroughly and according to a test plan v v v v v v
9. Distinguish between one-time and ongoing loads v v v
10. Use the right technology for the right reasons v v v v v v
11. Triage source attributes v v v
12. Capture atomic level detail v v
13. Strive for subject orientation and integration v v
14. Capture history of changes in audit trail form v
15. Modularize ETL processing v v v v v v
16. Ensure that business data is non-volatile v v v
17. Use bulk loads and/or insert-only processing v
18. Complete subject orientation and integration v v
19. Use the right data structures (relational vs. dimensional) v v v v v
20. Use shared transformation rules and logic v v v v v
21. Design for distribution first, then for access v
22. Fully understand each unique access need v v v
23. Use DBMS update capabilities v v v
24. Design for access before other purposes v v
25. Design for access tool capabilities v v
26. Capture quality metadata and report data quality v v v

25 March 2019
Exercises

Exercise 1: Source Data Options

Exercise 2: Source Data Modeling

Exercise 3: Data Capture

Exercise 4: Data Transformation

Exercise 5: Data Acquisition Decision

25 March 2019

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy