Lecture 3
Lecture 3
Objective
Survey broadly all the various aspects of the data
extraction, transformation, and loading (ETL) functions
Examine the data extraction function, its challenges, its
techniques
Tasks and types of the data transformation function
Understand the meaning of data integration and
consolidation
Perceive the importance of the data load function and
probe the major methods for applying data to the
warehouse
Gain a true insight into why ETL is crucial, time
consuming, and arduous
Challenges
5
Challenges
Lack of inconsistency
For example, data on salary may be represented as
monthly salary, weekly salary, and bimonthly salary in
different source payroll systems
For inconsistent data, lack of a means for resolving
mismatches
Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.
6
ETL
The process of culling out data that is
required for the Data Warehouse from the
source system
• To get data out of the source and load it into
the data warehouse simply a process of copying
data from one database to other
• Data is extracted from an OLTP database,
transformed to match the data warehouse
schema and loaded into the data warehouse
database
ETL
• Many data warehouses also incorporate data from
non-OLTP systems such as text files, legacy
systems, and spreadsheets such data also
requires extraction, transformation, and loading
• When defining ETL for a data warehouse, it is
important to think of ETL as a process, not a
physical implementation
Could involve some degree of cleansing and transformation
requires active inputs from various stakeholders including
developers, analysts, testers, top executives and is technically
challenging
Time Consuming and
Arduous
9
Time Consuming and
Arduous
10
Time Consuming and
Arduous
11
Why ETL
There are many reasons for adopting ETL in the
organization:
1. It helps companies to analyze their business data for
taking critical business decisions.
2. Transactional databases cannot answer complex business
questions that can be answered by ETL example.
1. A Data Warehouse provides a common data repository
3. ETL provides a method of moving the data from various
sources into a data warehouse.
Why ETL
• Well-designed and documented ETL system is almost essential
to the success of a Data Warehouse project.
• Allow verification of data transformation, aggregation and
calculations rules.
• Allow sample data comparison between the source and the
target system.
• Can perform complex transformations and requires the extra
area to store the data.
• Helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
Why ETL
• ETL is a predefined process for accessing and
manipulating source data into the target database.
• ETL in data warehouse offers deep historical context
for the business.
• It helps to improve productivity because it codifies
and reuses without a need for technical skills.
Refreshment Workflow
ETL Requirements and
Steps
The most underestimated process in DW development
The most time-consuming process in DW development
• 50-70% of development time is spent on ETL!
• Extract: Extract relevant data
• Transform: Transform data to DW format
• Build keys, etc.
• Cleansing of data
• Load: Load data into DW
• Build aggregates, etc.
ETL Requirements and Steps
ETL Requirements and
Steps
1. Source identification
2. Method of extraction
3. Extraction frequency
4. Time window
5. Job sequencing
6. Exception handling
Source Identification
Data in operational system
Current value
Most attributes
Most value at current time
May change as transaction occur
Periodic status
Value of attributes and history preserve
Value stored with reference of time
An insurance policy such as insurance claim, each event, such as claim
initiation, verification, assessment, and settlement is usually recorded at each
point of time when something in the policy changes.
Data in operational system
Data extraction
Real time
Use three strategies
1. Capture through transaction log
2. Capture through database trigger
3. Capture in source application
Capture through trnsaction
log file
For each key in F2 find a tuple with the matching key in F1.
1.If we find a tuple <ki,bi> in F1:
the tuple in F1 needs to be updated
UPDATE <ki,bj>
2. If there is no tuple in F1 corresponding to tuple in F2
insert a tuple in F1.
INSERT <ki,bj> in F1
3.If there is no tuple in F2 that has the key <ki,bj>
corresponding to a tuple in F1
the tuple in F1 will have to be deleted
DELETE in F1
FAST Data LOAD
Extract/ETL only changes since last load (delta)
Extracts only that data which has changed since the last
time a build was run
FAST Data LOAD
Delta = changes since last load
Store sorted total extracts in DSA( Data Staging area)
Delta = current+last extract
Always possible
Handles deletions
High extraction time
Put update timestamp on all rows (in sources)
Updated by DB trigger
Extract only where “timestamp > time for last extract”
Reduces extract time
Cannot (alone) handle deletions
Deferred CDC
Deferred CDC
Deferred CDC