0% found this document useful (0 votes)

35 views46 pages

Lecture 3

The document discusses the challenges and processes involved in Extract, Transform, Load (ETL) for data warehousing. Some key points: 1. ETL functions are challenging due to the diverse and disparate nature of source systems and their changing structures over time. Data extraction, transformation, integration and loading take significant effort. 2. ETL is a multi-step process involving data extraction from sources, transformations to fit the data warehouse schema, and loading the data. It aims to move and consolidate data for analysis while resolving inconsistencies. 3. ETL is time-consuming and arduous, often taking 50-70% of a project's effort. Designing, testing and deploying E

Uploaded by

f200190

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views46 pages

Lecture 3

Uploaded by

f200190

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 46

Data WareHouse

Objective
 Survey broadly all the various aspects of the data
extraction, transformation, and loading (ETL) functions
 Examine the data extraction function, its challenges, its
techniques
 Tasks and types of the data transformation function
 Understand the meaning of data integration and
consolidation
 Perceive the importance of the data load function and
probe the major methods for applying data to the
warehouse
 Gain a true insight into why ETL is crucial, time
consuming, and arduous
Challenges

1. For changing data into information,

capture the data
2. Perform transformations so that the data
will be ﬁt to be converted into strategic
information
3. It is still not useful to the end-users until
it is moved to the data warehouse
repository
3
Challenges

 DWH environment has three functional

areas
Data acquisition
Data storage, and
Information delivery
 ETL functions are challenging primarily
because of
1. The nature of the source systems
2. The disparities among the source operational systems
4
Challenges
 Source systems are very diverse and disparate.
 Deal with source systems on multiple platforms and
different operating systems
 Older legacy applications
 Historical information is critical in a data warehouse
 Quality of data is dubious in many old source systems
that have evolved over time.
 Source system structures keep changing over time
because of new business conditions

5
Challenges
 Lack of inconsistency
For example, data on salary may be represented as
monthly salary, weekly salary, and bimonthly salary in
different source payroll systems
 For inconsistent data, lack of a means for resolving
mismatches
 Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.

6
ETL
 The process of culling out data that is
required for the Data Warehouse from the
source system
• To get data out of the source and load it into
the data warehouse simply a process of copying
data from one database to other
• Data is extracted from an OLTP database,
transformed to match the data warehouse
schema and loaded into the data warehouse
database
ETL
• Many data warehouses also incorporate data from
non-OLTP systems such as text files, legacy
systems, and spreadsheets such data also
requires extraction, transformation, and loading
• When defining ETL for a data warehouse, it is
important to think of ETL as a process, not a
physical implementation
 Could involve some degree of cleansing and transformation
 requires active inputs from various stakeholders including
developers, analysts, testers, top executives and is technically
challenging
Time Consuming and
Arduous

 Designed ETL functions, tests of the various

processes, and deployement, will take 50% to
70% of the project effort
 Data extraction depends on the nature and complexity of the
source systems.
 The metadata must contain information on every database
and data structure
 Need detailed information, including database size
 Time window to extract data
 Determine the mechanism for capturing the changes

9
Time Consuming and
Arduous

 Transformation methods within

transformation activity
 Reformat internal data structures, re-
sequence data, apply various forms of
conversion techniques, supply default values
wherever values are missing, and must
design the whole set of aggregates that are
needed for performance improvement

10
Time Consuming and
Arduous

 Initial loading can populate millions of

rows in the data warehouse database.
 Even more difﬁcult is the task of testing and
applying the load images to actually populate
the physical ﬁles in the data warehouse.
 It may take two or more weeks to complete
the initial physical loading.

11
Why ETL
 There are many reasons for adopting ETL in the
organization:
1. It helps companies to analyze their business data for
taking critical business decisions.
2. Transactional databases cannot answer complex business
questions that can be answered by ETL example.
1. A Data Warehouse provides a common data repository
3. ETL provides a method of moving the data from various
sources into a data warehouse.
Why ETL
• Well-designed and documented ETL system is almost essential
to the success of a Data Warehouse project.
• Allow verification of data transformation, aggregation and
calculations rules.
• Allow sample data comparison between the source and the
target system.
• Can perform complex transformations and requires the extra
area to store the data.
• Helps to Migrate data into a Data Warehouse. Convert to the
various formats and types to adhere to one consistent system.
Why ETL
• ETL is a predefined process for accessing and
manipulating source data into the target database.
• ETL in data warehouse offers deep historical context
for the business.
• It helps to improve productivity because it codifies
and reuses without a need for technical skills.
Refreshment Workflow
ETL Requirements and
Steps
 The most underestimated process in DW development
 The most time-consuming process in DW development
• 50-70% of development time is spent on ETL!
• Extract: Extract relevant data
• Transform: Transform data to DW format
• Build keys, etc.
• Cleansing of data
• Load: Load data into DW
• Build aggregates, etc.
ETL Requirements and Steps
ETL Requirements and
Steps

 For initial bulk refresh as well as for the

incremental data loads, the sequence is
simply as noted:
 triggering for incremental changes, ﬁltering
for refreshes and incremental loads, data
extraction, transformation, integration,
cleansing, and applying to the data
warehouse database. .
ETL Architecture
Data Staging Area (DSA)
 Transit storage for data in the ETL process
• Transformations/cleansing done here
• No user queries
• Sequential operations on large data volumes
• Performed by central ETL logic
• No need for locking, logging, etc.
• RDBMS or flat files? (DBMS have become better at this)
• Finished dimensions copied from DSA to relevant marts
Data Staging Area (DSA)
• Allows centralized backup/recovery
• Often too time consuming to initial load all
data marts by failure
• Backup/recovery facilities needed
• Better to do this centrally in DSA than in all
data marts
Key factor

 The complexity of the data extraction and

transformation functions
 Diversity of source
 Data loading function
 Data refresh-Load job run too long
 For full refresh require proper time
 For incremental refresh need special technique to
capture changes at proper time
Data extraction

 Two major factor for a data warehouse

1. Extract data from many disparate sources
2. Extract data on the changes for ongoing incremental
loads as well as for a one-time initial full load.
Data extraction

 Previous factors, therefore, warrant the Use third-

party data extraction tools in addition to in-house
programs or scripts
 more expensive
 record their own metadata

 In-house programs increase the cost of maintenance

and are hard to maintain as source systems change
 Effective data extraction is a key to the success of
your data warehouse
Third Party tools

1. Teradata Warehouse Builder from Teradata

2. DataStage from Ascential Software
3. SAS System from SAS Institute
4. Power Mart/Power Center from Informatica
5. Sagent Solution from Sagent Software
List of issue in Data
extraction

1. Source identiﬁcation
2. Method of extraction
3. Extraction frequency
4. Time window
5. Job sequencing
6. Exception handling
Source Identification
Data in operational system

 Current value
 Most attributes
 Most value at current time
 May change as transaction occur

 Periodic status
 Value of attributes and history preserve
 Value stored with reference of time
 An insurance policy such as insurance claim, each event, such as claim
initiation, veriﬁcation, assessment, and settlement is usually recorded at each
point of time when something in the policy changes.
Data in operational system
Data extraction

 Data warehouse must be kept updated

so the history of the changes and
statuses are reﬂected in the DWH
 Two major type of data extraction
techniques
1. Static data (As Is)
2. Data of Revision
Data extraction

 Static Data(As Is): Capture data at the given point of time

 In current and periodic data include all transient data identified for
extraction
 Use for initial load
 Data of revision(incremental data capture)
 Revision since the last time data was capture
 If source data is transient/current, then capture of revisions is not easy
 If periodic, extracts the statuses and events that have been recorded
since the last data extraction
Incremental data Capture

 Immediate Incremental data capture

 Deferred data capture
Immediate data capture

 Real time
 Use three strategies
1. Capture through transaction log
2. Capture through database trigger
3. Capture in source application
Capture through trnsaction
log file

 Log file to recover from failure

 Read log file and select all committed transaction
 Contain all the changes to the various source database
tables
 Replication to capture changes to source data
1. Identify the source system database table
2. Identify and define target files in the staging area
3. Create mapping between the source table and target files
4. Define the replication mode
Capture through trnsaction
log file

5. Schedule the replication process

6. Capture the changes from the transaction logs
7. Transfer captured data from logs to target files
8. Verify transfer of data changes
9. Confirm success or failure of replication
10. In metadata, document the outcome of replication
11. Maintain definitions of sources, targets, and mappings
Immediate data capture
Capture through Trigger

 Triggers are special stored procedures that are stored on

the database and ﬁred when certain predeﬁned events
occur
 Execute actions on INSERT/UPDATE/DELETE
 Operational applications need not be changed
 Enables real-time update of DW

 For example, if you need to capture all changes to the

records in the customer table, write a trigger program to
capture all updates and deletes in that table
Captured In Source
application

 The source application is made to assist in the

data capture for the data warehouse.
 You have to modify the relevant application programs
that write to the source files and databases.

 Works for all types of updates and systems

 Operational applications must be changed +
operational overhead
Deferred Data Extraction

 The techniques under deferred data extraction do

not capture the changes in real time.
 The capture happens later.
 Two types
1. Capture Based on Date and Time Stamp
1. Use timestamp for each updates
2. captures the latest state of the source data
2. Capture by Comparing File (Snapshot differential)
1. Compare copies of data, simple and straightforward
2. May be the only feasible option for some legacy data sources that do not have
transaction logs or time stamps on source records
Snap shot Differential

 F1-> existing set of data in the data warehouse

 Each entry of the snapshot F1 is denoted by tuple (ki,bi)
where ki is the key and bi the set of fields

 F2-> data source as snapshot

 Each entry of the snapshot F2 is denoted by the tuple (ki,bj)
 where ki is the key and bj the corresponding set of fields
that have the present value of the tuple
Snap shot Differential

For each key in F2 find a tuple with the matching key in F1.
1.If we find a tuple <ki,bi> in F1:
the tuple in F1 needs to be updated
UPDATE <ki,bj>
2. If there is no tuple in F1 corresponding to tuple in F2
insert a tuple in F1.
INSERT <ki,bj> in F1
3.If there is no tuple in F2 that has the key <ki,bj>
corresponding to a tuple in F1
the tuple in F1 will have to be deleted
DELETE in F1
FAST Data LOAD
 Extract/ETL only changes since last load (delta)
 Extracts only that data which has changed since the last
time a build was run
FAST Data LOAD
 Delta = changes since last load
 Store sorted total extracts in DSA( Data Staging area)
 Delta = current+last extract
 Always possible
 Handles deletions
 High extraction time
 Put update timestamp on all rows (in sources)
 Updated by DB trigger
 Extract only where “timestamp > time for last extract”
 Reduces extract time
 Cannot (alone) handle deletions
Deferred CDC
Deferred CDC
Deferred CDC

Basics of Data Integration
100% (1)
Basics of Data Integration
61 pages
Data Warehousing - C04 - ETL
100% (1)
Data Warehousing - C04 - ETL
52 pages
ETL Process: (Extract, Transform, and Load) Process
No ratings yet
ETL Process: (Extract, Transform, and Load) Process
21 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
Lecture 7 (17-04-2024)
No ratings yet
Lecture 7 (17-04-2024)
29 pages
ELT Process
No ratings yet
ELT Process
80 pages
Module 3
No ratings yet
Module 3
30 pages
Building The DW - ETL
100% (1)
Building The DW - ETL
19 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
ETL Basic Concepts
No ratings yet
ETL Basic Concepts
63 pages
ETL Best Practices
No ratings yet
ETL Best Practices
21 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
Lecture-9 Extraction Transformation Loading
No ratings yet
Lecture-9 Extraction Transformation Loading
15 pages
DW Chap2
No ratings yet
DW Chap2
15 pages
Kabul University: Computer Science Faculty
No ratings yet
Kabul University: Computer Science Faculty
27 pages
ETL Concepts
100% (3)
ETL Concepts
56 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
ETL - Extract, Transform and Load: What Is A Data Warehouse?
No ratings yet
ETL - Extract, Transform and Load: What Is A Data Warehouse?
30 pages
Unit 3-1
No ratings yet
Unit 3-1
19 pages
ETL Power Point Presentation
No ratings yet
ETL Power Point Presentation
40 pages
DW - Unit 3
No ratings yet
DW - Unit 3
10 pages
Data Warehouse Slide3
No ratings yet
Data Warehouse Slide3
43 pages
Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
ETL
No ratings yet
ETL
32 pages
Bases de Dados e Armazéns de Dados: Bibliography
No ratings yet
Bases de Dados e Armazéns de Dados: Bibliography
11 pages
Extract Transform Load Cycle
No ratings yet
Extract Transform Load Cycle
32 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
ETL (Extract, Transform, and Load
No ratings yet
ETL (Extract, Transform, and Load
17 pages
Bi Unit 3
No ratings yet
Bi Unit 3
26 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
Presentation 2
No ratings yet
Presentation 2
22 pages
PI ETL Concepts
No ratings yet
PI ETL Concepts
31 pages
Assignment On Chapter 8 Data Warehousing and Management
No ratings yet
Assignment On Chapter 8 Data Warehousing and Management
13 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
ETL Basics
No ratings yet
ETL Basics
6 pages
Why ETL
No ratings yet
Why ETL
15 pages
Data Warehousing Dr. L. Rajya Lakshmi
No ratings yet
Data Warehousing Dr. L. Rajya Lakshmi
16 pages
ETL Process in Data Warehouse
No ratings yet
ETL Process in Data Warehouse
26 pages
All Bi
No ratings yet
All Bi
17 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
ETL (Extract, Transform, and Load) Process
No ratings yet
ETL (Extract, Transform, and Load) Process
8 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
ETL Process
No ratings yet
ETL Process
11 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
Data Warehousing and Data Mining: Sunil Paudel
No ratings yet
Data Warehousing and Data Mining: Sunil Paudel
29 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
DWH and Testing1
No ratings yet
DWH and Testing1
11 pages
Assignment On Chapter 8 Data Warehousing and Management
No ratings yet
Assignment On Chapter 8 Data Warehousing and Management
13 pages
Project Aditya)
No ratings yet
Project Aditya)
82 pages
(ETL) Ahmad Abdalkareem Lafta
No ratings yet
(ETL) Ahmad Abdalkareem Lafta
8 pages
Robert D'Onofrio-Delay Analysis UK-US Approaches 2018
100% (1)
Robert D'Onofrio-Delay Analysis UK-US Approaches 2018
9 pages
ReadyIAS AW Toolkit
No ratings yet
ReadyIAS AW Toolkit
41 pages
What Is ETL?: ETL Is A Process That Extracts The Data From Different Source Systems, Then
No ratings yet
What Is ETL?: ETL Is A Process That Extracts The Data From Different Source Systems, Then
7 pages
Cost & Management Accounting
No ratings yet
Cost & Management Accounting
3 pages
What Is ETL?
No ratings yet
What Is ETL?
6 pages
DWH Concepts Overview
No ratings yet
DWH Concepts Overview
11 pages
ETL (Extract, Transform, and Load) Process in Data Warehouse
No ratings yet
ETL (Extract, Transform, and Load) Process in Data Warehouse
6 pages
Sheila A. Ibia Bsit 2 What Is ETL (Extract, Transform, Load) ?
No ratings yet
Sheila A. Ibia Bsit 2 What Is ETL (Extract, Transform, Load) ?
5 pages
LRFD 0.9F 0.75F 0.99F: LR F A LR
No ratings yet
LRFD 0.9F 0.75F 0.99F: LR F A LR
4 pages
Validation
No ratings yet
Validation
11 pages
PHD Thesis GauthamRam Cover Final
No ratings yet
PHD Thesis GauthamRam Cover Final
251 pages
Stability Analysis and Modelling of Unde
No ratings yet
Stability Analysis and Modelling of Unde
309 pages
BUF16821 DC-DC Ic
100% (1)
BUF16821 DC-DC Ic
31 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
Power Electronics For Electric Vehicles
No ratings yet
Power Electronics For Electric Vehicles
51 pages
NeuralHack Stage 2 Python
100% (1)
NeuralHack Stage 2 Python
2 pages
Jurnal Manajemen Strategi Agribisnis Jessica Halaman 74 - 87
No ratings yet
Jurnal Manajemen Strategi Agribisnis Jessica Halaman 74 - 87
46 pages
Sustainable Industrial Chemistry 1st Edition Fabrizio Cavani Download
No ratings yet
Sustainable Industrial Chemistry 1st Edition Fabrizio Cavani Download
55 pages
Computer Network - CS610 Power Point Slides Lecture 12
No ratings yet
Computer Network - CS610 Power Point Slides Lecture 12
20 pages
Spe 201216 Ms Minifrac
No ratings yet
Spe 201216 Ms Minifrac
12 pages
Critical Thinking
No ratings yet
Critical Thinking
3 pages
RSA Projects Overview
100% (2)
RSA Projects Overview
7 pages
Mutations
No ratings yet
Mutations
48 pages
Inner Ring
No ratings yet
Inner Ring
16 pages
1st Sem Syllabus (ICT 416 Programming Concept With C)
No ratings yet
1st Sem Syllabus (ICT 416 Programming Concept With C)
5 pages
Academic Writing
No ratings yet
Academic Writing
12 pages
Creativity Is Always A Social Process
No ratings yet
Creativity Is Always A Social Process
17 pages
Keyboard Layout Selection Procedure
No ratings yet
Keyboard Layout Selection Procedure
8 pages
HSDL 3005 028
No ratings yet
HSDL 3005 028
28 pages
Curved Point-in-Space
No ratings yet
Curved Point-in-Space
13 pages
School Students' Physical Activity Physical Activity and Its Contributing Factors in
No ratings yet
School Students' Physical Activity Physical Activity and Its Contributing Factors in
8 pages
Kingspan Range Tribune Xe Brochure en GB
No ratings yet
Kingspan Range Tribune Xe Brochure en GB
16 pages
M62015L, FP M62016L, FP: V C Reset INT GND
No ratings yet
M62015L, FP M62016L, FP: V C Reset INT GND
4 pages
fml-g12s Ds en
No ratings yet
fml-g12s Ds en
7 pages
Laporan Daftar Pengguna GoodEva SmartSafety - Batch 1
No ratings yet
Laporan Daftar Pengguna GoodEva SmartSafety - Batch 1
3 pages
Streamlining ETL: A Practical Guide to Building Pipelines with Python and SQL
From Everand
Streamlining ETL: A Practical Guide to Building Pipelines with Python and SQL
Peter Jones
No ratings yet
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
ELT Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Data Warehousing: Optimizing Data Storage And Retrieval For Business Success
From Everand
Data Warehousing: Optimizing Data Storage And Retrieval For Business Success
Rob Botwright
No ratings yet
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 3

Uploaded by

Lecture 3

Uploaded by

Data WareHouse

1. For changing data into information,

 DWH environment has three functional

 Designed ETL functions, tests of the various

 Transformation methods within

 Initial loading can populate millions of

 For initial bulk refresh as well as for the

 The complexity of the data extraction and

 Two major factor for a data warehouse

 Previous factors, therefore, warrant the Use third-

 In-house programs increase the cost of maintenance

1. Teradata Warehouse Builder from Teradata

 Data warehouse must be kept updated

 Static Data(As Is): Capture data at the given point of time

 Immediate Incremental data capture

 Log file to recover from failure

5. Schedule the replication process

 Triggers are special stored procedures that are stored on

 For example, if you need to capture all changes to the

 The source application is made to assist in the

 Works for all types of updates and systems

 The techniques under deferred data extraction do

 F1-> existing set of data in the data warehouse

 F2-> data source as snapshot

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.