Kabul University: Computer Science Faculty
Kabul University: Computer Science Faculty
1
Putting the pieces together
Semistructured MOLAP
Sources Query/Reporting
www data
Meta
Data
Extract
Data Analysis
Archived
data
Transform
Load
(ETL)
Warehouse
ROLAP
Business
Users
IT Data Mining
Users
Operational
Data Bases
Data sources Data Marts Tools
Business Users
www data
MIS Systems
TRANSFORM CLEANSE Data Warehouse
(Acct, HR)
Legacy
Systems
EXTRACT LOAD
Archived data
Other indigenous applications
OLAP
(COBOL, VB, C++, Java) Temporary
Data storage
3
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…
Extracts
Data Index
from Data Data Data Statistics
Transfor- Mainte-
source Movement Cleansing Loading Collection
mation nance
systems
Backup
4
Overview of Data Extraction
First step of ETL, followed by many.
5
Types of Data Extraction
Logical Extraction
Full Extraction
Incremental Extraction
Physical Extraction
Online Extraction
Offline Extraction
Legacy vs. OLTP
6
Logical Data Extraction
Full Extraction
The data extracted completely from the source system.
Incremental Extraction
Data extracted after a well defined point/event in time.
Offline Extraction
Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.
9
Data Transformation
Basic tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
10
Data Transformation Basic Tasks
Selection
11
Data Transformation Basic Tasks
Splitting/joining
12
Data Transformation Basic Tasks
Conversion
13
Data Transformation Basic Tasks: Conversion Example-1
14
Data Transformation Basic Tasks: Conversion Example-2
Summarization
16
Data Transformation Basic Tasks
Enrichment
17
Data Transformation Basic Tasks: Enrichment Example
19
Why ETL Issues?
Data from different source systems will be
different, poorly documented and dirty. Lot of
analysis required.
21
“Some” Issues
Usually, if not always underestimated
Diversity in source systems and platforms
Inconsistent data representations
Complexity of transformations
Rigidity and unavailability of legacy systems
Volume of legacy data
Web scrapping
22
Complexity of problem/work underestimated
23
Diversity in source systems and platforms
Platform OS DBMS MIS/ERP
Main Frame VMS Oracle SAP
Mini Computer Unix Informix PeopleSoft
Desktop Win NT Access JD Edwards
DOS Text file
24
Inconsistent data representations
Same data, different representation
Date value representations
Examples:
970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)
25
Multiple sources for same data element
Need to rank source systems on a per data element basis.
Take data element from source system with highest rank where
element exists.
26
Beware of data quality (or lack of it)
Data quality is always worse than expected.
27