0% found this document useful (0 votes)
68 views27 pages

Kabul University: Computer Science Faculty

The document discusses Extract, Transform, Load (ETL) processes in data warehousing. It describes the ETL cycle with the extract, transform, and load steps. It discusses issues that can arise in each step, such as inconsistencies in data from different source systems and lack of data standards. Transforming data to standardized formats and enriching data with additional attributes are challenging due to dirty, poorly documented source data and lack of address/name standards. Manual data entry also introduces potential errors.

Uploaded by

Sohail Azizof
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views27 pages

Kabul University: Computer Science Faculty

The document discusses Extract, Transform, Load (ETL) processes in data warehousing. It describes the ETL cycle with the extract, transform, and load steps. It discusses issues that can arise in each step, such as inconsistencies in data from different source systems and lack of data standards. Transforming data to standardized formats and enriching data with additional attributes are challenging due to dirty, poorly documented source data and lack of address/name standards. Manual data entry also introduces potential errors.

Uploaded by

Sohail Azizof
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

KABUL UNIVERSITY

Computer Science Faculty

Data Warehousing & BI


Lecture 2: Extract Transform Load
(ETL)_part 1
Lecturer: Ahmad Javid Mayar

1
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)


Semistructured MOLAP
Sources Query/Reporting

www data
Meta
Data 
 Extract
Data Analysis 









 Archived
data
Transform
Load
(ETL)
Warehouse
ROLAP
 
Business
Users
IT Data Mining
Users
Operational
Data Bases 

Data sources Data Marts  Tools
Business Users

{Comment: All except ETL washed out look}


2
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of reading The process of transforming the The process of writing
data from different extracted data from its original the data into the target
sources. state into a consistent state so source.
that it can be placed into
another database.


www data
MIS Systems
TRANSFORM CLEANSE Data Warehouse
(Acct, HR)

Legacy
Systems
EXTRACT LOAD
Archived data


Other indigenous applications
OLAP
(COBOL, VB, C++, Java) Temporary
Data storage
3
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…

Extracts
Data Index
from Data Data Data Statistics
Transfor- Mainte-
source Movement Cleansing Loading Collection
mation nance
systems

Backup

4
Overview of Data Extraction
First step of ETL, followed by many.

Source system for extraction are typically OLTP systems.

A very complex task due to number of reasons:


 Very complex and poorly documented source system.
 Data has to be extracted not once, but number of times.

The process design is dependent on:


 Which extraction method to choose?
 How to make available extracted data for further processing?

5
Types of Data Extraction
 Logical Extraction
 Full Extraction
 Incremental Extraction

 Physical Extraction
 Online Extraction
 Offline Extraction
 Legacy vs. OLTP

6
Logical Data Extraction
 Full Extraction
 The data extracted completely from the source system.

 No need to keep track of changes.

 Source data made available as-is with any additional information.

 Incremental Extraction
 Data extracted after a well defined point/event in time.

 Mechanism used to reflect/record the temporal changes in data


(column or table).

 Sometimes entire tables off-loaded from source system into the


DWH.

 Can have significant performance impacts on the data warehouse


server.
7
Physical Data Extraction…
 Online Extraction
 Data extracted directly from the source system.
 May access source tables through an intermediate system.
 Intermediate system usually similar to the source system.

 Offline Extraction
 Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.

 Data is either already structured or was created by an extraction


routine.

 Some of the prevalent structures are:


 Flat files
 Dump files
 Redo and archive logs
 Transportable table-spaces 8
Physical Data Extraction

 Legacy vs. OLTP

 Data moved from the source system

 Copy made of the source system data

 Staging area used for performance reasons

9
Data Transformation

 Basic tasks
1. Selection

2. Splitting/Joining

3. Conversion

4. Summarization

5. Enrichment

10
Data Transformation Basic Tasks

 Selection

11
Data Transformation Basic Tasks

 Splitting/joining

12
Data Transformation Basic Tasks

 Conversion

13
Data Transformation Basic Tasks: Conversion Example-1

 Convert common data elements into a consistent


form i.e. name and address.
Field format Field data
First-Family-title Muhammad Ibrahim Contractor
Family-title-comma-first Ibrahim Contractor, Muhammad
Family-comma-first-title Ibrahim, Muhammad Contractor

 Translation of dissimilar codes into a standard


code. F/NO-2
F-2
FL.NO.2
Natl. ID NID FLAT No. 2
FL.2
National ID NID FL/NO.2
FL-2
FLAT-2
FLAT#
FLAT,2
FLAT-NO-2
FL-NO.2

14
Data Transformation Basic Tasks: Conversion Example-2

 Data representation change


 EBCIDIC to ASCII

 Operating System Change


 Mainframe (MVS) to UNIX
 UNIX to NT or XP

 Data type change


 Program (Excel to Access), database format (FoxPro to
Access).
 Character, numeric and date type.
 Fixed and variable length.
15
Data Transformation Basic Tasks

 Summarization

16
Data Transformation Basic Tasks

 Enrichment

17
Data Transformation Basic Tasks: Enrichment Example

 Data elements are mapped from source tables and


files to destination fact and dimension tables.
Parsed Data
Input Data First Name: HAJI MUHAMMAD
HAJI MUHAMMAD IBRAHIM, GOVT. CONT. Family Name: IBRAHIM
K. S. ABDULLAH & BROTHERS, Title: GOVT. CONT.
Kote Sangay Road, Asia Market Firm: K. S. ABDULLAH & BROTHERS
Kabul, Ph 0703157848 Firm Location: Asia Market
Road: Kota Sangay
Phone: 0093-7031577848
City: Kabul
Code: 46200

 Default values are used in the absence of source


data.

 Fields are added for unique keys and time


elements. 18
Issues of ETL

19
Why ETL Issues?
Data from different source systems will be
different, poorly documented and dirty. Lot of
analysis required.

Easy to collate addresses and names? Not really.


No address or name standards.

Use software for standardization. Very


expensive, as any “standards” vary from country
to country, not large enough market.
20
Why ETL Issues?
Things would have been simpler in the presence of
operational systems, but that is not always the case

Manual data collection and entry. Nothing wrong with


that, but potential to introduces lots of problems.

Data is never perfect. The cost of perfection, extremely


high vs. its value.

21
“Some” Issues
 Usually, if not always underestimated
 Diversity in source systems and platforms
 Inconsistent data representations
 Complexity of transformations
 Rigidity and unavailability of legacy systems
 Volume of legacy data
 Web scrapping

22
Complexity of problem/work underestimated

 Work seems to be deceptively simple.

 People start manually building the DWH.

 Programmers underestimate the task.

 Impressions could be deceiving.

 Traditional DBMS rules and concepts break down for


very large heterogeneous historical databases.

23
Diversity in source systems and platforms
Platform OS DBMS MIS/ERP
Main Frame VMS Oracle SAP
Mini Computer Unix Informix PeopleSoft
Desktop Win NT Access JD Edwards
DOS Text file

Dozens of source systems across organizations

Numerous source systems within an organization

Need specialist for each

24
Inconsistent data representations
Same data, different representation
Date value representations
Examples:
970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)

Gender value representations


Examples:
- Male/Female - M/F
- 0/1 - PM/PF

25
Multiple sources for same data element
Need to rank source systems on a per data element basis.

Take data element from source system with highest rank where
element exists.

“Guessing” gender from name

Something is better than nothing?

26
Beware of data quality (or lack of it)
 Data quality is always worse than expected.

 It is not a matter of few hundred rows.

 Data recorded for running operations is not usually


good enough for decision support.
 Not knowing gender does not hurt POS.

27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy