DW M4 L1 - ETL Introduction
DW M4 L1 - ETL Introduction
LOAD
• Extract
– Extract relevant data
• Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
• Load
– Load data into DW
– Build aggregates, etc.
ETL System
• Back room or “Green room” of the DW
• Analogy - Kitchen of a restaurant
– A restaurant’s kitchen is designed for efficiency,
quality & integrity
– Throughput is critical when the restaurant is
packed
– Meals coming out should be consistent and
hygienic
– Skilled chefs
– Patrons not allowed inside
• Dangerous place to be in – sharp knives and hot
plates
• Trade secrets
ETL Design & Development
• Most challenging problem faced by the DW
project team
• 70% of the risk & effort in a DW project
comes from ETL
• Has 34 subsystems!!
• Not a one time effort!
– Initial load
– Subsequent loads (periodic refresh of the DW)
• Automation is critical!
Back Room Architecture
• ETL processing happens here
• Availability of right data from point A to
point B with appropriate transformations
applied at the appropriate time
• ETL tools are largely automated, but are
still very complex systems
General ETL Requirements
• Productivity support
– Basic development environment capabilities like code
library management, check in/check out, version control
etc.
• Usability
– Must be as usable as possible
– GUI based
– System documentation: developers should easily capture
information about processes they are creating
– This metadata should be available to all
– Data compliance
• Metadata Driven
– Services that support ETL process must be metadata
driven
General ETL Requirements
• Business needs – Users’ informntation requirement
• Compliance – must provide proof that the data reported is not
manipulated in any way
• Data Quality – garbage in garbage out!!
• Security – do not publish data widely to all decision makers
• Data Integration – Master Data Management System (MDM).
Conforming dimensions and facts
• Data Latency – huge effect on ETL architecture
– Use efficient data processing algorithms, parallelization, and
powerful hardware to speed up batch-oriented data flows
– If the requirement is for Real-time, then architecture must make a
switch from batch to microbatch or stream-oriented
• Archiving & Lineage – must for compliance & security reasons
– After ever major activity of the ETL pipeline, writing the data to disk
(staging) is recommended
– All staged data should be archived
Choice of Architecture
Tool Based ETL
Data Flow
Extract Clean Conform Deliver
• Back room of a DW is often called the data
staging area
• Staging means ‘writing to disk’
• ETL team needs a number of different data
structures for all kinds of staging needs
To stage or not to stage
• Flat files
– fast to write, append to, sort and filter (grep)
but slow to update, access or join
• XML Data Sets
– Used as a medium of data transfer between
incompatible data sources
• Relational Tables
Coming up next …
• 34 subsystems of ETL