ETL Training - Day 1
ETL Training - Day 1
A process of transforming
Information
data into information and
making it available to users
in a timely enough manner
to make a difference
Data
Data Warehouse Defined
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
Accessible
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data
Warehouse 1996
What is Data Warehouse?(Cont.)
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored.
This goal is to remove data redundancy. This architecture is not frequently
used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
This is the most widely used architecture.
It consists of the Top, Middle and Bottom Tier.
Generic two-level architecture
L
One,
company-
T wide
warehouse
Reliable reporting
Rapid access to data
Integrated data
Flexible presentation of data
Better decision making
Why Separate Data Warehouse?
Performance
Operational database designed & tuned for known transactions & workloads.
Complex OLAP queries would degrade performance. for op transactions.
Special data organization, access & implementation methods needed for
multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which
Operational database do not typically maintain.
Old record
New record
Open/current record
After
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 01-13-2001
8 1125 Steve Married 01-14-2001 NULL
IT personnel need to know data sources and targets; database, table and column
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Types of Facts(In Continuation)
Non Additive
The abbreviation of ETL is Extract, Transform, Load; these three database functions
are combined into one tool that automates the process to pull data out of one database
and place it into another database.
ETL can consolidate the scattered data for any organization while working with
different departments. It can very well handle the data coming from different
departments. ETL can transform not only data from different departments but also
data from different sources altogether. For example, any organization is running its
business on different environments like SAP and Oracle Apps for their businesses. If
the higher management wants to take discussion on their business, they want to
make the data integrated and used it for their reporting purposes. ETL can take these
two source system data and integrate it into single format and load it into the tables.
43
Why use an ETL Tool?
44
ETL Tool Function
A typical ETL tool-based data warehouse uses staging area, data integration, and
access layers to perform its functions. It’s normally a 3-layer architecture.
Staging Layer – The staging layer or staging database is used to store the data
extracted from different source data systems.
Data Integration Layer – The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged into
hierarchical groups, often called dimensions, and into facts and aggregate facts.
The combination of facts and dimensions tables in a DW system is called a
schema.
Access Layer – The access layer is used by end-users to retrieve the data for
analytical reporting and information.
The ETL Process
Extract
Extract relevant data
Transform
Transform data to DW format
Build keys, etc.
Cleansing of data
Load
Load data into DW
Build aggregates, etc.
46
The following illustration shows how the three layers interact with
Example
For example, a health insurance organization might have information on a customer in
several departments and each department might have that customer's information listed
in a different way. The membership department might list the customer by name,
whereas the claims department might list the customer by number. ETL can bundle all
this data and consolidate it into a uniform presentation, such as for storing in a
database or data warehouse.
48
Development of traditional Chinese medicine clinical data
warehouse
Data Extraction
• The integration of all of the disparate system across the enterprise is the real challenge
to getting the data warehouse
• Data is extracted from heterogeneous data sources
• The majority of data extraction comes from unstructured data sources and different data
50
formats. This Unstructured data can be in any form, such as flat files and structured data
can be tables, indexes, and analytics.
Data Staging
Often used as an interim step between data extraction and later steps
Accumulates data from asynchronous sources using native interfaces, flat
files, FTP sessions, or other processes
At a predefined cutoff time, data in the staging file is transformed and
loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data staging
51
Data Transformation
Transforms the data in accordance with the business rules and standards that have
been established
Example include: format changes, reduplication, splitting up fields, replacement of
codes, derived values, and aggregates
52
Data Loading
Appendix
Business Intelligence
Business Intelligence (BI): In simple terms it is the accumulation,
analysis,reporting, budgeting & presentation of your business
data. The goal of utilizing business intelligence for your business is
to improve your visibility of your organizational operations and
financial status to better manage your business.
Companies need to translate data into information to plan for
future business strategies. ... Hence data can help maximize
revenues and reduce costs. A Business Intelligence (BI) solution
helps in producing accurate reports by extracting data directly from
your data source
Terms and definition used in ETL testing
ODS :An operational data store (or "ODS") is a database designed to integrate
data from multiple sources for additional operations on the data.
Staging: Staging area is an intermediate area that sits between data sources and
data warehouse/data marts systems. Staging areas can be designed to provide
many benefits, but the primary motivations for their use are to increase efficiency
of ETL processes, ensure data integrity, and support data quality operations.
Data Mining: Data mining involves extracting hidden information from data and
interpret it for future predictions
OLAP and OLTP: OLTP systems provide source data to data warehouses,
whereas OLAP systems help to analyze it.
Terms and definition used in ETL testing contd..
Slowly Changing Dimensions
Type 0 - The passive method
Type 1 - Overwriting the old value
Type 2 - Creating a new additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Keys in Database :
A Surrogate key has sequence-generated numbers with no meaning. It is meant to
identify the rows uniquely.
A Primary key is used to identify the rows uniquely. It is visible to users and can be
changed as per requirement.
A Foreign Key primary key in another table.
Terms and definition used in ETL testing contd..
Data Types: A data type defines what kind of value a column can contain.
Char(n),VarChar(n),Integer(p),Decimal(p,s),TimeStamp,Date
Data Purging : Data purging is a process of deleting data from a data warehouse.
It removes junk data like rows with null values or extra spaces.
Meta Data :Models the organization of data and applications in the different
OLAP components. Meta data describes objects such as tables in OLTP
databases, cubes in data warehouses and data marts, and also records which
applications reference the various pieces of data.
Questions