0% found this document useful (0 votes)
140 views59 pages

ETL Training - Day 1

The document provides an overview of key concepts related to data warehousing including: 1. What is a data warehouse and its key characteristics like being subject-oriented, integrated, time-varying and non-volatile collection of data to support decision making. 2. Common data warehouse architectures like single-tier, two-tier and most widely used three-tier architecture. 3. Types of slowly changing dimensions (SCD) like Type 1, 2 and 3 that store and manage historical data over time in the data warehouse.

Uploaded by

Anshul Bhatnagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views59 pages

ETL Training - Day 1

The document provides an overview of key concepts related to data warehousing including: 1. What is a data warehouse and its key characteristics like being subject-oriented, integrated, time-varying and non-volatile collection of data to support decision making. 2. Common data warehouse architectures like single-tier, two-tier and most widely used three-tier architecture. 3. Types of slowly changing dimensions (SCD) like Type 1, 2 and 3 that store and manage historical data over time in the data warehouse.

Uploaded by

Anshul Bhatnagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Day 1 Agenda

What is Data Warehousing?

A process of transforming
Information
data into information and
making it available to users
in a timely enough manner
to make a difference

Data
Data Warehouse Defined

 “A data warehouse is a collection of corporate


information, derived directly from operational
systems and some external data sources. Its
specific purpose is to support business
decisions, not business operations”
Data Warehouse

A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
 Accessible
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data
Warehouse 1996
What is Data Warehouse?(Cont.)

A data warehouse is a copy of transaction data


specifically structured for querying and reporting.
A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in
support of management's decision making process
A data warehouse is a central repository for all or
significant parts of the data that an enterprise's various
business systems collect.
Data Warehouse Architectures

Single-tier architecture
 The objective of a single layer is to minimize the amount of data stored.
This goal is to remove data redundancy. This architecture is not frequently
used in practice.
Two-tier architecture
 Two-layer architecture separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
 This is the most widely used architecture.
 It consists of the Top, Middle and Bottom Tier.
Generic two-level architecture

L
One,
company-
T wide
warehouse

Periodic extraction  data is not completely current in warehouse


 Bottom Tier: The database of the Datawarehouse servers as the bottom
tier. It is usually a relational database system. Data is cleansed,
transformed, and loaded into this layer using back-end tools.
 Middle Tier: The middle tier in Data warehouse is an OLAP server which
is implemented using either ROLAP or MOLAP model. For a user, this
application tier presents an abstracted view of the database. This layer
also acts as a mediator between the end-user and the database.
 Top-Tier: The top tier is a front-end client layer. Top tier is the tools and
API that you connect and get data out from the data warehouse. It could
be Query tools, reporting tools, managed query tools, Analysis tools and
Data mining tools.
Three Tier Architecture
Why Use Data Wharehouse

 Data Exploration and Discovery


 Integrated and Consistent data
 Quality assured data
 Easily accessible data
 Production and performance awareness
 Access to data in a timely manner
Benefits of a Data Warehouse

 Reliable reporting
 Rapid access to data
 Integrated data
 Flexible presentation of data
 Better decision making
Why Separate Data Warehouse?

 Performance
 Operational database designed & tuned for known transactions & workloads.
 Complex OLAP queries would degrade performance. for op transactions.
 Special data organization, access & implementation methods needed for
multidimensional views & queries.

 Function
 Missing data: Decision support requires historical data, which
Operational database do not typically maintain.

 Data consolidation: Decision support requires consolidation


(aggregation, summarization) of data from many heterogeneous
sources: operational databases, external sources.

 Data quality: Different sources typically use inconsistent data


representations, codes, and formats which have to be reconciled.
What & How is Tested ?

 Data quality: entails an accurate check on the correctness of the


data loaded by ETL procedures and accessed by front-end tools.
end tools.
 Design quality quality: implies verifying that user requirements are
implies implies verifying verifying that user requirements
requirements are well expressed by the conceptual and by the
logical schema.
Cont.

 Functional test: it verifies that the item is compliant with its


specified business requirements.
 Usability test: it evaluates the item by letting users interact with it,
in order to verify that the item is easy order to verify that the item is
easy to use and comprehensible comprehensible.
 Performance test: it checks that the item performance is
satisfactory under typical workload conditions.
 Stress test: it shows how well the item performs with peak loads of
data and very heavy workloads.
Cont.

 Recovery test: it checks how well an item is able to recover from


crashes, hardware failures and other similar crashes, hardware
failures and other similar problems problems.
 Security test: Security test: it checks that the item protects data
and it checks that the item protects protects data and maintains
maintains functionality functionality as intended intended.
 Regression test: It checks that the item still functions correctly after
a change has occurred.
What Is a Slowly Changing Dimension?

 A slowly changing dimension (SCD) is a dimension that


stores and manages both current and historical data over
time in a data warehouse.

 When the historical attribute values are retained if the


attributes are updated

 Used when the organization does not want to lose track


of what actually happened

 Example: customer moves from Connecticut to Seattle


Type 1 SCD: Does Not Store History

Type 1 overwrites old values.

Old record

ID Customer ID Customer Name Marital Status


3 1125 Steve Single

New record

ID Customer ID Customer Name Marital Status


3 1125 Steve Married
Overwriting a Record

 As taught earlier, this is referred to as a type


1 slowly changing dimension.
 Implementation is easy.
 History is lost.
 This technique is not recommended.

Customer ID John Doe Married


......................................................................
......................................................................
Type 2 SCD: Preserves Complete History

Type 2 stores complete change history in a new record.


Before
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 NULL

Open/current record
After
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 01-13-2001
8 1125 Steve Married 01-14-2001 NULL

Open/current record Closed record


Adding a New Record

 This is an example of a type 2 slowly


changing dimension.
 History is preserved; dimensions grow.
 Time constraints are required.
 A generalized key is created.
 Metadata tracks the use of keys.
1 Customer ID John Doe Single 1-Feb-41 31-Dec-95
42 Customer ID John Doe Married 1-Jan-96
Type 3 SCD: Stores Only
the Previous Value

Type 3 stores current and previous version of a selected attribute.

ID Customer Customer Marital Previous Effective


Marital
ID Name Status Date
Status
3 1125 Steve Married Single 01-14-2001

3 1125 Steve Widower Married 10-30-2004


Adding a Current Field

 This is an example of a type 3 slowly


changing dimension.
 Some history is maintained.
 Intermediate values are lost.
 This method is enhanced by adding an
Effective Date field.
Day 2
Meta Data

 Data about data

 Needed by both information technology personnel and users

 IT personnel need to know data sources and targets; database, table and column

names; refresh schedules; data usage measures; etc.

 Users need to know entity/attribute definitions; reports/query tools available;

report distribution information; help desk contact information, etc.

Metadata can be classified into following categories:


 Technical Meta Data: This kind of Metadata contains information about warehouse
which is used by Data warehouse designers and administrators.
 Business Meta Data: This kind of Metadata contains detail that gives end-users a
way easy to understand information stored in the data warehouse.
Data Model

The data modeling process. ... A conceptual data model is


developed based on the data requirements for the application that is
being developed, perhaps in the context of an activity model.
The data model will normally consist of entity types, attributes,
relationships, integrity rules, and the definitions of those objects
Types of data models

 Hierarchical database model.


 Relational model.
 Network model.
 Object-oriented database model.
 Entity-relationship model.
 Document model.
 Entity-attribute-value model.
 Star schema.
Conceptual Modeling of Data
Warehouses

 Modeling data warehouses: dimensions & measures


 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema

time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Dimensions

 Dimensions determine the contextual background for the


facts.
 A dimension is a collection of members or units of the
same type of views.
 Dimensions describe who, what, when, where and why
for the facts.
 Dimensions should consist of the following data types
1. Surrogate key.
2. Primary key of the loaded source(s)
3. Any additional attributes (columns) that
Facts

 A fact is a collection of related data items, consisting of


measures and context data.

 Each fact typically represents a business item, a


business transaction, or an event that can be used in
analyzing the business or business process.

 Facts are measured, “continuously valued”, rapidly


changing information. Can be calculated and/or derived.
Facts(Cont.)

Facts are the key metrics used to measure business results:


Sales
Production
Inventory
Can be additive, semi-additive, or non-additive
act table consists of at least two types of data: keys and
measures.
s are usually surrogate keys that link to the dimension tables.
Fact Table

 A table that is used to store business information


(measures) that can be used in mathematical
equations.
 Quantities
 Percentages
 Prices
Types of Facts

Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Types of Facts(In Continuation)

 Non Additive

 Numeric measures that cannot be added across any


dimensions
 Intensity measure averaged across all dimensions eg.
Room temperature
 Textual facts - AVOID THEM
Summary

 Data warehouse is an information system that contains historical


and commutative data from single or multiple sources.
 A data warehouse is subject oriented as it offers information
regarding subject instead of organization's ongoing operations.
 In Data Warehouse, integration means the establishment of a
common unit of measure for all similar data from the different
databases
 Data warehouse is also non-volatile means the previous data is
not erased when new data is entered in it.
 A Datawarehouse is Time-variant as the data in a DW has high
shelf life.
Summary cont…..

 There are 5 main components of a Datawarehouse. 1) Database


2) ETL Tools 3) Meta Data 4) Query Tools 5) DataMarts
 These are four main categories of query tools 1. Query and
reporting, tools 2. Application Development tools, 3. Data mining
tools 4. OLAP tools
 The data sourcing, transformation, and migration tools are used for
performing all the conversions and summarizations.
 In the Data Warehouse Architecture, meta-data plays an important
role as it specifies the source, usage, values, and features of data
warehouse data.
ETL Overview

 Extraction Transformation Loading – ETL


 To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
 Data is extracted from an OLTP database, transformed to
match the data warehouse schema and loaded into the
data warehouse database
 Many data warehouses also incorporate data from non-
OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
 When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL Overview

 ETL is often a complex combination of process and


technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
 It is not a one time event as new data is added to the Data
Warehouse periodically – monthly, daily, hourly
 Because ETL is an integral, ongoing, and recurring part of
a data warehouse
 Automated
 Well documented
 Easily changeable
What is ETL Testing

The abbreviation of ETL is Extract, Transform, Load; these three database functions
are combined into one tool that automates the process to pull data out of one database
and place it into another database.
ETL can consolidate the scattered data for any organization while working with
different departments. It can very well handle the data coming from different
departments. ETL can transform not only data from different departments but also
data from different sources altogether. For example, any organization is running its
business on different environments like SAP and Oracle Apps for their businesses. If
the higher management wants to take discussion on their business, they want to
make the data integrated and used it for their reporting purposes. ETL can take these
two source system data and integrate it into single format and load it into the tables.

43
Why use an ETL Tool?

Use an ETL tool to:


 Simplify the process of migrating data

 Standardize the method of data migration


 Store all data transformation logic/rules as Meta data
 Enable users, managers and architects to understand, review,
 and modify the various interfaces
 Reduce cost and effort associated with building interfaces

44
ETL Tool Function

 A typical ETL tool-based data warehouse uses staging area, data integration, and
access layers to perform its functions. It’s normally a 3-layer architecture.
 Staging Layer – The staging layer or staging database is used to store the data
extracted from different source data systems.
 Data Integration Layer – The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged into
hierarchical groups, often called dimensions, and into facts and aggregate facts.
The combination of facts and dimensions tables in a DW system is called a
schema.
 Access Layer – The access layer is used by end-users to retrieve the data for
analytical reporting and information.
The ETL Process

 Extract
 Extract relevant data
 Transform
 Transform data to DW format
 Build keys, etc.
 Cleansing of data
 Load
 Load data into DW
 Build aggregates, etc.

46
The following illustration shows how the three layers interact with
Example
For example, a health insurance organization might have information on a customer in
several departments and each department might have that customer's information listed
in a different way. The membership department might list the customer by name,
whereas the claims department might list the customer by number. ETL can bundle all
this data and consolidate it into a uniform presentation, such as for storing in a
database or data warehouse.

48
Development of traditional Chinese medicine clinical data
warehouse
Data Extraction

• The integration of all of the disparate system across the enterprise is the real challenge
to getting the data warehouse
• Data is extracted from heterogeneous data sources
• The majority of data extraction comes from unstructured data sources and different data
50
formats. This Unstructured data can be in any form, such as flat files and structured data
can be tables, indexes, and analytics.
Data Staging
 Often used as an interim step between data extraction and later steps
 Accumulates data from asynchronous sources using native interfaces, flat
files, FTP sessions, or other processes
 At a predefined cutoff time, data in the staging file is transformed and
loaded to the warehouse
 There is usually no end user access to the staging file
 An operational data store may be used for data staging

51
Data Transformation

 Transforms the data in accordance with the business rules and standards that have
been established
 Example include: format changes, reduplication, splitting up fields, replacement of
codes, derived values, and aggregates

52
Data Loading

 Data are physically moved to the data warehouse


 The loading takes place within a “load window”
 The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications

Appendix
Business Intelligence
 Business Intelligence (BI): In simple terms it is the accumulation,
analysis,reporting, budgeting & presentation of your business
data. The goal of utilizing business intelligence for your business is
to improve your visibility of your organizational operations and
financial status to better manage your business.
 Companies need to translate data into information to plan for
future business strategies. ... Hence data can help maximize
revenues and reduce costs. A Business Intelligence (BI) solution
helps in producing accurate reports by extracting data directly from
your data source
Terms and definition used in ETL testing

 Data Warehouse :A data warehouse is a database that is designed for


query and analysis rather than for transaction processing. The data
warehouse is constructed by integrating the data from multiple
heterogeneous sources.
 Database Schemas :Schema is a logical description of the entire
database.
 Star Schema,Snow Flakes, Fact Constellation
 Fact and dimension tables
 Data Marts: A data mart is a simple form of a data warehouse that is
focused on a single subject (or functional area), such as Sales or Finance
or Marketing.
 Dependent and Independent Data Marts
Terms and definition used in ETL testing contd..

 ODS :An operational data store (or "ODS") is a database designed to integrate
data from multiple sources for additional operations on the data.
 Staging: Staging area is an intermediate area that sits between data sources and
data warehouse/data marts systems. Staging areas can be designed to provide
many benefits, but the primary motivations for their use are to increase efficiency
of ETL processes, ensure data integrity, and support data quality operations.
 Data Mining: Data mining involves extracting hidden information from data and
interpret it for future predictions
 OLAP and OLTP: OLTP systems provide source data to data warehouses,
whereas OLAP systems help to analyze it.
Terms and definition used in ETL testing contd..
 Slowly Changing Dimensions
 Type 0 - The passive method
 Type 1 - Overwriting the old value
 Type 2 - Creating a new additional record
 Type 3 - Adding a new column
 Type 4 - Using historical table
 Keys in Database :
 A Surrogate key has sequence-generated numbers with no meaning. It is meant to
identify the rows uniquely.
 A Primary key is used to identify the rows uniquely. It is visible to users and can be
changed as per requirement.
 A Foreign Key primary key in another table.
Terms and definition used in ETL testing contd..

 Data Types: A data type defines what kind of value a column can contain.
Char(n),VarChar(n),Integer(p),Decimal(p,s),TimeStamp,Date
 Data Purging : Data purging is a process of deleting data from a data warehouse.
It removes junk data like rows with null values or extra spaces.
 Meta Data :Models the organization of data and applications in the different
OLAP components. Meta data describes objects such as tables in OLTP
databases, cubes in data warehouses and data marts, and also records which
applications reference the various pieces of data.
 Questions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy