0% found this document useful (0 votes)

140 views59 pages

ETL Training - Day 1

The document provides an overview of key concepts related to data warehousing including: 1. What is a data warehouse and its key characteristics like being subject-oriented, integrated, time-varying and non-volatile collection of data to support decision making. 2. Common data warehouse architectures like single-tier, two-tier and most widely used three-tier architecture. 3. Types of slowly changing dimensions (SCD) like Type 1, 2 and 3 that store and manage historical data over time in the data warehouse.

Uploaded by

Anshul Bhatnagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views59 pages

ETL Training - Day 1

Uploaded by

Anshul Bhatnagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 59

Day 1 Agenda

What is Data Warehousing?

A process of transforming
Information
data into information and
making it available to users
in a timely enough manner
to make a difference

Data
Data Warehouse Defined

 “A data warehouse is a collection of corporate

information, derived directly from operational
systems and some external data sources. Its
specific purpose is to support business
decisions, not business operations”
Data Warehouse

A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
 Accessible
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data
Warehouse 1996
What is Data Warehouse?(Cont.)

A data warehouse is a copy of transaction data

specifically structured for querying and reporting.
A data warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in
support of management's decision making process
A data warehouse is a central repository for all or
significant parts of the data that an enterprise's various
business systems collect.
Data Warehouse Architectures

Single-tier architecture
 The objective of a single layer is to minimize the amount of data stored.
This goal is to remove data redundancy. This architecture is not frequently
used in practice.
Two-tier architecture
 Two-layer architecture separates physically available sources and data
warehouse. This architecture is not expandable and also not supporting a
large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
 This is the most widely used architecture.
 It consists of the Top, Middle and Bottom Tier.
Generic two-level architecture

L
One,
company-
T wide
warehouse

Periodic extraction  data is not completely current in warehouse

 Bottom Tier: The database of the Datawarehouse servers as the bottom
tier. It is usually a relational database system. Data is cleansed,
transformed, and loaded into this layer using back-end tools.
 Middle Tier: The middle tier in Data warehouse is an OLAP server which
is implemented using either ROLAP or MOLAP model. For a user, this
application tier presents an abstracted view of the database. This layer
also acts as a mediator between the end-user and the database.
 Top-Tier: The top tier is a front-end client layer. Top tier is the tools and
API that you connect and get data out from the data warehouse. It could
be Query tools, reporting tools, managed query tools, Analysis tools and
Data mining tools.
Three Tier Architecture
Why Use Data Wharehouse

 Data Exploration and Discovery

 Integrated and Consistent data
 Quality assured data
 Easily accessible data
 Production and performance awareness
 Access to data in a timely manner
Benefits of a Data Warehouse

 Reliable reporting
 Rapid access to data
 Integrated data
 Flexible presentation of data
 Better decision making
Why Separate Data Warehouse?

 Performance
 Operational database designed & tuned for known transactions & workloads.
 Complex OLAP queries would degrade performance. for op transactions.
 Special data organization, access & implementation methods needed for
multidimensional views & queries.

 Function
 Missing data: Decision support requires historical data, which
Operational database do not typically maintain.

 Data consolidation: Decision support requires consolidation

(aggregation, summarization) of data from many heterogeneous
sources: operational databases, external sources.

 Data quality: Different sources typically use inconsistent data

representations, codes, and formats which have to be reconciled.
What & How is Tested ?

 Data quality: entails an accurate check on the correctness of the

data loaded by ETL procedures and accessed by front-end tools.
end tools.
 Design quality quality: implies verifying that user requirements are
implies implies verifying verifying that user requirements
requirements are well expressed by the conceptual and by the
logical schema.
Cont.

 Functional test: it verifies that the item is compliant with its

specified business requirements.
 Usability test: it evaluates the item by letting users interact with it,
in order to verify that the item is easy order to verify that the item is
easy to use and comprehensible comprehensible.
 Performance test: it checks that the item performance is
satisfactory under typical workload conditions.
 Stress test: it shows how well the item performs with peak loads of
data and very heavy workloads.
Cont.

 Recovery test: it checks how well an item is able to recover from

crashes, hardware failures and other similar crashes, hardware
failures and other similar problems problems.
 Security test: Security test: it checks that the item protects data
and it checks that the item protects protects data and maintains
maintains functionality functionality as intended intended.
 Regression test: It checks that the item still functions correctly after
a change has occurred.
What Is a Slowly Changing Dimension?

 A slowly changing dimension (SCD) is a dimension that

stores and manages both current and historical data over
time in a data warehouse.

 When the historical attribute values are retained if the

attributes are updated

 Used when the organization does not want to lose track

of what actually happened

 Example: customer moves from Connecticut to Seattle

Type 1 SCD: Does Not Store History

Type 1 overwrites old values.

Old record

ID Customer ID Customer Name Marital Status

3 1125 Steve Single

New record

ID Customer ID Customer Name Marital Status

3 1125 Steve Married
Overwriting a Record

 As taught earlier, this is referred to as a type

1 slowly changing dimension.
 Implementation is easy.
 History is lost.
 This technique is not recommended.

Customer ID John Doe Married

......................................................................
......................................................................
Type 2 SCD: Preserves Complete History

Type 2 stores complete change history in a new record.

Before
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 NULL

Open/current record
After
ID Customer Customer Marital Effective Expiration
ID Name Status Date Date
3 1125 Steve Single 04-04-1999 01-13-2001
8 1125 Steve Married 01-14-2001 NULL

Open/current record Closed record

Adding a New Record

 This is an example of a type 2 slowly

changing dimension.
 History is preserved; dimensions grow.
 Time constraints are required.
 A generalized key is created.
 Metadata tracks the use of keys.
1 Customer ID John Doe Single 1-Feb-41 31-Dec-95
42 Customer ID John Doe Married 1-Jan-96
Type 3 SCD: Stores Only
the Previous Value

Type 3 stores current and previous version of a selected attribute.

ID Customer Customer Marital Previous Effective

Marital
ID Name Status Date
Status
3 1125 Steve Married Single 01-14-2001

3 1125 Steve Widower Married 10-30-2004

Adding a Current Field

 This is an example of a type 3 slowly

changing dimension.
 Some history is maintained.
 Intermediate values are lost.
 This method is enhanced by adding an
Effective Date field.
Day 2
Meta Data

 Data about data

 Needed by both information technology personnel and users

 IT personnel need to know data sources and targets; database, table and column

names; refresh schedules; data usage measures; etc.

 Users need to know entity/attribute definitions; reports/query tools available;

report distribution information; help desk contact information, etc.

Metadata can be classified into following categories:

 Technical Meta Data: This kind of Metadata contains information about warehouse
which is used by Data warehouse designers and administrators.
 Business Meta Data: This kind of Metadata contains detail that gives end-users a
way easy to understand information stored in the data warehouse.
Data Model

The data modeling process. ... A conceptual data model is

developed based on the data requirements for the application that is
being developed, perhaps in the context of an activity model.
The data model will normally consist of entity types, attributes,
relationships, integrity rules, and the definitions of those objects
Types of data models

 Hierarchical database model.

 Relational model.
 Network model.
 Object-oriented database model.
 Entity-relationship model.
 Document model.
 Entity-attribute-value model.
 Star schema.
Conceptual Modeling of Data
Warehouses

 Modeling data warehouses: dimensions & measures

 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
Example of Snowflake Schema

time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_street
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type
Dimensions

 Dimensions determine the contextual background for the

facts.
 A dimension is a collection of members or units of the
same type of views.
 Dimensions describe who, what, when, where and why
for the facts.
 Dimensions should consist of the following data types
1. Surrogate key.
2. Primary key of the loaded source(s)
3. Any additional attributes (columns) that
Facts

 A fact is a collection of related data items, consisting of

measures and context data.

 Each fact typically represents a business item, a

business transaction, or an event that can be used in
analyzing the business or business process.

 Facts are measured, “continuously valued”, rapidly

changing information. Can be calculated and/or derived.
Facts(Cont.)

Facts are the key metrics used to measure business results:

Sales
Production
Inventory
Can be additive, semi-additive, or non-additive
act table consists of at least two types of data: keys and
measures.
s are usually surrogate keys that link to the dimension tables.
Fact Table

 A table that is used to store business information

(measures) that can be used in mathematical
equations.
 Quantities
 Percentages
 Prices
Types of Facts

Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account balance, Inventory
balance
Added and divided by number of time period to get a time-average
Types of Facts(In Continuation)

 Non Additive

 Numeric measures that cannot be added across any

dimensions
 Intensity measure averaged across all dimensions eg.
Room temperature
 Textual facts - AVOID THEM
Summary

 Data warehouse is an information system that contains historical

and commutative data from single or multiple sources.
 A data warehouse is subject oriented as it offers information
regarding subject instead of organization's ongoing operations.
 In Data Warehouse, integration means the establishment of a
common unit of measure for all similar data from the different
databases
 Data warehouse is also non-volatile means the previous data is
not erased when new data is entered in it.
 A Datawarehouse is Time-variant as the data in a DW has high
shelf life.
Summary cont…..

 There are 5 main components of a Datawarehouse. 1) Database

2) ETL Tools 3) Meta Data 4) Query Tools 5) DataMarts
 These are four main categories of query tools 1. Query and
reporting, tools 2. Application Development tools, 3. Data mining
tools 4. OLAP tools
 The data sourcing, transformation, and migration tools are used for
performing all the conversions and summarizations.
 In the Data Warehouse Architecture, meta-data plays an important
role as it specifies the source, usage, values, and features of data
warehouse data.
ETL Overview

 Extraction Transformation Loading – ETL

 To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
 Data is extracted from an OLTP database, transformed to
match the data warehouse schema and loaded into the
data warehouse database
 Many data warehouses also incorporate data from non-
OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
 When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL Overview

 ETL is often a complex combination of process and

technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
 It is not a one time event as new data is added to the Data
Warehouse periodically – monthly, daily, hourly
 Because ETL is an integral, ongoing, and recurring part of
a data warehouse
 Automated
 Well documented
 Easily changeable
What is ETL Testing

The abbreviation of ETL is Extract, Transform, Load; these three database functions
are combined into one tool that automates the process to pull data out of one database
and place it into another database.
ETL can consolidate the scattered data for any organization while working with
different departments. It can very well handle the data coming from different
departments. ETL can transform not only data from different departments but also
data from different sources altogether. For example, any organization is running its
business on different environments like SAP and Oracle Apps for their businesses. If
the higher management wants to take discussion on their business, they want to
make the data integrated and used it for their reporting purposes. ETL can take these
two source system data and integrate it into single format and load it into the tables.

43
Why use an ETL Tool?

Use an ETL tool to:

 Simplify the process of migrating data

 Standardize the method of data migration

 Store all data transformation logic/rules as Meta data
 Enable users, managers and architects to understand, review,
 and modify the various interfaces
 Reduce cost and effort associated with building interfaces

44
ETL Tool Function

 A typical ETL tool-based data warehouse uses staging area, data integration, and
access layers to perform its functions. It’s normally a 3-layer architecture.
 Staging Layer – The staging layer or staging database is used to store the data
extracted from different source data systems.
 Data Integration Layer – The integration layer transforms the data from the
staging layer and moves the data to a database, where the data is arranged into
hierarchical groups, often called dimensions, and into facts and aggregate facts.
The combination of facts and dimensions tables in a DW system is called a
schema.
 Access Layer – The access layer is used by end-users to retrieve the data for
analytical reporting and information.
The ETL Process

 Extract
 Extract relevant data
 Transform
 Transform data to DW format
 Build keys, etc.
 Cleansing of data
 Load
 Load data into DW
 Build aggregates, etc.

46
The following illustration shows how the three layers interact with
Example
For example, a health insurance organization might have information on a customer in
several departments and each department might have that customer's information listed
in a different way. The membership department might list the customer by name,
whereas the claims department might list the customer by number. ETL can bundle all
this data and consolidate it into a uniform presentation, such as for storing in a
database or data warehouse.

48
Development of traditional Chinese medicine clinical data
warehouse
Data Extraction

• The integration of all of the disparate system across the enterprise is the real challenge
to getting the data warehouse
• Data is extracted from heterogeneous data sources
• The majority of data extraction comes from unstructured data sources and different data
50
formats. This Unstructured data can be in any form, such as flat files and structured data
can be tables, indexes, and analytics.
Data Staging
 Often used as an interim step between data extraction and later steps
 Accumulates data from asynchronous sources using native interfaces, flat
files, FTP sessions, or other processes
 At a predefined cutoff time, data in the staging file is transformed and
loaded to the warehouse
 There is usually no end user access to the staging file
 An operational data store may be used for data staging

51
Data Transformation

 Transforms the data in accordance with the business rules and standards that have
been established
 Example include: format changes, reduplication, splitting up fields, replacement of
codes, derived values, and aggregates

52
Data Loading

 Data are physically moved to the data warehouse

 The loading takes place within a “load window”
 The trend is to near real time updates of the data warehouse as the warehouse is
increasingly used for operational applications

Appendix
Business Intelligence
 Business Intelligence (BI): In simple terms it is the accumulation,
analysis,reporting, budgeting & presentation of your business
data. The goal of utilizing business intelligence for your business is
to improve your visibility of your organizational operations and
financial status to better manage your business.
 Companies need to translate data into information to plan for
future business strategies. ... Hence data can help maximize
revenues and reduce costs. A Business Intelligence (BI) solution
helps in producing accurate reports by extracting data directly from
your data source
Terms and definition used in ETL testing

 Data Warehouse :A data warehouse is a database that is designed for

query and analysis rather than for transaction processing. The data
warehouse is constructed by integrating the data from multiple
heterogeneous sources.
 Database Schemas :Schema is a logical description of the entire
database.
 Star Schema,Snow Flakes, Fact Constellation
 Fact and dimension tables
 Data Marts: A data mart is a simple form of a data warehouse that is
focused on a single subject (or functional area), such as Sales or Finance
or Marketing.
 Dependent and Independent Data Marts
Terms and definition used in ETL testing contd..

 ODS :An operational data store (or "ODS") is a database designed to integrate
data from multiple sources for additional operations on the data.
 Staging: Staging area is an intermediate area that sits between data sources and
data warehouse/data marts systems. Staging areas can be designed to provide
many benefits, but the primary motivations for their use are to increase efficiency
of ETL processes, ensure data integrity, and support data quality operations.
 Data Mining: Data mining involves extracting hidden information from data and
interpret it for future predictions
 OLAP and OLTP: OLTP systems provide source data to data warehouses,
whereas OLAP systems help to analyze it.
Terms and definition used in ETL testing contd..
 Slowly Changing Dimensions
 Type 0 - The passive method
 Type 1 - Overwriting the old value
 Type 2 - Creating a new additional record
 Type 3 - Adding a new column
 Type 4 - Using historical table
 Keys in Database :
 A Surrogate key has sequence-generated numbers with no meaning. It is meant to
identify the rows uniquely.
 A Primary key is used to identify the rows uniquely. It is visible to users and can be
changed as per requirement.
 A Foreign Key primary key in another table.
Terms and definition used in ETL testing contd..

 Data Types: A data type defines what kind of value a column can contain.
Char(n),VarChar(n),Integer(p),Decimal(p,s),TimeStamp,Date
 Data Purging : Data purging is a process of deleting data from a data warehouse.
It removes junk data like rows with null values or extra spaces.
 Meta Data :Models the organization of data and applications in the different
OLAP components. Meta data describes objects such as tables in OLTP
databases, cubes in data warehouses and data marts, and also records which
applications reference the various pieces of data.
 Questions

ADF Course Content
No ratings yet
ADF Course Content
11 pages
Selecting The Right Data Warehouse For Analytics
No ratings yet
Selecting The Right Data Warehouse For Analytics
13 pages
Informatica Deployment Checklist
No ratings yet
Informatica Deployment Checklist
9 pages
Informatica Velocity Unit Test Plan
100% (1)
Informatica Velocity Unit Test Plan
4 pages
WS-BPEL 2.0 Beginner's Guide
From Everand
WS-BPEL 2.0 Beginner's Guide
Matjaz B. Juric
No ratings yet
Sgs Cbe PMCF Template en
No ratings yet
Sgs Cbe PMCF Template en
3 pages
RPA Presentation
100% (1)
RPA Presentation
12 pages
Technet Etl Design Questionnaire
100% (1)
Technet Etl Design Questionnaire
15 pages
Etl
No ratings yet
Etl
13 pages
DWBI Testing by Puneet
No ratings yet
DWBI Testing by Puneet
27 pages
Datawarehousing - Etl Project Life Cycle
No ratings yet
Datawarehousing - Etl Project Life Cycle
2 pages
Data Warehouse ETL Testing Best Practices
No ratings yet
Data Warehouse ETL Testing Best Practices
6 pages
Testing For DW BI
No ratings yet
Testing For DW BI
37 pages
BI Testing
No ratings yet
BI Testing
4 pages
Informatica Velocity
No ratings yet
Informatica Velocity
361 pages
ETL Development Standards
No ratings yet
ETL Development Standards
8 pages
Codetru - Big Data
100% (1)
Codetru - Big Data
17 pages
Mapping Specifications PowerCenter
100% (1)
Mapping Specifications PowerCenter
5 pages
Big Data Analysis Guide
No ratings yet
Big Data Analysis Guide
11 pages
Testing Processes Methodology
No ratings yet
Testing Processes Methodology
14 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Target To Source Mapping RT
No ratings yet
Target To Source Mapping RT
158 pages
Strategies For Testing Data Warehouse Applications
No ratings yet
Strategies For Testing Data Warehouse Applications
5 pages
03 - Quality Center Functionality and Features
No ratings yet
03 - Quality Center Functionality and Features
50 pages
Data Architecture Is Composed of Models
No ratings yet
Data Architecture Is Composed of Models
7 pages
ETL Development Standards
No ratings yet
ETL Development Standards
8 pages
CCI ETL Estimate Guidelines v1 1
No ratings yet
CCI ETL Estimate Guidelines v1 1
11 pages
Informatica Codereview Checklist
100% (1)
Informatica Codereview Checklist
4 pages
Benefits of Data Archiving in Data Warehouses
100% (1)
Benefits of Data Archiving in Data Warehouses
12 pages
Generalized SAP BI Unit Test Case Templates
No ratings yet
Generalized SAP BI Unit Test Case Templates
6 pages
Test Data Management
No ratings yet
Test Data Management
4 pages
Size and Effort Paper
No ratings yet
Size and Effort Paper
9 pages
Solution Architecture
No ratings yet
Solution Architecture
1 page
The Data Warehousing Development Lifecycle
100% (1)
The Data Warehousing Development Lifecycle
5 pages
Infosys Big Data Testing Services: Service Catalogue
No ratings yet
Infosys Big Data Testing Services: Service Catalogue
2 pages
Methodology For Data Validation v1.0 Rev-2016-06 Final
No ratings yet
Methodology For Data Validation v1.0 Rev-2016-06 Final
76 pages
DW Reference Documetns
No ratings yet
DW Reference Documetns
9 pages
Data Warehouse and ETL Verification Services Process Methods
No ratings yet
Data Warehouse and ETL Verification Services Process Methods
10 pages
Strategies For Testing Data Warehouse
100% (1)
Strategies For Testing Data Warehouse
4 pages
ETL Specific
No ratings yet
ETL Specific
12 pages
ETL DataSanity
No ratings yet
ETL DataSanity
15 pages
Naming Conventions - IDQ
No ratings yet
Naming Conventions - IDQ
12 pages
Data Analytics
No ratings yet
Data Analytics
8 pages
Data Models
No ratings yet
Data Models
57 pages
3 Amazing Engineering Resume Examples - LiveCareer
No ratings yet
3 Amazing Engineering Resume Examples - LiveCareer
5 pages
Business Intelligence Architect Developer Analyst in New York City Resume Ilin Star
No ratings yet
Business Intelligence Architect Developer Analyst in New York City Resume Ilin Star
3 pages
Accelerating Data Governance and Analysis With Snowflake
No ratings yet
Accelerating Data Governance and Analysis With Snowflake
2 pages
Test Case Training3002
0% (1)
Test Case Training3002
15 pages
Basics of Database Testing Contains The Following
No ratings yet
Basics of Database Testing Contains The Following
4 pages
Data Quality Project Estimation and Scheduling Factors
100% (1)
Data Quality Project Estimation and Scheduling Factors
8 pages
How To Do Database Migration Testing Effectively and Quickly - Software Testing Articles - Help Guide On Tools Test Automation, Strategies, Updates
No ratings yet
How To Do Database Migration Testing Effectively and Quickly - Software Testing Articles - Help Guide On Tools Test Automation, Strategies, Updates
5 pages
ETL QA Sample Scenario V3
100% (2)
ETL QA Sample Scenario V3
3 pages
DQ Architecture
0% (1)
DQ Architecture
3 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Oracle Information Integration, Migration, and Consolidation
From Everand
Oracle Information Integration, Migration, and Consolidation
Jason Williamson
No ratings yet
Enterprise Metadata Management Standard Requirements
From Everand
Enterprise Metadata Management Standard Requirements
Gerardus Blokdyk
No ratings yet
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
From Everand
The Definitive Guide to Data Integration: Unlock the power of data integration to efficiently manage, transform, and analyze data
Pierre-yves Bonnefoy
No ratings yet
Database testing Third Edition
From Everand
Database testing Third Edition
Gerardus Blokdyk
No ratings yet
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
From Everand
Oracle SOA BPEL Process Manager 11gR1 A Hands-on Tutorial
Ravi Saraswathi
5/5 (1)
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
María Carina Roldán
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
Oracle Application Express (APEX) : Project Implementation For COSC 5050 Distributed Database Applications Lab4
No ratings yet
Oracle Application Express (APEX) : Project Implementation For COSC 5050 Distributed Database Applications Lab4
32 pages
Huawei OceanStor Pacific Series Data Sheet
No ratings yet
Huawei OceanStor Pacific Series Data Sheet
6 pages
Database Format
No ratings yet
Database Format
22 pages
E Computer Notes - Oracle9i Extensions To DML and DDL Statements
No ratings yet
E Computer Notes - Oracle9i Extensions To DML and DDL Statements
20 pages
385 Mcqs On Research Methodology
No ratings yet
385 Mcqs On Research Methodology
96 pages
SCM201805006 - EPO - Artificial Intelligence at The EPO
No ratings yet
SCM201805006 - EPO - Artificial Intelligence at The EPO
9 pages
Informatica Working With Repository
No ratings yet
Informatica Working With Repository
27 pages
FINALinFrontiers - Fspor 1 1539483
No ratings yet
FINALinFrontiers - Fspor 1 1539483
13 pages
AI STUDY Metarial
No ratings yet
AI STUDY Metarial
9 pages
IT Theory Grade 10 Revision Material Term 1 - 2024
No ratings yet
IT Theory Grade 10 Revision Material Term 1 - 2024
10 pages
L10 Big Data
No ratings yet
L10 Big Data
12 pages
Engexam - info-IELTS Writing Task 1
No ratings yet
Engexam - info-IELTS Writing Task 1
11 pages
Data Logger Line No
No ratings yet
Data Logger Line No
1 page
Data Warehouse
No ratings yet
Data Warehouse
15 pages
My Inquiry Infrastructure
No ratings yet
My Inquiry Infrastructure
14 pages
Python For Excel
No ratings yet
Python For Excel
3 pages
Tutorial 9 Answers
No ratings yet
Tutorial 9 Answers
1 page
4 Arrays in 'C'
100% (10)
4 Arrays in 'C'
19 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
Mysql Record File
No ratings yet
Mysql Record File
20 pages
Advanced Database Systems (Lecture-2)
No ratings yet
Advanced Database Systems (Lecture-2)
12 pages
Manual of Operations For Supervisors - CRPS
No ratings yet
Manual of Operations For Supervisors - CRPS
54 pages
Research Process
No ratings yet
Research Process
35 pages
Screenlife Filmmaking As A Learning-By-Practice Approach in Education About Information Disorder
No ratings yet
Screenlife Filmmaking As A Learning-By-Practice Approach in Education About Information Disorder
108 pages
Engineering Data Analysis
No ratings yet
Engineering Data Analysis
16 pages
Analytical CDS Queries
No ratings yet
Analytical CDS Queries
5 pages
CS201 Introduction To Programming Solved MID Term Paper 03
No ratings yet
CS201 Introduction To Programming Solved MID Term Paper 03
4 pages
Literature Review On Warehouse
100% (3)
Literature Review On Warehouse
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ETL Training - Day 1

Uploaded by

ETL Training - Day 1

Uploaded by

Day 1 Agenda

What is Data Warehousing?

 “A data warehouse is a collection of corporate

A data warehouse is a copy of transaction data

Periodic extraction  data is not completely current in warehouse

 Data Exploration and Discovery

 Data consolidation: Decision support requires consolidation

 Data quality: Different sources typically use inconsistent data

 Data quality: entails an accurate check on the correctness of the

 Functional test: it verifies that the item is compliant with its

 Recovery test: it checks how well an item is able to recover from

 A slowly changing dimension (SCD) is a dimension that

 When the historical attribute values are retained if the

 Used when the organization does not want to lose track

 Example: customer moves from Connecticut to Seattle

Type 1 overwrites old values.

ID Customer ID Customer Name Marital Status

ID Customer ID Customer Name Marital Status

 As taught earlier, this is referred to as a type

Customer ID John Doe Married

Type 2 stores complete change history in a new record.

Open/current record Closed record

 This is an example of a type 2 slowly

Type 3 stores current and previous version of a selected attribute.

ID Customer Customer Marital Previous Effective

3 1125 Steve Widower Married 10-30-2004

 This is an example of a type 3 slowly

 Data about data

 Needed by both information technology personnel and users

names; refresh schedules; data usage measures; etc.

 Users need to know entity/attribute definitions; reports/query tools available;

report distribution information; help desk contact information, etc.

Metadata can be classified into following categories:

The data modeling process. ... A conceptual data model is

 Hierarchical database model.

 Modeling data warehouses: dimensions & measures

branch location_key location to_location

 Dimensions determine the contextual background for the

 A fact is a collection of related data items, consisting of

 Each fact typically represents a business item, a

 Facts are measured, “continuously valued”, rapidly

Facts are the key metrics used to measure business results:

 A table that is used to store business information

 Numeric measures that cannot be added across any

 Data warehouse is an information system that contains historical

 There are 5 main components of a Datawarehouse. 1) Database

 Extraction Transformation Loading – ETL

 ETL is often a complex combination of process and

Use an ETL tool to:

 Standardize the method of data migration

 Data are physically moved to the data warehouse

 Data Warehouse :A data warehouse is a database that is designed for

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.