0% found this document useful (0 votes)
16 views18 pages

Unit 1

A Data Warehouse (DWH) is a centralized system for storing and analyzing large volumes of structured data from various sources, designed to enhance reporting and decision-making. It integrates data for improved query performance, historical analysis, and data consistency, making it essential for businesses like retail companies to analyze trends and customer behavior. The architecture of a DWH typically consists of three layers: data source, data storage, and presentation, with various schemas and ETL processes to ensure high-quality data for analysis.

Uploaded by

mmanojm005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

Unit 1

A Data Warehouse (DWH) is a centralized system for storing and analyzing large volumes of structured data from various sources, designed to enhance reporting and decision-making. It integrates data for improved query performance, historical analysis, and data consistency, making it essential for businesses like retail companies to analyze trends and customer behavior. The architecture of a DWH typically consists of three layers: data source, data storage, and presentation, with various schemas and ETL processes to ensure high-quality data for analysis.

Uploaded by

mmanojm005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit -1

What is a Data Warehouse?


A Data Warehouse (DWH) is a centralized system used for storing, managing, and analysing
large volumes of structured data from multiple sources. It is designed for efficient reporting,
analytics, and decision-making rather than day-to-day operations.

Why is a Data Warehouse Needed?

Centralized Data Storage – Integrates data from multiple sources (e.g., sales, marketing,
finance).

Faster Query Performance – Optimized for complex analytical queries.

Historical Data Analysis – Stores past data for trends and forecasting.

Improved Decision-Making – Helps businesses make data-driven decisions.

Data Consistency & Accuracy – Standardizes and cleanses data from various sources.

Scenario: A Retail Company

A retail company has data from:

Sales Database – Daily sales transactions.

Marketing Database – Customer engagement and promotions.

Inventory System – Stock levels and restocking needs.

Without a Data Warehouse, the company must pull reports separately from each system,
which is time-consuming and inconsistent.

With a Data Warehouse, all data is combined into a single repository. The management can
now:

Analyse monthly sales trends

Predict which products will be in demand

Identify customer buying patterns

Thus, the Data Warehouse helps in better planning and decision-making using historical and
integrated data.
Data Warehouse Architecture
1. Data Warehouse Architecture Types

There are three main types of data warehouse architecture:

a. Single-tier Architecture

Focuses on minimizing data storage by eliminating redundant data.

Not commonly used in large organizations due to performance limitations.

b. Two-tier Architecture

Separates the data warehouse from the operational database.

Provides direct access to the data warehouse, but may have scalability issues.
c. Three-tier Architecture (Most Commonly Used)

The three-tier architecture consists of three layers:


Bottom Tier (Data Source Layer)

Extracts data from multiple operational sources such as databases, ERP, CRM, flat files, and
external systems.

Uses ETL (Extract, Transform, Load) tools to clean and transform the data.

Middle Tier (Data Storage and Processing Layer)

Stores transformed data in the data warehouse or data marts.

Uses OLAP (Online Analytical Processing) for fast querying and reporting.

Top Tier (Presentation Layer)

Provides access to business intelligence (BI) tools, dashboards, reports, and data
visualization.

Users can query the data using SQL, BI tools, or reporting software.

Components of Data Warehouse Architecture and their tasks :


1. Operational Source –
 An operational Source is a data source consists of Operational Data and
External Data.
 Data can come from Relational DBMS like Informix, Oracle.
2. Load Manager –
 The Load Manager performs all operations associated with the extraction of
loading data in the data warehouse.
 These tasks include the simple transformation of data to prepare data for entry
into the warehouse.
3. Warehouse Manage –
 The warehouse manager is responsible for the warehouse management
process.
 The operations performed by the warehouse manager are the analysis,
aggregation, backup and collection of data, de-normalization of the data.
4. Query Manager –
 Query Manager performs all the tasks associated with the management of user
queries.
 The complexity of the query manager is determined by the end-user access
operations tool and the features provided by the database.
5. Detailed Data –
 It is used to store all the detailed data in the database schema.
 Detailed data is loaded into the data warehouse to complement the data
collected.
6. Summarized Data –
 Summarized Data is a part of the data warehouse that stores predefined
aggregations
 These aggregations are generated by the warehouse manager.
7. Archive and Backup Data –
 The Detailed and Summarized Data are stored for the purpose of archiving and
backup.
 The data is relocated to storage archives such as magnetic tapes or optical
disks.
8. Metadata –
 Metadata is basically data stored above data.
 It is used for extraction and loading process, warehouse, management process,
and query management process.
9. End User Access Tools –
 End-User Access Tools consist of Analysis, Reporting, and mining.
 By using end-user access tools users can link with the warehouse.
Layers in Data Warehouse Architecture

Source Data Layer (Operational Layer)

This is where raw data originates. It includes transactional databases (OLTP), external
sources (APIs, flat files, IoT, etc.), and other business applications (CRM, ERP).

Data Transformation Layer

Also called the ETL (Extract, Transform, Load) Process, this layer extracts data from the
source, cleans it, transforms it into a consistent format, and loads it into the data
warehouse.

Data Warehouse Layer

This is the central storage for historical and processed data. The data is optimized for
querying and reporting (OLAP - Online Analytical Processing).

Metadata Layer

Stores information about the structure, sources, transformations, and relationships of data.

Reporting Layer

Provides business users with insights using dashboards, reports, and analytics tools.
Working of these layers

Source Data Layer gathers raw data from various sources.

Data Transformation Layer processes, cleans, and converts the data.

Data Warehouse Layer stores processed data in a structured way.

Metadata Layer keeps track of data definitions and relationships.

Reporting Layer enables business users to analyse and visualize data.

Types of DBMS Schemas for Decision Support

There are three main types of schemas used in a Data Warehouse:

STAR SCHEM A

SNOW FLAKE SCHEM A

GALAXY SCHEM A

1. Star Schema (Most Common for Decision Support)

A central fact table containing numerical values (measurable business data).

Multiple dimension tables connected to the fact table.

Advantages:

Simple and fast query performance.

Efficient for OLAP (Online Analytical Processing) tools.

Use Case:

Sales analysis, financial reporting, and customer segmentation.

2.Snowflake Schema (Normalized Version of Star Schema)

Similar to Star Schema but dimension tables are further normalized into sub-tables.
Advantages:

Saves storage space.

Reduces data redundancy.

Disadvantages:

Complex joins make queries slower.

Use Case:

Complex hierarchical relationships (e.g., multi-level product categories).

3. Galaxy Schema (Fact Constellation Schema)

Multiple fact tables that share dimension tables.

Advantages:

Supports multiple business processes in the same warehouse.

Use Case:

Companies tracking sales, inventory, and shipments together.

Data Extraction and Clean up Transformation in a Data Warehouse


Data Extraction and Clean up are essential steps in the ETL (Extract, Transform, Load)
process. These steps ensure that high-quality, consistent, and structured data is available for
decision-making.

1. What is Data Extraction, Clean up, and Transformation?


✅ Data Extraction: Retrieving raw data from various sources (databases, APIs, cloud storage,
etc.).
✅ Data Clean up: Removing errors, handling missing values, standardizing formats, and
removing duplicates.
✅ Data Transformation: Converting data into a usable format (aggregations, normalizing,
formatting).

🔹 Data Extraction Tools

These tools help extract data from multiple sources.


🔹 Data Clean up Tools

These tools help in data quality improvement by handling missing, duplicate, or inconsistent
data.

🔹 Data Transformation Tools

These tools help format and structure data for storage in a Data Warehouse.
Metadata
Metadata is information that describes other data, helping to organize, find, and access it
more easily. It includes details like content, format, and structure. Metadata can be stored
in formats like text, XML, or RDF and follows standards such as Dublin Core and schema.org
to ensure consistency.

It is used in libraries, museums, archives, and online platforms to improve search rankings
and provide context. Metadata also helps with data management by defining ownership,
access controls, and interoperability between systems. Additionally, it supports data
preservation and visualization by offering details on structure, provenance, and display
options.

Descriptiv
e

Statistical Structural

Types of
metadat
a
Administra
Reference
tive

Provenanc
e

Descriptive Metadata – Provides details about data to help with identification and
discovery.

Example: Title, author, keywords, description, creation date.

Structural Metadata – Defines how data is organized and related within a system.

Example: Table relationships in a database, chapters in a book.}

Administrative Metadata – Manages data access, preservation, and rights.

Provenance Metadata – Tracks the history and origin of data.

Example: Source, modification records, version history.

Reference Metadata – Provides contextual information about how data was collected and
processed.

Example: Survey methodology, data collection techniques.


Statistical Metadata (Data Dictionary) – Describes data fields, formats, and relationships.

Example: Column definitions in a database, data types, value ranges.

Importance of Metadata Reporting


Improves Data Discovery – Helps users find and understand data quickly.

Ensures Data Quality – Identifies inconsistencies, duplicates, and missing values.

Supports Compliance – Tracks data ownership, security, and regulatory requirements.

Enhances Data Governance – Monitors how data is structured, stored, and accessed.

Optimizes Performance – Helps improve database efficiency and data warehouse


management.

Metadata reports typically include the following details:

Descriptive Metadata - Title, author, keywords, descriptions.

Structural Metadata - Relationships between tables, data models.

Administrative Metadata - File formats, storage details, access permissions.

Provenance Metadata - Data source, modification history, version control.

Several tools help generate metadata reports, providing insights into data governance, data quality,
and compliance.
Query Tools for Metadata Reporting

Metadata query tools help users extract, analyze, and report metadata from databases, data
warehouses, and data lakes. These tools are essential for data governance, compliance, and
data quality analysis.

Applications of Query Tools


 Database Management

Retrieve, update, and manage data stored in relational and non RDB.

 Business Intelligence & Reporting

Generate reports and dashboards from large datasets.

 Data Warehousing & ETL (Extract, Transform, Load)

Extract data from multiple sources for reporting and analytics.

 Data Governance & Compliance

Ensure data integrity and accuracy by auditing metadata.

 Data Science & Machine Learning

Extract datasets for training machine learning models.

 Cloud Data Querying

Query cloud-based databases and data lakes.


Key Benefits of Query Tools

 Faster Data Retrieval – Run optimized queries for large datasets.


 Better Decision-Making – Generate business insights from raw data.
 Automation & Scheduling – Automate query execution and reporting.
 Cross-Platform Integration – Work with multiple databases in hybrid environments.

OLAP - On-Line Analytical Processing


OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of
the enterprise as understood by the clients.
Characteristics of OLAP

Fast

Information Analysis

Multidimentional Shared

Fast – The system should deliver feedback within five seconds, with basic analysis taking no
more than one second and complex queries rarely exceeding 20 seconds.

Analysis – The method must support business logic and statistical analysis while allowing
users to define new ad hoc calculations without programming.

Share – The system should ensure secure data access and support concurrent updates when
needed, managing multiple updates efficiently.

Multidimensional – The OLAP system must provide a multidimensional view with


hierarchical support for effective business analysis.

Information – The system should store all necessary application data while handling data
sparsity efficiently.
OLAP in multi-dimensional data analysis
In the multidimensional model, data is organized into dimensions, each with different levels
of detail using concept hierarchies. This allows users to view data from multiple
perspectives.

OLAP operations help analyze data interactively by exploring different views using a data
cube. For example, in a shop’s sales data cube:

Location is grouped by city,

Time is grouped by quarters,

Item is grouped by item type.

This structure makes it easy to perform drill-down, roll-up, slice, dice, and pivot operations
for in-depth analysis.

1.Roll-Up (Drill-Up)

The roll-up operation (also called drill-up) summarizes data by moving up a hierarchy or
removing dimensions from a data cube. It works like zooming out to see higher-level trends.

Example: In a sales data cube with a location hierarchy:

Order Street → City → State → Country

Rolling up from City to Country aggregates data at the country level instead of showing
details for each city.

If time is removed, sales are grouped only by location, without breaking it down by date.
2.Drill-Down (Roll-Down)

The drill-down operation (also called roll-down) is the opposite of roll-up. It works like
zooming in, moving from summary data to detailed data.

Example: In a sales data cube with a time hierarchy:

Year → Quarter → Month → Day

Drilling down from Quarter to Month gives a more detailed view of sales.

A drill-down can also add a new dimension, like introducing Customer Group to analyze
sales by customer type.

This helps in detailed analysis and finding specific patterns in data.

3.Slice

The slice operation extracts a subset of a data cube by selecting a single value from one
dimension, reducing the cube’s dimensions.

Example: In a 3D sales data cube (Location, Time, Product):

Selecting sales data for Q1 creates a 2D subcube with only Location and Product.

This helps in focused analysis on specific data points.


4.Dice
The dice operation describes a subcube by operating a selection on two or more dimension.

5.Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates
the data axes in view to provide an alternative presentation of the data. It may contain
swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.
CHARACTERISTICS OF DATA WARE HOUSE

 A Data Warehouse helps analyse data on specific topics like sales, finance, and
customer behaviour.
 It collects data from different sources, like databases and spreadsheets, and converts
it into a common format.
 The data is stored with time details to track changes over time.
 Once added, the data is rarely changed or deleted.
 It is built to handle large amounts of data efficiently.
 It supports tools like OLAP, data mining, and visualization dashboards.

Steps to Build a Data Warehouse


1. Requirement Gathering – Understand business needs and what data to store.
2. Data Source Identification – Identify where data comes from (databases, files, APIs,
etc.).
3. Data Modelling – Design the warehouse structure (Star Schema, Snowflake Schema).
4. ETL Process (Extract, Transform, Load) –
Extract data from sources

Transform it (clean, filter, format)

Load into the data warehouse

5. Data Storage – Store data efficiently in a relational or cloud-based data warehouse.


6. OLAP Cube Creation – Organize data for fast reporting and analysis.
7. BI & Reporting – Use tools like Power BI, Tableau, or SQL queries for insights.
8. Testing & Validation – Ensure accuracy, performance, and security.
9. Deployment & Maintenance – Regular updates, backups, and performance tuning.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy