Data Warehousing
Data Warehousing
UNIT-3
Definition
1. Data Sources:
Data warehouses collect data from multiple internal and external sources. These
sources may include databases (like transactional systems), flat files, spreadsheets,
CRM systems, ERP systems, and other third-party data feeds.
Examples: Sales records, customer data, financial data, market data.
2. ETL (Extract, Transform, Load) Process:
Extract: Data is extracted from various source systems (e.g., databases, files) and
collected for processing.
Transform: The data is cleaned, standardized, and transformed into a consistent format
suitable for storage in the data warehouse. This step involves processes like data
filtering, removing duplicates, and resolving inconsistencies.
Load: The transformed data is loaded into the data warehouse for storage and future
analysis.
ETL tools ensure that data is accurate, relevant, and timely.
Examples of ETL Tools: Informatica, Talend, Microsoft SQL Server Integration Services
(SSIS).
4. Data Warehouse Database:
This is the central repository where data is stored. It is optimized for querying and analysis,
often using relational database management systems (RDBMS) or specialized platforms like
columnar databases.
Data is usually organized into fact and dimension tables following a star or snowflake
schema.
Popular Databases: Amazon Redshift, Google BigQuery, Microsoft Azure Synapse, Teradata.
5. Metadata:
Metadata is "data about data." It provides information about the data in the warehouse,
such as the data's source, structure, definitions, and how it has been transformed.
It helps users understand the contents and organization of the data warehouse and
facilitates better data governance.
Types of Metadata:
• Technical metadata: Information about data storage, schemas, tables, and
relationships.
• Business metadata: Descriptions of what the data represents, e.g., sales figures,
customer demographics.
6. OLAP (Online Analytical Processing) Engine:
OLAP engines allow users to perform complex queries and multidimensional analysis on
the data stored in the warehouse.
OLAP systems organize data into cubes that allow users to explore data from different
perspectives (dimensions) such as time, geography, or product type.
Examples of OLAP Operations:
• Drill-down: Breaking data down into finer details.
• Roll-up: Aggregating data into higher-level summaries.
• Slice and dice: Viewing the data from different angles.
• Pivoting: Reorganizing data to look at it from a new dimension.
7. Data Warehouse Access Tools (BI Tools):
These tools provide users with interfaces to access and analyze data stored in the
warehouse. They include reporting tools, dashboards, query tools, and data mining tools.
Common BI Tools:
• Tableau
• Power BI
• Looker
• QlikView
8. Data Marts:
Data marts are smaller, focused subsets of the data warehouse, typically
designed for specific departments or business units (e.g., finance,
marketing, sales).
They contain data relevant to a particular function or team, providing
more targeted and faster access to the data needed for specific analyses.
9. Data Governance and Security:
A data warehouse must have governance policies and security controls to
ensure data accuracy, privacy, and regulatory compliance.
This includes user authentication, access control, encryption, audit
logging, and data lineage tracking to ensure only authorized users can
access the data warehouse.