Sem3 Unit1 DW
Sem3 Unit1 DW
• A data warehouse is a centralized storage system that allows for the storing, analyzing,
and interpreting of data in order to facilitate better decision-making.
• Data warehouses are primarily designed to facilitate searches and analyses and usually
contain large amounts of historical data.
• In a data warehouse, data from many different sources is brought to a single location
and then translated into a format the data warehouse can process and store.
Evolution of Data warehouse:
• Punch cards, first used for storing computer-generated data, became essential in government
and business by the 1950s, carried the famous warning "Do not fold, spindle, or mutilate,"
remained widely used until the mid-1980s, and are still utilized for voting ballots and
standardized tests.
• Magnetic storage began replacing punch cards in the 1960s, with disk storage (hard drives and
floppies) becoming popular in 1964, enabling direct data access and improving efficiency over
magnetic tapes.
• IBM pioneered disk storage by inventing the floppy and hard disk drives, improving their
technology over time, began manufacturing in 1956, and sold their hard disk business to Hitachi
in 2003.
Database Management Systems:
• Disk storage was quickly followed by software called a database management system (DBMS). In 1966, IBM
came up with its own DBMS called, at the time, an information management system.
• DBMS Functions:
Locate data efficiently.
Resolve conflicts.
Allow deletion and storage optimization.
Improve data retrieval speed.
• In the late 1960s and early ‘70s, commercial online applications came into play, shortly after disk storage
and DBMS software became popular.
• As a result, there were a large number of commercial applications which could be applied to online
processing. Some examples included:
Claims processing.
Bank teller processing.
Automated teller processing (ATMs).
Airline reservation processing.
Retail point-of-sale processing.
Manufacturing control processing.
Data Warehouse Alternatives:
Understanding OLTP, OLAP, and Data Warehousing:
• OLTP (Online Transaction Processing) → Handles real-time transactional data (e.g., banking, retail
purchases).
• OLAP (Online Analytical Processing) → Enables complex queries & analysis for decision-making
(e.g., sales trends, customer segmentation).
• Data Warehouse → Serves as a centralized storage for historical data, combining OLTP data and
external sources.
• Extract, Transform, Load (ETL) Process ✔ Data from OLTP databases (MySQL, PostgreSQL) is extracted. ✔ It is
cleaned, formatted, and optimized for analysis. ✔ Loaded into a data warehouse (Snowflake, Redshift, Big Query).
• Multidimensional Analysis ✔ OLAP systems query pre-aggregated data efficiently from the warehouse. ✔ Uses
cube structures to allow fast summaries across dimensions (e.g., time, location, product).
OLTP-Online Transaction Processing OLAP-Online Analytical Processing
Normalized data structure-Structured format Star scheme and snow flake schema models to study the data
It involves three distinct stages that help to streamline raw data from multiple sources into a clean, structured, and
usable form.
Extraction The Extract phase is the first step in the ETL process, where raw data is collected from various data
sources.
Types of data sources can include:
Structured: SQL databases, ERPs, CRMs.
Semi-structured: JSON, XML.
Unstructured: Emails, web pages, flat files.
Transformation Data extracted in the previous phase is often raw and inconsistent. During transformation,
the data is cleaned, aggregated, and formatted according to business rules.
Depending on the use case, there are two types of loading methods:
Full Load: All data is loaded into the target system, often used during the initial population of the warehouse.
Incremental Load: Only new or updated data is loaded, making this method more efficient for ongoing data
updates.