Unit 3-1
Unit 3-1
ETL stands for Extract, Transform, and Load. It is a critical process in Business Intelligence (BI) used to
integrate data from multiple sources into a data warehouse for reporting, analysis, and decision-making.
1. Data Sources:
o These are the original locations of data such as relational databases (MySQL, Oracle), flat
files (CSV, Excel), cloud storage (AWS S3), or APIs.
2. ETL Layer:
o This is the heart of the architecture. It includes three main operations:
▪ Extraction: Fetching raw data from different sources.
▪ Transformation: Cleaning, structuring, and converting data into a suitable format.
▪ Loading: Pushing the transformed data into the target system (data warehouse).
3. Staging Area:
o A temporary storage where data is held after extraction and before transformation. It helps in
data validation, sorting, and error handling.
4. Data Warehouse:
o The final destination where cleaned and transformed data is stored. It supports
multidimensional analysis and OLAP operations.
5. BI Layer:
o Tools like Power BI, Tableau, or SAP BO connect to the data warehouse to create
dashboards, reports, and visualizations.
1. Extraction:
o This step involves identifying and collecting data from multiple sources.
o Data can be structured (SQL databases) or unstructured (log files, social media).
o Extraction must ensure minimal impact on source systems and support batch or real-time
collection.
2. Transformation:
o This step involves cleaning and converting data into a standardized format.
o Operations include:
▪ Data cleansing (removing duplicates, handling nulls)
▪ Data integration (merging data from various sources)
▪ Data aggregation (summarizing data for analysis)
▪ Business rules application (e.g., converting currency)
3. Loading:
o The final step where transformed data is loaded into the data warehouse.
o It can be done in:
▪ Batch Mode (scheduled, periodic)
▪ Real-Time Mode (continuous updates)
Benefits in BI:
[ Data Sources ]
↓
[ Extraction ]
↓
[ Staging Area ]
↓
[ Transformation ]
↓
[ Loading ]
↓
[ Data Warehouse ]
↓
[ BI Reporting Tools ]
Conclusion:
ETL is the backbone of BI systems. It ensures that raw data from diverse sources is converted into
meaningful insights, enabling organizations to make data-driven decisions effectively.
2. Differentiate between Initial and Incremental Loading in ETL Processes. [Nov 2024]
Introduction:
In ETL (Extract, Transform, Load) processes, data loading is the phase where data is inserted into a data
warehouse or data mart. There are two primary types of loading strategies:
• Initial Load
• Incremental Load
Each serves a different purpose based on the data scenario.
The process of loading the entire dataset The process of loading only new or changed
Definition
for the first time. data since the last load.
Used during the first-time data Used for periodic updates (daily, hourly, real-
Usage Scenario
warehouse setup or a full reload. time).
Performance
Time-consuming and resource-intensive. Fast and efficient with less system load.
Impact
Full data load via bulk insert or batch Change Data Capture (CDC), Timestamps, or
Techniques Used
process. Audit Columns.
Detailed Explanation:
1. Initial Load:
• It is performed once when a data warehouse or a table is created.
• All existing data is extracted from the source, transformed, and loaded completely.
• Used when:
o A new system is being developed.
o There is a need for full reprocessing due to corruption or design changes.
Example: Loading 5 years of historical sales data into a newly built warehouse.
2. Incremental Load:
• This method loads only the data that has changed since the last ETL run.
• It improves performance, reduces network traffic, and keeps data up to date.
• Implemented using:
o Timestamps (e.g., last_modified_date)
o Log tables
o Change Data Capture (CDC) tools like Oracle GoldenGate, SQL Server CDC, Debezium.
Example: Loading only today's new sales records into the sales fact table each night.
Conclusion:
• Initial Loading is foundational for setting up systems.
• Incremental Loading ensures timely and efficient updates.
• A well-designed ETL system typically uses both: initial load first, followed by incremental updates.
3. Analyse the Role of Lookups in the Transformation Process of ETL. [May 2024]
Introduction:
In ETL (Extract, Transform, Load), the Transformation phase often involves Lookups to enrich or validate
data. Lookups are crucial when data from the source needs to be matched, referenced, or translated using
values from another table or dataset.
Type Description
Static Lookup The reference data does not change during ETL runtime.
Dynamic Lookup The reference data can change during ETL; updated data is considered.
Cached Lookup Reference data is pre-loaded into memory for faster access.
Uncached Lookup Accesses the reference table row-by-row; slower but always current.
Example Scenario:
• Source Data: Order table with Customer_ID
• Lookup Table: Customer Master with Customer_ID, Customer_Name, Region
• Purpose: Add Customer_Name and Region to the order record before loading into warehouse
Conclusion:
Lookups are a powerful mechanism in ETL transformation. They enable data enhancement, maintain
consistency, and support complex data relationships — all of which are essential for accurate and
meaningful business intelligence reporting.
4. Develop a Plan for Data Quality Assessment and Profiling in the Context of a Specific
Business Dataset. [Nov 2024]
Scenario:
Let’s assume a business dataset from an e-commerce company’s customer data, which includes:
• Customer_ID
• Name
• Email
• Phone_Number
• Gender
• Date_of_Registration
• Region
• Total_Purchases
The objective is to assess and improve data quality to ensure reliable BI reports and personalized customer
insights.
Dimension Description
Accuracy Is the data correct and valid? (e.g., correct phone numbers)
Completeness Are all required fields filled in?
Uniqueness Are there duplicate records?
Consistency Is data uniform across records and systems?
Timeliness Is the data up-to-date?
Validity Does the data match predefined formats and rules?
Use profiling tools (e.g., Talend, Informatica, Pandas Profiling in Python) to perform:
Example Checks:
• 7% missing Phone_Number
• 3% duplicate Customer_IDs
• Inconsistent Gender values like “M”, “Male”, “male”, “F”
Tool Use
Talend DQ GUI-based quality profiling and cleansing
Python (Pandas) Custom profiling, regex validations, and cleaning
SQL Queries Rule-based validation and duplicate checks
Great Expectations Create unit tests for data in pipelines
7. Document and Report Findings
Conclusion:
A structured Data Quality Assessment Plan ensures clean, consistent, and reliable data, which is crucial
for accurate BI insights, compliance, and operational decisions.
5. Describe the Key Components of Data Provisioning, Including Data Quality, Data
Profiling, and Data Enrichment. [May 2024]
Introduction:
Data Provisioning is the process of making relevant data available to systems, applications, or users in a
structured, secure, and timely manner. It plays a foundational role in Business Intelligence (BI) and
analytics, ensuring that clean, complete, and context-rich data reaches the end user.
1. Data Quality
Definition:
Ensures that the data provided is accurate, complete, consistent, and valid.
Key Activities:
Example:
Fixing email format errors or ensuring all phone numbers have 10 digits before provisioning data to a CRM
system.
2. Data Profiling
Definition:
The process of examining source data to understand its structure, content, relationships, and quality
issues before it is moved or used.
Key Activities:
Example:
Discovering that 25% of “Region” values are missing or that “Customer_ID” has duplicate values before
provisioning the data.
3. Data Enrichment
Definition:
The process of enhancing the dataset by adding new, useful information from internal or external sources.
Key Activities:
Example:
Adding the customer’s income range and age group to a marketing dataset to improve personalization and
segmentation.
Definition:
The technical core of provisioning – data is:
Example:
Pulling sales data from SAP, cleaning and aggregating it, then loading it into Power BI for analysis.
5. Metadata Management
Definition:
Managing information about the data (data about data), including source, lineage, format, and
transformations applied.
Importance:
Definition:
Ensuring that data is provisioned securely and only accessible to authorized users/systems.
Key Methods:
Conclusion:
These three components—when supported by strong ETL, metadata, and security practices—ensure that the
right data reaches the right people in the right form.
6. Evaluate the Impact of Data Duplication on the Overall Performance and Efficiency of a
Data Warehouse. [May 2024]
Introduction:
Data duplication refers to the presence of multiple identical or near-identical records in a data warehouse. It
is often a result of inconsistent data entry, lack of validation, or flaws in the ETL process. While it may seem
harmless at a glance, duplication negatively impacts performance, accuracy, and cost in data
warehousing and business intelligence.
Example:
A sales table with 2 million records where 10% are duplicates leads to 200,000 redundant entries.
Example:
A BI dashboard filtering customer orders may take seconds longer to load because of bloated duplicate
entries.
Example:
Duplicate sales records may inflate revenue reports, leading to incorrect forecasting or budget planning.
• Stakeholders may lose trust in the system if reports contain inconsistencies or incorrect totals.
• Data governance suffers as analysts constantly question data quality.
Example:
Two identical customer records receiving different marketing messages creates confusion and damages
brand trust.
• ETL jobs take longer due to increased volume and complex cleansing logic to detect/remove
duplicates.
• Transformation and validation scripts become more complex, raising maintenance burden.
Mitigation Strategies:
Method Description
Data Deduplication in ETL Use transformation logic to identify and remove duplicates
Primary Key Constraints Enforce uniqueness at the database level
Data Validation Rules Validate incoming records for existing keys or hashes
Master Data Management (MDM) Use identity resolution algorithms to unify records
Data Profiling Tools Detect duplicates before loading them into the warehouse
Conclusion:
Data duplication significantly degrades the performance, accuracy, and cost-efficiency of a data
warehouse. Proactively detecting and removing duplicates ensures reliable analytics, faster performance,
and greater business trust in the data.
7. Explain the Concept of Change Data Capture and Its Significance in Data Provisioning.
[May 2024]
Change Data Capture (CDC) is a technique used to identify and capture only the data that has changed
(inserted, updated, or deleted) in a source system and apply those changes to a target system such as a data
warehouse.
Instead of moving entire datasets repeatedly, CDC helps in transferring just the incremental changes,
ensuring efficiency, real-time updates, and data consistency.
Technique Description
Timestamp Columns Use last_modified timestamp to filter recent changes.
Database Triggers Triggers log every data change in audit or history tables.
Log-Based CDC Monitors database transaction logs to identify changes. Most efficient.
Snapshot Comparison Compare current and previous dataset snapshots to find changes.
• By only loading the changes, CDC reduces the volume of data to be transferred and processed.
• Helps in real-time or near-real-time updates to data warehouses or BI tools.
• Ensures that the target system is always synchronized with the source.
• Reduces risk of outdated or stale data in reports and dashboards.
• CDC can maintain history of changes, useful for audit trails, GDPR compliance, etc.
• Since CDC processes only incremental data, production systems are not heavily burdened.
• Ideal for high-availability environments where downtime is costly.
Example Scenario:
A retail company updates product prices daily. Instead of reloading the entire products table with
thousands of rows, CDC captures only the 50 updated products and updates them in the data warehouse.
This saves time, reduces system load, and keeps BI dashboards current.
Conclusion:
Change Data Capture is a crucial element in modern data provisioning, enabling efficient, real-time, and
low-latency data movement. It enhances performance, supports real-time analytics, and ensures data
accuracy across systems—making it a cornerstone for scalable Business Intelligence solutions.
8. Create a Simplified Data Mart and Explain Its Purpose in a BI System. [Both Papers]
A data mart is a subset of a data warehouse that is focused on a specific business area, department, or
subject (like sales, finance, HR, etc.). It contains summarized and relevant data that allows business users
to access and analyze information quickly without needing to scan through the entire data warehouse.
Purpose: To provide insights into sales performance, customer orders, and revenue for decision-
making by the Sales Department.
Field Description
sale_id Unique sale identifier
product_id Links to Product dimension
customer_id Links to Customer dimension
store_id Links to Store dimension
date_id Links to Date dimension
quantity_sold Number of items sold
Field Description
total_amount Sale amount
2. Dimension Tables:
• Dim_Product
o product_id, product_name, category, price
• Dim_Customer
o customer_id, customer_name, location, loyalty_status
• Dim_Store
o store_id, store_location, store_manager
• Dim_Date
o date_id, date, month, quarter, year
• Because it contains a focused subset of data, queries run faster than on a full-scale data warehouse.
2. Department-Specific Analytics
• Tailored to a specific business function (e.g., Sales), enabling custom dashboards and reports.
• Business teams can explore and analyze data without technical help or deep SQL knowledge.
• Analysts work on smaller datasets which lowers query load on the main data warehouse.
Conclusion:
A simplified data mart like a "Sales Data Mart" serves as a powerful, domain-specific data source for
business intelligence applications. It enables faster, more relevant, and more accessible data analysis,
empowering departments to make timely and informed decisions.
9. Compare and Contrast the Data Provisioning Methods of Different BI Tools such as
Tableau, Power BI, and Dundas BI. [Nov 2024]
Introduction:
Data provisioning in BI tools refers to the methods and processes used to connect, extract, transform,
and load data into the BI environment for analysis and visualization. Different BI tools offer varied
capabilities and approaches based on integration, performance, flexibility, and user interface.
Comparison Table:
Key Takeaways:
Tableau:
Power BI:
• Excellent ETL and modeling capabilities via Power Query and DAX.
• Ideal for organizations already within the Microsoft ecosystem.
• Better for self-service BI where business users handle their own data.
Dundas BI:
Conclusion:
Here’s a detailed flowchart description of the Data Provision process, highlighting key stages and
components in ETL and BI systems:
Explanation of Key Stages:
1. Source Systems: These are operational databases, external files, cloud apps, or any systems holding
business data.
2. Data Extraction: Extract data in either full loads or incremental loads (using CDC or timestamps).
3. Data Staging: Acts as a buffer or holding area where raw data resides temporarily.
4. Data Transformation: Core ETL work occurs here: cleaning inconsistent data, applying business
rules, joining datasets, enriching data, and performing lookups.
5. Data Loading: The cleaned, transformed data is loaded into data warehouses or smaller data marts
optimized for querying.
6. Data Provisioning: Data is provisioned through data marts, cubes, or direct connections, made ready
for consumption by BI tools.
7. BI & Analytics: Final stage where users interact with data via dashboards, reports, and analytics to
support decision-making.