DWDM202
DWDM202
Database System: Primarily designed for transactional processing (OLTP - Online Transaction
Processing). It is optimized for fast and efficient real-time operations like inserting,
updating, deleting, and retrieving small sets of data. Uses normalized data structures to
eliminate redundancy and ensure data integrity. This makes transactional operations
efficient. Typically stores operational data from a single application or system. Processes
real-time operational data that constantly changes. Used in applications like banking
systems, e-commerce, healthcare records, and CRM where real-time transactions are
essential. Eg Technology- MySQL, PostgreSQL, Oracle DB.
Data Warehouse- Designed for analytical processing (OLAP - Online Analytical Processing).
It is optimized for complex queries, reporting, and data analysis over large datasets. Uses
denormalized data structures (star schema, snowflake schema) to optimize read-heavy
operations and speed up complex queries. Integrates data from multiple sources
(databases, external files, APIs) to create a unified data repository for analysis. Processes
historical data for long-term storage and analysis, often in batch mode. Used in business
intelligence, reporting, trend analysis, and predictive analytics where historical and large-
scale data analysis is required. Eg Amazon Redshift, Google BigQuery, Snowflake.
Componets data warehousing Data sources provide raw data to the warehouse. These
sources can include operational databases (OLTP systems), external data sources,
spreadsheets, cloud services, and log files. The data may come from different formats like
SQL databases, APIs, or flat files, requiring transformation before storage. 2. ETL (Extract,
Transform, Load) Process- The ETL process is responsible for extracting data from various
sources, transforming it into a suitable format, and loading it into the data warehouse. The
transformation step includes cleaning, aggregating, and normalizing data to ensure
consistency and accuracy. Data Staging Area This is a temporary storage area where raw
data is held before transformation. It allows for data validation, cleansing, and
preprocessing to avoid loading erroneous or redundant data into the warehouse. Business
Intelligence (BI) Tools - BI tools provide a user-friendly interface for querying, reporting, data
visualization, and dashboard creation. These tools help business users generate insights
from data warehouse queries without needing technical expertise.
NEED- Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them in
an elementary form. 2)Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various purposes. 3) Make
strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions. 4) For data
consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data. 5) High
response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.
Data Mart: A subset of a data warehouse focused on a specific business unit, department,
or function (e.g., sales, marketing, finance). Usually derived from a single source or a subset
of the data warehouse. Uses a bottom-up approach—individual data marts are built first,
and they may later contribute to a data warehouse. Data marts focus on a specific subject
area, making it easier for users to access and analyze relevant data without having to
navigate the entire data warehouse. Processes more straightforward queries related to
departmental needs. Eg A sales data mart storing only sales-related data for analysis.
Features- Faster Access , Subject-Oriented , Improved Performance
Bottom Tier(Data Source Layer) – Data Extraction and Storage − The bottom tier of the
architecture is the data warehouse database server. It is the relational database system. We
use the back end tools and utilities to feed data into the bottom tier. These back end tools
and utilities perform the Extract, Clean, Load, and refresh functions. Relational databases
(e.g., Oracle, SQL Server), data warehouses (e.g., Amazon Redshift, Google BigQuery).
Middle Tier (Data Processing Layer)-OLAP & Business Logic− In the middle tier, we have the
OLAP Server that can be implemented in either of the following ways. By Relational OLAP
(ROLAP), which is an extended relational database management system. The ROLAP maps
the operations on multidimensional data to standard relational operations. By
Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional
data and operations. Eg Application servers, ETL tools (e.g., Apache NiFi, Talend)
Top-Tier (Presentation Layer) – Reporting & Visualization − This tier is the front-end client
layer. This layer holds the query tools and reporting tools, analysis tools and data mining
tools. I (e.g., Tableau, Power BI, Looker).
Metadata is data about the data or documentation about the information which is required
by the users. In data warehousing, metadata is one of the essential aspects. Metadata
includes the following:• The location and descriptions of warehouse systems and
components.• Names, definitions, structures, and content of data-warehouse and end-
users views. •Identification of authoritative data sources. •Integration and transformation
rules used to populate data. It is used for building, maintaining, managing, and using the
data warehouses. Metadata allow users access to help understand the content and find
data. Necessary- First, it acts as the glue that links all parts of the data warehouses. Next, it
provides information about the contents and structures to the developers. Finally, it opens
the doors to the end-users and makes the contents recognizable in their terms. TYPES-
Operational metadata contains all of the information about the operational data sources.
•Extraction and Transformation Metadata(this category of metadata contains information
about all the data transformation that takes place in the data staging area.) •End-User
Metadata(The end-user metadata allows the end-users to use their business terminology
and look for the information in those ways in which they usually think of the business.)
ETL stands for Extract, Transform, Load, and it is a crucial process in data warehousing and
data integration. ETL is used to collect data from various sources, transform it into a suitable
format, and load it into a target database or data warehouse for analysis and reporting. The
ETL process is essential for organizations that need to consolidate data from multiple
sources for analysis and reporting. Extract: The extraction phase involves retrieving data
from various source systems. These sources can be databases, flat files, APIs, cloud services,
or even web scraping. Transform: The transformation phase involves cleaning, enriching,
and converting the extracted data into a format suitable for analysis. This step is crucial for
ensuring data quality and consistency. Data Cleaning/Enrichment/Formatting Load: The
loading phase involves writing the transformed data into the target database or data
warehouse. This is where the data becomes available for analysis and reporting.
Data mining is the process of extracting meaningful patterns, trends, and insights from large
datasets using computational techniques. It transforms raw data into valuable insights by
leveraging algorithms, tools, and domain expertise. It works across structured,
unstructured, and semi-structured data to uncover patterns like associations, classifications,
and anomalies. Data mining automates the discovery of non-obvious patterns that are too
complex for manual detection. For example, retailers use data mining to identify purchasing
trends, while healthcare providers use it to predict disease outbreaks.
Data Reduction: This involves reducing the size of the dataset while preserving the
important information. It can be achieved through techniques such as feature selection and
feature extraction. Feature Selection – Identifying and keeping only the most relevant
features. Sampling – Selecting a representative subset of data to reduce computational
complexity. Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization – Scaling numerical data to a common
range (e.g., 0-1) to prevent bias. Standardization – Transforming data to have zero mean and
unit variance. Data discretization is the process of converting continuous data into discrete
categories or intervals. This technique is particularly useful in data mining and machine learning, as
many algorithms perform better with categorical data. Discretization helps simplify the analysis,
improve interpretability, and reduce the impact of noise in the data. Converting Continuous Data to
Categorical – Grouping numerical values into bins (e.g., age groups: 0–18, 19–35, 36–60).
Issues- Privacy and Ethics: Data mining often involves analyzing sensitive information, such
as personal health records, financial data, or user behavior. This raises significant privacy
and ethical concerns. Scalability: As the volume of data generated continues to grow
exponentially, data mining faces scalability challenges. Processing massive datasets, such as
those generated by social media platforms, IoT devices, or e-commerce transactions,
requires substantial computational resources. Scalability & Performance -Handling Large
Datasets: Processing petabytes of data requires high computational power. Real-time Data
Mining: Challenges in extracting insights from streaming data efficiently. Overfitting occurs
when a model learns the training data too well, capturing noise and outliers rather than the
underlying patterns. As a result, while the model may perform exceptionally well on the
training dataset, it fails to generalize to new, unseen data, leading to poor performance in
real-world applications. Interpretability: As data mining models become more complex,
particularly with the rise of deep learning, interpretability becomes a significant challenge.
Data Mining applications- Market Basket Analysis- Helps businesses understand what
products are frequently bought together. Ex: A supermarket finds that customers who buy
diapers also buy baby wipe. Disease Prediction and Diagnosis- Machine learning models
analyze patient data to predict diseases like cancer or diabetes. Ex: AI-assisted radiology for
detecting tumors in X-rays. Risk Management = Predicts the likelihood of loan defaults or
stock market crashes. Ex: Banks use credit scoring models to assess loan applicants. Social
Media and Internet - Sentiment Analysis Analyzes user opinions from social media posts,
reviews, and comments. Ex: Twitter sentiment analysis to gauge public opinion on political
issues. Transportation: A diversified transportation company with a large direct sales force
can apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.