MultiDimensional Data Model
MultiDimensional Data Model
OLAP stands for Online Analytical Processing Server. It is a software technology that allows users to
analyze information from multiple database systems at the same time. It is based on multidimensional data
model and allows the user to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.
Feature /
MOLAP (Multidimensional) ROLAP (Relational) HOLAP (Hybrid)
Criteria
Query Very fast due to pre- Slower, as queries are Moderate — fast for
Performance aggregation and indexing translated into SQL summary, slower for detail
Feature /
MOLAP (Multidimensional) ROLAP (Relational) HOLAP (Hybrid)
Criteria
Data Volume
Low to medium volumes Large volumes Medium to large volumes
Handling
Mixed environments
Fast analysis on static data Ad-hoc queries and large
Best Use Case needing flexibility and
with limited size data environments
performance
Technique Description
Materialization Precomputes and stores only selected cuboids based on query needs.
Full Materialization All cuboids are precomputed and stored — fast access but costly storage.
Partial Materialization Only most-used cuboids are precomputed; others are computed on-the-fly.
Iceberg Cubes Only cuboids that meet a threshold (e.g., sales > ₹10,000) are materialized.
A minimal subset of cuboids is selected such that others can be computed from
Shell Fragment Cubes
them.
Bottom-Up
Starts with base cuboid and computes higher-level cuboids progressively.
Computation
Top-Down Computation Starts with apex cuboid and drills down into detailed levels as needed.
MODULE-II:
1. Data Mining Functionalities
Data mining refers to the process of discovering patterns, trends, and useful information from large sets
of data. The functionalities of data mining can be broadly categorized into descriptive and predictive
tasks.
1.1 Classification
Classification is a predictive data mining task. It assigns data into predefined classes or categories. For
example, in a banking dataset, customers may be classified as “low risk” or “high risk” based on attributes
like income, credit history, and account activity. The process involves two steps: training a model using
historical data and then testing or applying this model to new data to predict the class.
1.2 Clustering
Clustering is a descriptive technique where data is grouped into clusters or groups based on similarity, but
unlike classification, the classes are not predefined. For example, a retail store can use clustering to group
customers with similar buying behavior. Common clustering algorithms include K-Means and DBSCAN. It
helps in customer segmentation and market analysis.
1.3 Association Rule Mining
This functionality finds interesting relationships or associations between variables in large databases. A
common example is market basket analysis, where rules like {Bread} → {Butter} are found, meaning
customers who buy bread are likely to buy butter too. Association rules are typically evaluated using
support, confidence, and lift measures.
1.4 Prediction
Prediction is another form of the predictive task where future values are forecast based on existing data.
For instance, predicting sales figures for the next quarter using historical sales data. It often uses
techniques from regression analysis and time series forecasting.
1.5 Outlier Detection (Anomaly Detection)
This functionality identifies data points that are significantly different from the majority of data. These
outliers may represent errors, fraud, or rare events. For example, in a credit card transaction dataset, a
sudden large purchase in a foreign country could be flagged as an anomaly.
1.6 Summarization
Summarization is the process of providing a compact and concise description of a dataset. This includes
computing descriptive statistics like mean, median, mode, or more complex summaries such as generating
summary reports, graphs, and visualization. It is useful in understanding the general properties of the data
before deeper analysis.
2. Steps in the Data Mining Process
The data mining process involves multiple sequential steps, from understanding the business problem to
deploying a working model. The major steps are as follows:
2.1 Data Cleaning
This is the first and most critical step where noisy, incomplete, and inconsistent data is corrected or
removed. For example, missing values are filled using mean imputation, and outliers are corrected or
removed. This step ensures the quality and accuracy of the final data used in mining.
2.2 Data Integration
In this step, data from multiple heterogeneous sources (like databases, files, and web data) are combined
into a single coherent data store, such as a data warehouse. Integration is challenging due to differences in
data formats, naming conventions, and measurement units. Proper integration is vital to ensure
completeness of the data.
2.3 Data Selection
Only relevant data is selected from the database based on the mining task. For instance, if the task is
customer churn prediction, only fields like customer ID, usage history, and service complaints may be
selected. This reduces the size and improves the focus of the mining process.
2.4 Data Transformation
Data is transformed into appropriate formats for mining. This includes normalization (scaling values to a
range), aggregation (summarizing), encoding categorical variables, and creating new attributes (feature
engineering). Transformation helps improve the efficiency and accuracy of mining algorithms.
6. Normalization, smoothing, aggregation, generalization.
Data normalization involves converting all data variables into a given range. Techniques that are used for
normalization are:
Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute
o v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.
v' = (v - min_A) / (max_A - min_A)
Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value v of attribute A is normalized to v’ by computing using below formula-
v' = (v - mean(A)) / (standard deviation(A))
2.5 Data Mining
This is the core step where intelligent methods are applied to extract patterns and knowledge from the
prepared data. Techniques such as classification, clustering, association rule mining, or regression are
applied depending on the objective.
2.6 Pattern Evaluation
Not all patterns discovered are useful or interesting. Pattern evaluation involves identifying truly useful and
valid patterns based on measures like accuracy, novelty, utility, and interestingness. For example, a pattern
with high confidence and support in market basket analysis is more valuable.
2.7 Knowledge Presentation
The final patterns are presented to users in a comprehensible form using reports, tables, graphs, or
visualizations. Visualization techniques help users understand the insights easily and support decision-
making processes.
1. Data Sources (WWW, Database, Data Warehouse, Other Repositories)
What happens: Raw data is collected from various sources like the internet, relational databases,
large data warehouses, or other structured/unstructured sources.
Goal: To gather as much relevant data as possible for mining.
3. Database Server
What happens: Acts as a central point that stores the cleaned and integrated data in an organized
format.
Goal: Provides fast access and management of data for mining processes.
7. Front End
What happens: The user interacts with the system here — asking questions, requesting pattern
visualizations, or triggering new mining tasks.
Goal: Provides control and interaction point for end-users like analysts or decision-makers.
8. Knowledge Base
What happens: Stores domain knowledge, constraints, past mining results, and user preferences.
Goal: Assists the mining process by guiding algorithms and evaluation, and helps improve future
data mining tasks.
📘 Bayes' Theorem
Bayes’ Theorem is:
MODULE – III:
1. Definition of Outliers
An outlier is a data object that deviates significantly from the rest of the dataset. It may not conform to the
general behavior or pattern found in the majority of data points. Outliers can arise due to measurement
errors, data entry issues, or genuinely rare but significant events.
Outliers are important because they can influence the results of statistical analyses or machine learning
models. Ignoring outliers might lead to misleading conclusions, while properly analyzing them can uncover
hidden insights or anomalies.
3. Types of Outliers
Outliers can be categorized based on their behavior and context. Common types include:
Global Outliers: These are data points that differ significantly from all other data in the dataset. For
example, in a dataset of student scores between 30 and 90, a score of 2 would be a global outlier.
Contextual (Conditional) Outliers: These data points are considered outliers in a specific context
but not globally. For instance, a temperature of 25°C may be normal in spring but unusual in winter
for a particular location.
Collective Outliers: A group of data instances may be considered outliers when analyzed together,
although each data point might appear normal individually. For example, a sequence of unusual
transactions in a short period might signal coordinated fraud.
4. Causes of Outliers
Outliers can occur due to various reasons:
Human or Instrument Errors: Errors in data collection or entry may lead to inaccurate data values,
such as typing mistakes or faulty sensor readings.
Experimental Variability: In scientific experiments, natural fluctuations in variables may cause data
points to deviate from expected results.
Novel Events or Rare Phenomena: Some outliers represent genuinely rare events that carry
valuable information. For example, a sudden spike in website traffic may indicate a viral event or
marketing success.
2.5 Challenges
Data Volume: Multimedia data requires high storage and processing power.
Semantic Gap: Difference between low-level features (color, pixel) and high-level understanding
(object, scene).
Noise and Quality: Variations in multimedia formats and quality can affect accuracy.
📝 3. Text Mining
Text mining, also called text data mining or text analytics, is the process of deriving high-quality
information from unstructured text data. Unlike structured data stored in tables, text data is found in
emails, books, articles, reviews, and social media posts.
Application
Disaster mapping Face recognition Sentiment analysis
Example