Unit-5 DMDW
Unit-5 DMDW
Concept Description:
o Handles complex data types and aggregations of attributes.
o Typically, the process is more automated in nature.
o It provides summarized data that helps in understanding trends and
features in the dataset.
OLAP (Online Analytical Processing):
o Primarily deals with a limited number of dimensions and measures (e.g.,
time, region, product type).
o It is a more user-controlled process where users can interactively
explore data, slice and dice data, and perform different types of
aggregation.
o OLAP allows multidimensional analysis, like drilling down into data to
explore more details or rolling up to see summary-level information.
Let's imagine we have a dataset for sales transactions at a retail store. We can apply
concept description techniques as follows:
Characterization:
Comparison:
In this diagram:
Key Differences:
1. Complexity: Concept description can handle more complex data and generate
insights automatically, while OLAP is more focused on simple, interactive data
exploration.
2. User Control: OLAP gives users control to manipulate data, while concept
description is more about generating summaries and comparisons.
Data Cube is a technique used in OLAP (Online Analytical Processing), which involves
organizing data in a multidimensional array. Data is stored in a cube-like structure
with dimensions and measures.
Key Operations:
Example:
Consider a retail company tracking sales across different regions, products, and time
periods (e.g., daily, weekly). The data can be represented as a 3D cube where:
By applying roll-up, we can aggregate the data from daily sales to monthly sales,
whereas drill-down would take us from monthly to weekly or daily sales.
Data Type Restrictions: It can handle only simple non-numeric data types for
dimensions and numeric data for measures.
Lack of Intelligent Analysis: The cube does not automatically decide which
dimensions should be used or at what levels generalization should occur.
Attribute-Oriented Induction
Imagine a database for graduate students at a university, where you want to mine
the general characteristics of science students. The relevant data might include:
The goal is to generalize attributes like GPA (e.g., "High", "Medium", "Low") and
Major (e.g., "Science", "Arts", "Engineering") to identify patterns, such as:
Science students tend to have higher GPAs and are predominantly male.
Basic Algorithm of Attribute-Oriented Induction:
EXAMPLE:
DMQL: Describe general characteristics of graduate students in the Big-University
database
use Big_University_DB mine characteristics as “Science_Students” in relevance to
name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Crosstab:
Construct a data cube on-the-fly for the given data mining query
o Facilitate efficient drill-down analysis
o May increase the response time
o A balanced solution: precomputation of “subprime” relation
o
Use a predefined & precomputed data cube
o Construct a data cube beforehand
o Facilitate not only the attribute-oriented induction, but also
attribute relevance analysis, dicing, slicing, roll-up and drill-
down. Cost of cube computation and the nontrivial storage
overhead
Diagram: Data Cube and Generalization
+-----------------------------------------------------+
| Data Cube |
|-----------------------------------------------------|
| Dimensions: | Time | Region |
| -------------------------------------------------- |
| Measures: | Sales | Profit |
|-----------------------------------------------------|
| Data: | Daily | North, South |
|-----------------------------------------------------|
| Operations: | Roll-up, Drill-down, Slice, Dice |
+-----------------------------------------------------+
|
v
+-----------------------------------------------+
| Generalization (Roll-up) |
| e.g., Aggregate data from daily to monthly |
+-----------------------------------------------+
|
v
+-----------------------------------------------+
| Drill-Down Analysis |
| e.g., Examine daily data after aggregation |
+-----------------------------------------------+
Conclusion
1. Data Collection: The first step is to collect task-relevant data. This could be
from a relational database, data warehouse, or any other structured data
source.
2. Preprocessing: The data may need to be cleaned and transformed,
especially if it contains missing values, outliers, or irrelevant attributes.
3. Feature Selection: Using statistical or machine learning techniques,
irrelevant or redundant features are eliminated.
4. Analysis: Analyze the relationships between the attributes and target
concepts using methods such as correlation, information gain, or other
relevance measures.
5. Presentation: The results of the analysis are presented in a format that
highlights the most important attributes, either using charts, graphs, or
ranked lists.
Age
Gender
Study Hours
Attendance
Final Exam Score (Target variable)
The goal is to analyze the relevance of each attribute in predicting the Final Exam
Score. The steps involved would be:
1. Data Collection: Collect data for each student with the given attributes.
2. Data Preprocessing: Clean the data by handling missing values and
ensuring all attributes are in appropriate formats (e.g., numeric,
categorical).
3. Correlation Analysis: Calculate the correlation between each attribute and
the Final Exam Score.
o Example: You may find that Study Hours and Attendance have a high
positive correlation with the Final Exam Score, suggesting these are
relevant attributes for predicting exam performance.
o Age and Gender might have a low or no correlation with the Final
Exam Score, suggesting these attributes might not be relevant for
prediction.
4. Statistical Testing: You could perform a statistical test (e.g., t-test) to check
if the difference in exam scores is statistically significant across different
Gender groups.
5. Feature Selection: Based on the correlation and statistical tests, select the
most relevant features like Study Hours and Attendance, and remove
irrelevant features like Age.
1. Correlation Coefficient:
o Used to measure the strength and direction of the relationship
between two variables.
o For example, the Pearson correlation coefficient ranges from -1
(perfect negative correlation) to 1 (perfect positive correlation).
2. Information Gain:
o Measures how much "information" an attribute provides about the
target variable. Higher information gain means that the attribute is
more relevant.
o It is often used in decision trees to determine the best attribute to
split the data.
3. Chi-Square Test:
o A statistical test used to determine if there is a significant association
between two categorical variables.
o Example: Test if Gender influences Final Exam Score.
4. Mutual Information:
o A measure from information theory that quantifies the amount of
information obtained about one attribute by observing another
attribute.
o Higher mutual information means a stronger relationship.
1. Correlation Analysis: You might find that Income and Browsing History are
highly correlated with Purchase Status, suggesting that these are relevant
predictors.
2. Statistical Tests: A Chi-Square test might show that Location has no
significant impact on Purchase Status, making it less relevant.
3. Feature Selection: Based on the analysis, you could select Income and
Browsing History as the most relevant attributes to predict Purchase
Status.
Conclusion
The goal is to discriminate between customers who will churn and those who
will not churn.
Monthly Spend > $50 and Contract type are the most important features
to discriminate between churn and non-churn.
Customers who spend more than $50 or have a 1-year contract are less
likely to churn, while customers with a low satisfaction rating or month-to-
month contract are more likely to churn.
Confusion Matrix: Measures how many true positives, false positives, true
negatives, and false negatives are produced by the model.
Accuracy: Percentage of correctly classified instances.
Precision and Recall: Used to evaluate the model's performance, especially
for imbalanced datasets.
Descriptive statistical measures are used to summarize and describe the main features of a
dataset. These measures help in understanding the structure, trends, and patterns within the data,
especially when working with large datasets. In the context of data mining, these statistical
measures are applied to help identify key insights, detect anomalies, and prepare the data for
further analysis.
Descriptive statistics focus on summarizing data rather than making predictions or inferences.
Common descriptive statistical measures include:
o Variance: Measures the spread of data points around the mean. A high variance
indicates that the data points are spread out over a wider range.
Formula:
o : The square root of the variance, which gives a more interpretable measure of
spread. It shows how much data points deviate from the mean.
Formula:
3. Shape of Distribution:
o Skewness: Measures the asymmetry of the data distribution. Positive skew
indicates a rightward tail (data is skewed towards higher values), while negative
skew indicates a leftward tail (data is skewed towards lower values).
o Kurtosis: Measures the "tailedness" of the distribution. High kurtosis indicates a
distribution with heavy tails (more outliers), while low kurtosis suggests a
distribution with lighter tails.
4. Correlation Measures:
o Pearson Correlation: Measures the linear relationship between two continuous
variables. The correlation coefficient ranges from -1 (perfect negative correlation)
to 1 (perfect positive correlation).
Formula:
In large datasets, traditional descriptive statistics methods often struggle with efficiency and
scalability due to the size and complexity of the data. However, several techniques and tools
have been developed to efficiently mine descriptive statistical measures from large datasets:
1. Sampling:
o Instead of processing the entire dataset, you can take a representative sample of
the data. This helps reduce computational complexity and still provides insights that
are representative of the overall dataset. This is especially useful when dealing with
big data.
2. Parallel and Distributed Processing:
o In large datasets, it’s common to use distributed computing frameworks (e.g.,
Apache Hadoop, Apache Spark) to perform statistical computations in parallel
across multiple machines. This allows faster computation of descriptive statistics by
distributing the workload.
o For example, Spark SQL can be used to calculate summary statistics across large
datasets in parallel, leveraging in-memory computation for faster performance.
3. Data Cube and OLAP (Online Analytical Processing):
o The Data Cube approach is an efficient method used in OLAP systems to compute
descriptive statistics on multi-dimensional data.
o For example, a data cube might store aggregated statistics (mean, sum, count) for
sales data across different dimensions like time, location, and product category.
o Using roll-up (aggregating data to higher levels of granularity) and drill-down
(exploring data in more detail), OLAP can help summarize large datasets with
efficiency.
4. Streaming Data Analytics:
o For real-time data streams, techniques like online algorithms are used to
calculate descriptive statistics incrementally as the data flows in. This is commonly
used in scenarios like monitoring web traffic, sensor data, or financial markets.
o For example, algorithms like Exponential Moving Average (EMA) can be used to
compute running averages and standard deviations in a stream of data without
needing to store the entire dataset.
5. Dimensionality Reduction:
o In datasets with a large number of features, dimensionality reduction techniques
(e.g., Principal Component Analysis (PCA), t-SNE) can be used to reduce the
number of variables while retaining most of the descriptive statistical properties.
This makes it easier to visualize and interpret large datasets.
Example of Descriptive Statistical Measures in a Large Dataset
Imagine a retail company has a large dataset containing sales transactions from different stores.
The attributes of each transaction include:
Store Location
Product Category
Transaction Amount
Transaction Date
The company wants to use descriptive statistics to understand customer buying patterns and
identify sales trends across different stores.
Mean Transaction Amount: Calculate the average spending per transaction for each store
and product category.
Median Transaction Amount: Identify the middle value of the transaction amounts,
especially useful if there are a few extreme outliers.
Mode: Find the most frequent product category or transaction amount.
2. Dispersion Measures:
Variance and Standard Deviation: Measure the variability in transaction amounts across
stores. High variance might indicate that some stores have more expensive transactions
than others.
Range: Identify the lowest and highest transaction amounts to assess the overall
distribution.
3. Correlation Measures:
4. Quantiles:
Quartiles and IQR: Divide the transaction amounts into quartiles to understand the spread
of transaction values, and use the IQR to identify any potential outliers.
Analyze the skewness of the transaction amount distribution to see if it’s skewed towards
high or low values. A high positive skew might indicate a few large transactions in an
otherwise small transaction dataset.
Check kurtosis to identify if the data has heavy tails, which could point to a few
transactions significantly larger than the rest.
Diagram: Descriptive Statistics on a Retail Dataset
+--------------------------------------------+
| Descriptive Statistics on |
| Retail Transaction Data |
+--------------------------------------------+
| Mean Transaction Amount: $45.50 |
| Median Transaction Amount: $40.00 |
| Mode (Most Frequent Product): Electronics |
| Range: $5.00 - $500.00 |
| Variance: 120.5 |
| Standard Deviation: 10.98 |
| Correlation between Location & Amount: 0.65|
| Interquartile Range: $35.00 - $55.00 |
| Skewness: 1.2 (Positive skew) |
+--------------------------------------------+
Conclusion
Mining descriptive statistical measures in large datasets is essential for understanding the
underlying patterns, trends, and distributions within the data. By utilizing methods like sampling,
distributed computing, data cubes, and online algorithms, organizations can efficiently process
large datasets and extract meaningful insights. These measures help in summarizing data,
detecting anomalies, and setting the stage for further analysis, modeling, and decision-making.