0% found this document useful (0 votes)

11 views21 pages

Unit-5 DMDW

Uploaded by

foxic66210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views21 pages

Unit-5 DMDW

Uploaded by

foxic66210

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

DATA WAREHOUSING AND DATA MINING

Unit-5: Concept Description

TOPIC: Characterization and comparison &what is concept description?

Concept Description is a data mining technique that focuses on providing an

understanding of the underlying patterns and characteristics of a dataset. It involves
summarizing or comparing datasets to reveal key attributes, trends, and features.
Concept description can be broadly divided into Descriptive Mining and Predictive
Mining:

 Descriptive Mining: This approach aims to describe concepts or datasets in

concise, summarizing, and informative ways. The goal is to produce easily
interpretable summaries that convey the most important information about
the dataset. This includes clustering or finding patterns that group data
together, as well as providing descriptive statistics.
o Example of Descriptive Mining: A descriptive mining model could be
used to analyze the sales data of a store and summarize the sales
performance across different regions. It could describe trends such as
the highest-selling products or the average sales per store location.
 Predictive Mining: In contrast to descriptive mining, predictive mining
constructs models based on historical data to forecast trends, behaviors, or
properties of unknown data. These models predict future outcomes based on
patterns identified in existing data.
o Example of Predictive Mining: Using predictive mining, a company might
predict future customer purchasing behavior based on past transactions.
For instance, it might predict that a customer who previously purchased
electronic gadgets will likely buy similar items in the future.

Types of Concept Description

1. Characterization: This is a descriptive process where the goal is to summarize

the overall features or characteristics of a dataset. Characterization aims to
provide a general overview of the data and highlight key attributes in a
compact and understandable way.
o Example: Characterizing the data of a university’s student body could
involve summarizing data like the average age, number of students from
different majors, distribution of grades, etc.
2. Comparison: This involves comparing two or more datasets to identify
similarities and differences. The objective is to describe how different groups
of data behave or perform relative to each other.
o Example: A comparison of sales data between two stores in different
cities to determine how customer buying behavior differs based on
location.

Concept Description vs. OLAP (Online Analytical Processing)

 Concept Description:
o Handles complex data types and aggregations of attributes.
o Typically, the process is more automated in nature.
o It provides summarized data that helps in understanding trends and
features in the dataset.
 OLAP (Online Analytical Processing):
o Primarily deals with a limited number of dimensions and measures (e.g.,
time, region, product type).
o It is a more user-controlled process where users can interactively
explore data, slice and dice data, and perform different types of
aggregation.
o OLAP allows multidimensional analysis, like drilling down into data to
explore more details or rolling up to see summary-level information.

Example of Concept Description

Let's imagine we have a dataset for sales transactions at a retail store. We can apply
concept description techniques as follows:

Characterization:

 Objective: To summarize the overall sales performance of the store.

 Approach: We may summarize the total sales revenue, the average amount
spent per customer, the most popular items purchased, and the peak shopping
hours.

Comparison:

 Objective: To compare sales performance between two different store

locations.
 Approach: We might compare the total sales, average spending, customer
demographics, or item popularity between the two locations.

Diagram: Concept Description vs. OLAP

In this diagram:

 Concept Description is more focused on summarizing or comparing datasets

using automation and handles more complex, varied data.
 OLAP is a user-driven process, where users explore data by manipulating
dimensions and measures interactively.

Key Differences:

1. Complexity: Concept description can handle more complex data and generate
insights automatically, while OLAP is more focused on simple, interactive data
exploration.
2. User Control: OLAP gives users control to manipulate data, while concept
description is more about generating summaries and comparisons.

TOPIC: DATA GENERALIZATION AND SUMMARIZATION-BASED CHARACTERIZATION

Data Generalization is a process where a large set of task-relevant data in a database

is abstracted from a low conceptual level to higher, more generalized levels. This is
typically used to identify overarching patterns or trends in the data by simplifying the
data representation. The goal is to provide more concise and manageable summaries
that still retain the critical relationships.

Approaches to Data Generalization:

1. Data Cube Approach (OLAP):

o This is used for multidimensional analysis, storing data in a data cube
structure, allowing for quick generalization or specialization of data.
2. Attribute-Oriented Induction:
o A more flexible approach that is not confined to categorical data or
numeric measures and allows for generalizing attributes based on their
distinct values.

Data Cube Approach

Data Cube is a technique used in OLAP (Online Analytical Processing), which involves
organizing data in a multidimensional array. Data is stored in a cube-like structure
with dimensions and measures.

 Dimensions: Attributes like time, location, or product category.

 Measures: Numerical data like sales, profit, or quantity.

Key Operations:

1. Roll-up: This operation generalizes data by moving to a higher level of

abstraction. For example, sales data can be generalized from a daily level to a
monthly level.
2. Drill-down: This is the opposite of roll-up, allowing users to zoom in on the
data by going to a more granular level.

Example:

Consider a retail company tracking sales across different regions, products, and time
periods (e.g., daily, weekly). The data can be represented as a 3D cube where:

 One dimension is Region (e.g., North, South, East, West).

 The second dimension is Product (e.g., Electronics, Clothing, Groceries).
 The third dimension is Time (e.g., January, February, March).

By applying roll-up, we can aggregate the data from daily sales to monthly sales,
whereas drill-down would take us from monthly to weekly or daily sales.

Limitations of the Data Cube Approach:

 Data Type Restrictions: It can handle only simple non-numeric data types for
dimensions and numeric data for measures.
 Lack of Intelligent Analysis: The cube does not automatically decide which
dimensions should be used or at what levels generalization should occur.

Attribute-Oriented Induction

Attribute-Oriented Induction (AOI) is a more flexible approach for generalizing data

that works with both categorical and numeric data, unlike the data cube approach
which primarily works with numeric measures.

The process involves:

1. Initial Relation: Collect task-relevant data using a database query.

2. Attribute Removal: Remove attributes that do not provide valuable
information for the generalization process.
3. Attribute Generalization: If an attribute has too many distinct values, it is
generalized by applying a generalization operator (e.g., grouping cities into
regions).
4. Aggregation: Identical or generalized data tuples are merged, and their counts
are accumulated.

Example of Attribute-Oriented Induction:

Imagine a database for graduate students at a university, where you want to mine
the general characteristics of science students. The relevant data might include:

 Name, Gender, Major, Birthplace, GPA, etc.

The goal is to generalize attributes like GPA (e.g., "High", "Medium", "Low") and
Major (e.g., "Science", "Arts", "Engineering") to identify patterns, such as:

 Science students tend to have higher GPAs and are predominantly male.
Basic Algorithm of Attribute-Oriented Induction:

1. Initial Relation (InitialRel): Query the database to gather task-relevant data.

2. PreGen: Analyze the distinct values of each attribute to decide whether to
remove or generalize the attribute.
3. PrimeGen: Apply the generalization plan to produce the generalized relation.
4. Presentation: Present the results to the user through interactive features like
drilling, pivoting, or visualization.

EXAMPLE:
DMQL: Describe general characteristics of graduate students in the Big-University
database
use Big_University_DB mine characteristics as “Science_Students” in relevance to
name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#, gpa
from student where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example

Presentation of Generalized Results

Generalized relation: Relations where some or all attributes are generalized, with
counts or other aggregation values accumulated.
Cross tabulation: Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules: Mapping generalized result into characteristic rules
with quantitative information associated with it, e.g.,
grad(x)Ù male(x)Þ
birth _ region(x) ="Canada"[t :53%]Úbirth _ region(x) =" foreign"[t :47%].
Generalized Relation:

Crosstab:

Implementation by Cube Technology:

Construct a data cube on-the-fly for the given data mining query
o Facilitate efficient drill-down analysis
o May increase the response time
o A balanced solution: precomputation of “subprime” relation
o
Use a predefined & precomputed data cube
o Construct a data cube beforehand
o Facilitate not only the attribute-oriented induction, but also
attribute relevance analysis, dicing, slicing, roll-up and drill-
down. Cost of cube computation and the nontrivial storage
overhead
Diagram: Data Cube and Generalization
+-----------------------------------------------------+
| Data Cube |
|-----------------------------------------------------|
| Dimensions: | Time | Region |
| -------------------------------------------------- |
| Measures: | Sales | Profit |
|-----------------------------------------------------|
| Data: | Daily | North, South |
|-----------------------------------------------------|
| Operations: | Roll-up, Drill-down, Slice, Dice |
+-----------------------------------------------------+
|
v
+-----------------------------------------------+
| Generalization (Roll-up) |
| e.g., Aggregate data from daily to monthly |
+-----------------------------------------------+
|
v
+-----------------------------------------------+
| Drill-Down Analysis |
| e.g., Examine daily data after aggregation |
+-----------------------------------------------+
Conclusion

 Data Generalization and Summarization-Based Characterization are essential

for abstracting large datasets to find overarching patterns, which aids decision-
making and analysis.
 Data Cube offers efficient multidimensional analysis with operations like roll-
up and drill-down, while Attribute-Oriented Induction provides flexibility with
both categorical and numerical data, allowing for more detailed and
customizable generalization process
TOPIC: ANALYTICAL CHARACTERIZATION: ANALYSIS OF ATTRIBUTE RELEVANCE

Analytical Characterization refers to the process of identifying and analyzing the

relevance of attributes in a dataset in order to understand which attributes
contribute most significantly to certain patterns or results. This technique is
often used in data mining to find correlations between different variables and
to identify which attributes (or features) are most useful for further analysis or
predictive modeling.
Key Concepts of Analytical Characterization

1. Attribute Relevance: This is the process of evaluating how important each

attribute is with respect to the task at hand, such as predicting an outcome
or understanding patterns. The goal is to determine which attributes
contribute the most to the analysis, helping to simplify models by removing
irrelevant or redundant attributes.
2. Analysis of Attribute Relevance: In this analysis, attributes are evaluated
based on how well they explain the target concept or outcome. Some
common methods to measure relevance include:
o Correlation: Measures how strongly two attributes are related. High
correlation implies that one attribute can be used to predict another.
o Statistical Tests: Techniques like chi-square, ANOVA, or t-tests can
help determine the relationship between categorical or numerical
attributes.
o Information Gain: A concept from decision tree algorithms that
measures how well an attribute helps reduce uncertainty (or
entropy) in predicting a target variable.
o Feature Selection Methods: Techniques that help identify the most
important attributes for the model, such as backward elimination,
forward selection, or recursive feature elimination.

How to Perform Attribute Relevance Analysis

1. Data Collection: The first step is to collect task-relevant data. This could be
from a relational database, data warehouse, or any other structured data
source.
2. Preprocessing: The data may need to be cleaned and transformed,
especially if it contains missing values, outliers, or irrelevant attributes.
3. Feature Selection: Using statistical or machine learning techniques,
irrelevant or redundant features are eliminated.
4. Analysis: Analyze the relationships between the attributes and target
concepts using methods such as correlation, information gain, or other
relevance measures.
5. Presentation: The results of the analysis are presented in a format that
highlights the most important attributes, either using charts, graphs, or
ranked lists.

Example of Analytical Characterization: Analyzing Student Performance

Imagine you have a dataset of students with the following attributes:

 Age
 Gender
 Study Hours
 Attendance
 Final Exam Score (Target variable)

The goal is to analyze the relevance of each attribute in predicting the Final Exam
Score. The steps involved would be:

1. Data Collection: Collect data for each student with the given attributes.
2. Data Preprocessing: Clean the data by handling missing values and
ensuring all attributes are in appropriate formats (e.g., numeric,
categorical).
3. Correlation Analysis: Calculate the correlation between each attribute and
the Final Exam Score.
o Example: You may find that Study Hours and Attendance have a high
positive correlation with the Final Exam Score, suggesting these are
relevant attributes for predicting exam performance.
o Age and Gender might have a low or no correlation with the Final
Exam Score, suggesting these attributes might not be relevant for
prediction.
4. Statistical Testing: You could perform a statistical test (e.g., t-test) to check
if the difference in exam scores is statistically significant across different
Gender groups.
5. Feature Selection: Based on the correlation and statistical tests, select the
most relevant features like Study Hours and Attendance, and remove
irrelevant features like Age.

Diagram: Attribute Relevance Analysis Process

1. Correlation Coefficient:
o Used to measure the strength and direction of the relationship
between two variables.
o For example, the Pearson correlation coefficient ranges from -1
(perfect negative correlation) to 1 (perfect positive correlation).
2. Information Gain:
o Measures how much "information" an attribute provides about the
target variable. Higher information gain means that the attribute is
more relevant.
o It is often used in decision trees to determine the best attribute to
split the data.
3. Chi-Square Test:
o A statistical test used to determine if there is a significant association
between two categorical variables.
o Example: Test if Gender influences Final Exam Score.
4. Mutual Information:
o A measure from information theory that quantifies the amount of
information obtained about one attribute by observing another
attribute.
o Higher mutual information means a stronger relationship.

Example: Analyzing Customer Data for Purchase Behavior

Suppose you want to analyze which attributes influence a customer's likelihood to

purchase a product. The dataset contains the following attributes:
 Age
 Income
 Location
 Browsing History
 Purchase Status (Target variable)

1. Correlation Analysis: You might find that Income and Browsing History are
highly correlated with Purchase Status, suggesting that these are relevant
predictors.
2. Statistical Tests: A Chi-Square test might show that Location has no
significant impact on Purchase Status, making it less relevant.
3. Feature Selection: Based on the analysis, you could select Income and
Browsing History as the most relevant attributes to predict Purchase
Status.

Conclusion

Analytical Characterization using attribute relevance analysis helps in identifying

which attributes are most significant for a given analysis or prediction task. By
performing correlation analysis, statistical tests, or information gain calculations,
we can focus on the most relevant features, improve the efficiency of models, and
make the analysis process more streamlined and effective. This process is
especially useful for reducing complexity in datasets with many attributes and
improving predictive accuracy.

TOPIC: MINING CLASS COMPARISONS: DISCRIMINATING BETWEEN DIFFERENT

CLASSES

Mining Class Comparisons is a technique in data mining that involves identifying

and differentiating between distinct classes within a dataset. This is particularly
useful when the goal is to classify data into different categories and compare the
characteristics of these classes. The goal is to discover how the attributes of
different classes are distinct, and what features can be used to best discriminate
between them.

In data mining, discriminating between classes refers to analyzing how different

attributes can be used to separate or distinguish between different groups or
classes within a dataset. For instance, if we have a dataset with two classes (e.g.,
"Positive" and "Negative") or multiple classes (e.g.,
"High," "Medium," "Low") with respect to a target variable, the objective is to
determine which attributes (or features) can best separate these classes based on
their characteristics.

Key Concepts of Mining Class Comparisons

1. Discriminating Features: These are attributes that distinguish between

different classes. For example, in a dataset of students, the feature study
hours may discriminate well between high-performing and low-performing
students.
2. Class Comparison Techniques: Various techniques are used to compare
and differentiate between classes. Some common methods are:
o Statistical Tests: These can be used to test if the means of attributes
differ significantly between classes (e.g., t-tests, ANOVA).
o Decision Trees: These can be used to partition the dataset based on
attribute values and discriminate between classes.
o Cluster Analysis: Helps in grouping data points into classes based on
similarity and can reveal class characteristics.
o Feature Selection: Identifies the most important features that help in
classifying the data into distinct groups.
3. Class Discrimination Metrics: Metrics that measure how well attributes
discriminate between classes, such as:
o Entropy: Measures the impurity or uncertainty in a dataset.
o Information Gain: Measures the reduction in entropy or uncertainty
when a dataset is split based on an attribute.
o Gini Index: Measures the impurity of a dataset, similar to entropy.

Steps in Mining Class Comparisons

1. Data Preparation: Gather and preprocess data (handle missing values,

normalize, etc.).
2. Class Label Identification: Identify the class labels (e.g., High, Medium, Low,
Positive, Negative).
3. Feature Selection: Choose relevant features that could discriminate
between classes.
4. Statistical Analysis/Modeling: Apply techniques like statistical tests,
decision trees, or clustering to identify significant differences between the
classes.
5. Comparison and Evaluation: Evaluate the discriminating ability of the
features and compare the characteristics of the classes.

6. Interpretation and Visualization: Present the results of the comparison and

interpret which features are most important for distinguishing between the
classes.

Example of Mining Class Comparisons: Predicting Customer Churn

Let’s consider a telecommunications company trying to predict customer churn

(whether customers will leave the company or stay). The company collects data
on the following attributes:

 Age of the customer.

 Monthly spend on services.
 Contract type (e.g., Month-to-Month, 1 Year, 2 Years).
 Customer satisfaction rating.
 Churn status (Target variable: Churn or No Churn).

The goal is to discriminate between customers who will churn and those who
will not churn.

Steps in the Example:

1. Data Preparation: Clean the data (remove missing values, encode

categorical variables like contract type).
2. Class Labels: The class labels are Churn (1) and No Churn (0).
3. Feature Selection: Select features that could help distinguish between
churn and no-churn customers, such as Age, Monthly spend, and Customer
satisfaction.
4. Statistical Analysis:
o Use t-tests to check if the Monthly spend is significantly different
between churn and non-churn customers.
o Apply ANOVA to test if the Customer satisfaction ratings vary
significantly between churn and non-churn customers.
5. Modeling (Decision Tree): Build a decision tree to classify customers based
on their features. The decision tree might show that Monthly spend and
Contract type are the most important features for predicting churn.

Decision Tree Example:

The decision tree might look like this:

[Monthly Spend > $50]

/ \
[Yes] [No]
/ \

[Contract type == '1 Year'] [Customer Satisfaction > 3.5]

/ \ / \
[Yes] [No] [Yes] [No]
Churn (Yes) No Churn (No) Churn (Yes) No Churn (No)

 Monthly Spend > $50 and Contract type are the most important features
to discriminate between churn and non-churn.
 Customers who spend more than $50 or have a 1-year contract are less
likely to churn, while customers with a low satisfaction rating or month-to-
month contract are more likely to churn.

Diagram: Class Comparison in Decision Tree

+---------------------------+
| Monthly Spend > $50? |
+---------------------------+
/ \
Yes No
/ \
+------------------------+ +----------------------------+
| Contract Type == '1 Year' | | Customer Satisfaction > 3.5 |
+------------------------+ +----------------------------+
/ \ / \
Yes No Yes No
/ \ / \
Churn (Yes) No Churn (No) Churn (Yes) No Churn (No)
Evaluation Metrics:

 Confusion Matrix: Measures how many true positives, false positives, true
negatives, and false negatives are produced by the model.
 Accuracy: Percentage of correctly classified instances.
 Precision and Recall: Used to evaluate the model's performance, especially
for imbalanced datasets.

Conclusion: Mining Class Comparisons is a powerful method for distinguishing

between different classes within a dataset. By analyzing the features and their
relationships with the class labels, we can identify which attributes are most
relevant in differentiating between classes. In the example of customer churn
prediction, features like monthly spend, contract type, and customer satisfaction
were found to be key discriminators. Methods like decision trees, statistical tests,
and feature selection play a crucial role in identifying these discriminating
features.

This process is essential for improving classification models, optimizing

predictions, and making data-driven decisions.

TOPIC: MINING DESCRPITIVE STATICAL MEASURES IN LARGE DATASETS

Mining Descriptive Statistical Measures in Large Datasets

Descriptive statistical measures are used to summarize and describe the main features of a
dataset. These measures help in understanding the structure, trends, and patterns within the data,
especially when working with large datasets. In the context of data mining, these statistical
measures are applied to help identify key insights, detect anomalies, and prepare the data for
further analysis.

Key Descriptive Statistical Measures in Data Mining

Descriptive statistics focus on summarizing data rather than making predictions or inferences.
Common descriptive statistical measures include:

1. Central Tendency Measures:

o Mean: The average value of a dataset. It gives a general sense of the central value
but is sensitive to extreme values (outliers).
 Formula:
o Median: The middle value in a dataset when the data points are sorted. The median
is less sensitive to outliers and is often used when the data distribution is skewed.
o Mode: The most frequent value in a dataset. In some cases, datasets may have more
than one mode (bimodal or multimodal).
2. Dispersion (Variability) Measures:
o Range: The difference between the maximum and minimum values in a dataset. It
gives a basic sense of how spread out the data is but is sensitive to outliers.
 Formula:

o Variance: Measures the spread of data points around the mean. A high variance
indicates that the data points are spread out over a wider range.
 Formula:

o : The square root of the variance, which gives a more interpretable measure of
spread. It shows how much data points deviate from the mean.
 Formula:

3. Shape of Distribution:
o Skewness: Measures the asymmetry of the data distribution. Positive skew
indicates a rightward tail (data is skewed towards higher values), while negative
skew indicates a leftward tail (data is skewed towards lower values).
o Kurtosis: Measures the "tailedness" of the distribution. High kurtosis indicates a
distribution with heavy tails (more outliers), while low kurtosis suggests a
distribution with lighter tails.

4. Correlation Measures:
o Pearson Correlation: Measures the linear relationship between two continuous
variables. The correlation coefficient ranges from -1 (perfect negative correlation)
to 1 (perfect positive correlation).
 Formula:

o Spearman's Rank Correlation: A non-parametric measure of correlation that

assesses the relationship between ranked values of two variables.
5. Quantiles:
o Percentiles: Divides the data into 100 equal parts. The 50th percentile corresponds
to the median.
o Quartiles: Divides the data into four equal parts. The interquartile range (IQR),
which is the difference between the 75th percentile (Q3) and 25th percentile (Q1),
is used to measure variability and detect outliers.

Descriptive Statistics in Large Datasets

In large datasets, traditional descriptive statistics methods often struggle with efficiency and
scalability due to the size and complexity of the data. However, several techniques and tools
have been developed to efficiently mine descriptive statistical measures from large datasets:

1. Sampling:
o Instead of processing the entire dataset, you can take a representative sample of
the data. This helps reduce computational complexity and still provides insights that
are representative of the overall dataset. This is especially useful when dealing with
big data.
2. Parallel and Distributed Processing:
o In large datasets, it’s common to use distributed computing frameworks (e.g.,
Apache Hadoop, Apache Spark) to perform statistical computations in parallel
across multiple machines. This allows faster computation of descriptive statistics by
distributing the workload.
o For example, Spark SQL can be used to calculate summary statistics across large
datasets in parallel, leveraging in-memory computation for faster performance.
3. Data Cube and OLAP (Online Analytical Processing):
o The Data Cube approach is an efficient method used in OLAP systems to compute
descriptive statistics on multi-dimensional data.
o For example, a data cube might store aggregated statistics (mean, sum, count) for
sales data across different dimensions like time, location, and product category.
o Using roll-up (aggregating data to higher levels of granularity) and drill-down
(exploring data in more detail), OLAP can help summarize large datasets with
efficiency.
4. Streaming Data Analytics:
o For real-time data streams, techniques like online algorithms are used to
calculate descriptive statistics incrementally as the data flows in. This is commonly
used in scenarios like monitoring web traffic, sensor data, or financial markets.
o For example, algorithms like Exponential Moving Average (EMA) can be used to
compute running averages and standard deviations in a stream of data without
needing to store the entire dataset.
5. Dimensionality Reduction:
o In datasets with a large number of features, dimensionality reduction techniques
(e.g., Principal Component Analysis (PCA), t-SNE) can be used to reduce the
number of variables while retaining most of the descriptive statistical properties.
This makes it easier to visualize and interpret large datasets.
Example of Descriptive Statistical Measures in a Large Dataset

Imagine a retail company has a large dataset containing sales transactions from different stores.
The attributes of each transaction include:

 Store Location
 Product Category
 Transaction Amount
 Transaction Date

The company wants to use descriptive statistics to understand customer buying patterns and
identify sales trends across different stores.

1. Central Tendency Measures:

 Mean Transaction Amount: Calculate the average spending per transaction for each store
and product category.
 Median Transaction Amount: Identify the middle value of the transaction amounts,
especially useful if there are a few extreme outliers.
 Mode: Find the most frequent product category or transaction amount.

2. Dispersion Measures:

 Variance and Standard Deviation: Measure the variability in transaction amounts across
stores. High variance might indicate that some stores have more expensive transactions
than others.
 Range: Identify the lowest and highest transaction amounts to assess the overall
distribution.

3. Correlation Measures:

 Correlation between Store Location and Transaction Amount: Check if there is a

relationship between the location of a store and the average transaction value.
 Correlation between Product Category and Transaction Amount: Determine if certain
product categories lead to higher or lower sales.

4. Quantiles:

 Quartiles and IQR: Divide the transaction amounts into quartiles to understand the spread
of transaction values, and use the IQR to identify any potential outliers.

5. Skewness and Kurtosis:

 Analyze the skewness of the transaction amount distribution to see if it’s skewed towards
high or low values. A high positive skew might indicate a few large transactions in an
otherwise small transaction dataset.
 Check kurtosis to identify if the data has heavy tails, which could point to a few
transactions significantly larger than the rest.
Diagram: Descriptive Statistics on a Retail Dataset
+--------------------------------------------+
| Descriptive Statistics on |
| Retail Transaction Data |
+--------------------------------------------+
| Mean Transaction Amount: $45.50 |
| Median Transaction Amount: $40.00 |
| Mode (Most Frequent Product): Electronics |
| Range: $5.00 - $500.00 |
| Variance: 120.5 |
| Standard Deviation: 10.98 |
| Correlation between Location & Amount: 0.65|
| Interquartile Range: $35.00 - $55.00 |
| Skewness: 1.2 (Positive skew) |
+--------------------------------------------+
Conclusion

Mining descriptive statistical measures in large datasets is essential for understanding the
underlying patterns, trends, and distributions within the data. By utilizing methods like sampling,
distributed computing, data cubes, and online algorithms, organizations can efficiently process
large datasets and extract meaningful insights. These measures help in summarizing data,
detecting anomalies, and setting the stage for further analysis, modeling, and decision-making.

Dote 2011 L1
No ratings yet
Dote 2011 L1
35 pages
UNIT-4 Characterization and Comparison
No ratings yet
UNIT-4 Characterization and Comparison
61 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
64 pages
Demand Forecasting - Lecture Notes
100% (1)
Demand Forecasting - Lecture Notes
30 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
59 pages
DMBI Sort
No ratings yet
DMBI Sort
89 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Data Mining Concept Description: Characterization and Comparison
No ratings yet
Data Mining Concept Description: Characterization and Comparison
14 pages
Data Mining: Concepts and Techniques: April 30, 2012
No ratings yet
Data Mining: Concepts and Techniques: April 30, 2012
64 pages
Data Mining: Concepts and Techniques: January 14, 2014
No ratings yet
Data Mining: Concepts and Techniques: January 14, 2014
64 pages
Relation Between Olap: Data Warehouse and
No ratings yet
Relation Between Olap: Data Warehouse and
7 pages
CH 4
No ratings yet
CH 4
58 pages
Chapter 2: Online Analytical Processing
No ratings yet
Chapter 2: Online Analytical Processing
18 pages
Chapter 5: Concept Description: Characterization and Comparison
No ratings yet
Chapter 5: Concept Description: Characterization and Comparison
58 pages
Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn
No ratings yet
Data Warehouse Concepts: Quách Đình Hoàng Hoangqd@hcmute - Edu.vn
35 pages
Data Mining Unit2
No ratings yet
Data Mining Unit2
9 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
Cap 6B. The Median, Quartiles and Interquartile Range
No ratings yet
Cap 6B. The Median, Quartiles and Interquartile Range
4 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Data Mining: Concepts and Techniques: November 21, 2013
No ratings yet
Data Mining: Concepts and Techniques: November 21, 2013
64 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
73 pages
Concept Description: Characterization and Comparision: Chapter-10
No ratings yet
Concept Description: Characterization and Comparision: Chapter-10
5 pages
Data Mining New Notes Unit 2 PDF
No ratings yet
Data Mining New Notes Unit 2 PDF
15 pages
Olap (Online Analytical Processing)
No ratings yet
Olap (Online Analytical Processing)
8 pages
Sol CT75
No ratings yet
Sol CT75
11 pages
2.data Warehouse and OLAP
No ratings yet
2.data Warehouse and OLAP
14 pages
5 Desc
No ratings yet
5 Desc
60 pages
Statistics Review
0% (1)
Statistics Review
5 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
Chapter 2 - P2 PDF
100% (1)
Chapter 2 - P2 PDF
41 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
9 pages
CH 07 Specification and Data Issues TQT
No ratings yet
CH 07 Specification and Data Issues TQT
45 pages
A 45301
No ratings yet
A 45301
6 pages
Unit 3
No ratings yet
Unit 3
38 pages
San Agustin Institute of Technology: Third Periodical Exam
No ratings yet
San Agustin Institute of Technology: Third Periodical Exam
2 pages
Unit-Iii Data Mining Material
No ratings yet
Unit-Iii Data Mining Material
23 pages
1st0 2h Que 20220625
No ratings yet
1st0 2h Que 20220625
20 pages
Data Mining and Data Warehousing Notes ct1
No ratings yet
Data Mining and Data Warehousing Notes ct1
12 pages
Assignment No.6
No ratings yet
Assignment No.6
8 pages
Unit 4
No ratings yet
Unit 4
27 pages
Syllabus SP 20 Stat 300 Hybrid (8 Weeks) Sunday
No ratings yet
Syllabus SP 20 Stat 300 Hybrid (8 Weeks) Sunday
7 pages
Data Mining Mid 2
No ratings yet
Data Mining Mid 2
20 pages
Nama: Faris Abdi El Hakim NRP: 122180006 Tugas (Homework)
No ratings yet
Nama: Faris Abdi El Hakim NRP: 122180006 Tugas (Homework)
4 pages
Note #10 Correlation and Regression
No ratings yet
Note #10 Correlation and Regression
7 pages
STEYX Function
No ratings yet
STEYX Function
2 pages
Lab 04 - Supervised ML Classification - Updated
No ratings yet
Lab 04 - Supervised ML Classification - Updated
21 pages
Multiple-Choice Questions: Describing Data: Numerical
No ratings yet
Multiple-Choice Questions: Describing Data: Numerical
4 pages
Handwriting Problems in Primary School Children A
No ratings yet
Handwriting Problems in Primary School Children A
11 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Doc-20240330-Wa0001 240330 194806
No ratings yet
Doc-20240330-Wa0001 240330 194806
7 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
DSBDAL Lab Manual
No ratings yet
DSBDAL Lab Manual
26 pages
AIML Lect6 Ensembles
No ratings yet
AIML Lect6 Ensembles
41 pages
Recentered Influence Functions (Rifs) in Stata: Rif Regression and Rif Decomposition
No ratings yet
Recentered Influence Functions (Rifs) in Stata: Rif Regression and Rif Decomposition
44 pages
Chapter Two Time Series Regression
No ratings yet
Chapter Two Time Series Regression
7 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Improving Student Learning Outcomes in Economics
No ratings yet
Improving Student Learning Outcomes in Economics
7 pages
Unit 5 DWDM
No ratings yet
Unit 5 DWDM
19 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
Unit 5
No ratings yet
Unit 5
24 pages
Unit 1 DM
No ratings yet
Unit 1 DM
37 pages
DWM Unit 1
No ratings yet
DWM Unit 1
67 pages
Unit 2 DWM
No ratings yet
Unit 2 DWM
16 pages
Chapter One Questions
No ratings yet
Chapter One Questions
8 pages
DM Concepts
No ratings yet
DM Concepts
64 pages
Classification Basics
No ratings yet
Classification Basics
14 pages
Naan Muthalvan Project Report Stock Market Forecast 4310
No ratings yet
Naan Muthalvan Project Report Stock Market Forecast 4310
29 pages
Taxi Fare Prediction Using Random Forests
No ratings yet
Taxi Fare Prediction Using Random Forests
10 pages
MultiDimensional Data Model
No ratings yet
MultiDimensional Data Model
22 pages
DM HarshQuesAns
No ratings yet
DM HarshQuesAns
183 pages
Unit 2 DATA WAREHOUSE AND DATA MART
No ratings yet
Unit 2 DATA WAREHOUSE AND DATA MART
17 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
ML Module1
No ratings yet
ML Module1
56 pages
DWDM Notes
No ratings yet
DWDM Notes
19 pages
DWDM Unit-1 Og
No ratings yet
DWDM Unit-1 Og
8 pages
Data Mining Techniques Unit 2
No ratings yet
Data Mining Techniques Unit 2
48 pages
T-Test - BIOLOGY FOR LIFE
No ratings yet
T-Test - BIOLOGY FOR LIFE
6 pages
DM Unit-1
No ratings yet
DM Unit-1
14 pages
(Ebook PDF) Principles of Econometrics, 5th Editioninstant Download
100% (4)
(Ebook PDF) Principles of Econometrics, 5th Editioninstant Download
49 pages
Unit 5
No ratings yet
Unit 5
14 pages
2-Concept Hierarchy To Classification of DMS
No ratings yet
2-Concept Hierarchy To Classification of DMS
75 pages
05 DM BI Concept Description
No ratings yet
05 DM BI Concept Description
21 pages
Lecture 2.1.1 2.1.2
No ratings yet
Lecture 2.1.1 2.1.2
23 pages
Data Mining Unit-III
No ratings yet
Data Mining Unit-III
5 pages
6.concept Description Characterization and Comparison
No ratings yet
6.concept Description Characterization and Comparison
69 pages
R20-DMT Unit-I
No ratings yet
R20-DMT Unit-I
24 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-5 DMDW

Uploaded by

Unit-5 DMDW

Uploaded by

DATA WAREHOUSING AND DATA MINING

Unit-5: Concept Description

Concept Description is a data mining technique that focuses on providing an

 Descriptive Mining: This approach aims to describe concepts or datasets in

Types of Concept Description

1. Characterization: This is a descriptive process where the goal is to summarize

Concept Description vs. OLAP (Online Analytical Processing)

Example of Concept Description

 Objective: To summarize the overall sales performance of the store.

 Objective: To compare sales performance between two different store

Diagram: Concept Description vs. OLAP

 Concept Description is more focused on summarizing or comparing datasets

TOPIC: DATA GENERALIZATION AND SUMMARIZATION-BASED CHARACTERIZATION

Data Generalization is a process where a large set of task-relevant data in a database

Approaches to Data Generalization:

1. Data Cube Approach (OLAP):

Data Cube Approach

 Dimensions: Attributes like time, location, or product category.

1. Roll-up: This operation generalizes data by moving to a higher level of

 One dimension is Region (e.g., North, South, East, West).

Limitations of the Data Cube Approach:

Attribute-Oriented Induction (AOI) is a more flexible approach for generalizing data

The process involves:

1. Initial Relation: Collect task-relevant data using a database query.

Example of Attribute-Oriented Induction:

 Name, Gender, Major, Birthplace, GPA, etc.

1. Initial Relation (InitialRel): Query the database to gather task-relevant data.

Presentation of Generalized Results

Implementation by Cube Technology:

 Data Generalization and Summarization-Based Characterization are essential

Analytical Characterization refers to the process of identifying and analyzing the

1. Attribute Relevance: This is the process of evaluating how important each

How to Perform Attribute Relevance Analysis

Example of Analytical Characterization: Analyzing Student Performance

Imagine you have a dataset of students with the following attributes:

Diagram: Attribute Relevance Analysis Process

Example: Analyzing Customer Data for Purchase Behavior

Suppose you want to analyze which attributes influence a customer's likelihood to

Analytical Characterization using attribute relevance analysis helps in identifying

TOPIC: MINING CLASS COMPARISONS: DISCRIMINATING BETWEEN DIFFERENT

Mining Class Comparisons is a technique in data mining that involves identifying

In data mining, discriminating between classes refers to analyzing how different

Key Concepts of Mining Class Comparisons

1. Discriminating Features: These are attributes that distinguish between

Steps in Mining Class Comparisons

1. Data Preparation: Gather and preprocess data (handle missing values,

6. Interpretation and Visualization: Present the results of the comparison and

Example of Mining Class Comparisons: Predicting Customer Churn

Let’s consider a telecommunications company trying to predict customer churn

 Age of the customer.

Steps in the Example:

1. Data Preparation: Clean the data (remove missing values, encode

Decision Tree Example:

The decision tree might look like this:

[Monthly Spend > $50]

[Contract type == '1 Year'] [Customer Satisfaction > 3.5]

Diagram: Class Comparison in Decision Tree

Conclusion: Mining Class Comparisons is a powerful method for distinguishing

This process is essential for improving classification models, optimizing

TOPIC: MINING DESCRPITIVE STATICAL MEASURES IN LARGE DATASETS

Key Descriptive Statistical Measures in Data Mining

1. Central Tendency Measures:

o Spearman's Rank Correlation: A non-parametric measure of correlation that

Descriptive Statistics in Large Datasets

1. Central Tendency Measures:

 Correlation between Store Location and Transaction Amount: Check if there is a

5. Skewness and Kurtosis:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.