Data Mining Unit 3
Data Mining Unit 3
• Requirements
• Quality Data
• Major Task in data Pre-processing
• Data cleaning
• Data Integration
• Data Transformation
• Data Reduction
• Data discretization
Requirement of Data Pre-processing
NOISE or
OUTLIERS
Outlier analysis by box plotter
Q1 Q2 Q3
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Lower Upper
Quartile = Median = Quartile =
14.4 14.6 14.9
Gender
Total
Male Female
e= (Count_gender X Count_prefered_reading)/N
Gender
Total
Male Female
Degree of freedom=(Row-1)X(Column-1)
=(2-1)X(2-1)
=1
Chi Square Table For(independency)
Example 2:A researcher might want to know if there is a significant association between
the variables gender and soft drink choice (Coke and Pepsi were considered). The null
hypothesis would be,
Ho: There is no significant association between gender and soft drink choice.
(Gender and preferred soft drinking is independent )
Significant level 5%
Data Transformation
In data transformation, the data are transformed or
consolidated into appropriate forms for mining.
Strategies used for Data transformation are :
1. Smoothing: Remove the noise
2. Attribute Construction-New set of attributes generated.
3. Aggregation: Summarization of data value.
4. Normalization: Attributes values are scaled in new range(-1 to
+1, 0 to 1 etc.)
5. Discretization: Data divided into discrete intervals(Like 0-100,
101 to 200, 201 to 300 etc.)
6. Concepts of hierarchy generation from higher to lower .
For examples address hierarchy is :country>state>city>street
Data Transformation by Normalization
Normalization change the unit of measurement like
meter to kilometre.
For better performance data must scaled in smaller
format like interval [-1 to +1 or 0 to 1].
Following methods can be used for data normalization:
1. Min-Max Normalization
2. Z-Score(zero mean) Normalization
3. Normalization By decimal Scaling
Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively. Map
income $73,600 to the range [0.0,1.0].
Here MINA=12000, MAXA=98000,
NEW_MINA=0, NEW_MAXA=1 and V=73600
V’=[(73,600−12,000 )/(98,000−12,000 )]*(1.0−0) +0
= 0.716.
So 73600 will represented 0.716 in new range[0-1]
Example:
Suppose that the mean and standard deviation of the values
for the attribute income are $54,000 and $16,000,
respectively. With z-score normalization, a value of $73,600
for income is transformed to :
V’=(73,600−54,000)/ 16,000
= 1.225
Numerosity reduction
• Numerosity reduction reduce the data volume by choosing
alternative, ‘smaller’ forms of data representation.
• These techniques may be Parametric or Nonparametric.
3. Sampling. A subset of the data is selected (randomly or systematically) to represent the entire dataset.
4. Data Cube aggregation Multidimensional data is summarized by computing aggregates (e.g.,
sum, average) for different combinations of dimensions in a data cube.
A histogram partitions the data distribution of an attribute
,A into disjoint subsets, also known as buckets.
Histograms
• A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
• If each bucket represents only a single attribute-value/
frequency pair, the buckets are called singleton buckets.
• There are several partitioning rules, including the following:
• Equal-width: In an equal-width histogram, the width of each
bucket range is uniform (such as the width of $10 for the
The range of the attribute A is divided into buckets of uniform width. For example, if A ranges from 0 to 100, and
buckets . we use 10 buckets, each bucket will have a width of 10 10 (e.g., [0–10), [10–20), ..., [90–100]).
• Equal-frequency (or equidepth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant (that is, each bucket
contains roughly the same number of contiguous data
samples).
Buckets are created such that each bucket contains approximately the same number of data points.
For example, if there are 100 data points and 10 buckets, each bucket will contain roughly 10 data points.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Histograms
• Example :The following data are a list of prices of commonly
sold items at AllElectronics (rounded to the nearest dollar):
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
Sampling
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
1. Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D (s < N), where
the probability of drawing any tuple in D is 1=N, that is, all tuples
are equally likely to be sampled.
2. Simple random sample with replacement (SRSWR) of size s:
This is similar to SRSWOR, except that each time a tuple is drawn
from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn Again.
Similar to SRSWOR, but with replacement. After a tuple is drawn, it is returned to the dataset before the next draw.
This allows the same tuple to be sampled multiple times.
Cluster sampling is a probabilistic sampling technique in which the population is
divided into groups or clusters, and a random selection of these clusters is made.
All members within the selected clusters are included in the sample. This method
• 3-Cluster sampling
is particularly effective when the population is large and widely dispersed
geographically.
• 4-Stratified sampling
Groups are homogeneous, meaning the units share characteristics. The
population is divided into specific groups based on an interest, and some
members of all groups are included in the sample. This method is used to
improve precision and representation, and can help researchers make
better estimates.
Stratified sampling is a probabilistic sampling technique in which the population is divided into smaller groups, known as strata,
• based on shared characteristics. A sample is then drawn from each stratum. This method ensures that specific subgroups of the
population are adequately represented in the final sample.
Data Cube Aggregation
• It is use to summarize multidimensional data cube.
• Data aggregation is any process in which information is
gathered and expressed in a summary form, for purposes such
as statistical analysis.
• A common aggregation purpose is to get more information
about particular groups based on specific variables such as
age, profession, sales or income.
Data Discretization
In data discretization data divided into discrete intervals .Data
discretization can be categorized based on how it performed.
1. Supervised discretization use class information.
2. Unsupervised discretization not use prior class information.
3. Other techniques are: Data discretization is the process of converting
continuous data attributes into discrete intervals
• Binning methods or bins. This transformation is particularly useful
for simplifying data analysis, visualizing patterns,
and preparing data for algorithms that work
• Histogram better with discrete values (e.g., certain
classification algorithms).
• Cluster Analysis
• Decision tree analysis
• Correlation analysis
Discretization refers to transforming continuous numerical data into discrete categories or intervals.
Concept hierarchy generation involves replacing raw attribute values with higher-level categories or concepts (e.g., replacing age values like "
25" with "young")
Purpose-1)These techniques simplify the representation of data for analysis.2) They are essential in mining data at multiple abstraction levels,
enabling pattern detection and insights.
Discretization and Concept Hierarchy Generation for
Numerosity Reduction:
Numerical Data
Discretization reduces the volume of data by grouping continuous values into intervals, lowering complexity without significant information loss.
Example: Replace individual age values (1, 2, 3...) with intervals (e.g., "1–10", "11–20").
4. Cluster analysis
5. Discretization by intuitive partitioning
Discretization and Concept Hierarchy Generation for
Numerical Data
fi
ti
:
ti
y
fi
.
fi
fi
tt
fi
)
ti
.
Attributes Defined:
Brand: Apple, Samsung, Sony.
Type: Smartphone, Tablet, Laptop.
Model: Specific models like
iPhone 14, Galaxy S21, etc.
Dynamic Hierarchy Generation:
Case 1: Group by Brand First