3. Data Preprocessing
3. Data Preprocessing
• Metadata for each attribute include the name, meaning, data type,
and range of values permitted for the attribute, and null rules for
handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema
integration.
Data Integration: Redundancy and Correlation Analysis
• Redundancy is another important issue in data integration. An
attribute (such as annual revenue, for instance) may be redundant if it
can be derived from another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis. Given
two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
a) χ2 (chi-square) test for Nominal Data
b) Correlation Coefficient for Numeric Data
c) Covariance of Numeric Data
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• Correlation relationship between two attributes, A and B, can be
discovered by a χ2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2,…,ac.
• B has r distinct values, namely b1, b2,…,br.
• The c values of A making up the columns and the r values of B making
up the rows.
• where oij is the observed frequency (i.e., actual count) of the joint
event (Ai, Bj) and eij is the expected frequency of (Ai, Bj), which can be
computed as:
Note- Since our computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent and conclude that the two
attributes are (strongly) correlated for the given group of people.
Data Integration: Redundancy and Correlation Analysis
b) Correlation Coefficient for Numeric Data
• We can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient (also known as Pearson’s
product moment coefficient).
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
• rA,B = 0: independent
• rAB < 0: negatively correlated
Visually Evaluating Correlation
x1
1) Dimensionality Reduction: PCA
• The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
➢Normalize input data: Each attribute falls within the same range.
➢Compute k orthonormal (unit) vectors, i.e., principal components.
➢Each input data (vector) is a linear combination of the k principal
component vectors.
➢The principal components are sorted in order of decreasing
“significance” or strength.
➢Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
1) Dimensionality Reduction: PCA
• Steps for PCA:
• Step 1: Standardization (or Data normalization):
➢PCA normalized the input data, so that each attribute falls within the
same range.
➢Normalization helps ensure that attributes with large domains will not
dominate attributes with smaller domains.
✓ Calculate mean.
1) Dimensionality Reduction: PCA
• Steps for PCA:
• Step 2: Covariance matrix (Find Covariance Matrix to identify
correlation):
➢This step calculates the covariance matrix of the standardized data.
➢Covariance matrix shows how each variable is related to every other
variable in the dataset.
▪ If the value of the Covariance Matrix is positive, then it indicates
that the variables are correlated.
▪ If the value of the Covariance Matrix is negative, then it indicates
that the variables are inversely correlated.
▪ If the value of the Covariance Matrix is zero, then the variables are
no correlated.
1) Dimensionality Reduction: PCA
• Step 3: Calculate eigen value of the Covariance matrix:
➢Characteristic equation is used to find eigen values.
det(A- λI) = 0
➢where I is the identity matrix and det(B) is the determinant of the
matrix B.
➢From the determinant of the matrix, we can find Quadratic
equation and can find the λ.
➢The solutions λ of the characteristic equation are the eigenvalues.
➢ An eigenvalue is a number representing the amount of variance
present in the data for a given direction.
➢N by N matrix has N eigen values.
1) Dimensionality Reduction: PCA
• Step 4: Find eigen vectors of the Covariance matrix:
[A- λI]X = 0 or AX= λX
➢The eigenvectors represent the directions in which the data varies
the most, while the eigenvalues represent the amount of variation
along each eigenvector.
➢Each eigenvector has its corresponding eigenvalue.
1) Dimensionality Reduction: PCA
• Step 5: Find a unit (or feature) eigen vectors:
||X|| = 𝑥 2 + 𝑦 2 + ⋯ + 𝑛2
➢Find Magnitude ||X|| of the eigen vector X.
➢Divide each elements of the eigen vector by ||X||.
➢At the end we got final eigen vector E1. Similarity, can find final
eigen vector for all the λ.
1) Dimensionality Reduction: PCA
• Step 6: Find Principal components:
𝑋11 − 𝑋1
𝐸𝑖𝑇 . 𝑋12 − 𝑋2
𝑋𝑚𝑛 − 𝑋𝑚
➢Transpose the final eigen vector and multiply with corresponding
values of the original given data.
➢After multiply both the matrix, we will get the principal component
corresponding to each point.
1) Dimensionality Reduction: PCA
• Step 7: Transform the data (Visualization):
➢The final step is to transform the original data into the lower-
dimensional space defined by the principal components.
➢The transformation does not modify the original data itself but
instead provides a new perspective to better represent the data.
1) Dimensionality Reduction: PCA
• Example1: X1 X2
• Step 1: Standardization (or Data normalization): E1 4 11
E2 8 4
➢Calculate mean
E3 13 5
𝑋1 = 8,
E4 7 14
𝑋2 = 8.5
14 − 11
−11 23
1) Dimensionality Reduction: PCA
• Example1 continue… 14 − 11
• Step 3 Calculate eigen value of the Covariance matrix : −11 23
det(A- λI) = 0
14 − λ − 11
−11 23 − λ
14 − λ 23 − λ − (−11 × −11)
λ2 − 37λ + 201
λ1 = 30.3849
λ2 = 6.6151
||X|| = 𝑥 2 + 𝑦 2 + ⋯ + 𝑛2
= 112 + (14 − λ1 )2
=19.7348
11ൗ
19.7348 0.5574
𝐸1 = (14 − λ ) =
1 ൗ −0.8303
19.7348
11ൗ
13.2490 0.8303
𝐸2 = (14 − λ ) =
2 ൗ 0.5574
13.2490
1) Dimensionality Reduction: PCA
• Example1 continue… X1 X2
−4.3053 E1 4 11
• Step 6: Find Principal components:
3.7363 E2 8 4
𝑋12 − 𝑋1 5.6930 E3 13 5
𝐸1𝑇 .
𝑋13 − 𝑋2 −5.1240 E4 7 14
𝑋 − 𝑋1 𝑋1 = 8,
𝐸1 = 0.5574 − 0.8303 . 12
𝑋13 − 𝑋2 𝑋2 = 8.5
𝐸1 = 0.5574(𝑋12 − 𝑋1 ) − 0.8303(𝑋13 − 𝑋2 )
𝐸1 = 0.5574(4 − 8) − 0.8303(11 − 8.5)
−4.3053 0.5574
𝐸1 =
−0.8303
1) Dimensionality Reduction: Attribute Subset Selection
Data sets for analysis may contain hundreds of attributes, many of
which may be irrelevant to the mining task or redundant.
• Redundant attributes.
➢Duplicate much or all of the information contained in one or more
other attributes
➢E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
➢Contain no information that is useful for the data mining task at
hand
➢E.g., students' ID is often irrelevant to the task of predicting
students' GPA
• Added volume of irrelevant or redundant attributes can slow down
the mining process. It discovered patterns of poor quality.
1) Dimensionality Reduction: Attribute Subset Selection
• Attribute subset selection also called feature subset selection.
• Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
• It reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
1) Dimensionality Reduction: Attribute Subset Selection
• For n attributes, there are 2n possible subsets. take and Not take
• An exhaustive search for the optimal subset of attributes can be
prohibitively expensive, especially as n and the number of data
classes increase.
• So, heuristic methods that explore a reduced search space are
commonly used for attribute subset selection.
• Heuristic methods are typically greedy in that, while searching
through attribute space, they always make what looks to be the best
choice at the time.
• Their strategy is to make a locally optimal choice in the hope that this
will lead to a globally optimal solution.
• Such greedy methods are effective in practice and may come close to
estimating an optimal solution.
1) Dimensionality Reduction: Attribute Subset Selection
• The “best” (and “worst”) attributes are typically determined using
tests of statistical significance, which assume that the attributes are
independent of one another.
• Many other attribute evaluation measures can be used such as the
information gain measure used in building decision trees for
classification.
1) Dimensionality Reduction: Attribute Subset Selection
Figure: Greedy (heuristic) methods for attribute subset selection.
1) Dimensionality Reduction: Attribute Subset Selection
• 1. Stepwise forward selection: The procedure starts with an empty
set of attributes as the reduced set. The best of the original attributes
is determined and added to the reduced set.
• 2. Stepwise backward elimination: The procedure starts with the full
set of attributes. At each step, it removes the worst attribute
remaining in the set.
• 3. Combination of forward selection and backward elimination: The
procedure selects the best attribute and removes the worst from
among the remaining attributes.
1) Dimensionality Reduction: Attribute Subset Selection
• 4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5,
and CART) were originally intended for classification.
➢Decision tree induction constructs a flowchart like structure where
each internal (non-leaf) node denotes a test on an attribute.
➢Each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction.
➢At each node, the algorithm chooses the best attribute to partition
the data into individual classes.
𝐶𝑜𝑣(𝑥,𝑦)
𝑤=
𝑉𝑎𝑟(𝑥)
𝑏 = 𝑦ത − 𝑤. 𝑥ҧ
σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
𝑤= σ(𝑥−𝑥)ҧ 2
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16
𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4
12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40
𝑏 = 16 − 1.5 × 12 = −2
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16
𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4
12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40
𝑏 = 16 − 1.5 × 12 = −2
• If the value of x is 18, find the value of y?.
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16
𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4
12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40
𝑏 = 16 − 1.5 × 12 = −2
• If the value of independent attribute is 20, then find the value of
dependent attribute
2) Numerosity reduction: Regression and log-linear models
• Log-linear models approximate discrete multidimensional probability
distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes),
we can consider each tuple as a point in an n-dimensional space.
• Log-linear models can be used to estimate the probability of each
point in a multidimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from
lower-dimensional spaces.
• Log-linear models used Backward Elimination Procedure to remove
higher-dimensional data space (or to reduce dimensionality).
2) Numerosity reduction: Histograms
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• The histogram for an attribute A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair,
the buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given
attribute.
2) Numerosity reduction: Histograms
• Example: A list of prices for commonly sold items are given.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10,
12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
• Figure shows a histogram for the data
using singleton buckets.
2) Numerosity reduction: Histograms
• Example: A list of prices for commonly sold items are given.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10,
12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
• To further reduce the data, it is
common to have each bucket denote a
continuous value range for the given
attribute.
• In Figure, each bucket represents a
different 10 range for price.
2) Numerosity reduction: Histograms
• How are the buckets determined and the attribute values partitioned.
• There are several partitioning rules, including the following:
➢Equal-width: In an equal-width histogram, the width of each bucket
range is uniform (e.g., the width of 10 for the buckets in Figure).
➢Equal-frequency (or equal-depth): In an equal-frequency histogram,
the buckets are created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains roughly the same
number of contiguous data samples).
• How are the buckets determined and the attribute values partitioned.
2) Numerosity reduction: Histograms
• Histograms are highly effective at approximating both sparse and
dense data, as well as highly skewed and uniform data.
• The histograms described before for single attributes can be extended
for multiple attributes.
• Multidimensional histograms can capture dependencies between
attributes.
• These histograms have been found effective in approximating data
with up to five attributes.
• More studies are needed regarding the effectiveness of
multidimensional histograms for high dimensionalities.
• Singleton buckets are useful for storing high-frequency outliers.
2) Numerosity reduction: Clustering
• Clustering techniques partition the
objects into groups, or clusters, so that
objects within a cluster are similar to
one another and dissimilar to objects in
other clusters.
• Similarity is commonly defined in terms
of how close the objects are in space,
based on a distance function.
• The quality of a cluster may be
represented by its diameter, the
maximum distance between any two
objects in the cluster.
2) Numerosity reduction: Clustering
• Figure shows a 2-D plot of customer
data with respect to customer locations
in a city. Three data clusters are visible.
• Type of clustering models:
i. Centroid models:-K-Means, K-Medoid.
ii. Hierarchical clustering:-Agglomerative
Hierarchical Clustering, BIRCH (Balanced
Iterative Reducing and Clustering using
Hierarchies).
iii.Distribution (probability-based) models:
Expectation-maximization, Gaussian
distribution.
iv.Density models:-DBSCAN (Density-Based
Spatial Clustering of Applications with
Noise),OPTICS (Ordering Points to Identify
the Clustering Structure)
2) Numerosity reduction: Clustering
• In data reduction, the cluster representations of the data are used to
replace the actual data.
• The effectiveness of this technique depends on the data’s nature.
• It is much more effective for data that can be organized into distinct
clusters than for smeared data.
2) Numerosity reduction: Sampling
• Sampling can be used as a data reduction technique because it allows
a large data set to be represented by a much smaller random data
sample (or subset).
• Suppose that a large data set, D, contains N tuples.
➢Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D, where the
probability of drawing any tuple in D is 1/N, that is, all tuples are
equally likely to be sampled.
➢Simple random sample with replacement
(SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from
D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D
so that it may be drawn again.
2) Numerosity reduction: Sampling
➢Cluster sample: If the tuples in D are grouped into M mutually
disjoint clusters, then an SRS of s clusters can be obtained.
➢For example, tuples in a database are usually retrieved a page at a
time, so that each page can be considered a cluster.
➢A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
➢Other clustering criteria conveying rich semantics can also be
explored.
2) Numerosity reduction: Sampling
➢Stratified sample: D is divided into mutually disjoint parts called
strata.
➢A stratified sample of D is generated by obtaining an SRS at each
stratum.
➢Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data).
2) Numerosity reduction: Data Cube Aggregation
• For example, you have data consist of the sales per quarter, for the
years 2008 to 2010.
• However, you are interested in the annual sales (total per year),
rather than the total per quarter.
• Thus, the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter.
• The resulting data is smaller in volume, without loss of information
necessary for the analysis task.
4. Data Transformation
• In the preprocessing step, the data are transformed or consolidated
so that the resulting mining process may be more efficient, and the
patterns found may be easier to understand.
• In data transformation, the data are transformed or consolidated into
forms appropriate for mining.
• Strategies or methods for data transformation
➢1. Smoothing: which works to remove noise from the data.
Techniques include binning, regression, and clustering.
➢2. Attribute construction (or feature construction): where new
attributes are constructed and added from the given set of
attributes to help the mining process.
4. Data Transformation
➢ 3. Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total amounts.
This step is typically used in constructing a data cube for data
analysis at multiple abstraction levels.
➢ 4. Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
➢ 5. Discretization: where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior). The labels, in turn, can
be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute.
4. Data Transformation
➢ 6. Concept hierarchy generation for nominal data: where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
• There is much overlap between the major data preprocessing tasks.
The first three of these strategies were discussed earlier.
• Smoothing is a form of data cleaning.
• Attribute construction and aggregation were discussed on data
reduction.
• So, here we concentrate on the remaining three strategies.
4. Data Transformation: Data Transformation by Normalization
• The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
different results.
• To avoid the issues of measurement units, the data must be
normalized or standardized.
• This involves transforming the data to fall within a smaller or common
range such as [−1,1] or [0.0, 1.0].
• Normalizing the data attempts to give all attributes an equal weight.
• There are many methods for data normalization.
• We study min-max normalization, z-score normalization, and
normalization by decimal scaling.
• Let A be a numeric attribute with n observed values, v1, v2, …, vn.
4. Data Transformation: Data Transformation by Normalization
• Min-max normalization performs a linear transformation on the
original data.
• Suppose that minA and maxA are the minimum and maximum values
of an attribute, A.
• Min-max normalization maps a value, vi, of A to in the range
[new_minA, new_maxA].
• For a attribute income, min & max are $12,000 and $98,000,
respectively. We would like to map income to the range [0.0, 1.0].
• A given value of $73,600 for income is transformed to
4. Data Transformation: Data Transformation by Normalization
• z-score normalization (or zero-mean normalization), the values for an
attribute A, are normalized based on the mean (i.e., average) and
standard deviation.
• A value vi, of A is normalized to