L2 Data Preparation
L2 Data Preparation
1
6/28/2023
Data Preprocessing
Why do we need to prepare the data?
In real world applications data can be inconsistent, incomplete and/or noisy
Data entry, data transmission, or data collection problems
Discrepancy in naming conventions
Duplicated records
Incomplete or missing data
Contradictions in data
What happens when the data can not be trusted?
Can the decision be trusted? Decision making is jeopardized
2
6/28/2023
Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Cleaning
Real-world application data can be incomplete, noisy, and
inconsistent
No recorded values for some attributes
Not considered at time of entry
Random errors
Irrelevant records or fields
3
6/28/2023
Ex: Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
4
6/28/2023
10
5
6/28/2023
Value of every record in each bin is changed to the mean value for
that bin. If it is necessary to keep the value as an integer, then the
mean values are rounded to the nearest integer.
11
12
6
6/28/2023
Data Integration
Data analysis may require a combination of data from multiple
sources into a coherent data store
Challenges in Data Integration:
Schema integration: CID = C_number = Cust-id = cust#
Semantic heterogeneity
Data value conflicts (different representations or scales, etc.)
Synchronization (especially important in Web usage mining)
Redundant attributes (redundant if it can be derived from other attributes) --
may be able to identify redundancies via correlation analysis:
Pr(A,B) / (Pr(A).Pr(B))
= 1: independent,
> 1: positive correlation,
< 1: negative correlation.
7
6/28/2023
Normalization: Example
z-score normalization: v’ = (v - Mean) / Stdev
Example: normalizing the “Humidity” attribute:
Humidity
Humidity
0.48
85
0.99
90
78 -0.23
1.60
96 Mean = 80.3 -0.03
80
70 Stdev = 9.84 -1.05
65 -1.55
95 1.49
70 -1.05
80 -0.03
70 -1.05
90 0.99
75 -0.54
80 -0.03
15
Normalization: Example II
Min-Max normalization on an employee database
max distance for salary: 100000-19000 = 81000
max distance for age: 52-27 = 25
New min for age and salary = 0; new max for age and salary = 1
16
8
6/28/2023
17
Discretization - Example
Example: discretizing the “Humidity” attribute using 3
bins.
Humidity
Humidity
85 High
90 High
78 Low = 60-69 Normal
96 High
80 Normal = 70-79 High
70 Normal
65
High = 80+ Low
95 High
70 Normal
80 High
70 Normal
90
High
75
Normal
80
High
18
9
6/28/2023
19
20
10
6/28/2023
21
22
11
6/28/2023
Data Reduction
Data is often too large; reducing data can improve performance
Data reduction consists of reducing the representation of the data
set while producing the same (or almost the same) results
Data reduction includes:
Data cube aggregation
Dimensionality reduction
Discretization
Numerosity reduction
Regression
Histograms
Clustering
Sampling
23
24
12
6/28/2023
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal Component Analysis
Attribute subset selection
Attribute or feature generation
25
x2
x1
26
13
6/28/2023
28
14
6/28/2023
30
15
6/28/2023
31
32
16
6/28/2023
Regression Analysis
y
Collection of techniques for the
modeling and analysis of
numerical data consisting of Y1
values of a dependent variable
(also response variable or
measurement) and of one or more Y1’
y=x+1
independent variables (aka.
explanatory variables or
predictors)
The parameters are estimated to X1 x
obtains a "best fit" of the data
Typically the best fit is evaluated Used for prediction (including
by using the least squares method, forecasting of time-series data),
but other criteria have also been inference, hypothesis testing, and
used modeling of causal relationships
33
Regression Analysis
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion on known values of Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models
Approximate discrete multidimensional probability distributions
Estimate the probability of each point in a multi-dimensional space for a
set of discretized attributes, based on a smaller subset of dimensions
Useful for dimensionality reduction and data smoothing
34
17
6/28/2023
Numerocity Reduction
Reduction via histograms:
Divide data into buckets and store
representation of buckets (sum, count, etc.)
35
Sampling Techniques
Raw Data
Cluster/Stratified Sample
Raw Data
36
18