0% found this document useful (0 votes)
9 views31 pages

Data Preprocessing - Updated

Uploaded by

jainpranav3882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views31 pages

Data Preprocessing - Updated

Uploaded by

jainpranav3882
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 31

Data

Preprocessing

1
Purpose of Pre-processing

 Accuracy: correct or wrong, accurate or not


 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not,
dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

2
Various Steps in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation
3
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially
incorrect data, e.g., instrument faulty, human or
computer error, transmission error

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data

e.g., City=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Wages=“−10” (an error)

inconsistent: containing discrepancies in codes or
names, e.g.,

Age=“42”, Birthday=“03/07/2020”

Different way of grading like “1, 2, 3”, now rating
“A, B, C”

duplicate records
4
Incomplete (Missing) Data
 Data is not always available
 E.g., share of broker, customer income
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important
at the time of entry
 not register history or changes of the data
 Missing data may need to be inferred
5
Handling of Missing Data
 Ignore the tuple: usually done when class label is
missing (when doing classification)—not effective
when the % of missing values per attribute varies
considerably
 Fill in the missing value manually: tedious +
infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to
the same class: smarter
6
Noisy Data
 Noise: random error or variance in a measured
variable
 Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems

duplicate records

incomplete data

inconsistent data
7
Handling of Noisy Data
 Binning

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g.,
deal with possible outliers)

8
Handling of Noisy Data: Binning

 It involves grouping data into bins or


intervals and then assigning each data
point to a bin.
Create four age bins: [0-18],
Individual Age
[19-39],[40-59],
Sort Age
and
[60+]
1 32 6
2 24 18
3 120 24
4 42 28
5 18 32
6 56 42
7 6 56
8 70 70
9 90 90
10 28 120 9
Handling of Noisy Data: Binning
Create four age bins: [0-18], [19-39],[40-59], and
[60+]

 Bin [0-18]: Midpoint = (0 Individual Age

+ 18) / 2 = 9 1 32
 Bin [19-39]: Midpoint = 2 24
3 120
(19 + 39) / 2 = 29 4 42
 Bin [40-59]: Midpoint = 5 18
(40 + 59) / 2 = 49.5 6 56
7 6
 Bin [60+]: Midpoint = (60 8 70
+ ∞) / 2 = ∞ (Infinity) 9 90
10 28

10
Handling of Noisy Data: Binning

 Assign each individual's age to the


appropriate bin and replace the age with
the corresponding midpoint:
Individua
l Age Bin Midpoint
1 32 [19-39] 29
2 24 [19-39] 29
3 120 [60+] ∞
4 42 [40-59] 49.5
5 18 [0-18] 9
6 56 [40-59] 49.5
7 6 [0-18] 9
8 70 [60+] ∞
9 90 [60+] ∞
10 28 [19-39] 29 11
Handling of Noisy Data:
Regression
 Cleaning noisy data using regression
typically involves identifying and
handling outliers or extreme data points
that can adversely affect the model's
Square
performance. Footage (X) Price (Y)
 You have the following data 1000
1200
200000
220000
for 10 houses: 1400 240000
1600 260000
 The linear regression model 1800 280000
2000 300000
is given by: 2200 320000
2400 340000
Y=β 0 +β1 X 2600 360000
8000 900000
12
Handling of Noisy Data:
Regression
 Let's calculate β0 (intercept) and β1
(slope) using the least squares method:
β1=n∑(XY)- ∑X∑Y/(n∑(X2)- (∑X)2)
β0=(∑Y- β1∑X )/ n
 Compute β1≈21.97 (slope); β0≈265611
(intercept)
 To predict the price for the house with 8000
square footage: Y=β0 +β1 X
Y=265611+21.97 * 8000 Y≈445571
 Replace the noisy data point (8000, 900000)
with the predicted value (8000, 445571) 13
Handling of Noisy Data: Clustering
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid
and diameter) only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions
and clustering algorithms

14
Handling of Noisy Data:
Clustering

 Detecting and removing Student Exam Score


outliers using clustering is a 1 85
common data preprocessing 2 88
technique.
3 90
 You have the following exam 4 92
scores for 15 students: 5 95
6 97
7 99
8 100
9 101
10 103
11 105
12 107
13 110
14 112
15 150
15
Handling of Noisy Data:
Clustering
 You decide to use K-Means clustering with k=2 clusters to detect
outliers. After clustering, you find that the cluster centers as follows:
 Cluster 1 center: 92.3
 Cluster 2 center: 108.2
Student Exam Score
1 85
2 88
3 90
4 92
5 95
6 97
7 99
8 100
9 101
10 103
11 105
12 107
13 110
14 112
15 150
16
Handling of Noisy Data:
Clustering

Dista Dista
Let's calculate the distances:
nce to nce to
 For Cluster 1: Distance =
Studen Exam Cluste Cluste
|Exam Score - Cluster 1 center| = |Exam t Score r1 r2
Score - 92.3|
1 85 7.3 23.2
 For Cluster 2: Distance = 2 88 4.3 20.2
|Exam Score - Cluster 2 center| = |Exam 3 90 2.3 18.2
Score - 108.2| 4 92 0.3 16.2
 Let's calculate the distances for each 5 95 3.3 13.2
student: 6 97 5.3 11.2
 data points with distances to both 7 99 7.3 9.2
clusters greater than, for example, 20 as 8 100 8.3 8.2
outliers. 9 101 9.3 7.2
 The outliers are students with Exam 10 103 11.3 5.2
Scores: 85, 88, 150. 11 105 13.3 3.2
12 107 15.3 1.2
13 110 17.7 1.8
14 112 19.7 3.8
15 150 57.7 41.8
17
Handling Redundancy in Data
Integration

 Redundant data occur often when integration of


multiple databases

Object identification: The same attribute or object
may have different names in different databases

Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid

18
Correlation Analysis (Nominal Data)
 Χ2 (chi-square) test
2
(Observed  Expected )
 2 
Expected
 The larger the Χ2 value, the more likely the
variables are related
 The cells that contribute the most to the Χ2 value
are those whose actual count is very different from
the expected count

19
Chi-Square Calculation: An
Example

Play Not play Sum


chess chess (row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis


are expected counts calculated based on the data
distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840
 It shows that like_science_fiction and play_chess
are correlated in the group
20
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s


coefficient)
 
n n
(ai  A)(bi  B ) (ai bi )  n A B
rA, B  i 1
 i 1
(n  1) A B (n  1) A B

where n is the number of tuples, and are the respective


A B
means of A and B, σA and σB are the respective standard
deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
21
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

22
Covariance (Numeric Data)
 Covariance is similar to correlation

Correlation coefficient:
where n is the number of tuples, and are the respective mean
or expected values of A and B, A σ andBσ are the respective
A B

standard deviation of A and B.


 Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:
 Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply
independence
23
Co-Variance: An Example

 It can be simplified in computation as

 Suppose two stocks A and B have the following values in one


week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

Question: If the stocks are affected by the same industry trends,


will their prices rise or fall together?

E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data
set that is much smaller in volume but yet produces the same
(or almost the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long
time to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant

attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data

Reduction)

Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression 25
Sampling

 Sampling: obtaining a small sample s to represent the


whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
data

Simple random sampling may have very poor
performance in the presence of skew

Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at
a time)
26
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any

particular item
 Sampling without replacement
 Once an object is selected, it is removed from

the population
 Sampling with replacement
 A selected object is not removed from the

population
 Stratified sampling:
 Partition the data set, and draw samples from

each partition (proportionally, i.e., approximately


the same percentage of the data)
 Used in conjunction with skewed data
27
Sampling: With or without
Replacement

W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re

SRSW
R

Raw Data
28
Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values s.t. each old
value can be identified with one of the new values

 Methods
 Smoothing: Remove noise from data

 Attribute/feature construction


New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction

 Normalization: Scaled to fall within a smaller, specified

range

min-max normalization

z-score normalization

normalization by decimal scaling
 Discretization: Concept hierarchy climbing

29
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000
73,600  12,000 normalized to
(1.0  0)  0 0.716
[0.0, 1.0]. Then $73,000 is mapped to
98, 000  12, 000

 Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
1.225
 Ex. Let μ = 54,000, σ = 16,000. Then16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
30
Thanks

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy