Unit - 1 Data Preprocessing
Unit - 1 Data Preprocessing
Data Preprocessing
Data Preprocessing
Learning Objectives
meanmode = 3(meanmedian).
The degree to which numerical data tend to
spread is called the dispersion, or variance of the
data. The most common measures of data
dispersion are
1) Range, Quartiles, Outliers, and Boxplots
2) Variance and Standard Deviation
The range of the set is the difference between
the largest (max()) and smallest (min()) values.
The most commonly used percentiles other than
the median are quartiles. The first quartile,
denoted by Q1, is the 25th percentile; the third
quartile, denoted by Q3, is the 75th percentile. The
quartiles, including the median, give some
indication of the center, spread, and shape of a
distribution. The distance between the first and third
quartiles is a simple measure of spread that gives
the range covered by the middle half of the data.
• Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-
number summary as follows:
• Typically, the ends of the box are at the
quartiles, so that the box length is the
interquartile range, IQR.
• The median is marked by a line within the
box.
• Two lines (called whiskers) outside the box
extend to the smallest (Minimum) and largest
(Maximum) observations.
2.3 Graphic Displays of Basic Descriptive Data Summaries
Aside from the bar charts, pie charts, and line graphs used in most
statistical or graphical data presentation software packages, there are
other popular types of graphs for the display of data summaries and
distributions. These include histograms, quantile plots, q-q plots, scatter
plots, and loess curves. Such graphs are very helpful for the visual
inspection of your data.
3. Data Cleaning
• Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
1) Missing Data
• Data is not always available
a. E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
a. equipment malfunction
b. inconsistent with other recorded data and thus
deleted
c. data not entered due to misunderstanding
d. certain data may not be considered important at the
time of entry
e. not register history or changes of the data
f. Missing data may need to be inferred.
How to Handle Missing Data?
• Binning method:
• Clustering
- detect and remove outliers
• Regression
- smooth by fitting the data into regression functions
Binning Methods for Data Smoothing
Sorted data for price (in dollars):
4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34
W O R
SRS le random
i m p h o u t
( s e wi t
l
s a m p m e nt )
p l a c e
re
SRSW
R
Raw Data
Raw Data Cluster/Stratified Sample
Data discretization techniques can be used to reduce the
number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.
Replacing numerous values of a continuous attribute by a
small number of interval labels thereby reduces and
simplifies the original data. This leads to a concise, easy-
to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based
on how the discretization is performed, such as
whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the
discretization process uses class information, then
we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding
one or a few points (called split points or cut points)
to split the entire attribute range, and then repeats
this recursively on the resulting intervals, it is called
top-down discretization or splitting. This contrasts
with bottom-up discretization or merging, which
starts by considering all of the continuous values as
potential split-points, removes some by merging
neighborhood values to form intervals, and then
recursively applies this process to the resulting
intervals. Discretization can be performed recursively
on an attribute to provide a hierarchical or
multiresolution partitioning of the attribute values,
known as a concept hierarchy. Concept hierarchies
are useful for mining at multiple levels of abstraction.
Three types of attributes:
Nominal — values from an unordered set, e.g., color,
profession
Ordinal — values from an ordered set, e.g., military or
academic rank
Continuous — real numbers, e.g., integer or real
numbers
Discretization:
Divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
Typical methods: All the methods can be applied
recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization: supervised, top-down
split
Interval merging by 2 Analysis: unsupervised, bottom-
up merge
Segmentation by natural partitioning: top-down split,
unsupervised
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the information
gain after partitioning
| S | is |S |
1 2
I(S,T) = Entropy(S 1 ) + Entropy(S 2 )
|S| |S|
Entropy is calculated based on class distribution of the
samples in the set. Given m classes, the entropy of S1
is m
Entropy ( S1 ) pi log 2 ( pi )
i 1