0% found this document useful (0 votes)

8 views66 pages

Unit - 1 Data Preprocessing

The document covers data preprocessing, emphasizing its importance in data mining for ensuring quality results. It discusses various techniques for data cleaning, integration, transformation, and reduction, highlighting methods for handling missing and noisy data. Additionally, it introduces descriptive statistics for understanding data distribution and outlines strategies for efficient data representation and attribute selection.

Uploaded by

Potturi S Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views66 pages

Unit - 1 Data Preprocessing

Uploaded by

Potturi S Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 66

UNIT – 1

Data Preprocessing
Data Preprocessing
Learning Objectives

• Understand why preprocess the data.

• Understand how to clean the data.
• Understand how to integrate and transform the data.

 Why preprocess the data?

 Data cleaning
 Data integration and transformation
Why Data Preprocessing?
1. Data mining aims at discovering relationships and other
forms of knowledge from data in the real world.

1. Data map entities in the application domain to symbolic

representation through a measurement function

1. Data in the real world is dirty

incomplete: missing data, lacking attribute values, lacking certain

attributes of interest, or containing only aggregate data
noisy: containing errors, such as measurement errors, or outliers
inconsistent: containing discrepancies in codes or names
distorted: sampling distortion (A Change for worse)

4. No quality data, no quality mining results! (GIGO)

5. Quality decisions must be based on quality data

6. Data warehouse needs consistent integration of quality data

 Data quality is multidimensional:
 Accuracy
 Preciseness (=reliability)
 Completeness
 Consistency
 Timeliness
 Believability (=validity)
 Value added
 Interpretability
 Accessibility
 Broad categories:
 intrinsic, contextual, representational, and
accessibility
 Data cleaning
 Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies and
errors
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but
produces the same or similar analytical results
 Data discretization
 Part of data reduction but with particular importance,
especially for numerical data
• For data preprocessing to be successful, it is essential to have
an overall picture of your data.

• Descriptive data summarization techniques can be used to

identify the typical properties of your data and highlight which
data values should be treated as noise or outliers.

• Thus, we first introduce the basic concepts of descriptive data

summarization before getting into the concrete workings of
data preprocessing techniques.

• For many data preprocessing tasks, users would like to learn

about data characteristics regarding both central tendency and
dispersion of the data.
• Measures of central tendency include mean, median, mode,
and midrange, while measures of data dispersion include
quartiles, interquartile range (IQR), and variance.

• These descriptive statistics are of great help in understanding

the distribution of the data.

• Such measures have been studied extensively in the statistical

literature.

• From the data mining point of view, we need to examine how

they can be computed efficiently in large databases.

• In particular, it is necessary to introduce the notions of

distributive measure, algebraic measure, and holistic measure.

• Knowing what kind of measure we are dealing with can help

us choose an efficient implementation for it.
In this section, we look at various ways to measure the central
tendency of data. The most common and most effective numerical
measure of the “center” of a set of data is the (arithmetic) mean.

mean􀀀mode = 3(mean􀀀median).
 The degree to which numerical data tend to
spread is called the dispersion, or variance of the
data. The most common measures of data
dispersion are
 1) Range, Quartiles, Outliers, and Boxplots
 2) Variance and Standard Deviation

 The range of the set is the difference between
the largest (max()) and smallest (min()) values.

 The most commonly used percentiles other than
the median are quartiles. The first quartile,
denoted by Q1, is the 25th percentile; the third
quartile, denoted by Q3, is the 75th percentile. The
quartiles, including the median, give some
indication of the center, spread, and shape of a
distribution. The distance between the first and third
quartiles is a simple measure of spread that gives
the range covered by the middle half of the data.
• Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-
number summary as follows:
• Typically, the ends of the box are at the
quartiles, so that the box length is the
interquartile range, IQR.
• The median is marked by a line within the
box.
• Two lines (called whiskers) outside the box
extend to the smallest (Minimum) and largest
(Maximum) observations.
2.3 Graphic Displays of Basic Descriptive Data Summaries
Aside from the bar charts, pie charts, and line graphs used in most
statistical or graphical data presentation software packages, there are
other popular types of graphs for the display of data summaries and
distributions. These include histograms, quantile plots, q-q plots, scatter
plots, and loess curves. Such graphs are very helpful for the visual
inspection of your data.
3. Data Cleaning
• Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
1) Missing Data
• Data is not always available
a. E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
a. equipment malfunction
b. inconsistent with other recorded data and thus
deleted
c. data not entered due to misunderstanding
d. certain data may not be considered important at the
time of entry
e. not register history or changes of the data
f. Missing data may need to be inferred.
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing

(assuming the tasks in classification—not effective when
the percentage of missing values per attribute varies
considerably.)
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the
same class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
• 2.Noisy Data
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which requires data cleaning
– duplicate records
– inconsistent data
How to Handle Noisy Data?

• Binning method:

- first sort data and partition into (equi-depth) bins

- then one can smooth by bin means, smooth by bin median,
- smooth by bin boundaries, etc.

• Clustering
- detect and remove outliers

• Combined computer and human inspection

- detect suspicious values and check by human

• Regression
- smooth by fitting the data into regression functions
Binning Methods for Data Smoothing
Sorted data for price (in dollars):
4,8,9,15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Cluster Analysis
Regression
 Data integration:
 combines data from multiple sources into a
coherent store
 Schema integration
 integrate metadata from different sources
 entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id 
B.cust-#
 Detecting and resolving data value
conflicts
 for the same real world entity, attribute values from
different sources are different
 possible reasons: different representations,
different scales, e.g., metric vs. British units
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant data may be able to be detected by
correlational analysis
• Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
• min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
• z-score normalization
v  meanA
v' 
stand _ devA
• normalization by decimal scaling
v Where j is the smallest integer such that Max(| v ' |)<1
v'  j
10
 Min-max normalization: to [new_minA,
new_maxA]v  min A
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
73,600  12,000
(1.0  0)  0 0.716
98,000  12,000
 Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,600 is
mapped to
 Z-score normalization
v   73,600(μ: mean, σ:
 54,000
A
v'  1.225
standard deviation):
 16, 000
A

 Ex. Let μ = 54,000, σ = 16,000. Then

 Normalization
v by decimal scaling
v'  j
10 Where j is the smallest integer such that Max(|ν’|) < 1
 Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of the original data. That is, mining on the reduced
data set should be more efficient yet produce the
same (or almost the same) analytical results.
Strategies for data reduction include the following:
1.Data cube aggregation, where aggregation operations are applied to the data in
the construction of a data cube.

2. Attribute subset selection, where irrelevant, weakly relevant, or redundant

attributes or dimensions may be detected and removed.

3. Dimensionality reduction, where encoding mechanisms are used to reduce

the data set size.

4. Numerosity reduction, where the data are replaced or estimated by alternative,

smaller data representations such as parametric models (which need store only the
model parameters instead of the actual data) or nonparametric methods such as
clustering, sampling, and the use of histograms.

5. Discretization and concept hierarchy generation, where rawdata values for

attributes are replaced by ranges or higher conceptual levels.
 Attribute subset selection reduces the data set
size by removing irrelevant or redundant
attributes (or dimensions).

 The goal of attribute subset selection is to find
a minimum set of attributes such that the
resulting probability distribution of the data
classes is as close as possible to the original
distribution obtained using all attributes.

 Mining on a reduced set of attributes has an

additional benefit.

 It reduces the number of attributes appearing
in the discovered patterns, helping to make the
patterns easier to understand.
The “Best” (and “Worst”) attributes are typically determined using tests of
statistical significance, which assume that the attributes are independent of
one. Many other attribute evaluation measures can be used, such as the
information gain measure used in building decision trees for classification
• Basic heuristic methods of attribute subset
selection include the following techniques, some of
which are illustrated in Figure.

• 1. Stepwise forward selection: The procedure

starts with an empty set of attributes as the
reduced set. The best of the original attributes is
determined and added to the reduced set. At each
subsequent iteration or step, the best of the
remaining original attributes is added to the set.

• 2. Stepwise backward elimination: The

procedure starts with the full set of attributes. At
each step, it removes the worst attribute
remaining in the set.

• 3. Combination of forward selection and

backward elimination: The stepwise forward
selection and backward elimination methods
can be combined so that, at each step, the
procedure selects the best attribute and removes the
worst from among the remaining attributes.
• 4. Decision tree induction: Decision tree
algorithms, such as ID3, C4.5, and CART, were
originally intended for classification. Decision tree
induction constructs a flow chart like structure where
each internal (non leaf) node denotes a test on an
attribute, each branch corresponds to an outcome of
the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm
chooses the “best” attribute to partition the data into
individual classes. When decision tree induction is
used for attribute subset selection, a tree is
constructed from the given data. All attributes that
do not appear in the tree are assumed to be
irrelevant. The set of attributes appearing in the tree
form the reduced subset of attributes.

 The stopping criteria for the methods may

vary. The procedure may employ a threshold on the
measure used to determine when to stop the
attribute selection process.
 In dimensionality reduction, data encoding or
transformations are applied so as to obtain a
reduced or “compressed” representation of the
original data. If the original data can be
reconstructed from the compressed data without any
loss of information, the data reduction is called
lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy. There are several well-
tuned algorithms for string compression. Although
they are typically lossless, they allow only limited
manipulation of the data.

 In this section, we instead focus on two popular and

effective methods of lossy dimensionality reduction:
wavelet transforms and principal components
analysis.
• Wavelet transforms can be applied to
multidimensional data, such as a data cube.

• This is done by first applying the transform to the

first dimension, then to the second, and so on. The
computational complexity involved is linear with
respect to the number of cells in the cube. Wavelet
transforms give good results on sparse or skewed
data and on data with ordered attributes. Lossy
compression by wavelets is reportedly better than
JPEG compression, the current commercial standard.
Wavelet transforms have many real-world
applications, including the compression of
fingerprint images, computer vision, analysis of
time-series data, and data cleaning.
• PCA is computationally inexpensive, can be applied to
ordered and unordered attributes, and can handle sparse
data and skewed data. Multidimensional data of more than
two dimensions can be handled by reducing the problem to
two dimensions. Principal components may be used as
inputs to multiple regression and cluster analysis.

• In comparison with wavelet transforms, PCA tends to be

better at handling sparse data, whereas wavelet transforms
are more suitable for data of high dimensionality.
 “Can we reduce the data volume by choosing
alternative, ‘smaller’ forms of data
representation?”
 Techniques of numerosity reduction can indeed
be applied for this purpose. These techniques may
be parametric or nonparametric.
 For parametric methods, a model is used to
estimate the data, so that typically only the data
parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Log-linear
models, which estimate discrete multidimensional
probability distributions, are an example.
 Nonparametric methods for storing reduced
representations of the data include histograms,
clustering, and sampling.
 Let’s look at each of the numerosity reduction
techniques mentioned above.
 Linear regression: Data are modeled to fit a
straight line
 Often uses the least-square method to fit the
line
 Multiple regression: allows a response variable Y
to be modeled as a linear function of
multidimensional feature vector
 Log-linear model: approximates discrete
multidimensional probability distributions
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the
line and are to be estimated by using the data at
hand
– Using the least squares criterion to the known
values of Y1, Y2, …, X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into
the above
• Log-linear models:
– The multi-way table of joint probabilities is
approximated by a product of lower-order tables
– Probability: p(a, b, c, d) = ab acad bcd
Histograms : Histograms use binning to approximate data
distributions and are a popular form of data reduction.
A histogram for an attribute, A, partitions the data distribution of
A into disjoint subsets, or buckets. If each bucket represents only a
single attribute-value/frequency pair, the buckets are called singleton
buckets. Often, buckets instead represent continuous ranges for the
given attribute.
There are several partitioning rules, including the following:

Equal-width: In an equal-width histogram, the width of each

bucket range is uniform

Equal-frequency (or equidepth): In an equal-frequency

histogram, the buckets are created so that, roughly, the frequency of
each bucket is constant (that is, each bucket contains roughly the
same number of contiguous data samples).

V-Optimal: If we consider all of the possible histograms for a

given number of buckets, the V-Optimal histogram is the one with the
least variance. Histogram variance is a weighted sum of the original
values that each bucket represents, where bucket weight is equal to
the number of values in the bucket.

MaxDiff: In a MaxDiff histogram, we consider the difference

between each pair of adjacent values.
Clustering
Clustering techniques consider data tuples as objects. They
partition the objects into groups or clusters, so that objects within a
cluster are “similar” to one another and “dissimilar” to objects in
other clusters.
In data reduction, the cluster representations of the data are used
to replace the actual data. The effectiveness of this technique
depends on the nature of the data. It is much more effective for data
that can be organized into distinct clusters than for smeared data.
 Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid
and diameter) only
 Can be very effective if data is clustered but not if
data is “smeared”
 Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
 There are many choices of clustering definitions
and clustering algorithms
 Cluster analysis will be studied in depth later
Sampling
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller random
sample (or subset) of the data. Suppose that a large data set, D,
contains N tuples. Let’s look at the most common ways that we
could sample D for data reduction.
An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample sampling
complexity is potentially sublinear to the size of the data.
Simple Random sample without replacement
• For a fixed sample size, sampling complexity
increases only linearly as the number of data
dimensions.
• When applied to data reduction, sampling is
most commonly used to estimate the answer
to an aggregate query.
 Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
 Note: Sampling may not reduce database I/Os (page at
a time)
Sampling: with or without
Replacement

W O R
SRS le random
i m p h o u t
( s e wi t
l
s a m p m e nt )
p l a c e
re

SRSW
R

Raw Data
Raw Data Cluster/Stratified Sample
 Data discretization techniques can be used to reduce the
number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.
Replacing numerous values of a continuous attribute by a
small number of interval labels thereby reduces and
simplifies the original data. This leads to a concise, easy-
to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based
on how the discretization is performed, such as
whether it uses class information or which direction it
proceeds (i.e., top-down vs. bottom-up). If the
discretization process uses class information, then
we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding
one or a few points (called split points or cut points)
to split the entire attribute range, and then repeats
this recursively on the resulting intervals, it is called
top-down discretization or splitting. This contrasts
with bottom-up discretization or merging, which
starts by considering all of the continuous values as
potential split-points, removes some by merging
neighborhood values to form intervals, and then
recursively applies this process to the resulting
intervals. Discretization can be performed recursively
on an attribute to provide a hierarchical or
multiresolution partitioning of the attribute values,
known as a concept hierarchy. Concept hierarchies
are useful for mining at multiple levels of abstraction.
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color,
profession
 Ordinal — values from an ordered set, e.g., military or
academic rank
 Continuous — real numbers, e.g., integer or real
numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical
attributes.
 Reduce data size by discretization
 Prepare for further analysis
 Typical methods: All the methods can be applied
recursively
 Binning (covered above)
 Top-down split, unsupervised,
 Histogram analysis (covered above)
 Top-down split, unsupervised
 Clustering analysis (covered above)
 Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down
split
 Interval merging by 2 Analysis: unsupervised, bottom-
up merge
 Segmentation by natural partitioning: top-down split,
unsupervised
 Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the information
gain after partitioning
| S | is |S |
1 2
I(S,T) = Entropy(S 1 ) + Entropy(S 2 )
|S| |S|
 Entropy is calculated based on class distribution of the
samples in the set. Given m classes, the entropy of S1
is m
Entropy ( S1 )   pi log 2 ( pi )
i 1

 where pi is the probability of class i in S1

 The boundary that minimizes the entropy function over
all possible boundaries is selected as a binary
discretization
 The process is recursively applied to partitions obtained
until some stopping criterion is met
 Such a boundary may reduce data size and improve
classification accuracy
 Merging-based (bottom-up) vs. splitting-based methods
 Merge: Find the best neighboring intervals and merge
them to form larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD
2002]
 Initially, each distinct value of a numerical attr. A is
considered to be one interval
 2 tests are performed for every pair of adjacent
intervals
 Adjacent intervals with the least 2 values are merged
together, since low 2 values for a pair indicate similar
class distributions
 This merge process proceeds recursively until a
predefined stopping criterion is met (such as
significance level, max-interval, max inconsistency,
etc.)
 Cluster analysis is a popular data discretization
method. A clustering algorithm can be
applied to discretize a numerical attribute, A, by
partitioning the values of A into clusters or groups.
Clustering takes the distribution of A into
consideration, as well as the closeness of data points,
and therefore is able to produce high-quality
discretization results.
 Clustering can be used to generate a concept
hierarchy for A by following either a topdown splitting
strategy or a bottom-up merging strategy, where
each cluster forms a node of the concept hierarchy. In
the former, each initial cluster or partition may be
further
 decomposed into several subclusters, forming a lower
level of the hierarchy. In the latter, clusters are
formed by repeatedly grouping neighboring clusters
in order to form
 higher-level concepts.
• Discretization by Intuitive Partitioning
• Although the above discretization methods are useful
in the generation of numerical hierarchies, many users
would like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear
intuitive or “natural.”
• If an interval covers 3, 6, 7, or 9 distinct values at the
most significant digit, then partition the range into 3
intervals (3 equal-width intervals for 3, 6, and 9; and 3
intervals in the grouping of 2-3-2 for 7).
• If it covers 2, 4, or 8 distinct values at the most
significant digit, then partition the range into 4 equal-
width intervals.
• If it covers 1, 5, or 10 distinct values at the most
significant digit, then partition the range into 5 equal-
width intervals.
• The rule can be recursively applied to each interval,
creating a concept hierarchy for the given numerical
attribute. Real-world data often contain extremely large
positive and/or negative outlier values, which could
distort any top-down discretization method based on
minimum and maximum data values.
 A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, “natural”
intervals.
 If an interval covers 3, 6, 7 or 9 distinct values
at the most significant digit, partition the range
into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4
intervals
 If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into 5
intervals
 Categorical data are discrete data.
Categorical attributes have a finite (but
possibly large) number of distinct values,
with no ordering among the values.
Examples include geographic location, job
category, and itemtype. There are several
methods for the generation of concept
hierarchies for categorical data.
 Specification of a partial ordering of
attributes explicitly at the schema level by
users or Experts.
 Specification of a portion of a hierarchy by
explicit data grouping
 Specification of a set of attributes, but not of
their partial ordering
 Specification of only a partial set of
attributes
 Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by
explicit data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct
values
 E.g., for a set of attributes: {street, city, state,
country}
 Some hierarchies can be automaatically generated
based on the analysis of the number of distinct
values per attribute in the data set
 The attribute with the most distinct values is
placed at the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

 Data preparation or preprocessing is a big issue
for both data warehousing and data mining
 Discriptive data summarization is need for quality
data preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research

Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Week2 2
No ratings yet
Week2 2
25 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Service Manual - VIDAS Range - LIS - 161150-486 - B - MAR 3408
No ratings yet
Service Manual - VIDAS Range - LIS - 161150-486 - B - MAR 3408
109 pages
Feritscope FMP30: Operators Manual
No ratings yet
Feritscope FMP30: Operators Manual
240 pages
CH 2
No ratings yet
CH 2
36 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
3 Preprocessing
No ratings yet
3 Preprocessing
82 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
178 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Update Instructions For MMI 3G v1.7
No ratings yet
Update Instructions For MMI 3G v1.7
49 pages
Linear Convolution Program in C Language Using CCStudio
80% (5)
Linear Convolution Program in C Language Using CCStudio
3 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit 2
No ratings yet
Unit 2
37 pages
Computer Care and Lab Mangement
No ratings yet
Computer Care and Lab Mangement
41 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Power Amplifier: Technical Data Sheet
No ratings yet
Power Amplifier: Technical Data Sheet
5 pages
Cbmec 1 M1 Wed
No ratings yet
Cbmec 1 M1 Wed
3 pages
Markets Are Found Not Created
No ratings yet
Markets Are Found Not Created
6 pages
Big-Data-Analytics-18CS72 ... : R Lil
No ratings yet
Big-Data-Analytics-18CS72 ... : R Lil
7 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
SAP LA thr81 EN 2411 SG
No ratings yet
SAP LA thr81 EN 2411 SG
10 pages
Programming Intel 8086
No ratings yet
Programming Intel 8086
2 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Bus Incident Management User Guide PDF
No ratings yet
Bus Incident Management User Guide PDF
105 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Department of Electrical Engineering
No ratings yet
Department of Electrical Engineering
18 pages
Algo Midterm
No ratings yet
Algo Midterm
6 pages
Unit - II
No ratings yet
Unit - II
56 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Solutions For Problems in Systems Analysis and Design, 8th Edition by Kendall
No ratings yet
Solutions For Problems in Systems Analysis and Design, 8th Edition by Kendall
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Lecture 8 Iterative Methods
No ratings yet
Lecture 8 Iterative Methods
34 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Project Report Beng/Bsc/Msc: Delete As Appropriate
No ratings yet
Project Report Beng/Bsc/Msc: Delete As Appropriate
17 pages
Daftar Harga Toko Jeremy Lengkap
No ratings yet
Daftar Harga Toko Jeremy Lengkap
2 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Isscc2022 000016CL
No ratings yet
Isscc2022 000016CL
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Unit 4
No ratings yet
Unit 4
30 pages
Programming With Python I Notes2
No ratings yet
Programming With Python I Notes2
30 pages
Normalization
No ratings yet
Normalization
35 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Legend Update Guide EN IT v11
No ratings yet
Legend Update Guide EN IT v11
20 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Cromwell Our Chief of Men Antonia Fraser PDF Download
No ratings yet
Cromwell Our Chief of Men Antonia Fraser PDF Download
22 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CIA TOCs CH
No ratings yet
CIA TOCs CH
7 pages
Driver Deployment Utility v2.0.0.32468 Release Notes
No ratings yet
Driver Deployment Utility v2.0.0.32468 Release Notes
4 pages
Jurnal Ind Iam
No ratings yet
Jurnal Ind Iam
8 pages
PeopleSoftTraining at Hyderabad
No ratings yet
PeopleSoftTraining at Hyderabad
1 page
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
GX Hand-Held Barcode Scanner Reconfiguration Worksheet
No ratings yet
GX Hand-Held Barcode Scanner Reconfiguration Worksheet
5 pages
Syllabus
No ratings yet
Syllabus
3 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit - 1 Data Preprocessing

Uploaded by

Unit - 1 Data Preprocessing

Uploaded by

UNIT – 1

• Understand why preprocess the data.

 Why preprocess the data?

1. Data map entities in the application domain to symbolic

1. Data in the real world is dirty

incomplete: missing data, lacking attribute values, lacking certain

4. No quality data, no quality mining results! (GIGO)

5. Quality decisions must be based on quality data

6. Data warehouse needs consistent integration of quality data

• Descriptive data summarization techniques can be used to

• Thus, we first introduce the basic concepts of descriptive data

• For many data preprocessing tasks, users would like to learn

• These descriptive statistics are of great help in understanding

• Such measures have been studied extensively in the statistical

• From the data mining point of view, we need to examine how

• In particular, it is necessary to introduce the notions of

• Knowing what kind of measure we are dealing with can help

• Ignore the tuple: usually done when class label is missing

- first sort data and partition into (equi-depth) bins

• Combined computer and human inspection

* Partition into (equi-depth) bins:

* Smoothing by bin means:

* Smoothing by bin boundaries:

 Ex. Let μ = 54,000, σ = 16,000. Then

2. Attribute subset selection, where irrelevant, weakly relevant, or redundant

3. Dimensionality reduction, where encoding mechanisms are used to reduce

4. Numerosity reduction, where the data are replaced or estimated by alternative,

5. Discretization and concept hierarchy generation, where rawdata values for

 Mining on a reduced set of attributes has an

• 1. Stepwise forward selection: The procedure

• 2. Stepwise backward elimination: The

• 3. Combination of forward selection and

 The stopping criteria for the methods may

 In this section, we instead focus on two popular and

• This is done by first applying the transform to the

• In comparison with wavelet transforms, PCA tends to be

Equal-width: In an equal-width histogram, the width of each

Equal-frequency (or equidepth): In an equal-frequency

V-Optimal: If we consider all of the possible histograms for a

MaxDiff: In a MaxDiff histogram, we consider the difference

 where pi is the probability of class i in S1

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.