3 Prep
3 Prep
names
No quality data, no quality mining results!
Quality decisions must be based on quality data
quality data
June 15, 2024 Data Mining: Concepts and Techniques 2
Multi-Dimensional Measure of Data Quality
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and
accessibility.
June 15, 2024 Data Mining: Concepts and Techniques 3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
June 15, 2024 Data Mining: Concepts and Techniques 4
Forms of data preprocessing
technology limitation
incomplete data
inconsistent data
Regression
smooth by fitting the data into regression functions
uniform grid
if A and B are the lowest and highest values of the
Y1
Y1’ y=x+1
X1 x
store
Schema integration
integrate metadata from different sources
Dimensionality reduction
Numerosity reduction
understand
Heuristic methods (due to exponential # of choices):
step-wise forward selection
decision-tree induction
June 15, 2024 Data Mining: Concepts and Techniques 24
Example of Decision Tree Induction
A1? A6?
Typically lossless
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
os sy
l
Original Data
Approximated
X2
Y1
Y2
X1
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
above.
Log-linear models:
The multi-way table of joint probabilities is
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
June 15, 2024 Data Mining: Concepts and Techniques 38
Sampling
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Entropy-based discretization
(-$1,000 - $2,000)
Step 3:
(-$4000 -$5,000)
Step 4: