02 DataPreparation
02 DataPreparation
DATA MINING
WEEK 2
DATA
PREPARATION
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2, 2022
DATA "Remember, if you fail to prepare you are preparing to fail."
PREPARATION
- H.K Williams
Aggarwal Ch 2
Survey Data ? ?
(e.g., census)
Data from ? ?
multiple
sources
Sensor data
Time-series data Network traffic
Transforms: Fourier, wavelets Raw data: traffic packets + routing information
Image data
Example: KDD Cup 99 Intrusion Detection Dataset
Raw data: pixels (R,G,B values)
Features: duration, protocol, service, source, destination
Low level: corners, edges, lines, colour histograms, texture, . . .
Documents
High level: shapes, visual words . . .
Basic: term-document matrix (bag of words)
Web logs
Advanced: Bigrams, trigrams, name-entity
Text strings in pre-specified format
XML documents: tree-based representations
Fields easily extracted
Vector or tree representation
https://www.absentdata.com/pandas/pandas-cut-continuous-
to-categorical/
Equi-depth
https://www.saedsayad.com/unsupervised_binning.htm
Binarization: convert to multiple binary attributes Vector representation: numeric, high-dimensional, and
Missing entries
Incorrect entries
Scaling and normalization
Delete records
Estimating values is
Estimate the value a classification/regression problem
Remeasure Contextual data can help with estimation
Use an analytical app that can handle missing data (dependency-oriented data)
Beware of relying on large fractions of estimated data
Remove sparse attributes
Inconsistency detection
Cross-validation for information from multiple sources
Data-centric methods
Domain knowledge
Use the data itself to determine what is normal and not
Bag of apples shouldn't cost $450
normal
Melbourne / South Australia shouldn't be a valid address This is clustering/outlier analysis
Sydney / Canada is valid though!
Solutions
Attributes with different scales or distributions can
Scaling data via linear/log/pareto or min/max functions
cause bias in DM applications (e.g., data similarity)
Clipping
https://developers.google.com/machine-learning/data-prep/transform/normalization
COMP5009 – DATA MINING, CURTIN UNIVERSITY 18
CLIPPING DATA
https://developers.google.com/machine-learning/data-prep/transform/normalization
https://developers.google.com/machine-learning/data-prep/transform/normalization
Force data to be within [0,1] Force data to have zero mean and unit variance
X -> (x-min)/(max-min) X -> Z = (x-μ)/σ
Common feature:
Distance/density
measurement
https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html
Whatever scaling and data preparation you do to your existing data has to be replicated on any new
data you receive.
Keep track of the parameters of your scaling / selection functions.
transformation
Instances Features
Goal Approach
Accurate representation of the entire data Things that don't change are easy to represent
Visually:
1. Rotate axes about the origin
4. GOTO 2
Compute the mean centered covariance matrix: The eigenvectors in P are the Principal Components
Decompose C as:
1. Series is V = (8, 6, 2, 3, 4, 6, 6, 5)
wi = Σ(V.bi)/|bi|
Where:
- |b| is the L1 norm
- Σ is summation
8 6 2 3 4 6 6 5 0 0 1 -1 0 0 0 0 2 0 0 2 -3 0 0 0 0 -0.5
8 6 2 3 4 6 6 5 0 0 0 0 1 -1 0 0 2 0 0 0 0 4 -6 0 0 -1
8 6 2 3 4 6 6 5 0 0 0 0 0 0 1 -1 2 0 0 0 0 0 0 6 -5 0.5
8 6 2 3 4 6 6 5 1 1 -1 -1 0 0 0 0 4 8 6 -2 -3 0 0 0 0 2.25
8 6 2 3 4 6 6 5 0 0 0 0 1 1 -1 -1 4 0 0 0 0 4 6 -6 -5 -0.25
8 6 2 3 4 6 6 5 1 1 1 1 -1 -1 -1 -1 8 8 6 2 3 -4 -6 -6 -5 -0.25
8 6 2 3 4 6 6 5 1 1 1 1 1 1 1 1 8 8 6 2 3 4 6 6 5 5
Original data! Can invert the matrix so that we compute the weights
directly: W=DH-1
(8, 6, 2, 3, 4, 6, 6, 5)
The wavelet coefficients describe differences at
various scales
O H
Treat as an image
Treat as multiple 1D data
Transform along 3 axes
eg temp, humidity, irradiance
Horizontal
Transform each separately
V D
Vertical
Diagonal
O H
Data transformation
Reduce complexity of data
Remove noise