0% found this document useful (0 votes)
20 views43 pages

02 DataPreparation

The document discusses various tasks involved in data preparation for data mining including feature extraction, data cleaning, and data transformation. It describes techniques for handling different data types and formats, discretizing continuous variables, dealing with missing data, and resolving inconsistent data.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views43 pages

02 DataPreparation

The document discusses various tasks involved in data preparation for data mining including feature extraction, data cleaning, and data transformation. It describes techniques for handling different data types and formats, discretizing continuous variables, dealing with missing data, and resolving inconsistent data.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

COMP5009

DATA MINING

WEEK 2
DATA
PREPARATION
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2, 2022
DATA "Remember, if you fail to prepare you are preparing to fail."
PREPARATION
- H.K Williams
Aggarwal Ch 2

Data preparation tasks:


 Feature extraction
 Data cleaning
 Data reduction and transformation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2


 Raw data is typically not ideal for many data mining
FEATURE algorithms
EXTRACTION  Not all data are relevant
Aggarwal Ch 2.2  Measurement = Signal + Noise

 Unsuitable data types

 Missing & corrupted data

 Values have a range of distributions

 Data can come from different sources

 Very large size (eg, video data)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3


Data Source Potential Issue Potential Solution
Sensor Large volume of data with low Summary statistic or aggregate data
(e.g., voltage information density Wavelet/Fourier transform
meter) Signal processing

Image High dimensional data (spatial and Image segmentation


(e.g., cat spectral) Histogram analysis
photo) Features are not pixels, but groups of
pixels
Document Human language has grammar, but it is Word frequency or bag-of-words approach
(e.g., blog post, dynamic Domain specific statistical models
report)

Survey Data ? ?
(e.g., census)

Data from ? ?
multiple
sources

COMP5009 – DATA MINING, CURTIN UNIVERSITY 4


FEATURE EXTRACTION

 Sensor data
 Time-series data  Network traffic
 Transforms: Fourier, wavelets  Raw data: traffic packets + routing information
 Image data
 Example: KDD Cup 99 Intrusion Detection Dataset
 Raw data: pixels (R,G,B values)
 Features: duration, protocol, service, source, destination
 Low level: corners, edges, lines, colour histograms, texture, . . .
 Documents
 High level: shapes, visual words . . .
 Basic: term-document matrix (bag of words)
 Web logs
 Advanced: Bigrams, trigrams, name-entity
 Text strings in pre-specified format
  XML documents: tree-based representations
Fields easily extracted
 Vector or tree representation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5


 Different sources present data in different formats

DATA TYPES AND  DM applications need homogeneous data

PORTABILITY  Converting between data types is crucial

 Numeric data is the easiest to work with

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6


DISCRETIZATION  Converting numerical data into categorical data

https://www.absentdata.com/pandas/pandas-cut-continuous-
to-categorical/

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7


DISCRETIZATION
Bin edges Label
(-inf, 1] N
Process:
(1,16) A
 Create bins which cover the range of data
[16,inf) O
 For each data instance assign it to a bin

 Bin labels can be categories or numeric

Common choice of bin size:


 Equi-width (Equi-log)

 Equi-depth

https://www.saedsayad.com/unsupervised_binning.htm

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8


CATEGORICAL AND TEXT CONVERSIONS

Categorical -> Numeric Text -> Numeric

Binarization: convert to multiple binary attributes  Vector representation: numeric, high-dimensional, and

 Temp = {hot, warm, cold} sparse (c.f., Binarization)


 Latent semantic analysis
 Hot = (1,0,0) = 4
Warm = (0,1,0) = 2  Reduces the dimensions considerably: typical a lexicon of
100,000 dimensions can be embedded in fewer than 300
Cold = (0,0,1) = 1
dimensions
 Application of singular value decomposition (SVD)
 Document-term matrix: collection of all documents with
normalized word frequencies

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9


TIME SERIES CONVERSION

Time Series -> Numeric

 Removes dependency between original time-series


values
 Common: discrete wavelet transformation
 Time series to multi-dimensional data
 Each dimension: wavelet coefficient
 Large coefficients: dimensionality reduction

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10


FEATURE EXTRACTION

 Feature extraction is a process of turning raw data


into a set of features
 What features should be excluded, even if they are
 What considerations are needed to ensure that the
easy to measure?
features are useful?
 -
 -
 -
 -
 -
 -

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11


HETEROGENEOUS DATA SOURCES

 Fare is in units of some currency


 A standard unit can easily be selected

 Ticket has a range of formats due to


different information sources.
 Different sales agents?
 Different historical sources?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 12


DATA CLEANING
Aggarwal Ch 2.3

 Missing entries
 Incorrect entries
 Scaling and normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 13


 Attribute may be not able to measured

 Survey data may contain optional questions


MISSING DATA  Data may have been collected but has been lost or corrupted

 Attributes may not apply to all records

COMP5009 – DATA MINING, CURTIN UNIVERSITY 14


 How do you interpret Age = NaN?

 How would you record a stowaway in this table?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15


DEALING WITH MISSING DATA

 Delete records
 Estimating values is
 Estimate the value a classification/regression problem
 Remeasure  Contextual data can help with estimation

 Use an analytical app that can handle missing data (dependency-oriented data)
 Beware of relying on large fractions of estimated data
 Remove sparse attributes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16


INCORRECT OR INCONSISTENT DATA

 Inconsistency detection
 Cross-validation for information from multiple sources
 Data-centric methods
 Domain knowledge
 Use the data itself to determine what is normal and not
 Bag of apples shouldn't cost $450
normal
 Melbourne / South Australia shouldn't be a valid address  This is clustering/outlier analysis
 Sydney / Canada is valid though!

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17


SCALING AND NORMALIZATION

Solutions
 Attributes with different scales or distributions can
 Scaling data via linear/log/pareto or min/max functions​​
cause bias in DM applications (e.g., data similarity)
 Clipping​​

 Binning data -> Discretization with numeric labels

https://developers.google.com/machine-learning/data-prep/transform/normalization
COMP5009 – DATA MINING, CURTIN UNIVERSITY 18
CLIPPING DATA

https://developers.google.com/machine-learning/data-prep/transform/normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19


 Many physical processes result in a power-law (or
LOGARITHMIC SCALING log) distribution of outputs (see examples)
 Logarithmic scaling converts this distribution into a
linear distribution

https://developers.google.com/machine-learning/data-prep/transform/normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20


LINEAR RESCALING

Min/max scaling Z-score scaling

Force data to be within [0,1] Force data to have zero mean and unit variance
 X -> (x-min)/(max-min)  X -> Z = (x-μ)/σ

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21


ALGORITHMS WHICH NEED FEATURE SCALING

Common feature:
 Distance/density
measurement

https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22


SUMMARY  Data + Algorithm = Results
 For best results we need the best data
 And the best algorithm (later lectures)

 Missing data can be removed or replaced


 Incorrect data can be updated
 Feature scaling helps avoid bias in DM applications

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23


SCALING AND PREPROCESSING FOR UNSEEN DATA

 Whatever scaling and data preparation you do to your existing data has to be replicated on any new
data you receive.
 Keep track of the parameters of your scaling / selection functions.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24


DATA REDUCTION
AND
TRANSFORMATION  Data sampling
Aggarwal Ch 2.4  Feature selection
 Dimensionality reduction
 axis rotation

 transformation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 25


SAMPLING VIA SUBSET

Instances Features

 Random sampling  Unsupervised selection


 Easy!  Use the data to determine which features should be
 Biased sampling retained/removed

  Useful for clustering/outlier analysis


Intentionally emphasize aspects of the data based on some
'relevance' score  Supervised selection
 Stratified sampling
 Useful for classification problems
 Groups have different sizes
 Only features that can predict the class are retained
 Sampling draws evenly across groups
 What is the advantage?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26


DIMENSIONALITY REDUCTION

Goal Approach

 Accurate representation of the entire data  Things that don't change are easy to represent

 Use a smaller amount of data (constants)


 Focus on the most changing (variable) parts of the
data
 Retain the most variant aspects

 Excise the least variant aspects

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27


PRINCIPLE
COMPONENT """
ANALYSIS Principal component analysis (PCA) is a statistical procedure
that uses an orthogonal transformation to convert a set of
observations of possibly correlated variables (entities each of
which takes on various numerical values) into a set of values
of linearly uncorrelated variables called principal
components.
"""
- Wikipedia

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28


PRINCIPAL COMPONENT ANALYSIS (PCA)

Visually:
1. Rotate axes about the origin

2. Identify an orientation which displays the largest


spread -> component 1 (C1)
3. Project all data onto a plane perpendicular to C1

4. GOTO 2

Each iteration you are visualizing the data in fewer


dimensions

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29


DATA SCALING

 What would happen if our features had different


scales?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30


PCA IN PRACTICE

 Form data into matrix D

 Compute the mean centered covariance matrix:  The eigenvectors in P are the Principal Components

 The transformed data are now D' where:

 Decompose C as:

 Where P is a matrix of orthonormal eigenvectors of C,


and Λ is a diagonal matrix of eigenvalues
 Normal practice is to sort Λ and P such that
the diagonal entries of Λ are ordered from largest
to smallest

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31


SINGULAR VALUE DECOMPOSITION OR SVD

 SVD is a more general version of PCA Applications:

 SVD can work with non-centered data  Dimensionality reduction (!)

 No need to subtract the mean  Noise reduction

 SVD algorithms are more efficient  Replacing missing data (imputation)

 Libraries like sklearn.decomposition.PCA use SVD


'under the hood'

COMP5009 – DATA MINING, CURTIN UNIVERSITY 32


TYPE
TRANSFORMATION
 Complex data => less complex data
 Wavelet transform
 Fourier transform

COMP5009 – DATA MINING, CURTIN UNIVERSITY 33


HAAR WAVELET TRANSFORM – ORDERED DATA

 Again, primarily interested in differences

 Represent (1D) data series as a summation of


wavelets
 Record the wavelet vectors and coefficients

 Can reconstruct data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34


WAVELET DECOMPOSITION EXAMPLE

1. Series is V = (8, 6, 2, 3, 4, 6, 6, 5)

2. Multiply by basis vector bi and sum

3. Divide by the number of non-zero entries in the basis


vector
4. This is the coefficient wi

wi = Σ(V.bi)/|bi|

Where:
- |b| is the L1 norm
- Σ is summation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35


SOLUTION

V bi |bi| V.bi Wi= Σ(V.bi) / |bi|


8 6 2 3 4 6 6 5 1 -1 0 0 0 0 0 0 2 8 -6 0 0 0 0 0 0 1

8 6 2 3 4 6 6 5 0 0 1 -1 0 0 0 0 2 0 0 2 -3 0 0 0 0 -0.5

8 6 2 3 4 6 6 5 0 0 0 0 1 -1 0 0 2 0 0 0 0 4 -6 0 0 -1

8 6 2 3 4 6 6 5 0 0 0 0 0 0 1 -1 2 0 0 0 0 0 0 6 -5 0.5

8 6 2 3 4 6 6 5 1 1 -1 -1 0 0 0 0 4 8 6 -2 -3 0 0 0 0 2.25

8 6 2 3 4 6 6 5 0 0 0 0 1 1 -1 -1 4 0 0 0 0 4 6 -6 -5 -0.25

8 6 2 3 4 6 6 5 1 1 1 1 -1 -1 -1 -1 8 8 6 2 3 -4 -6 -6 -5 -0.25

8 6 2 3 4 6 6 5 1 1 1 1 1 1 1 1 8 8 6 2 3 4 6 6 5 5

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36


DATA RECONSTRUCTION

Original data!  Can invert the matrix so that we compute the weights
directly: W=DH-1
(8, 6, 2, 3, 4, 6, 6, 5)
 The wavelet coefficients describe differences at
various scales

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37


EXERCISE

 Consider the time series: 1, 1, 3, 3, 3, 3, 1, 1

 Perform wavelet decomposition on the series.

 How many coefficients are not zero?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38


MULTI-DIMENSIONAL DATA

O H

 Treat as an image
 Treat as multiple 1D data
 Transform along 3 axes
 eg temp, humidity, irradiance
 Horizontal
 Transform each separately

V D
Vertical
 Diagonal

COMP5009 – DATA MINING, CURTIN UNIVERSITY 40


MULTI-DIMENSIONAL DATA

O H

 Treat as multiple 1D data


What if our (non-image) features
 eg temp, humidity, irradiance
had different scales? V D
 Transform each separately

COMP5009 – DATA MINING, CURTIN UNIVERSITY 41


IMAGE RECONSTRUCTION

 Load data in order of importance


 Example: progressive JPEG

 What if we don't restore all the wavelet components?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 42


 Data reduction
SUMMARY  Less data means faster processing

 Ideally less data doesn't need to mean less information

 Data transformation
 Reduce complexity of data

 Remove noise

COMP5009 – DATA MINING, CURTIN UNIVERSITY 43


NEXT: DATA SIMILARITY AND DATA DISTANCE
AGGARWAL CHAPTER 3

COMP5009 – DATA MINING, CURTIN UNIVERSITY 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy