0% found this document useful (0 votes)

20 views43 pages

02 DataPreparation

The document discusses various tasks involved in data preparation for data mining including feature extraction, data cleaning, and data transformation. It describes techniques for handling different data types and formats, discretizing continuous variables, dealing with missing data, and resolving inconsistent data.

Uploaded by

Manish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views43 pages

02 DataPreparation

Uploaded by

Manish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

COMP5009

DATA MINING

WEEK 2
DATA
PREPARATION
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2, 2022
DATA "Remember, if you fail to prepare you are preparing to fail."
PREPARATION
- H.K Williams
Aggarwal Ch 2

Data preparation tasks:

 Feature extraction
 Data cleaning
 Data reduction and transformation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2

 Raw data is typically not ideal for many data mining
FEATURE algorithms
EXTRACTION  Not all data are relevant
Aggarwal Ch 2.2  Measurement = Signal + Noise

 Unsuitable data types

 Missing & corrupted data

 Values have a range of distributions

 Data can come from different sources

 Very large size (eg, video data)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3

Data Source Potential Issue Potential Solution
Sensor Large volume of data with low Summary statistic or aggregate data
(e.g., voltage information density Wavelet/Fourier transform
meter) Signal processing

Image High dimensional data (spatial and Image segmentation

(e.g., cat spectral) Histogram analysis
photo) Features are not pixels, but groups of
pixels
Document Human language has grammar, but it is Word frequency or bag-of-words approach
(e.g., blog post, dynamic Domain specific statistical models
report)

Survey Data ? ?
(e.g., census)

Data from ? ?
multiple
sources

COMP5009 – DATA MINING, CURTIN UNIVERSITY 4

FEATURE EXTRACTION

 Sensor data
 Time-series data  Network traffic
 Transforms: Fourier, wavelets  Raw data: traffic packets + routing information
 Image data
 Example: KDD Cup 99 Intrusion Detection Dataset
 Raw data: pixels (R,G,B values)
 Features: duration, protocol, service, source, destination
 Low level: corners, edges, lines, colour histograms, texture, . . .
 Documents
 High level: shapes, visual words . . .
 Basic: term-document matrix (bag of words)
 Web logs
 Advanced: Bigrams, trigrams, name-entity
 Text strings in pre-specified format
  XML documents: tree-based representations
Fields easily extracted
 Vector or tree representation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5

 Different sources present data in different formats

DATA TYPES AND  DM applications need homogeneous data

PORTABILITY  Converting between data types is crucial

 Numeric data is the easiest to work with

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6

DISCRETIZATION  Converting numerical data into categorical data

https://www.absentdata.com/pandas/pandas-cut-continuous-
to-categorical/

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7

DISCRETIZATION
Bin edges Label
(-inf, 1] N
Process:
(1,16) A
 Create bins which cover the range of data
[16,inf) O
 For each data instance assign it to a bin

 Bin labels can be categories or numeric

Common choice of bin size:

 Equi-width (Equi-log)

 Equi-depth

https://www.saedsayad.com/unsupervised_binning.htm

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8

CATEGORICAL AND TEXT CONVERSIONS

Categorical -> Numeric Text -> Numeric

Binarization: convert to multiple binary attributes  Vector representation: numeric, high-dimensional, and

 Temp = {hot, warm, cold} sparse (c.f., Binarization)

 Latent semantic analysis
 Hot = (1,0,0) = 4
Warm = (0,1,0) = 2  Reduces the dimensions considerably: typical a lexicon of
100,000 dimensions can be embedded in fewer than 300
Cold = (0,0,1) = 1
dimensions
 Application of singular value decomposition (SVD)
 Document-term matrix: collection of all documents with
normalized word frequencies

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9

TIME SERIES CONVERSION

Time Series -> Numeric

 Removes dependency between original time-series

values
 Common: discrete wavelet transformation
 Time series to multi-dimensional data
 Each dimension: wavelet coefficient
 Large coefficients: dimensionality reduction

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10

FEATURE EXTRACTION

 Feature extraction is a process of turning raw data

into a set of features
 What features should be excluded, even if they are
 What considerations are needed to ensure that the
easy to measure?
features are useful?
 -
 -
 -
 -
 -
 -

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11

HETEROGENEOUS DATA SOURCES

 Fare is in units of some currency

 A standard unit can easily be selected

 Ticket has a range of formats due to

different information sources.
 Different sales agents?
 Different historical sources?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 12

DATA CLEANING
Aggarwal Ch 2.3

 Missing entries
 Incorrect entries
 Scaling and normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 13

 Attribute may be not able to measured

 Survey data may contain optional questions

MISSING DATA  Data may have been collected but has been lost or corrupted

 Attributes may not apply to all records

COMP5009 – DATA MINING, CURTIN UNIVERSITY 14

 How do you interpret Age = NaN?

 How would you record a stowaway in this table?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15

DEALING WITH MISSING DATA

 Delete records
 Estimating values is
 Estimate the value a classification/regression problem
 Remeasure  Contextual data can help with estimation

 Use an analytical app that can handle missing data (dependency-oriented data)
 Beware of relying on large fractions of estimated data
 Remove sparse attributes

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16

INCORRECT OR INCONSISTENT DATA

 Inconsistency detection
 Cross-validation for information from multiple sources
 Data-centric methods
 Domain knowledge
 Use the data itself to determine what is normal and not
 Bag of apples shouldn't cost $450
normal
 Melbourne / South Australia shouldn't be a valid address  This is clustering/outlier analysis
 Sydney / Canada is valid though!

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17

SCALING AND NORMALIZATION

Solutions
 Attributes with different scales or distributions can
 Scaling data via linear/log/pareto or min/max functions
cause bias in DM applications (e.g., data similarity)
 Clipping

 Binning data -> Discretization with numeric labels

https://developers.google.com/machine-learning/data-prep/transform/normalization
COMP5009 – DATA MINING, CURTIN UNIVERSITY 18
CLIPPING DATA

https://developers.google.com/machine-learning/data-prep/transform/normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19

 Many physical processes result in a power-law (or
LOGARITHMIC SCALING log) distribution of outputs (see examples)
 Logarithmic scaling converts this distribution into a
linear distribution

https://developers.google.com/machine-learning/data-prep/transform/normalization

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20

LINEAR RESCALING

Min/max scaling Z-score scaling

Force data to be within [0,1] Force data to have zero mean and unit variance
 X -> (x-min)/(max-min)  X -> Z = (x-μ)/σ

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21

ALGORITHMS WHICH NEED FEATURE SCALING

Common feature:
 Distance/density
measurement

https://www.kdnuggets.com/2020/04/data-transformation-standardization-normalization.html

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22

SUMMARY  Data + Algorithm = Results
 For best results we need the best data
 And the best algorithm (later lectures)

 Missing data can be removed or replaced

 Incorrect data can be updated
 Feature scaling helps avoid bias in DM applications

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23

SCALING AND PREPROCESSING FOR UNSEEN DATA

 Whatever scaling and data preparation you do to your existing data has to be replicated on any new
data you receive.
 Keep track of the parameters of your scaling / selection functions.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24

DATA REDUCTION
AND
TRANSFORMATION  Data sampling
Aggarwal Ch 2.4  Feature selection
 Dimensionality reduction
 axis rotation

 transformation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 25

SAMPLING VIA SUBSET

Instances Features

 Random sampling  Unsupervised selection

 Easy!  Use the data to determine which features should be
 Biased sampling retained/removed

  Useful for clustering/outlier analysis

Intentionally emphasize aspects of the data based on some
'relevance' score  Supervised selection
 Stratified sampling
 Useful for classification problems
 Groups have different sizes
 Only features that can predict the class are retained
 Sampling draws evenly across groups
 What is the advantage?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26

DIMENSIONALITY REDUCTION

Goal Approach

 Accurate representation of the entire data  Things that don't change are easy to represent

 Use a smaller amount of data (constants)

 Focus on the most changing (variable) parts of the
data
 Retain the most variant aspects

 Excise the least variant aspects

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27

PRINCIPLE
COMPONENT """
ANALYSIS Principal component analysis (PCA) is a statistical procedure
that uses an orthogonal transformation to convert a set of
observations of possibly correlated variables (entities each of
which takes on various numerical values) into a set of values
of linearly uncorrelated variables called principal
components.
"""
- Wikipedia

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28

PRINCIPAL COMPONENT ANALYSIS (PCA)

Visually:
1. Rotate axes about the origin

2. Identify an orientation which displays the largest

spread -> component 1 (C1)
3. Project all data onto a plane perpendicular to C1

4. GOTO 2

Each iteration you are visualizing the data in fewer

dimensions

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29

DATA SCALING

 What would happen if our features had different

scales?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30

PCA IN PRACTICE

 Form data into matrix D

 Compute the mean centered covariance matrix:  The eigenvectors in P are the Principal Components

 The transformed data are now D' where:

 Decompose C as:

 Where P is a matrix of orthonormal eigenvectors of C,

and Λ is a diagonal matrix of eigenvalues
 Normal practice is to sort Λ and P such that
the diagonal entries of Λ are ordered from largest
to smallest

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31

SINGULAR VALUE DECOMPOSITION OR SVD

 SVD is a more general version of PCA Applications:

 SVD can work with non-centered data  Dimensionality reduction (!)

 No need to subtract the mean  Noise reduction

 SVD algorithms are more efficient  Replacing missing data (imputation)

 Libraries like sklearn.decomposition.PCA use SVD

'under the hood'

COMP5009 – DATA MINING, CURTIN UNIVERSITY 32

TYPE
TRANSFORMATION
 Complex data => less complex data
 Wavelet transform
 Fourier transform

COMP5009 – DATA MINING, CURTIN UNIVERSITY 33

HAAR WAVELET TRANSFORM – ORDERED DATA

 Again, primarily interested in differences

 Represent (1D) data series as a summation of

wavelets
 Record the wavelet vectors and coefficients

 Can reconstruct data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34

WAVELET DECOMPOSITION EXAMPLE

1. Series is V = (8, 6, 2, 3, 4, 6, 6, 5)

2. Multiply by basis vector bi and sum

3. Divide by the number of non-zero entries in the basis

vector
4. This is the coefficient wi

wi = Σ(V.bi)/|bi|

Where:
- |b| is the L1 norm
- Σ is summation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35

SOLUTION

V bi |bi| V.bi Wi= Σ(V.bi) / |bi|

8 6 2 3 4 6 6 5 1 -1 0 0 0 0 0 0 2 8 -6 0 0 0 0 0 0 1

8 6 2 3 4 6 6 5 0 0 1 -1 0 0 0 0 2 0 0 2 -3 0 0 0 0 -0.5

8 6 2 3 4 6 6 5 0 0 0 0 1 -1 0 0 2 0 0 0 0 4 -6 0 0 -1

8 6 2 3 4 6 6 5 0 0 0 0 0 0 1 -1 2 0 0 0 0 0 0 6 -5 0.5

8 6 2 3 4 6 6 5 1 1 -1 -1 0 0 0 0 4 8 6 -2 -3 0 0 0 0 2.25

8 6 2 3 4 6 6 5 0 0 0 0 1 1 -1 -1 4 0 0 0 0 4 6 -6 -5 -0.25

8 6 2 3 4 6 6 5 1 1 1 1 -1 -1 -1 -1 8 8 6 2 3 -4 -6 -6 -5 -0.25

8 6 2 3 4 6 6 5 1 1 1 1 1 1 1 1 8 8 6 2 3 4 6 6 5 5

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36

DATA RECONSTRUCTION

Original data!  Can invert the matrix so that we compute the weights
directly: W=DH-1
(8, 6, 2, 3, 4, 6, 6, 5)
 The wavelet coefficients describe differences at
various scales

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37

EXERCISE

 Consider the time series: 1, 1, 3, 3, 3, 3, 1, 1

 Perform wavelet decomposition on the series.

 How many coefficients are not zero?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38

MULTI-DIMENSIONAL DATA

O H

 Treat as an image
 Treat as multiple 1D data
 Transform along 3 axes
 eg temp, humidity, irradiance
 Horizontal
 Transform each separately

V D
Vertical
 Diagonal

COMP5009 – DATA MINING, CURTIN UNIVERSITY 40

MULTI-DIMENSIONAL DATA

O H

 Treat as multiple 1D data

What if our (non-image) features
 eg temp, humidity, irradiance
had different scales? V D
 Transform each separately

COMP5009 – DATA MINING, CURTIN UNIVERSITY 41

IMAGE RECONSTRUCTION

 Load data in order of importance

 Example: progressive JPEG

 What if we don't restore all the wavelet components?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 42

 Data reduction
SUMMARY  Less data means faster processing

 Ideally less data doesn't need to mean less information

 Data transformation
 Reduce complexity of data

 Remove noise

COMP5009 – DATA MINING, CURTIN UNIVERSITY 43

NEXT: DATA SIMILARITY AND DATA DISTANCE
AGGARWAL CHAPTER 3

COMP5009 – DATA MINING, CURTIN UNIVERSITY 44

Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Stacked It
No ratings yet
Stacked It
28 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
DM 2 Part 1
No ratings yet
DM 2 Part 1
50 pages
Week2 2
No ratings yet
Week2 2
25 pages
Week 2
No ratings yet
Week 2
96 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
DM Merged
No ratings yet
DM Merged
169 pages
01 IntroToDMandDBMS
No ratings yet
01 IntroToDMandDBMS
50 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Unit 3
No ratings yet
Unit 3
164 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Unit I
No ratings yet
Unit I
57 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
DataMining S
No ratings yet
DataMining S
103 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DS Unit 1 Essay Answers.
No ratings yet
DS Unit 1 Essay Answers.
18 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
73 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Unit 1
No ratings yet
Unit 1
8 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
Gujarat Technological University: Page 1 of 2
No ratings yet
Gujarat Technological University: Page 1 of 2
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.