0% found this document useful (0 votes)

50 views71 pages

ML2 1 Data Transformations

This document provides an overview of machine learning techniques for data transformations. It discusses selecting relevant attributes, discretizing numeric attributes, projecting data to reduce dimensions, oversampling to balance classes, and detecting outliers. The goals are to improve model performance by preprocessing data. Specific techniques covered include attribute selection using symmetric uncertainty, searching attribute spaces, speeding up evaluations, unsupervised discretization, and entropy-based discretization.

Uploaded by

Renu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views71 pages

ML2 1 Data Transformations

Uploaded by

Renu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Machine Learning 2

WS 18/19
Dr. Benjamin Guthier
Professur für Bildverarbeitung

Machine Learning 2 – Dr. Benjamin Guthier

1. DATA TRANSFORMATIONS

Machine Learning 2 – Dr. Benjamin Guthier

Content of this Chapter

Modify input data to improve performance of machine learning:

• Attribute selection
– Remove redundant or irrelevant attributes
• Discretization
– Converting numeric attributes into categories
• Data projection
– Transform input vector space so that most of the information is
contained in the first vector components
• Oversampling
– Balance class distribution in data sets
• Outlier detection
– Remove anomalies from the data set

Machine Learning 2 – Dr. Benjamin Guthier 3 | 1. Data Transformations

Learning Goals
• After this chapter, you will be able to…

• Decide whether to use scheme-specific or scheme-

independent attribute selection
• Calculate symmetric uncertainty between attributes
• Describe different methods of searching the space of
attribute subsets and speeding up evaluation

• Give examples for unsupervised discretization

• Explain the steps of entropy-based discretization
• Calculate entropy
• Outline the idea of using the MDL principle as stop criterion

Machine Learning 2 – Dr. Benjamin Guthier 4 | 1. Data Transformations

Learning Goals (2)

• Decide when to apply PCA

• Explain the steps of PCA
• Calculate a covariance matrix
• Use its eigenvectors to transform data

• Decide when to use over- and under-sampling

• Create new samples by linear interpolation
• Handle nominal attributes

• Build isolation forests for outlier detection

• Estimate probability densities with Gaussian kernels

Machine Learning 2 – Dr. Benjamin Guthier 5 | 1. Data Transformations

• I. Witten, E. Frank: Data Mining – Practical Machine Learning

Tools and Techniques, Chapter 8 “Data Transformations”

Machine Learning 2 – Dr. Benjamin Guthier 6 | 1. Data Transformations

ATTRIBUTE SELECTION

Machine Learning 2 – Dr. Benjamin Guthier 7 | 1. Data Transformations

Motivation

• Data sets with many attributes often contain useless ones

• Redundant attributes
– Two attributes with highly correlated values
– Corresponds to giving attribute higher weight (probably undesirable)

• Irrelevant attributes
– Attribute is uncorrelated with class label
– Will erroneously appear “good” far down in a decision tree

Machine Learning 2 – Dr. Benjamin Guthier 8 | 1. Data Transformations

Motivation (2)

• In theory: many machine learning schemes already select

best attributes (e.g., using gain ratio)

• In practice: many advantages of additional pre-selection

– Reduced generalization error of trained model
– Easier interpretation of results (e.g., smaller trees)

• Two general approaches:

– Scheme-independent selection (filter)
– Scheme-dependent selection (wrapper)

Machine Learning 2 – Dr. Benjamin Guthier 9 | 1. Data Transformations

Scheme-Independent Selection

• Simple approach: use a different machine learning scheme

for selection

• Build decision tree over data

– Prune tree
– Only keep attributes that are used in the tree
– Assumes the final model to be built is not a decision tree

• Train a support vector machine

– Assigns coefficients to each attribute
– Use these as scores for attributes and filter

Machine Learning 2 – Dr. Benjamin Guthier 10 | 1. Data Transformations

Symmetric Uncertainty

• Alternative: measure redundancy between attributes

 Remove redundant attributes

• Symmetric uncertainty measures redundancy between attributes

A and B
𝐻 𝐴 + 𝐻 𝐵 − 𝐻(𝐴, 𝐵)
𝑈 𝐴, 𝐵 = 2
𝐻 𝐴 + 𝐻(𝐵)

• Entropy: 𝐻 𝐴 = − σ𝑖 𝑃 𝐴 = 𝑎𝑖 ⋅ log 𝑃(𝐴 = 𝑎𝑖 )

– 𝑃 𝐴 = 𝑎𝑖 : can be counted from the data set
• 𝐻(𝐴, 𝐵): same formula using joint probabilities 𝑃(𝐴 = 𝑎𝑖 , 𝐵 = 𝑏𝑗 )
– Also counted from data set

Machine Learning 2 – Dr. Benjamin Guthier 11 | 1. Data Transformations

Symmetric Uncertainty (2)

• Intuitively:
– 𝐻(𝐴) is the information provided by attribute 𝐴
– 𝐻(𝐵) is the information provided by attribute 𝐵
– 𝐻(𝐴, 𝐵) is the combined information (“𝐴 ∪ 𝐵”)
• Information of 𝐴 + Information of 𝐵 - Redundancy

𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻 𝐵 − 𝑅
𝐻(𝐴) 𝑅 𝐻(𝐵) or
𝑅 = 𝐻 𝐴 + 𝐻 𝐵 − 𝐻(𝐴, 𝐵)

Machine Learning 2 – Dr. Benjamin Guthier 12 | 1. Data Transformations

Symmetric Uncertainty (3)

Two examples:
1. Two attributes 𝐴 and 𝐵 are independent
 𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻(𝐵)
𝐻 𝐴 +𝐻 𝐵 −𝐻(𝐴,𝐵)
 𝑈 𝐴, 𝐵 = 2 = 0  no redundancy
𝐻 𝐴 +𝐻(𝐵)

2. The two attributes are identical

 𝐻 𝐴, 𝐵 = 𝐻 𝐴 = 𝐻(𝐵)
𝐻 𝐴 +𝐻 𝐵 −𝐻(𝐴,𝐵) 𝐻(𝐴)
 𝑈 𝐴, 𝐵 = 2 = 2 =1
𝐻 𝐴 +𝐻(𝐵) 2𝐻(𝐴)
 completely redundant

Machine Learning 2 – Dr. Benjamin Guthier 13 | 1. Data Transformations

Scheme-Specific Selection

• Same situation as before:

– Data set with a set of attributes is given
– Find a subset of attributes that gives best performance
– Measure performance by training the target model

• Challenges:
– Explore the space of attribute subsets efficiently
• For 𝑛 attributes, there are 2𝑛 possible subsets
– Training and evaluating the model over and over takes too long

Machine Learning 2 – Dr. Benjamin Guthier 14 | 1. Data Transformations

Searching the Attribute Space

• Forward selection: Start from an empty set and “greedily”

add features

• At every stage:
– Try adding each remaining
A B C attribute
– Train model and evaluate
– Only continue with the subset
A,B A,C B,C that performs best

• Stop when performance no

A,B,C longer improves

Machine Learning 2 – Dr. Benjamin Guthier 15 | 1. Data Transformations

Searching the Attribute Space (2)

• Backward elimination: Same but backwards

– Start with the set of all attributes
– Successively remove one attribute and evaluate subset
– Continue from best subset
– Stop when performance no longer improves

• Other graph search algorithms can be used too

– Best-first search: Keep a sorted queue of best candidate subsets to
evaluate next instead of only keeping the best
– Beam search: Keep a list of 𝑛 best candidates at each stage

Machine Learning 2 – Dr. Benjamin Guthier 16 | 1. Data Transformations

Speeding Up the Evaluation

• Performance is evaluated using cross validation

– For 𝑘 attributes, we need 𝑂(𝑘 2 ) cross validations  too many!

• Race search
– At each stage, evaluate all 𝑘 subsets in parallel
– Drop the ones early that fall behind in accuracy

• Pre-select attributes
– Rank all attributes using information gain before adding them
– Only evaluate subsets where high IG attributes have been added

Machine Learning 2 – Dr. Benjamin Guthier 17 | 1. Data Transformations

Attribute Selection – Conclusions

• Removing redundant or irrelevant attributes improves the

performance of a machine learning scheme

• The metric “symmetric uncertainty” can be used to identify

redundant attributes

• Scheme-specific selection greedily tests different attribute

subsets
– Different subset search strategies and evaluation strategies exist

Machine Learning 2 – Dr. Benjamin Guthier 18 | 1. Data Transformations

DISCRETIZATION

• Fayyad, U., and K. Irani. "Multi-interval discretization of continuous-

valued attributes for classification learning." (1993).
– Recommended reading for entropy-based discretization

Machine Learning 2 – Dr. Benjamin Guthier 19 | 1. Data Transformations

Motivation

• Discretization converts numeric (continuous) attributes into

– Nominal categories (no ordering, no averages/median), or
– Ordinal data (categories with ordering and numeric interpretation)

• Many ML models only work with discrete attributes

• Ones that can handle numeric attributes have shortcomings

Machine Learning 2 – Dr. Benjamin Guthier 20 | 1. Data Transformations

Motivation (2)

• Naïve Bayes assumes normal distribution for numeric values

– Assumption may be invalid in realistic scenarios

• Decision trees with numeric attributes

– May run slower due to repeated sorting of examples
– May discretize on tree nodes with little data (susceptible to noise)
– Different local discretization depending on where it is performed

 ML models often benefit from prior discretization

Machine Learning 2 – Dr. Benjamin Guthier 21 | 1. Data Transformations

Unsupervised Discretization

• Unsupervised discretization is agnostic of class labels

– Only choice if class labels are unavailable (e.g., for clustering)
– May choose boundaries that split classes

• Good results if bins are small enough

– E.g., for 𝑁 total examples, use 𝑁 bins with 𝑁 examples each

• Two approaches
– Equal-interval binning
– Equal-frequency binning

Machine Learning 2 – Dr. Benjamin Guthier 22 | 1. Data Transformations

Equal-Interval Binning

• Determine maximum/minimum attribute value

• Decide on number of intervals
• Split into intervals of equal width
 Yields uneven amounts of examples per bin

Discrete interval
example
Half open intervals 

min max
Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 23 | 1. Data Transformations

Equal-Frequency Binning

• Decide on number of intervals (here 4)

• Calculate number of examples per bin (here 5)
• Split into intervals with equal number of examples
 Considers distribution of values
 Still agnostic to class labels

Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 24 | 1. Data Transformations

Supervised Discretization

• Goal: Improve discretization by using known class labels

– Split intervals so that they help classify the examples

• General approach
– Try out different split points (thresholds)
– Evaluate usefulness of a split for classification. Choose best split
– Keep splitting sub-intervals recursively until stop criterion is met

Machine Learning 2 – Dr. Benjamin Guthier 25 | 1. Data Transformations

Possible Split Points

• For 𝑁 examples, there are 𝑁 − 1 possible split points

– Too many to try them all

• Only consider splits where class label changes

– Ideal thresholds can be proven to lay on class boundaries

Potential split points

Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 26 | 1. Data Transformations

Evaluating Splits

• How to judge whether a chosen threshold is suitable?

• Counting the error (or accuracy) and choosing threshold
with the lowest error fails in some cases
Optimal discretization
1
Attribute 𝑎2

0
0 1
Attribute 𝑎1
Machine Learning 2 – Dr. Benjamin Guthier 27 | 1. Data Transformations
Evaluating Splits (2)

• In the example, choosing threshold for 𝑎2 using error works

– It will find 𝑎2 = 0.5 as split point: 𝑎2 ≤ 0.5 ⇒ star, 𝑎2 > 0.5 ⇒ circle
– Increasing it will misclassify more circles as stars and increase error
– Reducing it also increases error

1 2
• Any threshold for 𝑎1 in [ , ] produces very similar error
3 3
– Error does not decrease further when splitting 𝑎1 into 3 intervals
– However: ideal discretization requires the shown splits

 Error is not suited for judging the usefulness of a split!

Machine Learning 2 – Dr. Benjamin Guthier 28 | 1. Data Transformations

Entropy-Based Discretization

• To evaluate a split, we use the entropy before and after

• Entropy 𝐻 of a data set 𝑆 before the split:

𝑘

H 𝑆 = ෍ −𝑝𝑖 log 2 (𝑝𝑖 )

𝑖=1
– 𝑝𝑖 : fraction of examples belonging to class 𝑖
– 𝑘: number of classes

• After the split, we have data sets 𝑆1 and 𝑆2 . Entropy is now:

𝑆1 𝑆2
R S1 , S2 = ⋅ H 𝑆1 + ⋅ H(𝑆2 )
𝑆 𝑆

Machine Learning 2 – Dr. Benjamin Guthier 29 | 1. Data Transformations

Entropy-Based Discretization (2)

• Choose the split with largest Information Gain:

IG = H 𝑆 − R (𝑆1 , 𝑆2 )

• IG considers the class distribution

– Instead of just the majority class for error

1 2
• In the example, choosing thresholds near 𝑎1 = or 𝑎1 =
3 3
gives the “cleanest” separation of classes
– These thresholds minimize entropy / maximize information gain

 Entropy can be used to find the optimal discretization

Machine Learning 2 – Dr. Benjamin Guthier 30 | 1. Data Transformations

Stop Criterion

• Entropy decreases (𝐼𝐺 > 0), the more intervals are created
 We need a stop criterion

• Minimum Description Length (MDL) principle:

“If several hypotheses explain the same body of data, choose
the simplest one”

• “Simplest”: Uses the lowest number of bits to represent

• In our context, the two hypotheses are:

a) Split again to obtain one more sub-interval
b) Stop splitting and keep the current discretization

Machine Learning 2 – Dr. Benjamin Guthier 31 | 1. Data Transformations

MDL Principle
• Questions for stop criterion with MDL are:
– How many bits are needed to encode class labels for every example?
– Does the next split decrease this number? If not, stop splitting

• Entropy = average number of bits per example needed

• Example: Data set has two classes with a 50:50 distribution

– We need one bit per example to encode its class (𝐻 𝑆 = 1)
• After perfect split: Two intervals with examples of only one
class in each
– Entropy is now 0
– After encoding the split, no encoding of class labels necessary
 Saved a lot of bits. Good split!

Machine Learning 2 – Dr. Benjamin Guthier 32 | 1. Data Transformations

MDL Principle (2)
• In General, each split...
– Reduces average entropy = avg. number of bits to encode examples
– Adds overhead for encoding the split (e.g., the threshold value)
 Stop when the overhead outweighs the gain

• MDL for “no split”

– Need 𝑆 ⋅ 𝐻 𝑆 bits on avg. to encode class label for each example
• Each label is 𝐻(𝑆) bits long (e.g., using Huffman coding)
– Need to store code word for each class label: 𝑘 ⋅ 𝐻(𝑆) bits
• 𝑘 classes and each class label takes 𝐻(𝑆) bits
 Encoding class labels takes 𝑆 ⋅ 𝐻 𝑆 + 𝑘 ⋅ 𝐻(𝑆) bits

Machine Learning 2 – Dr. Benjamin Guthier 33 | 1. Data Transformations

Huffman Code – Example
• Create (optimal) binary encoding for class labels 𝐴, 𝐵 and 𝐶
– Class distribution in data set 𝑆 is 50% 𝐴, 25% 𝐵 and 25% 𝐶

• Entropy of 𝑆 is:
𝐻 𝑆 = −0.5 ⋅ log 2 0.5 − 0.25 ⋅ log 2 0.25 − 0.25 ⋅ log 2 0.25
= 0.5 ⋅ 1 + 0.25 ⋅ 2 + 0.25 ⋅ 2 = 1.5

Optimal code words

Class Prob. Code
• Expected code length:
𝐴 0.5 1 0.5 ⋅ 1 + 0.25 ⋅ 2 + 0.25 ⋅ 2 = 1.5 bits
𝐵 0.25 00
• Same as entropy!
𝐶 0.25 01

Machine Learning 2 – Dr. Benjamin Guthier 34 | 1. Data Transformations

MDL Principle (3)

• MDL for “split data set”

– Encoding the split point: log 2 |𝑆| − 1 bits
• Split is right before an example. Encode its index
– Encoding class labels: 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H(𝑆2 ) bits
• Each subset according to its entropy
– Code word for each class label: 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻(𝑆2 ) bits
• After a good split, some classes no longer appear in 𝑆1 or 𝑆2
• 𝑘1 and 𝑘2 are the number of remaining class labels
– Encoding subset of class labels per interval: log 2 (3𝑘 − 2) bits
• Just believe it or look it up in the paper! 

 Encoding the split and class labels takes

log 2 |𝑆| − 1 + 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H 𝑆2 + 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻 𝑆2 + log 2 (3𝑘 − 2)
bits

Machine Learning 2 – Dr. Benjamin Guthier 35 | 1. Data Transformations

Putting it all Together

• Perform the split if and only if:

log 2 |𝑆| − 1 + 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H 𝑆2 + 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻 𝑆2
+ log 2 3𝑘 − 2 < 𝑆 ⋅ 𝐻 𝑆 + 𝑘 ⋅ 𝐻(𝑆)

…some math later…

• Criterion for a split:

log 3𝑘 − 2 + log 𝑆 − 1 + 𝑘1 𝐻 𝑆1 + 𝑘2 𝐻 𝑆2 − 𝑘𝐻(𝑆)
𝐼𝐺 >
𝑆
– Stop otherwise

Machine Learning 2 – Dr. Benjamin Guthier 36 | 1. Data Transformations

Putting it all Together (2)

• Lessons to take away here:

– Know how to calculate each term in the stop criterion
• E.g., how to calculate 𝑘1 𝐻 𝑆1 and 𝐼𝐺
– Understand the idea of the MDL principle and how it can be used to
formulate a stop criterion
– Don’t try to memorize the entire equation or its derivation!

Machine Learning 2 – Dr. Benjamin Guthier 37 | 1. Data Transformations

Discretization – Conclusions

• Often better to discretize than using numeric attributes

• Simple methods: Equal-interval and equal-frequency

binning

• Use entropy and information gain if class labels are known

– Try all split points, and evaluate their quality
– Split recursively until stop criterion is reached

• Stop at the “simplest” discretization using MDL

Machine Learning 2 – Dr. Benjamin Guthier 38 | 1. Data Transformations

DATA PROJECTION

Machine Learning 2 – Dr. Benjamin Guthier 39 | 1. Data Transformations

Motivation

• Mathematical operations on attributes to make them more

suitable for machine learning

• Subtracting two dates (complex data) to get an age (simple)

• Clustering on numeric data to produce discrete cluster

labels as attributes

• Reduce number of dimensions to 2 or 3 for visualization

Machine Learning 2 – Dr. Benjamin Guthier 40 | 1. Data Transformations

Motivation – Oblique Decision Tree

• Problem:
– Nodes in decision trees split attribute space parallel to axes
• E.g., 𝑎1 > 0.4
– Best split may be oblique (not aligned to coordinate axes)

• Rotate coordinate system

Attribute 𝑎2

• New attributes 𝑎1′ , 𝑎2′

𝑎1′
𝑎2′
• Threshold 𝑎1′ > 0.5 now
splits data perfectly
Attribute 𝑎1
Machine Learning 2 – Dr. Benjamin Guthier 41 | 1. Data Transformations
Principle Component Analysis (PCA)

• Input data: dozens or hundreds of numeric attributes

– Runtime is cubic in the number of attributes

• Correspond to coordinate axes in high dimensional space

• Goal: Rotate the axes so that the direction of greatest

variance of the data aligns with the axes
– Or rotate the data (same thing!)

• Interactive demo:
http://setosa.io/ev/principal-component-analysis/

Machine Learning 2 – Dr. Benjamin Guthier 42 | 1. Data Transformations

PCA – General Approach

• Intuitively:
– Calculate direction of greatest variance
– Use this direction as first coordinate axis
– Second axis must be orthogonal to first
• E.g., on a plane perpendicular to first axis in 3D
– Find direction of greatest variance under this constraint
– And so on…

• Implemented as:
– Calculate sample covariance matrix 𝚺
– New coordinate axes are eigenvectors of 𝚺

Machine Learning 2 – Dr. Benjamin Guthier 43 | 1. Data Transformations

Sample Mean / Variance

• An input attribute is a random variable 𝑋 (here: continuous)

• Sample mean of a uniformly distributed random variable:

1
𝑥ҧ = ෍ 𝑥𝑖
𝑁 𝑖
– 𝑁: Number of samples
– 𝑥𝑖 : Value of 𝑋 for sample 𝑖
– May differ from true (but often unknown) population mean

1
• Sample variance: 𝑠 = σ 𝑥𝑖 − 𝑥ҧ 2
𝑁−1 𝑖
– Uses sample mean  underestimates population variance
1
– Term corrects for this
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 44 | 1. Data Transformations

Sample Covariance

• Sample covariance for two attributes 𝑋𝑗 and 𝑋𝑘

1
𝑠𝑗𝑘 = ෍(𝑥𝑖𝑗 − 𝑥𝑗ҧ )(𝑥𝑖𝑘 − 𝑥ҧ𝑘 )
𝑁−1
𝑖
– 𝑥𝑖𝑗 and 𝑥𝑖𝑘 : Value of 𝑋𝑗 and 𝑋𝑘 for sample 𝑖
– 𝑥𝑗ҧ and 𝑥ҧ𝑘 : Sample means of 𝑋𝑗 and 𝑋𝑘

• Observations:
– Can be computed for any two attributes 𝑗 and 𝑘
– 𝑠𝑘𝑘 is the sample variance of 𝑋𝑘
– Covariance is un-normalized correlation
– Independent random variables have 0 covariance
• Note: 0 covariance does not imply independence

Machine Learning 2 – Dr. Benjamin Guthier 45 | 1. Data Transformations

Sample Covariance Matrix

• Now consider 𝐾 attributes (𝐾 random variables)

• Sample covariance matrix 𝚺 is a 𝐾 × 𝐾 matrix with

coefficient 𝑠𝑗𝑘 in row 𝑗 and column 𝑘

• Alternative notation:
– Normalize attribute values first (zero mean): 𝑥𝑖𝑗
′
= 𝑥𝑖𝑗 − 𝑥𝑗ҧ
– Create vector for (normalized) sample 𝑖: 𝒙′𝒊 = 𝑥𝑖1 , … , 𝑥𝑖𝐾 𝑇

1 𝑇
𝚺= σ𝑖 𝒙′𝒊 ⋅ 𝒙′𝒊
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 46 | 1. Data Transformations

Sample Covariance Matrix (2)

• Can be written (and computed) in matrix form

• Organize (normalized) samples as 𝑁 × 𝐾 matrix 𝑴

′𝑇
𝒙𝟏
𝑴= …
′ 𝑇
𝒙𝑵
– Rows of 𝑴 contain all normalized attributes for one sample

• Sample covariance matrix now defined as:

1
𝚺= 𝑴𝑇 𝑴
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 47 | 1. Data Transformations

Eigenvectors and Eigenvalues
• Intuitively:
– Square matrix 𝑨 ∈ ℝ𝐾×𝐾 is a linear transformation from ℝ𝐾 to ℝ𝐾
– Multiplying vector 𝒗 with 𝑨 transforms it into a new vector 𝑨𝒗 = 𝒗′
– If 𝒗 and 𝒗′ point into the same direction, 𝒗 is an eigenvector of 𝑨

• Mathematically:
– 𝒗 being an eigenvector of 𝑨 means 𝑨𝒗 = 𝜆𝒗
– 𝜆 is the eigenvalue
– All multiples of 𝒗 are eigenvectors and form an eigenspace
• 𝑨𝒙 is closer to the eigenspace than 𝒙 was for any 𝒙 ∈ ℝ𝐾
• Can be used to calculate eigenvectors (keep transforming points by 𝑨)

• Interactive demo:
http://setosa.io/ev/eigenvectors-and-eigenvalues/

Machine Learning 2 – Dr. Benjamin Guthier 48 | 1. Data Transformations

Eigendecomposition

• “Symmetric positive semi-definite matrices can be

decomposed into an orthogonal matrix of eigenvectors and
a diagonal matrix of eigenvalues”
– Covariance matrices 𝚺 are symmetric: 𝑠𝑗𝑘 = 𝑠𝑘𝑗
– …and positive semi-definite: 𝒙𝑇 𝚺𝒙 ≥ 0 ∀𝒙 ∈ ℝ𝐾 (proof omitted)

• Eigendecomposition: 𝚺 = 𝑸𝚲𝑸𝑇
– 𝑸 ∈ ℝ𝐾×𝐾 has eigenvectors 𝒗1 , … , 𝒗𝐾 as columns
– 𝑸 is orthogonal: 𝑸−1 = 𝑸𝑇
– 𝚲 is a diagonal matrix with eigenvalues 𝜆1 , … , 𝜆𝐾 on the diagonal

Machine Learning 2 – Dr. Benjamin Guthier 49 | 1. Data Transformations

Eigendecomposition (2)

𝜆1 ⋯ 0
• 𝑸 = 𝒗1 𝒗2 … 𝒗𝐾 𝚲= ⋮ ⋱ ⋮
0 ⋯ 𝜆𝐾
• 𝒗1 is an eigenvector of 𝚺, so 𝚺𝒗1 = 𝒗1 𝜆1
• Do this with all eigenvectors at once: 𝚺𝑸 = 𝑸𝚲
– 𝑸𝚲 = (𝜆1 𝒗1 𝜆2 𝒗2 … 𝜆𝐾 𝒗𝐾 ): Matrix of scaled eigenvectors
• Since 𝑸 is orthogonal: 𝜮𝑸 = 𝑸𝚲 ⟺ 𝚺 = 𝑸𝚲𝑸𝑇

• 𝑸 and 𝚲 are computed using the QR algorithm

– Implemented in many linear algebra libraries

Machine Learning 2 – Dr. Benjamin Guthier 50 | 1. Data Transformations

Change of Basis

• Combining computation of 𝚺 and its eigendecomposition

1
(ignoring the normalization term for now)
𝑁−1
𝑴𝑇 𝑴 = 𝚺 = 𝐐𝚲𝑸𝑇

• Right/left multiplying with 𝑸 and 𝑸𝑇 , respectively:

𝚲 = 𝑸𝑇 𝑴𝑇 𝑴𝑸 = 𝑴𝑸 𝑇 𝑴𝑸
– 𝑴𝑸 means transforming the examples with matrix 𝑸
– Equivalent to change of basis. Columns of 𝑴𝑸 are the attribute values in
the new coordinate system

 In the new system, the covariance matrix is diagonal (𝚲)

– Axes are uncorrelated
– Variance of each axis is the eigenvalue

Machine Learning 2 – Dr. Benjamin Guthier 51 | 1. Data Transformations

Dimensionality Reduction

• Sort columns of 𝑸 and 𝚲 by decreasing eigenvalue

– I.e., decreasing variance

• New attributes (columns) of 𝑴𝑸 are now sorted by variance

• Reduce dimensionality
– Decide how many attributes 𝐿 < 𝐾 to keep
– Truncate 𝑸 to 𝐾 × 𝐿 matrix 𝑸𝐿 of the first 𝐿 eigenvectors
– 𝑁 × 𝐿 matrix 𝑴𝑸𝐿 contains transformed examples with 𝐿 attributes

Machine Learning 2 – Dr. Benjamin Guthier 52 | 1. Data Transformations

Dimensionality Reduction (2)
• How many components to choose?
• Plot variance versus component number

Axis Variance Cumulative • Pick desired cumulative

70%
1 61.2% 61.2% variance
2 18.0% 79.2% 60% • Or choose the “elbow” of

Percentage of variance
3 4.7% 83.9% 50% the curve
4 4.0% 87.9%
40%
5 3.2% 91.1%
30%
6 2.9% 94.0%
7 2.0% 96.0% 20%
8 1.7% 97.7% 10%
9 1.4% 99.1%
0%
10 0.9% 100.0% 1 2 3 4 5 6 7 8 9 10
Component number

Machine Learning 2 – Dr. Benjamin Guthier 53 | 1. Data Transformations

Principle Component Analysis

1. Normalize attributes (columns of 𝑴)

– Subtract the sample mean of each attribute
– Divide by the standard deviation of each attribute
• Unless differences in variance are meaningful
1
2. Calculate covariance matrix 𝚺 = 𝑴𝑇 𝑴
𝑁−1
3. Calculate eigenvectors and –values of 𝚺 = 𝐐𝚲𝑸𝑇
4. Sort 𝑸 and 𝚲 by decreasing eigenvalues
5. Choose number 𝐿 of new attributes (principal components)
6. Transform examples 𝑴 into 𝑴𝑸𝐿

Machine Learning 2 – Dr. Benjamin Guthier 54 | 1. Data Transformations

OVERSAMPLING

• Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-

sampling technique." Journal of artificial intelligence
research 16 (2002): 321-357.

Machine Learning 2 – Dr. Benjamin Guthier 55 | 1. Data Transformations

Motivation

• Data sets often contain more “normal” than “irregular”

examples

• Irregular ones are often the ones to be predicted (positives)

– E.g., faults in a machine, suspicious activity, rare disease diagnosis

• Imbalance of 100 to 1 or more is possible

• False negatives usually worse than false positives, but…

– Classifier that optimizes accuracy is biased towards predicting the
majority class (boring case)

Machine Learning 2 – Dr. Benjamin Guthier 56 | 1. Data Transformations

Straightforward Solutions

• Under-sample the majority class

– Randomly discard a percentage of the negative examples
 Greatly reduces data set size
 Still only a small number of positive examples

• Increase the weight of minority class examples

– Equivalent to duplicating minority examples
 Classifiers fit these examples more closely
 Overfitting in highly imbalanced data

Machine Learning 2 – Dr. Benjamin Guthier 57 | 1. Data Transformations

Increasing Weight  Overfitting
Majority class Decision boundary with
increased weight (overfitting)
Minority class
Unseen minority examples Optimal decision boundary
(test set)
Attribute 2

Attribute 1
Machine Learning 2 – Dr. Benjamin Guthier 58 | 1. Data Transformations
Synthetic Minority Over-Sampling Technique (SMOTE)

• Under-sample majority and over-sample minority class

– Under-sampling by discarding random examples
– Oversampling by linear interpolation between positive examples

• Find 𝑘 nearest neighbors of a random sample 𝑥0 𝑥1

– E.g., 𝑘 = 3
𝑥′
• Pick a random nearest neighbor 𝑥1
𝛼 𝑥1 − 𝑥0
𝑥0
• Pick a random 𝛼 ∈ [0,1]
• Calculate 𝑥 ′ = 𝑥0 + 𝛼 𝑥1 − 𝑥0
• Add 𝑥′ as new synthetic example of the minority
class

Machine Learning 2 – Dr. Benjamin Guthier 59 | 1. Data Transformations

Over- And Under-Sampling – Example

• Calculate factor of over-sampling, e.g.

– Data set has an imbalance of 12:1 (negative : positive)
– Desired 1:3 (3 times as many positives as negatives)
12 1
– Over-sampling factor is: ൗ = 36
1 3

• Split factor evenly between over- and under-sampling

– 36 = 36 ⋅ 36 = 6 ⋅ 6

 Over-sample minority class by a factor of 6

 Randomly discard 5 out of 6 examples of the majority class

Machine Learning 2 – Dr. Benjamin Guthier 60 | 1. Data Transformations

Partly Nominal Attributes

• How to include nominal attributes in distance calculations?

• Calculate standard deviation for each continuous attribute

individually. Pick the median of all standard deviations
– “How different are continuous features typically?”

• During nearest neighbor calculation:

– If nominal attributes differ, add median std. deviation as distance
– E.g., std. deviation of temperature is 10 degrees, so the difference
between “rainy” and “sunny” is also 10.

• During interpolation: Assign nominal value according to the

majority among the 𝑘 nearest neighbors

Machine Learning 2 – Dr. Benjamin Guthier 61 | 1. Data Transformations

Oversampling – Conclusions

• Imbalanced data sets may be undesirable

• Balance data by a mixture of over- and under-sampling

– By linter interpolation and random discarding, respectively

• Use median of standard deviations for nominal attributes

– More distance metrics for nominal attributes in the paper

Machine Learning 2 – Dr. Benjamin Guthier 62 | 1. Data Transformations

OUTLIER DETECTION

Machine Learning 2 – Dr. Benjamin Guthier 63 | 1. Data Transformations

Motivation

• Large data sets often contain noise (outliers)

– Decrease performance of trained models

• Automatic filtering of outliers can help, but…

– …it fixes the effects, not the cause of bad data
– Get the data right in the first place!
– Use automatic filtering as starting point for manual inspection

• Simple approach:
– Train multiple different models
– Training examples misclassified by a model indicate an outlier
– Count how often example is misclassified and discard if above
threshold

Machine Learning 2 – Dr. Benjamin Guthier 64 | 1. Data Transformations

Attribute Noise vs. Bad Labels

• Errors in attribute values (noise)

– Inherent property of real data
– Can be beneficial to training if real data contains the same noise
– Model can learn reliability of each attribute

• Errors in class labels

– Create atypical values in all attributes at once
– Try to identify and fix bad class labels before training

Machine Learning 2 – Dr. Benjamin Guthier 65 | 1. Data Transformations

One Class Learning

• Distinguish a target class from unknown examples

• Extreme case of imbalanced data

– No “irregular” data available
– Detecting very rare (potentially catastrophic) faults
– Access patterns to a network where potential attacks are unknown

• Example method: One class Support Vector Machine (OSVM)

– Implemented in machine learning toolkits (e.g., scikit-learn)
– Details omitted here, see for example:
Schölkopf, Bernhard, et al. "Support vector method for novelty detection."
Advances in neural information processing systems. 2000.

Machine Learning 2 – Dr. Benjamin Guthier 66 | 1. Data Transformations

Isolation Forest
• Random forest for outlier detection

• At each node of each decision tree:

– Choose a random attribute and a random threshold
– Split the set of examples
– Stop when there is only one example left
– Create multiple such trees

• Observation now:
– Inliers form dense clusters in attribute space
– Many splits required to isolate them
– Outliers are isolated after only a few splits (further up in the tree)

 Examples with average path length below a threshold are outliers

Machine Learning 2 – Dr. Benjamin Guthier 67 | 1. Data Transformations

Isolation Forest – 1D Example
Attribute value

Examples Outlier

High probability of
outlier being isolated

Machine Learning 2 – Dr. Benjamin Guthier 68 | 1. Data Transformations

Kernel Density Estimation

• Idea:
– Examples 𝒙 were drawn from unknown probability distribution 𝑝(𝒙)
– Estimate probability distribution 𝑝(𝒙)
Ƹ from the data
– If 𝑝(𝒙
Ƹ 𝑖 ) lower than a threshold  𝒙𝑖 is an outlier

1 𝒙−𝒙𝑖
• Kernel density estimator: 𝑝Ƹ 𝒙 = σ𝑖 𝐾
𝑁𝜎 𝜎
– 𝑁: number of examples
– 𝜎: smoothing parameter („bandwidth“)
– 𝐾 ⋅ : Non-negative kernel function that integrates to 1
• E.g., Gaussian, box, triangle, Epanechnikov

Machine Learning 2 – Dr. Benjamin Guthier 69 | 1. Data Transformations

Kernel Density Estimation (2)

Source: Wikipedia.org

Source: Wikipedia.org
Examples

• Each example defines a • Gray: true distribution 𝑝(𝒙)

Gaussian kernel (red) • Red: too little smoothing
• Averaging yields the • Green: too much smoothing
estimate 𝑝(𝒙)
Ƹ (blue) • Black: good amount

Machine Learning 2 – Dr. Benjamin Guthier 70 | 1. Data Transformations

Outlier Detection – Conclusions

• It is always better to fix the data instead of filtering outliers

– Automatic methods can help to identify problems

• One class classifiers detect outliers when only “normal”

examples are available for training

• Isolation forests calculate an anomaly score based on

average path length in decision trees

• Kernel density estimation estimates the probability of an

example to be “normal”

Machine Learning 2 – Dr. Benjamin Guthier 71 | 1. Data Transformations

CampusX DSMP 2.0 Syllabus
No ratings yet
CampusX DSMP 2.0 Syllabus
66 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Sanet - ST 3030528146
100% (5)
Sanet - ST 3030528146
506 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Normalization
No ratings yet
Normalization
35 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
3-Random Projection and Compressed Sensing Technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing Technique-13-01-2025
84 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Practical Guide and Concepts Data Mining
No ratings yet
Practical Guide and Concepts Data Mining
63 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
DM 02 04 Data Transformation
No ratings yet
DM 02 04 Data Transformation
49 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lecture 5 - Data Preparation
No ratings yet
Lecture 5 - Data Preparation
31 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data
No ratings yet
Data
36 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Unit - II
No ratings yet
Unit - II
56 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
ML Notes
No ratings yet
ML Notes
44 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
2 - Preprocessing
No ratings yet
2 - Preprocessing
74 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Chap2 Overview
No ratings yet
Chap2 Overview
17 pages
Seismic Attribute Mapping of Structure and Stratigraphy Mind Map
No ratings yet
Seismic Attribute Mapping of Structure and Stratigraphy Mind Map
1 page
Chap7 - Diagonalization and Quadratic Forms
No ratings yet
Chap7 - Diagonalization and Quadratic Forms
55 pages
Module 8 - Final - 21.7.24
No ratings yet
Module 8 - Final - 21.7.24
66 pages
6 Multivariate Gaussian
No ratings yet
6 Multivariate Gaussian
138 pages
Assigment
No ratings yet
Assigment
52 pages
Linear Algebra
No ratings yet
Linear Algebra
50 pages
Lec 4 RQ
No ratings yet
Lec 4 RQ
32 pages
Comptech Cheat Sheet
No ratings yet
Comptech Cheat Sheet
2 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
39.M.E. Digital Signal Processing
No ratings yet
39.M.E. Digital Signal Processing
42 pages
MML Book Errata
No ratings yet
MML Book Errata
12 pages
Matlab Tutorial of Modelling of A Slider Crank Mechanism
No ratings yet
Matlab Tutorial of Modelling of A Slider Crank Mechanism
14 pages
112 4 Eigenvalue Problem and Spectral Decomposition of Second-Order Tensors
No ratings yet
112 4 Eigenvalue Problem and Spectral Decomposition of Second-Order Tensors
5 pages
Multivariate Analysis in Metabolomics
No ratings yet
Multivariate Analysis in Metabolomics
31 pages
Matlab Tutorial PDF
No ratings yet
Matlab Tutorial PDF
12 pages
Towards An Explainable Comparison and Alignment of Feature Embeddings
No ratings yet
Towards An Explainable Comparison and Alignment of Feature Embeddings
33 pages
AdvancedSensorySystems 3b SVD
No ratings yet
AdvancedSensorySystems 3b SVD
13 pages
Linear Algebra - Part II: Projection, Eigendecomposition, SVD
No ratings yet
Linear Algebra - Part II: Projection, Eigendecomposition, SVD
20 pages
Localization of Mixed Near-Field and Far-Field Sources Using Symmetric Double Nested Arrays
No ratings yet
Localization of Mixed Near-Field and Far-Field Sources Using Symmetric Double Nested Arrays
11 pages
Lecture Notes in Quantum Mechanics: June 2017
No ratings yet
Lecture Notes in Quantum Mechanics: June 2017
99 pages
Matrices: CS485/685 Computer Vision Dr. George Bebis
No ratings yet
Matrices: CS485/685 Computer Vision Dr. George Bebis
25 pages
Matrix Perturbation Theory
No ratings yet
Matrix Perturbation Theory
18 pages
COMP4222-Lecture 4-Self Reading-Chapter 4 Eigenvalues Decomposition by Longin Jan Latecki
No ratings yet
COMP4222-Lecture 4-Self Reading-Chapter 4 Eigenvalues Decomposition by Longin Jan Latecki
30 pages
The Mixed-Determined Problem: N M N M
No ratings yet
The Mixed-Determined Problem: N M N M
13 pages
Accelerating Spectral Clustering On Quan-2
No ratings yet
Accelerating Spectral Clustering On Quan-2
13 pages
Advanced Machine Learning: CS 281
100% (1)
Advanced Machine Learning: CS 281
88 pages
Principal Component Analysis (PCA) : Principles, Biplots, and Modern Extensions For Sparse Data
No ratings yet
Principal Component Analysis (PCA) : Principles, Biplots, and Modern Extensions For Sparse Data
70 pages
Linear Algebra Iii
No ratings yet
Linear Algebra Iii
27 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.