0% found this document useful (0 votes)
50 views71 pages

ML2 1 Data Transformations

This document provides an overview of machine learning techniques for data transformations. It discusses selecting relevant attributes, discretizing numeric attributes, projecting data to reduce dimensions, oversampling to balance classes, and detecting outliers. The goals are to improve model performance by preprocessing data. Specific techniques covered include attribute selection using symmetric uncertainty, searching attribute spaces, speeding up evaluations, unsupervised discretization, and entropy-based discretization.

Uploaded by

Renu Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views71 pages

ML2 1 Data Transformations

This document provides an overview of machine learning techniques for data transformations. It discusses selecting relevant attributes, discretizing numeric attributes, projecting data to reduce dimensions, oversampling to balance classes, and detecting outliers. The goals are to improve model performance by preprocessing data. Specific techniques covered include attribute selection using symmetric uncertainty, searching attribute spaces, speeding up evaluations, unsupervised discretization, and entropy-based discretization.

Uploaded by

Renu Chaudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Machine Learning 2

WS 18/19
Dr. Benjamin Guthier
Professur für Bildverarbeitung

Machine Learning 2 – Dr. Benjamin Guthier


1. DATA TRANSFORMATIONS

Machine Learning 2 – Dr. Benjamin Guthier


Content of this Chapter

Modify input data to improve performance of machine learning:


• Attribute selection
– Remove redundant or irrelevant attributes
• Discretization
– Converting numeric attributes into categories
• Data projection
– Transform input vector space so that most of the information is
contained in the first vector components
• Oversampling
– Balance class distribution in data sets
• Outlier detection
– Remove anomalies from the data set

Machine Learning 2 – Dr. Benjamin Guthier 3 | 1. Data Transformations


Learning Goals
• After this chapter, you will be able to…

• Decide whether to use scheme-specific or scheme-


independent attribute selection
• Calculate symmetric uncertainty between attributes
• Describe different methods of searching the space of
attribute subsets and speeding up evaluation

• Give examples for unsupervised discretization


• Explain the steps of entropy-based discretization
• Calculate entropy
• Outline the idea of using the MDL principle as stop criterion

Machine Learning 2 – Dr. Benjamin Guthier 4 | 1. Data Transformations


Learning Goals (2)

• Decide when to apply PCA


• Explain the steps of PCA
• Calculate a covariance matrix
• Use its eigenvectors to transform data

• Decide when to use over- and under-sampling


• Create new samples by linear interpolation
• Handle nominal attributes

• Build isolation forests for outlier detection


• Estimate probability densities with Gaussian kernels

Machine Learning 2 – Dr. Benjamin Guthier 5 | 1. Data Transformations


Recommended Reading

• I. Witten, E. Frank: Data Mining – Practical Machine Learning


Tools and Techniques, Chapter 8 “Data Transformations”

Machine Learning 2 – Dr. Benjamin Guthier 6 | 1. Data Transformations


ATTRIBUTE SELECTION

Machine Learning 2 – Dr. Benjamin Guthier 7 | 1. Data Transformations


Motivation

• Data sets with many attributes often contain useless ones

• Redundant attributes
– Two attributes with highly correlated values
– Corresponds to giving attribute higher weight (probably undesirable)

• Irrelevant attributes
– Attribute is uncorrelated with class label
– Will erroneously appear “good” far down in a decision tree

Machine Learning 2 – Dr. Benjamin Guthier 8 | 1. Data Transformations


Motivation (2)

• In theory: many machine learning schemes already select


best attributes (e.g., using gain ratio)

• In practice: many advantages of additional pre-selection


– Reduced generalization error of trained model
– Easier interpretation of results (e.g., smaller trees)

• Two general approaches:


– Scheme-independent selection (filter)
– Scheme-dependent selection (wrapper)

Machine Learning 2 – Dr. Benjamin Guthier 9 | 1. Data Transformations


Scheme-Independent Selection

• Simple approach: use a different machine learning scheme


for selection

• Build decision tree over data


– Prune tree
– Only keep attributes that are used in the tree
– Assumes the final model to be built is not a decision tree

• Train a support vector machine


– Assigns coefficients to each attribute
– Use these as scores for attributes and filter

Machine Learning 2 – Dr. Benjamin Guthier 10 | 1. Data Transformations


Symmetric Uncertainty

• Alternative: measure redundancy between attributes


 Remove redundant attributes

• Symmetric uncertainty measures redundancy between attributes


A and B
𝐻 𝐴 + 𝐻 𝐵 − 𝐻(𝐴, 𝐵)
𝑈 𝐴, 𝐵 = 2
𝐻 𝐴 + 𝐻(𝐵)

• Entropy: 𝐻 𝐴 = − σ𝑖 𝑃 𝐴 = 𝑎𝑖 ⋅ log 𝑃(𝐴 = 𝑎𝑖 )


– 𝑃 𝐴 = 𝑎𝑖 : can be counted from the data set
• 𝐻(𝐴, 𝐵): same formula using joint probabilities 𝑃(𝐴 = 𝑎𝑖 , 𝐵 = 𝑏𝑗 )
– Also counted from data set

Machine Learning 2 – Dr. Benjamin Guthier 11 | 1. Data Transformations


Symmetric Uncertainty (2)

• Intuitively:
– 𝐻(𝐴) is the information provided by attribute 𝐴
– 𝐻(𝐵) is the information provided by attribute 𝐵
– 𝐻(𝐴, 𝐵) is the combined information (“𝐴 ∪ 𝐵”)
• Information of 𝐴 + Information of 𝐵 - Redundancy

𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻 𝐵 − 𝑅
𝐻(𝐴) 𝑅 𝐻(𝐵) or
𝑅 = 𝐻 𝐴 + 𝐻 𝐵 − 𝐻(𝐴, 𝐵)

Machine Learning 2 – Dr. Benjamin Guthier 12 | 1. Data Transformations


Symmetric Uncertainty (3)

Two examples:
1. Two attributes 𝐴 and 𝐵 are independent
 𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻(𝐵)
𝐻 𝐴 +𝐻 𝐵 −𝐻(𝐴,𝐵)
 𝑈 𝐴, 𝐵 = 2 = 0  no redundancy
𝐻 𝐴 +𝐻(𝐵)

2. The two attributes are identical


 𝐻 𝐴, 𝐵 = 𝐻 𝐴 = 𝐻(𝐵)
𝐻 𝐴 +𝐻 𝐵 −𝐻(𝐴,𝐵) 𝐻(𝐴)
 𝑈 𝐴, 𝐵 = 2 = 2 =1
𝐻 𝐴 +𝐻(𝐵) 2𝐻(𝐴)
 completely redundant

Machine Learning 2 – Dr. Benjamin Guthier 13 | 1. Data Transformations


Scheme-Specific Selection

• Same situation as before:


– Data set with a set of attributes is given
– Find a subset of attributes that gives best performance
– Measure performance by training the target model

• Challenges:
– Explore the space of attribute subsets efficiently
• For 𝑛 attributes, there are 2𝑛 possible subsets
– Training and evaluating the model over and over takes too long

Machine Learning 2 – Dr. Benjamin Guthier 14 | 1. Data Transformations


Searching the Attribute Space

• Forward selection: Start from an empty set and “greedily”


add features

• At every stage:
– Try adding each remaining
A B C attribute
– Train model and evaluate
– Only continue with the subset
A,B A,C B,C that performs best

• Stop when performance no


A,B,C longer improves

Machine Learning 2 – Dr. Benjamin Guthier 15 | 1. Data Transformations


Searching the Attribute Space (2)

• Backward elimination: Same but backwards


– Start with the set of all attributes
– Successively remove one attribute and evaluate subset
– Continue from best subset
– Stop when performance no longer improves

• Other graph search algorithms can be used too


– Best-first search: Keep a sorted queue of best candidate subsets to
evaluate next instead of only keeping the best
– Beam search: Keep a list of 𝑛 best candidates at each stage

Machine Learning 2 – Dr. Benjamin Guthier 16 | 1. Data Transformations


Speeding Up the Evaluation

• Performance is evaluated using cross validation


– For 𝑘 attributes, we need 𝑂(𝑘 2 ) cross validations  too many!

• Race search
– At each stage, evaluate all 𝑘 subsets in parallel
– Drop the ones early that fall behind in accuracy

• Pre-select attributes
– Rank all attributes using information gain before adding them
– Only evaluate subsets where high IG attributes have been added

Machine Learning 2 – Dr. Benjamin Guthier 17 | 1. Data Transformations


Attribute Selection – Conclusions

• Removing redundant or irrelevant attributes improves the


performance of a machine learning scheme

• The metric “symmetric uncertainty” can be used to identify


redundant attributes

• Scheme-specific selection greedily tests different attribute


subsets
– Different subset search strategies and evaluation strategies exist

Machine Learning 2 – Dr. Benjamin Guthier 18 | 1. Data Transformations


DISCRETIZATION

• Fayyad, U., and K. Irani. "Multi-interval discretization of continuous-


valued attributes for classification learning." (1993).
– Recommended reading for entropy-based discretization

Machine Learning 2 – Dr. Benjamin Guthier 19 | 1. Data Transformations


Motivation

• Discretization converts numeric (continuous) attributes into


– Nominal categories (no ordering, no averages/median), or
– Ordinal data (categories with ordering and numeric interpretation)

• Many ML models only work with discrete attributes

• Ones that can handle numeric attributes have shortcomings

Machine Learning 2 – Dr. Benjamin Guthier 20 | 1. Data Transformations


Motivation (2)

• Naïve Bayes assumes normal distribution for numeric values


– Assumption may be invalid in realistic scenarios

• Decision trees with numeric attributes


– May run slower due to repeated sorting of examples
– May discretize on tree nodes with little data (susceptible to noise)
– Different local discretization depending on where it is performed

 ML models often benefit from prior discretization

Machine Learning 2 – Dr. Benjamin Guthier 21 | 1. Data Transformations


Unsupervised Discretization

• Unsupervised discretization is agnostic of class labels


– Only choice if class labels are unavailable (e.g., for clustering)
– May choose boundaries that split classes

• Good results if bins are small enough


– E.g., for 𝑁 total examples, use 𝑁 bins with 𝑁 examples each

• Two approaches
– Equal-interval binning
– Equal-frequency binning

Machine Learning 2 – Dr. Benjamin Guthier 22 | 1. Data Transformations


Equal-Interval Binning

• Determine maximum/minimum attribute value


• Decide on number of intervals
• Split into intervals of equal width
 Yields uneven amounts of examples per bin

Discrete interval
example
Half open intervals 

min max
Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 23 | 1. Data Transformations


Equal-Frequency Binning

• Decide on number of intervals (here 4)


• Calculate number of examples per bin (here 5)
• Split into intervals with equal number of examples
 Considers distribution of values
 Still agnostic to class labels

Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 24 | 1. Data Transformations


Supervised Discretization

• Goal: Improve discretization by using known class labels


– Split intervals so that they help classify the examples

• General approach
– Try out different split points (thresholds)
– Evaluate usefulness of a split for classification. Choose best split
– Keep splitting sub-intervals recursively until stop criterion is met

Machine Learning 2 – Dr. Benjamin Guthier 25 | 1. Data Transformations


Possible Split Points

• For 𝑁 examples, there are 𝑁 − 1 possible split points


– Too many to try them all

• Only consider splits where class label changes


– Ideal thresholds can be proven to lay on class boundaries

Potential split points

Value of numeric attribute

Machine Learning 2 – Dr. Benjamin Guthier 26 | 1. Data Transformations


Evaluating Splits

• How to judge whether a chosen threshold is suitable?


• Counting the error (or accuracy) and choosing threshold
with the lowest error fails in some cases
Optimal discretization
1
Attribute 𝑎2

0
0 1
Attribute 𝑎1
Machine Learning 2 – Dr. Benjamin Guthier 27 | 1. Data Transformations
Evaluating Splits (2)

• In the example, choosing threshold for 𝑎2 using error works


– It will find 𝑎2 = 0.5 as split point: 𝑎2 ≤ 0.5 ⇒ star, 𝑎2 > 0.5 ⇒ circle
– Increasing it will misclassify more circles as stars and increase error
– Reducing it also increases error

1 2
• Any threshold for 𝑎1 in [ , ] produces very similar error
3 3
– Error does not decrease further when splitting 𝑎1 into 3 intervals
– However: ideal discretization requires the shown splits

 Error is not suited for judging the usefulness of a split!

Machine Learning 2 – Dr. Benjamin Guthier 28 | 1. Data Transformations


Entropy-Based Discretization

• To evaluate a split, we use the entropy before and after

• Entropy 𝐻 of a data set 𝑆 before the split:


𝑘

H 𝑆 = ෍ −𝑝𝑖 log 2 (𝑝𝑖 )


𝑖=1
– 𝑝𝑖 : fraction of examples belonging to class 𝑖
– 𝑘: number of classes

• After the split, we have data sets 𝑆1 and 𝑆2 . Entropy is now:


𝑆1 𝑆2
R S1 , S2 = ⋅ H 𝑆1 + ⋅ H(𝑆2 )
𝑆 𝑆

Machine Learning 2 – Dr. Benjamin Guthier 29 | 1. Data Transformations


Entropy-Based Discretization (2)

• Choose the split with largest Information Gain:


IG = H 𝑆 − R (𝑆1 , 𝑆2 )

• IG considers the class distribution


– Instead of just the majority class for error

1 2
• In the example, choosing thresholds near 𝑎1 = or 𝑎1 =
3 3
gives the “cleanest” separation of classes
– These thresholds minimize entropy / maximize information gain

 Entropy can be used to find the optimal discretization

Machine Learning 2 – Dr. Benjamin Guthier 30 | 1. Data Transformations


Stop Criterion

• Entropy decreases (𝐼𝐺 > 0), the more intervals are created
 We need a stop criterion

• Minimum Description Length (MDL) principle:


“If several hypotheses explain the same body of data, choose
the simplest one”

• “Simplest”: Uses the lowest number of bits to represent

• In our context, the two hypotheses are:


a) Split again to obtain one more sub-interval
b) Stop splitting and keep the current discretization

Machine Learning 2 – Dr. Benjamin Guthier 31 | 1. Data Transformations


MDL Principle
• Questions for stop criterion with MDL are:
– How many bits are needed to encode class labels for every example?
– Does the next split decrease this number? If not, stop splitting

• Entropy = average number of bits per example needed

• Example: Data set has two classes with a 50:50 distribution


– We need one bit per example to encode its class (𝐻 𝑆 = 1)
• After perfect split: Two intervals with examples of only one
class in each
– Entropy is now 0
– After encoding the split, no encoding of class labels necessary
 Saved a lot of bits. Good split!

Machine Learning 2 – Dr. Benjamin Guthier 32 | 1. Data Transformations


MDL Principle (2)
• In General, each split...
– Reduces average entropy = avg. number of bits to encode examples
– Adds overhead for encoding the split (e.g., the threshold value)
 Stop when the overhead outweighs the gain

• MDL for “no split”


– Need 𝑆 ⋅ 𝐻 𝑆 bits on avg. to encode class label for each example
• Each label is 𝐻(𝑆) bits long (e.g., using Huffman coding)
– Need to store code word for each class label: 𝑘 ⋅ 𝐻(𝑆) bits
• 𝑘 classes and each class label takes 𝐻(𝑆) bits
 Encoding class labels takes 𝑆 ⋅ 𝐻 𝑆 + 𝑘 ⋅ 𝐻(𝑆) bits

Machine Learning 2 – Dr. Benjamin Guthier 33 | 1. Data Transformations


Huffman Code – Example
• Create (optimal) binary encoding for class labels 𝐴, 𝐵 and 𝐶
– Class distribution in data set 𝑆 is 50% 𝐴, 25% 𝐵 and 25% 𝐶

• Entropy of 𝑆 is:
𝐻 𝑆 = −0.5 ⋅ log 2 0.5 − 0.25 ⋅ log 2 0.25 − 0.25 ⋅ log 2 0.25
= 0.5 ⋅ 1 + 0.25 ⋅ 2 + 0.25 ⋅ 2 = 1.5

Optimal code words


Class Prob. Code
• Expected code length:
𝐴 0.5 1 0.5 ⋅ 1 + 0.25 ⋅ 2 + 0.25 ⋅ 2 = 1.5 bits
𝐵 0.25 00
• Same as entropy!
𝐶 0.25 01

Machine Learning 2 – Dr. Benjamin Guthier 34 | 1. Data Transformations


MDL Principle (3)

• MDL for “split data set”


– Encoding the split point: log 2 |𝑆| − 1 bits
• Split is right before an example. Encode its index
– Encoding class labels: 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H(𝑆2 ) bits
• Each subset according to its entropy
– Code word for each class label: 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻(𝑆2 ) bits
• After a good split, some classes no longer appear in 𝑆1 or 𝑆2
• 𝑘1 and 𝑘2 are the number of remaining class labels
– Encoding subset of class labels per interval: log 2 (3𝑘 − 2) bits
• Just believe it or look it up in the paper! 

 Encoding the split and class labels takes


log 2 |𝑆| − 1 + 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H 𝑆2 + 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻 𝑆2 + log 2 (3𝑘 − 2)
bits

Machine Learning 2 – Dr. Benjamin Guthier 35 | 1. Data Transformations


Putting it all Together

• Perform the split if and only if:


log 2 |𝑆| − 1 + 𝑆1 ⋅ H 𝑆1 + 𝑆2 ⋅ H 𝑆2 + 𝑘1 ⋅ 𝐻 𝑆1 + 𝑘2 ⋅ 𝐻 𝑆2
+ log 2 3𝑘 − 2 < 𝑆 ⋅ 𝐻 𝑆 + 𝑘 ⋅ 𝐻(𝑆)

…some math later…

• Criterion for a split:


log 3𝑘 − 2 + log 𝑆 − 1 + 𝑘1 𝐻 𝑆1 + 𝑘2 𝐻 𝑆2 − 𝑘𝐻(𝑆)
𝐼𝐺 >
𝑆
– Stop otherwise

Machine Learning 2 – Dr. Benjamin Guthier 36 | 1. Data Transformations


Putting it all Together (2)

• Lessons to take away here:


– Know how to calculate each term in the stop criterion
• E.g., how to calculate 𝑘1 𝐻 𝑆1 and 𝐼𝐺
– Understand the idea of the MDL principle and how it can be used to
formulate a stop criterion
– Don’t try to memorize the entire equation or its derivation!

Machine Learning 2 – Dr. Benjamin Guthier 37 | 1. Data Transformations


Discretization – Conclusions

• Often better to discretize than using numeric attributes

• Simple methods: Equal-interval and equal-frequency


binning

• Use entropy and information gain if class labels are known


– Try all split points, and evaluate their quality
– Split recursively until stop criterion is reached

• Stop at the “simplest” discretization using MDL

Machine Learning 2 – Dr. Benjamin Guthier 38 | 1. Data Transformations


DATA PROJECTION

Machine Learning 2 – Dr. Benjamin Guthier 39 | 1. Data Transformations


Motivation

• Mathematical operations on attributes to make them more


suitable for machine learning

• Subtracting two dates (complex data) to get an age (simple)

• Clustering on numeric data to produce discrete cluster


labels as attributes

• Reduce number of dimensions to 2 or 3 for visualization

Machine Learning 2 – Dr. Benjamin Guthier 40 | 1. Data Transformations


Motivation – Oblique Decision Tree

• Problem:
– Nodes in decision trees split attribute space parallel to axes
• E.g., 𝑎1 > 0.4
– Best split may be oblique (not aligned to coordinate axes)

• Rotate coordinate system


Attribute 𝑎2

• New attributes 𝑎1′ , 𝑎2′


𝑎1′
𝑎2′
• Threshold 𝑎1′ > 0.5 now
splits data perfectly
Attribute 𝑎1
Machine Learning 2 – Dr. Benjamin Guthier 41 | 1. Data Transformations
Principle Component Analysis (PCA)

• Input data: dozens or hundreds of numeric attributes


– Runtime is cubic in the number of attributes

• Correspond to coordinate axes in high dimensional space

• Goal: Rotate the axes so that the direction of greatest


variance of the data aligns with the axes
– Or rotate the data (same thing!)

• Interactive demo:
http://setosa.io/ev/principal-component-analysis/

Machine Learning 2 – Dr. Benjamin Guthier 42 | 1. Data Transformations


PCA – General Approach

• Intuitively:
– Calculate direction of greatest variance
– Use this direction as first coordinate axis
– Second axis must be orthogonal to first
• E.g., on a plane perpendicular to first axis in 3D
– Find direction of greatest variance under this constraint
– And so on…

• Implemented as:
– Calculate sample covariance matrix 𝚺
– New coordinate axes are eigenvectors of 𝚺

Machine Learning 2 – Dr. Benjamin Guthier 43 | 1. Data Transformations


Sample Mean / Variance

• An input attribute is a random variable 𝑋 (here: continuous)

• Sample mean of a uniformly distributed random variable:


1
𝑥ҧ = ෍ 𝑥𝑖
𝑁 𝑖
– 𝑁: Number of samples
– 𝑥𝑖 : Value of 𝑋 for sample 𝑖
– May differ from true (but often unknown) population mean

1
• Sample variance: 𝑠 = σ 𝑥𝑖 − 𝑥ҧ 2
𝑁−1 𝑖
– Uses sample mean  underestimates population variance
1
– Term corrects for this
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 44 | 1. Data Transformations


Sample Covariance

• Sample covariance for two attributes 𝑋𝑗 and 𝑋𝑘


1
𝑠𝑗𝑘 = ෍(𝑥𝑖𝑗 − 𝑥𝑗ҧ )(𝑥𝑖𝑘 − 𝑥ҧ𝑘 )
𝑁−1
𝑖
– 𝑥𝑖𝑗 and 𝑥𝑖𝑘 : Value of 𝑋𝑗 and 𝑋𝑘 for sample 𝑖
– 𝑥𝑗ҧ and 𝑥ҧ𝑘 : Sample means of 𝑋𝑗 and 𝑋𝑘

• Observations:
– Can be computed for any two attributes 𝑗 and 𝑘
– 𝑠𝑘𝑘 is the sample variance of 𝑋𝑘
– Covariance is un-normalized correlation
– Independent random variables have 0 covariance
• Note: 0 covariance does not imply independence

Machine Learning 2 – Dr. Benjamin Guthier 45 | 1. Data Transformations


Sample Covariance Matrix

• Now consider 𝐾 attributes (𝐾 random variables)

• Sample covariance matrix 𝚺 is a 𝐾 × 𝐾 matrix with


coefficient 𝑠𝑗𝑘 in row 𝑗 and column 𝑘

• Alternative notation:
– Normalize attribute values first (zero mean): 𝑥𝑖𝑗

= 𝑥𝑖𝑗 − 𝑥𝑗ҧ
– Create vector for (normalized) sample 𝑖: 𝒙′𝒊 = 𝑥𝑖1 , … , 𝑥𝑖𝐾 𝑇

1 𝑇
𝚺= σ𝑖 𝒙′𝒊 ⋅ 𝒙′𝒊
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 46 | 1. Data Transformations


Sample Covariance Matrix (2)

• Can be written (and computed) in matrix form

• Organize (normalized) samples as 𝑁 × 𝐾 matrix 𝑴


′𝑇
𝒙𝟏
𝑴= …
′ 𝑇
𝒙𝑵
– Rows of 𝑴 contain all normalized attributes for one sample

• Sample covariance matrix now defined as:


1
𝚺= 𝑴𝑇 𝑴
𝑁−1

Machine Learning 2 – Dr. Benjamin Guthier 47 | 1. Data Transformations


Eigenvectors and Eigenvalues
• Intuitively:
– Square matrix 𝑨 ∈ ℝ𝐾×𝐾 is a linear transformation from ℝ𝐾 to ℝ𝐾
– Multiplying vector 𝒗 with 𝑨 transforms it into a new vector 𝑨𝒗 = 𝒗′
– If 𝒗 and 𝒗′ point into the same direction, 𝒗 is an eigenvector of 𝑨

• Mathematically:
– 𝒗 being an eigenvector of 𝑨 means 𝑨𝒗 = 𝜆𝒗
– 𝜆 is the eigenvalue
– All multiples of 𝒗 are eigenvectors and form an eigenspace
• 𝑨𝒙 is closer to the eigenspace than 𝒙 was for any 𝒙 ∈ ℝ𝐾
• Can be used to calculate eigenvectors (keep transforming points by 𝑨)

• Interactive demo:
http://setosa.io/ev/eigenvectors-and-eigenvalues/

Machine Learning 2 – Dr. Benjamin Guthier 48 | 1. Data Transformations


Eigendecomposition

• “Symmetric positive semi-definite matrices can be


decomposed into an orthogonal matrix of eigenvectors and
a diagonal matrix of eigenvalues”
– Covariance matrices 𝚺 are symmetric: 𝑠𝑗𝑘 = 𝑠𝑘𝑗
– …and positive semi-definite: 𝒙𝑇 𝚺𝒙 ≥ 0 ∀𝒙 ∈ ℝ𝐾 (proof omitted)

• Eigendecomposition: 𝚺 = 𝑸𝚲𝑸𝑇
– 𝑸 ∈ ℝ𝐾×𝐾 has eigenvectors 𝒗1 , … , 𝒗𝐾 as columns
– 𝑸 is orthogonal: 𝑸−1 = 𝑸𝑇
– 𝚲 is a diagonal matrix with eigenvalues 𝜆1 , … , 𝜆𝐾 on the diagonal

Machine Learning 2 – Dr. Benjamin Guthier 49 | 1. Data Transformations


Eigendecomposition (2)

𝜆1 ⋯ 0
• 𝑸 = 𝒗1 𝒗2 … 𝒗𝐾 𝚲= ⋮ ⋱ ⋮
0 ⋯ 𝜆𝐾
• 𝒗1 is an eigenvector of 𝚺, so 𝚺𝒗1 = 𝒗1 𝜆1
• Do this with all eigenvectors at once: 𝚺𝑸 = 𝑸𝚲
– 𝑸𝚲 = (𝜆1 𝒗1 𝜆2 𝒗2 … 𝜆𝐾 𝒗𝐾 ): Matrix of scaled eigenvectors
• Since 𝑸 is orthogonal: 𝜮𝑸 = 𝑸𝚲 ⟺ 𝚺 = 𝑸𝚲𝑸𝑇

• 𝑸 and 𝚲 are computed using the QR algorithm


– Implemented in many linear algebra libraries

Machine Learning 2 – Dr. Benjamin Guthier 50 | 1. Data Transformations


Change of Basis

• Combining computation of 𝚺 and its eigendecomposition


1
(ignoring the normalization term for now)
𝑁−1
𝑴𝑇 𝑴 = 𝚺 = 𝐐𝚲𝑸𝑇

• Right/left multiplying with 𝑸 and 𝑸𝑇 , respectively:


𝚲 = 𝑸𝑇 𝑴𝑇 𝑴𝑸 = 𝑴𝑸 𝑇 𝑴𝑸
– 𝑴𝑸 means transforming the examples with matrix 𝑸
– Equivalent to change of basis. Columns of 𝑴𝑸 are the attribute values in
the new coordinate system

 In the new system, the covariance matrix is diagonal (𝚲)


– Axes are uncorrelated
– Variance of each axis is the eigenvalue

Machine Learning 2 – Dr. Benjamin Guthier 51 | 1. Data Transformations


Dimensionality Reduction

• Sort columns of 𝑸 and 𝚲 by decreasing eigenvalue


– I.e., decreasing variance

• New attributes (columns) of 𝑴𝑸 are now sorted by variance

• Reduce dimensionality
– Decide how many attributes 𝐿 < 𝐾 to keep
– Truncate 𝑸 to 𝐾 × 𝐿 matrix 𝑸𝐿 of the first 𝐿 eigenvectors
– 𝑁 × 𝐿 matrix 𝑴𝑸𝐿 contains transformed examples with 𝐿 attributes

Machine Learning 2 – Dr. Benjamin Guthier 52 | 1. Data Transformations


Dimensionality Reduction (2)
• How many components to choose?
• Plot variance versus component number

Axis Variance Cumulative • Pick desired cumulative


70%
1 61.2% 61.2% variance
2 18.0% 79.2% 60% • Or choose the “elbow” of

Percentage of variance
3 4.7% 83.9% 50% the curve
4 4.0% 87.9%
40%
5 3.2% 91.1%
30%
6 2.9% 94.0%
7 2.0% 96.0% 20%
8 1.7% 97.7% 10%
9 1.4% 99.1%
0%
10 0.9% 100.0% 1 2 3 4 5 6 7 8 9 10
Component number

Machine Learning 2 – Dr. Benjamin Guthier 53 | 1. Data Transformations


Principle Component Analysis

1. Normalize attributes (columns of 𝑴)


– Subtract the sample mean of each attribute
– Divide by the standard deviation of each attribute
• Unless differences in variance are meaningful
1
2. Calculate covariance matrix 𝚺 = 𝑴𝑇 𝑴
𝑁−1
3. Calculate eigenvectors and –values of 𝚺 = 𝐐𝚲𝑸𝑇
4. Sort 𝑸 and 𝚲 by decreasing eigenvalues
5. Choose number 𝐿 of new attributes (principal components)
6. Transform examples 𝑴 into 𝑴𝑸𝐿

Machine Learning 2 – Dr. Benjamin Guthier 54 | 1. Data Transformations


OVERSAMPLING

• Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-


sampling technique." Journal of artificial intelligence
research 16 (2002): 321-357.

Machine Learning 2 – Dr. Benjamin Guthier 55 | 1. Data Transformations


Motivation

• Data sets often contain more “normal” than “irregular”


examples

• Irregular ones are often the ones to be predicted (positives)


– E.g., faults in a machine, suspicious activity, rare disease diagnosis

• Imbalance of 100 to 1 or more is possible

• False negatives usually worse than false positives, but…


– Classifier that optimizes accuracy is biased towards predicting the
majority class (boring case)

Machine Learning 2 – Dr. Benjamin Guthier 56 | 1. Data Transformations


Straightforward Solutions

• Under-sample the majority class


– Randomly discard a percentage of the negative examples
 Greatly reduces data set size
 Still only a small number of positive examples

• Increase the weight of minority class examples


– Equivalent to duplicating minority examples
 Classifiers fit these examples more closely
 Overfitting in highly imbalanced data

Machine Learning 2 – Dr. Benjamin Guthier 57 | 1. Data Transformations


Increasing Weight  Overfitting
Majority class Decision boundary with
increased weight (overfitting)
Minority class
Unseen minority examples Optimal decision boundary
(test set)
Attribute 2

Attribute 1
Machine Learning 2 – Dr. Benjamin Guthier 58 | 1. Data Transformations
Synthetic Minority Over-Sampling Technique (SMOTE)

• Under-sample majority and over-sample minority class


– Under-sampling by discarding random examples
– Oversampling by linear interpolation between positive examples

• Find 𝑘 nearest neighbors of a random sample 𝑥0 𝑥1


– E.g., 𝑘 = 3
𝑥′
• Pick a random nearest neighbor 𝑥1
𝛼 𝑥1 − 𝑥0
𝑥0
• Pick a random 𝛼 ∈ [0,1]
• Calculate 𝑥 ′ = 𝑥0 + 𝛼 𝑥1 − 𝑥0
• Add 𝑥′ as new synthetic example of the minority
class

Machine Learning 2 – Dr. Benjamin Guthier 59 | 1. Data Transformations


Over- And Under-Sampling – Example

• Calculate factor of over-sampling, e.g.


– Data set has an imbalance of 12:1 (negative : positive)
– Desired 1:3 (3 times as many positives as negatives)
12 1
– Over-sampling factor is: ൗ = 36
1 3

• Split factor evenly between over- and under-sampling


– 36 = 36 ⋅ 36 = 6 ⋅ 6

 Over-sample minority class by a factor of 6


 Randomly discard 5 out of 6 examples of the majority class

Machine Learning 2 – Dr. Benjamin Guthier 60 | 1. Data Transformations


Partly Nominal Attributes

• How to include nominal attributes in distance calculations?

• Calculate standard deviation for each continuous attribute


individually. Pick the median of all standard deviations
– “How different are continuous features typically?”

• During nearest neighbor calculation:


– If nominal attributes differ, add median std. deviation as distance
– E.g., std. deviation of temperature is 10 degrees, so the difference
between “rainy” and “sunny” is also 10.

• During interpolation: Assign nominal value according to the


majority among the 𝑘 nearest neighbors

Machine Learning 2 – Dr. Benjamin Guthier 61 | 1. Data Transformations


Oversampling – Conclusions

• Imbalanced data sets may be undesirable

• Balance data by a mixture of over- and under-sampling


– By linter interpolation and random discarding, respectively

• Use median of standard deviations for nominal attributes


– More distance metrics for nominal attributes in the paper

Machine Learning 2 – Dr. Benjamin Guthier 62 | 1. Data Transformations


OUTLIER DETECTION

Machine Learning 2 – Dr. Benjamin Guthier 63 | 1. Data Transformations


Motivation

• Large data sets often contain noise (outliers)


– Decrease performance of trained models

• Automatic filtering of outliers can help, but…


– …it fixes the effects, not the cause of bad data
– Get the data right in the first place!
– Use automatic filtering as starting point for manual inspection

• Simple approach:
– Train multiple different models
– Training examples misclassified by a model indicate an outlier
– Count how often example is misclassified and discard if above
threshold

Machine Learning 2 – Dr. Benjamin Guthier 64 | 1. Data Transformations


Attribute Noise vs. Bad Labels

• Errors in attribute values (noise)


– Inherent property of real data
– Can be beneficial to training if real data contains the same noise
– Model can learn reliability of each attribute

• Errors in class labels


– Create atypical values in all attributes at once
– Try to identify and fix bad class labels before training

Machine Learning 2 – Dr. Benjamin Guthier 65 | 1. Data Transformations


One Class Learning

• Distinguish a target class from unknown examples

• Extreme case of imbalanced data


– No “irregular” data available
– Detecting very rare (potentially catastrophic) faults
– Access patterns to a network where potential attacks are unknown

• Example method: One class Support Vector Machine (OSVM)


– Implemented in machine learning toolkits (e.g., scikit-learn)
– Details omitted here, see for example:
Schölkopf, Bernhard, et al. "Support vector method for novelty detection."
Advances in neural information processing systems. 2000.

Machine Learning 2 – Dr. Benjamin Guthier 66 | 1. Data Transformations


Isolation Forest
• Random forest for outlier detection

• At each node of each decision tree:


– Choose a random attribute and a random threshold
– Split the set of examples
– Stop when there is only one example left
– Create multiple such trees

• Observation now:
– Inliers form dense clusters in attribute space
– Many splits required to isolate them
– Outliers are isolated after only a few splits (further up in the tree)

 Examples with average path length below a threshold are outliers

Machine Learning 2 – Dr. Benjamin Guthier 67 | 1. Data Transformations


Isolation Forest – 1D Example
Attribute value

Examples Outlier

High probability of
outlier being isolated

Machine Learning 2 – Dr. Benjamin Guthier 68 | 1. Data Transformations


Kernel Density Estimation

• Idea:
– Examples 𝒙 were drawn from unknown probability distribution 𝑝(𝒙)
– Estimate probability distribution 𝑝(𝒙)
Ƹ from the data
– If 𝑝(𝒙
Ƹ 𝑖 ) lower than a threshold  𝒙𝑖 is an outlier

1 𝒙−𝒙𝑖
• Kernel density estimator: 𝑝Ƹ 𝒙 = σ𝑖 𝐾
𝑁𝜎 𝜎
– 𝑁: number of examples
– 𝜎: smoothing parameter („bandwidth“)
– 𝐾 ⋅ : Non-negative kernel function that integrates to 1
• E.g., Gaussian, box, triangle, Epanechnikov

Machine Learning 2 – Dr. Benjamin Guthier 69 | 1. Data Transformations


Kernel Density Estimation (2)

Source: Wikipedia.org

Source: Wikipedia.org
Examples

• Each example defines a • Gray: true distribution 𝑝(𝒙)


Gaussian kernel (red) • Red: too little smoothing
• Averaging yields the • Green: too much smoothing
estimate 𝑝(𝒙)
Ƹ (blue) • Black: good amount

Machine Learning 2 – Dr. Benjamin Guthier 70 | 1. Data Transformations


Outlier Detection – Conclusions

• It is always better to fix the data instead of filtering outliers


– Automatic methods can help to identify problems

• One class classifiers detect outliers when only “normal”


examples are available for training

• Isolation forests calculate an anomaly score based on


average path length in decision trees

• Kernel density estimation estimates the probability of an


example to be “normal”

Machine Learning 2 – Dr. Benjamin Guthier 71 | 1. Data Transformations

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy