ML2 1 Data Transformations
ML2 1 Data Transformations
WS 18/19
Dr. Benjamin Guthier
Professur für Bildverarbeitung
• Redundant attributes
– Two attributes with highly correlated values
– Corresponds to giving attribute higher weight (probably undesirable)
• Irrelevant attributes
– Attribute is uncorrelated with class label
– Will erroneously appear “good” far down in a decision tree
• Intuitively:
– 𝐻(𝐴) is the information provided by attribute 𝐴
– 𝐻(𝐵) is the information provided by attribute 𝐵
– 𝐻(𝐴, 𝐵) is the combined information (“𝐴 ∪ 𝐵”)
• Information of 𝐴 + Information of 𝐵 - Redundancy
𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻 𝐵 − 𝑅
𝐻(𝐴) 𝑅 𝐻(𝐵) or
𝑅 = 𝐻 𝐴 + 𝐻 𝐵 − 𝐻(𝐴, 𝐵)
Two examples:
1. Two attributes 𝐴 and 𝐵 are independent
𝐻 𝐴, 𝐵 = 𝐻 𝐴 + 𝐻(𝐵)
𝐻 𝐴 +𝐻 𝐵 −𝐻(𝐴,𝐵)
𝑈 𝐴, 𝐵 = 2 = 0 no redundancy
𝐻 𝐴 +𝐻(𝐵)
• Challenges:
– Explore the space of attribute subsets efficiently
• For 𝑛 attributes, there are 2𝑛 possible subsets
– Training and evaluating the model over and over takes too long
• At every stage:
– Try adding each remaining
A B C attribute
– Train model and evaluate
– Only continue with the subset
A,B A,C B,C that performs best
• Race search
– At each stage, evaluate all 𝑘 subsets in parallel
– Drop the ones early that fall behind in accuracy
• Pre-select attributes
– Rank all attributes using information gain before adding them
– Only evaluate subsets where high IG attributes have been added
• Two approaches
– Equal-interval binning
– Equal-frequency binning
Discrete interval
example
Half open intervals
min max
Value of numeric attribute
• General approach
– Try out different split points (thresholds)
– Evaluate usefulness of a split for classification. Choose best split
– Keep splitting sub-intervals recursively until stop criterion is met
0
0 1
Attribute 𝑎1
Machine Learning 2 – Dr. Benjamin Guthier 27 | 1. Data Transformations
Evaluating Splits (2)
1 2
• Any threshold for 𝑎1 in [ , ] produces very similar error
3 3
– Error does not decrease further when splitting 𝑎1 into 3 intervals
– However: ideal discretization requires the shown splits
1 2
• In the example, choosing thresholds near 𝑎1 = or 𝑎1 =
3 3
gives the “cleanest” separation of classes
– These thresholds minimize entropy / maximize information gain
• Entropy decreases (𝐼𝐺 > 0), the more intervals are created
We need a stop criterion
• Entropy of 𝑆 is:
𝐻 𝑆 = −0.5 ⋅ log 2 0.5 − 0.25 ⋅ log 2 0.25 − 0.25 ⋅ log 2 0.25
= 0.5 ⋅ 1 + 0.25 ⋅ 2 + 0.25 ⋅ 2 = 1.5
• Problem:
– Nodes in decision trees split attribute space parallel to axes
• E.g., 𝑎1 > 0.4
– Best split may be oblique (not aligned to coordinate axes)
• Interactive demo:
http://setosa.io/ev/principal-component-analysis/
• Intuitively:
– Calculate direction of greatest variance
– Use this direction as first coordinate axis
– Second axis must be orthogonal to first
• E.g., on a plane perpendicular to first axis in 3D
– Find direction of greatest variance under this constraint
– And so on…
• Implemented as:
– Calculate sample covariance matrix 𝚺
– New coordinate axes are eigenvectors of 𝚺
1
• Sample variance: 𝑠 = σ 𝑥𝑖 − 𝑥ҧ 2
𝑁−1 𝑖
– Uses sample mean underestimates population variance
1
– Term corrects for this
𝑁−1
• Observations:
– Can be computed for any two attributes 𝑗 and 𝑘
– 𝑠𝑘𝑘 is the sample variance of 𝑋𝑘
– Covariance is un-normalized correlation
– Independent random variables have 0 covariance
• Note: 0 covariance does not imply independence
• Alternative notation:
– Normalize attribute values first (zero mean): 𝑥𝑖𝑗
′
= 𝑥𝑖𝑗 − 𝑥𝑗ҧ
– Create vector for (normalized) sample 𝑖: 𝒙′𝒊 = 𝑥𝑖1 , … , 𝑥𝑖𝐾 𝑇
1 𝑇
𝚺= σ𝑖 𝒙′𝒊 ⋅ 𝒙′𝒊
𝑁−1
• Mathematically:
– 𝒗 being an eigenvector of 𝑨 means 𝑨𝒗 = 𝜆𝒗
– 𝜆 is the eigenvalue
– All multiples of 𝒗 are eigenvectors and form an eigenspace
• 𝑨𝒙 is closer to the eigenspace than 𝒙 was for any 𝒙 ∈ ℝ𝐾
• Can be used to calculate eigenvectors (keep transforming points by 𝑨)
• Interactive demo:
http://setosa.io/ev/eigenvectors-and-eigenvalues/
• Eigendecomposition: 𝚺 = 𝑸𝚲𝑸𝑇
– 𝑸 ∈ ℝ𝐾×𝐾 has eigenvectors 𝒗1 , … , 𝒗𝐾 as columns
– 𝑸 is orthogonal: 𝑸−1 = 𝑸𝑇
– 𝚲 is a diagonal matrix with eigenvalues 𝜆1 , … , 𝜆𝐾 on the diagonal
𝜆1 ⋯ 0
• 𝑸 = 𝒗1 𝒗2 … 𝒗𝐾 𝚲= ⋮ ⋱ ⋮
0 ⋯ 𝜆𝐾
• 𝒗1 is an eigenvector of 𝚺, so 𝚺𝒗1 = 𝒗1 𝜆1
• Do this with all eigenvectors at once: 𝚺𝑸 = 𝑸𝚲
– 𝑸𝚲 = (𝜆1 𝒗1 𝜆2 𝒗2 … 𝜆𝐾 𝒗𝐾 ): Matrix of scaled eigenvectors
• Since 𝑸 is orthogonal: 𝜮𝑸 = 𝑸𝚲 ⟺ 𝚺 = 𝑸𝚲𝑸𝑇
• Reduce dimensionality
– Decide how many attributes 𝐿 < 𝐾 to keep
– Truncate 𝑸 to 𝐾 × 𝐿 matrix 𝑸𝐿 of the first 𝐿 eigenvectors
– 𝑁 × 𝐿 matrix 𝑴𝑸𝐿 contains transformed examples with 𝐿 attributes
Percentage of variance
3 4.7% 83.9% 50% the curve
4 4.0% 87.9%
40%
5 3.2% 91.1%
30%
6 2.9% 94.0%
7 2.0% 96.0% 20%
8 1.7% 97.7% 10%
9 1.4% 99.1%
0%
10 0.9% 100.0% 1 2 3 4 5 6 7 8 9 10
Component number
Attribute 1
Machine Learning 2 – Dr. Benjamin Guthier 58 | 1. Data Transformations
Synthetic Minority Over-Sampling Technique (SMOTE)
• Simple approach:
– Train multiple different models
– Training examples misclassified by a model indicate an outlier
– Count how often example is misclassified and discard if above
threshold
• Observation now:
– Inliers form dense clusters in attribute space
– Many splits required to isolate them
– Outliers are isolated after only a few splits (further up in the tree)
Examples Outlier
High probability of
outlier being isolated
• Idea:
– Examples 𝒙 were drawn from unknown probability distribution 𝑝(𝒙)
– Estimate probability distribution 𝑝(𝒙)
Ƹ from the data
– If 𝑝(𝒙
Ƹ 𝑖 ) lower than a threshold 𝒙𝑖 is an outlier
1 𝒙−𝒙𝑖
• Kernel density estimator: 𝑝Ƹ 𝒙 = σ𝑖 𝐾
𝑁𝜎 𝜎
– 𝑁: number of examples
– 𝜎: smoothing parameter („bandwidth“)
– 𝐾 ⋅ : Non-negative kernel function that integrates to 1
• E.g., Gaussian, box, triangle, Epanechnikov
Source: Wikipedia.org
Source: Wikipedia.org
Examples