Fundamentals of Machine Learning
Fundamentals of Machine Learning
Roozbeh Sanaei
June 30, 2024
Contents
1 Linear Algebra 4
1.1 Matrix Decomposition Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Orthogonality and Orthonormality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Dimensionality Reduction 11
3.1 Independent Component Analysis (ICA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 ICA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Different Algorithms in ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Infomax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 What is Whitening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 Fast Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.6 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 SNE, t-SNE, UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Comparative Analysis of SNE, t-SNE, and UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 SNE, t-SNE and UMAP Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Sparse Dictionary Learning Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Non-negative Matrix Factorization(NMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Multidimensional Scaling (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Isomap (Isometric Mapping) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
4 Clustering 25
4.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Elbow Method for Determining Optimal Number of Clusters in K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Silhouette Analysis for Determining Optimal Clusters in K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 The Canopy Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Gaussian Mixture Models (GMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Challenges in Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Comparison of GMMs and K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 DBSCAN: Density-Based Spatial Clustering of Applications with Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 OPTICS(Ordering Points To Identify the Clustering Structure) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Spectral Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Markov Chain Clustering (MCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Agglomerative Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
© Roozbeh Sanaei 2
5.9.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Evaluation 70
6.1 Evaluation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.1 K-fold Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.2 The ROC Curve in Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Accuracy Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Lift and Drift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 Information Theory 82
8.1 Shannon Uncertainty Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Boltzmann’s Entropy Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Fisher’s Score and Fisher’s Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.6 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
© Roozbeh Sanaei 3
Linear Algebra
Finding Eigenvalues
Eigenvalue Decomposition
Eigenvalues of A are found by solving the characteristic equation: If A has n linearly independent eigenvectors {v1 , v2 , . . . , vn } corresponding to
eigenvalues {λ1 , λ2 , . . . , λn }, then A can be factorized as:
det(A − λI) = 0
A = V DV −1
where det denotes the determinant of a matrix, and I is the identity ma-
trix. The roots of the characteristic polynomial (a polynomial in λ) are the where V is the matrix whose i-th column is the eigenvector vi , and D is the
eigenvalues of A. diagonal matrix with eigenvalues λi on the diagonal.
4
1.1.2 Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a critical technique in linear alge- • Σ (singular values): Diagonal entries are the square roots of the non-
bra, utilized in various fields such as signal processing, statistics, and machine negative eigenvalues of AT A or AAT .
learning. It decomposes any matrix into three distinct matrices.
Computing SVD
Definition of SVD • Compute eigenvalues and eigenvectors of AAT and AT A.
Given an m × n matrix A, SVD is defined as: • Singular values in Σ are square roots of non-zero eigenvalues of AT A.
A = U ΣV T • Columns of U are normalized eigenvectors of AAT .
© Roozbeh Sanaei 5
1.1.3 Orthogonality and Orthonormality
Orthogonality ∥u∥ = ∥v∥ = 1
Definition: Two vectors u and v in a vector space are orthogonal if their
dot product is zero: Importance: Orthonormal vectors simplify computations and are used
u·v=0 in Fourier series, quantum mechanics, and signal processing.
In Rn , this is:
u1 v1 + u2 v2 + · · · + un vn = 0
Importance: Orthogonal vectors minimize errors and dependencies in Applications in Linear Transformations
computations and are used in methods like the Gram-Schmidt process for
orthogonal bases. In matrix terms, a matrix A with orthonormal columns satisfies:
Orthonormality AT A = I
Definition: A set of vectors is orthonormal if all vectors are orthogonal to
each other and each vector is of unit length. For vectors u and v: where AT is the transpose of A, and I is the identity matrix. This property
u · v = 0 (if u ̸= v) is crucial in preserving lengths and angles in transformations.
© Roozbeh Sanaei 6
Fundamental Concepts of Machine Learning
Semi-Supervised Learning
Combines both labeled and unlabeled data for training.
Application: Useful in scenarios where labeled data are limited.
Example: Language translation models with limited annotated data.
7
2.0.2 Overfitting and its Mitigation
Overfitting Methods to Overcome Overfitting
Definition: Overfitting occurs when a model learns both the underlying Regularization: Adding a penalty to the model’s loss function to constrain
patterns and the noise in the training data, leading to poor generalization to its complexity.
new data. Dropout: Randomly ignoring neurons during training in neural networks
Consequence and Impact: The model exhibits high accuracy on train- to prevent over-reliance on specific patterns.
ing data but poor performance on unseen data. Early Stopping: Halting training when the model’s performance on
Reason: Often due to excessive complexity in the model relative to the validation data starts to worsen.
amount of training data, leading to the learning of noise. Data Augmentation: Artificially increasing training data diversity
through transformations.
Ensembling: Combining predictions from multiple models to average
out errors.
Feature Selection: Choosing the most relevant features for training to
reduce model complexity.
Model Simplification: Reducing the number of layers or parameters in
the model.
Increasing Dataset Size: Expanding the training dataset to provide
more comprehensive examples.
Bayesian Neural Networks: Incorporating probabilistic approaches in
neural networks to manage uncertainty.
© Roozbeh Sanaei 8
Bias and Variance in Machine Learning
• Bias: • Bias-Variance Trade-Off:
– Measures the difference between the model’s average prediction – Improving the model to reduce bias typically increases its vari-
and the true values. ance, and vice versa.
– Bias2 [fˆ(x)] = E[fˆ(x)] − f (x)
2
– Total Error = Bias2 + Variance + Irreducible Error
∗ Where fˆ(x) is the model’s prediction, f (x) is the true func- ∗ Irreducible Error represents the error inherent in the problem
tion, and E[fˆ(x)] is the expected value of the model’s predic- itself, due to factors like noise.
tions.
– High bias can lead to underfitting, indicating a model too simple
to capture the data’s complexity.
• Variance:
– Measures the variation of model predictions for a given data point.
2
– Variance[fˆ(x)] = E fˆ(x) − E[fˆ(x)]
© Roozbeh Sanaei 9
2.0.3 Feature Selection
Filter Methods Advantages of Feature Selection
These methods assess features independently of any learning algorithm, using • Reducing dimensionality
statistical measures.
• Improving model performance
• Correlation Coefficient: Measures linear relationship between features
and the target. • Enhancing interpretability
• Chi-squared Test: Assesses independence between categorical features • Reducing computational complexity
and the target.
• Improving data quality
• Information Gain: Evaluates reduction in entropy from adding a fea-
ture. • Enhancing model transparency
Embedded Methods
These methods integrate feature selection within the learning algorithm dur-
ing model training.
• Lasso Regression: Uses L1 regularization to eliminate irrelevant fea-
tures.
• Elastic Net: Combines L1 and L2 regularization for balanced feature
selection.
• Random Forest Importance: Assesses feature importance based on their
contribution to random forest performance.
© Roozbeh Sanaei 10
Dimensionality Reduction
• Algorithm: Adjusts weights to maximize independence of output sig- • Ambiguity: Possibility of permutation and scaling ambiguities in so-
nals. lutions.
11
3.1.2 Different Algorithms in ICA
• FastICA: Known for its computational efficiency, FastICA uses a order statistics over different time delays for ICA, particularly effective
fixed-point iteration scheme to maximize the non-Gaussianity of pro- for time-dependent signals.
jected data.
• Probabilistic ICA: This method incorporates a probabilistic model,
• Infomax and Extended Infomax: These algorithms maximize the often using a likelihood function, to estimate independent components,
information transfer from the input to the output of a neural network. and is closely related to factor analysis.
Extended Infomax can handle both sub- and super-Gaussian sources.
• Temporal ICA: Designed specifically for time-series data, it focuses
• JADE (Joint Approximate Diagonalization of Eigenmatrices): on the temporal structure of the data to separate sources.
JADE operates by jointly diagonalizing a set of covariance matrices to
extract the independent components. • Complex-valued ICA: This variant is used for complex-valued data
(with real and imaginary parts), as found in some signal processing
• CUmulative Distributions-based ICA (CUDI): CUDI maximizes applications.
non-Gaussianity using cumulative distribution functions.
• Nonlinear ICA: Deals with mixtures that are nonlinear combinations
• Second-order blind identification (SOBI): SOBI utilizes second- of the source signals, unlike traditional linear ICA models.
© Roozbeh Sanaei 12
3.1.3 Infomax
Infomax is a principle used in information theory and neural network training, aiming to maximize the mutual information between the input and output
of a system. This principle is often applied in unsupervised learning, where the goal is to find a representation of the input data that preserves as much
information as possible.
x∈X
• The objective is to find the optimal W that maximizes I(X; Y ).
Mutual Information (I)
• Mutual information quantifies the shared information between two ran- Optimization
dom variables, X and Y .
• Typically involves using gradient-based optimization techniques.
• It is defined as:
!
p(x, y) • Adjusts weights W by:
I(X; Y ) = p(x, y) log
X X
Objective of Infomax • Additional constraints, such as weight regularization, are often applied.
© Roozbeh Sanaei 13
3.1.4 What is Whitening?
• Whitening: Transformation of Data • Perform Eigenvalue Decomposition
– Whitening is a process that transforms data so that the covariance – Apply eigenvalue decomposition to the covariance matrix Σ:
matrix of the resulting dataset is the identity matrix.
Σ = V DV T
– the means that the features are uncorrelated and each feature has
unit variance.
– V is the matrix of eigenvectors, D is the diagonal matrix of eigen-
– This involves decorrelating the features and normalizing their vari- values.
ance.
• Whiten the Data
• Compute the Covariance Matrix
– Transform the data to obtain the whitened data Xwhite :
– Given a dataset X with n features and m samples.
1
– Calculate the covariance matrix Σ as: Xwhite = ED− 2 V T (X − X̄)
1 1
Σ= (X − X̄)T (X − X̄) – D− 2 is obtained by taking the reciprocal square root of each non-
m zero element in D.
– X̄ is the mean vector for the dataset. – E is an optional scaling matrix, often the identity matrix I.
© Roozbeh Sanaei 14
3.1.5 Fast Independent Component Analysis
Fast Independent Component Analysis (Fast ICA) is an algorithm used 2. Maximization of Non-Gaussianity: The core of Fast ICA is to find
for the separation of a multivariate signal into additive subcomponents. It’s a linear combination of the whitened variables that maximizes non-
often used in the context of blind source separation, where the goal is to Gaussianity. Non-Gaussianity can be measured in several ways, such
separate a set of signals that have been mixed together. as using kurtosis or negentropy.
Basic Model 3. Iterative Fixed-Point Algorithm: The Fast ICA algorithm finds
the independent components by iterating the following steps:
The basic model of ICA can be represented as:
(a) Choose an initial weight vector w.
x = As
where (b) Update w using the formula:
• x is the observed mixed signal.
w+ = E{xg(wT x)} − E{g ′ (wT x)}w
• A is the mixing matrix.
where g is a non-linear function, which is chosen based on the
• s is the vector of independent source signals. measure of non-Gaussianity (like kurtosis or negentropy) and g ′ is
its derivative.
Goal of Fast ICA
(c) Normalize the weight vector:
The goal is to estimate the matrix W, which is the unmixing matrix, such
that: w+
w=
∥w+ ∥
s ≈ Wx
(d) Repeat until convergence.
Fast ICA Algorithm
1. Centering and Whitening: First, the observed signals are centered 4. Extraction of Independent Components: Once the algorithm con-
and whitened. Centering involves subtracting the mean, and whitening verges, the independent components are given by:
is done to transform the variables into uncorrelated variables with unit
variance. s = Wx
© Roozbeh Sanaei 15
3.1.6 JADE
JADE ICA is a statistical technique used to separate a multivariate signal into additive subcomponents that are maximally independent from each other.
It is particularly useful in blind source separation tasks like separating audio signals.
© Roozbeh Sanaei 16
3.2 SNE, t-SNE, UMAP
3.2.1 SNE
Similarity in the Original Space Minimization of Kullback-Leibler Divergence
exp(−||xi − xj ||2 /2σi2 )
pj|i = P
k̸=i exp(−||xi − xk || /2σi )
2 2 pij
KL(P ||Q) = pij log
XX
© Roozbeh Sanaei 17
3.2.2 Comparative Analysis of SNE, t-SNE, and UMAP
High Dimensional Probabilities Calculation Cost Function and Optimization
• SNE: Utilizes scaled Euclidean distance, leading to non-symmetric dis- • SNE: Employs Kullback-Leibler divergence, facing optimization chal-
similarities due to the variance parameter σi . lenges.
• t-SNE Difference: Implements symmetrization to make high- • t-SNE Difference: Retains KL divergence but modifies probabilities
dimensional probabilities symmetric. approach.
• UMAP Difference: Works with similarities instead of probabilities, • UMAP Difference: Uses cross-entropy and stochastic gradient de-
using a different metric function. scent, capturing more global structure.
• t-SNE Difference: Adopts the Student t-distribution to solve the • t-SNE Difference: Prioritizes local structure conservation, effective
crowding problem. in visualizing clusters.
• UMAP Difference: Does not normalize low-dimensional similarities, • UMAP Difference: Preserves more global structure due to its
improving performance. methodological approach.
© Roozbeh Sanaei 18
3.2.3 SNE, t-SNE and UMAP Comparison
Similarity Variables: yi , yj , Low-dimensional embeddings. The use of t-
distribution helps in mitigating the crowding problem and allows t-SNE
• SNE Similarity: The SNE algorithm uses a Gaussian kernel to model
to better model the relationships between points in the reduced space.
the probability that a point xi in a high-dimensional space would choose
another point xj as its neighbor: • UMAP Similarity Measure in Reduced Space: UMAP uses a
exp(−||xi − xj ||2 /2σi2 ) different formulation for low-dimensional similarities:
pj|i = P 1
k̸=i exp(−||xi − xk || /2σi )
2 2
νij =
1 + a||yi − yj ||2b
Variables: xi , xj , High-dimensional data points; σi2 , Variance of the
Gaussian kernel centered at xi . Variables: yi , yj , Low-dimensional embeddings; a, b, Parameters
learned during the optimization process. This measure helps UMAP
• t-SNE Symmetrization: t-SNE modifies the SNE approach by sym-
to preserve the topological structure of the data in the reduced space,
metrizing the probabilities to address the original asymmetry:
ensuring a balance between local and global structures.
pj|i + pi|j
pij =
2N Loss Functions
Variables: pj|i , pi|j , Conditional probabilities from SNE; N , Total num-
ber of data points. • SNE Loss Function: SNE uses the Kullback-Leibler divergence as
its loss function:
pj|i
• UMAP Similarity Measure: UMAP employs a fuzzy set approach CSNE = pj|i log
XX
for similarity, focusing on both local and global data structures: i j qj|i
µij = exp(− max(0, d(xi , xj ) − ρi )/σi ) This function measures the mismatch between the high-dimensional
and low-dimensional probabilities, aiming to preserve local structures
Variables: d(xi , xj ), User-defined metric for distance; ρi , σi , Parameters in the reduced space.
for local neighborhood adjustment.
• t-SNE Loss Function: t-SNE also uses the Kullback-Leibler diver-
Similarity in Reduced Space gence, but with symmetrized probabilities:
pij
• SNE Similarity in Reduced Space: SNE in the reduced space uses Ct-SNE = pij log
XX
a similar approach to the high-dimensional space but with fixed vari- i j qij
ance: This function aims to minimize the difference between high-dimensional
exp(−||yi − yj ||2 )
qj|i = P and low-dimensional representations, focusing on local neighborhood
k̸=i exp(−||yi − yk || )
2
structures.
Variables: yi , yj , Low-dimensional embeddings of the high-dimensional
data points xi , xj . This calculates the probability of the low- • UMAP Loss Function: UMAP utilizes a cross-entropy loss function:
dimensional embeddings of points, emphasizing their relative proximi-
CUMAP = wij log(σ(dij )) + (1 − wij ) log(1 − σ(dij ))
X
ties. ij
• t-SNE Symmetrization in Reduced Space: t-SNE uses the Stu- Variables: σ(dij ), Logistic sigmoid function of the distance between
dent t-distribution for probabilities in the reduced space: points i and j; wij , Weight derived from the high-dimensional graph.
(1 + ||yi − yj ||2 )−1 UMAP’s loss function balances attractive and repulsive forces, aiming
qij = P
k̸=l (1 + ||yk − yl || )
2 −1 to preserve both local and global data structures.
© Roozbeh Sanaei 19
3.3 Linear Discriminant Analysis (LDA)
• Primary Goal of LDA: • Solution and Eigenvalue Problem:
– Identify a linear combination of features. – Maximize J(W ) by solving the eigenvalue problem SW−1
SB v = λv,
– Differentiate or segregate multiple classes of objects or events. where v are the eigenvectors, λ are the eigenvalues, SB is the
between-class scatter matrix, and SW is the within-class scatter
• Optimization Approach: matrix.
– Optimize the ratio of determinants between the between-class – Focus on the eigenvectors corresponding to the largest eigenvalues
scatter matrix (SB ) and the within-class scatter matrix (SW ). for maximum variance between classes.
– Calculate mean vectors for each class:
• Projection of Data:
mi = 1 P
ni x∈Di x, – Determine W for data projection, where W contains the selected
eigenvectors.
where mi is the mean vector for class i, ni is the number of
samples in class i, and Di is the set of data points in class i. – Project data points x into a lower-dimensional space to achieve
optimal class separation: y = W T x, where y is the projected data
– Between-Class Scatter Matrix:
and x is the original data.
SB = Ni (mi − m)(mi − m)T ,
Pc
i=1
SW = x∈Di (x − mi )(x − mi )T .
Pc P
i=1
|W T SB W |
– Fisher’s Criterion J(W ) = |W T SW W |
, where W is the projection
matrix.
© Roozbeh Sanaei 20
3.4 Sparse Dictionary Learning Overview
1. Objective: 3. Optimization Challenge:
• Goal: Find a dictionary D and a sparse representation X to ap- • Problem Nature: Generally non-convex and NP-hard due to the
proximate a given dataset Y as Y ≈ DX. l0 -norm constraint.
• Components: • Practical Approaches:
– Y : A matrix where each column represents a data sample.
– Relaxing the l0 -norm to an l1 -norm, promoting sparsity while
– D: The dictionary matrix with each column as a dictionary being convex.
atom.
– Using greedy algorithms like Orthogonal Matching Pursuit
– X: A sparse matrix where each column is the sparse repre- (OMP).
sentation of the corresponding column in Y .
2. Mathematical Formulation: 4. Alternate Minimization:
• • Strategy:
1
min ∥Y − DX∥2F – Fix D and optimize X: For each column yi of Y , find the
D,X 2
sparse representation xi using D.
subject to ∥xi ∥0 ≤ T ∀i.
– Fix X and optimize D: Update D while keeping X fixed.
• Elements:
– ∥ · ∥F : Frobenius norm, measuring the difference between Y 5. Regularization and Constraints:
and DX.
• In Practice:
– ∥xi ∥0 : l0 -norm of the i-th column of X, counting non-zero
entries to enforce sparsity. – Adding constraints like normalizing the columns of D to pre-
– T : A threshold dictating the maximum number of non-zero vent scaling issues.
entries in each column of X. – Incorporating regularization terms to control overfitting.
© Roozbeh Sanaei 21
3.5 Non-negative Matrix Factorization(NMF)
Non-negative Matrix Factorization (NMF) is a powerful technique in data analysis and linear algebra. It aims to factorize a non-negative matrix V into
two non-negative matrices W and H, where k is the desired rank or number of components. This factorization is useful for various applications, including
dimensionality reduction, feature extraction, and source separation.
Key Points about NMF • Optimization: NMF employs iterative optimization algorithms, like
multiplicative update rules, to find optimal values for W and H. These
• NMF Objective: NMF aims to factorize a non-negative matrix V rules are applied iteratively until convergence:
(m × n) into two non-negative matrices W (m × k) and H (k × n),
V ⊙ H′
!
where k is the desired rank or number of components. For W : Wnew = W ⊙
W H ⊙ H′
W′ ⊙ V
!
© Roozbeh Sanaei 22
3.6 Multidimensional Scaling (MDS)
• Similarity/Dissimilarity Matrix (D): • Optimization:
– The matrix D contains elements dij , each representing the distance – Iteratively adjust the coordinates of points in the low-dimensional
or dissimilarity between objects i and j. space to minimize the stress function.
– It’s typically symmetric, with zeros on the diagonal (indicating – Often done using numerical optimization techniques, such as gra-
zero dissimilarity of an object with itself). dient descent.
– To find a set of points in a low-dimensional space (2D or 3D) that – The configuration of points reflects the relative similarities or dis-
represent the objects. similarities among the objects.
– The matrix X contains elements xij , each representing the dis- – Objects more similar are closer together in the MDS space, while
tance between points i and j in the low-dimensional space. less similar objects are farther apart.
© Roozbeh Sanaei 23
3.7 Isomap (Isometric Mapping)
Isomap (Isometric Mapping) is a nonlinear dimensionality reduction method designed to uncover the underlying manifold structure in a high-dimensional
dataset by approximating the geodesic distances among points.
• Construct a neighborhood graph G. Each point xi in the dataset is • Perform double centering on D to create a matrix B:
connected to its K nearest neighbors, or to all points within a fixed 1
radius ϵ. B = − HD2 H
2
where H is the centering matrix H = I − n1 11T , I is the identity matrix,
• The distance between connected points is typically the Euclidean dis- and 11 is a vector of ones.
tance: d(xi , xj ) = ∥xi − xj ∥.
• B is then subjected to eigenvalue decomposition:
• Compute the shortest path distances between all pairs of points in the where Λ is the diagonal matrix of eigenvalues and V contains the cor-
graph G using algorithms like Floyd-Warshall or Dijkstra’s. responding eigenvectors.
© Roozbeh Sanaei 24
Clustering
4.1 K-Means
The K-means algorithm is a widely used method in unsupervised machine learning for clustering data. It partitions a dataset into K distinct, non-overlapping
subgroups or clusters.
• Choose k initial centroids randomly. • Recalculate the centroids of the clusters based on the current as-
signment of data points.
• These centroids C = {c1 , c2 , ..., ck } are the starting points for each
of the clusters. • The new centroid ci for cluster i is the mean of all points assigned
to that cluster:
2. Assignment Step: 1 X
ci = xj
|Si | xj ∈Si
• Assign each data point xi to the nearest centroid.
• Here, |Si | is the number of data points in cluster i, and xj are the
• The assignment of a data point to a cluster is based on the min-
data points in cluster i.
imum distance from the centroids, typically calculated using the
Euclidean distance. 4. Convergence Check:
• The assignment function is represented as:
• Repeat the assignment and update steps until the centroids no
Si = {xp : ∥xp − ci ∥ ≤ ∥xp − cj ∥ ∀j, 1 ≤ j ≤ k} longer change significantly, or a maximum number of iterations is
reached.
• Here, Si is the set of points assigned to the i-th cluster, xp is a • This convergence is often checked by seeing if the sum of the
data point, and ∥xp − ci ∥ is the Euclidean distance between xp squared distances between data points and their corresponding
and centroid ci . centroids is minimized.
25
4.1.1 Elbow Method for Determining Optimal Number of Clusters in K-means
Calculate Within-Cluster Sum of Squares (WSS) for Various k Choose k at the Elbow Point
Perform K-means clustering for each value of k (the number of clusters). The optimal number of clusters k is chosen at this elbow point. This choice
represents a balance between maximizing the number of clusters (to reduce
Plot the WSS Values WSS) and keeping the model simple and generalizable (by not having too
many clusters).
Create a plot with the number of clusters k on the x-axis and the correspond-
ing WSS on the y-axis.
© Roozbeh Sanaei 26
4.1.2 Silhouette Analysis for Determining Optimal Clusters in K-means
1. Calculate the Silhouette Coefficient for Each Data Point: • High Value: Indicates good matching within its own cluster and
poor matching to neighboring clusters.
• Compute a(i):
• Low/Negative Value: Suggests incorrect clustering or too
–
1 many/few clusters.
a(i) =
X
∥x − i∥
|Si | − 1 x∈Si ,x̸=i
• Silhouette Coefficient for a Single Data Point:
– Average distance from the i-th data point to all other points
in the same cluster Si . – The silhouette coefficient for a single data point is a measure of
• Compute b(i): how similar that point is to points in its own cluster compared to
– points in other clusters.
1 X – This coefficient helps determine the appropriateness of the clus-
b(i) = min ∥x − i∥
j̸=i |Sj | x∈Sj tering.
– Smallest average distance from the i-th data point to all points
• Mean Silhouette Coefficient as a Quality Measure:
in any other cluster, excluding the one to which i belongs.
2. Compute the Silhouette Coefficient for Each Point: – The mean of the silhouette coefficient for all points is a measure
used to evaluate the quality of clustering in a dataset.
• Silhouette Coefficient:
– It provides insight into how well each object lies within its cluster.
b(i) − a(i)
S(i) =
max{a(i), b(i)} • Importance:
• Measures how similar a data point is within its own cluster com- – Helps in assessing the separation distance between clusters.
pared to other clusters. Ranges from −1 to 1.
– Distinct clusters lead to better definitions and a higher average
3. Interpret the Results: silhouette score.
© Roozbeh Sanaei 27
Gap Statistics for Determining Optimal Clusters in K-means
1. Cluster the Data and Compute Within-Cluster Dispersion: • Expected Dispersion:
Where Dr is the sum of pairwise distances for all points in cluster 4. Calculate the Gap Statistic:
r, and nr is the number of points in cluster r.
• Gap Statistic:
2. Generate Reference Data Sets:
Gap(k) = E∗ log(Wk ) − log(Wk )
• Generate B reference datasets with a random uniform distribu-
tion. The Gap Statistic measures the difference between the expected
• Each dataset should match the original in terms of number of dispersion and the observed dispersion.
observations and features. 5. Choose Optimal k:
3. Compute the Expected Dispersion for Reference Data: • Select the k where the Gap Statistic reaches its maximum.
• Apply K-means clustering to each reference dataset for different • Alternatively, choose the smallest k where Gap Statistic is within
k and compute the WSS. one standard deviation of the Gap Statistic at k + 1.
• Compares clustering results against a random uniform distribution to identify significant clustering structures.
© Roozbeh Sanaei 28
4.2 The Canopy Method
The Canopy Method is a pre-clustering method used in data mining for speeding up clustering operations on large data sets. It involves creating ’canopies’
or rough groupings, followed by more precise clustering algorithms like K-means. This method is particularly effective for large datasets as it reduces
computational costs by limiting the number of distance calculations.
© Roozbeh Sanaei 29
4.3 Gaussian Mixture Models (GMMs)
Gaussian Mixture Models are a probabilistic model used to represent nor- where πi are the mixing coefficients, and f (x|µi , Σi ) is the PDF of the i-th
mally distributed subpopulations within an overall population, often used in Gaussian component.
clustering.
1 1
f (x|µ, Σ) = q exp − (x − µ)T Σ−1 (x − µ)
(2π)k |Σ| 2 M-step: Update the parameters:
© Roozbeh Sanaei 30
4.3.1 Challenges in Gaussian Mixture Models (GMM)
Choosing the Number of Components Covariance Structure
Difficulty in determining the optimal number of Gaussian components. Tech- Choosing the right covariance structure (spherical, diagonal, tied, or full) is
niques like BIC, AIC, or cross-validation may not always provide clear guid- challenging and affects both the model’s flexibility and computational com-
ance. plexity.
Results can vary significantly based on the initial choice of parameters. Poor Performance can degrade in high-dimensional spaces due to sparsity, making
initialization can lead to suboptimal clustering solutions. it difficult to accurately estimate parameters.
Overfitting
Convergence to Local Optima
There’s a risk of overfitting, especially with a large number of components
The EM algorithm may converge to local rather than global optima, resulting
or overly complex models. Requires careful model validation and possibly
in finding suboptimal solutions for the GMM.
regularization.
© Roozbeh Sanaei 31
4.3.2 Comparison of GMMs and K-means
Overlapping Clusters Shape and Variances
GMMs are better suited to handling overlapping clusters than K-means. GMMs can model clusters with different shapes and variances, while K-means
assumes that the variance of the data within each cluster is the same and
Non-Spherical Clusters that the clusters are spherical in shape.
GMMs are better equipped to handle clusters that are not spherical in shape
than K-means. Computational Demand
GMMs are more computationally intensive than K-means.
Number of Clusters
GMMs can estimate the number of clusters in the data using model selection Assumption Limitations
techniques such as the Bayesian Information Criterion (BIC) or the Akaike
Information Criterion (AIC), while K-means requires the user to specify the GMMs presume Gaussian distribution in each cluster, a condition not always
number of clusters a priori. met in datasets, unlike K-means.
© Roozbeh Sanaei 32
4.4 DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
DBSCAN is a popular clustering algorithm used in data analysis, particularly effective for identifying clusters of varying shapes in a dataset with noise
(i.e., outliers).
4. Iterate: Repeat steps 2 and 3 for each point in the dataset until all
• Epsilon-Neighborhood of a Point: Nε (p) = {q ∈ D | dist(p, q) ≤ ε}. points are either assigned to a cluster or marked as noise.
Here, Nε (p) represents the ε-neighborhood of a point p, consisting of
all points q within the dataset D that are within a distance ε from p. 5. Result: The output is a set of clusters with core and border points, and
a set of noise points.
• Core Point: A point p is a core point if its ε-neighborhood contains at Key Properties
least MinPts, i.e., |Nε (p)| ≥ MinPts. • DBSCAN does not require specifying the number of clusters in advance.
© Roozbeh Sanaei 33
4.5 OPTICS(Ordering Points To Identify the Clustering Struc-
ture) Algorithm
Core Concepts The Algorithm
1. Start: Pick an unprocessed point p from the dataset.
1. Core Distance: For a point p in the dataset, the core distance is
the smallest distance such that p is a core point with respect to ε and 2. Retrieve Neighbors: Find the ε-neighborhood of p and calculate the
MinPts. core distance for p.
for o, p in the dataset, where dist(o, p) is the Euclidean distance between • It is particularly useful for datasets where clusters of different densities
o and p. exist, and the noise level is high.
© Roozbeh Sanaei 34
4.6 Spectral Clustering Algorithms
Spectral clustering algorithms can be broken down into the following key steps:
© Roozbeh Sanaei 35
4.7 Markov Chain Clustering (MCL)
Markov Chain Clustering (MCC), specifically the Markov Cluster Algorithm (MCL), is a process for finding clusters (i.e., groups of related items) in
graphs. It’s based on the idea of random walks on the graph, which can be described using Markov chains. Here’s a basic overview of how the algorithm
works, including the key equations involved:
© Roozbeh Sanaei 36
4.8 Agglomerative Clustering Algorithm
Initialization • Ward’s Method:
Given a dataset X = {x1 , x2 , . . . , xn }, where each xi is a data point. Ini- !
dWard (A, B) = 2
||x − µA || +
2 2
X X X
tially, each data point xi is considered as a separate cluster Ci , thus having ||x − µC || − ||x − µB ||
n clusters. x∈A∪B x∈A x∈B
• Complete-linkage: Repeat steps 2-5 until only one cluster remains or until a single cluster
reached.
dcomplete (Ci , Cj ) = max{||a − b|| : a ∈ Ci , b ∈ Cj }
© Roozbeh Sanaei 37
Supervised Machine Learning
Independence
Each observation in the dataset is independent of the others, ensuring that No Auto-correlation
the value of one observation doesn’t influence or depend on another.
In the residuals, there is an absence of auto-correlation, particularly impor-
Homoscedasticity tant in time series data, where one time period’s errors shouldn’t influence
another’s.
The variance of the error terms (residuals) is constant across all levels of the
independent variables, indicating uniform dispersion of residuals.
The residuals (differences between observed and predicted values) are nor- The independent variables are assumed to be measured without significant
mally distributed, which is especially crucial for small sample sizes. error, implying that any measurement error is negligible.
38
5.2 Ordinary Least Squares (OLS)
Least squares is a mathematical method used to minimize the sum of squared differences between observed data points and model predictions.
Ordinary least squares (OLS) is a specific type of least squares method used in linear regression. It finds the best-fitting linear equation by minimiz-
ing the sum of squared errors between the observed values and the values predicted by the linear model.
Assumptions
Solution
• Linearity: The relationship between regressors and the dependent vari-
able is linear. Derive β̂ by setting the derivative of Q with respect to β to zero:
• Conditional Independence: E(U |X) = 0, the expectation of the error
term, given the regressors, is zero. • Differentiate Q: −2X ′ Y + 2X ′ Xβ = 0
• No Multi-collinearity: The matrix X has full rank k, indicating no
perfect collinearity among regressors. • Solve for β: X ′ Xβ = X ′ Y
• Homoskedasticity: Var(U |X) = σ 2 In , the variance of the error term is
constant. • Obtain β̂: β̂ = (X ′ X)−1 X ′ Y
© Roozbeh Sanaei 39
5.2.1 OLS as Projection
The Ordinary Least Squares (OLS) method projects the outcome variable y onto the space spanned by the regressors X, analogous to Ax in linear algebra.
© Roozbeh Sanaei 40
5.2.2 Applying SVD to OLS and Ridge Regression
Standard OLS Formula Ridge Regression Formula
The standard OLS formula to estimate the fitted values β̂ is: Ridge regression modifies the OLS regression by adding a penalty term to
the size of the coefficients. The objective is to minimize the penalized sum
X β̂ = X(X ′ X)−1 X ′ y of squares. The solution to ridge regression is given by:
© Roozbeh Sanaei 41
5.2.3 Relationship between CEF and Regression
1. The Conditional Expectation Function (CEF) is defined as E[Y |X], representing the expected value of Y given X.
2. In regression, the dependent variable Yi is decomposed as Yi = E[Yi |Xi ] + ϵi , where ϵi is the error term, orthogonal to Xi . The primary goal in
regression is to find a function of X, say m(X), that minimizes the squared mean error, min E[(Yi − m(Xi ))2 ], where the optimal choice for m(X)
turns out to be the CEF.
3. In OLS regression, the aim is to linearly approximate the CEF. The regression equation β = arg minb E[(E[Yi |Xi ] − Xi′ b)] indicates that minimizing
the squared differences between Yi and Xi′ b is equivalent to approximating the CEF linearly.
© Roozbeh Sanaei 42
5.3 Method of Moments
Introduction Example 1: Estimator for Sample Mean
The Method of Moments is a statistical technique for estimating the pa- • Population Moment: µ = E[X]
rameters of a probability distribution or a model. This approach compares • Objective: Find an estimator for the sample mean.
theoretical moments from a probability distribution, like mean and variance,
with empirical moments derived from data. • Process:
– Sample Analogue: Replace the expected value E[X] with a sample
Understanding Moments mean X̄.
– Estimator for µ: µ̂ = n1 ni=1 Xi = X̄
P
• Definition: Moments are quantitative measures of a function’s shape.
Example 2: Normal Distribution
• Types of Moments:
• Given: X1 , X2 , . . . , Xn ∼ N (µ, σ 2 )
– Raw Moments: The nth raw moment of a random variable x, de-
• Objective: Find estimators for the parameters µ and σ 2 .
noted as µn , is E[xn ].
• Process:
– Central Moments: The nth central moment, denoted as µ′n , is
E[(x − µ)n ], where µ is the mean. – First Moment (Mean): Population Moment: E[X] = µ, Sam-
ple Analogue: X̄, Estimator for µ: µ̂ = X̄.
5.3.0.1 Method of Moments Estimator – Second Moment (Variance): Population Moment: E[X 2 ] =
µ2 +σ 2 , Expand using µ’s estimator, Sample Analogue: n1 ni=1 Xi2 ,
P
• Theoretical Moment: Calculated for the entire population using its Estimator for σ 2 : σ̂ 2 = n1 ni=1 (Xi − X̄)2 − X̄ 2 .
P
Given that we have two moment conditions but only one parameter to
• The principle: With a sufficiently large and representative sample, estimate, it’s necessary to find a method to effectively ’merge’ these
the sample moments should be good approximations of the theoretical conditions. Relying on just one of these conditions would result in
moments. underutilizing the available information.
© Roozbeh Sanaei 43
5.3.1 General Framework of Moment Conditions
Moment Conditions Application in OLS Regression
• Moment conditions in regression are expressed as a function g(Xi , β). • OLS Moment Condition: E[Xi Ui ] = 0 or E[Xi (yi − Xi′ β)] = 0, where
Ui is the error term.
• Xi represents the observed data, including dependent variables yi , in-
dependent variables Xi , and any instruments Zi .
• This condition is used to solve for the OLS estimator β̂.
• β is a vector of parameters to estimate, with a length of k.
Extension to More Complex Models (IV Regression)
Model Identification
• Instrumental Variables (IV) Regression: Useful when the model is overi-
• A model is identified if the solution for β is unique. dentified (l instruments for k parameters).
• Uniqueness is expressed as E[g(Xi , β)] = 0 and E[g(Xi , β̂)] = 0 imply-
ing β = β̂. • IV Estimator Formula: Derived by solving the moment condition,
β̂IV = (Z ′ X)−1 Z ′ y.
• At least as many restrictions (moment conditions) as parameters (k)
are needed to identify the model. • IV regression uses instruments Zi to resolve issues in the OLS model.
© Roozbeh Sanaei 44
5.3.2 Instrumental Variables in Regression Models
• Instrumental variables are used in statistical analysis to address endogeneity issues, such as omitted variables that affect both X and Y, where
explanatory variables in a regression model are correlated with the error term.
• They provide a way to estimate causal relationships by using a variable (the instrument) that is correlated with the explanatory variable but not
with the error term.
Example: To assess education’s impact on income without bias from individual ability, use an instrumental variable like proximity to a university.
This approach isolates the influence of education on income, separate from individual ability.
Moment Conditions with Instrumental Variables • Instruments Zi are selected for their correlation with endogenous inde-
pendent variables and their lack of correlation with the error term.
• The moment conditions involving instrumental variables can be repre-
sented as g(Xi , β) = Zi (yi − Xi′ β) or E[Zi Ui ] = 0. • This substitution allows the IV estimator to isolate the variation in the
explanatory variable that is unaffected by endogeneity.
• Here, Zi are the instrumental variables, yi the dependent variables, Xi
the independent variables, β the parameters, and Ui the error terms.
Condition of Perfect Identification
• The condition E[Zi Ui ] = 0 implies that the instruments are uncorre-
lated with the error term, a critical requirement for valid instrumental • For a model to be perfectly identified, the number of instruments (l)
variables. should be equal to the number of parameters (k).
• Solving this condition yields the IV estimator β̂IV = (Z ′ X)−1 Z ′ y. • The IV approach extends beyond simple linear models and can be ap-
plied to more complex regression models where standard OLS assump-
Role of Instrumental Variables in Addressing Endogeneity tions do not hold.
• IV regression substitutes problematic OLS moments with new ones that • It is particularly useful in models where endogeneity is a concern and
incorporate instruments, addressing the endogeneity bias. the model’s identification relies on the validity of the instruments.
© Roozbeh Sanaei 45
5.3.3 Generalized Method of Moments (GMM)
General Concept of GMM GMM Formula for Linear Regression Models
• Overidentified Models: In the context of linear regression models that are overidentified, the general
GMM formula is given by:
– In scenarios where the number of restrictions (l) is greater than
the number of parameters to estimate (k), i.e., l > k, the model −1
β̂GM M = ((X ′ Z)W (Z ′ X)) (X ′ Z)W (Z ′ y)
is said to be overidentified.
– In such cases, traditional methods like Ordinary Least Squares Here:
(OLS) or Instrumental Variable (IV) regression cannot be directly • X and Z represent the observed data and instruments, respectively.
applied to estimate the parameter vector β.
• W is again the weighting matrix.
• Combining Restrictions:
– GMM seeks to find an estimate of β that brings the sample mo- • X ′ and Z ′ are transposes of X and Z respectively.
ments as close to zero as possible. This involves combining multi-
ple moment conditions in an optimal way. Optimal Choice of Weighting Matrix
– The moment conditions for all restrictions are still equal to zero, • The choice of the weighting matrix W is crucial in GMM. For instance,
but the sample approximations may not be exactly zero due to when W = (Z ′ Z)−1 , the GMM estimator becomes equivalent to the
finite sample sizes. Instrumental Variable (IV) estimator.
GMM Estimation • The optimal choice of W depends on the specifics of the model and the
nature of the data.
The GMM estimator, β̂GM M , is defined as the value of β that minimizes the
weighted distance of ni=1 g(Xi , β), where g(Xi , β) is a vector of functions
P
© Roozbeh Sanaei 46
5.4 Maximum Likelihood
5.4.1 Maximum Likelihood Estimation (MLE)
MLE is a statistical method used to estimate the parameters of a model, aiming to find the parameter values that make the observed data most probable.
i=1
• Maximizing ln L(θ) is equivalent to maximizing L(θ) as the logarithm • Maximizing this log-likelihood with respect to µ yields the MLE for the
is monotonic. mean of a normal distribution.
© Roozbeh Sanaei 47
5.4.2 OLS Estimator using Maximum Likelihood
Model Specification Log-Likelihood Function
Start with the linear regression model: Convert the likelihood to log-likelihood for simplification:
yi = x′i β + ui
n
where yi is the dependent variable, xi is the vector of independent variables, ln L(β, σ 2 ) =
X
ln f (yi , xi ; β, σ 2 )
β is the vector of coefficients, and ui is the error term. i=1
© Roozbeh Sanaei 48
Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) are an advanced form of linear regression models, characterized by their ability to handle a variety of response variable
distributions and to establish a distinct relationship between response and predictor variables. The essence of GLMs lies in three core components:
1. Random Component: This defines the probability distribution of the response variable, Y . In GLMs, Y is assumed to follow a distribution from
the exponential family, such as Normal, Binomial, or Poisson distributions.
2. Systematic Component: Represents the explanatory (independent) variables, X1 , X2 , . . . , Xn , and their linear combination, often denoted as η.
The equation for this component is:
η = β0 + β1 X1 + β2 X2 + . . . + βn Xn
Here, β0 , β1 , . . . , βn are the model’s parameters (coefficients).
3. Link Function: Denoted as g(), this function connects the systematic component to the expected value of the response variable. It ensures that
the model accommodates the distribution type of the response variable. The relationship is described as:
g(E(Y )) = η
Full GLM
The GLM equation combines these components as:
Here, g −1 () is the inverse of the link function, transforming the linear predictor η back to the response variable’s scale.
• Normal Distribution: Utilized in standard linear regression where the response variable can take any continuous value. In this simplest form of
GLM, often equivalent to ordinary least squares regression, the model is:
Y = β0 + β1 X + ϵ
• Binomial Distribution: Employed in logistic regression, suitable for binary outcomes such as success/failure or yes/no. The logistic model predicts
the probability of occurrence of an event and is expressed as:
!
p
log = β0 + β1 X
1−p
where p represents the probability of one of the binary outcomes.
© Roozbeh Sanaei 49
• Poisson Distribution: Applied in scenarios where the response variable represents count data, typically non-negative integers, such as the number
of occurrences of an event. Poisson regression, used for modeling count data, is formulated as:
log(E(Y )) = β0 + β1 X
• Gamma Distribution: Often used in situations where the response variable is positively skewed and continuous, such as for modeling time-to-event
data (e.g., survival times). The Gamma distribution in GLMs typically uses the inverse or log link function. The model can be expressed as:
g(E(Y )) = β0 + β1 X1 + . . . + βn Xn
where E(Y ) is the expected value of the response variable, Y , which follows a Gamma distribution. The link function g() could be the inverse link
g(E(Y )) = 1/E(Y ) or the logarithmic link g(E(Y )) = log(E(Y )).
• Multinomial Distribution: Used when the response variable can take on more than two categories, as in the case of multinomial logistic regression.
This model is an extension of logistic regression to multiple categories. The Multinomial distribution in GLMs can be represented by a set of equations,
one for each category. For a response variable with k categories, the model can be:
!
pi
log = βi0 + βi1 X1 + . . . + βin Xn
pk
for i = 1, 2, . . . , k − 1, where pi is the probability of the ith category. The reference category k serves as the baseline, and the logit link is used.
© Roozbeh Sanaei 50
Bernoulli Trial Estimator using Maximum Likelihood
Problem Setup Log-Likelihood Function
Consider a dataset consisting of results from a series of coin flips, where we The log-likelihood function is:
aim to estimate the probability of the coin landing heads.
n n
ln L(p) = ln f (xi ; p) = (xi ln p + (1 − xi ) ln(1 − p))
X X
Bernoulli Distribution i=1 i=1
Assume that each coin flip is an independent Bernoulli trial with probability
p of landing heads. Maximizing Log-Likelihood
To find the maximum likelihood estimator for p, take the derivative of the
Probability Mass Function (PMF)
log-likelihood function with respect to p and set it to zero.
The PMF for a single observation xi (taking the value of 0 or 1) is given by:
i=1
which is the sample mean of the observed values.
© Roozbeh Sanaei 51
5.5 Logistic Regression
Linear Regression Logistic Regression
• Used for predicting a continuous outcome. • Used for binary classification (predicting a categorical outcome with
two classes).
• y = β0 + β1 x1 + β2 x2 + . . . + βn xn + ϵ
– y: Dependent variable (the outcome to predict). • ln p
1−p
= β0 + β1 x1 + β2 x2 + . . . + βn xn
– β0 , β1 , . . . , βn : Coefficients (weights) of the model. – p: Probability of the dependent variable being in one of the two
– x1 , x2 , . . . , xn : Independent variables (predictors). classes.
– ϵ: Error term. – Other terms are similar to linear regression.
• Example Usage: Predicting house prices based on features like size, • Example Usage: Predicting whether an email is spam or not based
location, number of bedrooms, etc. on characteristics like the frequency of certain words, sender, etc.
Assumption Linear Regression Logistic Regression
Assumes the dependent variable is continuous and nor- Assumes the dependent variable is categorical, typically
Nature of Dependent Variable
mally distributed. binary.
Assumes a linear relationship between the log-odds
Assumes a linear relationship between the dependent
Relationship Between Variables (logit) of the dependent variable and the independent
and independent variables.
variables.
Assumes that the residuals (errors) are normally dis-
Distribution of Errors/Residuals Does not assume a normal distribution of residuals.
tributed.
Assumes homoscedasticity, meaning the variance around
This assumption is not applicable, as it deals with cat-
Homoscedasticity the regression line is the same for all values of the pre-
egorical outcomes.
dictor variable.
Assumes no autocorrelation in the residuals. The errors Also assumes independence of errors (no autocorrela-
Independence of Errors
are independent of each other. tion).
Assumes little or no multicollinearity among the inde- Similar to linear regression, it assumes little or no mul-
Multicollinearity
pendent variables. ticollinearity.
Can often work with smaller sample sizes, though the Typically requires a larger sample size, especially if the
Sample Size
exact size depends on the number of predictors. outcome is rare.
© Roozbeh Sanaei 52
5.6 Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) are an advanced form of linear regression models, characterized by their ability to handle a variety of response variable
distributions and to establish a distinct relationship between response and predictor variables. The essence of GLMs lies in three core components:
1. Random Component: This defines the probability distribution of • Binomial Distribution: Employed in logistic regression, suitable for
the response variable, Y . In GLMs, Y is assumed to follow a distribu- binary outcomes such as success/failure or yes/no. The logistic model
tion from the exponential family, such as Normal, Binomial, or Poisson predicts the probability of occurrence of an event and is expressed as:
distributions. !
p
log = β0 + β1 X
2. Systematic Component: Represents the explanatory (independent) 1−p
variables, X1 , X2 , . . . , Xn , and their linear combination, often denoted
as η. The equation for this component is: where p represents the probability of one of the binary outcomes.
• Each data point in an SVM is represented as a point in an n-dimensional • The SVM aims to maximize the margin, the distance between the hy-
space (where n is the number of features), with each feature being a perplane and the nearest data points of each class (support vectors),
particular coordinate. For a data point xi , it can be represented in a given by ∥w∥
2
.
D-dimensional space as a feature vector x ∈ RD .
• The optimization problem is to maximize 2
∥w∥
subject to yi (wT xi + b) ≥
1 for all i, where yi is the label of xi .
Hyperplane
Kernel Trick
• The hyperplane equation, the decision boundary, is given by wT x + b =
0 where w is the weight vector, b is the bias term, and x is the input • For non-linearly separable data, SVMs use kernel functions to trans-
feature vector. form the data into a higher dimension for linear separability.
if f (x) < 0, into the other class (negative class). and ξi ≥ 0 for all i, where C is the regularization parameter.
© Roozbeh Sanaei 54
5.7.1 Lagrangian SVMs
The Lagrangian in SVMs combines the objective (minimizing 21 ∥w∥2 ) with Dual Problem Formulation
the constraints (data points correctly classified):
• Transformation to Dual Problem By solving for w and substitut-
ing it back into the Lagrangian, we obtain the dual problem, which
1 n only involves the Lagrangian multipliers. This dual problem is easier
L(w, b, α) = ∥w∥2 − αi [yi (wT xi + b) − 1]
X
2 to solve as it typically has fewer dimensions than the original problem
i=1
α: n
1 X n
L(α) = αi αj yi yj xTi xj
X
αi −
Here, αi are Lagrange multipliers for each constraint, ensuring each data i=1 2 i,j=1
point xi is on the correct side of the margin.
• The dual form is computationally efficient, especially for large feature
spaces or kernel methods.
© Roozbeh Sanaei 55
5.7.2 Comparison Between OLS and SVM
Comparing the approaches of Ordinary Least Squares (OLS) and Support Vector Machines (SVM) in various aspects.
• SVM: Minimizes the l2-norm of the coefficient vector, seeking a sparse • SVM: Can model non-linear relationships effectively using kernel func-
solution with many coefficients set to zero. tions for higher-dimensional mapping.
• SVM: Manages error using constraints with a predetermined margin • SVM: Fewer assumptions; does not require linearity, normality, ho-
(ϵ). Absolute errors are constrained to be less than or equal to ϵ, with moscedasticity, or independence, offering robustness in diverse scenar-
deviations denoted as ξ. ios.
• SVM: Less sensitive due to the margin concept (ϵ). Outliers have • SVM: Performs better on small datasets due to the margin (ϵ) and
limited impact on the model. regularization, preventing overfitting.
© Roozbeh Sanaei 56
5.8 Decision Tree Algorithms
5.8.1 ID3
1. Start at the Root Node: 5. Choose the Attribute with the Maximum Information Gain:
• Begin with the entire training set as the root. • The attribute with the highest Information Gain is chosen as the
decision node.
2. Selecting the Best Attribute:
• For each attribute A in the dataset, the ID3 algorithm calculates 6. Split the Dataset:
an attribute’s effectiveness in classifying the training data.
• The dataset is then split by the chosen attribute to produce sub-
• The measure used for this purpose is the Information Gain, which sets of the dataset.
is based on the concept of Entropy.
7. Recursion:
3. Calculate Entropy:
• The ID3 algorithm is then recursively applied to each subset with
• Entropy is a measure of the randomness or uncertainty in the data.
the remaining attributes.
• The entropy of the entire dataset S is given by:
n 8. Termination Conditions:
H(S) = − pi log2 pi
X
v∈V alues(A)
|S|
where V alues(A) are the different values of attribute A, Sv is the
subset of S for which attribute A has value v, and H(Sv ) is the
entropy of subset Sv .
© Roozbeh Sanaei 57
5.8.2 Comparison of ID3 and C4.5 Algorithms
• Start with the Entire Dataset: • Handling Continuous Attributes:
– ID3 and C4.5: Both start with the full dataset as the root of the – ID3: Does not handle continuous attributes.
tree.
– C4.5: Handles continuous attributes by finding an optimal thresh-
• Choose the Best Attribute: old.
– ID3: Selects the attribute with the highest information gain. • Pruning:
– C4.5: Also considers the highest information gain, but normalizes
it. – ID3: No pruning mechanism.
• Calculate Entropy: – C4.5: Prunes trees to avoid overfitting.
– ID3 and C4.5: Both use the same entropy formula. • Recursive Splitting:
• Calculate and Normalize Information Gain: – ID3 and C4.5: Apply the process recursively using the remaining
– ID3: IG(S, A) = H(S) −
P |Sv |
v∈Values(A) |S| H(Sv )
attributes.
– C4.5: Uses the same as ID3, followed by normalization (Gain Ra- • Termination Conditions:
tio):
GainRatio(S, A) = SplitInfo(S,A)
IG(S,A)
– ID3 and C4.5: Recursion stops when all instances in a node are
with of the same class, there are no attributes left, or the subset is too
SplitInfo(S, A) = − v∈Values(A) |S|S|v | log2 |S|S|v | small.
P
© Roozbeh Sanaei 58
Features of C5.0 Algorithm
• Efficiency Improvements: • Handling Continuous and Categorical Attributes:
– C5.0 is more memory efficient and faster than C4.5. – Similar to C4.5, C5.0 can handle both continuous and categorical
– It can handle larger datasets more effectively. attributes.
© Roozbeh Sanaei 59
5.8.3 Decision Tree Pruning Methods
Pre-Pruning (Early Stopping)
• Stop growing the tree earlier, before it perfectly classifies the training data.
• Set a maximum depth, minimum number of samples per leaf, or a minimum improvement in the impurity measure.
Post-Pruning
• Grow the tree fully, then remove nodes that do not provide significant predictive power.
–
Rα (T ) = R(T ) + α × |leaves|
– R(T ): Misclassification rate of the tree T .
– α: Complexity parameter.
– |leaves|: Number of leaves in the tree.
∗ Minimize Rα (T ). Increasing α leads to simpler trees.
Pruning reduces the complexity of the final model, improving its generalizability and interpretiveness.
© Roozbeh Sanaei 60
5.8.4 Decision Tree Splitting Criteria
Gini Impurity Chi-Squared Statistic
n
Gini(S) = 1 − (pi )2 (O − E)2
X
χ2 =
X
i=1 E
• S: Dataset or subset.
• O: Observed frequency.
• pi : Proportion of instances in class i within S.
• n: Number of classes. • E: Expected frequency under independence.
Gini impurity measures the likelihood of incorrect classification if you ran-
domly pick an instance and classify it according to the distribution of classes The Chi-squared test assesses the independence between the splitting at-
in the subset. A Gini score of 0 indicates perfect purity. tribute and the target variable. A high value indicates a significant associa-
tion, suggesting a beneficial split.
Entropy:
n Reduction in Variance
H(S) = − pi log2 pi
X
i=1
V arianceReduction = T otalV ariance − W eightedV ariance
Information Gain:
|Sv | • Variance is calculated as the average squared deviation from the mean
IG(S, A) = H(S) − H(Sv )
X
v∈V alues(A)
|S| of the target variable.
• H(S): Entropy of set S.
Used in regression problems to find splits that reduce the variance of the
• pi : Probability of an item in S belonging to class i. target variable, indicating more homogeneity.
• A: Attribute for splitting.
• V alues(A): All possible values for attribute A. Information Gain Ratio
© Roozbeh Sanaei 61
5.9 Ensemble Models
Bagging (Bootstrap Aggregating) Boosting
Bagging, short for Bootstrap Aggregating, is an ensemble machine learning Boosting is an ensemble machine learning technique used to enhance the per-
algorithm designed to improve the stability and accuracy of machine learning formance of predictive models. It builds a sequence of models in a way that
algorithms. It involves creating multiple versions of a predictor model and each subsequent model attempts to correct the errors of the previous one.
using these to get an aggregated predictor. The method works by randomly The final prediction is typically a weighted sum of the individual models.
selecting subsets of the training set with replacement, training a model on Boosting is especially known for increasing accuracy in both classification
each, and then combining their predictions. This approach is particularly and regression tasks, often significantly reducing bias and variance compared
effective in reducing variance and avoiding overfitting. to single models.
© Roozbeh Sanaei 62
5.9.1 Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that extends linear models by incorporating automatic variable
selection and nonlinear relationships. It models relationships by fitting piecewise linear regressions, creating ’splines’ that adjust to different ranges of the
data. MARS is particularly effective for high-dimensional data and can capture complex patterns without requiring a pre-specified functional form.
Base Functions
• Composition: Piecewise linear ”basis” or ”base” functions.
• Predicted Value: ŷ(x) = β0 + βj h(x, cj , sj ) where β0 is the intercept, βj are coefficients, and J is the number of basis functions.
PJ
j=1
Forward Pass
• Begins by selecting base function pairs that minimize the residual sum of squares (RSS).
where yi are the actual values, ŷ(xi ) are the predicted values, N is the number of observations, and h is the effective number of parameters.
© Roozbeh Sanaei 63
Stacking Voting Ensemble
Stacking, short for stacked generalization, is an ensemble machine learning Voting Ensemble is a machine learning technique that combines the predic-
algorithm. It involves combining multiple predictive models to generate a tions from multiple models. It involves creating multiple different models
new model, typically resulting in improved prediction accuracy. In stacking, on the same dataset and using a majority vote (for classification) or average
different algorithms are trained on the same dataset and their predictions (for regression) of their predictions as the final prediction. This approach is
are used as inputs to a final ’meta-model’, which makes the ultimate pre- beneficial for improving model performance and robustness, as it reduces the
diction. This technique leverages the strengths of each individual model, likelihood of an unfortunate selection of a poorly performing model. Voting
thereby reducing the risk of choosing a suboptimal algorithm. can be ’hard’, using a strict majority vote, or ’soft’, where probabilities are
averaged.
• Model Selection:
• Base Model Training:
– Choose diverse machine learning models.
– Train base models f1 , f2 , . . . , fM on training data. • Voting Mechanism:
• Model Training:
– Train meta-model g on the meta-features.
– Train each model independently on the same dataset.
fstacking (x) = g(P1 (x), P2 (x), . . . , PM (x)) – Final prediction based on aggregated votes or probabilities.
© Roozbeh Sanaei 64
Random Subspace Method (RSM) Mixture of Experts (MoE)
The Random Subspace Method (RSM) is a machine learning technique for Mixture of Experts (MoE) is an ensemble machine learning technique that
improving model accuracy and robustness. It involves training each model divides a complex problem into simpler sub-problems, solved by specialized
on a different random subset of features of the dataset, rather than on the models called experts. Each expert is trained on a different segment of the
complete feature set. This approach, also known as feature bagging, helps data or task, and a gating network determines the weight or influence of each
in reducing the correlation among the models in an ensemble, leading to expert’s output in the final prediction. MoE effectively combines the outputs
better generalization and reduced overfitting, especially in cases with high- of various models, making it well-suited for tasks where different regions of
dimensional data. RSM is particularly effective in combination with decision the input space require different types of modeling or expertise.
trees and other algorithms sensitive to feature selection.
• Division of Problem Space:
• Feature Subsampling:
– Divide the problem space into regions for different expert models.
– From a dataset with P features, randomly select k features (k <
P ) for each model. • Training of Experts:
• Aggregation: – Train a gating network to select the appropriate expert for each
input.
– Aggregate predictions using averaging or majority voting.
• Aggregation:
• Model formulation:
– Combine outputs from experts based on gating network weights.
– For classification:
PredictionRSM = mode{prediction1 , prediction2 , . . . , predictionM } • Model formulation:
– For regression: – fMoE (x) = M i=1 gi (x) · fi (x) where gi (x) is the gating network’s
P
Average predictions from all models. weight for the i-th expert.
© Roozbeh Sanaei 65
5.9.2 Ensemble Models Comparison
© Roozbeh Sanaei 66
5.9.3 AdaBoost
AdaBoost, short for Adaptive Boosting, is an ensemble machine learning algorithm used primarily for classification tasks. It works by combining multiple
weak learners, typically simple decision trees, to create a strong classifier. In AdaBoost, each subsequent model focuses more on the instances that were
incorrectly predicted by previous models, as these receive increased weight. The final prediction is a weighted sum of the predictions from all learners.
AdaBoost is known for its effectiveness in boosting the performance of simple models and its ease of implementation.
© Roozbeh Sanaei 67
5.9.4 Gradient Boosting
Gradient Boosting is an ensemble machine learning technique used for 2. Fit a weak learner hm (x) to these residuals.
both classification and regression tasks. It builds the model in a stage-wise
fashion, with each new model being trained to correct the errors made by 3. Find the multiplier γm that minimizes the loss when added to the cur-
the previous ones. The method uses the gradient descent algorithm to mini- rent model:
mize the loss when adding new models. Each tree in the ensemble is fit on a n
γm = arg min L(yi , Fm−1 (xi ) + γhm (xi ))
X
modified version of the original dataset. Gradient Boosting is known for its γ
high effectiveness, particularly in situations where data is unbalanced and in i=1
For each iteration m from 1 to M: • The final model is the sum of the initial model and all the weak learners’
contributions:
1. Compute the residuals rim for each training instance i, which are the M
F (x) = F0 (x) + γm hm (x)
X
negative gradients of the loss function with respect to the prediction. m=1
© Roozbeh Sanaei 68
Aspect Bagging Boosting Stacking Random Sub- MoE Blending
space Method
Model Training Independent, par- Sequential, focuses Independent base Independent, par- Train multiple ex- Train models in-
allel training on errors of previ- models, followed allel training on perts and a gating dependently, com-
ous models by a meta-model feature subsets model bine using a hold-
out set
Data Handling Bootstrap samples Full dataset with Full dataset for Full dataset with Full dataset, gat- Full dataset, split
(subsets of data) adjusted weights base models, meta- subsets of features ing model directs into training and
for training sam- model on outputs to experts validation sets
ples
Model Type Similar types, e.g., Typically similar Diverse model Similar types, dif- Diverse, special- Typically diverse
decision trees types types ferent feature sub- ized models models
sets
Impact on Bias Reduces model- Focuses on reduc- Varies based on Reduced by fea- Depends on the ex- Optimized during
specific bias ing bias through model selection ture diversity pertise of individ- the blending pro-
error correction ual models cess
Impact on Variance Averaging reduces Can increase if Depends on the Mitigated by di- Varies, complex Controlled by vali-
variance overfitting occurs variance of base versity in feature models might dation set
and meta-models subsets increase variance
Computational Generally lower, Higher due to se- Potentially high Similar to Bag- Potentially high, Moderate, depend-
Complexity parallelizable quential training due to two levels ging, manageable complex training ing on complexity
of training of models
Overfitting Risk Lower due to av- Higher if not care- Depends on base Reduced due to Varies, requires Lower, due to use
eraging/majority fully tuned and meta-model feature diversity careful design of validation set for
voting complexity final model
Applicability High-variance Improving weak Leveraging High-dimensional Complex problems Problems where a
models, e.g., deci- models, high-bias strengths of feature spaces with diverse data simpler model can
sion trees situations different models characteristics combine predic-
tions
© Roozbeh Sanaei 69
Evaluation
70
6.1.2 The ROC Curve in Binary Classification
• The ROC curve graphically evaluates binary classification models. • Threshold Levels:
© Roozbeh Sanaei 71
6.1.3 Accuracy Metrics
P
Condition Positive
• Prevalence: P Total Population Confusion Matrix
P
True Positive Predicted
• Positive Predictive Value (PPV) or Precision: P Predicted Condition Positive Positive Negative
P
False Negative True False
• False Omission Rate (FOR): P Predicted Condition Negative Actual
Positive
Positive (TP) Negative (FN)
P P
True Positive+ True Negative False True
• Accuracy (ACC): Negative
Positive (FP) Negative (TN)
P
Total Population
P
False Positive
• False Discovery Rate (FDR): P Predicted Condition Positive
P
True Negative
• Negative Predictive Value (NPV): P Predicted Condition Negative
P
• True Positive Rate (TPR) or Recall or Sensitivity: P True Positive
Condition Positive
P
False Positive
• False Positive Rate (FPR) or Fall-out: P Condition Negative
P
• False Negative Rate (FNR) or Miss Rate: P False Negative
Condition Positive
P
• Specificity (SPC) or Selectivity or TNR: P True Negative
Condition Negative
• F1 Score: 2 · Precision·Recall
Precision+Recall
© Roozbeh Sanaei 72
6.1.4 Lift and Drift Charts
Lift Charts focus on quantifying the effectiveness of a model compared to a baseline, while Drift Charts are concerned with tracking changes in data
distributions over time, which can impact the performance of predictive models.
© Roozbeh Sanaei 73
Anomalies and Outliers
• Anomalies: • Anomalies:
– Include supervised, semi-supervised, and unsupervised methods.
– Data points that deviate significantly from expected behavior.
• Outliers:
– Represent instances not conforming to general patterns or trends.
– Identified using statistical measures, such as Z-scores, IQR.
• Outliers:
Applications
• Outliers: • Outliers:
– Types include Global Outliers, Contextual Outliers, Collective
– Extreme values, statistically rare, can skew results. Outliers.
74
7.2 Isolation Forest Algorithm
The Isolation Forest algorithm isolates each data point by randomly selecting features and split values, creating an ensemble of iTrees. It then calculates
the path length for each point in these trees. Anomalies are identified based on their shorter path lengths, which are used to compute an anomaly score
indicating the likelihood of a point being an outlier in the dataset.
where H(i) is the i-th harmonic number and n is the size of the testing
dataset. γ is the Euler-Mascheroni constant, approximately 0.5772156649.
© Roozbeh Sanaei 75
7.3 Cook’s Distance
Cook’s Distance is a measure used in statistics to identify influential observations in a dataset, particularly in the context of linear regression. It estimates
the influence of a data point in a least-squares regression analysis.
•
(i) 2 • Normalization Factor: The denominator normalizes this sum by the
j=1 (Ŷj − Ŷj )
is the sum of squared differences between the predicted
Pn
(i) number of predictors and the mean squared error.
values Ŷj from the full model and the predicted values Ŷj from the
model without the i-th observation. An observation with a high Cook’s Distance indicates significant influence
• p is the number of predictors in the model. on the model’s parameters. A threshold, such as 4/n (where n is the number
of observations), is often used to identify observations with a substantially
• M SE is the mean squared error of the full model. high Cook’s Distance as influential.
© Roozbeh Sanaei 76
7.4 Quartiles and Interquartile Range (IQR)
Quartiles Identifying Outliers Using IQR
• In a dataset: • Outliers are identified using the IQR:
– The first quartile, Q1, is the median of the lower half of the data. – Lower Bound for Outliers (data points less than this value are
– The third quartile, Q3, is the median of the upper half of the data. considered lower outliers):
– Q2, or the second quartile, is the median of the entire dataset, but Lower Bound = Q1 − 1.5 × IQR
it’s not used in calculating the IQR.
– Upper Bound for Outliers (data points greater than this value are
Calculating the IQR considered upper outliers):
• The IQR is the difference between the third and first quartiles:
Upper Bound = Q3 + 1.5 × IQR
IQR = Q3 − Q1
• The choice of 1.5 as the multiplier is conventional but effective in distin-
• This value represents the spread of the middle 50% of the data. guishing typical data points from those that are significantly different.
© Roozbeh Sanaei 77
7.5 Local Outlier Factor
The Local Outlier Factor (LOF) algorithm is a method for detecting outliers in a dataset by examining the local density deviation of each data point
compared to its neighbors.
Step 1: Determining the k-Distance Step 4: Determining the Local Outlier Factor (LOF)
The k-Distance of a point P is defined as the distance of P to its k th nearest The LOF of a point P is determined as:
neighbor. This is mathematically represented as: P LRDk (O)
O∈Nk (P ) LRDk (P )
k-distance(P ) = dist(P, Ok ) LOFk (P ) =
|Nk (P )|
where Ok is the k th nearest neighbor of P and dist(P, Ok ) is the distance The LOF is the average of the ratio of the LRD of P to the LRD of its
between P and Ok . neighbors. A high LOF value, significantly greater than 1, indicates that P
is an outlier.
Step 2: Calculating the Reachability Distance
The Reachability Distance between two points P and O is given by: Overview
The LOF algorithm is particularly effective in identifying outliers due to
Reachability-Distancek (P, O) = max{k-distance(O), dist(P, O)} several key reasons:
This represents the maximum of the k-distance of O and the actual distance • Focus on Local Spatial Properties: It emphasizes the local spatial
between P and O. characteristics of data points, rather than their global distribution in
the dataset.
Step 3: Computing the Local Reachability Density (LRD)
• Useful for Varying Densities: The algorithm is especially useful
The LRD of a point P is calculated as: in datasets where density varies significantly, accommodating clusters
!−1 that are either more sparse or dense than others.
Reachability-Distancek (P, O)
P
O∈Nk (P )
LRDk (P ) = • Identification of Notably Different Points: The LOF score quanti-
|Nk (P )|
fies the extent to which an object deviates from its neighboring points in
where Nk (P ) is the set of k nearest neighbors of P . The LRD is the inverse terms of density. This helps in identifying points that are significantly
of the average reachability distance from P to its neighbors. different or isolated from their local surroundings.
© Roozbeh Sanaei 78
7.6 Mahalanobis Distance
Measures the distance between a point and a distribution, particularly in a multivariate context.
D2 = (x − µ)T Σ−1 (x − µ)
• Sensitivity to Outliers: Calculations can be significantly affected by outliers, potentially leading to misleading results.
• Assumption of Gaussian Distribution: Works best when the data distribution is Gaussian, which might not be the case in all datasets.
© Roozbeh Sanaei 79
7.7 Minimum Covariance Determinant (MCD) Method
The goal of the MCD method is to find a subset of the dataset with the smallest covariance determinant, representing the ”normal” observations and
reducing the influence of outliers.
• Calculating Mean and Covariance for the Subset: • Fast-MCD Algorithm: An iterative algorithm that refines the subset
selection to minimize the determinant.
– Mean µh of Xh :
1 X • Iterative Process: Each iteration updates the subset to better ap-
µh = xi proximate the minimum determinant.
h i∈Xh
• Random Sampling: Initial subset selection involves random sampling
– Covariance matrix Σh of Xh :
to cover a broad range of possibilities.
1 X
Σh = (xi − µh )(xi − µh )T
h − 1 i∈Xh Significance of the MCD Method
• Outlier Resistance: MCD is robust against outliers, reducing their
• Minimizing the Determinant: impact on covariance estimation.
– Select Xh to minimize the determinant of Σh . • Improved Analysis: Provides more accurate covariance estimates in
datasets with outliers.
– Optimization problem:
• Wide Application: Useful in fields like finance and economics where
min det(Σh ) outliers are common.
Xh ⊂X,|Xh |=h
– The subset Xh minimizing the determinant provides robust MCD • Enables Advanced Techniques: Essential for complex statistical
estimates of mean and covariance. methods in datasets with outliers.
© Roozbeh Sanaei 80
7.8 Single-Class SVM
• Single-class SVM is used for anomaly detection. 2. Schölkopf’s Formulation (Using a Hyperplane)
• It focuses on a single class, unlike standard SVMs which handle two or • Using a hyperplane instead of a hypersphere.
more classes.
• Objective: Maximize the distance of the hyperplane from the origin.
• The goal is to establish a decision boundary that separates the data
points of a single class from the origin in high-dimensional space. • Optimization Problem:
• Enclosing data in a hypersphere in high-dimensional space. • w is the normal to the hyperplane, b is the bias, ρ is the margin.
• Center and Radius: Hypersphere characterized by center a and ra-
dius R.
• Objective: Minimize the radius while keeping all data points inside
or on its surface.
• Optimization Problem:
Minimize: R2 + C
X
ξi
i
Subject to: |xi − a|2 ≤ R2 + ξi ∀i
ξi ≥ 0, ∀i
• xi are data points, ξi are slack variables, and C is a regularization
parameter.
© Roozbeh Sanaei 81
Information Theory
H(X) = − p(x) log p(x) • Non-negativity: Entropy is always non-negative, implying that the
X
• H(X) is the entropy of the random variable X. • Maximum entropy with uniform distribution: Entropy is maxi-
mized when the distribution of the random variable is uniform. This is
• X represents the set of all possible outcomes of X. because uniform distribution indicates the highest level of uncertainty
• p(x) is the probability of an outcome x. or lack of specific information about the variable.
• The logarithm base is chosen depending on the context (base 2 for bits, • Additivity for independent events: For two independent random
base e for natural units, and base 10 for dits). variables X and Y , the entropy of their joint distribution is the sum of
their individual entropies:
The entropy H(X) measures the average level of ”information”, ”sur-
prise”, or ”uncertainty” inherent in a random variable’s possible outcomes. H(X, Y ) = H(X) + H(Y )
82
8.2 Boltzmann’s Entropy Formula
Boltzmann’s entropy formula is a cornerstone in statistical mechanics and thermodynamics, offering a mathematical expression for the concept of entropy.
• This emphasizes the probabilistic nature of entropy. • Links convexity or concavity of functions to expected values.
This formula finds extensive applications in fields such as physics, chemistry, • Influential in statistics, optimization, economics, and various other
and information theory. mathematical fields.
© Roozbeh Sanaei 83
8.3 Jensen’s Inequality
For Convex Functions For Concave Functions
• Inequality Statement: f (E[X]) ≤ E[f (X)] • Inequality Statement: f (E[X]) ≥ E[f (X)]
• Meaning: When f is a convex function and X is a random variable, • Meaning: For a concave function f and a random variable X, the
the function value at the expected value of X is less than or equal to function value at the expected value of X is greater than or equal to
the expected value of the function of X. the expected value of the function of X.
• Interpretation: For convex functions, the ”average” output is at least • Interpretation: For concave functions, the ”average” output is at
as large as the output at the ”average” input. most as large as the output at the ”average” input.
© Roozbeh Sanaei 84
8.4 Fisher’s Score and Fisher’s Information
In essence, Fisher’s Score provides a mechanism to locate the most prob- Fisher’s Information
able parameter values in a likelihood function, while Fisher’s Information
Fisher’s Information measures the amount of information an observable ran-
quantifies how much certainty there is in these estimates.
dom variable carries about an unknown parameter of a distribution that
models the variable.
Fisher’s Score
• Variance of the Score:
Fisher’s Score is the derivative (gradient) of the log-likelihood function with I(θ) = Var[s(θ)]
respect to the parameter. where I(θ) is the Fisher’s Information.
© Roozbeh Sanaei 85
8.5 Kullback-Leibler Divergence
• The Kullback-Leibler (KL) divergence, denoted as DKL (P ∥ Q), is a For Continuous Distributions
statistical measure used to quantify how one probability distribution P
differs from a second, reference distribution Q. • Density Functions: p(x) and q(x).
• It’s not a symmetric measure and doesn’t satisfy the triangle inequality,
setting it apart from a traditional metric.
• DKL (P ∥ Q) = p(x) log
R∞ p(x)
−∞ q(x)
dx.
For Discrete Distributions
• Distributions: P and Q.
• General Interpretation: Integral over the entire range of the ran-
• DKL (P ∥ Q) = P (x) log P (x)
. dom variable, involving the product of the probability density under P
P
x∈X Q(x)
and the logarithm of the ratio of densities under P and Q.
• General Interpretation: Sums over all possible events x in the sam-
ple space X , each term being the product of the probability of event x
under P and the logarithm of the ratio of probabilities under P and Q.
• Information Theory Perspective: Measures the continuous ”infor-
• Information Theory Perspective: Quantifies the additional bits mation cost” or extra bits required when outcomes are coded using Q
required for encoding each event x using a model Q instead of P . instead of P .
© Roozbeh Sanaei 86
8.6 Mutual Information
Mutual Information (MI) is a measure used in statistics to quantify the For Discrete Distributions
amount of information obtained about one random variable by observing
When X and Y are discrete, MI is calculated as a double sum:
another. It is intimately linked to the concept of entropy in information the-
ory. The definition and computation of MI depend on whether the random P(X,Y ) (x, y)
!
I(X; Y ) = P(X,Y ) (x, y) log
X X
variables involved are discrete or continuous. P (x)P (y)
y∈Y x∈X X Y
Here, P(X,Y ) is the joint probability mass function, and PX and PY are the
marginal probability mass functions of X and Y , respectively.
P(X,Y ) (x, y)
Z Z !
I(X; Y ) = P(X,Y ) (x, y) log dxdy
For a pair of random variables X and Y with joint distribution P(X,Y ) and Y X PX (x)PY (y)
marginal distributions PX and PY , the mutual information is defined as:
In this case, P(X,Y ) is the joint probability density function.
Equivalent Expressions
I(X; Y ) = DKL (P(X,Y ) ∥PX ⊗ PY ) Mutual information can also be expressed in terms of entropy:
© Roozbeh Sanaei 87
Time Series Forecasting
88
9.2 Autocorrelation Function (ACF)
• Autocorrelation measures the correlation of a signal or time series with Continuous and Discrete Signals
a delayed version of itself, It is calculated for various time lags.
Continuous Signal: For a continuous signal f (t), the autocorrelation is
• The autocorrelation function is used to find repeating patterns or pe- defined as an integral:
riodic signals within a dataset. Z ∞
Rf f (τ ) = f (t + τ )f (t) dt
• For instance, it can identify if there’s a regular, cyclical behavior in −∞
temperature readings over days or in stock market prices over weeks.
Discrete Signal: For a discrete-time signal y(n), the autocorrelation at
lag ℓ is given by the sum:
Basic Definition
Ryy (ℓ) =
X
For a time series Xt , where t represents time, the autocorrelation function y(n)y(n − ℓ)
(ACF) assesses the correlation between Xt and Xt−h for various values of h n∈Z
RXX (τ ) = E[Xt+τ X t ]
Normalization
where: In statistics and time series analysis, it’s common to normalize the auto-
• τ is the lag. covariance function to get a time-dependent Pearson correlation coefficient.
The auto-correlation coefficient for a stochastic process is:
• E[·] is the expectation operator.
KXX (t1 , t2 )
• X t denotes the complex conjugate of Xt . ρXX (t1 , t2 ) =
σt1 σt2
© Roozbeh Sanaei 89
9.3 Partial Autocorrelation Function (PACF)
Definition and Basic Concept: This represents the correlation between values two time periods apart,
conditional on the knowledge of the value in between.
• PACF is the partial correlation of a stationary time series with its
own lagged values, regressed against the values of the time series at all
shorter lags. • Similarly, the 3rd order (lag) PACF is calculated as:
• In simpler terms, it tells you the direct relationship between an obser- Covariance(xt , xt−3 | xt−1 , xt−2 )
vation and its lag, removing the influence of intermediate lags. PACF3 = q
Variance(xt | xt−1 , xt−2 )Variance(xt−3 | xt−1 , xt−2 )
• The PACF of order k can be defined as the last element in the matrix
Rk divided by r0 , where Rk is a k × k matrix and Ck is a k × 1 column This continues for higher lags.
vector.
• The 1st order PACF is defined to be equal to the 1st order autocorre- • PACF is particularly useful in identifying the order of an autoregressive
lation. (AR) model in time series analysis.
• For higher orders, the 2nd order (lag) PACF is given by the equation:
• The theoretical ACF and PACF for AR, MA, and ARMA conditional
Covariance(xt , xt−2 | xt−1 ) mean models are known and are different for each model, aiding in
PACF2 = q
Variance(xt | xt−1 )Variance(xt−2 | xt−1 ) model selection.
© Roozbeh Sanaei 90
9.4 Autoregressive Integrated Moving Average (ARIMA)
Key Parameters General Equation for Non-Seasonal ARIMA
p q
• p (Autoregressive Part): (1 − φi L )(1 − L) Xt = (1 +
i d
θi Li )εt
X X
i=1 i=1
– Number of lags of the dependent variable used as predictors. • Xt : Time series value at time t.
– Captures the influence of prior values on current values. • φi : Coefficients for the autoregressive part.
– Degree of differencing to make the series stationary. Equation with Drift Component
– Transformation to stabilize the mean and variance over time. p q
(1 − φi L )(1 − L) Xt = δ + (1 +
i d
θi Li )εt (9.1)
X X
– Order of the moving average component. • Identifying appropriate values for p, d, and q based on time series data
characteristics.
– Number of lagged forecast errors included in the model.
• Once parameters are determined, the model is used for forecasting fu-
– Addresses the influence of random shocks from previous points. ture values.
© Roozbeh Sanaei 91
9.5 SARIMAX Model
ARIMAX Model SARIMAX Model
• Extends the ARIMA model by including exogenous inputs. • Combines SARIMA and ARIMAX, incorporating both seasonal com-
ponents and exogenous inputs.
• Incorporates independent variables influencing the time series but not
autoregressed on. • Enhances ARIMA by adding seasonality and external data for improved
forecasting.
• Models the time series using both the series itself and other independent
variables.
• Equation for SARIMAX(p,d,q)(P,D,Q,s):
SARIMA Model n
Θ(L)p θ(Ls )P ∆d ∆D
s yt = Φ(L) ϕ(L ) ∆ ∆s ϵt +
q s Q d D
βi xit
X
• Stands for Seasonal ARIMA, used for time series with evident season- i=1
ality.
– Θ(L)p and Φ(L)q are the non-seasonal components.
• Combines two ARIMA models: one for non-seasonal and one for sea-
– θ(Ls )P and ϕ(Ls )Q are the seasonal components.
sonal parts.
s represent differencing operations.
– ∆d and ∆D
• Includes extra seasonal parameters: P (Seasonal autoregressive order),
– yt is the time series, ϵt is the error term.
D (Seasonal differencing order), Q (Seasonal moving average order), S
(Length of the seasonal cycle). – xit with coefficients βi are the exogenous variables.
© Roozbeh Sanaei 92
9.6 Simple Exponential Smoothing (SES)
Simple exponential smoothing (SES) is a method suitable for forecasting time series data that does not display any clear trend or seasonal pattern. It is
part of the family of exponential smoothing methods and works particularly well for data that is essentially random with no evident seasonality or trend.
Core Principles of Simple Exponential Smoothing: • Forecast : The forecast at time T + 1 is a weighted average between
the most recent observation yT and the previous forecast ŷT |T −1 :
• Weighted Averages: SES forecasts are calculated using weighted av-
erages where the weights decrease exponentially as observations come ŷT +1|T = αyT + (1 − α)ŷT |T −1
from further in the past. The smallest weights are associated with the
oldest observations. This can be represented by the equation: This formula is used iteratively to calculate forecasts for each time
period.
ŷT +1|T = αyT + α(1 − α)yT −1 + α(1 − α)2 yT −2 + · · ·
• Component Form: In simple exponential smoothing, the only com-
Here, (0 ¡ α ¡ 1) is the smoothing parameter, yT is the last observed ponent is the level ℓt . The component form comprises a forecast equa-
value, and ŷT +1|T is the forecast for the next period. tion and a smoothing equation:
• Influence of α: The value of α determines the weights assigned to
ŷt+h|t = ℓt
observations. A small α gives more weight to observations from the
distant past, while a large α gives more weight to recent observations.
ℓt = αyt + (1 − α)ℓt−1
If α equals 1, the SES forecast is the same as the last observed value,
akin to the naı̈ve forecast. The forecast value at time t + 1 is the estimated level at time t.
© Roozbeh Sanaei 93
9.7 Damped Trend Method
The damped trend method in time series forecasting enhances traditional models by incorporating a damping factor, ϕ, to account for the diminishing
impact of trends over time.
Trend
• The reduced impact of trends over time makes the damped trend bt = β (ℓt − ℓt−1 ) + (1 − β )ϕbt−1
∗ ∗
Forecast
ŷt+h|t = ℓt + (ϕ + ϕ2 + . . . + ϕh )bt
• It provides more realistic forecasts, especially for data with trends that
are likely to decelerate, offering a conservative yet often more accurate This equation projects the future values, taking into account the damp-
long-term outlook compared to models assuming a continuous linear ening effect on the trend. The term (ϕ + ϕ2 + . . . + ϕh ) represents the
trend. cumulative dampening effect over h periods.
© Roozbeh Sanaei 94
9.8 Holt’s linear trend method
Holt’s linear trend method is an extension of exponential smoothing used for time series forecasting, particularly effective when the data exhibits trends.
This method involves two primary equations: the level equation and the trend equation.
Level Forecast
The level equation updates the series’ current level, which is a smoothed Combines the level and trend components to forecast future values.
estimate of the series’ value.
ŷt+h|t = ℓt + hbt
ℓt = αyt + (1 − α)(ℓt−1 + bt−1 )
• ŷt+h|t is the forecast for h periods ahead.
• ℓt is the estimated level at time t.
• α is the smoothing parameter for the level, between 0 and 1. • The equation suggests that the future value is a function of the current
estimated level and the current trend, projected h steps into the future.
• yt is the actual observed value at time t.
• ℓt−1 and bt−1 are the estimated level and trend, respectively, from the Key Characteristics of Holt’s Method
previous time step.
• Effectiveness for Linear Trends: Tailored for data exhibiting a lin-
ear trend, providing accurate forecasting in such scenarios.
Trend
The trend equation updates the trend component, reflecting changes in the • Dynamic Adjustment: Dynamically updates both the level and
level. trend of the series, offering adaptability to changes in the data.
bt = β ∗ (ℓt − ℓt−1 ) + (1 − β ∗ )bt−1 • Flexibility and Robustness: Adjusts to changing trends, making it
a flexible and robust forecasting tool.
• bt is the estimated trend at time t.
• β ∗ is the smoothing parameter for the trend, also between 0 and 1. • Importance of Smoothing Parameters (α and β ∗ ): Critical in de-
termining the model’s responsiveness to changes in the level and trend
• The trend component is the estimated change in the level component of the data. These parameters balance between the historical data’s
from one period to the next. relevance and recent observations.
© Roozbeh Sanaei 95
9.9 Exponential Smoothing Methods
© Roozbeh Sanaei 96