0% found this document useful (0 votes)
47 views97 pages

Fundamentals of Machine Learning

Uploaded by

Pavan Somisetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views97 pages

Fundamentals of Machine Learning

Uploaded by

Pavan Somisetty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Fundamentals of Machine Learning

Roozbeh Sanaei
June 30, 2024
Contents

1 Linear Algebra 4
1.1 Matrix Decomposition Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Orthogonality and Orthonormality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Fundamental Concepts of Machine Learning 7


2.0.1 Different Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.0.2 Overfitting and its Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.0.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Dimensionality Reduction 11
3.1 Independent Component Analysis (ICA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 ICA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Different Algorithms in ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Infomax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.4 What is Whitening? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.5 Fast Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.6 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 SNE, t-SNE, UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Comparative Analysis of SNE, t-SNE, and UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 SNE, t-SNE and UMAP Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Sparse Dictionary Learning Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Non-negative Matrix Factorization(NMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Multidimensional Scaling (MDS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Isomap (Isometric Mapping) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1
4 Clustering 25
4.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Elbow Method for Determining Optimal Number of Clusters in K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.2 Silhouette Analysis for Determining Optimal Clusters in K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 The Canopy Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Gaussian Mixture Models (GMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Challenges in Gaussian Mixture Models (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Comparison of GMMs and K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 DBSCAN: Density-Based Spatial Clustering of Applications with Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 OPTICS(Ordering Points To Identify the Clustering Structure) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Spectral Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Markov Chain Clustering (MCL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Agglomerative Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Supervised Machine Learning 38


5.1 Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Assumptions Of Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Ordinary Least Squares (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 OLS as Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.2 Applying SVD to OLS and Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.3 Relationship between CEF and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 General Framework of Moment Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Instrumental Variables in Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.3 Generalized Method of Moments (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.1 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4.2 OLS Estimator using Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Generalized Linear Models (GLMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.7.1 Lagrangian SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7.2 Comparison Between OLS and SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.8 Decision Tree Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8.1 ID3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8.2 Comparison of ID3 and C4.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8.3 Decision Tree Pruning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.8.4 Decision Tree Splitting Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.9 Ensemble Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.9.1 Multivariate Adaptive Regression Splines (MARS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.9.2 Ensemble Models Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.9.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

© Roozbeh Sanaei 2
5.9.4 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6 Evaluation 70
6.1 Evaluation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.1 K-fold Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.2 The ROC Curve in Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Accuracy Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Lift and Drift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Anomalies and Outliers 74


7.1 Anomalies and Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Isolation Forest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.3 Cook’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.4 Quartiles and Interquartile Range (IQR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.5 Local Outlier Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.6 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.7 Minimum Covariance Determinant (MCD) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.8 Single-Class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8 Information Theory 82
8.1 Shannon Uncertainty Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Boltzmann’s Entropy Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.3 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Fisher’s Score and Fisher’s Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.5 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.6 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Time Series Forecasting 88


9.1 Common Components of a Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.2 Autocorrelation Function (ACF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.3 Partial Autocorrelation Function (PACF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.4 Autoregressive Integrated Moving Average (ARIMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.5 SARIMAX Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.6 Simple Exponential Smoothing (SES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.7 Damped Trend Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
9.8 Holt’s linear trend method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.9 Exponential Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

© Roozbeh Sanaei 3
Linear Algebra

1.1 Matrix Decomposition Techniques


1.1.1 Eigenvalue Decomposition
Eigenvalue decomposition is a matrix decomposition technique used in linear algebra. It expresses a given square matrix in terms of its eigenvalues and
eigenvectors. The decomposition is essential in various scientific and engineering applications.

Definition of Eigenvalues and Eigenvectors Finding Eigenvectors


Given a square matrix A, an eigenvector v and an eigenvalue λ satisfy the For each eigenvalue λ, eigenvectors are found by solving:
equation:
(A − λI)v = 0
Av = λv
where v is a non-zero vector, and λ is a scalar. This is a system of linear equations. Non-zero solutions for v are the eigen-
vectors corresponding to λ.

Finding Eigenvalues
Eigenvalue Decomposition
Eigenvalues of A are found by solving the characteristic equation: If A has n linearly independent eigenvectors {v1 , v2 , . . . , vn } corresponding to
eigenvalues {λ1 , λ2 , . . . , λn }, then A can be factorized as:
det(A − λI) = 0
A = V DV −1
where det denotes the determinant of a matrix, and I is the identity ma-
trix. The roots of the characteristic polynomial (a polynomial in λ) are the where V is the matrix whose i-th column is the eigenvector vi , and D is the
eigenvalues of A. diagonal matrix with eigenvalues λi on the diagonal.

4
1.1.2 Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD) is a critical technique in linear alge- • Σ (singular values): Diagonal entries are the square roots of the non-
bra, utilized in various fields such as signal processing, statistics, and machine negative eigenvalues of AT A or AAT .
learning. It decomposes any matrix into three distinct matrices.
Computing SVD
Definition of SVD • Compute eigenvalues and eigenvectors of AAT and AT A.
Given an m × n matrix A, SVD is defined as: • Singular values in Σ are square roots of non-zero eigenvalues of AT A.
A = U ΣV T • Columns of U are normalized eigenvectors of AAT .

where U is an m × m orthogonal matrix, V is an n × n orthogonal matrix, • Columns of V are normalized eigenvectors of AT A.


and Σ is an m × n diagonal matrix.
Properties of SVD
Matrices in SVD • SVD exists for any m × n matrix.
• U (left-singular vectors): Columns of U are eigenvectors of AAT . • Singular values are non-negative and usually in descending order.

• V (right-singular vectors): Columns of V are eigenvectors of AT A. • U and V are orthogonal matrices.

© Roozbeh Sanaei 5
1.1.3 Orthogonality and Orthonormality
Orthogonality ∥u∥ = ∥v∥ = 1
Definition: Two vectors u and v in a vector space are orthogonal if their
dot product is zero: Importance: Orthonormal vectors simplify computations and are used
u·v=0 in Fourier series, quantum mechanics, and signal processing.
In Rn , this is:
u1 v1 + u2 v2 + · · · + un vn = 0
Importance: Orthogonal vectors minimize errors and dependencies in Applications in Linear Transformations
computations and are used in methods like the Gram-Schmidt process for
orthogonal bases. In matrix terms, a matrix A with orthonormal columns satisfies:

Orthonormality AT A = I
Definition: A set of vectors is orthonormal if all vectors are orthogonal to
each other and each vector is of unit length. For vectors u and v: where AT is the transpose of A, and I is the identity matrix. This property
u · v = 0 (if u ̸= v) is crucial in preserving lengths and angles in transformations.

© Roozbeh Sanaei 6
Fundamental Concepts of Machine Learning

2.0.1 Different Learning Paradigms


Supervised Learning Self-Supervised Learning
Training on labeled data to learn the mapping from input to output. Uses unlabeled data and generates its own labels from the data’s structure.
Application: Used in regression and classification tasks. Application: Gaining traction in computer vision and natural language
Example: Predicting house prices based on features like size and loca- processing.
tion. Example: Pretext tasks in deep learning models like predicting the next
word in a sentence.
Unsupervised Learning
2.0.1.1 Reinforcement Learning
Learning patterns from unlabeled data without explicit instruction on what
to predict. Learning to make decisions by performing actions to achieve a goal.
Application: Clustering, association, dimensionality reduction. Application: Robotics, gaming, navigation.
Example: Identifying customer segments in marketing data. Example: A robot learning to navigate through a maze.

Semi-Supervised Learning
Combines both labeled and unlabeled data for training.
Application: Useful in scenarios where labeled data are limited.
Example: Language translation models with limited annotated data.

7
2.0.2 Overfitting and its Mitigation
Overfitting Methods to Overcome Overfitting

Definition: Overfitting occurs when a model learns both the underlying Regularization: Adding a penalty to the model’s loss function to constrain
patterns and the noise in the training data, leading to poor generalization to its complexity.
new data. Dropout: Randomly ignoring neurons during training in neural networks
Consequence and Impact: The model exhibits high accuracy on train- to prevent over-reliance on specific patterns.
ing data but poor performance on unseen data. Early Stopping: Halting training when the model’s performance on
Reason: Often due to excessive complexity in the model relative to the validation data starts to worsen.
amount of training data, leading to the learning of noise. Data Augmentation: Artificially increasing training data diversity
through transformations.
Ensembling: Combining predictions from multiple models to average
out errors.
Feature Selection: Choosing the most relevant features for training to
reduce model complexity.
Model Simplification: Reducing the number of layers or parameters in
the model.
Increasing Dataset Size: Expanding the training dataset to provide
more comprehensive examples.
Bayesian Neural Networks: Incorporating probabilistic approaches in
neural networks to manage uncertainty.

© Roozbeh Sanaei 8
Bias and Variance in Machine Learning
• Bias: • Bias-Variance Trade-Off:
– Measures the difference between the model’s average prediction – Improving the model to reduce bias typically increases its vari-
and the true values. ance, and vice versa.
– Bias2 [fˆ(x)] = E[fˆ(x)] − f (x)
 2
– Total Error = Bias2 + Variance + Irreducible Error
∗ Where fˆ(x) is the model’s prediction, f (x) is the true func- ∗ Irreducible Error represents the error inherent in the problem
tion, and E[fˆ(x)] is the expected value of the model’s predic- itself, due to factors like noise.
tions.
– High bias can lead to underfitting, indicating a model too simple
to capture the data’s complexity.
• Variance:
– Measures the variation of model predictions for a given data point.
 2 
– Variance[fˆ(x)] = E fˆ(x) − E[fˆ(x)]

∗ Where E[fˆ(x)] is the expected value of the model’s predic-


tions.
– High variance can lead to overfitting, where the model captures
noise as if it were a significant signal.

© Roozbeh Sanaei 9
2.0.3 Feature Selection
Filter Methods Advantages of Feature Selection
These methods assess features independently of any learning algorithm, using • Reducing dimensionality
statistical measures.
• Improving model performance
• Correlation Coefficient: Measures linear relationship between features
and the target. • Enhancing interpretability
• Chi-squared Test: Assesses independence between categorical features • Reducing computational complexity
and the target.
• Improving data quality
• Information Gain: Evaluates reduction in entropy from adding a fea-
ture. • Enhancing model transparency

Wrapper Methods • Addressing multicollinearity

These methods evaluate feature subsets through a specific machine learning


algorithm, focusing on model performance.
• Recursive Feature Elimination (RFE): Eliminates least important fea-
tures iteratively.
• Forward Selection: Adds most significant feature iteratively from an
empty set.
• Backward Elimination: Removes least significant feature iteratively
from a full set.

Embedded Methods
These methods integrate feature selection within the learning algorithm dur-
ing model training.
• Lasso Regression: Uses L1 regularization to eliminate irrelevant fea-
tures.
• Elastic Net: Combines L1 and L2 regularization for balanced feature
selection.
• Random Forest Importance: Assesses feature importance based on their
contribution to random forest performance.

© Roozbeh Sanaei 10
Dimensionality Reduction

3.1 Independent Component Analysis (ICA)


3.1.1 ICA Overview
Basics of ICA • Output: Independent components for signal reconstruction and source
identification.
• Objective: Decomposing a multivariate signal into independent non-
Gaussian components.
Applications
• Use Cases: Applied in blind source separation, image processing, and
• Audio Processing: Separating voices in the ’cocktail party problem’.
complex data analysis.
• Medical Imaging: Identifying brain activities and artifacts in fMRI.
How ICA Works
• Financial Analysis: Analyzing complex financial data to extract un-
• Statistical Independence: Components assumed to be statistically derlying factors.
independent.
Limitations
• Non-Gaussianity: Non-Gaussian sources sum up to a more Gaussian-
like distribution, aiding separation. • Assumptions: Relies on specific assumptions that may not always
hold.
Process
• Model Order Determination: Difficulty in deciding the number of
• Input: Linear mixtures of unknown independent components. components.

• Algorithm: Adjusts weights to maximize independence of output sig- • Ambiguity: Possibility of permutation and scaling ambiguities in so-
nals. lutions.

11
3.1.2 Different Algorithms in ICA
• FastICA: Known for its computational efficiency, FastICA uses a order statistics over different time delays for ICA, particularly effective
fixed-point iteration scheme to maximize the non-Gaussianity of pro- for time-dependent signals.
jected data.
• Probabilistic ICA: This method incorporates a probabilistic model,
• Infomax and Extended Infomax: These algorithms maximize the often using a likelihood function, to estimate independent components,
information transfer from the input to the output of a neural network. and is closely related to factor analysis.
Extended Infomax can handle both sub- and super-Gaussian sources.
• Temporal ICA: Designed specifically for time-series data, it focuses
• JADE (Joint Approximate Diagonalization of Eigenmatrices): on the temporal structure of the data to separate sources.
JADE operates by jointly diagonalizing a set of covariance matrices to
extract the independent components. • Complex-valued ICA: This variant is used for complex-valued data
(with real and imaginary parts), as found in some signal processing
• CUmulative Distributions-based ICA (CUDI): CUDI maximizes applications.
non-Gaussianity using cumulative distribution functions.
• Nonlinear ICA: Deals with mixtures that are nonlinear combinations
• Second-order blind identification (SOBI): SOBI utilizes second- of the source signals, unlike traditional linear ICA models.

© Roozbeh Sanaei 12
3.1.3 Infomax
Infomax is a principle used in information theory and neural network training, aiming to maximize the mutual information between the input and output
of a system. This principle is often applied in unsupervised learning, where the goal is to find a representation of the input data that preserves as much
information as possible.

Entropy (H) Application in Neural Networks


• Entropy measures the amount of uncertainty or randomness in a ran- • The output Y is a function of the input X and the network weights W :
dom variable.
• For a random variable X, it is defined as: Y = f (W X)
H(X) = − p(x) log p(x)
X

x∈X
• The objective is to find the optimal W that maximizes I(X; Y ).
Mutual Information (I)
• Mutual information quantifies the shared information between two ran- Optimization
dom variables, X and Y .
• Typically involves using gradient-based optimization techniques.
• It is defined as:
!
p(x, y) • Adjusts weights W by:
I(X; Y ) = p(x, y) log
X X

x∈X y∈Y p(x)p(y)


∂I(X; Y )
• Alternatively, it can be expressed using entropy: Wnew = Wold + η
∂W
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
I(X; Y ) = H(Y ) − H(Y |X) Constraints and Regularization

Objective of Infomax • Additional constraints, such as weight regularization, are often applied.

• The goal is to maximize the mutual information I(X; Y ).


• These help prevent overfitting and ensure a more robust learning pro-
• This involves maximizing H(Y ) and minimizing H(Y |X). cess.

© Roozbeh Sanaei 13
3.1.4 What is Whitening?
• Whitening: Transformation of Data • Perform Eigenvalue Decomposition

– Whitening is a process that transforms data so that the covariance – Apply eigenvalue decomposition to the covariance matrix Σ:
matrix of the resulting dataset is the identity matrix.
Σ = V DV T
– the means that the features are uncorrelated and each feature has
unit variance.
– V is the matrix of eigenvectors, D is the diagonal matrix of eigen-
– This involves decorrelating the features and normalizing their vari- values.
ance.
• Whiten the Data
• Compute the Covariance Matrix
– Transform the data to obtain the whitened data Xwhite :
– Given a dataset X with n features and m samples.
1
– Calculate the covariance matrix Σ as: Xwhite = ED− 2 V T (X − X̄)
1 1
Σ= (X − X̄)T (X − X̄) – D− 2 is obtained by taking the reciprocal square root of each non-
m zero element in D.
– X̄ is the mean vector for the dataset. – E is an optional scaling matrix, often the identity matrix I.

© Roozbeh Sanaei 14
3.1.5 Fast Independent Component Analysis
Fast Independent Component Analysis (Fast ICA) is an algorithm used 2. Maximization of Non-Gaussianity: The core of Fast ICA is to find
for the separation of a multivariate signal into additive subcomponents. It’s a linear combination of the whitened variables that maximizes non-
often used in the context of blind source separation, where the goal is to Gaussianity. Non-Gaussianity can be measured in several ways, such
separate a set of signals that have been mixed together. as using kurtosis or negentropy.

Basic Model 3. Iterative Fixed-Point Algorithm: The Fast ICA algorithm finds
the independent components by iterating the following steps:
The basic model of ICA can be represented as:
(a) Choose an initial weight vector w.
x = As
where (b) Update w using the formula:
• x is the observed mixed signal.
w+ = E{xg(wT x)} − E{g ′ (wT x)}w
• A is the mixing matrix.
where g is a non-linear function, which is chosen based on the
• s is the vector of independent source signals. measure of non-Gaussianity (like kurtosis or negentropy) and g ′ is
its derivative.
Goal of Fast ICA
(c) Normalize the weight vector:
The goal is to estimate the matrix W, which is the unmixing matrix, such
that: w+
w=
∥w+ ∥
s ≈ Wx
(d) Repeat until convergence.
Fast ICA Algorithm
1. Centering and Whitening: First, the observed signals are centered 4. Extraction of Independent Components: Once the algorithm con-
and whitened. Centering involves subtracting the mean, and whitening verges, the independent components are given by:
is done to transform the variables into uncorrelated variables with unit
variance. s = Wx

© Roozbeh Sanaei 15
3.1.6 JADE
JADE ICA is a statistical technique used to separate a multivariate signal into additive subcomponents that are maximally independent from each other.
It is particularly useful in blind source separation tasks like separating audio signals.

1. Centering and Whitening Where cum is the fourth-order cumulant function.


Centering: Removes the mean from the data to ensure zero mean.
Kijkl = cum(zi , zj , zk , zl )
For data matrix XX and its mean E[X]:
Where Kijkl is the cumulant matrix for the quadruple (zi , zj , zk , zl ).
Xcentered = X − E[X]

Whitening: Transforms variables into uncorrelated variables with unit 3. Diagonalization


variance, reducing ICA to finding an orthogonal matrix. Finding an orthogonal transformation to make the cumulant matrices as di-
For whitened matrix Z, diagonal matrix of eigenvalues D, and matrix of agonal as possible, indicating statistical independence.
eigenvectors E: For orthogonal unmixing matrix W and cumulant matrices K:
1
Z = D− 2 E T Xcentered
W ∗ = arg min (off-diagonal elements of W KW T )2
X
W
2. Cumulant Calculation
Fourth-order cumulants capture non-Gaussian features like kurtosis. 4. Independent Components
For components zi , zj , zk , zl of whitened data Z:
Obtaining independent components from observed data by transforming
cum(zi , zj , zk , zl ) = E[zi zj zk zl ] − E[zi ]E[zj zk zl ] whitened data with the matrix from diagonalization.
− E[zi ]E[zj ]E[zk zl ] − E[zi ]E[zj zk ]E[zl ] For matrix of independent components S and optimal unmixing matrix W :
+ 2E[zi ]E[zj ]E[zk ]E[zl ] S = WZ

© Roozbeh Sanaei 16
3.2 SNE, t-SNE, UMAP
3.2.1 SNE
Similarity in the Original Space Minimization of Kullback-Leibler Divergence
exp(−||xi − xj ||2 /2σi2 )
pj|i = P
k̸=i exp(−||xi − xk || /2σi )
2 2 pij
KL(P ||Q) = pij log
XX

where pj|i is the probability of picking xj as a neighbor of xi , xi , xj are data i j qij


points in the high-dimensional space, ||xi − xj || is the Euclidean distance
between xi and xj , σi is the variance of the Gaussian distribution around xi , where KL(P ||Q) is the Kullback-Leibler divergence between the high-
and k is an index for summation over all points except i. dimensional and low-dimensional representations, pij , qij are the joint prob-
abilities in the high and low-dimensional spaces, respectively.
Symmetrized Similarities
pj|i + pi|j
pij =
2N Gradient Descent
where pij is the joint probability symmetrizing pj|i and pi|j , and N is the
total number of data points.
δKL(P ||Q)
= 4 (pij − qij )(yi − yj )(1 + ||yi − yj ||2 )−1
X
Similarity in the Reduced Space δyi j
(1 + ||yi − yj ||2 )−1
qij = P
k̸=l (1 + ||yk − yl || )
2 −1 ||Q)
where δKL(P
δyi
is the gradient of the Kullback-Leibler divergence with re-
where qij is the probability of yi and yj being neighbors in the low-dimensional spect to point yi , yi , yj are data points in the low-dimensional space, pij , qij
space, yi , yj are data points in the low-dimensional space, and k, l are indices are the joint probabilities in the high and low-dimensional spaces, respec-
for summation over all distinct pairs of points. tively.

© Roozbeh Sanaei 17
3.2.2 Comparative Analysis of SNE, t-SNE, and UMAP
High Dimensional Probabilities Calculation Cost Function and Optimization
• SNE: Utilizes scaled Euclidean distance, leading to non-symmetric dis- • SNE: Employs Kullback-Leibler divergence, facing optimization chal-
similarities due to the variance parameter σi . lenges.

• t-SNE Difference: Implements symmetrization to make high- • t-SNE Difference: Retains KL divergence but modifies probabilities
dimensional probabilities symmetric. approach.

• UMAP Difference: Works with similarities instead of probabilities, • UMAP Difference: Uses cross-entropy and stochastic gradient de-
using a different metric function. scent, capturing more global structure.

Low Dimensional Probabilities Calculation Focus on Data Structure Preservation


• SNE: Uses Gaussian neighborhoods with fixed variance for low- • SNE: Aims to preserve both local and global structures, but has chal-
dimensional probabilities. lenges.

• t-SNE Difference: Adopts the Student t-distribution to solve the • t-SNE Difference: Prioritizes local structure conservation, effective
crowding problem. in visualizing clusters.

• UMAP Difference: Does not normalize low-dimensional similarities, • UMAP Difference: Preserves more global structure due to its
improving performance. methodological approach.

© Roozbeh Sanaei 18
3.2.3 SNE, t-SNE and UMAP Comparison
Similarity Variables: yi , yj , Low-dimensional embeddings. The use of t-
distribution helps in mitigating the crowding problem and allows t-SNE
• SNE Similarity: The SNE algorithm uses a Gaussian kernel to model
to better model the relationships between points in the reduced space.
the probability that a point xi in a high-dimensional space would choose
another point xj as its neighbor: • UMAP Similarity Measure in Reduced Space: UMAP uses a
exp(−||xi − xj ||2 /2σi2 ) different formulation for low-dimensional similarities:
pj|i = P 1
k̸=i exp(−||xi − xk || /2σi )
2 2
νij =
1 + a||yi − yj ||2b
Variables: xi , xj , High-dimensional data points; σi2 , Variance of the
Gaussian kernel centered at xi . Variables: yi , yj , Low-dimensional embeddings; a, b, Parameters
learned during the optimization process. This measure helps UMAP
• t-SNE Symmetrization: t-SNE modifies the SNE approach by sym-
to preserve the topological structure of the data in the reduced space,
metrizing the probabilities to address the original asymmetry:
ensuring a balance between local and global structures.
pj|i + pi|j
pij =
2N Loss Functions
Variables: pj|i , pi|j , Conditional probabilities from SNE; N , Total num-
ber of data points. • SNE Loss Function: SNE uses the Kullback-Leibler divergence as
its loss function:
pj|i
• UMAP Similarity Measure: UMAP employs a fuzzy set approach CSNE = pj|i log
XX

for similarity, focusing on both local and global data structures: i j qj|i

µij = exp(− max(0, d(xi , xj ) − ρi )/σi ) This function measures the mismatch between the high-dimensional
and low-dimensional probabilities, aiming to preserve local structures
Variables: d(xi , xj ), User-defined metric for distance; ρi , σi , Parameters in the reduced space.
for local neighborhood adjustment.
• t-SNE Loss Function: t-SNE also uses the Kullback-Leibler diver-
Similarity in Reduced Space gence, but with symmetrized probabilities:
pij
• SNE Similarity in Reduced Space: SNE in the reduced space uses Ct-SNE = pij log
XX

a similar approach to the high-dimensional space but with fixed vari- i j qij
ance: This function aims to minimize the difference between high-dimensional
exp(−||yi − yj ||2 )
qj|i = P and low-dimensional representations, focusing on local neighborhood
k̸=i exp(−||yi − yk || )
2
structures.
Variables: yi , yj , Low-dimensional embeddings of the high-dimensional
data points xi , xj . This calculates the probability of the low- • UMAP Loss Function: UMAP utilizes a cross-entropy loss function:
dimensional embeddings of points, emphasizing their relative proximi-
CUMAP = wij log(σ(dij )) + (1 − wij ) log(1 − σ(dij ))
X
ties. ij

• t-SNE Symmetrization in Reduced Space: t-SNE uses the Stu- Variables: σ(dij ), Logistic sigmoid function of the distance between
dent t-distribution for probabilities in the reduced space: points i and j; wij , Weight derived from the high-dimensional graph.
(1 + ||yi − yj ||2 )−1 UMAP’s loss function balances attractive and repulsive forces, aiming
qij = P
k̸=l (1 + ||yk − yl || )
2 −1 to preserve both local and global data structures.
© Roozbeh Sanaei 19
3.3 Linear Discriminant Analysis (LDA)
• Primary Goal of LDA: • Solution and Eigenvalue Problem:
– Identify a linear combination of features. – Maximize J(W ) by solving the eigenvalue problem SW−1
SB v = λv,
– Differentiate or segregate multiple classes of objects or events. where v are the eigenvectors, λ are the eigenvalues, SB is the
between-class scatter matrix, and SW is the within-class scatter
• Optimization Approach: matrix.
– Optimize the ratio of determinants between the between-class – Focus on the eigenvectors corresponding to the largest eigenvalues
scatter matrix (SB ) and the within-class scatter matrix (SW ). for maximum variance between classes.
– Calculate mean vectors for each class:
• Projection of Data:
mi = 1 P
ni x∈Di x, – Determine W for data projection, where W contains the selected
eigenvectors.
where mi is the mean vector for class i, ni is the number of
samples in class i, and Di is the set of data points in class i. – Project data points x into a lower-dimensional space to achieve
optimal class separation: y = W T x, where y is the projected data
– Between-Class Scatter Matrix:
and x is the original data.
SB = Ni (mi − m)(mi − m)T ,
Pc
i=1

where mi is the mean vector for class i, m is the overall mean


vector, Ni is the number of samples in class i, and c is the total
number of classes.
– Within-Class Scatter Matrix:

SW = x∈Di (x − mi )(x − mi )T .
Pc P
i=1
|W T SB W |
– Fisher’s Criterion J(W ) = |W T SW W |
, where W is the projection
matrix.

© Roozbeh Sanaei 20
3.4 Sparse Dictionary Learning Overview
1. Objective: 3. Optimization Challenge:
• Goal: Find a dictionary D and a sparse representation X to ap- • Problem Nature: Generally non-convex and NP-hard due to the
proximate a given dataset Y as Y ≈ DX. l0 -norm constraint.
• Components: • Practical Approaches:
– Y : A matrix where each column represents a data sample.
– Relaxing the l0 -norm to an l1 -norm, promoting sparsity while
– D: The dictionary matrix with each column as a dictionary being convex.
atom.
– Using greedy algorithms like Orthogonal Matching Pursuit
– X: A sparse matrix where each column is the sparse repre- (OMP).
sentation of the corresponding column in Y .
2. Mathematical Formulation: 4. Alternate Minimization:

• • Strategy:
1
min ∥Y − DX∥2F – Fix D and optimize X: For each column yi of Y , find the
D,X 2
sparse representation xi using D.
subject to ∥xi ∥0 ≤ T ∀i.
– Fix X and optimize D: Update D while keeping X fixed.
• Elements:
– ∥ · ∥F : Frobenius norm, measuring the difference between Y 5. Regularization and Constraints:
and DX.
• In Practice:
– ∥xi ∥0 : l0 -norm of the i-th column of X, counting non-zero
entries to enforce sparsity. – Adding constraints like normalizing the columns of D to pre-
– T : A threshold dictating the maximum number of non-zero vent scaling issues.
entries in each column of X. – Incorporating regularization terms to control overfitting.

© Roozbeh Sanaei 21
3.5 Non-negative Matrix Factorization(NMF)
Non-negative Matrix Factorization (NMF) is a powerful technique in data analysis and linear algebra. It aims to factorize a non-negative matrix V into
two non-negative matrices W and H, where k is the desired rank or number of components. This factorization is useful for various applications, including
dimensionality reduction, feature extraction, and source separation.

Key Points about NMF • Optimization: NMF employs iterative optimization algorithms, like
multiplicative update rules, to find optimal values for W and H. These
• NMF Objective: NMF aims to factorize a non-negative matrix V rules are applied iteratively until convergence:
(m × n) into two non-negative matrices W (m × k) and H (k × n),
V ⊙ H′
!
where k is the desired rank or number of components. For W : Wnew = W ⊙
W H ⊙ H′
W′ ⊙ V
!

• Factorization: For H : Hnew = H ⊙


W′ ⊙ WH
V ≈ WH where ⊙ represents element-wise multiplication, ′
represents matrix
transpose, and / represents element-wise division.

• Non-Negativity Constraint: The non-negativity constraint ensures


V : Original non-negative matrix. that all elements in W and H are non-negative, making NMF suitable
W : Matrix of non-negative basis vectors. for data where negative values lack meaningful interpretations, such as
H : Matrix of non-negative coefficients. images, text, and audio.

• Applications: NMF is applied in various domains, including image


processing, text mining, audio source separation, and dimensionality
• Cost Function: The cost function typically used in NMF is the reduction, to extract interpretable components from data.
squared Euclidean distance (Frobenius norm) between V and W H:
• Interpretability: NMF provides interpretable factors in the form of
non-negative basis vectors (W ) and coefficients (H), making it valuable
Cost = ||V − W H||2 for feature extraction and dimensionality reduction tasks.

• Sparsity: Regularization techniques, like L1 regularization (similar to


where ||A||2 represents the Frobenius norm, the sum of the squares of Lasso), can be added to promote sparsity in the factor matrices, leading
all elements in matrix A. to non-negative sparse coding.

© Roozbeh Sanaei 22
3.6 Multidimensional Scaling (MDS)
• Similarity/Dissimilarity Matrix (D): • Optimization:

– The matrix D contains elements dij , each representing the distance – Iteratively adjust the coordinates of points in the low-dimensional
or dissimilarity between objects i and j. space to minimize the stress function.
– It’s typically symmetric, with zeros on the diagonal (indicating – Often done using numerical optimization techniques, such as gra-
zero dissimilarity of an object with itself). dient descent.

• Distance Matrix in Low-Dimensional Space (X): • Result Interpretation:

– To find a set of points in a low-dimensional space (2D or 3D) that – The configuration of points reflects the relative similarities or dis-
represent the objects. similarities among the objects.
– The matrix X contains elements xij , each representing the dis- – Objects more similar are closer together in the MDS space, while
tance between points i and j in the low-dimensional space. less similar objects are farther apart.

• Stress Function: – Dimension interpretation is not straightforward and often requires


domain-specific knowledge.
– Measures the goodness of fit between the distances xij in the low-
dimensional space and the original dissimilarities dij . • Non-Uniqueness:

– MDS does not provide a unique solution; different starting con-


sP
(dij −xij )2
– Stress = i<j
, where sums over all unique pairs of
P
i<j figurations can lead to different final configurations with similar
P
d2
i<j ij

points. stress values.

© Roozbeh Sanaei 23
3.7 Isomap (Isometric Mapping)
Isomap (Isometric Mapping) is a nonlinear dimensionality reduction method designed to uncover the underlying manifold structure in a high-dimensional
dataset by approximating the geodesic distances among points.

Neighborhood Graph Construction: Double Centering and Eigenvalue Decomposition:

• Construct a neighborhood graph G. Each point xi in the dataset is • Perform double centering on D to create a matrix B:
connected to its K nearest neighbors, or to all points within a fixed 1
radius ϵ. B = − HD2 H
2
where H is the centering matrix H = I − n1 11T , I is the identity matrix,
• The distance between connected points is typically the Euclidean dis- and 11 is a vector of ones.
tance: d(xi , xj ) = ∥xi − xj ∥.
• B is then subjected to eigenvalue decomposition:

Shortest Path Calculation: B = V ΛV T

• Compute the shortest path distances between all pairs of points in the where Λ is the diagonal matrix of eigenvalues and V contains the cor-
graph G using algorithms like Floyd-Warshall or Dijkstra’s. responding eigenvectors.

Embedding into Lower-dimensional Space:


• Let DG (xi , xj ) denote the shortest path distance between xi and xj in
the graph. • Select the top d eigenvectors (corresponding to the largest d eigenval-
ues) to form the matrix Vd .

• The coordinates of the data in the lower-dimensional space are then


Constructing the Distance Matrix: given by:
1
Y = Λd2 VdT
• Construct a matrix D where each element Dij represents the graph
distance DG (xi , xj ). where Λd contains the top d eigenvalues.

© Roozbeh Sanaei 24
Clustering

4.1 K-Means
The K-means algorithm is a widely used method in unsupervised machine learning for clustering data. It partitions a dataset into K distinct, non-overlapping
subgroups or clusters.

1. Initialization: 3. Update Step:

• Choose k initial centroids randomly. • Recalculate the centroids of the clusters based on the current as-
signment of data points.
• These centroids C = {c1 , c2 , ..., ck } are the starting points for each
of the clusters. • The new centroid ci for cluster i is the mean of all points assigned
to that cluster:
2. Assignment Step: 1 X
ci = xj
|Si | xj ∈Si
• Assign each data point xi to the nearest centroid.
• Here, |Si | is the number of data points in cluster i, and xj are the
• The assignment of a data point to a cluster is based on the min-
data points in cluster i.
imum distance from the centroids, typically calculated using the
Euclidean distance. 4. Convergence Check:
• The assignment function is represented as:
• Repeat the assignment and update steps until the centroids no
Si = {xp : ∥xp − ci ∥ ≤ ∥xp − cj ∥ ∀j, 1 ≤ j ≤ k} longer change significantly, or a maximum number of iterations is
reached.
• Here, Si is the set of points assigned to the i-th cluster, xp is a • This convergence is often checked by seeing if the sum of the
data point, and ∥xp − ci ∥ is the Euclidean distance between xp squared distances between data points and their corresponding
and centroid ci . centroids is minimized.

25
4.1.1 Elbow Method for Determining Optimal Number of Clusters in K-means
Calculate Within-Cluster Sum of Squares (WSS) for Various k Choose k at the Elbow Point
Perform K-means clustering for each value of k (the number of clusters). The optimal number of clusters k is chosen at this elbow point. This choice
represents a balance between maximizing the number of clusters (to reduce
Plot the WSS Values WSS) and keeping the model simple and generalizable (by not having too
many clusters).
Create a plot with the number of clusters k on the x-axis and the correspond-
ing WSS on the y-axis.

Identify the Elbow Point


The ”elbow” is the point in the plot where the WSS starts to decrease at a
slower rate. It is visually identified as a point of inflection on the curve. As
k increases, the average distortion per cluster decreases because the clusters
are smaller. However, beyond a certain k (the elbow point), this decrease
starts to diminish.

© Roozbeh Sanaei 26
4.1.2 Silhouette Analysis for Determining Optimal Clusters in K-means
1. Calculate the Silhouette Coefficient for Each Data Point: • High Value: Indicates good matching within its own cluster and
poor matching to neighboring clusters.
• Compute a(i):
• Low/Negative Value: Suggests incorrect clustering or too

1 many/few clusters.
a(i) =
X
∥x − i∥
|Si | − 1 x∈Si ,x̸=i
• Silhouette Coefficient for a Single Data Point:
– Average distance from the i-th data point to all other points
in the same cluster Si . – The silhouette coefficient for a single data point is a measure of
• Compute b(i): how similar that point is to points in its own cluster compared to
–   points in other clusters.
1 X – This coefficient helps determine the appropriateness of the clus-
b(i) = min  ∥x − i∥
j̸=i |Sj | x∈Sj tering.
– Smallest average distance from the i-th data point to all points
• Mean Silhouette Coefficient as a Quality Measure:
in any other cluster, excluding the one to which i belongs.
2. Compute the Silhouette Coefficient for Each Point: – The mean of the silhouette coefficient for all points is a measure
used to evaluate the quality of clustering in a dataset.
• Silhouette Coefficient:
– It provides insight into how well each object lies within its cluster.
b(i) − a(i)
S(i) =
max{a(i), b(i)} • Importance:

• Measures how similar a data point is within its own cluster com- – Helps in assessing the separation distance between clusters.
pared to other clusters. Ranges from −1 to 1.
– Distinct clusters lead to better definitions and a higher average
3. Interpret the Results: silhouette score.

© Roozbeh Sanaei 27
Gap Statistics for Determining Optimal Clusters in K-means
1. Cluster the Data and Compute Within-Cluster Dispersion: • Expected Dispersion:

• For each k, perform K-means clustering and calculate the WSS. 1 XB


E log(Wk ) =

log(Wkb )
• WSS: B b=1
k
1
Wk =
X
Dr Where Wkb is the WSS for the b-th reference dataset.
r=1 2nr

Where Dr is the sum of pairwise distances for all points in cluster 4. Calculate the Gap Statistic:
r, and nr is the number of points in cluster r.
• Gap Statistic:
2. Generate Reference Data Sets:
Gap(k) = E∗ log(Wk ) − log(Wk )
• Generate B reference datasets with a random uniform distribu-
tion. The Gap Statistic measures the difference between the expected
• Each dataset should match the original in terms of number of dispersion and the observed dispersion.
observations and features. 5. Choose Optimal k:
3. Compute the Expected Dispersion for Reference Data: • Select the k where the Gap Statistic reaches its maximum.
• Apply K-means clustering to each reference dataset for different • Alternatively, choose the smallest k where Gap Statistic is within
k and compute the WSS. one standard deviation of the Gap Statistic at k + 1.

Importance of Gap Statistics


• Provides an objective approach to determine the number of clusters.

• Compares clustering results against a random uniform distribution to identify significant clustering structures.

© Roozbeh Sanaei 28
4.2 The Canopy Method
The Canopy Method is a pre-clustering method used in data mining for speeding up clustering operations on large data sets. It involves creating ’canopies’
or rough groupings, followed by more precise clustering algorithms like K-means. This method is particularly effective for large datasets as it reduces
computational costs by limiting the number of distance calculations.

Steps • Apply a more precise clustering algorithm, like K-means, to each


canopy.
1. Select Distance Metrics:
• Distance Calculation:
• Choose distance metrics suitable for your data, such as Euclidean,
Manhattan, or Cosine distance. – The distance depends on the chosen metric. For Euclidean dis-
tance between two points x and y:
2. Set Thresholds T 1 and T 2:
v
u n
• Define two distance thresholds, T 1 and T 2 (with T 1 > T 2). d(x, y) = t (x − yi )2
uX
i
i=1
• These thresholds determine how close points must be to form and
belong to a canopy. Where n is the number of dimensions, and xi , yi are the coordi-
nates of x, y in the i-th dimension.
3. Create Canopies:
• Threshold Application:
• Randomly select a data point as a canopy center.
– For a canopy center c and a data point p:
• Canopy Formation Rule:
∗ Include p in the canopy if d(c, p) < T 1.
– Include any point within distance T 1 of the center in the
canopy. ∗ Remove p from the dataset if d(c, p) < T 2.
– Remove any point within distance T 2 from the dataset to
Importance
prevent it from being a future canopy center.
• Efficiency: Reduces computational costs for clustering large datasets.
4. Repeat the Process:
• Scalability: Suitable for datasets too large for algorithms like K-
• Continue until all points are either in a canopy or removed from means.
the dataset.
• Flexibility: Compatible with various distance metrics and clustering
5. Use Canopies for Further Clustering: algorithms.

© Roozbeh Sanaei 29
4.3 Gaussian Mixture Models (GMMs)
Gaussian Mixture Models are a probabilistic model used to represent nor- where πi are the mixing coefficients, and f (x|µi , Σi ) is the PDF of the i-th
mally distributed subpopulations within an overall population, often used in Gaussian component.
clustering.

1. Gaussian (Normal) Distribution 3. Expectation-Maximization (EM) Algorithm


A Gaussian distribution is defined by its mean (µ) and variance (σ 2 ). Its The EM algorithm for GMMs involves the following steps:
probability density function (PDF) for a single variable is:
E-step: Compute responsibilities:
1 (x
!
2
− µ)
f (x|µ, σ 2 ) = √ exp −
2πσ 2 2σ 2 πk f (xi |µk , Σk )
γ(zik ) = PM
For a multivariate context, the PDF becomes: j=1 πj f (xi |µj , Σj )

1 1
 
f (x|µ, Σ) = q exp − (x − µ)T Σ−1 (x − µ)
(2π)k |Σ| 2 M-step: Update the parameters:

where k is the dimensionality of the data.


1 XN
πknew = γ(zik ) (4.1)
N i=1
2. Mixture of Gaussians
γ(zik )xi
PN
A Gaussian Mixture Model is a weighted sum of M Gaussian distributions: µnew = Pi=1 (4.2)
i=1 γ(zik )
k N
M
γ(zik )(xi − µnew
k )(xi − µk )
PN new T
p(x) = πi f (x|µi , Σi )
X
Σnew = i=1 (4.3)
i=1 γ(zik )
k PN
i=1

© Roozbeh Sanaei 30
4.3.1 Challenges in Gaussian Mixture Models (GMM)
Choosing the Number of Components Covariance Structure
Difficulty in determining the optimal number of Gaussian components. Tech- Choosing the right covariance structure (spherical, diagonal, tied, or full) is
niques like BIC, AIC, or cross-validation may not always provide clear guid- challenging and affects both the model’s flexibility and computational com-
ance. plexity.

Sensitivity to Initialization High-Dimensional Data

Results can vary significantly based on the initial choice of parameters. Poor Performance can degrade in high-dimensional spaces due to sparsity, making
initialization can lead to suboptimal clustering solutions. it difficult to accurately estimate parameters.

Overfitting
Convergence to Local Optima
There’s a risk of overfitting, especially with a large number of components
The EM algorithm may converge to local rather than global optima, resulting
or overly complex models. Requires careful model validation and possibly
in finding suboptimal solutions for the GMM.
regularization.

Model Complexity Computational Complexity


Increasing the number of components adds complexity to the model. A high- The EM algorithm can be computationally intensive, especially for large
dimensional parameter space can be difficult to optimize and interpret. datasets and complex models.

Assumption of Gaussian Components Interpretability


Assumes each cluster follows a Gaussian distribution, which may not be suit- Higher model complexity can lead to challenges in interpreting and explaining
able for data that does not fit this assumption. the model.

© Roozbeh Sanaei 31
4.3.2 Comparison of GMMs and K-means
Overlapping Clusters Shape and Variances
GMMs are better suited to handling overlapping clusters than K-means. GMMs can model clusters with different shapes and variances, while K-means
assumes that the variance of the data within each cluster is the same and
Non-Spherical Clusters that the clusters are spherical in shape.
GMMs are better equipped to handle clusters that are not spherical in shape
than K-means. Computational Demand
GMMs are more computationally intensive than K-means.
Number of Clusters
GMMs can estimate the number of clusters in the data using model selection Assumption Limitations
techniques such as the Bayesian Information Criterion (BIC) or the Akaike
Information Criterion (AIC), while K-means requires the user to specify the GMMs presume Gaussian distribution in each cluster, a condition not always
number of clusters a priori. met in datasets, unlike K-means.

© Roozbeh Sanaei 32
4.4 DBSCAN: Density-Based Spatial Clustering of Applications
with Noise
DBSCAN is a popular clustering algorithm used in data analysis, particularly effective for identifying clusters of varying shapes in a dataset with noise
(i.e., outliers).

Core Concepts The Algorithm


1. Start: Pick an arbitrary point p from the dataset.
• ε (Epsilon): The radius of a neighborhood around a given point.
2. Core Point Check: Check if p is a core point. If yes, create a new cluster
and add all points in Nε (p) to this cluster.
• MinPts: The minimum number of points required to form a dense re-
gion. 3. Expand Cluster: For each point q in the cluster, if q is a core point,
add its ε-neighborhood to the cluster.

4. Iterate: Repeat steps 2 and 3 for each point in the dataset until all
• Epsilon-Neighborhood of a Point: Nε (p) = {q ∈ D | dist(p, q) ≤ ε}. points are either assigned to a cluster or marked as noise.
Here, Nε (p) represents the ε-neighborhood of a point p, consisting of
all points q within the dataset D that are within a distance ε from p. 5. Result: The output is a set of clusters with core and border points, and
a set of noise points.

• Core Point: A point p is a core point if its ε-neighborhood contains at Key Properties
least MinPts, i.e., |Nε (p)| ≥ MinPts. • DBSCAN does not require specifying the number of clusters in advance.

• It can discover clusters of arbitrary shape.


• Border Point: A point p is a border point if its ε-neighborhood contains
• Points not belonging to any cluster are treated as noise, making DB-
fewer than MinPts but p is within the ε-neighborhood of a core point.
SCAN robust to outliers.

• DBSCAN is an effective clustering method for spatial data analysis and


• Noise Point: A point p is a noise point if it is neither a core point nor is widely used in various fields like astronomy, geospatial analysis, and
a border point. bioinformatics.

© Roozbeh Sanaei 33
4.5 OPTICS(Ordering Points To Identify the Clustering Struc-
ture) Algorithm
Core Concepts The Algorithm
1. Start: Pick an unprocessed point p from the dataset.
1. Core Distance: For a point p in the dataset, the core distance is
the smallest distance such that p is a core point with respect to ε and 2. Retrieve Neighbors: Find the ε-neighborhood of p and calculate the
MinPts. core distance for p.

3. Ordering Points: If p is a core point, update the reachability dis-


tances of its neighbors and process them in increasing order of their

UNDEFINED
reachability distance, recursively.
if |Nε (p)| < MinPts,
Core-Distanceε,MinPts (p) =
dist(p, q) otherwise, 4. Create Reachability Plot: The reachability distances are plotted
for all points, creating a reachability plot. This plot represents the
clustering structure of the data.
where q is the MinPts-th nearest neighbor of p within ε.
5. Extract Clusters: Clusters can be extracted from the reachability
plot based on a threshold value for reachability distance, which can be
2. Reachability Distance: The reachability distance of a point p from user-defined or automatically determined.
a point o is the maximum of the core distance of o and the Euclidean
distance between o and p. Key Properties
• OPTICS does not produce explicit clusters; instead, it creates an order-
ing of the data points representing its density-based clustering struc-
ture.
Reachability-Distanceε,MinPts (o, p) = max(Core-Distanceε,MinPts (o), dist(o, p))
• It handles varying densities better than DBSCAN.

for o, p in the dataset, where dist(o, p) is the Euclidean distance between • It is particularly useful for datasets where clusters of different densities
o and p. exist, and the noise level is high.

© Roozbeh Sanaei 34
4.6 Spectral Clustering Algorithms
Spectral clustering algorithms can be broken down into the following key steps:

1. Similarity Graph Construction 3. Eigenvalue Decomposition


• Create a similarity graph where each node represents a data point.
• Perform eigenvalue decomposition on the Laplacian matrix L.
• Define edges based on the similarity, often using a Gaussian simi-
larity function: • Extract eigenvalues and their corresponding eigenvectors.
!
||xi − xj ||2
S(xi , xj ) = exp − 4. Forming the Feature Vector
2σ 2
where xi , xj are data points and σ is a scaling parameter. • Construct a matrix U using the k eigenvectors corresponding to
the k smallest eigenvalues.
2. Graph Laplacian
• U becomes an n × k matrix, where n is the number of data points.
• Construct a weighted adjacency matrix W , where Wij indicates
the similarity between nodes i and j. 5. Clustering
• Create a degree matrix D, a diagonal matrix with Dii being the
sum of the weights of the edges connected to node i: • Treat each row of U as a point in Rk .
Dii =
X
Wij • Use a standard clustering algorithm like k-means to cluster these
j
points.
• Define the graph Laplacian L as:
• The resulting clusters correspond to the clusters in the original
L=D−W dataset.

© Roozbeh Sanaei 35
4.7 Markov Chain Clustering (MCL)
Markov Chain Clustering (MCC), specifically the Markov Cluster Algorithm (MCL), is a process for finding clusters (i.e., groups of related items) in
graphs. It’s based on the idea of random walks on the graph, which can be described using Markov chains. Here’s a basic overview of how the algorithm
works, including the key equations involved:

1. Representation of Graph 4. Expansion and Inflation


A graph is represented by its adjacency matrix A. In this matrix, the element Two key operations are performed during each iteration:
Aij represents the weight of the edge from node i to node j. If there is no
edge, Aij = 0. • Expansion (E): This involves raising the matrix to a power (matrix
multiplication), which corresponds to the random walk process. The
expansion step tends to spread out the probability mass, which can
lead to flow between different regions of the graph.
2. Normalization
The adjacency matrix is converted into a stochastic matrix M by normalizing M (expanded) = M n
each row to sum to 1. This is done by dividing each element by the sum of
Here, n is usually 2, but can be adjusted.
its row:
• Inflation (I): After expansion, the inflation step is applied. This in-
Aij volves raising each element of the matrix to a power r (inflation factor)
Mij = P
k Aik and then re-normalizing the rows to sum to 1. Inflation strengthens
intra-cluster probabilities and weakens inter-cluster probabilities.
This normalization ensures that each row of M represents a probability
distribution, consistent with the idea of a Markov chain, where the transition (Mij
(expanded) r
)
(inflated)
from one node to another is based on probabilities. Mij =P (expanded) r
k (Mik )

The inflation parameter r is crucial; a higher value strengthens the


3. Random Walk Simulation contrast between strong and weak connections, leading to more dis-
tinct clusters.
The core idea of MCL is to simulate random walks on the graph. This is
done by repeatedly multiplying the matrix M by itself:
5. Convergence
M (2) = M × M, M (3) = M (2) × M, ... The process of expansion and inflation is repeated until the matrix M con-
verges, i.e., it no longer changes significantly with further iterations. The
Repeated multiplication corresponds to taking longer and longer random final matrix reveals the clusters: nodes that end up in the same row with
walks on the graph. high probabilities are considered to be in the same cluster.

© Roozbeh Sanaei 36
4.8 Agglomerative Clustering Algorithm
Initialization • Ward’s Method:
Given a dataset X = {x1 , x2 , . . . , xn }, where each xi is a data point. Ini- !
dWard (A, B) = 2
||x − µA || +
2 2
X X X
tially, each data point xi is considered as a separate cluster Ci , thus having ||x − µC || − ||x − µB ||
n clusters. x∈A∪B x∈A x∈B

where µA , µB , and µC are the centroids of clusters A, B, and the merged


Finding the Closest Pair
cluster C, respectively.
Calculate the distances between every pair of clusters and identify the pair
with the minimum distance.
Merging
Distance Computation Identify and merge the two clusters Ci and Cj that have the smallest distance
Compute the distance between two clusters, Ci and Cj , using one of the between them as calculated by the chosen metric.
following distance metrics:
• Single-linkage: Update Distance Matrix
dsingle (Ci , Cj ) = min{||a − b|| : a ∈ Ci , b ∈ Cj } After merging, the distance matrix is updated. If a new cluster Ck is formed
• Average-linkage: from Ci and Cj , update its distances to all other clusters.
1
daverage (Ci , Cj ) =
X X
||a − b||
|Ci ||Cj | a∈Ci b∈Cj Repeat

• Complete-linkage: Repeat steps 2-5 until only one cluster remains or until a single cluster
reached.
dcomplete (Ci , Cj ) = max{||a − b|| : a ∈ Ci , b ∈ Cj }

• Centroid-linkage: Cut the Dendrogram


dcentroid (Ci , Cj ) = ||µCi − µCj ||
The dendrogram, a tree-like diagram representing the series of merges, can
where µCi and µCj are the centroids of clusters Ci and Cj , respectively. be cut at different levels to obtain a specific number of clusters.

© Roozbeh Sanaei 37
Supervised Machine Learning

5.1 Linear Regression Model


Linear regression is a statistical technique used to model and analyze the relationship between a dependent variable and one or more independent variables
by fitting a straight line to the data.

5.1.1 Assumptions Of Linear Regression


Linearity No or Little Multicollinearity
The relationship between the independent and dependent variables is linear, There is minimal or no multicollinearity, meaning that the independent vari-
meaning changes in the independent variables result in proportional changes ables are not highly correlated with each other, to avoid inflated variances in
in the dependent variable. coefficient estimates.

Independence
Each observation in the dataset is independent of the others, ensuring that No Auto-correlation
the value of one observation doesn’t influence or depend on another.
In the residuals, there is an absence of auto-correlation, particularly impor-
Homoscedasticity tant in time series data, where one time period’s errors shouldn’t influence
another’s.
The variance of the error terms (residuals) is constant across all levels of the
independent variables, indicating uniform dispersion of residuals.

Normal Distribution of Residuals Fixed Independent Variables

The residuals (differences between observed and predicted values) are nor- The independent variables are assumed to be measured without significant
mally distributed, which is especially crucial for small sample sizes. error, implying that any measurement error is negligible.

38
5.2 Ordinary Least Squares (OLS)
Least squares is a mathematical method used to minimize the sum of squared differences between observed data points and model predictions.

Ordinary least squares (OLS) is a specific type of least squares method used in linear regression. It finds the best-fitting linear equation by minimiz-
ing the sum of squared errors between the observed values and the values predicted by the linear model.

Basic Model Objective


The linear regression model is expressed as: Minimize the sum of squared errors:
yi = x′i β + ui
n
Where yi is the outcome variable, xi is a vector of regressor variables, β is Q=
X
(yi − x′i β)2
the coefficient vector, and ui is the error term. i=1

Assumptions
Solution
• Linearity: The relationship between regressors and the dependent vari-
able is linear. Derive β̂ by setting the derivative of Q with respect to β to zero:
• Conditional Independence: E(U |X) = 0, the expectation of the error
term, given the regressors, is zero. • Differentiate Q: −2X ′ Y + 2X ′ Xβ = 0
• No Multi-collinearity: The matrix X has full rank k, indicating no
perfect collinearity among regressors. • Solve for β: X ′ Xβ = X ′ Y
• Homoskedasticity: Var(U |X) = σ 2 In , the variance of the error term is
constant. • Obtain β̂: β̂ = (X ′ X)−1 X ′ Y

© Roozbeh Sanaei 39
5.2.1 OLS as Projection
The Ordinary Least Squares (OLS) method projects the outcome variable y onto the space spanned by the regressors X, analogous to Ax in linear algebra.

Orthogonality Principle Projection Matrices


The residual (y − Xβ) is orthogonal to the span of X, and lies in the left
nullspace of X. The orthogonal projection matrix is defined as:

Deriving the OLS Estimator β Px = X(X ′ X)−1 X ′


From orthogonality, we have:
X ′ (y − Xβ) = 0 The residuals are expressed as:
which leads to:
X ′ Xβ = X ′ y
û = y − Xβ = y − Px y = (In − Px )y = Mx y
Solving for β gives:
β = (X ′ X)−1 X ′ y
where Mx projects onto the space orthogonal to the span of X.

© Roozbeh Sanaei 40
5.2.2 Applying SVD to OLS and Ridge Regression
Standard OLS Formula Ridge Regression Formula

The standard OLS formula to estimate the fitted values β̂ is: Ridge regression modifies the OLS regression by adding a penalty term to
the size of the coefficients. The objective is to minimize the penalized sum
X β̂ = X(X ′ X)−1 X ′ y of squares. The solution to ridge regression is given by:

Applying SVD to OLS β̂ridge = (X ′ X + λIk )−1 X ′ Y

By substituting the data input matrix X with its SVD components U DV ′ ,


Applying SVD to Ridge Regression
the formula becomes:
Substituting the SVD formula into the ridge regression formula, the fitted
X β̂ = U DV ′ (V D2 V ′ )−1 V D′ U ′ y values become:
X β̂ridge = U D(D2 + λI)−1 DU ′ y
Simplifying further:
This results in a clearer understanding of regularization. The predicted
X β̂ = U D(D2 )−1 DU ′ y = U U ′ y d2j
values are shrunk by the factor d2 +λ . Greater shrinkage is applied to vari-
j
In this revised formula, D is a diagonal matrix with the square root of ables explaining a lower fraction of the variance, in line with the principles of
the eigenvalues, and U and V are orthonormal matrices. This transformation Principal Component Analysis (PCA) and ridge regression. Ridge regression
shows that the fitted values in OLS regression can be computed with respect applies a weighted shrinkage method, as opposed to PCA, which truncates
to the orthonormal basis U . variables below a certain threshold.

© Roozbeh Sanaei 41
5.2.3 Relationship between CEF and Regression
1. The Conditional Expectation Function (CEF) is defined as E[Y |X], representing the expected value of Y given X.

2. In regression, the dependent variable Yi is decomposed as Yi = E[Yi |Xi ] + ϵi , where ϵi is the error term, orthogonal to Xi . The primary goal in
regression is to find a function of X, say m(X), that minimizes the squared mean error, min E[(Yi − m(Xi ))2 ], where the optimal choice for m(X)
turns out to be the CEF.

3. In OLS regression, the aim is to linearly approximate the CEF. The regression equation β = arg minb E[(E[Yi |Xi ] − Xi′ b)] indicates that minimizing
the squared differences between Yi and Xi′ b is equivalent to approximating the CEF linearly.

© Roozbeh Sanaei 42
5.3 Method of Moments
Introduction Example 1: Estimator for Sample Mean

The Method of Moments is a statistical technique for estimating the pa- • Population Moment: µ = E[X]
rameters of a probability distribution or a model. This approach compares • Objective: Find an estimator for the sample mean.
theoretical moments from a probability distribution, like mean and variance,
with empirical moments derived from data. • Process:
– Sample Analogue: Replace the expected value E[X] with a sample
Understanding Moments mean X̄.
– Estimator for µ: µ̂ = n1 ni=1 Xi = X̄
P
• Definition: Moments are quantitative measures of a function’s shape.
Example 2: Normal Distribution
• Types of Moments:
• Given: X1 , X2 , . . . , Xn ∼ N (µ, σ 2 )
– Raw Moments: The nth raw moment of a random variable x, de-
• Objective: Find estimators for the parameters µ and σ 2 .
noted as µn , is E[xn ].
• Process:
– Central Moments: The nth central moment, denoted as µ′n , is
E[(x − µ)n ], where µ is the mean. – First Moment (Mean): Population Moment: E[X] = µ, Sam-
ple Analogue: X̄, Estimator for µ: µ̂ = X̄.
5.3.0.1 Method of Moments Estimator – Second Moment (Variance): Population Moment: E[X 2 ] =
µ2 +σ 2 , Expand using µ’s estimator, Sample Analogue: n1 ni=1 Xi2 ,
P

• Theoretical Moment: Calculated for the entire population using its Estimator for σ 2 : σ̂ 2 = n1 ni=1 (Xi − X̄)2 − X̄ 2 .
P

probability distribution, it represents population measures like mean


or variance. An example is the theoretical mean E[X], derived using Example 3: Poisson Distribution
the expectation operator.
• Given: X1 , X2 , . . . , Xn ∼ Poisson(λ)
• Sample Moment: An empirical measure calculated from a data • Objective: Find estimators for the parameter λ.
sample, used to approximate theoretical moments. The sample mean
X̄ = n1 ni=1 Xi is a common example, where n is the sample size and
P
• Process:
Xi are the sample values. – Equality in Poisson Distribution: Population Moment: E[X] =
var(X) = λ.
• Replacing theoretical moments with sample moments: This
– Estimators for λ:
substitution is necessary because theoretical moments often depend on
unknown population parameters, while sample moments can be directly ∗ Estimator 1: λ̂1 = X̄.
computed from the available data. ∗ Estimator 2: λ̂2 = n1 ni=1 (Xi − X̄)2 .
P

Given that we have two moment conditions but only one parameter to
• The principle: With a sufficiently large and representative sample, estimate, it’s necessary to find a method to effectively ’merge’ these
the sample moments should be good approximations of the theoretical conditions. Relying on just one of these conditions would result in
moments. underutilizing the available information.
© Roozbeh Sanaei 43
5.3.1 General Framework of Moment Conditions
Moment Conditions Application in OLS Regression
• Moment conditions in regression are expressed as a function g(Xi , β). • OLS Moment Condition: E[Xi Ui ] = 0 or E[Xi (yi − Xi′ β)] = 0, where
Ui is the error term.
• Xi represents the observed data, including dependent variables yi , in-
dependent variables Xi , and any instruments Zi .
• This condition is used to solve for the OLS estimator β̂.
• β is a vector of parameters to estimate, with a length of k.
Extension to More Complex Models (IV Regression)
Model Identification
• Instrumental Variables (IV) Regression: Useful when the model is overi-
• A model is identified if the solution for β is unique. dentified (l instruments for k parameters).
• Uniqueness is expressed as E[g(Xi , β)] = 0 and E[g(Xi , β̂)] = 0 imply-
ing β = β̂. • IV Estimator Formula: Derived by solving the moment condition,
β̂IV = (Z ′ X)−1 Z ′ y.
• At least as many restrictions (moment conditions) as parameters (k)
are needed to identify the model. • IV regression uses instruments Zi to resolve issues in the OLS model.

© Roozbeh Sanaei 44
5.3.2 Instrumental Variables in Regression Models
• Instrumental variables are used in statistical analysis to address endogeneity issues, such as omitted variables that affect both X and Y, where
explanatory variables in a regression model are correlated with the error term.

• They provide a way to estimate causal relationships by using a variable (the instrument) that is correlated with the explanatory variable but not
with the error term.

Example: To assess education’s impact on income without bias from individual ability, use an instrumental variable like proximity to a university.
This approach isolates the influence of education on income, separate from individual ability.

Moment Conditions with Instrumental Variables • Instruments Zi are selected for their correlation with endogenous inde-
pendent variables and their lack of correlation with the error term.
• The moment conditions involving instrumental variables can be repre-
sented as g(Xi , β) = Zi (yi − Xi′ β) or E[Zi Ui ] = 0. • This substitution allows the IV estimator to isolate the variation in the
explanatory variable that is unaffected by endogeneity.
• Here, Zi are the instrumental variables, yi the dependent variables, Xi
the independent variables, β the parameters, and Ui the error terms.
Condition of Perfect Identification
• The condition E[Zi Ui ] = 0 implies that the instruments are uncorre-
lated with the error term, a critical requirement for valid instrumental • For a model to be perfectly identified, the number of instruments (l)
variables. should be equal to the number of parameters (k).

• Perfect identification ensures that there is just enough information to


Solving the Moment Condition uniquely identify the model parameters.
• The moment condition for IV regression is given by 0 = Zi (yi −
Pn
i=1
Xi′ β̂IV ). Application in Complex Models

• Solving this condition yields the IV estimator β̂IV = (Z ′ X)−1 Z ′ y. • The IV approach extends beyond simple linear models and can be ap-
plied to more complex regression models where standard OLS assump-
Role of Instrumental Variables in Addressing Endogeneity tions do not hold.

• IV regression substitutes problematic OLS moments with new ones that • It is particularly useful in models where endogeneity is a concern and
incorporate instruments, addressing the endogeneity bias. the model’s identification relies on the validity of the instruments.

© Roozbeh Sanaei 45
5.3.3 Generalized Method of Moments (GMM)
General Concept of GMM GMM Formula for Linear Regression Models
• Overidentified Models: In the context of linear regression models that are overidentified, the general
GMM formula is given by:
– In scenarios where the number of restrictions (l) is greater than
the number of parameters to estimate (k), i.e., l > k, the model −1
β̂GM M = ((X ′ Z)W (Z ′ X)) (X ′ Z)W (Z ′ y)
is said to be overidentified.
– In such cases, traditional methods like Ordinary Least Squares Here:
(OLS) or Instrumental Variable (IV) regression cannot be directly • X and Z represent the observed data and instruments, respectively.
applied to estimate the parameter vector β.
• W is again the weighting matrix.
• Combining Restrictions:
– GMM seeks to find an estimate of β that brings the sample mo- • X ′ and Z ′ are transposes of X and Z respectively.
ments as close to zero as possible. This involves combining multi-
ple moment conditions in an optimal way. Optimal Choice of Weighting Matrix
– The moment conditions for all restrictions are still equal to zero, • The choice of the weighting matrix W is crucial in GMM. For instance,
but the sample approximations may not be exactly zero due to when W = (Z ′ Z)−1 , the GMM estimator becomes equivalent to the
finite sample sizes. Instrumental Variable (IV) estimator.

GMM Estimation • The optimal choice of W depends on the specifics of the model and the
nature of the data.
The GMM estimator, β̂GM M , is defined as the value of β that minimizes the
weighted distance of ni=1 g(Xi , β), where g(Xi , β) is a vector of functions
P

representing the moment conditions. The GMM estimation equation can be


expressed as:
n n
!
β̂GM M = arg min g(Xi , β)′ W
X X
g(Xi , β)
β∈B i=1 i=1
Where:
• W is an l×l matrix of weights used to select the ideal linear combination
of instruments.
• The function g(Xi , β) represents the moment conditions, which may
involve observed data, endogenous variables, and instruments.

© Roozbeh Sanaei 46
5.4 Maximum Likelihood
5.4.1 Maximum Likelihood Estimation (MLE)
MLE is a statistical method used to estimate the parameters of a model, aiming to find the parameter values that make the observed data most probable.

Basic Idea Maximizing the Log-Likelihood


• Given data and a statistical model with parameters, MLE seeks to find
the parameter values that maximize the likelihood of observing the • Find the parameter values that maximize the log-likelihood by differ-
data. entiating ln L(θ) with respect to θ and setting it to zero. Numerical
methods may be required for complex cases:
Likelihood Function
d
• The likelihood function L(θ) is a function of the parameters θ and rep- ln L(θ) = 0
resents the probability of the observed data given these parameters. dθ

• For independent and identically distributed observations X =


{x1 , x2 , . . . , xn }, the likelihood function is the product of the individual
observations’ PDFs or PMFs: Example: Normal Distribution
n
L(θ) = f (x1 , x2 , . . . , xn |θ) = f (xi |θ)
Y
• For observations assumed to be normally distributed with unknown
i=1
mean µ and known variance σ 2 , the log-likelihood is:
Log-Likelihood
• The log-likelihood, ln L(θ), transforms the product into a sum, facili- n
1 (xi − µ)2
" !#
ln L(µ) = ln √ exp −
X
tating differentiation: 2πσ 2 2σ 2
i=1
n
ln L(θ) = ln f (xi |θ)
X

i=1

• Maximizing ln L(θ) is equivalent to maximizing L(θ) as the logarithm • Maximizing this log-likelihood with respect to µ yields the MLE for the
is monotonic. mean of a normal distribution.

© Roozbeh Sanaei 47
5.4.2 OLS Estimator using Maximum Likelihood
Model Specification Log-Likelihood Function
Start with the linear regression model: Convert the likelihood to log-likelihood for simplification:
yi = x′i β + ui
n
where yi is the dependent variable, xi is the vector of independent variables, ln L(β, σ 2 ) =
X
ln f (yi , xi ; β, σ 2 )
β is the vector of coefficients, and ui is the error term. i=1

Assumption about Error Term


Maximizing Log-Likelihood
Assume that the error terms ui , conditional on xi , are normally distributed
with mean 0 and unknown variance σ 2 : To find the maximum likelihood estimators, take the derivative of the log-
ui |xi ∼ N (0, σ 2 ) likelihood function with respect to β and σ 2 , and set them to zero.

Probability Density Function (PDF)


Deriving the Estimators
The PDF of a single observation is given by:
1 (yi − x′i β)2 Solving these equations will give you the maximum likelihood estimators for
!
f (yi , xi ; β, σ ) = √
2
exp − β and σ 2 :
2πσ 2 2σ 2
n
!−1 n
!
β̂M L = xi x′i
X X
Likelihood Function xi y i
i=1 i=1
The likelihood (joint PDF) of observing all data is:
n
1X n
L(β, σ 2 ) = f (yi , xi ; β, σ 2 ) 2
L = (yi − x′i β̂M L )2
Y
σ̂M
i=1 n i=1

© Roozbeh Sanaei 48
Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) are an advanced form of linear regression models, characterized by their ability to handle a variety of response variable
distributions and to establish a distinct relationship between response and predictor variables. The essence of GLMs lies in three core components:

1. Random Component: This defines the probability distribution of the response variable, Y . In GLMs, Y is assumed to follow a distribution from
the exponential family, such as Normal, Binomial, or Poisson distributions.

2. Systematic Component: Represents the explanatory (independent) variables, X1 , X2 , . . . , Xn , and their linear combination, often denoted as η.
The equation for this component is:
η = β0 + β1 X1 + β2 X2 + . . . + βn Xn
Here, β0 , β1 , . . . , βn are the model’s parameters (coefficients).

3. Link Function: Denoted as g(), this function connects the systematic component to the expected value of the response variable. It ensures that
the model accommodates the distribution type of the response variable. The relationship is described as:

g(E(Y )) = η

where E(Y ) is the expected value of Y .

Full GLM
The GLM equation combines these components as:

E(Y ) = g −1 (η) = g −1 (β0 + β1 X1 + β2 X2 + . . . + βn Xn )

Here, g −1 () is the inverse of the link function, transforming the linear predictor η back to the response variable’s scale.

Flexibility in Response Variable Distribution


GLMs are highly versatile, allowing for response variables from any member of the exponential family of distributions:

• Normal Distribution: Utilized in standard linear regression where the response variable can take any continuous value. In this simplest form of
GLM, often equivalent to ordinary least squares regression, the model is:

Y = β0 + β1 X + ϵ

where Y is continuous and normally distributed, and ϵ is the error term.

• Binomial Distribution: Employed in logistic regression, suitable for binary outcomes such as success/failure or yes/no. The logistic model predicts
the probability of occurrence of an event and is expressed as:
!
p
log = β0 + β1 X
1−p
where p represents the probability of one of the binary outcomes.
© Roozbeh Sanaei 49
• Poisson Distribution: Applied in scenarios where the response variable represents count data, typically non-negative integers, such as the number
of occurrences of an event. Poisson regression, used for modeling count data, is formulated as:

log(E(Y )) = β0 + β1 X

• Gamma Distribution: Often used in situations where the response variable is positively skewed and continuous, such as for modeling time-to-event
data (e.g., survival times). The Gamma distribution in GLMs typically uses the inverse or log link function. The model can be expressed as:

g(E(Y )) = β0 + β1 X1 + . . . + βn Xn

where E(Y ) is the expected value of the response variable, Y , which follows a Gamma distribution. The link function g() could be the inverse link
g(E(Y )) = 1/E(Y ) or the logarithmic link g(E(Y )) = log(E(Y )).

• Multinomial Distribution: Used when the response variable can take on more than two categories, as in the case of multinomial logistic regression.
This model is an extension of logistic regression to multiple categories. The Multinomial distribution in GLMs can be represented by a set of equations,
one for each category. For a response variable with k categories, the model can be:
!
pi
log = βi0 + βi1 X1 + . . . + βin Xn
pk

for i = 1, 2, . . . , k − 1, where pi is the probability of the ith category. The reference category k serves as the baseline, and the logit link is used.

© Roozbeh Sanaei 50
Bernoulli Trial Estimator using Maximum Likelihood
Problem Setup Log-Likelihood Function
Consider a dataset consisting of results from a series of coin flips, where we The log-likelihood function is:
aim to estimate the probability of the coin landing heads.
n n
ln L(p) = ln f (xi ; p) = (xi ln p + (1 − xi ) ln(1 − p))
X X
Bernoulli Distribution i=1 i=1

Assume that each coin flip is an independent Bernoulli trial with probability
p of landing heads. Maximizing Log-Likelihood
To find the maximum likelihood estimator for p, take the derivative of the
Probability Mass Function (PMF)
log-likelihood function with respect to p and set it to zero.
The PMF for a single observation xi (taking the value of 0 or 1) is given by:

f (xi ; p) = pxi (1 − p)1−xi Deriving the Estimator


Solving this equation gives the maximum likelihood estimator for p:
Likelihood Function
The likelihood function for observing the entire dataset is the product of 1X n
p̂M L = xi
individual PMFs: n i=1
n
L(p) = f (xi ; p)
Y

i=1
which is the sample mean of the observed values.

© Roozbeh Sanaei 51
5.5 Logistic Regression
Linear Regression Logistic Regression
• Used for predicting a continuous outcome. • Used for binary classification (predicting a categorical outcome with
two classes).
• y = β0 + β1 x1 + β2 x2 + . . . + βn xn + ϵ
 
– y: Dependent variable (the outcome to predict). • ln p
1−p
= β0 + β1 x1 + β2 x2 + . . . + βn xn

– β0 , β1 , . . . , βn : Coefficients (weights) of the model. – p: Probability of the dependent variable being in one of the two
– x1 , x2 , . . . , xn : Independent variables (predictors). classes.
– ϵ: Error term. – Other terms are similar to linear regression.

• Example Usage: Predicting house prices based on features like size, • Example Usage: Predicting whether an email is spam or not based
location, number of bedrooms, etc. on characteristics like the frequency of certain words, sender, etc.
Assumption Linear Regression Logistic Regression
Assumes the dependent variable is continuous and nor- Assumes the dependent variable is categorical, typically
Nature of Dependent Variable
mally distributed. binary.
Assumes a linear relationship between the log-odds
Assumes a linear relationship between the dependent
Relationship Between Variables (logit) of the dependent variable and the independent
and independent variables.
variables.
Assumes that the residuals (errors) are normally dis-
Distribution of Errors/Residuals Does not assume a normal distribution of residuals.
tributed.
Assumes homoscedasticity, meaning the variance around
This assumption is not applicable, as it deals with cat-
Homoscedasticity the regression line is the same for all values of the pre-
egorical outcomes.
dictor variable.
Assumes no autocorrelation in the residuals. The errors Also assumes independence of errors (no autocorrela-
Independence of Errors
are independent of each other. tion).
Assumes little or no multicollinearity among the inde- Similar to linear regression, it assumes little or no mul-
Multicollinearity
pendent variables. ticollinearity.
Can often work with smaller sample sizes, though the Typically requires a larger sample size, especially if the
Sample Size
exact size depends on the number of predictors. outcome is rare.

© Roozbeh Sanaei 52
5.6 Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) are an advanced form of linear regression models, characterized by their ability to handle a variety of response variable
distributions and to establish a distinct relationship between response and predictor variables. The essence of GLMs lies in three core components:

1. Random Component: This defines the probability distribution of • Binomial Distribution: Employed in logistic regression, suitable for
the response variable, Y . In GLMs, Y is assumed to follow a distribu- binary outcomes such as success/failure or yes/no. The logistic model
tion from the exponential family, such as Normal, Binomial, or Poisson predicts the probability of occurrence of an event and is expressed as:
distributions. !
p
log = β0 + β1 X
2. Systematic Component: Represents the explanatory (independent) 1−p
variables, X1 , X2 , . . . , Xn , and their linear combination, often denoted
as η. The equation for this component is: where p represents the probability of one of the binary outcomes.

η = β0 + β1 X1 + β2 X2 + . . . + βn Xn • Poisson Distribution: Applied in scenarios where the response vari-


able represents count data, typically non-negative integers, such as the
Here, β0 , β1 , . . . , βn are the model’s parameters (coefficients). number of occurrences of an event. Poisson regression, used for mod-
3. Link Function: Denoted as g(), this function connects the systematic eling count data, is formulated as:
component to the expected value of the response variable. It ensures
log(E(Y )) = β0 + β1 X
that the model accommodates the distribution type of the response
variable. The relationship is described as: • Gamma Distribution: Often used in situations where the response
g(E(Y )) = η variable is positively skewed and continuous, such as for modeling time-
to-event data (e.g., survival times). The Gamma distribution in GLMs
where E(Y ) is the expected value of Y .
typically uses the inverse or log link function. The model can be ex-
pressed as:
Full GLM
g(E(Y )) = β0 + β1 X1 + . . . + βn Xn
The GLM equation combines these components as: where E(Y ) is the expected value of the response variable, Y , which fol-
E(Y ) = g (η) = g (β0 + β1 X1 + β2 X2 + . . . + βn Xn )
−1 −1 lows a Gamma distribution. The link function g() could be the inverse
link g(E(Y )) = 1/E(Y ) or the logarithmic link g(E(Y )) = log(E(Y )).
Here, g −1 () is the inverse of the link function, transforming the linear pre-
dictor η back to the response variable’s scale. • Multinomial Distribution: Used when the response variable can
take on more than two categories, as in the case of multinomial logistic
Flexibility in Response Variable Distribution regression. This model is an extension of logistic regression to multiple
categories. The Multinomial distribution in GLMs can be represented
GLMs are highly versatile, allowing for response variables from any member by a set of equations, one for each category. For a response variable
of the exponential family of distributions: with k categories, the model can be:
• Normal Distribution: Utilized in standard linear regression where !
pi
the response variable can take any continuous value. In this simplest log = βi0 + βi1 X1 + . . . + βin Xn
form of GLM, often equivalent to ordinary least squares regression, the pk
model is: for i = 1, 2, . . . , k − 1, where pi is the probability of the ith category.
Y = β0 + β1 X + ϵ The reference category k serves as the baseline, and the logit link is
where Y is continuous and normally distributed, and ϵ is the error term. used.
© Roozbeh Sanaei 53
5.7 Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are a class of supervised learning models that find a hyperplane in an N-dimensional space (N — number of features)
that distinctly classifies data points. They aim to maximize the margin between data classes, with support vectors being the key data points nearest to the
hyperplane. SVMs can efficiently perform a non-linear classification using the kernel trick, implicitly mapping inputs into high-dimensional feature spaces.

Representation of Data Points Margin Maximization

• Each data point in an SVM is represented as a point in an n-dimensional • The SVM aims to maximize the margin, the distance between the hy-
space (where n is the number of features), with each feature being a perplane and the nearest data points of each class (support vectors),
particular coordinate. For a data point xi , it can be represented in a given by ∥w∥
2
.
D-dimensional space as a feature vector x ∈ RD .
• The optimization problem is to maximize 2
∥w∥
subject to yi (wT xi + b) ≥
1 for all i, where yi is the label of xi .
Hyperplane
Kernel Trick
• The hyperplane equation, the decision boundary, is given by wT x + b =
0 where w is the weight vector, b is the bias term, and x is the input • For non-linearly separable data, SVMs use kernel functions to trans-
feature vector. form the data into a higher dimension for linear separability.

• The kernel function K(xi , xj ) = ϕ(xi )T ϕ(xj ), where ϕ is the trans-


formation function. Common kernels include Polynomial, RBF, and
Classification Decision Rule Sigmoid.
• The decision function for classifying data points is based on the sign of
f (x) = wT x + b. Handling Misclassifications (Soft Margin)
• Slack variables ξi allow misclassifications, modifying the optimization
• If f (x) > 0, the data point is classified into one class (positive class); problem to minimize 21 ∥w∥2 + C ni=1 ξi subject to yi (wT xi + b) ≥ 1 − ξi
P

if f (x) < 0, into the other class (negative class). and ξi ≥ 0 for all i, where C is the regularization parameter.

© Roozbeh Sanaei 54
5.7.1 Lagrangian SVMs
The Lagrangian in SVMs combines the objective (minimizing 21 ∥w∥2 ) with Dual Problem Formulation
the constraints (data points correctly classified):
• Transformation to Dual Problem By solving for w and substitut-
ing it back into the Lagrangian, we obtain the dual problem, which
1 n only involves the Lagrangian multipliers. This dual problem is easier
L(w, b, α) = ∥w∥2 − αi [yi (wT xi + b) − 1]
X
2 to solve as it typically has fewer dimensions than the original problem
i=1
α: n
1 X n
L(α) = αi αj yi yj xTi xj
X
αi −
Here, αi are Lagrange multipliers for each constraint, ensuring each data i=1 2 i,j=1
point xi is on the correct side of the margin.
• The dual form is computationally efficient, especially for large feature
spaces or kernel methods.

Incorporating Basis Functions


Partial Derivatives
• Basis Function Transformation Basis functions ϕ(x) transform in-
• Finding optimal values of w, b, and α involves taking partial derivatives put data into a higher-dimensional space, facilitating separation of non-
of L and setting them to zero. linearly separable data.

• Kernel Function The kernel function K(xi , xj ) = ϕ(xi )T ϕ(xj ) com-


putes inner products in the transformed space.
• Derivative with Respect to w This derivative leads to:
• Modified Dual Lagrangian The Lagrangian with kernel function:
n n
1 X n
L(α) = αi αj yi yj K(xi , xj )
X
w=
X
αi yi xi αi −
i=1 i=1 2 i,j=1

• Optimization and Decision Function


indicating the optimal weight vector w as a combination of the support
vectors. • Optimization Find α that maximizes the dual Lagrangian under con-
straints.

• SVM Decision Function The decision function for new inputs x:


• Derivative with Respect to b Setting the derivative with respect to
b to zero results in: n
f (x) = αi yi K(xi , x) + b
X
n i=1
αi yi = 0
X

i=1 where b is determined using support vectors.


This comprehensive process allows SVMs to efficiently classify both
maintaining balance between the classes. linearly and non-linearly separable data.

© Roozbeh Sanaei 55
5.7.2 Comparison Between OLS and SVM
Comparing the approaches of Ordinary Least Squares (OLS) and Support Vector Machines (SVM) in various aspects.

Minimizing Coefficients Modeling Non-linear Relationships


• OLS: Minimizes squared error to find coefficients that minimize the • OLS: Assumes linear relationships, may struggle with non-linear rela-
sum of squared residuals. tionships without transformations.

• SVM: Minimizes the l2-norm of the coefficient vector, seeking a sparse • SVM: Can model non-linear relationships effectively using kernel func-
solution with many coefficients set to zero. tions for higher-dimensional mapping.

Error Term and Constraints Assumptions


• OLS: Does not incorporate a margin for error, aiming to minimize • OLS: Relies on linearity, normality of errors, homoscedasticity, and
squared residuals without a strict margin. independence of errors.

• SVM: Manages error using constraints with a predetermined margin • SVM: Fewer assumptions; does not require linearity, normality, ho-
(ϵ). Absolute errors are constrained to be less than or equal to ϵ, with moscedasticity, or independence, offering robustness in diverse scenar-
deviations denoted as ξ. ios.

Sensitivity to Outliers Performance on Small Datasets


• OLS: Sensitive to outliers as it gives equal weight to all data points. • OLS: May overfit on small datasets, potentially violating assumptions.

• SVM: Less sensitive due to the margin concept (ϵ). Outliers have • SVM: Performs better on small datasets due to the margin (ϵ) and
limited impact on the model. regularization, preventing overfitting.

© Roozbeh Sanaei 56
5.8 Decision Tree Algorithms
5.8.1 ID3
1. Start at the Root Node: 5. Choose the Attribute with the Maximum Information Gain:
• Begin with the entire training set as the root. • The attribute with the highest Information Gain is chosen as the
decision node.
2. Selecting the Best Attribute:
• For each attribute A in the dataset, the ID3 algorithm calculates 6. Split the Dataset:
an attribute’s effectiveness in classifying the training data.
• The dataset is then split by the chosen attribute to produce sub-
• The measure used for this purpose is the Information Gain, which sets of the dataset.
is based on the concept of Entropy.
7. Recursion:
3. Calculate Entropy:
• The ID3 algorithm is then recursively applied to each subset with
• Entropy is a measure of the randomness or uncertainty in the data.
the remaining attributes.
• The entropy of the entire dataset S is given by:
n 8. Termination Conditions:
H(S) = − pi log2 pi
X

i=1 • This process is repeated until one of the termination conditions


where pi is the proportion of the number of elements in class i to is met, such as all samples belong to the same class, there are no
the number of elements in set S, and n is the number of classes. more attributes left, or no further information gain is possible.
4. Calculate Information Gain for Each Attribute:
• Information Gain is calculated for each attribute. It is the dif-
ference in entropy before and after the dataset is split on that
attribute.
• The Information Gain (IG) for an attribute A is given by:
|Sv |
IG(S, A) = H(S) − H(Sv )
X

v∈V alues(A)
|S|
where V alues(A) are the different values of attribute A, Sv is the
subset of S for which attribute A has value v, and H(Sv ) is the
entropy of subset Sv .

© Roozbeh Sanaei 57
5.8.2 Comparison of ID3 and C4.5 Algorithms
• Start with the Entire Dataset: • Handling Continuous Attributes:
– ID3 and C4.5: Both start with the full dataset as the root of the – ID3: Does not handle continuous attributes.
tree.
– C4.5: Handles continuous attributes by finding an optimal thresh-
• Choose the Best Attribute: old.
– ID3: Selects the attribute with the highest information gain. • Pruning:
– C4.5: Also considers the highest information gain, but normalizes
it. – ID3: No pruning mechanism.
• Calculate Entropy: – C4.5: Prunes trees to avoid overfitting.

– ID3 and C4.5: Both use the same entropy formula. • Recursive Splitting:
• Calculate and Normalize Information Gain: – ID3 and C4.5: Apply the process recursively using the remaining
– ID3: IG(S, A) = H(S) −
P |Sv |
v∈Values(A) |S| H(Sv )
attributes.
– C4.5: Uses the same as ID3, followed by normalization (Gain Ra- • Termination Conditions:
tio):
GainRatio(S, A) = SplitInfo(S,A)
IG(S,A)
– ID3 and C4.5: Recursion stops when all instances in a node are
with of the same class, there are no attributes left, or the subset is too
SplitInfo(S, A) = − v∈Values(A) |S|S|v | log2 |S|S|v | small.
P

© Roozbeh Sanaei 58
Features of C5.0 Algorithm
• Efficiency Improvements: • Handling Continuous and Categorical Attributes:
– C5.0 is more memory efficient and faster than C4.5. – Similar to C4.5, C5.0 can handle both continuous and categorical
– It can handle larger datasets more effectively. attributes.

• Boosting: – Uses a similar mechanism for handling continuous attributes by


finding thresholds.
– C5.0 introduces boosting, building multiple models (trees) sequen-
tially. • Tree Pruning and Rule Generation:
– Each new model focuses on correctly classifying instances misclas- – C5.0 uses tree pruning and can generate rules from decision trees.
sified by previous models.
– This aspect is similar to what C4.5 offers.
• Winnowing:
• Error-Based Pruning:
– C5.0 can perform a feature selection step, known as winnowing,
before building trees. – Employs a more sophisticated error-based pruning method.
– This step helps in removing irrelevant attributes. – This approach can result in smaller and more accurate trees.

© Roozbeh Sanaei 59
5.8.3 Decision Tree Pruning Methods
Pre-Pruning (Early Stopping)
• Stop growing the tree earlier, before it perfectly classifies the training data.

• Set a maximum depth, minimum number of samples per leaf, or a minimum improvement in the impurity measure.

Post-Pruning
• Grow the tree fully, then remove nodes that do not provide significant predictive power.

• Cost Complexity Pruning:


Rα (T ) = R(T ) + α × |leaves|
– R(T ): Misclassification rate of the tree T .
– α: Complexity parameter.
– |leaves|: Number of leaves in the tree.
∗ Minimize Rα (T ). Increasing α leads to simpler trees.

Pruning reduces the complexity of the final model, improving its generalizability and interpretiveness.

© Roozbeh Sanaei 60
5.8.4 Decision Tree Splitting Criteria
Gini Impurity Chi-Squared Statistic
n
Gini(S) = 1 − (pi )2 (O − E)2
X
χ2 =
X
i=1 E
• S: Dataset or subset.
• O: Observed frequency.
• pi : Proportion of instances in class i within S.
• n: Number of classes. • E: Expected frequency under independence.
Gini impurity measures the likelihood of incorrect classification if you ran-
domly pick an instance and classify it according to the distribution of classes The Chi-squared test assesses the independence between the splitting at-
in the subset. A Gini score of 0 indicates perfect purity. tribute and the target variable. A high value indicates a significant associa-
tion, suggesting a beneficial split.
Entropy:
n Reduction in Variance
H(S) = − pi log2 pi
X

i=1
V arianceReduction = T otalV ariance − W eightedV ariance
Information Gain:
|Sv | • Variance is calculated as the average squared deviation from the mean
IG(S, A) = H(S) − H(Sv )
X

v∈V alues(A)
|S| of the target variable.
• H(S): Entropy of set S.
Used in regression problems to find splits that reduce the variance of the
• pi : Probability of an item in S belonging to class i. target variable, indicating more homogeneity.
• A: Attribute for splitting.
• V alues(A): All possible values for attribute A. Information Gain Ratio

• Sv : Subset of S for attribute A with value v. IG(S, A)


GainRatio(S, A) =
Entropy measures the level of disorder in the data. Information gain is the SplitInf o(S, A)
reduction in entropy from splitting the dataset, aiming to decrease uncer-
SplitInfo:
tainty.
|Sv | |Sv |
SplitInf o(S, A) = − log2
X

Classification Error v∈V alues(A)


|S| |S|

Classif icationError(S) = 1 − max(pi )


• SplitInf o(S, A): Intrinsic information of a split on attribute A.
• max(pi ): Highest proportion of any class in dataset S.
Classification error calculates the error rate of classifying an instance if it were The information gain ratio is a normalization of information gain, reducing
randomly classified according to the distribution of classes in the subset. It bias towards attributes with more distinct values. It balances a good split
focuses on the most frequent class. with not favoring attributes with more levels.

© Roozbeh Sanaei 61
5.9 Ensemble Models
Bagging (Bootstrap Aggregating) Boosting
Bagging, short for Bootstrap Aggregating, is an ensemble machine learning Boosting is an ensemble machine learning technique used to enhance the per-
algorithm designed to improve the stability and accuracy of machine learning formance of predictive models. It builds a sequence of models in a way that
algorithms. It involves creating multiple versions of a predictor model and each subsequent model attempts to correct the errors of the previous one.
using these to get an aggregated predictor. The method works by randomly The final prediction is typically a weighted sum of the individual models.
selecting subsets of the training set with replacement, training a model on Boosting is especially known for increasing accuracy in both classification
each, and then combining their predictions. This approach is particularly and regression tasks, often significantly reducing bias and variance compared
effective in reducing variance and avoiding overfitting. to single models.

• Data Sampling: • Initial Model:


– Given a training dataset D with N samples, create M new train-
– Train a weak model f1 and compute errors e1 .
ing sets D1 , D2 , . . . , DM , each of size N ′ (usually N ′ = N ), by
bootstrap sampling from D.
• Sequential Training:
• Model Training:
– For each subsequent model fi , adjust training data weights based
– Train a model fm on each M dataset. on ei−1 , focusing on mistakes.

• Aggregation: • Final Prediction:


– For regression, average all models: fbagging (x) = fm (x).
1 PM
– Output is a weighted sum: fboosting (x) = αi fi (x), with αi
PM
M m=1
i=1
– For classification, use mode: fbagging (x) = mode{f1 (x), f2 (x), . . . , fM (x)}. based on accuracy.

© Roozbeh Sanaei 62
5.9.1 Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique that extends linear models by incorporating automatic variable
selection and nonlinear relationships. It models relationships by fitting piecewise linear regressions, creating ’splines’ that adjust to different ranges of the
data. MARS is particularly effective for high-dimensional data and can capture complex patterns without requiring a pre-specified functional form.

Base Functions
• Composition: Piecewise linear ”basis” or ”base” functions.

• Knot: Point where the function changes slope.

• Basis Function: Defined as h(x, c, s) = max(0, x−c


s
) where x is the independent variable, c is the knot, and s is a scaling factor.

• Predicted Value: ŷ(x) = β0 + βj h(x, cj , sj ) where β0 is the intercept, βj are coefficients, and J is the number of basis functions.
PJ
j=1

Forward Pass
• Begins by selecting base function pairs that minimize the residual sum of squares (RSS).

• Greedy addition of base functions.

• Stops when RSS reduction is minimal.

Backward Pass (Pruning)


• Addresses overfitting.

• Uses Generalized Cross Validation (GCV).


 2
yi −ŷ(xi )
• GCV formula: GCV = 1 PN
N i=1 1− Nh

where yi are the actual values, ŷ(xi ) are the predicted values, N is the number of observations, and h is the effective number of parameters.

© Roozbeh Sanaei 63
Stacking Voting Ensemble

Stacking, short for stacked generalization, is an ensemble machine learning Voting Ensemble is a machine learning technique that combines the predic-
algorithm. It involves combining multiple predictive models to generate a tions from multiple models. It involves creating multiple different models
new model, typically resulting in improved prediction accuracy. In stacking, on the same dataset and using a majority vote (for classification) or average
different algorithms are trained on the same dataset and their predictions (for regression) of their predictions as the final prediction. This approach is
are used as inputs to a final ’meta-model’, which makes the ultimate pre- beneficial for improving model performance and robustness, as it reduces the
diction. This technique leverages the strengths of each individual model, likelihood of an unfortunate selection of a poorly performing model. Voting
thereby reducing the risk of choosing a suboptimal algorithm. can be ’hard’, using a strict majority vote, or ’soft’, where probabilities are
averaged.

• Model Selection:
• Base Model Training:
– Choose diverse machine learning models.
– Train base models f1 , f2 , . . . , fM on training data. • Voting Mechanism:

– Hard Voting: Majority vote from each model.


• Meta-Data Creation:
Predictionhard = mode{prediction1 , prediction2 , . . . , predictionM }
– Use base models to generate predictions P1 , P2 , . . . , PM as meta-
– Soft Voting: Average of predicted probabilities.
features.
1 XM
Predictionsoft = probabilitym
• Meta-Model Training: M m=1

• Model Training:
– Train meta-model g on the meta-features.
– Train each model independently on the same dataset.

• Final Prediction: • Aggregation of Predictions:

– Combine predictions using hard or soft voting.


– Final prediction by meta-model:
• Final Prediction:

fstacking (x) = g(P1 (x), P2 (x), . . . , PM (x)) – Final prediction based on aggregated votes or probabilities.

© Roozbeh Sanaei 64
Random Subspace Method (RSM) Mixture of Experts (MoE)
The Random Subspace Method (RSM) is a machine learning technique for Mixture of Experts (MoE) is an ensemble machine learning technique that
improving model accuracy and robustness. It involves training each model divides a complex problem into simpler sub-problems, solved by specialized
on a different random subset of features of the dataset, rather than on the models called experts. Each expert is trained on a different segment of the
complete feature set. This approach, also known as feature bagging, helps data or task, and a gating network determines the weight or influence of each
in reducing the correlation among the models in an ensemble, leading to expert’s output in the final prediction. MoE effectively combines the outputs
better generalization and reduced overfitting, especially in cases with high- of various models, making it well-suited for tasks where different regions of
dimensional data. RSM is particularly effective in combination with decision the input space require different types of modeling or expertise.
trees and other algorithms sensitive to feature selection.
• Division of Problem Space:
• Feature Subsampling:
– Divide the problem space into regions for different expert models.
– From a dataset with P features, randomly select k features (k <
P ) for each model. • Training of Experts:

• Model Training: – Train each expert on data corresponding to its region.

– Train separate models on these feature subsets. • Gating Network:

• Aggregation: – Train a gating network to select the appropriate expert for each
input.
– Aggregate predictions using averaging or majority voting.
• Aggregation:
• Model formulation:
– Combine outputs from experts based on gating network weights.
– For classification:
PredictionRSM = mode{prediction1 , prediction2 , . . . , predictionM } • Model formulation:

– For regression: – fMoE (x) = M i=1 gi (x) · fi (x) where gi (x) is the gating network’s
P

Average predictions from all models. weight for the i-th expert.

© Roozbeh Sanaei 65
5.9.2 Ensemble Models Comparison

© Roozbeh Sanaei 66
5.9.3 AdaBoost
AdaBoost, short for Adaptive Boosting, is an ensemble machine learning algorithm used primarily for classification tasks. It works by combining multiple
weak learners, typically simple decision trees, to create a strong classifier. In AdaBoost, each subsequent model focuses more on the instances that were
incorrectly predicted by previous models, as these receive increased weight. The final prediction is a weighted sum of the predictions from all learners.
AdaBoost is known for its effectiveness in boosting the performance of simple models and its ease of implementation.

AdaBoost Classification AdaBoost Regression


• Initialization: • Initialization:
– Dataset: (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) where xi represents fea-
tures and yi represents the class label for the i-th observation. – Dataset: (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) where xi represents fea-
– Assign equal weights: wi = N1 for i = 1, 2, . . . , N . tures and yi represents the target variable for the i-th observation.
– Assign equal weights: wi = 1
for i = 1, 2, . . . , N .
• Repeat the process until a specified number of iterations or N

until classification error is minimal:


• Repeat the process until a specified number of iterations or
– Fitting Weak Learners: until classification error is minimal:
∗ In iteration t, train weak learner Lt on the dataset, weighted
by wi . – Fitting Weak Learners:
∗ Use Lt to make predictions ŷit on the training data.
∗ In iteration t, train weak learner Lt on the dataset, weighted
– Calculating Error and Learner Weight: by wi .
∗ Calculate error: et = ̸= yi ), where ⊮
PN
i=1 wi ⊮(ŷit ∗ Make predictions ŷit on the training data.
is the indicator function.The indicator function 1(condi-
tion)1(condition) returns 1 if the specified condition is true, – Calculating Error and Learner Weight:
and 0 if it is false.
∗ Calculate error: et =
PN
 
i=1 wi |yi − ŷit |.
∗ Determine learner weight: αt = 12 ln 1−e et
t
.  
∗ Determine learner weight: αt = η · log 1−et
where η is the
– Updating Weights: et
learning rate.
∗ Update weights: wi ← wi · exp(αt · ⊮(ŷit ̸= yi )).
∗ Normalize weights to sum to 1. – Updating Weights:

• Final Model: ∗ Update weights: wi ← wi · exp(αt · |yi − ŷit |).


∗ Normalize weights to sum to 1.
– The final prediction model is a weighted vote of the weak learners:
T
!
ŷ(x) = sign
X
αt ŷt (x) • Final Model:
t=1
where T is the total number of iterations and sign function returns – The final model: ŷ(x) = αt ŷt (x), where T is the total number
PT
t=1
the class label based on the sign of the summation. of learners.

© Roozbeh Sanaei 67
5.9.4 Gradient Boosting
Gradient Boosting is an ensemble machine learning technique used for 2. Fit a weak learner hm (x) to these residuals.
both classification and regression tasks. It builds the model in a stage-wise
fashion, with each new model being trained to correct the errors made by 3. Find the multiplier γm that minimizes the loss when added to the cur-
the previous ones. The method uses the gradient descent algorithm to mini- rent model:
mize the loss when adding new models. Each tree in the ensemble is fit on a n
γm = arg min L(yi , Fm−1 (xi ) + γhm (xi ))
X
modified version of the original dataset. Gradient Boosting is known for its γ
high effectiveness, particularly in situations where data is unbalanced and in i=1

predictive tasks involving complex datasets.


4. Update the model:
Initialization: Fm (x) = Fm−1 (x) + γm hm (x)
• Start with an initial model, F0 (x), often a constant value such as the
mean of the target values. Output the final model:

For each iteration m from 1 to M: • The final model is the sum of the initial model and all the weak learners’
contributions:
1. Compute the residuals rim for each training instance i, which are the M
F (x) = F0 (x) + γm hm (x)
X
negative gradients of the loss function with respect to the prediction. m=1

© Roozbeh Sanaei 68
Aspect Bagging Boosting Stacking Random Sub- MoE Blending
space Method
Model Training Independent, par- Sequential, focuses Independent base Independent, par- Train multiple ex- Train models in-
allel training on errors of previ- models, followed allel training on perts and a gating dependently, com-
ous models by a meta-model feature subsets model bine using a hold-
out set
Data Handling Bootstrap samples Full dataset with Full dataset for Full dataset with Full dataset, gat- Full dataset, split
(subsets of data) adjusted weights base models, meta- subsets of features ing model directs into training and
for training sam- model on outputs to experts validation sets
ples
Model Type Similar types, e.g., Typically similar Diverse model Similar types, dif- Diverse, special- Typically diverse
decision trees types types ferent feature sub- ized models models
sets
Impact on Bias Reduces model- Focuses on reduc- Varies based on Reduced by fea- Depends on the ex- Optimized during
specific bias ing bias through model selection ture diversity pertise of individ- the blending pro-
error correction ual models cess
Impact on Variance Averaging reduces Can increase if Depends on the Mitigated by di- Varies, complex Controlled by vali-
variance overfitting occurs variance of base versity in feature models might dation set
and meta-models subsets increase variance
Computational Generally lower, Higher due to se- Potentially high Similar to Bag- Potentially high, Moderate, depend-
Complexity parallelizable quential training due to two levels ging, manageable complex training ing on complexity
of training of models
Overfitting Risk Lower due to av- Higher if not care- Depends on base Reduced due to Varies, requires Lower, due to use
eraging/majority fully tuned and meta-model feature diversity careful design of validation set for
voting complexity final model
Applicability High-variance Improving weak Leveraging High-dimensional Complex problems Problems where a
models, e.g., deci- models, high-bias strengths of feature spaces with diverse data simpler model can
sion trees situations different models characteristics combine predic-
tions

Updated Comparison of Ensemble Techniques in Machine Learning

© Roozbeh Sanaei 69
Evaluation

6.1 Evaluation Approaches


6.1.1 K-fold Validation
Data Splitting Calculating Performance Metrics
• Shuffle the Dataset Randomly, This ensures that the data splitting into • Aggregate scores: S1 , S2 , . . . , SK .
folds is as unbiased as possible. PK
Si
• Mean score: Mean = i=1
K
.
• Dataset D with N samples. rP
K
(S −Mean)2
• Standard deviation (optional): SD = i=1 i
.
• Choose K for the number of folds. K

• Split D into K subsets (D1 , D2 , . . . , DK ), each with approximately N Interpretation


K
samples. • The mean score represents the average performance across all folds.

• The standard deviation provides insight into the consistency of the


Cross-Validation Cycle
model’s performance across different subsets of data.
• For each fold i (where i = 1, 2, . . . , K):
• The beauty of K-fold cross-validation lies in its ability to use every data
– Validation set: Di . point for both training and validation. This makes it a robust method
for model evaluation, especially in cases where the dataset isn’t large
– Training set: D \ Di (all data in D except Di ). enough to afford a separate hold-out test set.
– Train model on D \ Di , test on Di .
• By averaging the performance across different subsets, it offers a bal-
– Record score: Si . anced view of how well the model might perform on unseen data.

70
6.1.2 The ROC Curve in Binary Classification
• The ROC curve graphically evaluates binary classification models. • Threshold Levels:

– ROC curve plotted by changing the threshold, the cut-off point


• True Positive Rate (TPR): for class decisions.
– Lowering the threshold increases both true and false positives.
– Raising the threshold reduces false positives but may miss true
– Proportion of actual positives correctly identified. positives.
– Example: In a medical test, TPR indicates how many sick people • Shape of the ROC Curve:
are correctly diagnosed.
– A curve closer to the top left corner indicates good performance
(high TPR, low FPR).
• False Positive Rate (FPR): – A curve near the diagonal line indicates less effective performance.

• Area Under the Curve (AUC):


– Proportion of actual negatives incorrectly identified as positives.
– A single number summarizing model performance.
– Example: In a medical test, FPR shows how many healthy people – A larger AUC indicates a better model. An AUC of 1 is perfect,
are wrongly diagnosed. while 0.5 suggests no discriminative ability.

© Roozbeh Sanaei 71
6.1.3 Accuracy Metrics
P
Condition Positive
• Prevalence: P Total Population Confusion Matrix
P
True Positive Predicted
• Positive Predictive Value (PPV) or Precision: P Predicted Condition Positive Positive Negative
P
False Negative True False
• False Omission Rate (FOR): P Predicted Condition Negative Actual
Positive
Positive (TP) Negative (FN)
P P
True Positive+ True Negative False True
• Accuracy (ACC): Negative
Positive (FP) Negative (TN)
P
Total Population
P
False Positive
• False Discovery Rate (FDR): P Predicted Condition Positive
P
True Negative
• Negative Predictive Value (NPV): P Predicted Condition Negative
P
• True Positive Rate (TPR) or Recall or Sensitivity: P True Positive
Condition Positive
P
False Positive
• False Positive Rate (FPR) or Fall-out: P Condition Negative
P
• False Negative Rate (FNR) or Miss Rate: P False Negative
Condition Positive
P
• Specificity (SPC) or Selectivity or TNR: P True Negative
Condition Negative

• Positive Likelihood Ratio (LR+): TPR


FPR

• Negative Likelihood Ratio (LR-): FNR


TNR

• Diagnostic Odds Ratio (DOR): LR+


LR-

• F1 Score: 2 · Precision·Recall
Precision+Recall

© Roozbeh Sanaei 72
6.1.4 Lift and Drift Charts
Lift Charts focus on quantifying the effectiveness of a model compared to a baseline, while Drift Charts are concerned with tracking changes in data
distributions over time, which can impact the performance of predictive models.

Lift Chart KL Divergence for Data Drift Detection


• Evaluates the performance of a predictive model against a baseline (of-
• Compares the distribution of input features between training data and
ten random selection).
new, incoming data.
• Lift Calculation:
Results with Model
Lift = Ptrain (x)
!
Results without Model DKL (Ptrain ∥ Pnew ) = Ptrain (x) log
X

x∈X Pnew (x)


• Example: If targeting 10% of the dataset by the model yields 30% of
positives, while random targeting gives 10%, then Lift = 10%
30%
= 3. This
indicates the model is 3 times as effective as random selection.
• Identifies significant changes in input data, crucial for maintaining
Drift Chart model accuracy as input patterns evolve.

• Identifies changes in data distribution over time, essential in machine


learning.
KL Divergence for Model Drift Detection
• Types of Drift:
– Data Drift: Changes in input data distribution. • Compares predicted probability distributions from a historically trained
– Model Drift: Degradation in model performance due to changing model with those from a recently trained model.
data relationships.
Phistoric (x)
!
• Common Method for Detection: DKL (Phistoric ∥ Precent ) = Phistoric (x) log
X

x∈X Precent (x)


– Statistical Measures: Kullback-Leibler divergence (KL diver-
gence).
• KL Divergence Equation: • Helps understand shifts in model predictions over time, indicating
P (x) changes in data patterns or relationships, crucial for adapting the model
!
DKL (P ||Q) = P (x) log
X

x∈X Q(x) to current data trends.

© Roozbeh Sanaei 73
Anomalies and Outliers

7.1 Anomalies and Outliers


Concept and Definition Detection Techniques

• Anomalies: • Anomalies:
– Include supervised, semi-supervised, and unsupervised methods.
– Data points that deviate significantly from expected behavior.
• Outliers:
– Represent instances not conforming to general patterns or trends.
– Identified using statistical measures, such as Z-scores, IQR.
• Outliers:
Applications

– Specific type of anomaly, significantly different and located far • Anomalies:


from the majority.
– Used in cybersecurity, medicine, machine vision, financial fraud.
– Affect statistical analysis and may indicate different mechanisms.
• Outliers:
– Significant in statistical analyses, quality control, market analysis.
Characteristics
Types
• Anomalies:
• Anomalies:
– Unexpected, infrequent, significant deviations. – Categorized based on domain or detection method.

• Outliers: • Outliers:
– Types include Global Outliers, Contextual Outliers, Collective
– Extreme values, statistically rare, can skew results. Outliers.
74
7.2 Isolation Forest Algorithm
The Isolation Forest algorithm isolates each data point by randomly selecting features and split values, creating an ensemble of iTrees. It then calculates
the path length for each point in these trees. Anomalies are identified based on their shorter path lengths, which are used to compute an anomaly score
indicating the likelihood of a point being an outlier in the dataset.

1. Construction of Isolation Trees (iTrees) 4. Interpreting the Anomaly Score


Isolation Forest constructs iTrees for a given dataset. An iTree is similar The value of s(x, m) indicates the likelihood of a point being an anomaly:
to a binary search tree, but it’s constructed by randomly selecting an at-
tribute and a split value for that attribute to partition the data. The process • If s(x, m) is close to 1, the point x is highly likely to be an anomaly.
continues recursively until a termination condition is met.
• If s(x, m) is much smaller than 0.5, the point x is likely to be a normal
point.
2. Path Length h(x)
The path length h(x) is a measure of the number of edges a point x traverses • A score around 0.5 indicates uncertainty.
in the iTree from the root node to an external node. Anomalous points
typically have shorter path lengths because they are easier to isolate.

3. Anomaly Score Calculation


The anomaly score for a data point is calculated using the path length. The
average path length E(h(x)) across all iTrees for a point x is used in the
calculation. The formula for the anomaly score s(x, m) of a data point x in
a sample of size m is:
E(h(x))
s(x, m) = 2− c(m)

Here, c(m) is a normalization factor given by:

2H(m − 1) − for m > 2


2(m−1)


 n
c(m) = 1 for m = 2

0 otherwise

where H(i) is the i-th harmonic number and n is the size of the testing
dataset. γ is the Euler-Mascheroni constant, approximately 0.5772156649.

© Roozbeh Sanaei 75
7.3 Cook’s Distance
Cook’s Distance is a measure used in statistics to identify influential observations in a dataset, particularly in the context of linear regression. It estimates
the influence of a data point in a least-squares regression analysis.

Mathematical Expression Key Components


The formula for Cook’s Distance is: • Residuals and Leverage: Considers the contribution of an obser-
(i) vation’s residual to the overall prediction error, and the observation’s
j=1 (Ŷj− Ŷj )2
Pn
Di = leverage on the regression line.
p M SE
• Sum of Squared Differences: The numerator calculates the sum of
Where: squared differences in predicted values with and without the particular
observation.
• Di is Cook’s Distance for the i-th observation.


(i) 2 • Normalization Factor: The denominator normalizes this sum by the
j=1 (Ŷj − Ŷj )
is the sum of squared differences between the predicted
Pn
(i) number of predictors and the mean squared error.
values Ŷj from the full model and the predicted values Ŷj from the
model without the i-th observation. An observation with a high Cook’s Distance indicates significant influence
• p is the number of predictors in the model. on the model’s parameters. A threshold, such as 4/n (where n is the number
of observations), is often used to identify observations with a substantially
• M SE is the mean squared error of the full model. high Cook’s Distance as influential.

© Roozbeh Sanaei 76
7.4 Quartiles and Interquartile Range (IQR)
Quartiles Identifying Outliers Using IQR
• In a dataset: • Outliers are identified using the IQR:
– The first quartile, Q1, is the median of the lower half of the data. – Lower Bound for Outliers (data points less than this value are
– The third quartile, Q3, is the median of the upper half of the data. considered lower outliers):
– Q2, or the second quartile, is the median of the entire dataset, but Lower Bound = Q1 − 1.5 × IQR
it’s not used in calculating the IQR.
– Upper Bound for Outliers (data points greater than this value are
Calculating the IQR considered upper outliers):
• The IQR is the difference between the third and first quartiles:
Upper Bound = Q3 + 1.5 × IQR
IQR = Q3 − Q1
• The choice of 1.5 as the multiplier is conventional but effective in distin-
• This value represents the spread of the middle 50% of the data. guishing typical data points from those that are significantly different.

© Roozbeh Sanaei 77
7.5 Local Outlier Factor
The Local Outlier Factor (LOF) algorithm is a method for detecting outliers in a dataset by examining the local density deviation of each data point
compared to its neighbors.

Step 1: Determining the k-Distance Step 4: Determining the Local Outlier Factor (LOF)
The k-Distance of a point P is defined as the distance of P to its k th nearest The LOF of a point P is determined as:
neighbor. This is mathematically represented as: P LRDk (O)
O∈Nk (P ) LRDk (P )
k-distance(P ) = dist(P, Ok ) LOFk (P ) =
|Nk (P )|
where Ok is the k th nearest neighbor of P and dist(P, Ok ) is the distance The LOF is the average of the ratio of the LRD of P to the LRD of its
between P and Ok . neighbors. A high LOF value, significantly greater than 1, indicates that P
is an outlier.
Step 2: Calculating the Reachability Distance
The Reachability Distance between two points P and O is given by: Overview
The LOF algorithm is particularly effective in identifying outliers due to
Reachability-Distancek (P, O) = max{k-distance(O), dist(P, O)} several key reasons:
This represents the maximum of the k-distance of O and the actual distance • Focus on Local Spatial Properties: It emphasizes the local spatial
between P and O. characteristics of data points, rather than their global distribution in
the dataset.
Step 3: Computing the Local Reachability Density (LRD)
• Useful for Varying Densities: The algorithm is especially useful
The LRD of a point P is calculated as: in datasets where density varies significantly, accommodating clusters
!−1 that are either more sparse or dense than others.
Reachability-Distancek (P, O)
P
O∈Nk (P )
LRDk (P ) = • Identification of Notably Different Points: The LOF score quanti-
|Nk (P )|
fies the extent to which an object deviates from its neighboring points in
where Nk (P ) is the set of k nearest neighbors of P . The LRD is the inverse terms of density. This helps in identifying points that are significantly
of the average reachability distance from P to its neighbors. different or isolated from their local surroundings.

© Roozbeh Sanaei 78
7.6 Mahalanobis Distance
Measures the distance between a point and a distribution, particularly in a multivariate context.

D2 = (x − µ)T Σ−1 (x − µ)

• D2 : Square of the Mahalanobis distance.

• x: Vector of the observation.

• µ: Mean vector of independent variables.

• Σ−1 : Inverse covariance matrix of independent variables.

Effective in multivariate anomaly detection and classification.

• Sensitivity to Outliers: Calculations can be significantly affected by outliers, potentially leading to misleading results.

• Assumption of Gaussian Distribution: Works best when the data distribution is Gaussian, which might not be the case in all datasets.

© Roozbeh Sanaei 79
7.7 Minimum Covariance Determinant (MCD) Method
The goal of the MCD method is to find a subset of the dataset with the smallest covariance determinant, representing the ”normal” observations and
reducing the influence of outliers.

Process Solving the MCD Method


• Dataset and Subset Selection: • Computational Challenge: Evaluating many subsets for the min-
imum determinant covariance matrix is intensive, especially for large
– Dataset X = {x1 , x2 , . . . , xn }, where each xi is in a p-dimensional datasets.
space.
• Heuristic Approaches: Approximations, like the Fast-MCD algo-
– Find subset Xh ⊂ X with h observations (n/2 < h ≤ n). rithm, are used for efficiency.

• Calculating Mean and Covariance for the Subset: • Fast-MCD Algorithm: An iterative algorithm that refines the subset
selection to minimize the determinant.
– Mean µh of Xh :
1 X • Iterative Process: Each iteration updates the subset to better ap-
µh = xi proximate the minimum determinant.
h i∈Xh
• Random Sampling: Initial subset selection involves random sampling
– Covariance matrix Σh of Xh :
to cover a broad range of possibilities.
1 X
Σh = (xi − µh )(xi − µh )T
h − 1 i∈Xh Significance of the MCD Method
• Outlier Resistance: MCD is robust against outliers, reducing their
• Minimizing the Determinant: impact on covariance estimation.

– Select Xh to minimize the determinant of Σh . • Improved Analysis: Provides more accurate covariance estimates in
datasets with outliers.
– Optimization problem:
• Wide Application: Useful in fields like finance and economics where
min det(Σh ) outliers are common.
Xh ⊂X,|Xh |=h

• Better Decision-Making: Leads to more reliable decision-making in


• Final Estimation: statistics-based models.

– The subset Xh minimizing the determinant provides robust MCD • Enables Advanced Techniques: Essential for complex statistical
estimates of mean and covariance. methods in datasets with outliers.

© Roozbeh Sanaei 80
7.8 Single-Class SVM
• Single-class SVM is used for anomaly detection. 2. Schölkopf’s Formulation (Using a Hyperplane)
• It focuses on a single class, unlike standard SVMs which handle two or • Using a hyperplane instead of a hypersphere.
more classes.
• Objective: Maximize the distance of the hyperplane from the origin.
• The goal is to establish a decision boundary that separates the data
points of a single class from the origin in high-dimensional space. • Optimization Problem:

• The technique involves mapping data points into high-dimensional 1


Maximize:
space for separation. |w|
• Useful in scenarios with significant data from one class to detect anoma- Subject to: (w · xi ) + b ≥ ρ − ξi , ∀i
lies or outliers. ξi ≥ 0, ∀i
αi = 1
X

1. Tax and Duin’s Formulation (Using a Hypersphere) i

• Enclosing data in a hypersphere in high-dimensional space. • w is the normal to the hyperplane, b is the bias, ρ is the margin.
• Center and Radius: Hypersphere characterized by center a and ra-
dius R.
• Objective: Minimize the radius while keeping all data points inside
or on its surface.
• Optimization Problem:
Minimize: R2 + C
X
ξi
i
Subject to: |xi − a|2 ≤ R2 + ξi ∀i
ξi ≥ 0, ∀i
• xi are data points, ξi are slack variables, and C is a regularization
parameter.

© Roozbeh Sanaei 81
Information Theory

8.1 Shannon Uncertainty Formula


The Shannon Uncertainty Formula, also known as Shannon’s entropy, is It represents the average amount of information conveyed by identifying the
a fundamental concept in information theory, introduced by Claude Shan- outcome of a random trial.
non. It quantifies the uncertainty or the amount of information contained in
a random variable or a system. The formula is expressed as: Key Properties of Shannon Uncertainty Formula

H(X) = − p(x) log p(x) • Non-negativity: Entropy is always non-negative, implying that the
X

x∈X average amount of information or uncertainty in a system cannot be a


Where: negative value.

• H(X) is the entropy of the random variable X. • Maximum entropy with uniform distribution: Entropy is maxi-
mized when the distribution of the random variable is uniform. This is
• X represents the set of all possible outcomes of X. because uniform distribution indicates the highest level of uncertainty
• p(x) is the probability of an outcome x. or lack of specific information about the variable.

• The logarithm base is chosen depending on the context (base 2 for bits, • Additivity for independent events: For two independent random
base e for natural units, and base 10 for dits). variables X and Y , the entropy of their joint distribution is the sum of
their individual entropies:
The entropy H(X) measures the average level of ”information”, ”sur-
prise”, or ”uncertainty” inherent in a random variable’s possible outcomes. H(X, Y ) = H(X) + H(Y )

82
8.2 Boltzmann’s Entropy Formula
Boltzmann’s entropy formula is a cornerstone in statistical mechanics and thermodynamics, offering a mathematical expression for the concept of entropy.

Basic Formula Entropy in Systems with Multiple Particles


The basic formula is given as: • Microstates for N Particles
S = kB ln(Ω)
– Consider a system with N identical particles, each capable of being
where: in one of K states.
• S is the Entropy of the system. – The total number of microstates, W , is calculated as:

• kB is the Boltzmann constant, valued at 1.38 × 10−23 J/K. (N + K − 1)!


W =
N !(K − 1)!
• Ω represents the number of possible microscopic configurations (mi-
crostates) of the system.
• Entropy Calculation
Interpretation – Using Stirling’s approximation, the logarithm of W is approxi-
• Entropy, S, measures the degree of disorder or randomness within a mated as:
system.
ln(W ) ≈ N ln(K) − N ln(N ) − (K − N ) ln(K − N )
• A higher number of microstates, Ω, implies greater entropy.
• This formula is applicable in systems where all microstates are equally • Boltzmann Entropy for the System
probable.
– The entropy for a system with N particles is given by:
Relation with Probability
S = kB (N ln(K) − N ln(N ) − (K − N ) ln(K − N ))
• The original Boltzmann formula connects entropy with the probability,
W , of the system being in a particular microstate: – Here, N signifies the number of particles in a particular state, and
K − N the number in other states.
S = kB ln(W )

• This emphasizes the probabilistic nature of entropy. • Links convexity or concavity of functions to expected values.

Applications • Established by Johan Jensen in 1906.

This formula finds extensive applications in fields such as physics, chemistry, • Influential in statistics, optimization, economics, and various other
and information theory. mathematical fields.

© Roozbeh Sanaei 83
8.3 Jensen’s Inequality
For Convex Functions For Concave Functions
• Inequality Statement: f (E[X]) ≤ E[f (X)] • Inequality Statement: f (E[X]) ≥ E[f (X)]

• Meaning: When f is a convex function and X is a random variable, • Meaning: For a concave function f and a random variable X, the
the function value at the expected value of X is less than or equal to function value at the expected value of X is greater than or equal to
the expected value of the function of X. the expected value of the function of X.

• Interpretation: For convex functions, the ”average” output is at least • Interpretation: For concave functions, the ”average” output is at
as large as the output at the ”average” input. most as large as the output at the ”average” input.

© Roozbeh Sanaei 84
8.4 Fisher’s Score and Fisher’s Information
In essence, Fisher’s Score provides a mechanism to locate the most prob- Fisher’s Information
able parameter values in a likelihood function, while Fisher’s Information
Fisher’s Information measures the amount of information an observable ran-
quantifies how much certainty there is in these estimates.
dom variable carries about an unknown parameter of a distribution that
models the variable.
Fisher’s Score
• Variance of the Score:
Fisher’s Score is the derivative (gradient) of the log-likelihood function with I(θ) = Var[s(θ)]
respect to the parameter. where I(θ) is the Fisher’s Information.

∂ • Expected Value of Second Derivative of Log-Likelihood:


s(θ) = log L(θ)
∂θ " #
∂2
I(θ) = −E log L(θ)
where: ∂θ2

• s(θ) is the Fisher’s Score, • As Variance of the Score:


 !2 

• θ is the parameter, I(θ) = E  log f (X; θ) 
∂θ
• L(θ) is the likelihood function. where f (X; θ) is the probability density or mass function of the random
variable X.
The score function is used to find the point where the likelihood function
reaches its maximum with respect to the parameter. Setting the score to Higher Fisher Information suggests less variance in the parameter esti-
zero leads to the maximum likelihood estimate. mate, indicating more certainty in the estimate.

© Roozbeh Sanaei 85
8.5 Kullback-Leibler Divergence
• The Kullback-Leibler (KL) divergence, denoted as DKL (P ∥ Q), is a For Continuous Distributions
statistical measure used to quantify how one probability distribution P
differs from a second, reference distribution Q. • Density Functions: p(x) and q(x).
• It’s not a symmetric measure and doesn’t satisfy the triangle inequality,
setting it apart from a traditional metric.  
• DKL (P ∥ Q) = p(x) log
R∞ p(x)
−∞ q(x)
dx.
For Discrete Distributions
• Distributions: P and Q.
  • General Interpretation: Integral over the entire range of the ran-
• DKL (P ∥ Q) = P (x) log P (x)
. dom variable, involving the product of the probability density under P
P
x∈X Q(x)
and the logarithm of the ratio of densities under P and Q.
• General Interpretation: Sums over all possible events x in the sam-
ple space X , each term being the product of the probability of event x
under P and the logarithm of the ratio of probabilities under P and Q.
• Information Theory Perspective: Measures the continuous ”infor-
• Information Theory Perspective: Quantifies the additional bits mation cost” or extra bits required when outcomes are coded using Q
required for encoding each event x using a model Q instead of P . instead of P .

© Roozbeh Sanaei 86
8.6 Mutual Information
Mutual Information (MI) is a measure used in statistics to quantify the For Discrete Distributions
amount of information obtained about one random variable by observing
When X and Y are discrete, MI is calculated as a double sum:
another. It is intimately linked to the concept of entropy in information the-
ory. The definition and computation of MI depend on whether the random P(X,Y ) (x, y)
!
I(X; Y ) = P(X,Y ) (x, y) log
X X
variables involved are discrete or continuous. P (x)P (y)
y∈Y x∈X X Y

Here, P(X,Y ) is the joint probability mass function, and PX and PY are the
marginal probability mass functions of X and Y , respectively.

For Continuous Distributions


General Definition For continuous variables, the double sum is replaced by a double integral:

P(X,Y ) (x, y)
Z Z !
I(X; Y ) = P(X,Y ) (x, y) log dxdy
For a pair of random variables X and Y with joint distribution P(X,Y ) and Y X PX (x)PY (y)
marginal distributions PX and PY , the mutual information is defined as:
In this case, P(X,Y ) is the joint probability density function.

Equivalent Expressions
I(X; Y ) = DKL (P(X,Y ) ∥PX ⊗ PY ) Mutual information can also be expressed in terms of entropy:

I(X; Y ) ≡ H(X) − H(X|Y ) ≡ H(Y ) − H(Y |X) ≡

H(X) + H(Y ) − H(X, Y ) ≡ H(X, Y ) − H(X|Y ) − H(Y |X)


Here, DKL represents the Kullback–Leibler divergence, and PX ⊗ PY is the
outer product distribution assigning probability PX (x) · PY (y) to each pair where H(X) and H(Y ) are the marginal entropies, H(X|Y ) and H(Y |X)
(x, y). are the conditional entropies, and H(X, Y ) is the joint entropy of X and Y .

© Roozbeh Sanaei 87
Time Series Forecasting

9.1 Common Components of a Time Series


• Trend Component: • Cyclical Component:

– Fluctuations in a non-fixed, irregular pattern.


– Indicates persistent increases or decreases over time.
– Usually tied to economic conditions, like boom or recession.
– Represents the long-term progression of the series. – Can span several years.
– Can be linear or non-linear. – Example: Business cycle phases impacting employment rates.
– Example: Steady increase in average global temperatures over • Irregular (or Random) Component:
decades.
– Consists of random fluctuations not attributed to other compo-
nents.
• Seasonal Component:
– Also known as the ”residual” or ”error” component.

– Repeats over a fixed period. – Caused by unpredictable or random events.


– Usually of short duration.
– Reflects seasonality or periodic fluctuations.
– Example: Sudden stock market fluctuations due to an unexpected
– Example: Higher hotel bookings during summer vacation seasons. political event.

88
9.2 Autocorrelation Function (ACF)
• Autocorrelation measures the correlation of a signal or time series with Continuous and Discrete Signals
a delayed version of itself, It is calculated for various time lags.
Continuous Signal: For a continuous signal f (t), the autocorrelation is
• The autocorrelation function is used to find repeating patterns or pe- defined as an integral:
riodic signals within a dataset. Z ∞
Rf f (τ ) = f (t + τ )f (t) dt
• For instance, it can identify if there’s a regular, cyclical behavior in −∞
temperature readings over days or in stock market prices over weeks.
Discrete Signal: For a discrete-time signal y(n), the autocorrelation at
lag ℓ is given by the sum:
Basic Definition
Ryy (ℓ) =
X
For a time series Xt , where t represents time, the autocorrelation function y(n)y(n − ℓ)
(ACF) assesses the correlation between Xt and Xt−h for various values of h n∈Z

(lag). The formula is:


Estimating Autocorrelation Coefficient
Cov(Xt , Xt−h )
ACF(h) = For a discrete process with known mean µ and variance σ 2 , and n observa-
Var(Xt )
tions, the estimated autocorrelation coefficient at lag k is:
where:
1 n−k
R̂(k) = (Xt − µ)(Xt+k − µ)
X
• Cov(Xt , Xt−h ) is the covariance between Xt and Xt−h . (n − k)σ 2 t=1
• Var(Xt ) is the variance of Xt .
Practical Significance
Wide-Sense Stationary Process
In time series analysis, ACF is used to detect non-randomness, identify sea-
In a wide-sense stationary (WSS) process, where the mean and variance are sonality or periodicity, and to inform the choice of model (e.g., ARIMA
constant over time, the autocovariance and autocorrelation depend only on models in forecasting). In signal processing, it helps in identifying signal
the lag, not on the specific time t. For such processes, the ACF is defined as: properties, such as the presence of a periodic signal.

RXX (τ ) = E[Xt+τ X t ]
Normalization
where: In statistics and time series analysis, it’s common to normalize the auto-
• τ is the lag. covariance function to get a time-dependent Pearson correlation coefficient.
The auto-correlation coefficient for a stochastic process is:
• E[·] is the expectation operator.
KXX (t1 , t2 )
• X t denotes the complex conjugate of Xt . ρXX (t1 , t2 ) =
σt1 σt2

© Roozbeh Sanaei 89
9.3 Partial Autocorrelation Function (PACF)
Definition and Basic Concept: This represents the correlation between values two time periods apart,
conditional on the knowledge of the value in between.
• PACF is the partial correlation of a stationary time series with its
own lagged values, regressed against the values of the time series at all
shorter lags. • Similarly, the 3rd order (lag) PACF is calculated as:

• In simpler terms, it tells you the direct relationship between an obser- Covariance(xt , xt−3 | xt−1 , xt−2 )
vation and its lag, removing the influence of intermediate lags. PACF3 = q
Variance(xt | xt−1 , xt−2 )Variance(xt−3 | xt−1 , xt−2 )
• The PACF of order k can be defined as the last element in the matrix
Rk divided by r0 , where Rk is a k × k matrix and Ck is a k × 1 column This continues for higher lags.
vector.

Calculation: Application in Time Series Analysis:

• The 1st order PACF is defined to be equal to the 1st order autocorre- • PACF is particularly useful in identifying the order of an autoregressive
lation. (AR) model in time series analysis.
• For higher orders, the 2nd order (lag) PACF is given by the equation:
• The theoretical ACF and PACF for AR, MA, and ARMA conditional
Covariance(xt , xt−2 | xt−1 ) mean models are known and are different for each model, aiding in
PACF2 = q
Variance(xt | xt−1 )Variance(xt−2 | xt−1 ) model selection.

© Roozbeh Sanaei 90
9.4 Autoregressive Integrated Moving Average (ARIMA)
Key Parameters General Equation for Non-Seasonal ARIMA
p q
• p (Autoregressive Part): (1 − φi L )(1 − L) Xt = (1 +
i d
θi Li )εt
X X

i=1 i=1

– Order of the autoregressive component. • L: Lag operator.

– Number of lags of the dependent variable used as predictors. • Xt : Time series value at time t.

– Captures the influence of prior values on current values. • φi : Coefficients for the autoregressive part.

• θi : Coefficients for the moving average part.


• d (Differencing):
• εt : Error terms.

– Degree of differencing to make the series stationary. Equation with Drift Component
– Transformation to stabilize the mean and variance over time. p q
(1 − φi L )(1 − L) Xt = δ + (1 +
i d
θi Li )εt (9.1)
X X

– Removes trends and seasonal effects. i=1 i=1

• δ: Represents the drift component, indicating a linear trend.


• q (Moving Average Part):
Application and Fitting

– Order of the moving average component. • Identifying appropriate values for p, d, and q based on time series data
characteristics.
– Number of lagged forecast errors included in the model.
• Once parameters are determined, the model is used for forecasting fu-
– Addresses the influence of random shocks from previous points. ture values.

© Roozbeh Sanaei 91
9.5 SARIMAX Model
ARIMAX Model SARIMAX Model
• Extends the ARIMA model by including exogenous inputs. • Combines SARIMA and ARIMAX, incorporating both seasonal com-
ponents and exogenous inputs.
• Incorporates independent variables influencing the time series but not
autoregressed on. • Enhances ARIMA by adding seasonality and external data for improved
forecasting.
• Models the time series using both the series itself and other independent
variables.
• Equation for SARIMAX(p,d,q)(P,D,Q,s):

SARIMA Model n
Θ(L)p θ(Ls )P ∆d ∆D
s yt = Φ(L) ϕ(L ) ∆ ∆s ϵt +
q s Q d D
βi xit
X

• Stands for Seasonal ARIMA, used for time series with evident season- i=1
ality.
– Θ(L)p and Φ(L)q are the non-seasonal components.
• Combines two ARIMA models: one for non-seasonal and one for sea-
– θ(Ls )P and ϕ(Ls )Q are the seasonal components.
sonal parts.
s represent differencing operations.
– ∆d and ∆D
• Includes extra seasonal parameters: P (Seasonal autoregressive order),
– yt is the time series, ϵt is the error term.
D (Seasonal differencing order), Q (Seasonal moving average order), S
(Length of the seasonal cycle). – xit with coefficients βi are the exogenous variables.

© Roozbeh Sanaei 92
9.6 Simple Exponential Smoothing (SES)
Simple exponential smoothing (SES) is a method suitable for forecasting time series data that does not display any clear trend or seasonal pattern. It is
part of the family of exponential smoothing methods and works particularly well for data that is essentially random with no evident seasonality or trend.

Core Principles of Simple Exponential Smoothing: • Forecast : The forecast at time T + 1 is a weighted average between
the most recent observation yT and the previous forecast ŷT |T −1 :
• Weighted Averages: SES forecasts are calculated using weighted av-
erages where the weights decrease exponentially as observations come ŷT +1|T = αyT + (1 − α)ŷT |T −1
from further in the past. The smallest weights are associated with the
oldest observations. This can be represented by the equation: This formula is used iteratively to calculate forecasts for each time
period.
ŷT +1|T = αyT + α(1 − α)yT −1 + α(1 − α)2 yT −2 + · · ·
• Component Form: In simple exponential smoothing, the only com-
Here, (0 ¡ α ¡ 1) is the smoothing parameter, yT is the last observed ponent is the level ℓt . The component form comprises a forecast equa-
value, and ŷT +1|T is the forecast for the next period. tion and a smoothing equation:
• Influence of α: The value of α determines the weights assigned to
ŷt+h|t = ℓt
observations. A small α gives more weight to observations from the
distant past, while a large α gives more weight to recent observations.
ℓt = αyt + (1 − α)ℓt−1
If α equals 1, the SES forecast is the same as the last observed value,
akin to the naı̈ve forecast. The forecast value at time t + 1 is the estimated level at time t.

© Roozbeh Sanaei 93
9.7 Damped Trend Method
The damped trend method in time series forecasting enhances traditional models by incorporating a damping factor, ϕ, to account for the diminishing
impact of trends over time.

Key Advantages Level


ℓt = αyt + (1 − α)(ℓt−1 + ϕbt−1 )
• By incorporating the damping factor ϕ, this method modifies the trend
Similar to Holt’s model, it updates the level of the series. The presence
component to account for a diminishing impact over time.
of ϕ in the equation dampens the influence of past trends on the current
level.

Trend

• The reduced impact of trends over time makes the damped trend bt = β (ℓt − ℓt−1 ) + (1 − β )ϕbt−1
∗ ∗

method particularly suitable for long-term forecasting where trends are


expected to level off. The trend component is adjusted for damping. The factor ϕ dampens the
trend, effectively reducing the influence of past trends over time.

Forecast
ŷt+h|t = ℓt + (ϕ + ϕ2 + . . . + ϕh )bt
• It provides more realistic forecasts, especially for data with trends that
are likely to decelerate, offering a conservative yet often more accurate This equation projects the future values, taking into account the damp-
long-term outlook compared to models assuming a continuous linear ening effect on the trend. The term (ϕ + ϕ2 + . . . + ϕh ) represents the
trend. cumulative dampening effect over h periods.

© Roozbeh Sanaei 94
9.8 Holt’s linear trend method
Holt’s linear trend method is an extension of exponential smoothing used for time series forecasting, particularly effective when the data exhibits trends.
This method involves two primary equations: the level equation and the trend equation.

Level Forecast
The level equation updates the series’ current level, which is a smoothed Combines the level and trend components to forecast future values.
estimate of the series’ value.
ŷt+h|t = ℓt + hbt
ℓt = αyt + (1 − α)(ℓt−1 + bt−1 )
• ŷt+h|t is the forecast for h periods ahead.
• ℓt is the estimated level at time t.
• α is the smoothing parameter for the level, between 0 and 1. • The equation suggests that the future value is a function of the current
estimated level and the current trend, projected h steps into the future.
• yt is the actual observed value at time t.
• ℓt−1 and bt−1 are the estimated level and trend, respectively, from the Key Characteristics of Holt’s Method
previous time step.
• Effectiveness for Linear Trends: Tailored for data exhibiting a lin-
ear trend, providing accurate forecasting in such scenarios.
Trend
The trend equation updates the trend component, reflecting changes in the • Dynamic Adjustment: Dynamically updates both the level and
level. trend of the series, offering adaptability to changes in the data.

bt = β ∗ (ℓt − ℓt−1 ) + (1 − β ∗ )bt−1 • Flexibility and Robustness: Adjusts to changing trends, making it
a flexible and robust forecasting tool.
• bt is the estimated trend at time t.
• β ∗ is the smoothing parameter for the trend, also between 0 and 1. • Importance of Smoothing Parameters (α and β ∗ ): Critical in de-
termining the model’s responsiveness to changes in the level and trend
• The trend component is the estimated change in the level component of the data. These parameters balance between the historical data’s
from one period to the next. relevance and recent observations.

© Roozbeh Sanaei 95
9.9 Exponential Smoothing Methods

Exponential Smoothing Methods


Trend Seasonal Forecast Equation Level Equation Seasonality Equation
N N ŷt+h|t = lt lt = αyt + (1 − α)lt−1 N/A
N A ŷt+h|t = lt + st+h−m(k+1) lt = α(yt − st−m ) + (1 − α)lt−1 st = γ(yt − lt−1 ) + (1 − γ)st−m
N M ŷt+h|t = lt st+h−m(k+1) lt = α(yt /st−m ) + (1 − α)lt−1 st = γ(yt /lt−1 ) + (1 − γ)st−m
A N ŷt+h|t = lt + hbt lt = αyt + (1 − α)(lt−1 + bt−1 ) N/A
A A ŷt+h|t = (lt + hbt ) + st+h−m(k+1) lt = α(yt − st−m ) + (1 − α)(lt−1 + bt−1 ) st = γ(yt − lt−1 − bt−1 ) + (1 − γ)st−m
A M ŷt+h|t = (lt + hbt )st+h−m(k+1) lt = α(yt /st−m ) + (1 − α)(lt−1 + bt−1 ) st = γ(yt /(lt−1 + bt−1 )) + (1 − γ)st−m
Ad N ŷt+h|t = lt + ϕhbt lt = αyt + (1 − α)(lt−1 + ϕbt−1 ) N/A
Ad A ŷt+h|t = (lt + ϕhbt ) + st+h−m(k+1) lt = α(yt − st−m ) + (1 − α)(lt−1 + ϕbt−1 ) st = γ(yt − lt−1 − ϕbt−1 ) + (1 − γ)st−m
Ad M ŷt+h|t = (lt + ϕhbt )st+h−m(k+1) lt = α(yt /st−m ) + (1 − α)(lt−1 + ϕbt−1 ) st = γ(yt /(lt−1 + ϕbt−1 )) + (1 − γ)st−m

© Roozbeh Sanaei 96

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy