0% found this document useful (0 votes)
7 views37 pages

PR Module 4 QB

The document discusses spectral clustering, highlighting its advantages over traditional clustering methods like K-means, particularly in handling non-convex and non-linear data structures. It explains the concept of the Laplacian matrix, its properties, and provides a detailed step-by-step explanation of the spectral clustering algorithm, including how to find connected components and perform a 2-way cut in a graph. Additionally, it outlines the computational aspects and limitations of spectral clustering, making it suitable for small to medium datasets.

Uploaded by

Ed Philip Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views37 pages

PR Module 4 QB

The document discusses spectral clustering, highlighting its advantages over traditional clustering methods like K-means, particularly in handling non-convex and non-linear data structures. It explains the concept of the Laplacian matrix, its properties, and provides a detailed step-by-step explanation of the spectral clustering algorithm, including how to find connected components and perform a 2-way cut in a graph. Additionally, it outlines the computational aspects and limitations of spectral clustering, making it suitable for small to medium datasets.

Uploaded by

Ed Philip Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

PR module 4 QB

April 2025

1 Why Do We Go for Spectral Clustering?


Spectral clustering is a powerful, flexible, unsupervised learning technique that excels in scenarios where tradi-
tional methods like K-means fall short. It is especially effective for complex-shaped data, non-linearly separable
patterns, and graph-based data structures.

1.1 Handles Non-Convex and Non-Linear Data Structures


Traditional algorithms like K-means assume clusters are convex and isotropic (e.g., circular or spherical), which
limits their effectiveness when clusters have irregular shapes. Spectral clustering, on the other hand, can handle
non-convex and non-linearly separable clusters due to its graph-based approach.
Example: Consider two “moon-shaped” clusters. K-means may incorrectly split one moon due to Euclidean
proximity, whereas spectral clustering captures the structural integrity and keeps the moons intact.

1.2 Utilizes Graph Theory and Similarity Matrices


Spectral clustering constructs a graph from data using a similarity matrix (e.g., Gaussian similarity or k-nearest
neighbors). It does not work directly on the raw data but rather on the similarity graph, allowing it to capture
both local and global relationships among data points. This is particularly advantageous in domains like image
segmentation and social network analysis.

1.3 Dimension Reduction via Eigenvalue Decomposition


By computing the eigenvectors of the graph Laplacian matrix, spectral clustering reduces the dimensionality of
the problem. The first few eigenvectors embed the data into a lower-dimensional space where simple clustering
methods like K-means can then be applied more effectively.

1.4 No Strong Assumptions on Cluster Shape


Unlike K-means (which assumes convex clusters) or DBSCAN (which assumes density-based structures), spectral
clustering can detect arbitrarily shaped clusters as long as they are well-connected in the similarity graph.

1.5 Robust to Initialization and Noise


Spectral clustering is less sensitive to initialization because the structure is derived from the Laplacian’s eigen-
vectors. Additionally, when using a k-nearest neighbor graph, it can ignore distant noise, making it more robust
under the right conditions.

1.6 Effective for Small to Medium Datasets


It performs especially well for small to medium-sized datasets where the number of clusters is low but bound-
aries are complex. However, for very large datasets, the computational cost (typically O(n3 )) of eigenvalue
decomposition can become prohibitive.

1.7 Suitable for High-Dimensional and Sparse Data


Spectral clustering works well in applications such as image segmentation, bioinformatics, and social network
analysis, where the data might not lie in a Euclidean space and can be sparse or high-dimensional.

1
1.8 When to Avoid Spectral Clustering
• Large Datasets: Computationally expensive due to eigenvalue decomposition (O(n3 ) complexity).
• Explicit Cluster Centroids Required: Does not return centroids like K-means.

• Parameter Sensitivity: Performance depends on the similarity metric, graph construction, and number
of clusters (k).

2 Define a Laplacian Matrix


2.1 Definition
The Laplacian Matrix is a matrix representation of a graph that captures the structure and connectivity between
nodes. It plays a crucial role in spectral graph theory and is widely used in spectral clustering.
Given an undirected graph G = (V, E) with n vertices:

• Adjacency Matrix (A or W): A ∈ Rn×n such that Aij = 1 if there is an edge between node i and node
j, otherwise 0. In a weighted graph, Wij denotes the weight of the edge.
• Degree Matrix (D): A diagonal matrix where Dii = j Aij or Dii = j Wij .
P P

Figure 1: The-Laplacian-Matrix-of-a-network-Panel-A-presents-a-small-undirected-network-Panel

2.2 Key Properties


• Symmetric for undirected graphs.
• Positive semi-definite: All eigenvalues ≥ 0.
• The smallest eigenvalue is 0. The number of zero eigenvalues equals the number of connected components.

• The eigenvectors of L or Lsym help embed the graph into a low-dimensional space for clustering.

2.3 Intuition Behind Laplacian


• D indicates how connected a node is; W shows direct connections.
• The Laplacian L = D − W measures how much a node differs from its neighbors.

2
• For a function f on nodes: X
(Lf )i = Wij (fi − fj )
j

• It acts like a discrete Laplace operator and captures smoothness over the graph.

3 Give a detailed explanation of the Spectral Clustering Algorithm.


Spectral Clustering is a graph-based clustering algorithm that uses the eigenvalues and eigenvectors of a graph
Laplacian derived from a similarity graph. It is highly effective for clustering non-convex and non-linearly
separable data.

3.0.1 Core Idea


Instead of clustering in the original input space, Spectral Clustering transforms the data using eigenvectors of
the Laplacian matrix and clusters them in a low-dimensional space.

3.0.2 Algorithm Overview


Given:
X = {x1 , x2 , ..., xn } ∈ Rd
We aim to divide the dataset into k clusters.

1. Construct the Similarity Graph


• Nodes: Data points
• Edges: Similarities represented by weights Wij
• Common similarity metrics:
– Gaussian RBF Kernel:
∥xi − xj ∥2
 
Wij = exp −
2σ 2
– k-Nearest Neighbors (k-NN)
– ϵ-Neighborhood Graph
2. Compute the Graph Laplacian
• Degree Matrix:
n
X
Dii = Wij
j=1

• Laplacian Variants:

L = D − W, Lsym = I − D−1/2 W D−1/2 , Lrw = I − D−1 W

3. Compute the Eigenvectors


• Find the first k eigenvectors corresponding to the smallest eigenvalues
• Form matrix U ∈ Rn×k
4. Normalize Rows (if using Lsym )
Uij
Ûij = qP
2
Uij
j

5. Apply K-Means Clustering


• Use rows of U or Û as k-dimensional vectors
• Cluster using K-means
6. Assign Cluster Labels
• Use K-means results to label original data points

3
3.0.3 Graph Cut Interpretation
Spectral Clustering minimizes the normalized cut:
k
X cut(Ai , Āi )
Ncut(A1 , ..., Ak ) =
i=1
vol(Ai )

where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A

3.0.4 Summary
Step Description
1 Build similarity graph W
2 Compute Laplacian matrix L, Lsym , or Lrw
3 Compute first k eigenvectors of Laplacian
4 Form matrix U ∈ Rn×k
5 Run K-means on rows of U
6 Assign original data points to clusters

4 How can we find connected components of a graph using Spectral


clustering?
Spectral clustering provides an elegant, graph-theoretic approach to identify connected components in an undi-
rected graph. This method leverages the spectral properties (eigenvalues and eigenvectors) of the graph Laplacian
matrix. According to spectral graph theory, the number of connected components in a graph equals the
multiplicity of the eigenvalue 0 in its Laplacian matrix.

4.1 Key Concept


A graph’s connected components are subsets of nodes where:
• Each pair of nodes in a component is connected by a path.

• No edges exist between different components.


In spectral terms, the Laplacian matrix L of the graph reflects this structure:
• The number of eigenvalues equal to zero corresponds to the number of connected components.

• The eigenvectors corresponding to those zero eigenvalues indicate the membership of nodes in each com-
ponent.

4.2 Step-by-Step Method


Let G = (V, E) be an undirected graph with n nodes.
Step 1: Create the Adjacency Matrix A
(
1 if there is an edge between node i and j
A ∈ Rn×n , Aij =
0 otherwise

Step 2: Compute the Degree Matrix D


X
Dii = Aij
j

This represents the degree of node i.


Step 3: Form the Unnormalized Laplacian

L=D−A

4
Step 4: Compute Eigenvalues and Eigenvectors Solve the eigenvalue problem:

Lv = λv

Compute eigenvalues λ1 , λ2 , . . . , λn and eigenvectors v1 , v2 , . . . , vn .


Step 5: Identify Number of Connected Components The number of eigenvalues equal to zero gives
the number of connected components.
Step 6: Use Eigenvectors to Group Nodes Let k be the number of connected components. Take the
first k eigenvectors (corresponding to zero eigenvalues), stack them column-wise into a matrix U ∈ Rn×k . Each
row of U corresponds to a node and will be identical for nodes in the same component.
Clustering algorithms like K-means can be applied on rows of U , although they may be unnecessary for ideal
Laplacians.

4.3 Example
Consider a graph with 2 connected components.
Adjacency Matrix W :  
0 1 0 0
1 0 0 0
W =
0

0 0 1
0 0 1 0
Degree Matrix D:  
1 0 0 0
0 1 0 0
D=
0

0 1 0
0 0 0 1
Laplacian Matrix L = D − W :  
1 −1 0 0
−1 1 0 0
L=
 
0 0 1 −1
0 0 −1 1
Eigenvalues: λ = [0, 0, 2, 2]
Since there are two zero eigenvalues, the graph has two connected components.
Eigenvectors corresponding to λ = 0:
   
1 0
1 0
0 , v2 = 1
v1 =    

0 1

Here, v1 identifies Component 1 (nodes 1 & 2) and v2 identifies Component 2 (nodes 3 & 4).

5 How can we find a 2-way cut in a graph using Spectral clustering?


5.1 What is a 2-Way Cut?
A 2-way cut of a graph is a partition of the vertex set V into two disjoint subsets A and B such that:
• A∪B =V
• The number (or total weight) of edges between A and B is minimized.

This is also referred to as a graph bisection or minimum cut.

5.2 Intuition Behind Spectral 2-Way Cut


Instead of checking all partitions (which is computationally expensive), spectral clustering uses eigenvectors to
approximate the optimal solution. The key idea is that the Fiedler vector (the eigenvector corresponding to
the second smallest eigenvalue of the Laplacian matrix L) contains critical information for partitioning.

5
5.3 Step-by-Step: 2-Way Cut Using Spectral Clustering
5.3.1 Step 1: Construct the Similarity Graph
• Use the Gaussian kernel:
||xi − xj ||2
 
Wij = exp −
2σ 2

• Or use a k-NN graph to connect each node to its k nearest neighbors.

5.3.2 Step 2: Construct the Graph Laplacian


• Degree matrix D: Dii = j Wij
P

• Unnormalized Laplacian: L = D − W
• Normalized Laplacian:
Lsym = D−1/2 LD−1/2 = I − D−1/2 W D−1/2

5.3.3 Step 3: Compute Eigenvectors


Solve:
Lsym v = λv
• λ1 = 0 with v1 = 1
• λ2 gives the Fiedler vector v2

5.3.4 Step 4: Partition Using the Fiedler Vector


• Sign-Based: Group A: v2 (i) > 0, Group B: v2 (i) ≤ 0
• Threshold-Based: Use threshold t (e.g., median of v2 )
• Optimal Threshold Search: Minimize:

cut(A, B) cut(A, B)
Ncut(A, B) = +
vol(A) vol(B)

where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A

5.4 Why Does This Work?


• The Fiedler vector minimizes a relaxed version of RatioCut or NormalizedCut.
• Nodes with similar values in v2 tend to be more connected.
• The sign split reveals the natural partition in the graph.

5.5 Visualization Insight


• Visualize node i’s position as v2 (i) on a line.
• A cut at zero (or any threshold) splits the graph meaningfully.

5.6 Summary
5.7 Applications
• Image segmentation
• Graph partitioning
• Social network clustering
• Clustering in high-dimensional data

6
Step Description
1 Construct similarity graph and Laplacian matrix L = D − W or Lsym
2 Compute Fiedler vector v2
3 Partition nodes using sign/threshold/optimal Ncut
4 Approximate 2-way cut achieved

Table 1: Steps for Spectral 2-Way Cut

6 Prove the four properties of a Laplacian Matrix L.


Let G = (V, E) be an undirected graph with |V | = n. Define:

• Adjacency matrix A ∈ Rn×n , where Aij = 1 if edge (i, j) ∈ E, else 0


• Degree matrix D ∈ Rn×n , where Dii = deg(vi ) = j Aij
P

Then, the Laplacian matrix is:


L=D−A

6.1 Property 1: Symmetric and Positive Semi-Definite


Symmetry:
• A is symmetric ⇒ A = AT
• D is diagonal (hence symmetric)

• So, LT = (D − A)T = DT − AT = D − A = L
Positive Semi-Definiteness: For any x ∈ Rn ,
n
X X
xT Lx = xT (D − A)x = di x2i − Aij xi xj
i=1 i,j

Rewriting:
1X
xT Lx = Aij (xi − xj )2 ≥ 0
2 i,j

6.2 Property 2: Eigenvalue Zero with All-Ones Eigenvector


Let 1 = [1, 1, . . . , 1]T . Then:
L · 1 = (D − A) · 1 = D · 1 − A · 1 = 0
Hence, 1 is an eigenvector of L with eigenvalue 0.

6.3 Property 3: Multiplicity of Eigenvalue 0


If the graph has k connected components C1 , C2 , . . . , Ck , define:
(
1 if j ∈ Ci
fi (j) =
0 otherwise

Then Lfi = 0 for each i, and fi ’s are linearly independent.


So, nullity of L = k.

6.4 Property 4: Zero Row and Column Sums


n
X n
X
Lij = Dii − Aij = deg(vi ) − deg(vi ) = 0
j=1 j=1

Since L is symmetric, column sums are also 0.

7
7 Explain about Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a statistical model that describes systems assumed to follow a Markov
process with hidden states. It is widely used in areas like speech recognition, weather prediction, bioinformatics,
finance, and natural language processing (NLP).

7.1 What Does an HMM Do?


In simple terms, an HMM models situations where you observe some data (outputs), but the internal mechanism
or state that generates those observations is hidden or not directly visible.

7.2 Real-World Examples


• Speech Recognition: Words (hidden states) produce acoustic signals (observations).
• Weather Prediction: Actual weather (hidden) causes sensor readings (observations).
• Bioinformatics: DNA sequences (observed) are generated by hidden biological processes.

• Finance: Market regimes (hidden) result in observed price movements.

7.3 Components of an HMM


An HMM is defined by five key components:

1. Hidden States
Denoted as S = {s1 , s2 , . . . , sN } or Q = {q1 , q2 , . . . , qN }. These are unobservable or latent variables.
Example: In speech recognition, the hidden states might be phonemes.

2. Observations
Denoted as O = {o1 , o2 , . . . , oT } or V = {v1 , v2 , . . . , vM }. These are the visible/measurable outputs.
Example: Acoustic signals or sensor readings.
3. Transition Probabilities
Matrix A = [aij ], where aij = P (sj | si ). This defines the probability of transitioning from state si to sj .
Example: Transition from ”sunny” to ”rainy”.
4. Emission Probabilities
Matrix B = [bj (k)], where bj (k) = P (ok | sj ). This defines the probability of observing ok given that the
system is in state sj .
Example: Observing ”umbrella” when the state is ”rainy”.
5. Initial Probabilities
Vector π = [πi ], where πi = P (si at time t = 1). This defines the probability of the initial state.
Example: The probability that the first day is ”sunny”.

We denote the full HMM model as λ = (A, B, π).

7.4 Assumptions of HMMs


• Markov Property:
P (qt | qt−1 , qt−2 , . . . , q1 ) = P (qt | qt−1 )
The future state depends only on the current state.
• Output Independence:
P (ot | qt , ot−1 , . . . , o1 ) = P (ot | qt )
The current observation depends only on the current state.

8
7.5 Core Problems Solved by HMMs
1. Evaluation Problem
Goal : Compute P (O | λ), the probability of the observation sequence given the model.
Solution: Forward Algorithm or Forward-Backward Algorithm.
2. Decoding Problem
Goal : Identify the most likely sequence of hidden states that could have led to the observed sequence.
Solution: Viterbi Algorithm.
3. Learning Problem
Goal : Estimate the model parameters (A, B, π) that best explain the observed data.
Solution: Baum-Welch Algorithm (a type of Expectation-Maximization).

8 Explain the diagram (state space diagram) used to represent an


HMM using an example.
A Hidden Markov Model (HMM) is a statistical model used to represent systems that are assumed to be a
Markov process with unobservable (hidden) states. A State Space Diagram graphically represents the hidden
states, transition probabilities, and emission (observation) probabilities.

8.1 Components of a State Space Diagram


• Hidden States (Circles): Represent the unobservable internal states.
• Observations (Squares/Diamonds): Represent the observable outputs.
• Transition Probabilities (Arrows between States): Indicate likelihood of moving from one hidden
state to another.
• Emission Probabilities (Arrows from States to Observations): Indicate likelihood of observing
output from a state.

8.2 Example 1: Weather and Umbrella


Hidden States: Sunny (S), Rainy (R)
Observations: Umbrella (U), No Umbrella (N)
Transition Probabilities (A):
P (S → S) = 0.7, P (S → R) = 0.3, P (R → S) = 0.4, P (R → R) = 0.6
Emission Probabilities (B):
P (U |S) = 0.1, P (N |S) = 0.9P (U |R) = 0.8, P (N |R) = 0.2
Initial State Probabilities:
P (S0 ) = 0.6, P (R0 ) = 0.4
Observed Sequence Example:
Sunny → Rainy → Rainy → Sunny ⇒ N, U, U, N
Diagram:
P(U|S)=0.1
+------> [Sunny] <------+
| | \ |
| | P(S→S)=0.7 | P(S→R)=0.3
| | \ |
| V \ V
[Start] [Rainy]
| ^ / ^
| | P(R→S)=0.4 | P(R→R)=0.6
| | / |
| | / |
+------> [Rainy] <------+
P(N|R)=0.2

9
8.3 Example 2: Weather and Activities
Hidden States: Rainy, Sunny
Observations: Walk, Shop, Clean
Transition Probabilities:

P (R → R) = 0.7, P (R → S) = 0.3, P (S → R) = 0.4, P (S → S) = 0.6

Emission Probabilities:
• Rainy: P (W alk) = 0.1, P (Shop) = 0.4, P (Clean) = 0.5
• Sunny: P (W alk) = 0.6, P (Shop) = 0.3, P (Clean) = 0.1

8.4 Alternative Representations


Transition Matrix:
Sunny Rainy
Sunny 0.7 0.3
Rainy 0.4 0.6
Emission Matrix:
U N
Sunny 0.1 0.9
Rainy 0.8 0.2

8.5 Why Use a State Space Diagram?


• Visualizes how states change and generate outputs.
• Essential for HMM algorithms: Forward-Backward, Viterbi.
• Widely used in speech recognition, bioinformatics, activity monitoring.

8.6 Summary
• Hidden states model internal dynamics.
• Observations reflect visible behavior.
• Transition and emission probabilities define system behavior.
• State space diagram helps reason about and visualize the HMM.

8.7 Hidden Markov Model (HMM): Diagram Interpretation


A Hidden Markov Model (HMM) is a statistical model representing systems with hidden (unobservable)
states that evolve over time according to transition probabilities, and each hidden state emits observable outcomes
with certain probabilities. The figure illustrates a typical state space diagram of an HMM.

8.8 Components of the State Space Diagram


• Hidden States (Circles): X1 and X2 represent the hidden (internal) states of the system.
• Observations (Squares): Y1 , Y2 , Y3 are the observable outputs.
• Transition Probabilities:
– a11 : probability of staying in X1
– a12 : probability of transitioning from X1 to X2
– a21 : probability of transitioning from X2 to X1
– a22 : probability of staying in X2
• Emission Probabilities:
– From X1 : emits Y1 (b11 ), Y2 (b12 ), Y3 (b13 )
– From X2 : emits Y1 (b21 ), Y2 (b22 ), Y3 (b23 )

10
8.9 Transition Matrix (A)
 
a11 a12
A=
a21 a22

8.10 Emission Matrix (B)


 
b b12 b13
B = 11
b21 b22 b23
Rows correspond to states X1 and X2 , and columns to outputs Y1 , Y2 , Y3 .

8.11 Initial State Probabilities


Let π = [π1 , π2 ] represent the initial probabilities of starting in states X1 and X2 respectively.

8.12 Interpretation
This HMM can be used in tasks such as:
• Predicting the next observation.
• Inferring the most likely sequence of hidden states given a sequence of observations.
• Estimating parameters (aij and bij ) from data.

9 What is the assumption used in a HMM?


Hidden Markov Models (HMMs) are built upon fundamental assumptions that simplify sequential data modeling
while maintaining computational tractability. Below are the core assumptions with detailed explanations:

9.1 Markov Property (First-Order Markov Assumption)


Definition: The future state depends only on the current state, not on the sequence of preceding states. This
is also called the memoryless property.

Mathematical Formulation:
P (St |St−1 , St−2 , . . . , S1 ) = P (St |St−1 )

Implications:

• Simplifies computations by ignoring long-term history.


• Reduces complexity from O(T · N T ) to O(T · N 2 ) where N is the number of states and T is the number of
time steps.

Example: In weather prediction, tomorrow’s weather depends only on today’s weather, not on yesterday’s or
earlier days. For instance:
• If today is Sunny, the probability of tomorrow being Rainy might be 20%.

• The model doesn’t consider whether it was Rainy or Cloudy two days ago.

9.2 Output Independence (Observation Independence)


Definition: The current observation depends only on the current hidden state, not on previous observations
or states.

Mathematical Formulation:

P (Ot |St , St−1 , . . . , S1 , Ot−1 , . . . , O1 ) = P (Ot |St )

11
Implications:
• Observations are conditionally independent given the hidden states.
• This allows the model to focus solely on the state-observation relationship without tracking historical
dependencies.

Example: In speech recognition:


• The acoustic features of a phoneme (e.g., ”ah”) depend only on the current phoneme being spoken.
• The sound of ”ah” does not depend on the previous phoneme ”b” or its acoustic features.

9.3 Stationarity (Time-Invariance)


Definition: Transition and emission probabilities do not change over time. The model’s behavior remains
consistent across all time steps.

Mathematical Formulation:

P (St |St−1 ) and P (Ot |St ) are constant ∀t.

Implications:
• The same transition/emission matrices apply at every time step.
• No need to recompute probabilities as time progresses.

Example: In DNA sequencing:


• The probability of a nucleotide transition (e.g., Adenine → Thymine) remains fixed throughout the se-
quence.
• It doesn’t matter if the transition happens at position 10 or position 1000.

9.4 Discrete States and Observations (Standard HMMs)


Definition:
• Hidden states are discrete (e.g., Sunny, Rainy, Cloudy).
• Observations are discrete (e.g., Umbrella, No Umbrella).

Extension: Continuous observations can be handled using Gaussian HMMs, where emission probabilities are
modeled as probability density functions (PDFs).

Example:
• Discrete: A weather HMM with states {Sunny, Rainy} and observations {Umbrella, No Umbrella}.
• Continuous: In speech recognition, acoustic features are continuous and modeled using Gaussian distribu-
tions.

9.5 Fixed Topology


Definition: The structure of possible transitions (e.g., which states can follow others) is preset and does not
change during inference or learning.

Example:
• In a left-to-right HMM (common in speech recognition), states can only transition forward or repeat (e.g.,
S1 → S2 , S2 → S3 , but not S3 → S1 ).
• In a fully connected HMM, any state can transition to any other state (e.g., Sunny → Rainy → Cloudy →
Sunny).

12
9.6 Why These Assumptions Matter
• Computational Efficiency: Reduces complexity from O(T · N T ) to O(T · N 2 ), making inference feasible.
• Tractability: Enables efficient algorithms like:
– Forward-Backward (for computing state probabilities).
– Viterbi (for finding the most likely state sequence).
– Baum-Welch (for parameter estimation).
• Limitations: May oversimplify real-world scenarios (e.g., long-term dependencies in language or climate).

9.7 Illustrative Example


Scenario: Predicting weather (hidden states) based on umbrella sightings (observations).

• Markov Property: Tomorrow’s weather depends only on today’s weather.


• Output Independence: Seeing an umbrella today doesn’t affect the probability of seeing one tomorrow.
• Stationarity: The chance of rain following sunshine is always 30%, regardless of the day.
• Discrete States/Observations: Weather is {Sunny, Rainy}; observations are {Umbrella, No Umbrella}.
• Fixed Topology: Transitions allowed: Sunny ↔ Rainy.

0.3

0.7 Sunny Rainy 0.6

0.4

Figure 2: State transition diagram for the weather HMM. Numbers represent transition probabilities.

10 The template of an HMM is characterized by 6 different parts.


State and explain each of them.
Hidden Markov Models (HMMs) are defined by six essential components that work together to model systems
where hidden states generate observable outputs. Below we describe each component in detail with mathematical
formulations and practical examples.

10.1 Set of Hidden States (S)


Definition: A finite set of N unobservable states representing the system’s internal conditions:

S = {s1 , s2 , ..., sN }

Key Characteristics:
• Not directly measurable (must be inferred from observations)
• Represent the true underlying situation
• Finite and discrete in standard HMMs

Examples:
• Weather system: {Sunny, Rainy, Cloudy}
• Speech recognition: {Phoneme1 , ..., PhonemeN }
• DNA sequencing: {Adenine, Thymine, Cytosine, Guanine}

13
10.2 Observation Symbols (V)
Definition: A set of M possible observable outputs:

V = {v1 , v2 , ..., vM }

Key Characteristics:
• Directly measurable evidence
• May have probabilistic relationship to hidden states
• Can be discrete or continuous (in extended models)

Examples:
• Weather observer’s activities: {Umbrella, No Umbrella}
• Speech signals: {Acoustic feature vectors}
• Genomic data: {Measured base pairs}

10.3 Transition Probability Matrix (A)


Definition: An N × N matrix where:

aij = P (qt+1 = sj |qt = si )

Constraints:
N
X
aij = 1 ∀i ∈ {1, ..., N }
j=1

Example (Weather System):


Sunny Rainy
Sunny 0.7 0.3
Rainy 0.4 0.6

Interpretation:
• 70% chance sunny day follows another sunny day
• 30% chance sunny day transitions to rainy day

10.4 Emission Probability Matrix (B)


Definition: An N × M matrix where:

bj (k) = P (ot = vk |qt = sj )

Constraints:
M
X
bj (k) = 1 ∀j ∈ {1, ..., N }
k=1

Example (Weather-Observation System):


Umbrella No Umbrella
Sunny 0.1 0.9
Rainy 0.8 0.2

Interpretation:
• 90% chance no umbrella is seen on sunny days
• 80% chance umbrella is seen on rainy days

14
10.5 Initial State Distribution ()
Definition: A vector of starting probabilities:

πi = P (q1 = si )

Constraints:
N
X
πi = 1
i=1

Example:
• P (Start Sunny) = 0.6
• P (Start Rainy) = 0.4

10.6 Observation Sequence (O) and Time (T)


Definition: A sequence of T observed symbols:

O = (o1 , o2 , ..., oT ), ot ∈ V

Key Points:
• T can be fixed or variable
• Represents actual measured data
• Used to infer hidden state sequence

Example (5-Day Weather Observation):

Day 1 2 3 4 5
Observation No Umbrella Umbrella Umbrella No Umbrella Umbrella

10.7 Complete System Example


Weather Prediction HMM:
• States: Sunny, Rainy
• Observations: Umbrella, No Umbrella
• Transition Matrix: As shown in Section 10.3
• Emission Matrix: As shown in Section 10.4
• Initial Distribution: π = [0.6, 0.4]
• Time: T = 5 days

10.8 Why These Components Matter


Complete System Specification: These six elements fully define an HMM’s behavior:
• Possible internal states (S)
• Observable outputs (V)
• State evolution rules (A)
• State-observation relationships (B)
• Starting conditions ()
• Time dimension (T)

15
0.7 0.6

0.3

start π = [0.6, 0.4] Sunny Rainy

0.4

No Umbrella (0.9) Umbrella (0.8)

Figure 3: Visualization of the weather HMM showing states, transitions, and emissions.

Enables Solving Fundamental Problems:

• Evaluation: Compute P (O|λ) (Forward-Backward algorithm)


• Decoding: Find most likely state sequence (Viterbi algorithm)
• Learning: Estimate λ = (A, B, π) from data (Baum-Welch)

11 Give 5 applications of HMM.


Hidden Markov Models are widely used across diverse domains to analyze sequential data where the underlying
system states are not directly observable. Below we present five major application areas with detailed examples
and illustrations.

11.1 Speech Recognition


Core Application:
• Converting spoken language to text (e.g., virtual assistants like Siri, Alexa)

• Speaker identification and verification systems

HMM Implementation:
• Hidden States: Phonemes (basic speech sounds like /k/, /æ/, /t/ for ”cat”)
• Observations: Acoustic features (frequency bands, MFCC coefficients)
• Transition Matrix: Probability of phoneme sequences (e.g., /h/→//→/l/→/o/ for ”hello”)

• Emission Matrix: Probability of acoustic features given each phoneme

Example Workflow:
1. User speaks the word ”weather” (/w/ // // //)
2. Microphone captures acoustic signals

3. HMM decodes most likely phoneme sequence


4. Maps to text output ”weather”

11.2 DNA Sequence Analysis


Core Application:
• Gene finding in genomic sequences family classification alignment

16
HMM Implementation:
• Hidden States: Biological features (coding exons, introns, regulatory regions)
• Observations: Nucleotide bases (A,T,C,G)
• Transition Matrix: Probability of genomic region changes (e.g., exon→intron)

• Emission Matrix: Base probabilities in each region type

Example:
• Identifying gene structure: 5’UTR → Exon → Intron → Exon → 3’UTR
• Start codon (ATG) emission probability: 0.95 in exons, 0.01 elsewhere

11.3 Natural Language Processing (NLP)


Core Application:
• Part-of-speech (POS) tagging

• Named entity recognition


• Machine translation

HMM Implementation:
• Hidden States: Grammatical tags (Noun, Verb, Adjective)
• Observations: Words in sentences

• Transition Matrix: Tag sequence probabilities


• Emission Matrix: Word generation probabilities per tag

Example Sentence: “The quick brown fox jumps”


Word The quick brown fox jumps
POS Det Adj Adj Noun Verb

Key Probabilities:
• P(Verb—Noun) = 0.4 (common after subjects)
• P(”fox”—Noun) = 0.01 (specific animal)

11.4 Financial Market Analysis


Core Application:
• Market regime detection (bull/bear markets)

• Algorithmic trading signals


• Risk assessment models

HMM Implementation:
• Hidden States: Market conditions (high/low volatility, trending/mean-reverting)
• Observations: Price changes, trading volumes, volatility indices

• Transition Matrix: Probability of market state changes


• Emission Matrix: Observation distributions per market state

17
11.5 5. Human Activity Recognition
Core Application:
• Wearable fitness tracking
• Medical rehabilitation monitoring
• Gesture-based interfaces

HMM Implementation:
• Hidden States: Activities (walking, running, sitting)
• Observations: Sensor data (accelerometer, gyroscope)
• Transition Matrix: Activity sequence probabilities
• Emission Matrix: Sensor readings per activity

Example:
• Walking state: Characteristic 2Hz acceleration patterns
• Sitting state: Near-zero acceleration variance
• Transition P(Walking→Running) = 0.3 for fitness scenarios

12 Explain the principle behind calculating the importance of web-


pages in Google’s PageRank(PR) algorithm.
PageRank revolutionized web search by quantifying page importance through link analysis. Below we detail its
core principles, mathematical formulation, and practical implications.

12.1 Core Principle: Links as Votes of Importance


Web as a Voting System:
• Each webpage is a node in a directed graph
• Each hyperlink is a directed edge representing a ”vote”
• Links from authoritative pages carry more weight (like academic citations)

Key Insight:
• A page linked by many important pages becomes important itself
• Importance is recursive and self-reinforcing

High PR

A B

Low PR

D C

Figure 4: Link graph showing page importance propagation. Page A gains higher PageRank from multiple
in-links.

18
12.2 Mathematical Formulation
PageRank Equation:
1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )

Terms:
• P R(P ): PageRank of page P
• P R(Pi ): PageRank of linking pages Pi
• L(Pi ): Number of outbound links from Pi

• d: Damping factor (typically 0.85)


• N : Total pages in the index

Interpretation:
• (1 − d)/N : Probability of random jump (teleportation)
• d × (·): Weighted sum of incoming link values
P

12.3 The Random Surfer Model


Behavior Simulation: A hypothetical user who:
• Follows links (with probability d = 0.85)

• Jumps randomly (with probability 1 − d = 0.15)

Purpose:
• Prevents getting stuck in dead-ends or cycles
• Ensures all pages have non-zero probability
• Models real user browsing behavior

d = 0.85
Current Page Follow Link Chosen from outlinks

1 − d = 0.15

Random Jump

Any page in index

Figure 5: Random surfer’s decision process at each step

19
12.4 Computation and Convergence
Iterative Process:
1. Initialize all pages with P R = 1/N

2. Repeatedly apply PageRank formula


3. Values converge after 50 iterations

Example Calculation: For three pages A → B → C:

P R(A) = 0.15/3 + 0.85 × (P R(C)/1)


P R(B) = 0.15/3 + 0.85 × (P R(A)/1)
P R(C) = 0.15/3 + 0.85 × (P R(B)/1)

13 Clearly explain all steps, using an example, in Google’s PR algo-


rithm.
13.1 Overview of PageRank
PageRank quantifies webpage importance by analyzing the web’s link structure, where links serve as votes of
confidence. The algorithm models:

• Web as a Directed Graph:


– Nodes represent webpages
– Directed edges represent hyperlinks

• Random Surfer Model:


– Users follow links with probability d = 0.85
– Random jumps occur with probability 1 − d = 0.15

13.2 Mathematical Formulation


The PageRank of a page P is calculated as:

1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )
Where:

• P R(Pi ): PageRank of pages linking to P


• L(Pi ): Number of outbound links from Pi
• N : Total number of pages
• d: Damping factor (typically 0.85)

13.3 Example Web Structure


Consider four pages with these links:

13.4 Step-by-Step Calculation


Initialization: All pages start with equal PageRank:

P R(A) = P R(B) = P R(C) = P R(D) = 0.25

20
Page A Page B

A B

D C

Page D Page C

Figure 6: Link structure for our example: A→B→C→A with D linking to both A and B

First Iteration:
• Page A:
 
1 − 0.85 P R(C) P R(D)
P R(A) = + 0.85 × +
4 1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.0375 + 0.31875 = 0.35625

• Page B:
 
P R(A) P R(D)
P R(B) = 0.0375 + 0.85 × +
1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.35625

• Page C:
P R(B)
P R(C) = 0.0375 + 0.85 ×
1
= 0.0375 + 0.2125 = 0.25

• Page D:
P R(D) = 0.0375 + 0 (no incoming links)
= 0.0375

Subsequent Iterations: After 20-30 iterations, values converge to:


Page Stable PageRank
A 0.37
B 0.33
C 0.23
D 0.07

13.5 Key Observations


• Page A ranks highest (receives votes from both C and D)
• Page D ranks lowest (no incoming links)
• Link Equity Division:
– Page B’s vote is split between A and C
– Page D’s vote is split between A and B

21
Figure 7: Final PageRank distribution among pages

13.6 Algorithm Properties


• Convergence: Guaranteed due to damping factor

• Dead Ends: Handled through random jumps


• Spam Resistance: Difficult to artificially inflate PageRank

13.7 Practical Implications


• SEO Strategy:
– Acquire links from high-PageRank pages
– Prefer links from pages with few outbound links
• Modern Search:

– PageRank remains one of 200 ranking factors


– Combines with content quality, user signals, etc.

Component Role
Link Graph Models web connectivity
Damping Factor Accounts for random navigation
Iterative Computation Ensures stable importance scores
Normalization Adjusts for link quantity/quality

Table 2: Core components of PageRank calculation

14 What are dangling nodes and irreducible graphs? How are such
cases handled in the PR algorithm?
Two critical challenges in PageRank computation are dangling nodes and irreducible graphs. This section
explains these concepts and their solutions in detail.

14.1 Dangling Nodes


Definition: Pages with no outbound links (e.g., PDFs, images, or dead-end pages). These disrupt PageRank
flow because they don’t distribute their PageRank to other pages.

A B C

Dangling Node

Figure 8: Example graph with dangling node C (no outbound links)

Problems:
• The random surfer gets ”trapped” with no links to follow
• Causes PageRank to ”leak” out of the system

• Breaks the stochastic property of the transition matrix

22
Solutions:
• Teleportation: Treat dangling nodes as linking to all pages:
(
1/N if page i is dangling
Mij =
Original value otherwise

• Damping Factor: The (1 − d)/N term ensures some PR flows to all pages
• Redistribution: During computation, evenly distribute a dangling node’s PR to all pages

14.2 Irreducible Graphs


Definition: Graphs with disconnected components that have no links between them. This makes the graph
reducible (not strongly connected).

Component 1

A B

Component 2

C D

Figure 9: Reducible graph with two disconnected components

Problems:
• PageRank cannot flow between disconnected components
• Leads to rank ”sinks” where PR accumulates in isolated groups
• No unique stationary distribution exists

Solutions:
• Damping Factor: Forces connectivity via random jumps:
1−d X P R(Pi )
P R(P ) = +d×
N L(Pi )

where (1 − d)/N ensures all pages are reachable


• Artificial Links: Add small transition probabilities between all pages

14.3 Combined Approach in PageRank


The standard PageRank algorithm integrates solutions for both issues:

• For Dangling Nodes:


– Add virtual links to all pages
– Implemented via transition matrix adjustments
• For Irreducibility:
– Use damping factor (d = 0.85)
– Guarantees strongly connected graph

23
A B C

Dangling fixes Irreducibility fixes


D E

Figure 10: Combined solutions applied to a sample graph

14.4 Practical Example


Consider a web graph with:
• Component 1: A → B → C (C is dangling)
• Component 2: D → E (disconnected)

Solutions Applied:
• For C (dangling): Add virtual links to A, B, D, E
• For disconnected components: Damping factor ensures 15% PR flows randomly

Result:
• All pages receive some PageRank
• No PR accumulation in isolated components
• Convergence to a unique stationary distribution

15 Prove the convergence of Google’s PageRank(PR) algorithm.


15.1 Introduction to PageRank
PageRank models the web as a directed graph where:
• Nodes represent web pages.
• Edges represent hyperlinks between pages.
The goal is to compute a rank vector π where each entry πi represents the importance of page i.

15.2 The Google Matrix G


The web’s link structure is encoded in the Google matrix G, defined as:
EE T
G = dT + (1 − d) ,
N
where:
• T : Transition matrix derived from link structure.
– If page i has L(i) outgoing links, Tij = 1/L(i) if i links to j, else 0.
– Dangling nodes (pages with no links) are treated as linking to all pages uniformly: Tij = 1/N .
• d: Damping factor (typically 0.85), the probability of following a link.
• EE T : Matrix of all 1’s (ensures uniform jumps when the user doesn’t follow links).
• N : Total number of pages.
Example: For a mini-web with 3 pages where Page 1 links to Page 2, and Page 2 links to Page 3:
 
0 1 0 T
T = 0 0 1  , G = 0.85T + 0.15 · EE .
3
1/3 1/3 1/3 (dangling node handling)

24
15.3 Key Properties Ensuring Convergence
For G to converge to a unique π, it must satisfy:
1. Stochasticity: Each row of G sums to 1 (valid probability distribution).
T
2. Irreducibility: Any page can reach any other page (ensured by damping, since (1 − d) EE
N adds non-zero
transitions).
3. Aperiodicity: No cyclic paths trap the surfer indefinitely (damping breaks cycles).
Perron-Frobenius Theorem: A stochastic, irreducible, and aperiodic matrix has:
• A unique largest eigenvalue λ1 = 1.

• A corresponding eigenvector π with all positive entries (the PageRank vector).

15.4 Power Iteration Method


The PageRank vector π is computed iteratively:

1. Initialize: π (0) = [1/N, . . . , 1/N ].


2. Iterate: π (k+1) = π (k) G.
3. Convergence: Stop when ∥π (k+1) − π (k) ∥ < ϵ.

Why It Works:
• G’s second-largest eigenvalue λ2 satisfies |λ2 | ≤ d, so convergence is at rate O(dk ).
• Example: After 50 iterations with d = 0.85, error scales as 0.8550 ≈ 10−4 .

15.5 Practical Example


Consider a 3-page web:
• Page A links to Page B.
• Page B links to Page C.

• Page C has no links (dangling node).


Transition Matrix T :  
0 1 0
T = 0 0 1 .
1/3 1/3 1/3
Google Matrix G (with d = 0.85):  
1 1 1
G = 0.85T + 0.05 1 1 1 .
1 1 1
Power Iteration: Starting from π (0) = [1/3, 1/3, 1/3], after 10 iterations, π stabilizes to the ranks of A, B, C.

15.6 Conclusion
PageRank converges because:
1. G is stochastic, irreducible, and aperiodic (due to damping).

2. The Perron-Frobenius theorem guarantees a unique π.


3. Power iteration converges geometrically with rate d.

25
16 Explain the Perceptron Approach used in pattern classifiers using
the perceptron node and threshold logic unit.
The perceptron is a fundamental algorithm in machine learning for binary classification, inspired by biological
neurons. It serves as the building block for artificial neural networks, using a simple computational model to
classify input patterns.

16.1 Perceptron Node and Threshold Logic Unit (TLU)


Perceptron Node:

• Basic computational unit modeled after a biological neuron


• Receives multiple input signals and produces a binary output
Threshold Logic Unit (TLU):

• Decision-making component that applies a step function


• Computes weighted sum of inputs and compares to threshold
Mathematical Model:
f (x) = h(w · x + b)
where:
• x: Input vector

• w: Weight vector
• b: Bias term
• h: Heaviside step function: (
1 if z > 0
h(z) =
0 otherwise

Geometric Interpretation:

P
wi xi + b = 0 defines a hyperplane decision boundary

• Classifies points based on which side of boundary they lie

16.2 Perceptron Learning Algorithm


Training Process:
1. Initialization:
• Set weights wi and bias b to small random values or zero
2. Iterative Training: For each training sample (xj , dj ):

• Compute output:
yj = h(w · xj + b)
• Update weights if misclassified:
wi ← wi + η(dj − yj )xj,i
b ← b + η(dj − yj )
where η is learning rate (0 < η ≤ 1)
3. Convergence:
• Repeat until no errors or maximum iterations reached
• Guaranteed to converge for linearly separable data (Perceptron Convergence Theorem)

26
16.3 Example: Logical AND Function
Truth Table:
x1 x2 d
0 0 0
0 1 0
1 0 0
1 1 1
Learned Decision Boundary:
• Example solution: w1 = 1, w2 = 1, b = −1.5
• Implements the function: x1 ∧ x2

16.4 Limitations
1. Linear Separability:
• Cannot solve non-linearly separable problems (e.g., XOR)
2. Single-Layer Limitation:
• Only capable of linear decision boundaries
3. Binary Output:
• No probabilistic interpretation of outputs

17 Draw the figure depicting a basic perceptron model.


17.1 Figure: Basic Perceptron Model
x1
w1
(
1 z≥0
x2 w2 y=
0 z<0
P
z= wi xi + b
Σ Step y
.. wn
.

xn

17.2 Components Description


• Inputs (x1 , x2 , ..., xn ): Feature vector (e.g., pixel values in an image)
• Weights (w1 , w2 , ..., wn ): Learned parameters (strength of each input connection)
• Bias (b): Shifts the decision boundary away from the origin
Pn
• Summing Junction (Σ): Computes the weighted sum z = i=1 wi xi + b
• Activation Function (Step Function):
– Outputs y = 1 if z ≥ 0
– Outputs y = 0 (or −1) if z < 0

18 Define the generic objective function for a basic perceptron model,


explaining all its variables/parameters.
The Perceptron is a fundamental machine learning algorithm for binary classification. It is a linear classifier
that attempts to find a hyperplane that separates data from two different classes.

27
18.1 Objective (Loss) Function
The Perceptron loss function penalizes only misclassified points and is defined as:
N
X
L(w, b) = max (0, −yi (w · xi + b))
i=1

• xi : Feature vector of the ith example


• yi : Label of the ith example (±1)
• w: Weight vector
• b: Bias term

• N : Number of training samples

If a sample is correctly classified, the term inside max is negative, resulting in zero loss. For misclassified
samples, the loss becomes positive, prompting weight updates.

18.2 Learning Rule


The Perceptron updates its weights and bias as follows:

w ← w + η · yi · xi
b ← b + η · yi

• η: Learning rate

This update occurs only when yi (w · xi + b) ≤ 0.

18.3 Example
Assume:
• x1 = [2, 3], y1 = +1

• x2 = [−1, −2], y2 = −1
• Initial w = [0, 0], b = 0, η = 1
Step 1: For x1 , the prediction is incorrect. Update:

w = [2, 3], b=1

Step 2: For x2 , prediction becomes correct. No update needed.

19 Explain the steps involved in the Perceptron Training Algorithm


(PTA).
The Perceptron Training Algorithm (PTA) is one of the simplest supervised learning algorithms used for
binary classification tasks. It learns a linear decision boundary to separate two classes. The algorithm
updates the weights and bias based on classification errors using a simple rule.

19.1 What is the Perceptron?


A Perceptron is a basic building block of neural networks. It takes multiple input features, applies a weight to
each, adds a bias, and passes the result through an activation function to predict a binary output (like 0 or
1, or -1 and +1).

28
19.2 Initialization
• Weights (w): Initialize to zeros or small random values.
• Bias (b): Set to zero or a small value.

• Learning Rate (η): A small value like 0.1 or 0.01.


• Maximum Epochs: Define to control convergence.

19.3 The Training Loop (Epochs)


Step 1: For each training sample (xi , yi ) Step 2: Compute activation:
n
(i)
X
z = w · xi + b = wj xj + b
j=1

Step 3: Apply activation function:


(
1 if z ≥ 0
ŷi =
0 (or − 1) otherwise

Step 4: Update (if misclassified):


• For {0, 1}:
(i)
wj ← wj + η(yi − ŷi )xj , b ← b + η(yi − ŷi )

• For {−1, +1}:


w ← w + ηyi xi , b ← b + ηyi

19.4 Convergence Criteria


• All training samples are correctly classified
• Maximum number of epochs reached

• No improvement in error

20 Given a dataset, demonstrate how the Perceptron Training Algo-


rithm (PTA) works starting with a zero-valued weight vector ω,
i.e., initially ω = (0, 0, 0)T . Calculate the ω vector at each iteration
until PTA converges, showing all intermediate steps.
20.1 Dataset
We use the following training dataset with four examples:

Example x1 x2 x3 (bias)
Label (y)
1 1 0 1
+1
2 0 1 1
+1
3 1 1 1
-1
4 0 0 1
-1

29
20.2 Algorithm Initialization
• Initial weight vector: ω (0) = (0, 0, 0)T
• Learning rate: η = 1
• Activation function: f (z) = sign(z)

20.3 Training Process


20.3.1 Epoch 1
• Example 1: x = (1, 0, 1), y = +1
z =ω·x=0
ŷ = sign(0) = +1 (Correct)
T
ω remains (0, 0, 0)

• Example 2: x = (0, 1, 1), y = +1


z=0
ŷ = +1 (Correct)
ω remains (0, 0, 0)T

• Example 3: x = (1, 1, 1), y = −1


z=0
ŷ = +1 (Misclassified)
ω ← ω + η · y · x = (−1, −1, −1)T

• Example 4: x = (0, 0, 1), y = −1


z = −1
ŷ = −1 (Correct)
ω remains (−1, −1, −1)T

20.3.2 Epoch 2
• Example 1: x = (1, 0, 1), y = +1
z = −2
ŷ = −1 (Misclassified)
ω ← (0, −1, 0)T

• Example 2: x = (0, 1, 1), y = +1


z = −1
ŷ = −1 (Misclassified)
ω ← (0, 0, 1)T

• Example 3: x = (1, 1, 1), y = −1


z=1
ŷ = +1 (Misclassified)
ω ← (−1, −1, 0)T

• Example 4: x = (0, 0, 1), y = −1


z=0
ŷ = +1 (Misclassified)
ω ← (−1, −1, −1)T

30
20.4 Observations
• The weights enter a cycle between (−1, −1, −1)T and (0, 0, 0)T
• This indicates the data is not linearly separable

• The perceptron cannot converge for this dataset

20.5 Key Concepts


• Update Rule: ω ← ω + η · y · x (applied only for misclassified samples)

• Convergence: Guaranteed only for linearly separable data (Perceptron Convergence Theorem)
• Bias Handling: Incorporated through x3 = 1 in the input vector

20.6 Alternative Example (Linearly Separable Case)


For contrast, consider the AND function dataset which is linearly separable:

x1 x2 y
0 0 -1
0 1 -1
1 0 -1
1 1 +1

PTA would converge to a solution like ω = (1, 1, −1.5)T after a few epochs.

20.7 Conclusion
The demonstration shows:
• How PTA iteratively updates weights
• The importance of linear separability for convergence
• Practical behavior with both separable and non-separable data

21 Describe the intuition behind updating weights in the Perceptron


Training Algorithm.
21.1 Core Principle
The Perceptron Training Algorithm (PTA) operates on error-driven learning, adjusting weights only when
misclassifications occur. This simple yet powerful mechanism enables the perceptron to gradually improve its
decision boundary.

21.2 Mathematical Foundation


For each input x with true label y ∈ {−1, +1}:

• Compute activation: z = w · x + b
• Predict: ŷ = sign(z)
• Update rule when ŷ ̸= y:

w ←w+η·y·x
b←b+η·y

31
Figure 11: Iterative adjustment of decision boundary through weight updates

21.3 Visualization of Learning Process


21.3.1 Case 1: False Negative (y = +1, ŷ = −1)
• Problem: z was too negative

• Solution: Add ηx to w
⇒ Increases z for future similar inputs

21.3.2 Case 2: False Positive (y = −1, ŷ = +1)


• Problem: z was too positive
• Solution: Subtract ηx from w
⇒ Decreases z for future similar inputs

21.4 Learning Rate Dynamics


The learning rate η controls update magnitudes:
Value Effect
Small η (e.g., 0.01) Slow but stable convergence
η = 1 (default) Standard correction per misclassification
Large η (e.g., 10) Fast but risky (may overshoot)

21.5 Bias Term Adjustment


The bias b (equivalent to w0 ) receives special treatment:

• Input is always +1 (the ”dummy feature”)


• Update rule: b ← b + η · y · 1
• Effect: Shifts boundary parallel to itself

32
21.6 Concrete Example
Consider 2D data with:
• Current weights: w = (1, 1), b = −1.5
• Misclassified point: x = (2, 1), y = +1
• Update (η = 1):
w ← (1, 1) + 1 · (+1) · (2, 1) = (3, 2)
b ← −1.5 + 1 · (+1) = −0.5

21.7 Convergence Properties


• Guaranteed for linearly separable data (Perceptron Convergence Theorem)
• Cycling indicates non-separable data (e.g., XOR problem)
• No margin guarantee: Finds any separating boundary, not necessarily optimal

21.8 Limitations and Workarounds


• Non-separable data: Use pocket algorithm (keep best weights)
• No probabilities: Consider logistic regression for confidence scores
• Linear constraints: Multilayer networks for nonlinear boundaries

22 What is/are the assumptions in the Perceptron Training Algo-


rithm?
22.1 Linear Separability
The fundamental assumption is that the data must be linearly separable - there exists a hyperplane that can
perfectly separate the classes.

22.2 Binary Classification Framework


The standard PTA operates strictly in a binary classification setting:
• Label Convention: yi ∈ {−1, +1} (or {0, 1})
• Extension: Multiclass requires one-vs-all or other strategies
• Limitation: No native probabilistic outputs

22.3 Learning Rate Dynamics


The fixed learning rate (η) impacts training:

w ← w + η · yi · x i
Learning Rate Effect
η too large Overshooting, oscillations
η too small Slow convergence
Optimal η Faster convergence

22.4 Data Characteristics


• Noise-Free: Assumes perfect labeling
– Workaround : Pocket Algorithm retains best weights
• Bounded Norm: ∥x∥ ≤ R for some R > 0
• Finite Samples: Practical implementations require finite data

33
22.5 Architectural Constraints
• Single Layer:
– Limited to linear decision boundaries
– Cannot solve XOR without hidden layers
• Activation Function: (
+1 z≥0
f (z) =
−1 otherwise
Non-differentiable nature prevents gradient flow

22.6 Theoretical Guarantees


• Convergence Proof : Finite steps to solution if separable
 2

k≤
∥w∗ ∥
where γ is the margin
• Initialization Invariance: Any initial w works (theoretically)

• When to Use:
– Simple binary classification
– Linearly separable data
– Baseline model before trying complex methods
• When to Avoid:
– Noisy datasets
– Non-linear problems
– Probabilistic outputs needed

23 Explain and differentiate between the roles of the following three


activation functions: Sigmoid, ReLU, and tanh.
23.1 Introduction
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. This
section compares three fundamental activation functions: Sigmoid, Hyperbolic Tangent (tanh), and Rectified
Linear Unit (ReLU).

23.2 Sigmoid (Logistic) Activation Function


23.2.1 Mathematical Definition
1
σ(x) =
1 + e−x

23.2.2 Characteristics
• Range: (0, 1)
• Shape: Smooth S-curve
• Common Use: Output layer for binary classification

23.2.3 Advantages
• Probabilistic interpretation (output as probability)
• Smooth gradients for backpropagation

34
23.2.4 Limitations
• Vanishing gradients for extreme inputs
• Not zero-centered (outputs always positive)

• Computationally expensive due to exponentiation

Figure 12: Sigmoid activation function and its derivative

23.3 Hyperbolic Tangent (tanh) Function


23.3.1 Mathematical Definition
ex − e−x
tanh(x) =
ex + e−x

23.3.2 Characteristics
• Range: (-1, 1)
• Shape: Zero-centered S-curve

• Common Use: Hidden layers in RNNs/LSTMs

23.3.3 Advantages
• Zero-centered outputs improve gradient flow
• Stronger gradients than sigmoid near zero

23.3.4 Limitations
• Still suffers from vanishing gradients

• Computationally expensive

23.4 Rectified Linear Unit (ReLU)


23.4.1 Mathematical Definition
ReLU(x) = max(0, x)

23.4.2 Characteristics
• Range: [0, )
• Shape: Linear for positive inputs, zero otherwise
• Common Use: Default choice for hidden layers

35
Figure 13: tanh activation function and its derivative

23.4.3 Advantages
• Computationally efficient

• Avoids vanishing gradients for positive inputs


• Promotes sparsity in activations

23.4.4 Limitations
• ”Dying ReLU” problem (neurons can get stuck)

• Not zero-centered

Figure 14: ReLU activation function and its derivative

23.5 Comparative Analysis

Table 3: Feature Comparison of Activation Functions


Feature Sigmoid tanh ReLU
Output Range (0, 1) (-1, 1) [0, )
Zero-Centered No Yes No
Gradient Behavior Vanishing Vanishing Non-vanishing (x ¿ 0)
Computational Cost High High Low
Common Use Cases Output layer RNNs/LSTMs Hidden layers
Key Limitation Vanishing gradients Vanishing gradients Dying neurons

23.6 Selection Guidelines


23.6.1 When to Use Each Function
• Sigmoid: Output layer for binary classification

36
• tanh: When zero-centered outputs are crucial (e.g., some RNNs)
• ReLU: Default choice for hidden layers in most architectures

23.6.2 Advanced Variants


• Leaky ReLU: LReLU(x) = max(αx, x) (fixes dying ReLU)
• Parametric ReLU: Learns α during training

• Swish: x · σ(βx) (self-gating property)

23.7 Conclusion
• ReLU is generally preferred for hidden layers due to computational efficiency
• Sigmoid remains relevant for probabilistic outputs
• tanh is useful when zero-centered outputs are beneficial
• Modern architectures often use advanced variants to address limitations

37

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy