0% found this document useful (0 votes)

7 views37 pages

PR Module 4 QB

The document discusses spectral clustering, highlighting its advantages over traditional clustering methods like K-means, particularly in handling non-convex and non-linear data structures. It explains the concept of the Laplacian matrix, its properties, and provides a detailed step-by-step explanation of the spectral clustering algorithm, including how to find connected components and perform a 2-way cut in a graph. Additionally, it outlines the computational aspects and limitations of spectral clustering, making it suitable for small to medium datasets.

Uploaded by

Ed Philip Luis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views37 pages

PR Module 4 QB

Uploaded by

Ed Philip Luis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

PR module 4 QB

April 2025

1 Why Do We Go for Spectral Clustering?

Spectral clustering is a powerful, flexible, unsupervised learning technique that excels in scenarios where tradi-
tional methods like K-means fall short. It is especially effective for complex-shaped data, non-linearly separable
patterns, and graph-based data structures.

1.1 Handles Non-Convex and Non-Linear Data Structures

Traditional algorithms like K-means assume clusters are convex and isotropic (e.g., circular or spherical), which
limits their effectiveness when clusters have irregular shapes. Spectral clustering, on the other hand, can handle
non-convex and non-linearly separable clusters due to its graph-based approach.
Example: Consider two “moon-shaped” clusters. K-means may incorrectly split one moon due to Euclidean
proximity, whereas spectral clustering captures the structural integrity and keeps the moons intact.

1.2 Utilizes Graph Theory and Similarity Matrices

Spectral clustering constructs a graph from data using a similarity matrix (e.g., Gaussian similarity or k-nearest
neighbors). It does not work directly on the raw data but rather on the similarity graph, allowing it to capture
both local and global relationships among data points. This is particularly advantageous in domains like image
segmentation and social network analysis.

1.3 Dimension Reduction via Eigenvalue Decomposition

By computing the eigenvectors of the graph Laplacian matrix, spectral clustering reduces the dimensionality of
the problem. The first few eigenvectors embed the data into a lower-dimensional space where simple clustering
methods like K-means can then be applied more effectively.

1.4 No Strong Assumptions on Cluster Shape

Unlike K-means (which assumes convex clusters) or DBSCAN (which assumes density-based structures), spectral
clustering can detect arbitrarily shaped clusters as long as they are well-connected in the similarity graph.

1.5 Robust to Initialization and Noise

Spectral clustering is less sensitive to initialization because the structure is derived from the Laplacian’s eigen-
vectors. Additionally, when using a k-nearest neighbor graph, it can ignore distant noise, making it more robust
under the right conditions.

1.6 Effective for Small to Medium Datasets

It performs especially well for small to medium-sized datasets where the number of clusters is low but bound-
aries are complex. However, for very large datasets, the computational cost (typically O(n3 )) of eigenvalue
decomposition can become prohibitive.

1.7 Suitable for High-Dimensional and Sparse Data

Spectral clustering works well in applications such as image segmentation, bioinformatics, and social network
analysis, where the data might not lie in a Euclidean space and can be sparse or high-dimensional.

1
1.8 When to Avoid Spectral Clustering
• Large Datasets: Computationally expensive due to eigenvalue decomposition (O(n3 ) complexity).
• Explicit Cluster Centroids Required: Does not return centroids like K-means.

• Parameter Sensitivity: Performance depends on the similarity metric, graph construction, and number
of clusters (k).

2 Define a Laplacian Matrix

2.1 Definition
The Laplacian Matrix is a matrix representation of a graph that captures the structure and connectivity between
nodes. It plays a crucial role in spectral graph theory and is widely used in spectral clustering.
Given an undirected graph G = (V, E) with n vertices:

• Adjacency Matrix (A or W): A ∈ Rn×n such that Aij = 1 if there is an edge between node i and node
j, otherwise 0. In a weighted graph, Wij denotes the weight of the edge.
• Degree Matrix (D): A diagonal matrix where Dii = j Aij or Dii = j Wij .
P P

Figure 1: The-Laplacian-Matrix-of-a-network-Panel-A-presents-a-small-undirected-network-Panel

2.2 Key Properties

• Symmetric for undirected graphs.
• Positive semi-definite: All eigenvalues ≥ 0.
• The smallest eigenvalue is 0. The number of zero eigenvalues equals the number of connected components.

• The eigenvectors of L or Lsym help embed the graph into a low-dimensional space for clustering.

2.3 Intuition Behind Laplacian

• D indicates how connected a node is; W shows direct connections.
• The Laplacian L = D − W measures how much a node differs from its neighbors.

2
• For a function f on nodes: X
(Lf )i = Wij (fi − fj )
j

• It acts like a discrete Laplace operator and captures smoothness over the graph.

3 Give a detailed explanation of the Spectral Clustering Algorithm.

Spectral Clustering is a graph-based clustering algorithm that uses the eigenvalues and eigenvectors of a graph
Laplacian derived from a similarity graph. It is highly effective for clustering non-convex and non-linearly
separable data.

3.0.1 Core Idea

Instead of clustering in the original input space, Spectral Clustering transforms the data using eigenvectors of
the Laplacian matrix and clusters them in a low-dimensional space.

3.0.2 Algorithm Overview

Given:
X = {x1 , x2 , ..., xn } ∈ Rd
We aim to divide the dataset into k clusters.

1. Construct the Similarity Graph

• Nodes: Data points
• Edges: Similarities represented by weights Wij
• Common similarity metrics:
– Gaussian RBF Kernel:
∥xi − xj ∥2

Wij = exp −
2σ 2
– k-Nearest Neighbors (k-NN)
– ϵ-Neighborhood Graph
2. Compute the Graph Laplacian
• Degree Matrix:
n
X
Dii = Wij
j=1

• Laplacian Variants:

L = D − W, Lsym = I − D−1/2 W D−1/2 , Lrw = I − D−1 W

3. Compute the Eigenvectors

• Find the first k eigenvectors corresponding to the smallest eigenvalues
• Form matrix U ∈ Rn×k
4. Normalize Rows (if using Lsym )
Uij
Ûij = qP
2
Uij
j

5. Apply K-Means Clustering

• Use rows of U or Û as k-dimensional vectors
• Cluster using K-means
6. Assign Cluster Labels
• Use K-means results to label original data points

3
3.0.3 Graph Cut Interpretation
Spectral Clustering minimizes the normalized cut:
k
X cut(Ai , Āi )
Ncut(A1 , ..., Ak ) =
i=1
vol(Ai )

where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A

3.0.4 Summary
Step Description
1 Build similarity graph W
2 Compute Laplacian matrix L, Lsym , or Lrw
3 Compute first k eigenvectors of Laplacian
4 Form matrix U ∈ Rn×k
5 Run K-means on rows of U
6 Assign original data points to clusters

4 How can we find connected components of a graph using Spectral

clustering?
Spectral clustering provides an elegant, graph-theoretic approach to identify connected components in an undi-
rected graph. This method leverages the spectral properties (eigenvalues and eigenvectors) of the graph Laplacian
matrix. According to spectral graph theory, the number of connected components in a graph equals the
multiplicity of the eigenvalue 0 in its Laplacian matrix.

4.1 Key Concept

A graph’s connected components are subsets of nodes where:
• Each pair of nodes in a component is connected by a path.

• No edges exist between different components.

In spectral terms, the Laplacian matrix L of the graph reflects this structure:
• The number of eigenvalues equal to zero corresponds to the number of connected components.

• The eigenvectors corresponding to those zero eigenvalues indicate the membership of nodes in each com-
ponent.

4.2 Step-by-Step Method

Let G = (V, E) be an undirected graph with n nodes.
Step 1: Create the Adjacency Matrix A
(
1 if there is an edge between node i and j
A ∈ Rn×n , Aij =
0 otherwise

Step 2: Compute the Degree Matrix D

X
Dii = Aij
j

This represents the degree of node i.

Step 3: Form the Unnormalized Laplacian

L=D−A

4
Step 4: Compute Eigenvalues and Eigenvectors Solve the eigenvalue problem:

Lv = λv

Compute eigenvalues λ1 , λ2 , . . . , λn and eigenvectors v1 , v2 , . . . , vn .

Step 5: Identify Number of Connected Components The number of eigenvalues equal to zero gives
the number of connected components.
Step 6: Use Eigenvectors to Group Nodes Let k be the number of connected components. Take the
first k eigenvectors (corresponding to zero eigenvalues), stack them column-wise into a matrix U ∈ Rn×k . Each
row of U corresponds to a node and will be identical for nodes in the same component.
Clustering algorithms like K-means can be applied on rows of U , although they may be unnecessary for ideal
Laplacians.

4.3 Example
Consider a graph with 2 connected components.
Adjacency Matrix W :  
0 1 0 0
1 0 0 0
W =
0

0 0 1
0 0 1 0
Degree Matrix D:  
1 0 0 0
0 1 0 0
D=
0

0 1 0
0 0 0 1
Laplacian Matrix L = D − W :  
1 −1 0 0
−1 1 0 0
L=
 
0 0 1 −1
0 0 −1 1
Eigenvalues: λ = [0, 0, 2, 2]
Since there are two zero eigenvalues, the graph has two connected components.
Eigenvectors corresponding to λ = 0:
   
1 0
1 0
0 , v2 = 1
v1 =    

0 1

Here, v1 identifies Component 1 (nodes 1 & 2) and v2 identifies Component 2 (nodes 3 & 4).

5 How can we find a 2-way cut in a graph using Spectral clustering?

5.1 What is a 2-Way Cut?
A 2-way cut of a graph is a partition of the vertex set V into two disjoint subsets A and B such that:
• A∪B =V
• The number (or total weight) of edges between A and B is minimized.

This is also referred to as a graph bisection or minimum cut.

5.2 Intuition Behind Spectral 2-Way Cut

Instead of checking all partitions (which is computationally expensive), spectral clustering uses eigenvectors to
approximate the optimal solution. The key idea is that the Fiedler vector (the eigenvector corresponding to
the second smallest eigenvalue of the Laplacian matrix L) contains critical information for partitioning.

5
5.3 Step-by-Step: 2-Way Cut Using Spectral Clustering
5.3.1 Step 1: Construct the Similarity Graph
• Use the Gaussian kernel:
||xi − xj ||2

Wij = exp −
2σ 2

• Or use a k-NN graph to connect each node to its k nearest neighbors.

5.3.2 Step 2: Construct the Graph Laplacian

• Degree matrix D: Dii = j Wij
P

• Unnormalized Laplacian: L = D − W
• Normalized Laplacian:
Lsym = D−1/2 LD−1/2 = I − D−1/2 W D−1/2

5.3.3 Step 3: Compute Eigenvectors

Solve:
Lsym v = λv
• λ1 = 0 with v1 = 1
• λ2 gives the Fiedler vector v2

5.3.4 Step 4: Partition Using the Fiedler Vector

• Sign-Based: Group A: v2 (i) > 0, Group B: v2 (i) ≤ 0
• Threshold-Based: Use threshold t (e.g., median of v2 )
• Optimal Threshold Search: Minimize:

cut(A, B) cut(A, B)
Ncut(A, B) = +
vol(A) vol(B)

where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A

5.4 Why Does This Work?

• The Fiedler vector minimizes a relaxed version of RatioCut or NormalizedCut.
• Nodes with similar values in v2 tend to be more connected.
• The sign split reveals the natural partition in the graph.

5.5 Visualization Insight

• Visualize node i’s position as v2 (i) on a line.
• A cut at zero (or any threshold) splits the graph meaningfully.

5.6 Summary
5.7 Applications
• Image segmentation
• Graph partitioning
• Social network clustering
• Clustering in high-dimensional data

6
Step Description
1 Construct similarity graph and Laplacian matrix L = D − W or Lsym
2 Compute Fiedler vector v2
3 Partition nodes using sign/threshold/optimal Ncut
4 Approximate 2-way cut achieved

Table 1: Steps for Spectral 2-Way Cut

6 Prove the four properties of a Laplacian Matrix L.

Let G = (V, E) be an undirected graph with |V | = n. Define:

• Adjacency matrix A ∈ Rn×n , where Aij = 1 if edge (i, j) ∈ E, else 0

• Degree matrix D ∈ Rn×n , where Dii = deg(vi ) = j Aij
P

Then, the Laplacian matrix is:

L=D−A

6.1 Property 1: Symmetric and Positive Semi-Definite

Symmetry:
• A is symmetric ⇒ A = AT
• D is diagonal (hence symmetric)

• So, LT = (D − A)T = DT − AT = D − A = L
Positive Semi-Definiteness: For any x ∈ Rn ,
n
X X
xT Lx = xT (D − A)x = di x2i − Aij xi xj
i=1 i,j

Rewriting:
1X
xT Lx = Aij (xi − xj )2 ≥ 0
2 i,j

6.2 Property 2: Eigenvalue Zero with All-Ones Eigenvector

Let 1 = [1, 1, . . . , 1]T . Then:
L · 1 = (D − A) · 1 = D · 1 − A · 1 = 0
Hence, 1 is an eigenvector of L with eigenvalue 0.

6.3 Property 3: Multiplicity of Eigenvalue 0

If the graph has k connected components C1 , C2 , . . . , Ck , define:
(
1 if j ∈ Ci
fi (j) =
0 otherwise

Then Lfi = 0 for each i, and fi ’s are linearly independent.

So, nullity of L = k.

6.4 Property 4: Zero Row and Column Sums

n
X n
X
Lij = Dii − Aij = deg(vi ) − deg(vi ) = 0
j=1 j=1

Since L is symmetric, column sums are also 0.

7
7 Explain about Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a statistical model that describes systems assumed to follow a Markov
process with hidden states. It is widely used in areas like speech recognition, weather prediction, bioinformatics,
finance, and natural language processing (NLP).

7.1 What Does an HMM Do?

In simple terms, an HMM models situations where you observe some data (outputs), but the internal mechanism
or state that generates those observations is hidden or not directly visible.

7.2 Real-World Examples

• Speech Recognition: Words (hidden states) produce acoustic signals (observations).
• Weather Prediction: Actual weather (hidden) causes sensor readings (observations).
• Bioinformatics: DNA sequences (observed) are generated by hidden biological processes.

• Finance: Market regimes (hidden) result in observed price movements.

7.3 Components of an HMM

An HMM is defined by five key components:

1. Hidden States
Denoted as S = {s1 , s2 , . . . , sN } or Q = {q1 , q2 , . . . , qN }. These are unobservable or latent variables.
Example: In speech recognition, the hidden states might be phonemes.

2. Observations
Denoted as O = {o1 , o2 , . . . , oT } or V = {v1 , v2 , . . . , vM }. These are the visible/measurable outputs.
Example: Acoustic signals or sensor readings.
3. Transition Probabilities
Matrix A = [aij ], where aij = P (sj | si ). This defines the probability of transitioning from state si to sj .
Example: Transition from ”sunny” to ”rainy”.
4. Emission Probabilities
Matrix B = [bj (k)], where bj (k) = P (ok | sj ). This defines the probability of observing ok given that the
system is in state sj .
Example: Observing ”umbrella” when the state is ”rainy”.
5. Initial Probabilities
Vector π = [πi ], where πi = P (si at time t = 1). This defines the probability of the initial state.
Example: The probability that the first day is ”sunny”.

We denote the full HMM model as λ = (A, B, π).

7.4 Assumptions of HMMs

• Markov Property:
P (qt | qt−1 , qt−2 , . . . , q1 ) = P (qt | qt−1 )
The future state depends only on the current state.
• Output Independence:
P (ot | qt , ot−1 , . . . , o1 ) = P (ot | qt )
The current observation depends only on the current state.

8
7.5 Core Problems Solved by HMMs
1. Evaluation Problem
Goal : Compute P (O | λ), the probability of the observation sequence given the model.
Solution: Forward Algorithm or Forward-Backward Algorithm.
2. Decoding Problem
Goal : Identify the most likely sequence of hidden states that could have led to the observed sequence.
Solution: Viterbi Algorithm.
3. Learning Problem
Goal : Estimate the model parameters (A, B, π) that best explain the observed data.
Solution: Baum-Welch Algorithm (a type of Expectation-Maximization).

8 Explain the diagram (state space diagram) used to represent an

HMM using an example.
A Hidden Markov Model (HMM) is a statistical model used to represent systems that are assumed to be a
Markov process with unobservable (hidden) states. A State Space Diagram graphically represents the hidden
states, transition probabilities, and emission (observation) probabilities.

8.1 Components of a State Space Diagram

• Hidden States (Circles): Represent the unobservable internal states.
• Observations (Squares/Diamonds): Represent the observable outputs.
• Transition Probabilities (Arrows between States): Indicate likelihood of moving from one hidden
state to another.
• Emission Probabilities (Arrows from States to Observations): Indicate likelihood of observing
output from a state.

8.2 Example 1: Weather and Umbrella

Hidden States: Sunny (S), Rainy (R)
Observations: Umbrella (U), No Umbrella (N)
Transition Probabilities (A):
P (S → S) = 0.7, P (S → R) = 0.3, P (R → S) = 0.4, P (R → R) = 0.6
Emission Probabilities (B):
P (U |S) = 0.1, P (N |S) = 0.9P (U |R) = 0.8, P (N |R) = 0.2
Initial State Probabilities:
P (S0 ) = 0.6, P (R0 ) = 0.4
Observed Sequence Example:
Sunny → Rainy → Rainy → Sunny ⇒ N, U, U, N
Diagram:
P(U|S)=0.1
+------> [Sunny] <------+
| | \ |
| | P(S→S)=0.7 | P(S→R)=0.3
| | \ |
| V \ V
[Start] [Rainy]
| ^ / ^
| | P(R→S)=0.4 | P(R→R)=0.6
| | / |
| | / |
+------> [Rainy] <------+
P(N|R)=0.2

9
8.3 Example 2: Weather and Activities
Hidden States: Rainy, Sunny
Observations: Walk, Shop, Clean
Transition Probabilities:

P (R → R) = 0.7, P (R → S) = 0.3, P (S → R) = 0.4, P (S → S) = 0.6

Emission Probabilities:
• Rainy: P (W alk) = 0.1, P (Shop) = 0.4, P (Clean) = 0.5
• Sunny: P (W alk) = 0.6, P (Shop) = 0.3, P (Clean) = 0.1

8.4 Alternative Representations

Transition Matrix:
Sunny Rainy
Sunny 0.7 0.3
Rainy 0.4 0.6
Emission Matrix:
U N
Sunny 0.1 0.9
Rainy 0.8 0.2

8.5 Why Use a State Space Diagram?

• Visualizes how states change and generate outputs.
• Essential for HMM algorithms: Forward-Backward, Viterbi.
• Widely used in speech recognition, bioinformatics, activity monitoring.

8.6 Summary
• Hidden states model internal dynamics.
• Observations reflect visible behavior.
• Transition and emission probabilities define system behavior.
• State space diagram helps reason about and visualize the HMM.

8.7 Hidden Markov Model (HMM): Diagram Interpretation

A Hidden Markov Model (HMM) is a statistical model representing systems with hidden (unobservable)
states that evolve over time according to transition probabilities, and each hidden state emits observable outcomes
with certain probabilities. The figure illustrates a typical state space diagram of an HMM.

8.8 Components of the State Space Diagram

• Hidden States (Circles): X1 and X2 represent the hidden (internal) states of the system.
• Observations (Squares): Y1 , Y2 , Y3 are the observable outputs.
• Transition Probabilities:
– a11 : probability of staying in X1
– a12 : probability of transitioning from X1 to X2
– a21 : probability of transitioning from X2 to X1
– a22 : probability of staying in X2
• Emission Probabilities:
– From X1 : emits Y1 (b11 ), Y2 (b12 ), Y3 (b13 )
– From X2 : emits Y1 (b21 ), Y2 (b22 ), Y3 (b23 )

10
8.9 Transition Matrix (A)

a11 a12
A=
a21 a22

8.10 Emission Matrix (B)

b b12 b13
B = 11
b21 b22 b23
Rows correspond to states X1 and X2 , and columns to outputs Y1 , Y2 , Y3 .

8.11 Initial State Probabilities

Let π = [π1 , π2 ] represent the initial probabilities of starting in states X1 and X2 respectively.

8.12 Interpretation
This HMM can be used in tasks such as:
• Predicting the next observation.
• Inferring the most likely sequence of hidden states given a sequence of observations.
• Estimating parameters (aij and bij ) from data.

9 What is the assumption used in a HMM?

Hidden Markov Models (HMMs) are built upon fundamental assumptions that simplify sequential data modeling
while maintaining computational tractability. Below are the core assumptions with detailed explanations:

9.1 Markov Property (First-Order Markov Assumption)

Definition: The future state depends only on the current state, not on the sequence of preceding states. This
is also called the memoryless property.

Mathematical Formulation:
P (St |St−1 , St−2 , . . . , S1 ) = P (St |St−1 )

Implications:

• Simplifies computations by ignoring long-term history.

• Reduces complexity from O(T · N T ) to O(T · N 2 ) where N is the number of states and T is the number of
time steps.

Example: In weather prediction, tomorrow’s weather depends only on today’s weather, not on yesterday’s or
earlier days. For instance:
• If today is Sunny, the probability of tomorrow being Rainy might be 20%.

• The model doesn’t consider whether it was Rainy or Cloudy two days ago.

9.2 Output Independence (Observation Independence)

Definition: The current observation depends only on the current hidden state, not on previous observations
or states.

Mathematical Formulation:

P (Ot |St , St−1 , . . . , S1 , Ot−1 , . . . , O1 ) = P (Ot |St )

11
Implications:
• Observations are conditionally independent given the hidden states.
• This allows the model to focus solely on the state-observation relationship without tracking historical
dependencies.

Example: In speech recognition:

• The acoustic features of a phoneme (e.g., ”ah”) depend only on the current phoneme being spoken.
• The sound of ”ah” does not depend on the previous phoneme ”b” or its acoustic features.

9.3 Stationarity (Time-Invariance)

Definition: Transition and emission probabilities do not change over time. The model’s behavior remains
consistent across all time steps.

Mathematical Formulation:

P (St |St−1 ) and P (Ot |St ) are constant ∀t.

Implications:
• The same transition/emission matrices apply at every time step.
• No need to recompute probabilities as time progresses.

Example: In DNA sequencing:

• The probability of a nucleotide transition (e.g., Adenine → Thymine) remains fixed throughout the se-
quence.
• It doesn’t matter if the transition happens at position 10 or position 1000.

9.4 Discrete States and Observations (Standard HMMs)

Definition:
• Hidden states are discrete (e.g., Sunny, Rainy, Cloudy).
• Observations are discrete (e.g., Umbrella, No Umbrella).

Extension: Continuous observations can be handled using Gaussian HMMs, where emission probabilities are
modeled as probability density functions (PDFs).

Example:
• Discrete: A weather HMM with states {Sunny, Rainy} and observations {Umbrella, No Umbrella}.
• Continuous: In speech recognition, acoustic features are continuous and modeled using Gaussian distribu-
tions.

9.5 Fixed Topology

Definition: The structure of possible transitions (e.g., which states can follow others) is preset and does not
change during inference or learning.

Example:
• In a left-to-right HMM (common in speech recognition), states can only transition forward or repeat (e.g.,
S1 → S2 , S2 → S3 , but not S3 → S1 ).
• In a fully connected HMM, any state can transition to any other state (e.g., Sunny → Rainy → Cloudy →
Sunny).

12
9.6 Why These Assumptions Matter
• Computational Efficiency: Reduces complexity from O(T · N T ) to O(T · N 2 ), making inference feasible.
• Tractability: Enables efficient algorithms like:
– Forward-Backward (for computing state probabilities).
– Viterbi (for finding the most likely state sequence).
– Baum-Welch (for parameter estimation).
• Limitations: May oversimplify real-world scenarios (e.g., long-term dependencies in language or climate).

9.7 Illustrative Example

Scenario: Predicting weather (hidden states) based on umbrella sightings (observations).

• Markov Property: Tomorrow’s weather depends only on today’s weather.

• Output Independence: Seeing an umbrella today doesn’t affect the probability of seeing one tomorrow.
• Stationarity: The chance of rain following sunshine is always 30%, regardless of the day.
• Discrete States/Observations: Weather is {Sunny, Rainy}; observations are {Umbrella, No Umbrella}.
• Fixed Topology: Transitions allowed: Sunny ↔ Rainy.

0.3

0.7 Sunny Rainy 0.6

0.4

Figure 2: State transition diagram for the weather HMM. Numbers represent transition probabilities.

10 The template of an HMM is characterized by 6 different parts.

State and explain each of them.
Hidden Markov Models (HMMs) are defined by six essential components that work together to model systems
where hidden states generate observable outputs. Below we describe each component in detail with mathematical
formulations and practical examples.

10.1 Set of Hidden States (S)

Definition: A finite set of N unobservable states representing the system’s internal conditions:

S = {s1 , s2 , ..., sN }

Key Characteristics:
• Not directly measurable (must be inferred from observations)
• Represent the true underlying situation
• Finite and discrete in standard HMMs

Examples:
• Weather system: {Sunny, Rainy, Cloudy}
• Speech recognition: {Phoneme1 , ..., PhonemeN }
• DNA sequencing: {Adenine, Thymine, Cytosine, Guanine}

13
10.2 Observation Symbols (V)
Definition: A set of M possible observable outputs:

V = {v1 , v2 , ..., vM }

Key Characteristics:
• Directly measurable evidence
• May have probabilistic relationship to hidden states
• Can be discrete or continuous (in extended models)

Examples:
• Weather observer’s activities: {Umbrella, No Umbrella}
• Speech signals: {Acoustic feature vectors}
• Genomic data: {Measured base pairs}

10.3 Transition Probability Matrix (A)

Definition: An N × N matrix where:

aij = P (qt+1 = sj |qt = si )

Constraints:
N
X
aij = 1 ∀i ∈ {1, ..., N }
j=1

Example (Weather System):

Sunny Rainy
Sunny 0.7 0.3
Rainy 0.4 0.6

Interpretation:
• 70% chance sunny day follows another sunny day
• 30% chance sunny day transitions to rainy day

10.4 Emission Probability Matrix (B)

Definition: An N × M matrix where:

bj (k) = P (ot = vk |qt = sj )

Constraints:
M
X
bj (k) = 1 ∀j ∈ {1, ..., N }
k=1

Example (Weather-Observation System):

Umbrella No Umbrella
Sunny 0.1 0.9
Rainy 0.8 0.2

Interpretation:
• 90% chance no umbrella is seen on sunny days
• 80% chance umbrella is seen on rainy days

14
10.5 Initial State Distribution ()
Definition: A vector of starting probabilities:

πi = P (q1 = si )

Constraints:
N
X
πi = 1
i=1

Example:
• P (Start Sunny) = 0.6
• P (Start Rainy) = 0.4

10.6 Observation Sequence (O) and Time (T)

Definition: A sequence of T observed symbols:

O = (o1 , o2 , ..., oT ), ot ∈ V

Key Points:
• T can be fixed or variable
• Represents actual measured data
• Used to infer hidden state sequence

Example (5-Day Weather Observation):

Day 1 2 3 4 5
Observation No Umbrella Umbrella Umbrella No Umbrella Umbrella

10.7 Complete System Example

Weather Prediction HMM:
• States: Sunny, Rainy
• Observations: Umbrella, No Umbrella
• Transition Matrix: As shown in Section 10.3
• Emission Matrix: As shown in Section 10.4
• Initial Distribution: π = [0.6, 0.4]
• Time: T = 5 days

10.8 Why These Components Matter

Complete System Specification: These six elements fully define an HMM’s behavior:
• Possible internal states (S)
• Observable outputs (V)
• State evolution rules (A)
• State-observation relationships (B)
• Starting conditions ()
• Time dimension (T)

15
0.7 0.6

0.3

start π = [0.6, 0.4] Sunny Rainy

0.4

No Umbrella (0.9) Umbrella (0.8)

Figure 3: Visualization of the weather HMM showing states, transitions, and emissions.

Enables Solving Fundamental Problems:

• Evaluation: Compute P (O|λ) (Forward-Backward algorithm)

• Decoding: Find most likely state sequence (Viterbi algorithm)
• Learning: Estimate λ = (A, B, π) from data (Baum-Welch)

11 Give 5 applications of HMM.

Hidden Markov Models are widely used across diverse domains to analyze sequential data where the underlying
system states are not directly observable. Below we present five major application areas with detailed examples
and illustrations.

11.1 Speech Recognition

Core Application:
• Converting spoken language to text (e.g., virtual assistants like Siri, Alexa)

• Speaker identification and verification systems

HMM Implementation:
• Hidden States: Phonemes (basic speech sounds like /k/, /æ/, /t/ for ”cat”)
• Observations: Acoustic features (frequency bands, MFCC coefficients)
• Transition Matrix: Probability of phoneme sequences (e.g., /h/→//→/l/→/o/ for ”hello”)

• Emission Matrix: Probability of acoustic features given each phoneme

Example Workflow:
1. User speaks the word ”weather” (/w/ // // //)
2. Microphone captures acoustic signals

3. HMM decodes most likely phoneme sequence

4. Maps to text output ”weather”

11.2 DNA Sequence Analysis

Core Application:
• Gene finding in genomic sequences family classification alignment

16
HMM Implementation:
• Hidden States: Biological features (coding exons, introns, regulatory regions)
• Observations: Nucleotide bases (A,T,C,G)
• Transition Matrix: Probability of genomic region changes (e.g., exon→intron)

• Emission Matrix: Base probabilities in each region type

Example:
• Identifying gene structure: 5’UTR → Exon → Intron → Exon → 3’UTR
• Start codon (ATG) emission probability: 0.95 in exons, 0.01 elsewhere

11.3 Natural Language Processing (NLP)

Core Application:
• Part-of-speech (POS) tagging

• Named entity recognition

• Machine translation

HMM Implementation:
• Hidden States: Grammatical tags (Noun, Verb, Adjective)
• Observations: Words in sentences

• Transition Matrix: Tag sequence probabilities

• Emission Matrix: Word generation probabilities per tag

Example Sentence: “The quick brown fox jumps”

Word The quick brown fox jumps
POS Det Adj Adj Noun Verb

Key Probabilities:
• P(Verb—Noun) = 0.4 (common after subjects)
• P(”fox”—Noun) = 0.01 (specific animal)

11.4 Financial Market Analysis

Core Application:
• Market regime detection (bull/bear markets)

• Algorithmic trading signals

• Risk assessment models

HMM Implementation:
• Hidden States: Market conditions (high/low volatility, trending/mean-reverting)
• Observations: Price changes, trading volumes, volatility indices

• Transition Matrix: Probability of market state changes

• Emission Matrix: Observation distributions per market state

17
11.5 5. Human Activity Recognition
Core Application:
• Wearable fitness tracking
• Medical rehabilitation monitoring
• Gesture-based interfaces

HMM Implementation:
• Hidden States: Activities (walking, running, sitting)
• Observations: Sensor data (accelerometer, gyroscope)
• Transition Matrix: Activity sequence probabilities
• Emission Matrix: Sensor readings per activity

Example:
• Walking state: Characteristic 2Hz acceleration patterns
• Sitting state: Near-zero acceleration variance
• Transition P(Walking→Running) = 0.3 for fitness scenarios

12 Explain the principle behind calculating the importance of web-

pages in Google’s PageRank(PR) algorithm.
PageRank revolutionized web search by quantifying page importance through link analysis. Below we detail its
core principles, mathematical formulation, and practical implications.

12.1 Core Principle: Links as Votes of Importance

Web as a Voting System:
• Each webpage is a node in a directed graph
• Each hyperlink is a directed edge representing a ”vote”
• Links from authoritative pages carry more weight (like academic citations)

Key Insight:
• A page linked by many important pages becomes important itself
• Importance is recursive and self-reinforcing

High PR

A B

Low PR

D C

Figure 4: Link graph showing page importance propagation. Page A gains higher PageRank from multiple
in-links.

18
12.2 Mathematical Formulation
PageRank Equation:
1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )

Terms:
• P R(P ): PageRank of page P
• P R(Pi ): PageRank of linking pages Pi
• L(Pi ): Number of outbound links from Pi

• d: Damping factor (typically 0.85)

• N : Total pages in the index

Interpretation:
• (1 − d)/N : Probability of random jump (teleportation)
• d × (·): Weighted sum of incoming link values
P

12.3 The Random Surfer Model

Behavior Simulation: A hypothetical user who:
• Follows links (with probability d = 0.85)

• Jumps randomly (with probability 1 − d = 0.15)

Purpose:
• Prevents getting stuck in dead-ends or cycles
• Ensures all pages have non-zero probability
• Models real user browsing behavior

d = 0.85
Current Page Follow Link Chosen from outlinks

1 − d = 0.15

Random Jump

Any page in index

Figure 5: Random surfer’s decision process at each step

19
12.4 Computation and Convergence
Iterative Process:
1. Initialize all pages with P R = 1/N

2. Repeatedly apply PageRank formula

3. Values converge after 50 iterations

Example Calculation: For three pages A → B → C:

P R(A) = 0.15/3 + 0.85 × (P R(C)/1)

P R(B) = 0.15/3 + 0.85 × (P R(A)/1)
P R(C) = 0.15/3 + 0.85 × (P R(B)/1)

13 Clearly explain all steps, using an example, in Google’s PR algo-

rithm.
13.1 Overview of PageRank
PageRank quantifies webpage importance by analyzing the web’s link structure, where links serve as votes of
confidence. The algorithm models:

• Web as a Directed Graph:

– Nodes represent webpages
– Directed edges represent hyperlinks

• Random Surfer Model:

– Users follow links with probability d = 0.85
– Random jumps occur with probability 1 − d = 0.15

13.2 Mathematical Formulation

The PageRank of a page P is calculated as:

1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )
Where:

• P R(Pi ): PageRank of pages linking to P

• L(Pi ): Number of outbound links from Pi
• N : Total number of pages
• d: Damping factor (typically 0.85)

13.3 Example Web Structure

Consider four pages with these links:

13.4 Step-by-Step Calculation

Initialization: All pages start with equal PageRank:

P R(A) = P R(B) = P R(C) = P R(D) = 0.25

20
Page A Page B

A B

D C

Page D Page C

Figure 6: Link structure for our example: A→B→C→A with D linking to both A and B

First Iteration:
• Page A:

1 − 0.85 P R(C) P R(D)
P R(A) = + 0.85 × +
4 1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.0375 + 0.31875 = 0.35625

• Page B:

P R(A) P R(D)
P R(B) = 0.0375 + 0.85 × +
1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.35625

• Page C:
P R(B)
P R(C) = 0.0375 + 0.85 ×
1
= 0.0375 + 0.2125 = 0.25

• Page D:
P R(D) = 0.0375 + 0 (no incoming links)
= 0.0375

Subsequent Iterations: After 20-30 iterations, values converge to:

Page Stable PageRank
A 0.37
B 0.33
C 0.23
D 0.07

13.5 Key Observations

• Page A ranks highest (receives votes from both C and D)
• Page D ranks lowest (no incoming links)
• Link Equity Division:
– Page B’s vote is split between A and C
– Page D’s vote is split between A and B

21
Figure 7: Final PageRank distribution among pages

13.6 Algorithm Properties

• Convergence: Guaranteed due to damping factor

• Dead Ends: Handled through random jumps

• Spam Resistance: Difficult to artificially inflate PageRank

13.7 Practical Implications

• SEO Strategy:
– Acquire links from high-PageRank pages
– Prefer links from pages with few outbound links
• Modern Search:

– PageRank remains one of 200 ranking factors

– Combines with content quality, user signals, etc.

Component Role
Link Graph Models web connectivity
Damping Factor Accounts for random navigation
Iterative Computation Ensures stable importance scores
Normalization Adjusts for link quantity/quality

Table 2: Core components of PageRank calculation

14 What are dangling nodes and irreducible graphs? How are such
cases handled in the PR algorithm?
Two critical challenges in PageRank computation are dangling nodes and irreducible graphs. This section
explains these concepts and their solutions in detail.

14.1 Dangling Nodes

Definition: Pages with no outbound links (e.g., PDFs, images, or dead-end pages). These disrupt PageRank
flow because they don’t distribute their PageRank to other pages.

A B C

Dangling Node

Figure 8: Example graph with dangling node C (no outbound links)

Problems:
• The random surfer gets ”trapped” with no links to follow
• Causes PageRank to ”leak” out of the system

• Breaks the stochastic property of the transition matrix

22
Solutions:
• Teleportation: Treat dangling nodes as linking to all pages:
(
1/N if page i is dangling
Mij =
Original value otherwise

• Damping Factor: The (1 − d)/N term ensures some PR flows to all pages
• Redistribution: During computation, evenly distribute a dangling node’s PR to all pages

14.2 Irreducible Graphs

Definition: Graphs with disconnected components that have no links between them. This makes the graph
reducible (not strongly connected).

Component 1

A B

Component 2

C D

Figure 9: Reducible graph with two disconnected components

Problems:
• PageRank cannot flow between disconnected components
• Leads to rank ”sinks” where PR accumulates in isolated groups
• No unique stationary distribution exists

Solutions:
• Damping Factor: Forces connectivity via random jumps:
1−d X P R(Pi )
P R(P ) = +d×
N L(Pi )

where (1 − d)/N ensures all pages are reachable

• Artificial Links: Add small transition probabilities between all pages

14.3 Combined Approach in PageRank

The standard PageRank algorithm integrates solutions for both issues:

• For Dangling Nodes:

– Add virtual links to all pages
– Implemented via transition matrix adjustments
• For Irreducibility:
– Use damping factor (d = 0.85)
– Guarantees strongly connected graph

23
A B C

Dangling fixes Irreducibility fixes

D E

Figure 10: Combined solutions applied to a sample graph

14.4 Practical Example

Consider a web graph with:
• Component 1: A → B → C (C is dangling)
• Component 2: D → E (disconnected)

Solutions Applied:
• For C (dangling): Add virtual links to A, B, D, E
• For disconnected components: Damping factor ensures 15% PR flows randomly

Result:
• All pages receive some PageRank
• No PR accumulation in isolated components
• Convergence to a unique stationary distribution

15 Prove the convergence of Google’s PageRank(PR) algorithm.

15.1 Introduction to PageRank
PageRank models the web as a directed graph where:
• Nodes represent web pages.
• Edges represent hyperlinks between pages.
The goal is to compute a rank vector π where each entry πi represents the importance of page i.

15.2 The Google Matrix G

The web’s link structure is encoded in the Google matrix G, defined as:
EE T
G = dT + (1 − d) ,
N
where:
• T : Transition matrix derived from link structure.
– If page i has L(i) outgoing links, Tij = 1/L(i) if i links to j, else 0.
– Dangling nodes (pages with no links) are treated as linking to all pages uniformly: Tij = 1/N .
• d: Damping factor (typically 0.85), the probability of following a link.
• EE T : Matrix of all 1’s (ensures uniform jumps when the user doesn’t follow links).
• N : Total number of pages.
Example: For a mini-web with 3 pages where Page 1 links to Page 2, and Page 2 links to Page 3:
 
0 1 0 T
T = 0 0 1  , G = 0.85T + 0.15 · EE .
3
1/3 1/3 1/3 (dangling node handling)

24
15.3 Key Properties Ensuring Convergence
For G to converge to a unique π, it must satisfy:
1. Stochasticity: Each row of G sums to 1 (valid probability distribution).
T
2. Irreducibility: Any page can reach any other page (ensured by damping, since (1 − d) EE
N adds non-zero
transitions).
3. Aperiodicity: No cyclic paths trap the surfer indefinitely (damping breaks cycles).
Perron-Frobenius Theorem: A stochastic, irreducible, and aperiodic matrix has:
• A unique largest eigenvalue λ1 = 1.

• A corresponding eigenvector π with all positive entries (the PageRank vector).

15.4 Power Iteration Method

The PageRank vector π is computed iteratively:

1. Initialize: π (0) = [1/N, . . . , 1/N ].

2. Iterate: π (k+1) = π (k) G.
3. Convergence: Stop when ∥π (k+1) − π (k) ∥ < ϵ.

Why It Works:
• G’s second-largest eigenvalue λ2 satisfies |λ2 | ≤ d, so convergence is at rate O(dk ).
• Example: After 50 iterations with d = 0.85, error scales as 0.8550 ≈ 10−4 .

15.5 Practical Example

Consider a 3-page web:
• Page A links to Page B.
• Page B links to Page C.

15.6 Conclusion
PageRank converges because:
1. G is stochastic, irreducible, and aperiodic (due to damping).

2. The Perron-Frobenius theorem guarantees a unique π.

3. Power iteration converges geometrically with rate d.

25
16 Explain the Perceptron Approach used in pattern classifiers using
the perceptron node and threshold logic unit.
The perceptron is a fundamental algorithm in machine learning for binary classification, inspired by biological
neurons. It serves as the building block for artificial neural networks, using a simple computational model to
classify input patterns.

16.1 Perceptron Node and Threshold Logic Unit (TLU)

Perceptron Node:

• Basic computational unit modeled after a biological neuron

• Receives multiple input signals and produces a binary output
Threshold Logic Unit (TLU):

• Decision-making component that applies a step function

• Computes weighted sum of inputs and compares to threshold
Mathematical Model:
f (x) = h(w · x + b)
where:
• x: Input vector

• w: Weight vector
• b: Bias term
• h: Heaviside step function: (
1 if z > 0
h(z) =
0 otherwise

Geometric Interpretation:
•
P
wi xi + b = 0 defines a hyperplane decision boundary

• Classifies points based on which side of boundary they lie

16.2 Perceptron Learning Algorithm

Training Process:
1. Initialization:
• Set weights wi and bias b to small random values or zero
2. Iterative Training: For each training sample (xj , dj ):

• Compute output:
yj = h(w · xj + b)
• Update weights if misclassified:
wi ← wi + η(dj − yj )xj,i
b ← b + η(dj − yj )
where η is learning rate (0 < η ≤ 1)
3. Convergence:
• Repeat until no errors or maximum iterations reached
• Guaranteed to converge for linearly separable data (Perceptron Convergence Theorem)

26
16.3 Example: Logical AND Function
Truth Table:
x1 x2 d
0 0 0
0 1 0
1 0 0
1 1 1
Learned Decision Boundary:
• Example solution: w1 = 1, w2 = 1, b = −1.5
• Implements the function: x1 ∧ x2

16.4 Limitations
1. Linear Separability:
• Cannot solve non-linearly separable problems (e.g., XOR)
2. Single-Layer Limitation:
• Only capable of linear decision boundaries
3. Binary Output:
• No probabilistic interpretation of outputs

17 Draw the figure depicting a basic perceptron model.

17.1 Figure: Basic Perceptron Model
x1
w1
(
1 z≥0
x2 w2 y=
0 z<0
P
z= wi xi + b
Σ Step y
.. wn
.

17.2 Components Description

• Inputs (x1 , x2 , ..., xn ): Feature vector (e.g., pixel values in an image)
• Weights (w1 , w2 , ..., wn ): Learned parameters (strength of each input connection)
• Bias (b): Shifts the decision boundary away from the origin
Pn
• Summing Junction (Σ): Computes the weighted sum z = i=1 wi xi + b
• Activation Function (Step Function):
– Outputs y = 1 if z ≥ 0
– Outputs y = 0 (or −1) if z < 0

18 Define the generic objective function for a basic perceptron model,

explaining all its variables/parameters.
The Perceptron is a fundamental machine learning algorithm for binary classification. It is a linear classifier
that attempts to find a hyperplane that separates data from two different classes.

27
18.1 Objective (Loss) Function
The Perceptron loss function penalizes only misclassified points and is defined as:
N
X
L(w, b) = max (0, −yi (w · xi + b))
i=1

• xi : Feature vector of the ith example

• yi : Label of the ith example (±1)
• w: Weight vector
• b: Bias term

• N : Number of training samples

If a sample is correctly classified, the term inside max is negative, resulting in zero loss. For misclassified
samples, the loss becomes positive, prompting weight updates.

18.2 Learning Rule

The Perceptron updates its weights and bias as follows:

w ← w + η · yi · xi
b ← b + η · yi

• η: Learning rate

This update occurs only when yi (w · xi + b) ≤ 0.

18.3 Example
Assume:
• x1 = [2, 3], y1 = +1

• x2 = [−1, −2], y2 = −1
• Initial w = [0, 0], b = 0, η = 1
Step 1: For x1 , the prediction is incorrect. Update:

w = [2, 3], b=1

Step 2: For x2 , prediction becomes correct. No update needed.

19 Explain the steps involved in the Perceptron Training Algorithm

(PTA).
The Perceptron Training Algorithm (PTA) is one of the simplest supervised learning algorithms used for
binary classification tasks. It learns a linear decision boundary to separate two classes. The algorithm
updates the weights and bias based on classification errors using a simple rule.

19.1 What is the Perceptron?

A Perceptron is a basic building block of neural networks. It takes multiple input features, applies a weight to
each, adds a bias, and passes the result through an activation function to predict a binary output (like 0 or
1, or -1 and +1).

28
19.2 Initialization
• Weights (w): Initialize to zeros or small random values.
• Bias (b): Set to zero or a small value.

• Learning Rate (η): A small value like 0.1 or 0.01.

• Maximum Epochs: Define to control convergence.

19.3 The Training Loop (Epochs)

Step 1: For each training sample (xi , yi ) Step 2: Compute activation:
n
(i)
X
z = w · xi + b = wj xj + b
j=1

Step 3: Apply activation function:

(
1 if z ≥ 0
ŷi =
0 (or − 1) otherwise

Step 4: Update (if misclassified):

• For {0, 1}:
(i)
wj ← wj + η(yi − ŷi )xj , b ← b + η(yi − ŷi )

• For {−1, +1}:

w ← w + ηyi xi , b ← b + ηyi

19.4 Convergence Criteria

• All training samples are correctly classified
• Maximum number of epochs reached

• No improvement in error

20 Given a dataset, demonstrate how the Perceptron Training Algo-

rithm (PTA) works starting with a zero-valued weight vector ω,
i.e., initially ω = (0, 0, 0)T . Calculate the ω vector at each iteration
until PTA converges, showing all intermediate steps.
20.1 Dataset
We use the following training dataset with four examples:

Example x1 x2 x3 (bias)
Label (y)
1 1 0 1
+1
2 0 1 1
+1
3 1 1 1
-1
4 0 0 1
-1

29
20.2 Algorithm Initialization
• Initial weight vector: ω (0) = (0, 0, 0)T
• Learning rate: η = 1
• Activation function: f (z) = sign(z)

20.3 Training Process

20.3.1 Epoch 1
• Example 1: x = (1, 0, 1), y = +1
z =ω·x=0
ŷ = sign(0) = +1 (Correct)
T
ω remains (0, 0, 0)

• Example 2: x = (0, 1, 1), y = +1

z=0
ŷ = +1 (Correct)
ω remains (0, 0, 0)T

• Example 3: x = (1, 1, 1), y = −1

z=0
ŷ = +1 (Misclassified)
ω ← ω + η · y · x = (−1, −1, −1)T

• Example 4: x = (0, 0, 1), y = −1

z = −1
ŷ = −1 (Correct)
ω remains (−1, −1, −1)T

20.3.2 Epoch 2
• Example 1: x = (1, 0, 1), y = +1
z = −2
ŷ = −1 (Misclassified)
ω ← (0, −1, 0)T

• Example 2: x = (0, 1, 1), y = +1

z = −1
ŷ = −1 (Misclassified)
ω ← (0, 0, 1)T

• Example 3: x = (1, 1, 1), y = −1

z=1
ŷ = +1 (Misclassified)
ω ← (−1, −1, 0)T

• Example 4: x = (0, 0, 1), y = −1

z=0
ŷ = +1 (Misclassified)
ω ← (−1, −1, −1)T

30
20.4 Observations
• The weights enter a cycle between (−1, −1, −1)T and (0, 0, 0)T
• This indicates the data is not linearly separable

• The perceptron cannot converge for this dataset

20.5 Key Concepts

• Update Rule: ω ← ω + η · y · x (applied only for misclassified samples)

• Convergence: Guaranteed only for linearly separable data (Perceptron Convergence Theorem)
• Bias Handling: Incorporated through x3 = 1 in the input vector

20.6 Alternative Example (Linearly Separable Case)

For contrast, consider the AND function dataset which is linearly separable:

x1 x2 y
0 0 -1
0 1 -1
1 0 -1
1 1 +1

PTA would converge to a solution like ω = (1, 1, −1.5)T after a few epochs.

20.7 Conclusion
The demonstration shows:
• How PTA iteratively updates weights
• The importance of linear separability for convergence
• Practical behavior with both separable and non-separable data

21 Describe the intuition behind updating weights in the Perceptron

Training Algorithm.
21.1 Core Principle
The Perceptron Training Algorithm (PTA) operates on error-driven learning, adjusting weights only when
misclassifications occur. This simple yet powerful mechanism enables the perceptron to gradually improve its
decision boundary.

21.2 Mathematical Foundation

For each input x with true label y ∈ {−1, +1}:

• Compute activation: z = w · x + b
• Predict: ŷ = sign(z)
• Update rule when ŷ ̸= y:

w ←w+η·y·x
b←b+η·y

31
Figure 11: Iterative adjustment of decision boundary through weight updates

21.3 Visualization of Learning Process

21.3.1 Case 1: False Negative (y = +1, ŷ = −1)
• Problem: z was too negative

• Solution: Add ηx to w
⇒ Increases z for future similar inputs

21.3.2 Case 2: False Positive (y = −1, ŷ = +1)

• Problem: z was too positive
• Solution: Subtract ηx from w
⇒ Decreases z for future similar inputs

21.4 Learning Rate Dynamics

The learning rate η controls update magnitudes:
Value Effect
Small η (e.g., 0.01) Slow but stable convergence
η = 1 (default) Standard correction per misclassification
Large η (e.g., 10) Fast but risky (may overshoot)

21.5 Bias Term Adjustment

The bias b (equivalent to w0 ) receives special treatment:

• Input is always +1 (the ”dummy feature”)

• Update rule: b ← b + η · y · 1
• Effect: Shifts boundary parallel to itself

32
21.6 Concrete Example
Consider 2D data with:
• Current weights: w = (1, 1), b = −1.5
• Misclassified point: x = (2, 1), y = +1
• Update (η = 1):
w ← (1, 1) + 1 · (+1) · (2, 1) = (3, 2)
b ← −1.5 + 1 · (+1) = −0.5

21.7 Convergence Properties

• Guaranteed for linearly separable data (Perceptron Convergence Theorem)
• Cycling indicates non-separable data (e.g., XOR problem)
• No margin guarantee: Finds any separating boundary, not necessarily optimal

21.8 Limitations and Workarounds

• Non-separable data: Use pocket algorithm (keep best weights)
• No probabilities: Consider logistic regression for confidence scores
• Linear constraints: Multilayer networks for nonlinear boundaries

22 What is/are the assumptions in the Perceptron Training Algo-

rithm?
22.1 Linear Separability
The fundamental assumption is that the data must be linearly separable - there exists a hyperplane that can
perfectly separate the classes.

22.2 Binary Classification Framework

The standard PTA operates strictly in a binary classification setting:
• Label Convention: yi ∈ {−1, +1} (or {0, 1})
• Extension: Multiclass requires one-vs-all or other strategies
• Limitation: No native probabilistic outputs

22.3 Learning Rate Dynamics

The fixed learning rate (η) impacts training:

w ← w + η · yi · x i
Learning Rate Effect
η too large Overshooting, oscillations
η too small Slow convergence
Optimal η Faster convergence

22.4 Data Characteristics

• Noise-Free: Assumes perfect labeling
– Workaround : Pocket Algorithm retains best weights
• Bounded Norm: ∥x∥ ≤ R for some R > 0
• Finite Samples: Practical implementations require finite data

33
22.5 Architectural Constraints
• Single Layer:
– Limited to linear decision boundaries
– Cannot solve XOR without hidden layers
• Activation Function: (
+1 z≥0
f (z) =
−1 otherwise
Non-differentiable nature prevents gradient flow

22.6 Theoretical Guarantees

• Convergence Proof : Finite steps to solution if separable
2
Rγ
k≤
∥w∗ ∥
where γ is the margin
• Initialization Invariance: Any initial w works (theoretically)

• When to Use:
– Simple binary classification
– Linearly separable data
– Baseline model before trying complex methods
• When to Avoid:
– Noisy datasets
– Non-linear problems
– Probabilistic outputs needed

23 Explain and differentiate between the roles of the following three

activation functions: Sigmoid, ReLU, and tanh.
23.1 Introduction
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. This
section compares three fundamental activation functions: Sigmoid, Hyperbolic Tangent (tanh), and Rectified
Linear Unit (ReLU).

23.2 Sigmoid (Logistic) Activation Function

23.2.1 Mathematical Definition
1
σ(x) =
1 + e−x

23.2.2 Characteristics
• Range: (0, 1)
• Shape: Smooth S-curve
• Common Use: Output layer for binary classification

23.2.3 Advantages
• Probabilistic interpretation (output as probability)
• Smooth gradients for backpropagation

34
23.2.4 Limitations
• Vanishing gradients for extreme inputs
• Not zero-centered (outputs always positive)

• Computationally expensive due to exponentiation

Figure 12: Sigmoid activation function and its derivative

23.3 Hyperbolic Tangent (tanh) Function

23.3.1 Mathematical Definition
ex − e−x
tanh(x) =
ex + e−x

23.3.2 Characteristics
• Range: (-1, 1)
• Shape: Zero-centered S-curve

• Common Use: Hidden layers in RNNs/LSTMs

23.3.3 Advantages
• Zero-centered outputs improve gradient flow
• Stronger gradients than sigmoid near zero

23.3.4 Limitations
• Still suffers from vanishing gradients

• Computationally expensive

23.4 Rectified Linear Unit (ReLU)

23.4.1 Mathematical Definition
ReLU(x) = max(0, x)

23.4.2 Characteristics
• Range: [0, )
• Shape: Linear for positive inputs, zero otherwise
• Common Use: Default choice for hidden layers

35
Figure 13: tanh activation function and its derivative

23.4.3 Advantages
• Computationally efficient

• Avoids vanishing gradients for positive inputs

• Promotes sparsity in activations

23.4.4 Limitations
• ”Dying ReLU” problem (neurons can get stuck)

• Not zero-centered

Figure 14: ReLU activation function and its derivative

23.5 Comparative Analysis

Table 3: Feature Comparison of Activation Functions

Feature Sigmoid tanh ReLU
Output Range (0, 1) (-1, 1) [0, )
Zero-Centered No Yes No
Gradient Behavior Vanishing Vanishing Non-vanishing (x ¿ 0)
Computational Cost High High Low
Common Use Cases Output layer RNNs/LSTMs Hidden layers
Key Limitation Vanishing gradients Vanishing gradients Dying neurons

23.6 Selection Guidelines

23.6.1 When to Use Each Function
• Sigmoid: Output layer for binary classification

36
• tanh: When zero-centered outputs are crucial (e.g., some RNNs)
• ReLU: Default choice for hidden layers in most architectures

23.6.2 Advanced Variants

• Leaky ReLU: LReLU(x) = max(αx, x) (fixes dying ReLU)
• Parametric ReLU: Learns α during training

• Swish: x · σ(βx) (self-gating property)

23.7 Conclusion
• ReLU is generally preferred for hidden layers due to computational efficiency
• Sigmoid remains relevant for probabilistic outputs
• tanh is useful when zero-centered outputs are beneficial
• Modern architectures often use advanced variants to address limitations

(Roundscape Adorevia) Unnoficial Game Guide
100% (1)
(Roundscape Adorevia) Unnoficial Game Guide
156 pages
Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering
No ratings yet
Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering
14 pages
Giu 2719 65 22376 2025-02-17T23 42 29
No ratings yet
Giu 2719 65 22376 2025-02-17T23 42 29
37 pages
Clustering
No ratings yet
Clustering
28 pages
Graph Based Clustering
No ratings yet
Graph Based Clustering
78 pages
Atif IS Paperwork
No ratings yet
Atif IS Paperwork
31 pages
Semi-Supervised Spectral Clustering Using Shared Nearest Neighbor For Data With Different Shape and Density
No ratings yet
Semi-Supervised Spectral Clustering Using Shared Nearest Neighbor For Data With Different Shape and Density
8 pages
Spectral Clustering 2
No ratings yet
Spectral Clustering 2
39 pages
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
No ratings yet
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
102 pages
DS303 Clustering
No ratings yet
DS303 Clustering
20 pages
Math 118: Mathematical Methods of Data Theory: Lecture 9: Graphs and Spectral Clustering
No ratings yet
Math 118: Mathematical Methods of Data Theory: Lecture 9: Graphs and Spectral Clustering
11 pages
ML Unit - IV
No ratings yet
ML Unit - IV
56 pages
Community Detection With Graph Neural Networks
No ratings yet
Community Detection With Graph Neural Networks
16 pages
Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
No ratings yet
Outer-Points Shaver-Robust Graph-Based Clustering Via Node Cutting
13 pages
09 - Spectral Clustering
No ratings yet
09 - Spectral Clustering
22 pages
Spec Clus Mod
No ratings yet
Spec Clus Mod
29 pages
Zhang cgf10 Spect Survey
No ratings yet
Zhang cgf10 Spect Survey
29 pages
Slides - Introduction To Signal Processing On Graphs
No ratings yet
Slides - Introduction To Signal Processing On Graphs
35 pages
Spectral Clustering Survey
No ratings yet
Spectral Clustering Survey
12 pages
Spectral Approach For Tabular and Graph Data Clustering
No ratings yet
Spectral Approach For Tabular and Graph Data Clustering
15 pages
Spectral Clustering
No ratings yet
Spectral Clustering
4 pages
Sem232 LA CC07 Group08
No ratings yet
Sem232 LA CC07 Group08
23 pages
LecN10 R
No ratings yet
LecN10 R
9 pages
The Latest Research Progress On Spectral Clustering
No ratings yet
The Latest Research Progress On Spectral Clustering
10 pages
Learning Spectral Clustering
No ratings yet
Learning Spectral Clustering
8 pages
521 Lecture 13
No ratings yet
521 Lecture 13
7 pages
Handbook of Cluster Analysis: C. Hennig, M. Meila, F. Murtagh, R. Rocci (Eds.)
No ratings yet
Handbook of Cluster Analysis: C. Hennig, M. Meila, F. Murtagh, R. Rocci (Eds.)
28 pages
Clustering Notes
No ratings yet
Clustering Notes
4 pages
GraphSigProc Part I v18 NowFnT
No ratings yet
GraphSigProc Part I v18 NowFnT
49 pages
Local-Global Fuzzy Clustering With Anchor Graph
No ratings yet
Local-Global Fuzzy Clustering With Anchor Graph
15 pages
Luxburg07 Tutorial 4488
No ratings yet
Luxburg07 Tutorial 4488
32 pages
Spectral Clustering
No ratings yet
Spectral Clustering
7 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Spectral Approach (BU)
No ratings yet
Spectral Approach (BU)
2 pages
Slidesgo Unlocking Connections The Power of Spectral Graph Theory 20241003174037IcC9
No ratings yet
Slidesgo Unlocking Connections The Power of Spectral Graph Theory 20241003174037IcC9
8 pages
Spectral Clustering: X Through The Parameter W 0. The Resulting
No ratings yet
Spectral Clustering: X Through The Parameter W 0. The Resulting
7 pages
GABB18 Paper 5
No ratings yet
GABB18 Paper 5
8 pages
Expert Systems With Applications: Tülin Inkaya
No ratings yet
Expert Systems With Applications: Tülin Inkaya
10 pages
Research On Spectral Clustering Algorithms and Prospects
No ratings yet
Research On Spectral Clustering Algorithms and Prospects
5 pages
Graph Laplacian Matrix
No ratings yet
Graph Laplacian Matrix
3 pages
Spectral Graph Theory: Tian Luo and Brandon Jiang
No ratings yet
Spectral Graph Theory: Tian Luo and Brandon Jiang
12 pages
주례발표 (20101019) SpectralClusteringforClass
No ratings yet
주례발표 (20101019) SpectralClusteringforClass
18 pages
Tutorial On Spectral Clustering
No ratings yet
Tutorial On Spectral Clustering
26 pages
Spectral Analysis of Signed Graphs For Clustering, Prediction and Visualization
No ratings yet
Spectral Analysis of Signed Graphs For Clustering, Prediction and Visualization
12 pages
Community Detection
No ratings yet
Community Detection
9 pages
Kernel K-Means, Spectral Clustering and Normalized Cuts: Inderjit S. Dhillon Yuqiang Guan Brian Kulis
No ratings yet
Kernel K-Means, Spectral Clustering and Normalized Cuts: Inderjit S. Dhillon Yuqiang Guan Brian Kulis
6 pages
2019 REU Dimension Reduction Poster
No ratings yet
2019 REU Dimension Reduction Poster
1 page
AL Tamil Medium Answer
No ratings yet
AL Tamil Medium Answer
93 pages
I Jcs It 2015060141
No ratings yet
I Jcs It 2015060141
5 pages
Makdad - Chloe - NCUWM2021Poster Connections Between Graph Spectral Clustering and PDEs
No ratings yet
Makdad - Chloe - NCUWM2021Poster Connections Between Graph Spectral Clustering and PDEs
1 page
3.1 Graph Clustering Using Normalized Cuts
No ratings yet
3.1 Graph Clustering Using Normalized Cuts
24 pages
Entropy: Kernel Spectral Clustering For Big Data Networks
No ratings yet
Entropy: Kernel Spectral Clustering For Big Data Networks
20 pages
Spectral Clustering: Eyal David Image Processing Seminar May 2008
No ratings yet
Spectral Clustering: Eyal David Image Processing Seminar May 2008
52 pages
Proposal - SRI SAI ENTERPRISES MOHAN NAGAR
No ratings yet
Proposal - SRI SAI ENTERPRISES MOHAN NAGAR
4 pages
2092 On Spectral Clustering Analysis and An Algorithm
No ratings yet
2092 On Spectral Clustering Analysis and An Algorithm
8 pages
Threads 2.0 ?
No ratings yet
Threads 2.0 ?
11 pages
n25 PDF
No ratings yet
n25 PDF
8 pages
Jiang J
No ratings yet
Jiang J
16 pages
Cis515 15 Spectral Clust Chap6
No ratings yet
Cis515 15 Spectral Clust Chap6
10 pages
Generating Evidence For Artificial Intelligence-Based Medical Devices
No ratings yet
Generating Evidence For Artificial Intelligence-Based Medical Devices
104 pages
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
No ratings yet
Soil Nailing For Failed Slope Stabilization On Hilly Terrain
7 pages
LAB6
No ratings yet
LAB6
4 pages
Data Migration in Fiori
No ratings yet
Data Migration in Fiori
22 pages
CS168 Spectral Graph Theory (Roughgarden & Valiant)
No ratings yet
CS168 Spectral Graph Theory (Roughgarden & Valiant)
11 pages
Spectral Decomposition For Graphs
No ratings yet
Spectral Decomposition For Graphs
5 pages
Planos ZX130-5
No ratings yet
Planos ZX130-5
18 pages
MTCP NJ Client
No ratings yet
MTCP NJ Client
4 pages
Flair AIO
No ratings yet
Flair AIO
5 pages
RR and RD Piles Design and Installation Manual
No ratings yet
RR and RD Piles Design and Installation Manual
56 pages
Hana Installation
No ratings yet
Hana Installation
16 pages
Unilumin AIO SMD 135 Inch
No ratings yet
Unilumin AIO SMD 135 Inch
4 pages
BS 2573-2 Contents
No ratings yet
BS 2573-2 Contents
1 page
Some Introductory Concepts On Fiberr Optic System
No ratings yet
Some Introductory Concepts On Fiberr Optic System
36 pages
CONDUITE
No ratings yet
CONDUITE
9 pages
Colgate OpenCore ComputerVision
No ratings yet
Colgate OpenCore ComputerVision
8 pages
BS en Iso 28927-5-2009 PDF
No ratings yet
BS en Iso 28927-5-2009 PDF
32 pages
Algebraic Geometry For Geometric Modeling: Ragni Piene
No ratings yet
Algebraic Geometry For Geometric Modeling: Ragni Piene
46 pages
BioData-Pragati Pandit
No ratings yet
BioData-Pragati Pandit
4 pages
Hussein Abdullahi Elmi: Personal Profile
No ratings yet
Hussein Abdullahi Elmi: Personal Profile
3 pages
Apporio Taxi Uber Clone
No ratings yet
Apporio Taxi Uber Clone
5 pages
6SL3210-5HE12-0UF0 Datasheet en
No ratings yet
6SL3210-5HE12-0UF0 Datasheet en
2 pages
MxG2wDO ReleaseNotes
No ratings yet
MxG2wDO ReleaseNotes
4 pages
Smart Port: Design and Perspectives
No ratings yet
Smart Port: Design and Perspectives
6 pages
Conference
No ratings yet
Conference
3 pages
Application For Admission in " KV NO.2 NAUSENABAUGH "
No ratings yet
Application For Admission in " KV NO.2 NAUSENABAUGH "
7 pages
7-Forex Trading Is A Business Learn To Trade The Market PDF
No ratings yet
7-Forex Trading Is A Business Learn To Trade The Market PDF
8 pages
Verilog Manal
No ratings yet
Verilog Manal
16 pages
All-Electric Bus HVAC Solutions: Choose From A Range of Clean, Efficient Solutions
No ratings yet
All-Electric Bus HVAC Solutions: Choose From A Range of Clean, Efficient Solutions
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.