PR Module 4 QB
PR Module 4 QB
April 2025
1
1.8 When to Avoid Spectral Clustering
• Large Datasets: Computationally expensive due to eigenvalue decomposition (O(n3 ) complexity).
• Explicit Cluster Centroids Required: Does not return centroids like K-means.
• Parameter Sensitivity: Performance depends on the similarity metric, graph construction, and number
of clusters (k).
• Adjacency Matrix (A or W): A ∈ Rn×n such that Aij = 1 if there is an edge between node i and node
j, otherwise 0. In a weighted graph, Wij denotes the weight of the edge.
• Degree Matrix (D): A diagonal matrix where Dii = j Aij or Dii = j Wij .
P P
Figure 1: The-Laplacian-Matrix-of-a-network-Panel-A-presents-a-small-undirected-network-Panel
• The eigenvectors of L or Lsym help embed the graph into a low-dimensional space for clustering.
2
• For a function f on nodes: X
(Lf )i = Wij (fi − fj )
j
• It acts like a discrete Laplace operator and captures smoothness over the graph.
• Laplacian Variants:
3
3.0.3 Graph Cut Interpretation
Spectral Clustering minimizes the normalized cut:
k
X cut(Ai , Āi )
Ncut(A1 , ..., Ak ) =
i=1
vol(Ai )
where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A
3.0.4 Summary
Step Description
1 Build similarity graph W
2 Compute Laplacian matrix L, Lsym , or Lrw
3 Compute first k eigenvectors of Laplacian
4 Form matrix U ∈ Rn×k
5 Run K-means on rows of U
6 Assign original data points to clusters
• The eigenvectors corresponding to those zero eigenvalues indicate the membership of nodes in each com-
ponent.
L=D−A
4
Step 4: Compute Eigenvalues and Eigenvectors Solve the eigenvalue problem:
Lv = λv
4.3 Example
Consider a graph with 2 connected components.
Adjacency Matrix W :
0 1 0 0
1 0 0 0
W =
0
0 0 1
0 0 1 0
Degree Matrix D:
1 0 0 0
0 1 0 0
D=
0
0 1 0
0 0 0 1
Laplacian Matrix L = D − W :
1 −1 0 0
−1 1 0 0
L=
0 0 1 −1
0 0 −1 1
Eigenvalues: λ = [0, 0, 2, 2]
Since there are two zero eigenvalues, the graph has two connected components.
Eigenvectors corresponding to λ = 0:
1 0
1 0
0 , v2 = 1
v1 =
0 1
Here, v1 identifies Component 1 (nodes 1 & 2) and v2 identifies Component 2 (nodes 3 & 4).
5
5.3 Step-by-Step: 2-Way Cut Using Spectral Clustering
5.3.1 Step 1: Construct the Similarity Graph
• Use the Gaussian kernel:
||xi − xj ||2
Wij = exp −
2σ 2
• Unnormalized Laplacian: L = D − W
• Normalized Laplacian:
Lsym = D−1/2 LD−1/2 = I − D−1/2 W D−1/2
cut(A, B) cut(A, B)
Ncut(A, B) = +
vol(A) vol(B)
where: X X
cut(A, B) = Wij , vol(A) = Dii
i∈A,j∈B i∈A
5.6 Summary
5.7 Applications
• Image segmentation
• Graph partitioning
• Social network clustering
• Clustering in high-dimensional data
6
Step Description
1 Construct similarity graph and Laplacian matrix L = D − W or Lsym
2 Compute Fiedler vector v2
3 Partition nodes using sign/threshold/optimal Ncut
4 Approximate 2-way cut achieved
• So, LT = (D − A)T = DT − AT = D − A = L
Positive Semi-Definiteness: For any x ∈ Rn ,
n
X X
xT Lx = xT (D − A)x = di x2i − Aij xi xj
i=1 i,j
Rewriting:
1X
xT Lx = Aij (xi − xj )2 ≥ 0
2 i,j
7
7 Explain about Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a statistical model that describes systems assumed to follow a Markov
process with hidden states. It is widely used in areas like speech recognition, weather prediction, bioinformatics,
finance, and natural language processing (NLP).
1. Hidden States
Denoted as S = {s1 , s2 , . . . , sN } or Q = {q1 , q2 , . . . , qN }. These are unobservable or latent variables.
Example: In speech recognition, the hidden states might be phonemes.
2. Observations
Denoted as O = {o1 , o2 , . . . , oT } or V = {v1 , v2 , . . . , vM }. These are the visible/measurable outputs.
Example: Acoustic signals or sensor readings.
3. Transition Probabilities
Matrix A = [aij ], where aij = P (sj | si ). This defines the probability of transitioning from state si to sj .
Example: Transition from ”sunny” to ”rainy”.
4. Emission Probabilities
Matrix B = [bj (k)], where bj (k) = P (ok | sj ). This defines the probability of observing ok given that the
system is in state sj .
Example: Observing ”umbrella” when the state is ”rainy”.
5. Initial Probabilities
Vector π = [πi ], where πi = P (si at time t = 1). This defines the probability of the initial state.
Example: The probability that the first day is ”sunny”.
8
7.5 Core Problems Solved by HMMs
1. Evaluation Problem
Goal : Compute P (O | λ), the probability of the observation sequence given the model.
Solution: Forward Algorithm or Forward-Backward Algorithm.
2. Decoding Problem
Goal : Identify the most likely sequence of hidden states that could have led to the observed sequence.
Solution: Viterbi Algorithm.
3. Learning Problem
Goal : Estimate the model parameters (A, B, π) that best explain the observed data.
Solution: Baum-Welch Algorithm (a type of Expectation-Maximization).
9
8.3 Example 2: Weather and Activities
Hidden States: Rainy, Sunny
Observations: Walk, Shop, Clean
Transition Probabilities:
Emission Probabilities:
• Rainy: P (W alk) = 0.1, P (Shop) = 0.4, P (Clean) = 0.5
• Sunny: P (W alk) = 0.6, P (Shop) = 0.3, P (Clean) = 0.1
8.6 Summary
• Hidden states model internal dynamics.
• Observations reflect visible behavior.
• Transition and emission probabilities define system behavior.
• State space diagram helps reason about and visualize the HMM.
10
8.9 Transition Matrix (A)
a11 a12
A=
a21 a22
8.12 Interpretation
This HMM can be used in tasks such as:
• Predicting the next observation.
• Inferring the most likely sequence of hidden states given a sequence of observations.
• Estimating parameters (aij and bij ) from data.
Mathematical Formulation:
P (St |St−1 , St−2 , . . . , S1 ) = P (St |St−1 )
Implications:
Example: In weather prediction, tomorrow’s weather depends only on today’s weather, not on yesterday’s or
earlier days. For instance:
• If today is Sunny, the probability of tomorrow being Rainy might be 20%.
• The model doesn’t consider whether it was Rainy or Cloudy two days ago.
Mathematical Formulation:
11
Implications:
• Observations are conditionally independent given the hidden states.
• This allows the model to focus solely on the state-observation relationship without tracking historical
dependencies.
Mathematical Formulation:
Implications:
• The same transition/emission matrices apply at every time step.
• No need to recompute probabilities as time progresses.
Extension: Continuous observations can be handled using Gaussian HMMs, where emission probabilities are
modeled as probability density functions (PDFs).
Example:
• Discrete: A weather HMM with states {Sunny, Rainy} and observations {Umbrella, No Umbrella}.
• Continuous: In speech recognition, acoustic features are continuous and modeled using Gaussian distribu-
tions.
Example:
• In a left-to-right HMM (common in speech recognition), states can only transition forward or repeat (e.g.,
S1 → S2 , S2 → S3 , but not S3 → S1 ).
• In a fully connected HMM, any state can transition to any other state (e.g., Sunny → Rainy → Cloudy →
Sunny).
12
9.6 Why These Assumptions Matter
• Computational Efficiency: Reduces complexity from O(T · N T ) to O(T · N 2 ), making inference feasible.
• Tractability: Enables efficient algorithms like:
– Forward-Backward (for computing state probabilities).
– Viterbi (for finding the most likely state sequence).
– Baum-Welch (for parameter estimation).
• Limitations: May oversimplify real-world scenarios (e.g., long-term dependencies in language or climate).
0.3
0.4
Figure 2: State transition diagram for the weather HMM. Numbers represent transition probabilities.
S = {s1 , s2 , ..., sN }
Key Characteristics:
• Not directly measurable (must be inferred from observations)
• Represent the true underlying situation
• Finite and discrete in standard HMMs
Examples:
• Weather system: {Sunny, Rainy, Cloudy}
• Speech recognition: {Phoneme1 , ..., PhonemeN }
• DNA sequencing: {Adenine, Thymine, Cytosine, Guanine}
13
10.2 Observation Symbols (V)
Definition: A set of M possible observable outputs:
V = {v1 , v2 , ..., vM }
Key Characteristics:
• Directly measurable evidence
• May have probabilistic relationship to hidden states
• Can be discrete or continuous (in extended models)
Examples:
• Weather observer’s activities: {Umbrella, No Umbrella}
• Speech signals: {Acoustic feature vectors}
• Genomic data: {Measured base pairs}
Constraints:
N
X
aij = 1 ∀i ∈ {1, ..., N }
j=1
Interpretation:
• 70% chance sunny day follows another sunny day
• 30% chance sunny day transitions to rainy day
Constraints:
M
X
bj (k) = 1 ∀j ∈ {1, ..., N }
k=1
Interpretation:
• 90% chance no umbrella is seen on sunny days
• 80% chance umbrella is seen on rainy days
14
10.5 Initial State Distribution ()
Definition: A vector of starting probabilities:
πi = P (q1 = si )
Constraints:
N
X
πi = 1
i=1
Example:
• P (Start Sunny) = 0.6
• P (Start Rainy) = 0.4
O = (o1 , o2 , ..., oT ), ot ∈ V
Key Points:
• T can be fixed or variable
• Represents actual measured data
• Used to infer hidden state sequence
Day 1 2 3 4 5
Observation No Umbrella Umbrella Umbrella No Umbrella Umbrella
15
0.7 0.6
0.3
0.4
Figure 3: Visualization of the weather HMM showing states, transitions, and emissions.
HMM Implementation:
• Hidden States: Phonemes (basic speech sounds like /k/, /æ/, /t/ for ”cat”)
• Observations: Acoustic features (frequency bands, MFCC coefficients)
• Transition Matrix: Probability of phoneme sequences (e.g., /h/→//→/l/→/o/ for ”hello”)
Example Workflow:
1. User speaks the word ”weather” (/w/ // // //)
2. Microphone captures acoustic signals
16
HMM Implementation:
• Hidden States: Biological features (coding exons, introns, regulatory regions)
• Observations: Nucleotide bases (A,T,C,G)
• Transition Matrix: Probability of genomic region changes (e.g., exon→intron)
Example:
• Identifying gene structure: 5’UTR → Exon → Intron → Exon → 3’UTR
• Start codon (ATG) emission probability: 0.95 in exons, 0.01 elsewhere
HMM Implementation:
• Hidden States: Grammatical tags (Noun, Verb, Adjective)
• Observations: Words in sentences
Key Probabilities:
• P(Verb—Noun) = 0.4 (common after subjects)
• P(”fox”—Noun) = 0.01 (specific animal)
HMM Implementation:
• Hidden States: Market conditions (high/low volatility, trending/mean-reverting)
• Observations: Price changes, trading volumes, volatility indices
17
11.5 5. Human Activity Recognition
Core Application:
• Wearable fitness tracking
• Medical rehabilitation monitoring
• Gesture-based interfaces
HMM Implementation:
• Hidden States: Activities (walking, running, sitting)
• Observations: Sensor data (accelerometer, gyroscope)
• Transition Matrix: Activity sequence probabilities
• Emission Matrix: Sensor readings per activity
Example:
• Walking state: Characteristic 2Hz acceleration patterns
• Sitting state: Near-zero acceleration variance
• Transition P(Walking→Running) = 0.3 for fitness scenarios
Key Insight:
• A page linked by many important pages becomes important itself
• Importance is recursive and self-reinforcing
High PR
A B
Low PR
D C
Figure 4: Link graph showing page importance propagation. Page A gains higher PageRank from multiple
in-links.
18
12.2 Mathematical Formulation
PageRank Equation:
1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )
Terms:
• P R(P ): PageRank of page P
• P R(Pi ): PageRank of linking pages Pi
• L(Pi ): Number of outbound links from Pi
Interpretation:
• (1 − d)/N : Probability of random jump (teleportation)
• d × (·): Weighted sum of incoming link values
P
Purpose:
• Prevents getting stuck in dead-ends or cycles
• Ensures all pages have non-zero probability
• Models real user browsing behavior
d = 0.85
Current Page Follow Link Chosen from outlinks
1 − d = 0.15
Random Jump
19
12.4 Computation and Convergence
Iterative Process:
1. Initialize all pages with P R = 1/N
1−d X P R(Pi )
P R(P ) = +d×
N i
L(Pi )
Where:
20
Page A Page B
A B
D C
Page D Page C
Figure 6: Link structure for our example: A→B→C→A with D linking to both A and B
First Iteration:
• Page A:
1 − 0.85 P R(C) P R(D)
P R(A) = + 0.85 × +
4 1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.0375 + 0.31875 = 0.35625
• Page B:
P R(A) P R(D)
P R(B) = 0.0375 + 0.85 × +
1 2
= 0.0375 + 0.85 × (0.25 + 0.125)
= 0.35625
• Page C:
P R(B)
P R(C) = 0.0375 + 0.85 ×
1
= 0.0375 + 0.2125 = 0.25
• Page D:
P R(D) = 0.0375 + 0 (no incoming links)
= 0.0375
21
Figure 7: Final PageRank distribution among pages
Component Role
Link Graph Models web connectivity
Damping Factor Accounts for random navigation
Iterative Computation Ensures stable importance scores
Normalization Adjusts for link quantity/quality
14 What are dangling nodes and irreducible graphs? How are such
cases handled in the PR algorithm?
Two critical challenges in PageRank computation are dangling nodes and irreducible graphs. This section
explains these concepts and their solutions in detail.
A B C
Dangling Node
Problems:
• The random surfer gets ”trapped” with no links to follow
• Causes PageRank to ”leak” out of the system
22
Solutions:
• Teleportation: Treat dangling nodes as linking to all pages:
(
1/N if page i is dangling
Mij =
Original value otherwise
• Damping Factor: The (1 − d)/N term ensures some PR flows to all pages
• Redistribution: During computation, evenly distribute a dangling node’s PR to all pages
Component 1
A B
Component 2
C D
Problems:
• PageRank cannot flow between disconnected components
• Leads to rank ”sinks” where PR accumulates in isolated groups
• No unique stationary distribution exists
Solutions:
• Damping Factor: Forces connectivity via random jumps:
1−d X P R(Pi )
P R(P ) = +d×
N L(Pi )
23
A B C
Solutions Applied:
• For C (dangling): Add virtual links to A, B, D, E
• For disconnected components: Damping factor ensures 15% PR flows randomly
Result:
• All pages receive some PageRank
• No PR accumulation in isolated components
• Convergence to a unique stationary distribution
24
15.3 Key Properties Ensuring Convergence
For G to converge to a unique π, it must satisfy:
1. Stochasticity: Each row of G sums to 1 (valid probability distribution).
T
2. Irreducibility: Any page can reach any other page (ensured by damping, since (1 − d) EE
N adds non-zero
transitions).
3. Aperiodicity: No cyclic paths trap the surfer indefinitely (damping breaks cycles).
Perron-Frobenius Theorem: A stochastic, irreducible, and aperiodic matrix has:
• A unique largest eigenvalue λ1 = 1.
Why It Works:
• G’s second-largest eigenvalue λ2 satisfies |λ2 | ≤ d, so convergence is at rate O(dk ).
• Example: After 50 iterations with d = 0.85, error scales as 0.8550 ≈ 10−4 .
15.6 Conclusion
PageRank converges because:
1. G is stochastic, irreducible, and aperiodic (due to damping).
25
16 Explain the Perceptron Approach used in pattern classifiers using
the perceptron node and threshold logic unit.
The perceptron is a fundamental algorithm in machine learning for binary classification, inspired by biological
neurons. It serves as the building block for artificial neural networks, using a simple computational model to
classify input patterns.
• w: Weight vector
• b: Bias term
• h: Heaviside step function: (
1 if z > 0
h(z) =
0 otherwise
Geometric Interpretation:
•
P
wi xi + b = 0 defines a hyperplane decision boundary
• Compute output:
yj = h(w · xj + b)
• Update weights if misclassified:
wi ← wi + η(dj − yj )xj,i
b ← b + η(dj − yj )
where η is learning rate (0 < η ≤ 1)
3. Convergence:
• Repeat until no errors or maximum iterations reached
• Guaranteed to converge for linearly separable data (Perceptron Convergence Theorem)
26
16.3 Example: Logical AND Function
Truth Table:
x1 x2 d
0 0 0
0 1 0
1 0 0
1 1 1
Learned Decision Boundary:
• Example solution: w1 = 1, w2 = 1, b = −1.5
• Implements the function: x1 ∧ x2
16.4 Limitations
1. Linear Separability:
• Cannot solve non-linearly separable problems (e.g., XOR)
2. Single-Layer Limitation:
• Only capable of linear decision boundaries
3. Binary Output:
• No probabilistic interpretation of outputs
xn
27
18.1 Objective (Loss) Function
The Perceptron loss function penalizes only misclassified points and is defined as:
N
X
L(w, b) = max (0, −yi (w · xi + b))
i=1
If a sample is correctly classified, the term inside max is negative, resulting in zero loss. For misclassified
samples, the loss becomes positive, prompting weight updates.
w ← w + η · yi · xi
b ← b + η · yi
• η: Learning rate
18.3 Example
Assume:
• x1 = [2, 3], y1 = +1
• x2 = [−1, −2], y2 = −1
• Initial w = [0, 0], b = 0, η = 1
Step 1: For x1 , the prediction is incorrect. Update:
28
19.2 Initialization
• Weights (w): Initialize to zeros or small random values.
• Bias (b): Set to zero or a small value.
• No improvement in error
Example x1 x2 x3 (bias)
Label (y)
1 1 0 1
+1
2 0 1 1
+1
3 1 1 1
-1
4 0 0 1
-1
29
20.2 Algorithm Initialization
• Initial weight vector: ω (0) = (0, 0, 0)T
• Learning rate: η = 1
• Activation function: f (z) = sign(z)
20.3.2 Epoch 2
• Example 1: x = (1, 0, 1), y = +1
z = −2
ŷ = −1 (Misclassified)
ω ← (0, −1, 0)T
30
20.4 Observations
• The weights enter a cycle between (−1, −1, −1)T and (0, 0, 0)T
• This indicates the data is not linearly separable
• Convergence: Guaranteed only for linearly separable data (Perceptron Convergence Theorem)
• Bias Handling: Incorporated through x3 = 1 in the input vector
x1 x2 y
0 0 -1
0 1 -1
1 0 -1
1 1 +1
PTA would converge to a solution like ω = (1, 1, −1.5)T after a few epochs.
20.7 Conclusion
The demonstration shows:
• How PTA iteratively updates weights
• The importance of linear separability for convergence
• Practical behavior with both separable and non-separable data
• Compute activation: z = w · x + b
• Predict: ŷ = sign(z)
• Update rule when ŷ ̸= y:
w ←w+η·y·x
b←b+η·y
31
Figure 11: Iterative adjustment of decision boundary through weight updates
• Solution: Add ηx to w
⇒ Increases z for future similar inputs
32
21.6 Concrete Example
Consider 2D data with:
• Current weights: w = (1, 1), b = −1.5
• Misclassified point: x = (2, 1), y = +1
• Update (η = 1):
w ← (1, 1) + 1 · (+1) · (2, 1) = (3, 2)
b ← −1.5 + 1 · (+1) = −0.5
w ← w + η · yi · x i
Learning Rate Effect
η too large Overshooting, oscillations
η too small Slow convergence
Optimal η Faster convergence
33
22.5 Architectural Constraints
• Single Layer:
– Limited to linear decision boundaries
– Cannot solve XOR without hidden layers
• Activation Function: (
+1 z≥0
f (z) =
−1 otherwise
Non-differentiable nature prevents gradient flow
• When to Use:
– Simple binary classification
– Linearly separable data
– Baseline model before trying complex methods
• When to Avoid:
– Noisy datasets
– Non-linear problems
– Probabilistic outputs needed
23.2.2 Characteristics
• Range: (0, 1)
• Shape: Smooth S-curve
• Common Use: Output layer for binary classification
23.2.3 Advantages
• Probabilistic interpretation (output as probability)
• Smooth gradients for backpropagation
34
23.2.4 Limitations
• Vanishing gradients for extreme inputs
• Not zero-centered (outputs always positive)
23.3.2 Characteristics
• Range: (-1, 1)
• Shape: Zero-centered S-curve
23.3.3 Advantages
• Zero-centered outputs improve gradient flow
• Stronger gradients than sigmoid near zero
23.3.4 Limitations
• Still suffers from vanishing gradients
• Computationally expensive
23.4.2 Characteristics
• Range: [0, )
• Shape: Linear for positive inputs, zero otherwise
• Common Use: Default choice for hidden layers
35
Figure 13: tanh activation function and its derivative
23.4.3 Advantages
• Computationally efficient
23.4.4 Limitations
• ”Dying ReLU” problem (neurons can get stuck)
• Not zero-centered
36
• tanh: When zero-centered outputs are crucial (e.g., some RNNs)
• ReLU: Default choice for hidden layers in most architectures
23.7 Conclusion
• ReLU is generally preferred for hidden layers due to computational efficiency
• Sigmoid remains relevant for probabilistic outputs
• tanh is useful when zero-centered outputs are beneficial
• Modern architectures often use advanced variants to address limitations
37