DL Module 2
DL Module 2
Deep feedforward networks, often called feedforward neural networks (FNNs) or multilayer
perceptrons (MLPs), are a fundamental class of neural networks in deep learning. Let’s break
down the key aspects of these networks, using the XOR example provided, to illustrate their
significance and functionality.
● Objective: The primary goal of a feedforward network is to approximate a function f∗. For
instance, in classification, the network maps an input x to an output y=f(x;θ), where θ
represents the network's parameters.
● Flow of Information: Information flows in one direction—from the input X, through
intermediate layers, to the output Y. There are no feedback loops in FNNs, distinguishing
them from recurrent neural networks (RNNs).
● Layers:
○ Input Layer: Receives the input features.
○ Hidden Layers: Layers where intermediate computations are performed.
○ Output Layer: Produces the final output.
● Composition of Functions: FNNs are built by chaining multiple functions together. For
The XOR (exclusive or) function operates on two binary inputs x1and x2, producing the output:
The XOR function is non-linearly separable, meaning it cannot be solved using a linear model.
Learning XOR
A simple feedforward neural network is proposed, consisting of two layers: a hidden layer and an
output layer. This allows the network to represent the XOR function by transforming the input space
in a way that makes it linearly separable.
● Input layer: Takes two binary values (like [0, 1], [1, 0], etc.).
● Hidden layer: Contains two units (neurons) that process the input using a non-linear
activation function
● Output layer: The result from the hidden layer is passed to a final output layer, which
predicts the value of the XOR function for that specific input.
What is regularization? How does regularization help in reducing overfitting.
When training a machine learning model, we aim to minimize a loss function that measures how
well the model fits the data. Regularization modifies this loss function by adding a penalty term.
This penalty increases as the model's parameters (weights) grow larger or become more complex.
By penalizing large or unnecessary parameters, regularization forces the model to simplify, thereby
reducing overfitting.
Types of Regularization
Gradient Descent is an optimization algorithm used in machine learning to minimize the loss
function of a model by iteratively adjusting its parameters (weights and biases). The goal is to find
the values of the parameters that result in the smallest possible loss.
How Gradient Descent Works
1. Initialization:
○ Start with random values for the model's parameters.
2. Compute the Loss:
○ Use the current parameters to calculate the loss (how far the model's predictions are
from the actual values).
3. Calculate the Gradient:
○ Compute the gradient of the loss function with respect to the parameters.
○ The gradient indicates the direction and magnitude of the steepest increase in loss.
4. Update the Parameters:
○ Move the parameters in the opposite direction of the gradient to reduce the loss.
○ The update rule:
θ = θ - α * (∂Loss/∂θ)
■ θ: Model parameters
■ α: Learning rate (controls the step size of the update)
5. Repeat:
○ Iterate through steps 2–4 until the loss converges to a minimum or a stopping
condition is met.
Backpropagation (short for backward propagation of errors) is a widely used algorithm in training
neural networks. Its goal is to adjust the weights of the network to minimize the error in predictions.
It achieves this by propagating the error backward through the network, layer by layer, using the
gradient descent optimization technique.
Steps in Backpropagation
1. Initialization:
○ Initialize all the weights and biases in the network randomly or using a specific
method (e.g., Xavier initialization).
Here:
■ δ: Error term of the output layer (depends on the loss function and activation).
■ z Input to the output layer from the previous layer.
○ Step 4.2: Backpropagate to Hidden Layers:
■ For each hidden layer, the error is propagated backward using the chain rule.
■ The gradient of the loss with respect to the weights and biases of each hidden
layer is calculated in a similar way as the output layer, considering the gradient
flow from the next layer.
■ This process is repeated until the gradients for all layers are computed.
Here: