Unit I
Unit I
o The neuron receives a set of binary inputs. Each input is associated with a
weight, which determines the strength of that input. The neuron computes a
weighted sum of these inputs.
o The McCulloch-Pitts neuron is a very basic model and does not incorporate
complex features like learning, dynamic weight adjustments, or multiple
layers (as in modern neural networks). It only considers linear combinations
of inputs and applies a step function to determine the output.
f(S)={1if S≥θ0if S<θf(S) = \begin{cases} 1 & \text{if } S \geq \theta \\ 0 & \text{if } S <
\theta \end{cases}
Where θ\theta is the threshold. If the sum of the weighted inputs reaches or exceeds the
threshold, the neuron fires (output = 1); otherwise, it remains inactive (output = 0).
o Binary Outputs: The binary output also limits the applicability of the
McCulloch-Pitts neuron to problems where binary classification is sufficient.
o For an AND gate, the weights are set so that the sum of the inputs must exceed
a certain threshold for the neuron to output 1.
o The OR gate requires the sum of the inputs to exceed the threshold if at least
one of the inputs is 1.
o Example: If x1=1x_1 = 1 or x2=1x_2 = 1, the output is 1.
3. NOT Operation:
o A NOT gate can be modeled by setting the threshold such that the output is the
opposite of the input.
Multilayer Perceptions (MLPs)
1. Input Layer: This layer takes in the raw data as input. The input layer is typically a
vector, where each element of the vector represents a feature of the data.
2. Hidden Layers: These are the intermediate layers between the input and output
layers. There can be one or more hidden layers in an MLP, and they are responsible
for learning representations of the input data. Each hidden layer contains multiple
neurons that apply a non-linear activation function to the weighted sum of their
inputs.
3. Output Layer: The output layer produces the final prediction or classification result.
For a classification task, this might be the probabilities of belonging to each class,
while for a regression task, it might be a continuous value.
Each layer of an MLP is made up of neurons (also called units), and each neuron is connected
to neurons in the subsequent layer by weighted connections.
Training an MLP
The training of an MLP involves the following steps:
1. Forward Pass: Compute the output of the network for a given input.
2. Loss Calculation: Compute the difference between the predicted output and the
actual output (using a loss function).
3. Backpropagation: Calculate the gradients of the loss with respect to each weight by
applying the chain rule.
4. Weight Update: Use an optimization algorithm (such as gradient descent) to update
the weights in the direction that minimizes the loss.
5. Iteration: Repeat the process for many iterations (epochs) to progressively reduce the
loss and improve the model's predictions.
Sigmoid: Often used in binary classification tasks. It squashes the input to a range
between 0 and 1.
Tanh: Like sigmoid but with a range of [-1, 1], often used in hidden layers.
ReLU: A very popular activation function due to its simplicity and efficiency. It
outputs the input if it’s positive, and zero otherwise.
Leaky ReLU: A variant of ReLU that allows small negative values for inputs less
than zero, helping to avoid dead neurons.
Softmax: Typically used in the output layer for multi-class classification tasks. It
converts the raw output into probabilities that sum to 1.
Advantages of MLPs
1. Non-linear Mapping:
o MLPs can model highly non-linear relationships between inputs and outputs,
which makes them more powerful than simple linear models.
2. Flexibility:
o MLPs can be applied to a wide range of tasks, including classification,
regression, function approximation, and more.
3. Learning Complex Patterns:
o MLPs have a large number of parameters, which can lead to overfitting if the
training data is not sufficiently large or diverse. Regularization techniques like
dropout, L2 regularization, and data augmentation are used to prevent
overfitting.
2. Training Complexity:
o In deep networks, gradients can either become too small (vanishing gradient)
or too large (exploding gradient), which can slow down training or lead to
instability. Techniques like batch normalization and careful weight
initialization can help mitigate these problems.
4. Need for Large Datasets:
o MLPs generally require large amounts of data to effectively train the network
and avoid overfitting. Insufficient data can lead to poor generalization.
Applications of MLPs
1. Classification:
o MLPs are widely used for classification tasks, such as image recognition,
sentiment analysis, spam detection, and more.
2. Regression:
o MLPs can predict continuous values, such as house prices, stock prices, and
other real-valued outputs.
3. Function Approximation:
o MLPs can approximate any continuous function, making them suitable for
solving problems in control systems, robotic motion, and more.
4. Time Series Prediction:
o With appropriate modifications (like recurrent layers), MLPs can be applied to
time series forecasting and sequential data.
Representation Power of MLPs
Sigmoid Neurons
Sigmoid Neurons
A Sigmoid Neuron is a type of artificial neuron in a neural network that uses the sigmoid
activation function to introduce non-linearity into the network. Sigmoid neurons are
commonly used in traditional neural networks, especially in the earlier stages of deep
learning research.
o The output of a sigmoid neuron is always between 0 and 1, making it ideal for
tasks where the output needs to be interpreted as a probability (e.g., binary
classification).
3. Smooth Gradient:
o The sigmoid function has a smooth gradient, which helps with the
optimization of weights during training (e.g., gradient descent).
4. Differentiability:
o The sigmoid function saturates at both ends of the range, meaning that for very
large or very small values of xx, the gradient (derivative) approaches zero.
This leads to a problem known as vanishing gradients, which can hinder the
learning process in deep networks, especially when training on large datasets.
In the regions where the gradient is very small, weight updates become very
small, slowing down or even halting learning.
2. Non-zero Centered Output:
o The output of the sigmoid function is not zero-centered. This means that for
large positive inputs, the output is close to 1, and for large negative inputs, the
output is close to 0. This can lead to inefficient gradient updates during
training, as the gradients are not balanced around zero.
3. Slow Convergence:
o Due to the squashing nature of the sigmoid function, learning can be slower
compared to other activation functions like ReLU (Rectified Linear Unit),
especially when deep networks are used.
4. Limited Range:
o The sigmoid function’s output range is between 0 and 1, which can sometimes
be too restrictive for some tasks, especially when a larger output range is
needed.
o Sigmoid neurons are often used in the output layer of neural networks for
binary classification tasks, where the network outputs a value between 0 and 1,
representing the probability of belonging to one of the two classes.
o These layers consist of neurons that process the inputs received from the
previous layer. A network can have one or more hidden layers, each
containing multiple neurons. The more hidden layers, the deeper the network,
making it capable of learning more complex representations of the data.
3. Output Layer:
o This is the final layer that produces the network’s output. The output layer’s
size depends on the task (e.g., one neuron for binary classification, multiple
neurons for multi-class classification or regression).
4. Activation Function:
o Each neuron in the network (except the input neurons) has an activation
function, which introduces non-linearity into the network, enabling it to learn
complex patterns.
o Common activation functions include:
ReLU (Rectified Linear Unit): f(x)=max(0,x)f(x) = \max(0, x)
Sigmoid: f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}
Tanh: f(x)=21+e−2x−1f(x) = \frac{2}{1 + e^{-2x}} - 1
o Using past data to predict future values, although more complex models like
RNNs or LSTMs are typically used for sequential data.
Universal approximator: MLPs can approximate any continuous function, as per the
universal approximation theorem.
Flexible: Can handle a variety of tasks, from regression to classification.
Requires Large Datasets: They often require large datasets to achieve high
performance.
Feed forward neural networks are the foundation for many deep learning models and serve as
the basis for more complex neural network architectures, such as Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs).
Back propagation
Backpropagation (short for backward propagation of errors) is the cornerstone of learning
in neural networks. It is the process by which a neural network adjusts its weights to
minimize the error (or loss) in its predictions. Backpropagation uses the chain rule of calculus
to compute the gradient of the loss function with respect to each weight by propagating the
error backward through the network.
Key Steps in Backpropagation:
1. Forward Pass:
o Input is processed through the network to generate output.
2. Loss Function:
o Compute the error between the predicted and actual output.
3. Backward Pass:
o Compute the gradient of the loss function with respect to each weight in the
network using the chain rule.
4. Gradient Descent:
o Update the weights using the computed gradients to minimize the loss
function.
Optimizers in Backpropagation
Backpropagation is often combined with gradient-based optimization algorithms to
improve training. Some of the commonly used optimizers are:
1. Stochastic Gradient Descent (SGD): Updates the weights after every batch or
sample, making the updates more noisy but potentially faster.
2. Momentum: A variation of SGD that uses a moving average of past gradients to
smooth out updates and help escape local minima.
3. Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning
rates, adjusting the learning rate for each parameter individually.
Challenges in Backpropagation
1. Vanishing Gradients:
o When gradients become very small during the backward pass (especially in
deep networks), the weights stop updating properly, and learning slows down
or stagnates. This is particularly a problem with activation functions like
sigmoid or tanh.
2. Exploding Gradients:
o When gradients become excessively large, the weights can grow very large,
leading to instability in training.
3. Overfitting:
o Backpropagation might overfit the model to the training data if the model is
too complex or the training process is not regularized.
Batch Normalization
Batch Normalization (BatchNorm) is a technique used to improve the training of deep
neural networks by normalizing the inputs to each layer. It helps to stabilize and accelerate
training, reduce the sensitivity to hyperparameters like the learning rate, and mitigate issues
like vanishing and exploding gradients.
o Normalizing the input to each layer stabilizes the training process, allowing
the network to train faster.
2. Reduces Internal Covariate Shift:
o In most cases, BatchNorm is applied after fully connected layers (before the
activation function).
After Convolutional Layers:
o Adds noise to the training process (due to mini-batch statistics), which helps to
prevent overfitting, reducing the need for other regularization methods.
4. Better Performance:
o BatchNorm can often lead to better model performance by allowing the use of
deeper networks and faster convergence.
o For small networks or tasks with relatively simple data, BatchNorm might not
have a significant impact.
4. Not Suitable for Recurrent Networks:
Key Operation: Normalizes activations based on the mean and variance of each
mini-batch, then scales and shifts them using learnable parameters.
Benefits: Faster training, improved stability, and slight regularization.
Limitations: Sensitive to mini-batch size, may not be suitable for very small
networks or sequential models.
Representation Learning
Representation Learning is a key concept in machine learning where a model learns
meaningful features or representations of data that can be used for various tasks. Unlike
traditional methods where features are hand-engineered, representation learning enables the
model to automatically extract and organize features directly from raw data.
o Features learned for one task can often be reused for related tasks (transfer
learning).
GPU implementation
Modern deep learning frameworks heavily rely on Graphics Processing Units (GPUs) to
accelerate computations. GPUs are designed for parallel processing, making them highly
efficient for the matrix and tensor operations central to deep learning. Here's a detailed
breakdown of GPU implementation:
Applications of PCA
1. Dimensionality Reduction:
Advantages of PCA
1. Reduces computational complexity by lowering dimensions.
2. Helps eliminate multicollinearity by creating uncorrelated features.
3. Enhances model performance by removing redundant or noisy features.
2) SVD
Singular Value Decomposition (SVD) is a matrix factorization technique used in linear
algebra. It decomposes a matrix into three distinct matrices that capture its intrinsic
properties. SVD has a wide range of applications, from dimensionality reduction to solving
linear equations, and is a foundational method in machine learning and data science.
Applications of SVD
1. Dimensionality Reduction:
2. Image Compression:
o SVD can compress images by storing only the largest singular values and
vectors.
3. Latent Semantic Analysis (LSA):
o SVD can solve systems of linear equations, especially when the system is ill-
conditioned or has no unique solution.
6. Noise Filtering:
o SVD isolates noise in the lower singular values, allowing the data to be
reconstructed without it.
Advantages of SVD
1. Handles any m×nm \times n matrix.
2. Effective for noise reduction and dimensionality reduction.
3. Robust for ill-conditioned matrices.
Limitations of SVD
1. Computationally expensive for large matrices.
2. Not scalable for very high-dimensional data.