0% found this document useful (0 votes)
22 views63 pages

Chapter 5

The document discusses neural networks, comparing biological and artificial neurons, and detailing their structure, function, and learning processes. It covers key concepts such as forward propagation, backpropagation, activation functions, and various types of neural networks, along with their advantages, disadvantages, and applications. The importance of backpropagation in training neural networks and the challenges associated with it are also highlighted.

Uploaded by

antenehpawlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views63 pages

Chapter 5

The document discusses neural networks, comparing biological and artificial neurons, and detailing their structure, function, and learning processes. It covers key concepts such as forward propagation, backpropagation, activation functions, and various types of neural networks, along with their advantages, disadvantages, and applications. The importance of backpropagation in training neural networks and the challenges associated with it are also highlighted.

Uploaded by

antenehpawlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Bahir Dar University

College of Science
Data Science Department
Course Title: Machine Learning

By

Adane Kasie Chekole (MSc)


Chapter 5: Neural Network: Representation
 Neurons and the Brain
 Biological vs Artificial Neural Network
 Model Representation
 Multi-class classification
 Non-linear hypothesis
 Activation functions in NN

2
Brain & Neurons

 The brain functions as a highly organized network of neurons, where billions of


connections form pathways responsible for everything from breathing to complex
thinking.
 Neurons are specialized cells that act as the primary communication network of the brain
and the nervous system, allowing signals to travel across the body.
 Neurons work together in synchronized patterns, creating neural circuits that underlie our
Biological neurons VS artificial neural
networks
Biological neurons VS artificial neural networks
 Structure: BN have a complex and organic structure, consisting of dendrites,
soma, axon, and synapses. AN have a simple and mathematical structure,
consisting of inputs, weights, bias, and activation function.
 Function: Biological process and transmit electrical and chemical signals,
using action potentials and neurotransmitters. Artificial neurons process and
transmit numerical values, using weighted sums and activation functions.
 Learning: Biological neurons learn and adapt through synaptic plasticity,
changing the strength and number of synapses based on experience and
stimuli. Artificial neurons learn and adapt through weight adjustment,
changing the value and number of weights based on error and feedback.
 Efficiency: Biological neurons are highly efficient and parallel, processing
and transmitting signals at high speed and low energy consumption. Artificial
neurons are less efficient and sequential, requiring more time and power to
perform computations and communications.
What is a Neural Network?
What is a Neural Network?
 Neural networks are machine learning models that mimic the complex
functions of the human brain.
 These models consist of interconnected nodes or neurons that process data,
learn patterns & enable tasks such as pattern recognition and decision-making.
 Neurons: The basic units that receive inputs, each neuron is governed by a
threshold and an activation function.
 Connections: Links between neurons that carry information, regulated by
weights and biases.
 Weights and Biases: These parameters determine the strength and influence
of connections.
 Propagation Functions: Mechanisms that help process and transfer data
across layers of neurons.
 Learning Rule: The method that adjusts weights and biases over time to
improve accuracy.
Layers in Neural Network Architecture
What is a Neural Network?
 Input Layer: This is where the network receives its input data. Each input
neuron in the layer corresponds to a feature in the input data.
 Hidden Layers: These layers perform most of the computational heavy
lifting. A neural network can have one or multiple hidden layers.
 Each layer consists of units (neurons) that transform the inputs into something
that the output layer can use.
 Output Layer: The final layer produces the output of the model. The format
of these outputs varies depending on the specific task (classific’n, regression).
 The activated outputs from the hidden layer are passed to the output neuron.
 The output neuron receives the values from the hidden layer neurons and
computes the final prediction using weights:
 Suppose the output weights from hidden layer to output neuron are [0.7, 0.2].
 Final Activation: The output is passed through a sigmoid activation function to
obtain a probability:
Working of Neural Networks
Forward Propagation
 Data is input into the network, it passes through the network in the forward
direction, from the input layer through the hidden layers to the output layer.
 Linear Transformation: Each neuron in a layer receives inputs, which are
multiplied by the weights associated with the connections. These products are
summed together, and a bias is added to the sum. This can be represented
mathematically as: z = w1x ​ 1 ​+ w2x
​ 2 ​+…+ wn​xn​+b where w represents the
weights, x represents the inputs, and b is the bias.
 Activation: The result of the linear transformation (denoted as z) is then
passed through an activation function. The activation function is crucial
because it introduces non-linearity into the system, enabling the network to
learn more complex patterns. Popular activation functions include ReLU,
sigmoid & tanh.
Working of Neural Networks
Backpropagation
 After forward propagation, the network evaluates its performance using a loss
function, which measures the difference between the actual output and the
predicted output. The goal of training is to minimize this loss. This is where
backpropagation comes into play:
 Loss Calculation: The network calculates the loss, which provides a measure
of error in the predictions. The loss function could vary; common choices are
mean squared error for regression tasks or cross-entropy loss for classification.
 Gradient Calculation: The network computes the gradients of the loss function
with respect to each weight and bias in the network. This involves applying the
chain rule of calculus to find out how much each part of the output error can be
attributed to each weight and bias.
 Weight Update: Once the gradients are calculated, the weights and biases are
updated using an optimization algorithm like stochastic gradient descent
Working of Neural Networks
Iteration
 This process of forward propagation, loss calculation, backpropagation, and
weight update is repeated for many iterations over the dataset.
 Over time, this iterative process reduces the loss, and the network’s predictions
become more accurate.
 Through these steps, neural networks can adapt their parameters to better
approximate the relationships in the data,
 thereby improving their performance on tasks such as classification,
regression, or any other predictive modeling.
Multi-class Classification in NN
 Multi-class classification refers to predictive modeling tasks where each input
is assigned to one class out of three or more possible categories, unlike binary.
 This type of classification is commonly used in applications such as
handwritten digit recognition, sentiment analysis with multiple emotions, and
object detection with diverse classes.
 NN used for multi-class classification typically employ a softmax activation
function in the output layer, which converts raw output scores into
probabilities for each class.
 The model is trained using a categorical cross-entropy loss function, which
penalizes incorrect predictions based on the divergence between predicted
probabilities and the actual class.
 Techniques like one-vs-rest or one-hot encoding are often used to structure
multi-class problems for training and prediction, depending on the framework
and model architecture.
Types of Neural Networks
 Feedforward Networks: A feedforward NN is a simple ANN architecture in
which data moves from input to output in a single direction.
 Singlelayer Perceptron: A single-layer perceptron consists of only one layer of
neurons . It takes inputs, applies weights, sums them up, and uses an activation
function to produce an output.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network
with three or more layers, including an input layer, one or more hidden layers,
and an output layer. It uses nonlinear activation functions.
 Convolutional Neural Network (CNN): a specialized ANN designed for image
processing. It employs convolutional layers to automatically learn hierarchical
features from input images, enabling effective recognition & classifications.
 Recurrent Neural Network (RNN): An ANN type intended for sequential data
processing. It is appropriate for applications where contextual dependencies
are critical, such as time series prediction & NLP.
Types of Neural Networks
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed
to overcome the vanishing gradient problem in training RNNs. It uses memory
cells and gates to selectively read, write, and erase information.
 GRU(Gated Recurrent Unit) is a type of RNN designed to capture long-range
dependencies in sequential data while solving the vanishing gradient problem.
GRU simplifies the LSTM architecture by eliminating the cell state, making it
easier to implement and requiring less memory for training.
 Transformer: is a NN architecture used for performing ML tasks particularly in
NLP & CV. Transformer Architecture is a model that uses self-attention to
transform one whole sentence into a single sentence e.g BERT, RoBERT.
Advantages of Neural Networks
 Adaptability: Neural networks are useful for activities where the link between
inputs and outputs is complex or not well defined because they can adapt to
new situations and learn from data.
 Pattern Recognition: Their proficiency in pattern recognition renders them
efficacious in tasks like as audio and image identification, natural language
processing, and other intricate data patterns.
 Parallel Processing: Because neural networks are capable of parallel
processing by nature, they can process numerous jobs at once, which speeds
up and improves the efficiency of computations.
 Non-Linearity: Neural networks are able to model and comprehend
complicated r/ships in data by virtue of the non-linear activation functions
found in neurons, which overcome the drawbacks of linear models.
Disadvantages of Neural Networks
 Computational Intensity: Large neural network training can be a laborious
and computationally demanding process that demands a lot of computing
power.
 Black box Nature: As “black box” models, NN pose a problem in important
applications since it is difficult to understand how they make decisions.
 Overfitting: Overfitting is a phenomenon in which NN commit training
material to memory rather than identifying patterns in the data. Although
regularization approaches help to alleviate this, the problem still exists.
 Need for Large datasets: For efficient training, neural networks frequently
need sizable, labeled datasets; otherwise, their performance may suffer from
incomplete or skewed data.
Applications of Neural Networks
 Image and Video Recognition: CNNs are extensively used in applications
such as facial recognition, autonomous driving, and medical image analysis.
 Natural Language Processing (NLP): RNNs and transformers power language
translation, chatbots, and sentiment analysis.
 Finance: Predicting stock prices, fraud detection, and risk management.
 Healthcare: Neural networks assist in diagnosing diseases, analyzing medical
images, and personalizing treatment plans.
 Gaming and Autonomous Systems: Neural networks enable real-time
decision-making, enhancing user experience in video games and enabling
autonomous systems like self-driving cars.
What is Backpropagation?
 Backpropagation (short for "backward propagation of errors") is an algorithm
that helps neural networks learn from their mistakes.
 It is the mechanism through which neural networks adjust their weights based
on the error made during prediction.
 Fine-tuning of weights enables the model to become more accurate over time.
 It works iteratively to adjust weights and bias to minimize the cost function.
 In each epoch the model adapts these parameters reducing loss by following
the error gradient.
 Backpropagation often uses optimization algorithms like gradient descent or
stochastic gradient descent.
 The algorithm computes the gradient using the chain rule from calculus
allowing it to effectively navigate complex layers in the neural network to
minimize the cost function.
Working of Backpropagation Algorithm
 Forward Pass: The input data is passed through the network, layer by layer,
and predictions are made. During this phase, the network computes the
predicted output and compares it with the actual result (label) using a loss
function.
 Backward Pass: This is where backpropagation comes into play. Once the loss
(error) is calculated, the network uses the backpropagation algorithm to trace
back through the layers and compute the gradients of the loss concerning each
weight in the network.
 Weight Update: After calculating the gradients, the weights of the neural
network are updated using an optimization algorithm like Gradient Descent.
This update reduces the error, making the model better at predicting outcomes
in future iterations.
Why is Backpropagation Important?
 Efficient Learning: Instead of randomly adjusting weights, backpropagation
ensures that adjustments are made in a way that minimizes the error as
efficiently as possible. This leads to faster learning & more optimized model.
 Simplicity and Flexibility: Its straightforward design suits a range of tasks
from basic feedforward to complex convolutional or recurrent networks.
 Automation of Feature Engineering: One of the beauties of ML is that it
automatically learns features from raw data, and it is the reason this happens.
 Scalability: The algorithm scales efficiently with larger datasets and more
complex networks making it ideal for large-scale tasks.
 Generalization: It helps models generalize well to new data improving
prediction accuracy on unseen examples.
 Without backpropagation, training a NN would be computationally infeasible
for real-world tasks due to the vast number of parameters involved.
Challenges in Backpropagation
 Vanishing Gradient Problem: In deep networks the gradients can become very
small during backpropagation making it difficult for the network to learn.
This is common when using activation functions like sigmoid or tanh.
 Exploding Gradients: The gradients can also become excessively large
causing the network to diverge during training.
 Overfitting: If the model becomes too good at minimizing the training error, it
might fail to generalize to new data. Techniques like dropout, regularization,
and careful design of the network architecture help mitigate this.
 Computational Intensity: Training deep networks with large datasets using
backpropagation can be computationally expensive. Modern advancements
like GPU/TPU acceleration and distributed computing have made this less of
an issue but still remain a concern for resource-constrained environments.
Example of Backpropagation in Machine Learning
 Assume the neurons use the sigmoid activation function for the forward and
backward pass. The target output is 0.5, and the learning rate is 1.
Formulas
 Initial Calculation
The weighted sum at each node is calculated using: aj​=∑(wi​,j∗xi​).
Where, 1. aj​is the weighted sum of all the inputs and weights at each node
2. wi,j​represents the weights between the ith input and the jth neuron
3. xi​represents the value of the ith input
 Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity
into the model. After applying the activation function to a, we get the output of
the neuron:
oj​= activation function(aj​)
1. Calculating Gradients, The change in each weight is calculated as:
Backpropagation
Backpropagation
Backpropagation

• After updating the weights the forward pass is repeated yielding: y3 = 0.57,
y4 ​= 0.56, y5 = 0.61 , Error=ytarget​− y5​= 0.5 − 0.61 = −0.11
• Since y5 = 0.61 is still not the target output the process of calculating the
error and backpropagating continues until the desired output is reached.
Activation functions in Neural Networks
What is an Activation Function?
 The activation function of a node in an ANN is a function that calculates the
output of the node based on its individual inputs and their weights.
 Activation function is a mathematical function applied to the output of neuron.
 It introduces non-linearity into the model, allowing the network to learn and
represent complex patterns in the data.
 Without this non-linearity feature, a neural network would behave like a linear
regression model, no matter how many layers it has.
 Activation function decides whether a neuron should be activated by
calculating the weighted sum of inputs and adding a bias term.
 This helps the model make complex decisions and predictions by introducing
non-linearities to the output of each neuron.
 The choice of activation function has a large impact on the capability and
performance of the NN, & some may be used in different parts of the model.
Introducing Non-Linearity in Neural Network
 Non-linearity means that the r/ship b/n input & output is not a straight line.
 It use greatly increases the expressive power of the network, allowing it to
approximate any continuous function given sufficient neurons and layers.
 Imagine you want to classify apples and bananas based on their shape & color.
 If we use a linear function, it can only separate them using a straight line.
 But real-world data is often more complex (e.g. overlapping color, d/t lighting)
By adding a non-linear activation function (like ReLU, Sigmoid, or Tanh), the
network can create curved decision boundaries to separate them correctly.
 Require more sophisticated training algorithms and often need larger amounts
of data to prevent overfitting and ensure generalization
 Non-linearity enables NN to learn more complex patterns that are not possible
with a purely linear model, such as: Modeling functions not linearly separable.
 Increasing the capacity of the network to form multiple decision boundaries
based on the combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?
 Non-linear activation functions enable NN to capture & model these complex,
non-linear r/ships, allowing for more expressive & powerful representations.
 Neural networks consist of neurons that operate using weights, biases, and
activation functions.
 In the learning process, these weights and biases are updated based on the error
produced at the output a process known as backpropagation.
 Activation functions enable backpropagation by providing gradients that are
essential for updating the weights and biases.
 Without non-linearity, even deep networks would be limited to solving only
simple, linearly separable problems.
 Activation functions empower neural networks to model highly complex data
distributions and solve advanced deep learning tasks.
 Adding non-linear activation functions introduce flexibility and enable the
network to learn more complex and abstract patterns from data.
Types of Activation Functions in Machine Learning
 Linear Activation Function
 Non-Linear Activation Functions
 Sigmoid Activation Function

 Tanh Activation Function

 ReLU (Rectified Linear Unit) Function


 Exponential Linear Units
 Softmax Function

 SoftPlus Function

35
Linear Activation Function
 Linear Activation Function resembles straight line define by y=x.
 A linear activation function is a function where the output is directly
proportional to the input; mathematically expressed as f(x) = x or f(x) = ax + b.
 No matter how many layers the neural network contains, if they all use linear
activation functions, the output is a linear combination of the input.
 The range of the output spans from (−∞ to +∞).
 Linear activation function is used at just one place i.e. output layer.
 Using linear activation across all layers makes the network’s ability to learn
complex patterns limited.
 Linear activation functions are useful for specific tasks but must be combined
with non-linear functions to enhance the NN learning & predictive capabilities.
 Limitations: If used in hidden layers, it causes the entire neural network to
behave like a single-layer perceptron, regardless of depth, reducing its ability
to learn complex functions.
Linear Activation Function
Non-Linear Activation Function
 Non-linear activation functions are mathematical functions applied to the
output of neurons that introduce non-linearity to a neural network.
 This non-linearity allows the network to learn and approximate complex, real-
world functions.
 These functions are differentiable, meaning their gradients can be calculated.
 This is essential for training the network using algorithms like gradient
descent and backpropagation.
 Non-linear activations help in shaping the range of the neuron outputs,
 preventing issues like exploding or vanishing gradients, and improving the
overall ability of the model to generalize across different types of data.
1. Sigmoid Function
 it is used as an activation function in ML for modeling binary classification
problems, smoothing outputs, & introducing non-linearity into models.
 Sigmoid is a mathematical function that maps any real-valued number into a
value between 0 and 1.
 Its characteristic "S"-shaped curve makes it particularly useful in scenarios
where we need to convert outputs into probabilities called the logistic function.
 Mathematically, sigmoid is represented as:
 It allows neural networks to handle and model complex patterns that linear
equations cannot.
 The output ranges between 0 and 1, hence useful for binary classification.
Properties of the Sigmoid Function
 Domain: The domain of the sigmoid function is all real numbers. This means
that you can input any real number into the sigmoid function, and it will
produce a valid output.
 Asymptotes: As x approaches positive infinity, σ(x) approaches 1. Conversely,
as x approaches negative infinity, σ(x) approaches 0. This property ensures that
the function never actually reaches 0 or 1, but gets arbitrarily close.
 Monotonicity: The sigmoid function is monotonically increasing, meaning that
as the input increases, the output also increases.
 Differentiability: The sigmoid function is differentiable, which allows for the
calculation of gradients during the training of machine learning models.
Sigmoid Function in Backpropagation
 If we use a linear activation function in a NN, the model will only be able to
separate data linearly, w/c results in poor performance on non-linear datasets.
 However, by adding a hidden layer with a sigmoid activation function, the
model gains the ability to handle non-linearity, by improving performance.
 During the backpropagation, the model calculates and updates weights and
biases by computing the derivative of the activation function.
 Issue using the sigmoid function is the vanishing gradient problem. When
updating weights & biases using gradient descent,
 if the gradients are too small, the updates to weights and biases become
insignificant, slowing down or even stopping learning.
The sigmoid function is useful because:
 It is the only function that appears in its derivative.
 It is differentiable at every point, which helps in the effective computation of
gradients during backpropagation.
Sigmoid Activation Function
2. Tanh Activation Function
 The tanh function outputs values in the range of -1 to +1. it can deal with
negative values more effectively than the sigmoid, which has a range of 0 to 1.
 Unlike the sigmoid function, tanh is zero-centered, which means that its output
is symmetric around the origin of the coordinate system.
 The tanh function is frequently used in the hidden layers of a neural network.
 Because of its zero-centered nature, when the data is also normalized to have
mean zero, it can result in more efficient training.
 Stronger gradients often result in faster learning and convergence during
training because they tend to be more resilient against the problem of
vanishing gradients when compared to the gradients of the sigmoid function.
Assumptions
 As the input becomes more positive, the output approaches 1.
 As the input becomes more negative, the output approaches -1.
 At x=0, the output is 0, which is the center of the function.
Why Use Tanh in Neural Networks?
 Non-linearity: Tanh introduces non-linearity to the model, which allows
neural networks to learn complex patterns and relationships in the data.
Without non-linear activation functions, a neural network would essentially
behave as a linear model, no matter how many layers it has.
 Centered Around Zero: The output of the tanh function is centered around 0,
unlike the sigmoid function, which outputs values between 0 and 1. This
makes the tanh activation function more useful for many types of tasks, as the
mean of the output is closer to zero, leading to more efficient training and
faster convergence.
 Gradient Behavior: Tanh helps mitigate the vanishing gradient problem (to
some extent), especially when compared to sigmoid activation. This is because
the gradient of the tanh function is generally higher than that of the sigmoid,
enabling better weight updates during backpropagation.
Tanh Activation Function
3. ReLU Activation Function
 Rectified Linear Unit (ReLU) is a popular activation functions used in neural
networks, especially in deep learning models. f(x) = max(0, ∞)
 It is default choice in many architectures due to its simplicity and efficiency.
 The ReLU function is a piecewise linear function that outputs the input
directly if it is positive; otherwise, it outputs zero.
 ReLU allows positive values to pass through unchanged while setting all
negative values to zero.
 This helps the neural network maintain the necessary complexity to learn
patterns while avoiding some of the pitfalls associated with other activation
functions, like the vanishing gradient problem.
 For inputs greater than 0, ReLU acts as a linear function with a gradient of 1.
 This means that it does not alter the scale of positive inputs and allows the
gradient to pass through unchanged during backpropagation.
Why is ReLU Popular?
 Simplicity: ReLU computationally efficient as it involves only a thresholding
operation. This simplicity makes it easy to implement and compute, which is
important when training deep neural networks with millions of parameters.
 Non-Linearity: Although it seems like a piecewise linear function, ReLU is
still a non-linear function. This allows the model to learn more complex data
patterns and model intricate relationships between features.
 Sparse Activation: ReLU's ability to output zero for negative inputs introduce
sparsity in the network, meaning that only a fraction of neurons activate at any
given time. This can lead to more efficient and faster computation.
 Gradient Computation: ReLU offers computational advantages in terms of
backpropagation, as its derivative is simple either 0 (when the input is
negative) or 1 (the input is positive). This helps to avoid the vanishing gradient
problem, which is a common issue with sigmoid or tanh activation functions.
ReLU Activation Function
Exponential Linear Unit
 ELU is a type of non-linear activation function that aims to improve learning
speed and performance in deep neural networks.
 combines the advantages of ReLU with a smoother gradient for negative inputs.
 Smooth curve: Unlike ReLU, which has a sharp corner at x=0, ELU is smooth
and differentiable everywhere.
 Negative values allowed: This helps ELU maintain zero-mean activations,
improving convergence.
 Linear for positive inputs: Like ReLU, it passes positive inputs unchanged,
ensuring efficiency and simplicity.
 Reduces the risk of vanishing gradients by allowing small negative outputs.
 Accelerates training by pushing mean activations closer to zero.
Limitations:1. more computationally expensive than ReLU exponential calc.
2. The choice of 𝛼 affects performance and may require tuning.
1. Softmax Function
 Softmax is an activation function commonly used in neural networks for multi-
classification problems.
 These probabilities are distributed across different classes such that their sum
equals 1.
 Essentially, Softmax helps in transforming output values into a format that can
be interpreted as probabilities, which makes it suitable for classification tasks.
 In a multi-class classification neural network, the final layer outputs a set of
values, each corresponding to a different class.
 These values, before Softmax is applied, can be any real numbers, and may not
provide meaningful information directly.
 The Softmax function processes these values into probabilities, which indicate
the likelihood of each class being the correct one.
 Softmax gained prominence with the rise of deep learning, particularly in
models such as multilayer perceptrons (MLPs) and CNN.
Why Use Softmax?
 Multi-Class Classification: Softmax is ideal for problems involving more than
two classes, where the goal is to predict a single class out of many. The
function's ability to generate a probability distribution over classes makes it
particularly useful in classification models.
 Probabilistic Interpretation: Since Softmax converts logits into probabilities,
the output is easily interpretable. You can not only determine the most likely
class but also gauge the confidence of the model in that prediction.
 Handling Multiple Classes Simultaneously: The sum-to-one property of
Softmax ensures that all possible classes are considered together. This holistic
approach ensures that each prediction takes into account all classes rather than
just focusing on one or two.
Softmax Function
1. Softplus Function
 The Softplus function is a smooth, differentiable approximation of the ReLU
(Rectified Linear Unit) activation function.
 It maps real-valued inputs to positive outputs and is often used when a non-
linear and smooth activation is preferred.
 Smooth ReLU: Unlike ReLU, which is not differentiable at 0, Softplus
provides a smooth curve throughout its domain.
 Outputs always positive: This makes it suitable when negative activations are
undesirable.
 Log-exp behavior: For large positive values, it behaves like ReLU (i.e.,
f(x)≈x), and for large negative values, it approaches zero slowly.
 Prevents the issue of "dead neurons" seen in ReLU (where outputs become
permanently zero).
 Maintains gradient flow during training due to smooth differentiability.
Why Use Softplus in Neural Networks?
 Smooth Approximation of ReLU: The Softplus function is often seen as a
smoother version of the ReLU function.
 While ReLU is simple and effective, it can have issues, such as causing
neurons to "die" if they always output zero for negative inputs.
 Softplus avoids this issue by providing a smooth, continuous output for both
positive and negative inputs.
 Differentiability: Softplus is a differentiable function, unlike ReLU, which
has a discontinuity at zero.
 The continuous and differentiable nature of Softplus makes it easier for
gradient-based optimization algorithms to work effectively, ensuring smooth
learning during training.
Why Use Softplus in Neural Networks?
 Preventing Dying Neurons: In the case of ReLU, when the input is negative,
the output is exactly zero, which can lead to dead neurons that do not
contribute to learning.
 Softplus gradually approaches zero for negative values, ensuring that neurons
always produce some non-zero output and continue contributing to the
learning process.
 Numerical Stability: The Softplus function has better numerical stability
than some other activation functions because it avoids the issues that arise
from very large or very small values.
 It has a smooth output, and for very large or very small inputs, the function
behaves predictably, reducing the risk of overflow or underflow in
computation.
Softplus Function
Key Takeaways:
 Softplus → Used for continuous and positive-valued outputs, acts as a smooth
ReLU.
 Softmax → Used for multi-class classification, converts outputs into probability
How to Choose an Activation Function
 Linear Activation Function → For Regression problems
 Used in: Predicting values like house prices, student grades, stock prices.
 Why: Suitable when the output is a continuous number (regression tasks).
 Not ideal for hidden layers because it can't model complex relationships.
 Sigmoid (Logistic) Function → For binary classification
 Used in: spam detection, loan approval (yes/no), medical results (+ve/-ve)
 Why: Gives output as a probability between 0 and 1.
 Slows down learning or Causes vanishing gradients in deep networks.
 Tanh Activation Function → For time series forecasting / Signal processing
 Used in: Weather forecasting, speech recognition, credit scoring.
 Why: Better than sigmoid because it's balanced (negative or positive).
 Limitation: Still suffers from vanishing gradient in very deep networks.
59
How to Choose an Activation Function
 ReLU (Rectified Linear Unit) Function → For Regression problems
 Used in: Face detection, object recognition, self-driving cars.
 Why: Fast and efficient—most commonly used in hidden layers today.
 Risk of “dead neurons” (never activate) during training and stop learning.
 Softmax Function → For multi-class classification
 Used in: Handwritten digit recognition, sentiment analysis (positive,
neutral, negative), language identification.
 Why: Great for choosing one correct answer from many (multi-class).
 Typically used in the final layer of models, not hidden layers.
 SoftPlus Activation Function → For Theoretical research
 Used in: advanced AI models, financial risk modeling.
 Why: Smooth alternative to ReLU, good for stability in sensitive systems.
60
 Not used as commonly as ReLU in practice.
How to Choose an Activation Function
Activation Function Real-World Example Use Layer Location
Linear Predict house price Output layer (regression)
Output layer (binary
Sigmoid Spam detection, loan approval
classification)
Tanh Credit scoring, signal analysis Hidden layers
ReLU Face or object detection Hidden layers
ELU NLP, robotics Hidden layers
Handwriting recognition, Output layer (multi-class
Softmax
sentiment classification)
Risk modeling, deep theory
SoftPlus Hidden layers (rare)
models
61
Impact of Activation Functions on Model Performance
 Convergence Speed: Functions like ReLU allow faster training by avoiding
the vanishing gradient problem, while Sigmoid and Tanh can slow down
convergence in deep networks.
 Gradient Flow: Activation functions like ReLU ensure better gradient flow,
helping deeper layers learn effectively. In contrast, Sigmoid can lead to small
gradients, hindering learning in deep layers.
 Model Complexity: Activation functions like Softmax allow the model to
handle complex multi-class problems, whereas simpler functions like ReLU
or Leaky ReLU are used for basic layers.
 Activation functions are the backbone of neural networks, enabling them to
capture non-linear relationships in data.
 From classic functions like Sigmoid and Tanh to modern variants like ReLU
and Swish, each has its place in different types of neural networks.
Enkutatash Tech

THANK TOU
Wisdom is the essence of our
uniqueness!
Adane Kasie Chekole
+251-938427723
adanekasie26@gmail.com
www.enkutatashplc.com

Lemlemitu, Bahir Dar, Ethiopia

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy