Chapter 5
Chapter 5
College of Science
Data Science Department
Course Title: Machine Learning
By
2
Brain & Neurons
• After updating the weights the forward pass is repeated yielding: y3 = 0.57,
y4 = 0.56, y5 = 0.61 , Error=ytarget− y5= 0.5 − 0.61 = −0.11
• Since y5 = 0.61 is still not the target output the process of calculating the
error and backpropagating continues until the desired output is reached.
Activation functions in Neural Networks
What is an Activation Function?
The activation function of a node in an ANN is a function that calculates the
output of the node based on its individual inputs and their weights.
Activation function is a mathematical function applied to the output of neuron.
It introduces non-linearity into the model, allowing the network to learn and
represent complex patterns in the data.
Without this non-linearity feature, a neural network would behave like a linear
regression model, no matter how many layers it has.
Activation function decides whether a neuron should be activated by
calculating the weighted sum of inputs and adding a bias term.
This helps the model make complex decisions and predictions by introducing
non-linearities to the output of each neuron.
The choice of activation function has a large impact on the capability and
performance of the NN, & some may be used in different parts of the model.
Introducing Non-Linearity in Neural Network
Non-linearity means that the r/ship b/n input & output is not a straight line.
It use greatly increases the expressive power of the network, allowing it to
approximate any continuous function given sufficient neurons and layers.
Imagine you want to classify apples and bananas based on their shape & color.
If we use a linear function, it can only separate them using a straight line.
But real-world data is often more complex (e.g. overlapping color, d/t lighting)
By adding a non-linear activation function (like ReLU, Sigmoid, or Tanh), the
network can create curved decision boundaries to separate them correctly.
Require more sophisticated training algorithms and often need larger amounts
of data to prevent overfitting and ensure generalization
Non-linearity enables NN to learn more complex patterns that are not possible
with a purely linear model, such as: Modeling functions not linearly separable.
Increasing the capacity of the network to form multiple decision boundaries
based on the combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?
Non-linear activation functions enable NN to capture & model these complex,
non-linear r/ships, allowing for more expressive & powerful representations.
Neural networks consist of neurons that operate using weights, biases, and
activation functions.
In the learning process, these weights and biases are updated based on the error
produced at the output a process known as backpropagation.
Activation functions enable backpropagation by providing gradients that are
essential for updating the weights and biases.
Without non-linearity, even deep networks would be limited to solving only
simple, linearly separable problems.
Activation functions empower neural networks to model highly complex data
distributions and solve advanced deep learning tasks.
Adding non-linear activation functions introduce flexibility and enable the
network to learn more complex and abstract patterns from data.
Types of Activation Functions in Machine Learning
Linear Activation Function
Non-Linear Activation Functions
Sigmoid Activation Function
SoftPlus Function
35
Linear Activation Function
Linear Activation Function resembles straight line define by y=x.
A linear activation function is a function where the output is directly
proportional to the input; mathematically expressed as f(x) = x or f(x) = ax + b.
No matter how many layers the neural network contains, if they all use linear
activation functions, the output is a linear combination of the input.
The range of the output spans from (−∞ to +∞).
Linear activation function is used at just one place i.e. output layer.
Using linear activation across all layers makes the network’s ability to learn
complex patterns limited.
Linear activation functions are useful for specific tasks but must be combined
with non-linear functions to enhance the NN learning & predictive capabilities.
Limitations: If used in hidden layers, it causes the entire neural network to
behave like a single-layer perceptron, regardless of depth, reducing its ability
to learn complex functions.
Linear Activation Function
Non-Linear Activation Function
Non-linear activation functions are mathematical functions applied to the
output of neurons that introduce non-linearity to a neural network.
This non-linearity allows the network to learn and approximate complex, real-
world functions.
These functions are differentiable, meaning their gradients can be calculated.
This is essential for training the network using algorithms like gradient
descent and backpropagation.
Non-linear activations help in shaping the range of the neuron outputs,
preventing issues like exploding or vanishing gradients, and improving the
overall ability of the model to generalize across different types of data.
1. Sigmoid Function
it is used as an activation function in ML for modeling binary classification
problems, smoothing outputs, & introducing non-linearity into models.
Sigmoid is a mathematical function that maps any real-valued number into a
value between 0 and 1.
Its characteristic "S"-shaped curve makes it particularly useful in scenarios
where we need to convert outputs into probabilities called the logistic function.
Mathematically, sigmoid is represented as:
It allows neural networks to handle and model complex patterns that linear
equations cannot.
The output ranges between 0 and 1, hence useful for binary classification.
Properties of the Sigmoid Function
Domain: The domain of the sigmoid function is all real numbers. This means
that you can input any real number into the sigmoid function, and it will
produce a valid output.
Asymptotes: As x approaches positive infinity, σ(x) approaches 1. Conversely,
as x approaches negative infinity, σ(x) approaches 0. This property ensures that
the function never actually reaches 0 or 1, but gets arbitrarily close.
Monotonicity: The sigmoid function is monotonically increasing, meaning that
as the input increases, the output also increases.
Differentiability: The sigmoid function is differentiable, which allows for the
calculation of gradients during the training of machine learning models.
Sigmoid Function in Backpropagation
If we use a linear activation function in a NN, the model will only be able to
separate data linearly, w/c results in poor performance on non-linear datasets.
However, by adding a hidden layer with a sigmoid activation function, the
model gains the ability to handle non-linearity, by improving performance.
During the backpropagation, the model calculates and updates weights and
biases by computing the derivative of the activation function.
Issue using the sigmoid function is the vanishing gradient problem. When
updating weights & biases using gradient descent,
if the gradients are too small, the updates to weights and biases become
insignificant, slowing down or even stopping learning.
The sigmoid function is useful because:
It is the only function that appears in its derivative.
It is differentiable at every point, which helps in the effective computation of
gradients during backpropagation.
Sigmoid Activation Function
2. Tanh Activation Function
The tanh function outputs values in the range of -1 to +1. it can deal with
negative values more effectively than the sigmoid, which has a range of 0 to 1.
Unlike the sigmoid function, tanh is zero-centered, which means that its output
is symmetric around the origin of the coordinate system.
The tanh function is frequently used in the hidden layers of a neural network.
Because of its zero-centered nature, when the data is also normalized to have
mean zero, it can result in more efficient training.
Stronger gradients often result in faster learning and convergence during
training because they tend to be more resilient against the problem of
vanishing gradients when compared to the gradients of the sigmoid function.
Assumptions
As the input becomes more positive, the output approaches 1.
As the input becomes more negative, the output approaches -1.
At x=0, the output is 0, which is the center of the function.
Why Use Tanh in Neural Networks?
Non-linearity: Tanh introduces non-linearity to the model, which allows
neural networks to learn complex patterns and relationships in the data.
Without non-linear activation functions, a neural network would essentially
behave as a linear model, no matter how many layers it has.
Centered Around Zero: The output of the tanh function is centered around 0,
unlike the sigmoid function, which outputs values between 0 and 1. This
makes the tanh activation function more useful for many types of tasks, as the
mean of the output is closer to zero, leading to more efficient training and
faster convergence.
Gradient Behavior: Tanh helps mitigate the vanishing gradient problem (to
some extent), especially when compared to sigmoid activation. This is because
the gradient of the tanh function is generally higher than that of the sigmoid,
enabling better weight updates during backpropagation.
Tanh Activation Function
3. ReLU Activation Function
Rectified Linear Unit (ReLU) is a popular activation functions used in neural
networks, especially in deep learning models. f(x) = max(0, ∞)
It is default choice in many architectures due to its simplicity and efficiency.
The ReLU function is a piecewise linear function that outputs the input
directly if it is positive; otherwise, it outputs zero.
ReLU allows positive values to pass through unchanged while setting all
negative values to zero.
This helps the neural network maintain the necessary complexity to learn
patterns while avoiding some of the pitfalls associated with other activation
functions, like the vanishing gradient problem.
For inputs greater than 0, ReLU acts as a linear function with a gradient of 1.
This means that it does not alter the scale of positive inputs and allows the
gradient to pass through unchanged during backpropagation.
Why is ReLU Popular?
Simplicity: ReLU computationally efficient as it involves only a thresholding
operation. This simplicity makes it easy to implement and compute, which is
important when training deep neural networks with millions of parameters.
Non-Linearity: Although it seems like a piecewise linear function, ReLU is
still a non-linear function. This allows the model to learn more complex data
patterns and model intricate relationships between features.
Sparse Activation: ReLU's ability to output zero for negative inputs introduce
sparsity in the network, meaning that only a fraction of neurons activate at any
given time. This can lead to more efficient and faster computation.
Gradient Computation: ReLU offers computational advantages in terms of
backpropagation, as its derivative is simple either 0 (when the input is
negative) or 1 (the input is positive). This helps to avoid the vanishing gradient
problem, which is a common issue with sigmoid or tanh activation functions.
ReLU Activation Function
Exponential Linear Unit
ELU is a type of non-linear activation function that aims to improve learning
speed and performance in deep neural networks.
combines the advantages of ReLU with a smoother gradient for negative inputs.
Smooth curve: Unlike ReLU, which has a sharp corner at x=0, ELU is smooth
and differentiable everywhere.
Negative values allowed: This helps ELU maintain zero-mean activations,
improving convergence.
Linear for positive inputs: Like ReLU, it passes positive inputs unchanged,
ensuring efficiency and simplicity.
Reduces the risk of vanishing gradients by allowing small negative outputs.
Accelerates training by pushing mean activations closer to zero.
Limitations:1. more computationally expensive than ReLU exponential calc.
2. The choice of 𝛼 affects performance and may require tuning.
1. Softmax Function
Softmax is an activation function commonly used in neural networks for multi-
classification problems.
These probabilities are distributed across different classes such that their sum
equals 1.
Essentially, Softmax helps in transforming output values into a format that can
be interpreted as probabilities, which makes it suitable for classification tasks.
In a multi-class classification neural network, the final layer outputs a set of
values, each corresponding to a different class.
These values, before Softmax is applied, can be any real numbers, and may not
provide meaningful information directly.
The Softmax function processes these values into probabilities, which indicate
the likelihood of each class being the correct one.
Softmax gained prominence with the rise of deep learning, particularly in
models such as multilayer perceptrons (MLPs) and CNN.
Why Use Softmax?
Multi-Class Classification: Softmax is ideal for problems involving more than
two classes, where the goal is to predict a single class out of many. The
function's ability to generate a probability distribution over classes makes it
particularly useful in classification models.
Probabilistic Interpretation: Since Softmax converts logits into probabilities,
the output is easily interpretable. You can not only determine the most likely
class but also gauge the confidence of the model in that prediction.
Handling Multiple Classes Simultaneously: The sum-to-one property of
Softmax ensures that all possible classes are considered together. This holistic
approach ensures that each prediction takes into account all classes rather than
just focusing on one or two.
Softmax Function
1. Softplus Function
The Softplus function is a smooth, differentiable approximation of the ReLU
(Rectified Linear Unit) activation function.
It maps real-valued inputs to positive outputs and is often used when a non-
linear and smooth activation is preferred.
Smooth ReLU: Unlike ReLU, which is not differentiable at 0, Softplus
provides a smooth curve throughout its domain.
Outputs always positive: This makes it suitable when negative activations are
undesirable.
Log-exp behavior: For large positive values, it behaves like ReLU (i.e.,
f(x)≈x), and for large negative values, it approaches zero slowly.
Prevents the issue of "dead neurons" seen in ReLU (where outputs become
permanently zero).
Maintains gradient flow during training due to smooth differentiability.
Why Use Softplus in Neural Networks?
Smooth Approximation of ReLU: The Softplus function is often seen as a
smoother version of the ReLU function.
While ReLU is simple and effective, it can have issues, such as causing
neurons to "die" if they always output zero for negative inputs.
Softplus avoids this issue by providing a smooth, continuous output for both
positive and negative inputs.
Differentiability: Softplus is a differentiable function, unlike ReLU, which
has a discontinuity at zero.
The continuous and differentiable nature of Softplus makes it easier for
gradient-based optimization algorithms to work effectively, ensuring smooth
learning during training.
Why Use Softplus in Neural Networks?
Preventing Dying Neurons: In the case of ReLU, when the input is negative,
the output is exactly zero, which can lead to dead neurons that do not
contribute to learning.
Softplus gradually approaches zero for negative values, ensuring that neurons
always produce some non-zero output and continue contributing to the
learning process.
Numerical Stability: The Softplus function has better numerical stability
than some other activation functions because it avoids the issues that arise
from very large or very small values.
It has a smooth output, and for very large or very small inputs, the function
behaves predictably, reducing the risk of overflow or underflow in
computation.
Softplus Function
Key Takeaways:
Softplus → Used for continuous and positive-valued outputs, acts as a smooth
ReLU.
Softmax → Used for multi-class classification, converts outputs into probability
How to Choose an Activation Function
Linear Activation Function → For Regression problems
Used in: Predicting values like house prices, student grades, stock prices.
Why: Suitable when the output is a continuous number (regression tasks).
Not ideal for hidden layers because it can't model complex relationships.
Sigmoid (Logistic) Function → For binary classification
Used in: spam detection, loan approval (yes/no), medical results (+ve/-ve)
Why: Gives output as a probability between 0 and 1.
Slows down learning or Causes vanishing gradients in deep networks.
Tanh Activation Function → For time series forecasting / Signal processing
Used in: Weather forecasting, speech recognition, credit scoring.
Why: Better than sigmoid because it's balanced (negative or positive).
Limitation: Still suffers from vanishing gradient in very deep networks.
59
How to Choose an Activation Function
ReLU (Rectified Linear Unit) Function → For Regression problems
Used in: Face detection, object recognition, self-driving cars.
Why: Fast and efficient—most commonly used in hidden layers today.
Risk of “dead neurons” (never activate) during training and stop learning.
Softmax Function → For multi-class classification
Used in: Handwritten digit recognition, sentiment analysis (positive,
neutral, negative), language identification.
Why: Great for choosing one correct answer from many (multi-class).
Typically used in the final layer of models, not hidden layers.
SoftPlus Activation Function → For Theoretical research
Used in: advanced AI models, financial risk modeling.
Why: Smooth alternative to ReLU, good for stability in sensitive systems.
60
Not used as commonly as ReLU in practice.
How to Choose an Activation Function
Activation Function Real-World Example Use Layer Location
Linear Predict house price Output layer (regression)
Output layer (binary
Sigmoid Spam detection, loan approval
classification)
Tanh Credit scoring, signal analysis Hidden layers
ReLU Face or object detection Hidden layers
ELU NLP, robotics Hidden layers
Handwriting recognition, Output layer (multi-class
Softmax
sentiment classification)
Risk modeling, deep theory
SoftPlus Hidden layers (rare)
models
61
Impact of Activation Functions on Model Performance
Convergence Speed: Functions like ReLU allow faster training by avoiding
the vanishing gradient problem, while Sigmoid and Tanh can slow down
convergence in deep networks.
Gradient Flow: Activation functions like ReLU ensure better gradient flow,
helping deeper layers learn effectively. In contrast, Sigmoid can lead to small
gradients, hindering learning in deep layers.
Model Complexity: Activation functions like Softmax allow the model to
handle complex multi-class problems, whereas simpler functions like ReLU
or Leaky ReLU are used for basic layers.
Activation functions are the backbone of neural networks, enabling them to
capture non-linear relationships in data.
From classic functions like Sigmoid and Tanh to modern variants like ReLU
and Swish, each has its place in different types of neural networks.
Enkutatash Tech
THANK TOU
Wisdom is the essence of our
uniqueness!
Adane Kasie Chekole
+251-938427723
adanekasie26@gmail.com
www.enkutatashplc.com