0% found this document useful (0 votes)
2 views77 pages

Lecture04-NeuralNetwork

The document provides an overview of neural networks, explaining their structure, types, applications, and benefits. It details the training process, including forward propagation, loss calculation, backpropagation, and weight updates using gradient descent. Additionally, it discusses activation functions, loss functions, and the importance of choosing an appropriate learning rate.

Uploaded by

sonle180804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views77 pages

Lecture04-NeuralNetwork

The document provides an overview of neural networks, explaining their structure, types, applications, and benefits. It details the training process, including forward propagation, loss calculation, backpropagation, and weight updates using gradient descent. Additionally, it discusses activation functions, loss functions, and the importance of choosing an appropriate learning rate.

Uploaded by

sonle180804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Neural Networks

Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education

February 10, 2025

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 1 / 77


What are Neural Networks?
• Inspired by the biological neural networks in our brains.
• Powerful machine learning models capable of learning complex patterns from data.
• Consist of interconnected nodes (neurons) organized in layers.
• Used for various tasks: classification, regression, image recognition, natural language processing,
etc.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 2 / 77


Types of Neural Networks
• Perceptron: The basic building block.
• Multilayer Perceptron (MLP): Multiple layers, capable of learning non-linear functions.
• Convolutional Neural Networks (CNNs): Specialized for image and video processing.
• Recurrent Neural Networks (RNNs): Designed for sequential data like text and time series.
• Transformers: Attention-based models, excelling in NLP tasks.

RNN
CNN
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 3 / 77
Applications and Benefits

Applications:
• Image Recognition: Object detection, image classification.
• Natural Language Processing: Machine translation, sentiment analysis.
• Healthcare: Disease diagnosis, drug discovery.
• Finance: Fraud detection, stock market prediction.
Benefits:
• High accuracy.
• Ability to learn complex patterns.
• Adaptability to different data types.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 4 / 77


Neural Networks vs. Traditional ML

Feature Neural Networks Traditional ML


Data Large datasets; performance scales with Smaller datasets; less data-dependent.
data.
Complexity High; computationally intensive; GPUs Lower; faster training; less resource-
often needed. intensive.
Feature Eng. Automatic feature learning; less manual Requires manual feature engineering.
effort.
Interpretability ”Black box”; difficult to interpret. More interpretable (e.g., decision trees).
Generalization Good with diverse data; prone to overfit- Can generalize; sensitive to fea-
ting. ture/model choice.
Use Cases Complex tasks (image/text/time series). Broader range; good for limited
data/interpretability.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 5 / 77


Processing Principles: Layers and Calculations
• Input Layer: Receives the input data.
• Hidden Layers: Perform computations.
• Output Layer: Produces the final result.
• Activation Function: Introduces non-linearity (e.g., sigmoid, ReLU).
Xn
Output = f ( wi xi + b)
i=1

where:
• xi : Input values
• wi : Weights
• b: Bias
• f : Activation function

A single neuron calculation


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 6 / 77
Simple Neural Network Architecture
• Neural networks are composed of layers of interconnected neurons.
• The basic structure includes:
• Input Layer: Receives the initial data.
• Hidden Layer(s): Perform computations on the input.
• Output Layer: Produces the final result.

A simple neural network with layers

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 7 / 77


The Input Layer

• The first layer of the network.


• Receives raw input data as a vector.
• Each neuron in the input layer corresponds to a feature of the input data.
• No computation occurs in the input layer; it simply passes the data to the next layer.
Example: If the input is an image of size 28x28 pixels, the input layer would have 784 neurons
(28 * 28).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 8 / 77


Hidden Layers
• Layers between the input and output layers.
• Perform the core computations of the network.
• Each neuron in a hidden layer receives input from all neurons in the previous layer.
• Computations involve:
• Weighted sum of inputs.
• Application of an activation function.
Xn
Outputj = f ( wij xi + bj )
i=1

where:
• xi : Input from the i-th neuron in the previous layer.
• wij : Weight connecting the i-th neuron in the previous layer to the j-th neuron in the current layer.
• bj : Bias for the j-th neuron.
• f : Activation function (e.g., ReLU, sigmoid).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 9 / 77


Activation Functions
• Introduce non-linearity into the network.
• Without activation functions, the network would only be able to learn linear relationships.
• Common activation functions:
ReLU (Rectified Linear Unit):
f (x) = max(0, x)
Sigmoid:
1
f (x) = 1+e −x
Tanh:
e x −e −x
f (x) = e x +e −x

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 10 / 77


The Output Layer

• The final layer of the network.


• Produces the output of the network.
• The number of neurons in the output layer depends on the task:
• Classification: One neuron per class (or a single neuron with sigmoid for binary
classification).
• Regression: One neuron with a linear activation function.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 11 / 77


Output Layer: Classification
• Goal: Assign an input to one of several predefined categories or classes.
• Number of neurons: Typically one neuron per class.
• Activation function:
• Softmax: Used for multi-class classification (more than two classes). It converts the output
of each neuron into a probability distribution over the classes.
e zi
Softmax(z)i = PK
j=1 e zj
where z is the vector of outputs from the previous layer and K is the number of classes.
• Sigmoid: Used for binary classification (two classes). The output is interpreted as the
probability of belonging to the positive class.
1
σ(x) =
1 + e −x
• Output: A vector of probabilities, where each element represents the probability of the
input belonging to a specific class. The class with the highest probability is the predicted
class.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 12 / 77
Output Layer: Regression

• Goal: Predict a continuous numerical value.


• Number of neurons: Typically one neuron.
• Activation function:
• Linear (Identity): f (x) = x. Often used when the output can take any real value.
• Sometimes no activation function is used at all.
• Output: A single numerical value representing the prediction.
Example: Predicting house prices, stock prices, or temperature.
• Suppose the output of the last hidden layer is 2.5.
• With a linear activation function, the output of the output neuron is also 2.5.
• This 2.5 represents the predicted value (e.g., $250,000 if predicting house prices in
thousands of dollars).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 13 / 77


Summary

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 14 / 77


Exercises

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 15 / 77


The Training Process: Overview

• Training a neural network involves adjusting its


weights and biases to minimize the difference
between its predictions and the actual target
values.
• This process is typically iterative and involves
the following steps:
1. Forward Propagation: Input data is passed
through the network to generate predictions.
2. Loss Calculation: A loss function measures the
error between predictions and true values.
3. Backpropagation: The error is propagated back
through the network to calculate gradients (how
Training Loop much each weight contributes to the error).
4. Weight Update: Weights and biases are adjusted
based on the gradients to reduce the loss.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 16 / 77


Forward Propagation

• Input data is fed into the input layer.


• Data is passed through each layer, with each neuron performing a weighted sum of its
inputs and applying an activation function.
• This process continues until the output layer produces a prediction.
Xn
Outputj = f ( wij xi + bj )
i=1

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 17 / 77


Loss Functions: Measuring Error
• A loss function quantifies the difference between the network’s predictions and the actual
target values.
• Common loss functions:
• Mean Squared Error (MSE): Used for regression.
n
1X
MSE = (yi − ŷi )2
n
i=1

where yi are the true values and ŷi are the predictions.
• Cross-Entropy Loss: Used for classification.

n C
1 XX
Cross-Entropy = − yic log(ŷic )
n c=1
i=1

where yic is 1 if the i-th sample belongs to class c and 0 otherwise, and ŷic is the predicted
probability of the i-th sample belonging to class c.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 18 / 77
Backpropagation: Calculating Gradients
• Backpropagation calculates the gradients of the loss function with respect to each weight
and bias in the network.
• These gradients indicate how much each weight contributes to the error.
• The chain rule of calculus is used to efficiently calculate these gradients through the
network’s layers.

Backpropagation
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 19 / 77
Weight Update: Gradient Descent
• Gradient descent is used to update the weights and biases based on the calculated
gradients.
• The weights are adjusted in the direction that reduces the loss.
• The update rule is:
∂Loss
wij = wij − η
∂wij
where η is the learning rate, which controls the step size of the update.

The goal is to find the argmin of the loss function, i.e.,


the weights that minimize the loss.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 20 / 77


The Idea Behind Gradient Descent
• Goal: Find the minimum of a function (the loss function in our case).
• Analogy: Imagine you are on top of a mountain in dense fog and want to reach the valley.
• You can only see the ground immediately around you.
• Gradient descent: You take small steps in the direction of the steepest descent (the
direction where the ground slopes downwards most steeply).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 21 / 77


Mathematical Formulation

• The gradient of a function points in the direction of the steepest ascent.


• To find the minimum, we move in the opposite direction of the gradient.
• The update rule for a weight w is:

∂Loss
wnew = wold − η
∂w
where:
• η (eta) is the learning rate, a small positive value that controls the step size.
• ∂Loss
∂w is the partial derivative of the loss function with respect to w , which represents the
gradient.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 22 / 77


Example Calculation
Let’s consider a simple loss function: Loss(w ) = w 2 − 4w + 5
• We want to find the minimum value of w using gradient descent.
• The derivative (gradient) of the loss function is: ∂Loss
∂w = 2w − 4
• Let’s start with an initial guess w = 0 and a learning rate η = 0.1.
• Iteration 1:
• Gradient: 2(0) − 4 = −4
• Update: wnew = 0 − 0.1(−4) = 0.4
• Iteration 2:
• Gradient: 2(0.4) − 4 = −3.2
• Update: wnew = 0.4 − 0.1(−3.2) = 0.72
• We continue this process until the change in w becomes very small or a maximum number of
iterations is reached.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 23 / 77


Example Calculation (Cont)
• Let’s continue the example with Loss(w ) = w 2 − 4w + 5, ∂Loss
∂w = 2w − 4, winitial = 0,
and η = 0.1.
Iteration wold Gradient wnew
1 0 -4 0.4
2 0.4 -3.2 0.72
3 0.72 -2.56 0.976
• Here’s a table showing several iterations: 4 0.976 -2.048 1.1808
5 1.1808 -1.6384 1.34464
... ... ... ...
10 1.758 -0.484 1.8064
20 1.990 -0.020 1.992
50 1.99999 -0.00002 2.000
• As we can see, w converges towards 2, which is the actual minimum of the function
(found by setting the derivative to 0: 2w − 4 = 0 ⇒ w = 2).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 24 / 77


Choosing the Learning Rate

• The learning rate η is a crucial hyperparameter. It controls the step size in gradient
descent.
• Too small: Convergence is very slow, requiring many iterations to reach the minimum.
• Too large: The algorithm might overshoot the minimum and oscillate or even diverge,
never finding a stable solution.
• Optimal learning rate: Finding the optimal learning rate often requires experimentation.
Techniques like learning rate scheduling (adjusting the learning rate during training) can
be helpful.
• Learning Rate Scheduling: Adjusting the learning rate during training. Common
approaches:
• Step Decay: Reduce the learning rate by a fixed factor after a certain number of epochs.
• Exponential Decay: Reduce the learning rate exponentially over time.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 25 / 77


Simple Neural Network Setup
• Consider a simple neural network with 3 inputs (x1 , x2 , x3 ), one hidden layer with 2 neurons, and one
output neuron.
• We have a dataset of 10 data points {x1 , x2 , x3 }.
• The output of the hidden neurons is calculated as:
y1 = x1 w1,1 + x2 w1,2 + x3 w1,3 + b1
y2 = x1 w2,1 + x2 w2,2 + x3 w2,3 + b2
Where wi,j are the weights connecting input j to hidden neuron i, and bi are the biases.
• The output of the network z1 is calculated using a sigmoid activation function:
1
z1 = σ(y1 ) =
1 + e −y1

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 26 / 77


Loss Function

• We use the Mean Squared Error (MSE) as our loss function:

10
1 X k
L= (o1 − z1k )2
10
k=1

where o1k is the desired output (ground truth) for the k-th data point, and z1k is the
network’s output for the k-th data point.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 27 / 77


Backpropagation: Calculating the Gradient
• We want to update the weights using gradient descent:

∂L
wj,i ← wj,i − η
∂wj,i

where η is the learning rate.


• Using the chain rule, we can calculate the partial derivative of the loss with respect to a
weight wj,i :

∂L ∂L ∂z1 ∂y1
=
∂wj,i ∂z1 ∂y1 ∂wj,i
10
2 X k
= (z1 − o1k ) · z1k (1 − z1k ) · xik
10
k=1

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 28 / 77


Weight Update

• Finally, we update the weights:


10
!
2 X k
wj,i ← wj,i − η (z1 − o1k ) · z1k (1 − z1k ) · xik
10
k=1

• This process is repeated for all weights in the network for each training example (or batch
of examples) until the loss converges to a minimum.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 29 / 77


Python Code - Hand-writing Digit Classification
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

print(’load data from MNIST’)


mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Use all digits 0-9


x_train = x_train.astype(np.float32)
x_test = x_test.astype(np.float32)

y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)


y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)

# Reshape and normalize


x_train = np.reshape(x_train, (-1, 784)) / 255.0
x_test = np.reshape(x_test, (-1, 784)) / 255.0
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 30 / 77
Python Code - Hand-writing Digit Classification

def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

# Initialize weights
W = np.random.uniform(-0.1, 0.1, (784, 10))

# Training process
n = 0.05 # Learning rate
num_epoch = 10 # Number of epochs
batch_size = 32
losses = []
train_accuracies = []
test_accuracies = []

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 31 / 77


Python Code - Hand-writing Digit Classification
for epoch in range(num_epoch):
for batch_start in range(0, x_train.shape[0], batch_size):
batch_end = min(batch_start + batch_size, x_train.shape[0])
x_batch = x_train[batch_start:batch_end]
y_batch = y_train[batch_start:batch_end]

o = sigmoid(np.matmul(x_batch, W))
dW = (x_batch.T @ ((y_batch - o) * o * (1 - o))) / batch_size
W = W + n * dW

# Calculate and store loss and accuracy for the full training set
train_o = sigmoid(np.matmul(x_train, W))
loss = np.mean(np.power(train_o - y_train, 2))
losses.append(loss)

train_predictions = np.argmax(train_o, axis=1)


train_true_labels = np.argmax(y_train, axis=1)
train_accuracy = np.mean(train_predictions == train_true_labels)
train_accuracies.append(train_accuracy)

# Calculate test accuracy


test_o = sigmoid(np.matmul(x_test, W))
test_predictions = np.argmax(test_o, axis=1)
test_true_labels = np.argmax(y_test, axis=1)
test_accuracy = np.mean(test_predictions == np.argmax(y_test, axis=1))
test_accuracies.append(test_accuracy)

print(f"Epoch {epoch+1}/{num_epoch}, Loss: {loss}, Train Accuracy: {train_accuracy}, Test Accuracy: {test_accuracy}")

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 32 / 77


Python Code - Hand-writing Digit Classification
# Plotting the results
epochs = range(1, num_epoch + 1)
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(epochs, losses)
plt.title("Loss vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.subplot(1, 3, 2)
plt.plot(epochs, train_accuracies, label=’Training Accuracy’)
plt.title("Training Accuracy vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")

plt.subplot(1, 3, 3)
plt.plot(epochs, test_accuracies, label=’Test Accuracy’, color=’orange’)
plt.title("Test Accuracy vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.tight_layout()
plt.show()

# Example of getting predictions (on test data)


test_o = sigmoid(np.matmul(x_test, W))
predictions = np.argmax(test_o, axis=1)
print("Example Predictions (first 10 test samples):", predictions[:10])

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 33 / 77


Student Assignment: Code Understanding & Analysis

Objective: To understand the practical implementation of a simple neural network.


• Code Dissection: Students will meticulously analyze the provided Python code, explaining the purpose of
each line and its role within the overall machine learning framework. This includes:
• Architectural Exploration: Students are challenged to explore architectural variations of the simple neural
network by experimenting with additional hidden layers and varying neuron counts.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 34 / 77


Introduction to Classification
• Classification: A supervised learning task where the goal is to assign input data to
predefined categories or classes.
• Two main types:
• Binary Classification: Two classes (e.g., spam/not spam, cat/dog).
• Multi-class Classification: More than two classes (e.g., handwritten digits 0-9, types of
flowers).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 35 / 77


Binary Classification

• Definition: A classification task with two possible outcomes or classes.


• Representation:
• Often encoded as 0 and 1 (or True/False).
• Sometimes represented as -1 and 1, especially in certain algorithms like Support Vector
Machines (SVMs).
• Decision Boundary: The goal is to find a decision boundary that separates the two
classes.
• Examples:
• Medical diagnosis (disease present/absent).
• Fraud detection (fraudulent/not fraudulent).
• Sentiment analysis (positive/negative).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 36 / 77


Binary Classification

• Probabilistic Interpretation: Many binary classifiers output a probability score between


0 and 1, representing the likelihood of belonging to one of the classes.
• Thresholding: A threshold (typically 0.5) is used to make a final classification decision:
• Probability ≥ threshold: Class 1
• Probability ¡ threshold: Class 0
• Examples:
• Medical diagnosis (probability of disease ¿ 0.7: disease present).
• Fraud detection (probability of fraud ¿ 0.9: fraudulent).
• Sentiment analysis (probability of positive sentiment ¿ 0.6: positive).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 37 / 77


Binary Classification

• Loss Functions: Used to train binary classifiers by measuring the difference between
predicted and actual labels.
• Common Loss Functions:
• Binary Cross-Entropy (Log Loss)
• Hinge Loss (used in SVMs)
• Examples:
• Medical diagnosis (minimizing false negatives).
• Fraud detection (minimizing false positives).
• Sentiment analysis (maximizing accuracy).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 38 / 77


Binary Classification

• Two Outcomes: The goal is to classify data into one of two distinct categories.
• Probabilities and Thresholds: Classifiers often output a probability P(y = 1|x), representing the
likelihood of belonging to class 1. A threshold (e.g., 0.5) determines the final classification.
• Decision Boundary: A line or surface that separates the two classes in the feature space.

• Examples:
• Medical diagnosis (disease present/absent).
• Fraud detection (fraudulent/not fraudulent).
• Sentiment analysis (positive/negative).

Example of a Decision Boundary

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 39 / 77


Multi-class Classification

• Definition: A classification task where the goal is to assign input data to one of more
than two predefined categories or classes.
• Output Representation:
• Often represented using one-hot encoding: a vector where each element corresponds to a class, and
only one element is 1 (representing the predicted class), while the others are 0.
• Example: For 3 classes (cat, dog, bird), a prediction of “dog” could be represented as [0, 1, 0].
• Challenges:
• Increased complexity compared to binary classification.
• Need for specialized algorithms and techniques (e.g., softmax regression, one-vs-rest, one-vs-one).
• Examples:
• Image classification (cat, dog, bird, fish, etc.).
• Handwritten digit recognition (0-9).
• News categorization (sports, politics, technology, business, etc.).
• Object recognition in images (car, pedestrian, traffic light, building, etc.).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 40 / 77


Strategies for Multi-class Classification

• One-vs-Rest (OvR) / One-vs-All (OvA): Train one binary classifier for each class,
treating that class as “positive” and all other classes as “negative.” For n classes, you
need n classifiers.
• One-vs-One (OvO): Train a binary classifier for each pair of classes. For n classes, you
train n(n−1)
2 classifiers.

One-vs-one and one-vs-rest

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 41 / 77


Logistic Regression

• Used for binary classification.


• Combines linear regression with a sigmoid
activation function.
• Linear Regression: y = w T x + b
• Sigmoid Function: σ(y ) = 1
1+e −y
• Logistic Regression: P(y = 1|x) = σ(w T x + b)
(Probability of belonging to class 1)

Logistic regression

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 42 / 77


Softmax Regression

• Generalization of logistic regression to multi-class classification.


• Calculates the probability of each class.
• Softmax function:
T
e wi x+bi
P(y = i|x) = P
K wjT x+bj
j=1 e

where K is the number of classes.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 43 / 77


Softmax in Neural Networks
• In neural networks, the softmax function is often used in the output layer for multi-class
classification.
• The output of the softmax layer is a probability distribution over the classes.
• The class with the highest probability is the predicted class.

Softmax in neural networks for multi-class classification


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 44 / 77
Network Architecture
Network configuration:
• 3 input nodes (x0 , x1 , x2 ).
• 4 hidden neurons (h0 , h1 , h2 , h3 ) with sigmoid activation.
• 3 output neurons (y0′ , y1′ , y2′ ) with softmax activation.
• Cross-entropy loss function.
x0 h0 y0′

x1 h1 y1′

x2 h2 y2′

h3
Architecture of the three-layer neural network.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 45 / 77


Forward Propagation and Loss

• Hidden layer activations:


2
!
(1) (1)
X
hj = σ wji xi + bj , j = 0, 1, 2, 3
i=0

1
where σ(z) = 1+e −z
(sigmoid).
• Output layer activations (softmax):
P3 (2) (2)
wkj hj +bk
e j=0
yk′ =P P3 (2) (2)
, k = 0, 1, 2
2 wlj hj +bl
l=0 e
j=0

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 46 / 77


Setup and Definitions
• We consider a neural network with a softmax output layer and cross-entropy loss.
• Cross-Entropy Loss (single example):

2
X
L=− yk log(yk′ )
k=0

where yk is true label (one-hot encoded) and yk′ denotes softmax output (predicted
probability for class k).
• Softmax Function:
e zk
yk′ = P2
zl
l=0 e
P3 (2) (2)
where zk = j=0 wkj hj + bk (weighted sum of inputs to output neuron k).

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 47 / 77


Applying the Chain Rule
• We want to find the derivative of the loss with respect to the output layer weights, ∂L
(2) .
∂wkj
• Using the chain rule:
∂L ∂L ∂yk′ ∂zk
= · ·
(2)
∂wkj ∂yk′ ∂zk ∂w (2)
kj
• Derivative of Loss with respect to Softmax Output:

∂L yk
=− ′
∂yk′ yk
• Derivative of Softmax Output with respect to zk :
∂yk′
• If k = l: ∂zk = yk′ (1 − yk′ )
∂yk′
• If k =
̸ l: ∂zl = −yk′ yl′

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 48 / 77


Calculating the Partial Derivatives
• The derivative of the Loss with respect to zk is:
2
X ∂L ∂y ′
∂L l
=
∂zk ∂yl′ ∂zk
l=0
∂L ∂yk′ X ∂L ∂yl′
= +
∂yk′ ∂zk ∂yl′ ∂zk
l̸=k
  X  yl 
yk ′ ′
= − ′ yk (1 − yk ) + − ′ (−yk′ yl′ )
yk yl
l̸=k
X
= −yk (1 − yk′ ) + yk′ yl
l̸=k

= −yk + yk yk′ + ′
yk (1 − yk ) = yk′ − yk

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 49 / 77


Final Result
• Derivative of zk with respect to wkj(2) :

∂zk
(2)
= hj
∂wkj

• Combining all parts of the chain rule:

∂L ∂L ∂yk′ ∂zk
(2)
= ′ · · = (yk′ − yk )hj
∂wkj ∂yk ∂zk ∂w (2)
kj

• This is the gradient we use to update the weights in the output layer during
backpropagation.
(2) (2)
wkj ← wkj − η(yk′ − yk )hj
where η is the learning rate.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 50 / 77
Backpropagation: Hidden Layer Weights

• Derivative of loss with respect to hidden layer weights wji(1) :

2
!
∂L X (2)
(1)
= (yk′ − yk )wkj hj (1 − hj )xi
∂wji k=0

• Weight update rule:


2
!
(1) (1) (2)
X
wji ← wji −η (yk′ − yk )wkj hj (1 − hj )xi
k=0

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 51 / 77


Chain Rule for Hidden Layer Weights

• We use the chain rule again, but this time it’s more complex because the hidden neurons
affect *all* output neurons:
∂L ∂L ∂hj
(1)
= ·
∂w ∂hj ∂w (1)
ji ji

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 52 / 77


Calculating the Partial Derivatives

• The activation of the hidden neuron hj is given by:

2
!
(1) (1)
X
hj = σ(zj ) = σ wji xi + bj
i=0

• Therefore:
∂hj ∂σ(zj ) ∂zj
(1)
= · (1)
= hj (1 − hj )xi
∂wji ∂zj ∂wji

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 53 / 77


Derivative of the Sigmoid Function
• Let’s first find the derivative of the sigmoid function σ(z) with respect to z:
 
∂σ(z) ∂ 1
=
∂z ∂z 1 + e −z
0 − (−1)e −z e −z
= =
(1 + e −z )2 (1 + e −z )2
1 e −z
= ·
1 + e −z  1 + e −z 
1 1
= · 1−
1 + e −z 1 + e −z
= σ(z)(1 − σ(z))
• Therefore:
∂σ(zj )
= hj (1 − hj )
∂zj
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 54 / 77
Calculating the Partial Derivatives
• This is the crucial part. The hidden neuron hj influences *all* output neurons. Therefore,
we need to sum over the contributions from each output neuron:
2
X ∂L ∂y ′
∂L
= · k
∂hj ∂yk′ ∂hj
k=0
2
X ∂L ∂yk′ ∂zk
= · ·
∂yk′ ∂zk ∂hj
k=0
2
(2)
X
= (yk′ − yk )wkj
k=0

∂zk (2)
• Here, ∂hj = wkj

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 55 / 77


Putting It All Together

• Combining the results:

∂L ∂L ∂hj
(1)
= ·
∂wji ∂hj ∂w (1)
ji
2
!
(2)
X
= (yk′ − yk )wkj hj (1 − hj )xi
k=0

• This is the gradient used to update the weights between the input and hidden layers.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 56 / 77


Batch vs. Mini-Batch Gradient Descent
Two main variants differing in how much training data is used to compute the gradient:
• Batch Gradient Descent:
• Uses the entire training dataset to compute the gradient of the loss function in each iteration
(epoch).
• Computationally expensive for large datasets.
• Provides a more stable but potentially slower convergence.
• Update rule (for a weight w ): wnew = wold − η∇L(w ; all data)
• Mini-Batch Gradient Descent:
• Uses a small random subset (a “mini-batch”) of the training data to compute the gradient in each
iteration.
• More computationally efficient, especially for large datasets.
• Introduces some noise in the gradient estimation, which can help escape local minima and
potentially lead to faster convergence.
• Update rule (for a weight w ): wnew = wold − η∇L(w ; mini-batch)

Stochastic Gradient Descent (SGD): A special case of mini-batch gradient descent where
the mini-batch size is 1 (i.e., each iteration uses only one training example).
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 57 / 77
Python Code

Students will work with a simple neural network implemented from scratch (without using
high-level APIs like TensorFlow or PyTorch for training). The code provided demonstrates key
steps of training a neural network, including forward propagation, loss computation,
backpropagation, and weight updates. The task involves analyzing, modifying, and extending
the code to deepen understanding of neural network mechanics.

Learning Objectives
• Understand the role of forward and backward propagation in neural networks.
• Learn how to compute gradients and update weights manually.
• Explore the effect of hyperparameters such as learning rate and number of hidden units.
• Analyze the training process by monitoring loss and accuracy.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 58 / 77


Python Code - Batch GD

import numpy as np
import tensorflow as tf
#load datashet
print("Load MNIST Database")
mnist = tf.keras.datasets.mnist
(x_train,y_train),(x_test,y_test)= mnist.load_data()
x_train=np.reshape(x_train,(60000,784))/255.0
x_test= np.reshape(x_test,(10000,784))/255.0
y_train = np.matrix(np.eye(10)[y_train])
y_test = np.matrix(np.eye(10)[y_test])
print("----------------------------------")
print(x_train.shape)
print(y_train.shape)

def sigmoid(x):
return 1./(1.+np.exp(-x))

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 59 / 77


Python Code - Batch GD
def softmax(x):
return np.divide(np.matrix(np.exp(x)),np.mat(np.sum(np.exp(x),axis=1)))
def Forwardpass(X,Wh,bh,Wo,bo):
zh = X@Wh.T + bh
a = sigmoid(zh)
z=a@Wo.T + bo
o = softmax(z)
return o
def AccTest(label,prediction): # calculate the matching score
OutMaxArg=np.argmax(prediction,axis=1)
LabelMaxArg=np.argmax(label,axis=1)
Accuracy=np.mean(OutMaxArg==LabelMaxArg)
return Accuracy
learningRate = 0.5
Epoch=50
NumTrainSamples=60000
NumTestSamples=10000

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 60 / 77


Python Code - Batch GD

NumInputs=784
NumHiddenUnits=512
NumClasses=10
#inital weights
#hidden layer
Wh=np.matrix(np.random.uniform(-0.5,0.5,(NumHiddenUnits,NumInputs)))
bh= np.random.uniform(0,0.5,(1,NumHiddenUnits))
dWh= np.zeros((NumHiddenUnits,NumInputs))
dbh= np.zeros((1,NumHiddenUnits))
#Output layer
Wo=np.random.uniform(-0.5,0.5,(NumClasses,NumHiddenUnits))
bo= np.random.uniform(0,0.5,(1,NumClasses))
dWo= np.zeros((NumClasses,NumHiddenUnits))
dbo= np.zeros((1,NumClasses))

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 61 / 77


Python Code - Batch GD

from IPython.display import clear_output


loss = []
Acc = []
for ep in range (Epoch):
#feed fordware propagation
x = x_train
y=y_train
zh = x@Wh.T + bh
a = sigmoid(zh)
z=a@Wo.T + bo
o = softmax(z)
loss.append(-np.sum(np.multiply(y,np.log10(o))))
d = o-y
dh = d@Wo
dhs = np.multiply(np.multiply(dh,a),(1-a))

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 62 / 77


Python Code - Batch GD
for ep in range (Epoch):
...

#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/NumTrainSamples
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/NumTrainSamples
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,’o’)
plt.show()
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 63 / 77
Python Code - Mini-Batch GD
from IPython.display import clear_output
loss = []
Acc = []
Batch_size = 200
Stochastic_samples = np.arange(NumTrainSamples)
for ep in range (Epoch):
np.random.shuffle(Stochastic_samples)
for ite in range (0,NumTrainSamples,Batch_size):
Batch_samples = Stochastic_samples[ite:ite+Batch_size]
x = x_train[Batch_samples,:]
y=y_train[Batch_samples,:]
zh = x@Wh.T + bh
a = sigmoid(zh)
z=a@Wo.T + bo
o = softmax(z)
loss.append(-np.sum(np.multiply(y,np.log10(o))))
d = o-y
dh = d@Wo
dhs = np.multiply(np.multiply(dh,a),(1-a))
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 64 / 77
Python Code - Mini-Batch GD
...
#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/Batch_size
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/Batch_size
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,’o’)
plt.show()
print(’Epoch:’, ep )
print(’Accuracy:’,AccTest(y_test,prediction) )
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 65 / 77
Batch vs Mini-Batch GD

Training processing: Batch vs Mini-batch gradient descent strategies.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 66 / 77


Assignment: Handwritten Digit Classification with Neural
Networks
• Objective: Develop a Python program to classify handwritten digits from the MNIST
dataset.
• Model: A four-layer neural network (Multi-Layer Perceptron):
• Input Layer (784 neurons)
• Hidden Layer 1 (ReLU activation)
• Hidden Layer 2 (Sigmoid activation)
• Output Layer (Softmax activation)
• Training: Implement the following:
• Feedforward propagation
• Backpropagation
• Mini-batch gradient descent (batch size: 1000)
• Submission: Submit your Python code on Google Classroom by the deadline.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 67 / 77


Local Minima
• Challenge: The loss function in deep learning is often non-convex, meaning it has multiple
minima.
• Local Minima: Points where the loss is lower than in the surrounding region, but not the
absolute lowest (global minimum).
• Impact: Gradient descent can get stuck in local minima, preventing the model from achieving
optimal performance.
• Mitigation: Careful initialization, using larger batch sizes, and advanced optimization algorithms
(like momentum) can help escape local minima.

A function with local minima


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 68 / 77
Vanishing Gradients
• Challenge: During backpropagation, gradients are multiplied as they are passed through layers.
• Vanishing Gradients: In deep networks, these repeated multiplications can cause gradients to
become extremely small, effectively stopping the training of earlier layers.
• Cause: Primarily due to activation functions with derivatives close to zero in certain regions (e.g.,
sigmoid saturation).
• Mitigation: Using ReLU or its variants (Leaky ReLU, ELU), careful initialization, and batch
normalization can help.

Illustrating vanishing gradients


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 69 / 77
Exploding Gradients
• Challenge: Opposite of vanishing gradients.
• Exploding Gradients: Gradients become extremely large during backpropagation, causing
unstable training.
• Cause: Large weights, especially in recurrent neural networks (RNNs).
• Mitigation: Gradient clipping (limiting the magnitude of gradients) and weight regularization can
help.

Comparing vanishing gradient and exploding gradient problems


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 70 / 77
Overfitting vs. Underfitting
• Underfitting: Model is too simple to capture the underlying patterns in the data. High bias.
• Overfitting: Model learns the training data too well, including noise, and generalizes poorly to
unseen data. High variance.
• Mitigation:
• Underfitting: Increase model complexity, train for longer.
• Overfitting: Use more data, regularization (L1, L2), dropout, early stopping.

Overfitting and underfitting


Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 71 / 77
Gradient Descent (Review)

• Basic Idea: Iteratively adjust weights to minimize the loss function.


• Update Rule:
wt+1 = wt − η∇L(wt )
where:
• wt : Weights at time step t.
• η: Learning rate (controls step size).
• ∇L(wt ): Gradient of the loss function at wt (direction of steepest ascent).
• Limitation: Can be slow, especially with complex loss landscapes or large datasets.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 72 / 77


Momentum

• Idea: Add a “velocity” term to accelerate learning in the right direction and dampen
oscillations.
• Update Rules:

vt+1 = µvt + η∇L(wt )


wt+1 = wt − vt+1

where:
• vt : Velocity at time step t (accumulated gradient).
• µ: Momentum term (typically 0.9). Controls the contribution of previous gradients.
• Benefit: Helps overcome small local minima and speeds up convergence.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 73 / 77


Adagrad (Adaptive Gradient Algorithm)
• Idea: Adapt the learning rate for each parameter individually based on the historical sum
of squared gradients.
• Update Rules:

Gt+1 = Gt + (∇L(wt ))2


η
wt+1 = wt − p ∇L(wt )
Gt+1 + ϵ

where:
• Gt : Sum of squared gradients up to time step t.
• ϵ: Small constant (e.g., 10−8 ) for numerical stability (prevents division by zero).
• Benefit: Well-suited for sparse data. Adapts learning rates for different features.
• Limitation: Accumulating gradients can cause the learning rate to become very small,
stopping learning.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 74 / 77
RMSProp (Root Mean Square Propagation)

• Idea: Improves Adagrad by using an exponentially decaying average of squared gradients.


• Update Rules:

St+1 = ρSt + (1 − ρ)(∇L(wt ))2


η
wt+1 = wt − p ∇L(wt )
St+1 + ϵ

where:
• St : Exponential moving average of squared gradients.
• ρ: Decay rate (typically 0.9). Controls the weighting of recent vs. past gradients.
• ϵ: Small constant for numerical stability.
• Benefit: Addresses Adagrad’s diminishing learning rate problem.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 75 / 77


Adam (Adaptive Moment Estimation)
• Idea: Combines momentum and RMSProp with gradients’ first and second moments.
• Update Rules:
mt+1 = β1 mt + (1 − β1 )∇L(wt )
vt+1 = β2 vt + (1 − β2 )(∇L(wt ))2
mt+1
m̂t+1 = (Bias correction)
1 − β1t+1
vt+1
v̂t+1 = (Bias correction)
1 − β2t+1
η
wt+1 = wt − p m̂t+1
v̂t+1 + ϵ
where:
• mt : Exponential moving average of gradients (first moment).
• vt : Exponential moving average of squared gradients (second moment).
• β1 , β2 : Exponential decay rates (typically 0.9 and 0.999, respectively).
• ϵ: Small constant for numerical stability.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 76 / 77
Dropout Technique
• Dropout: Randomly “drops out” (sets to zero) some neurons during training.
• Effect: Prevents complex co-adaptations between neurons, reducing overfitting.
• During Testing: Neurons are not dropped out, but their activations are scaled by the dropout
rate.

Illustrating dropout scheme.

Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 77 / 77

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy