Lecture04-NeuralNetwork
Lecture04-NeuralNetwork
Thien Huynh-The
Department of Computer and Communications Engineering
HCMC University of Technology and Education
RNN
CNN
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 3 / 77
Applications and Benefits
Applications:
• Image Recognition: Object detection, image classification.
• Natural Language Processing: Machine translation, sentiment analysis.
• Healthcare: Disease diagnosis, drug discovery.
• Finance: Fraud detection, stock market prediction.
Benefits:
• High accuracy.
• Ability to learn complex patterns.
• Adaptability to different data types.
where:
• xi : Input values
• wi : Weights
• b: Bias
• f : Activation function
where:
• xi : Input from the i-th neuron in the previous layer.
• wij : Weight connecting the i-th neuron in the previous layer to the j-th neuron in the current layer.
• bj : Bias for the j-th neuron.
• f : Activation function (e.g., ReLU, sigmoid).
where yi are the true values and ŷi are the predictions.
• Cross-Entropy Loss: Used for classification.
n C
1 XX
Cross-Entropy = − yic log(ŷic )
n c=1
i=1
where yic is 1 if the i-th sample belongs to class c and 0 otherwise, and ŷic is the predicted
probability of the i-th sample belonging to class c.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 18 / 77
Backpropagation: Calculating Gradients
• Backpropagation calculates the gradients of the loss function with respect to each weight
and bias in the network.
• These gradients indicate how much each weight contributes to the error.
• The chain rule of calculus is used to efficiently calculate these gradients through the
network’s layers.
Backpropagation
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 19 / 77
Weight Update: Gradient Descent
• Gradient descent is used to update the weights and biases based on the calculated
gradients.
• The weights are adjusted in the direction that reduces the loss.
• The update rule is:
∂Loss
wij = wij − η
∂wij
where η is the learning rate, which controls the step size of the update.
∂Loss
wnew = wold − η
∂w
where:
• η (eta) is the learning rate, a small positive value that controls the step size.
• ∂Loss
∂w is the partial derivative of the loss function with respect to w , which represents the
gradient.
• The learning rate η is a crucial hyperparameter. It controls the step size in gradient
descent.
• Too small: Convergence is very slow, requiring many iterations to reach the minimum.
• Too large: The algorithm might overshoot the minimum and oscillate or even diverge,
never finding a stable solution.
• Optimal learning rate: Finding the optimal learning rate often requires experimentation.
Techniques like learning rate scheduling (adjusting the learning rate during training) can
be helpful.
• Learning Rate Scheduling: Adjusting the learning rate during training. Common
approaches:
• Step Decay: Reduce the learning rate by a fixed factor after a certain number of epochs.
• Exponential Decay: Reduce the learning rate exponentially over time.
10
1 X k
L= (o1 − z1k )2
10
k=1
where o1k is the desired output (ground truth) for the k-th data point, and z1k is the
network’s output for the k-th data point.
∂L
wj,i ← wj,i − η
∂wj,i
∂L ∂L ∂z1 ∂y1
=
∂wj,i ∂z1 ∂y1 ∂wj,i
10
2 X k
= (z1 − o1k ) · z1k (1 − z1k ) · xik
10
k=1
• This process is repeated for all weights in the network for each training example (or batch
of examples) until the loss converges to a minimum.
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
# Initialize weights
W = np.random.uniform(-0.1, 0.1, (784, 10))
# Training process
n = 0.05 # Learning rate
num_epoch = 10 # Number of epochs
batch_size = 32
losses = []
train_accuracies = []
test_accuracies = []
o = sigmoid(np.matmul(x_batch, W))
dW = (x_batch.T @ ((y_batch - o) * o * (1 - o))) / batch_size
W = W + n * dW
# Calculate and store loss and accuracy for the full training set
train_o = sigmoid(np.matmul(x_train, W))
loss = np.mean(np.power(train_o - y_train, 2))
losses.append(loss)
print(f"Epoch {epoch+1}/{num_epoch}, Loss: {loss}, Train Accuracy: {train_accuracy}, Test Accuracy: {test_accuracy}")
plt.subplot(1, 3, 1)
plt.plot(epochs, losses)
plt.title("Loss vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.subplot(1, 3, 2)
plt.plot(epochs, train_accuracies, label=’Training Accuracy’)
plt.title("Training Accuracy vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.subplot(1, 3, 3)
plt.plot(epochs, test_accuracies, label=’Test Accuracy’, color=’orange’)
plt.title("Test Accuracy vs. Epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.tight_layout()
plt.show()
• Loss Functions: Used to train binary classifiers by measuring the difference between
predicted and actual labels.
• Common Loss Functions:
• Binary Cross-Entropy (Log Loss)
• Hinge Loss (used in SVMs)
• Examples:
• Medical diagnosis (minimizing false negatives).
• Fraud detection (minimizing false positives).
• Sentiment analysis (maximizing accuracy).
• Two Outcomes: The goal is to classify data into one of two distinct categories.
• Probabilities and Thresholds: Classifiers often output a probability P(y = 1|x), representing the
likelihood of belonging to class 1. A threshold (e.g., 0.5) determines the final classification.
• Decision Boundary: A line or surface that separates the two classes in the feature space.
• Examples:
• Medical diagnosis (disease present/absent).
• Fraud detection (fraudulent/not fraudulent).
• Sentiment analysis (positive/negative).
• Definition: A classification task where the goal is to assign input data to one of more
than two predefined categories or classes.
• Output Representation:
• Often represented using one-hot encoding: a vector where each element corresponds to a class, and
only one element is 1 (representing the predicted class), while the others are 0.
• Example: For 3 classes (cat, dog, bird), a prediction of “dog” could be represented as [0, 1, 0].
• Challenges:
• Increased complexity compared to binary classification.
• Need for specialized algorithms and techniques (e.g., softmax regression, one-vs-rest, one-vs-one).
• Examples:
• Image classification (cat, dog, bird, fish, etc.).
• Handwritten digit recognition (0-9).
• News categorization (sports, politics, technology, business, etc.).
• Object recognition in images (car, pedestrian, traffic light, building, etc.).
• One-vs-Rest (OvR) / One-vs-All (OvA): Train one binary classifier for each class,
treating that class as “positive” and all other classes as “negative.” For n classes, you
need n classifiers.
• One-vs-One (OvO): Train a binary classifier for each pair of classes. For n classes, you
train n(n−1)
2 classifiers.
Logistic regression
x1 h1 y1′
x2 h2 y2′
h3
Architecture of the three-layer neural network.
1
where σ(z) = 1+e −z
(sigmoid).
• Output layer activations (softmax):
P3 (2) (2)
wkj hj +bk
e j=0
yk′ =P P3 (2) (2)
, k = 0, 1, 2
2 wlj hj +bl
l=0 e
j=0
2
X
L=− yk log(yk′ )
k=0
where yk is true label (one-hot encoded) and yk′ denotes softmax output (predicted
probability for class k).
• Softmax Function:
e zk
yk′ = P2
zl
l=0 e
P3 (2) (2)
where zk = j=0 wkj hj + bk (weighted sum of inputs to output neuron k).
∂L yk
=− ′
∂yk′ yk
• Derivative of Softmax Output with respect to zk :
∂yk′
• If k = l: ∂zk = yk′ (1 − yk′ )
∂yk′
• If k =
̸ l: ∂zl = −yk′ yl′
= −yk + yk yk′ + ′
yk (1 − yk ) = yk′ − yk
∂zk
(2)
= hj
∂wkj
∂L ∂L ∂yk′ ∂zk
(2)
= ′ · · = (yk′ − yk )hj
∂wkj ∂yk ∂zk ∂w (2)
kj
• This is the gradient we use to update the weights in the output layer during
backpropagation.
(2) (2)
wkj ← wkj − η(yk′ − yk )hj
where η is the learning rate.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 50 / 77
Backpropagation: Hidden Layer Weights
2
!
∂L X (2)
(1)
= (yk′ − yk )wkj hj (1 − hj )xi
∂wji k=0
• We use the chain rule again, but this time it’s more complex because the hidden neurons
affect *all* output neurons:
∂L ∂L ∂hj
(1)
= ·
∂w ∂hj ∂w (1)
ji ji
2
!
(1) (1)
X
hj = σ(zj ) = σ wji xi + bj
i=0
• Therefore:
∂hj ∂σ(zj ) ∂zj
(1)
= · (1)
= hj (1 − hj )xi
∂wji ∂zj ∂wji
∂zk (2)
• Here, ∂hj = wkj
∂L ∂L ∂hj
(1)
= ·
∂wji ∂hj ∂w (1)
ji
2
!
(2)
X
= (yk′ − yk )wkj hj (1 − hj )xi
k=0
• This is the gradient used to update the weights between the input and hidden layers.
Stochastic Gradient Descent (SGD): A special case of mini-batch gradient descent where
the mini-batch size is 1 (i.e., each iteration uses only one training example).
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 57 / 77
Python Code
Students will work with a simple neural network implemented from scratch (without using
high-level APIs like TensorFlow or PyTorch for training). The code provided demonstrates key
steps of training a neural network, including forward propagation, loss computation,
backpropagation, and weight updates. The task involves analyzing, modifying, and extending
the code to deepen understanding of neural network mechanics.
Learning Objectives
• Understand the role of forward and backward propagation in neural networks.
• Learn how to compute gradients and update weights manually.
• Explore the effect of hyperparameters such as learning rate and number of hidden units.
• Analyze the training process by monitoring loss and accuracy.
import numpy as np
import tensorflow as tf
#load datashet
print("Load MNIST Database")
mnist = tf.keras.datasets.mnist
(x_train,y_train),(x_test,y_test)= mnist.load_data()
x_train=np.reshape(x_train,(60000,784))/255.0
x_test= np.reshape(x_test,(10000,784))/255.0
y_train = np.matrix(np.eye(10)[y_train])
y_test = np.matrix(np.eye(10)[y_test])
print("----------------------------------")
print(x_train.shape)
print(y_train.shape)
def sigmoid(x):
return 1./(1.+np.exp(-x))
NumInputs=784
NumHiddenUnits=512
NumClasses=10
#inital weights
#hidden layer
Wh=np.matrix(np.random.uniform(-0.5,0.5,(NumHiddenUnits,NumInputs)))
bh= np.random.uniform(0,0.5,(1,NumHiddenUnits))
dWh= np.zeros((NumHiddenUnits,NumInputs))
dbh= np.zeros((1,NumHiddenUnits))
#Output layer
Wo=np.random.uniform(-0.5,0.5,(NumClasses,NumHiddenUnits))
bo= np.random.uniform(0,0.5,(1,NumClasses))
dWo= np.zeros((NumClasses,NumHiddenUnits))
dbo= np.zeros((1,NumClasses))
#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/NumTrainSamples
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/NumTrainSamples
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,’o’)
plt.show()
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 63 / 77
Python Code - Mini-Batch GD
from IPython.display import clear_output
loss = []
Acc = []
Batch_size = 200
Stochastic_samples = np.arange(NumTrainSamples)
for ep in range (Epoch):
np.random.shuffle(Stochastic_samples)
for ite in range (0,NumTrainSamples,Batch_size):
Batch_samples = Stochastic_samples[ite:ite+Batch_size]
x = x_train[Batch_samples,:]
y=y_train[Batch_samples,:]
zh = x@Wh.T + bh
a = sigmoid(zh)
z=a@Wo.T + bo
o = softmax(z)
loss.append(-np.sum(np.multiply(y,np.log10(o))))
d = o-y
dh = d@Wo
dhs = np.multiply(np.multiply(dh,a),(1-a))
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 64 / 77
Python Code - Mini-Batch GD
...
#update weight
dWo = np.matmul(np.transpose(d),a)
dbo = np.mean(d) # consider a is 1 for bias
dWh = np.matmul(np.transpose(dhs),x)
dbh = np.mean(dhs) # consider a is 1 for bias
Wo =Wo - learningRate*dWo/Batch_size
bo =bo - learningRate*dbo
Wh =Wh-learningRate*dWh/Batch_size
bh =bh-learningRate*dbh
#Test accuracy with random innitial weights
prediction = Forwardpass(x_test,Wh,bh,Wo,bo)
Acc.append(AccTest(y_test,prediction))
clear_output(wait=True)
plt.plot([i for i, _ in enumerate(Acc)],Acc,’o’)
plt.show()
print(’Epoch:’, ep )
print(’Accuracy:’,AccTest(y_test,prediction) )
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 65 / 77
Batch vs Mini-Batch GD
• Idea: Add a “velocity” term to accelerate learning in the right direction and dampen
oscillations.
• Update Rules:
where:
• vt : Velocity at time step t (accumulated gradient).
• µ: Momentum term (typically 0.9). Controls the contribution of previous gradients.
• Benefit: Helps overcome small local minima and speeds up convergence.
where:
• Gt : Sum of squared gradients up to time step t.
• ϵ: Small constant (e.g., 10−8 ) for numerical stability (prevents division by zero).
• Benefit: Well-suited for sparse data. Adapts learning rates for different features.
• Limitation: Accumulating gradients can cause the learning rate to become very small,
stopping learning.
Thien Huynh-The - HCMUTE Neural Networks February 10, 2025 74 / 77
RMSProp (Root Mean Square Propagation)
where:
• St : Exponential moving average of squared gradients.
• ρ: Decay rate (typically 0.9). Controls the weighting of recent vs. past gradients.
• ϵ: Small constant for numerical stability.
• Benefit: Addresses Adagrad’s diminishing learning rate problem.