AI2025 Lecture06 Recording Slide
AI2025 Lecture06 Recording Slide
𝒙 Model 𝑦ො ≈ 𝑦
Input Output
(features) (prediction)
Training?
𝟏 1 2 2 𝑚 𝑚
𝐷= 𝒙 ,𝑦 , 𝒙 ,𝑦 ,… 𝒙 ,𝑦 Building a model to make the model can predict
the labels by using train data
Each row of data is called an observation or a tuple
SKKU Kang Research Group / SEE3007 Spring 2025 Prof. Woowhan Jung, DSLAB, Hanyang Univ. 2
Review: Classification vs Regression
Q1. Classification? Regression?
100
80
Predicted rating
70
Q2. Classification? Regression?
60
Cat
Classification Regression
Compute loss
Update parameters
to minimize the loss
𝑦
Features 𝒙
𝒘 𝑧 = 𝒘𝑇 𝒙 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑦,
ො 𝑦
Parameters
𝑏
▪ Example
𝑥1 𝑤1
▪ 𝒙=
𝑥2
, 𝒘=
𝑤2 1
▪ 𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 𝜎 𝑧 =
1 + 𝑒 −𝑧
▪ 𝑦ො = 𝑎 = 𝜎 𝑧 = 𝜎 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏
𝑦
Features 𝒙
𝒘 𝑧 = 𝒘𝑇 𝒙 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑦,
ො 𝑦
Parameters
𝑏
Repeatedly update
𝜃 ∗ = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
𝜃∗
By Hakky St
SKKU Kang Research Group / SEE3007 Spring 2025 7
Computing the Parameters with Gradient Decent
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏
∗
𝑚 𝜃 = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
1
𝐽 𝒘, 𝑏 = 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1
𝜕𝐽 𝜃
𝜕𝑤1
𝑤1 𝜕𝐽 𝜃
𝑤2 𝜕𝑤2
𝜃= … 𝛻𝜃 𝐽 𝜃 = …
𝑤𝑛 𝜕𝐽 𝜃
𝑏 𝜕𝑤𝑛
𝜕𝐽 𝜃
𝜕𝑏
▪ In our previous assignment, we figured out that logistic regression isn’t doing a good
job on XOR type of dataset.
1
𝜎 𝑧 =
1 + 𝑒 −𝑧
𝒘⊤ 𝒙 + 𝑏 = 0
𝜎 𝒘⊤ 𝒙 + 𝑏 = 0.5
1
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏 where 𝜎 𝑧 = 1+𝑒 −𝑧
𝑥1 𝑃 𝑦 = 1 𝒙 = 𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏 > 0.5
𝑦: =0
⇔ 𝒘⊤ 𝒙 + 𝑏 > 0
=1
or and xor
1 1 1 1 0 1 1 1 0
𝑥2 𝑥2 𝑥2
0 0 1 0 0 0 0 0 1
0 1 0 1 0 1
𝑥1 𝑥1 𝑥1
https://en.wikipedia.org/wiki/Neuron
SKKU Kang Research Group / SEE3007 Spring 2025 14
What is a Neural Network?
𝑥1
𝑥2 𝑦ො
𝑥3
Neuron
𝑥3
𝒃[1]
SKKU Kang Research Group / SEE3007 Spring 2025 16
What is a Neural Network?
𝑥1
𝑥2 𝑦ො
𝑥3
1st layer 2nd layer
Neuron
𝑥1
𝑥2 𝑦ො
𝑥3
𝑦
𝒙
𝒃[1] 𝒘[2]
SKKU Kang Research Group / SEE3007 Spring 2025 𝒃[2] 17
Sample Calculation
1st layer 2nd layer
𝑥1
𝑥2 𝑦ො
𝑥3
𝑦
𝒙
𝑇
𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1 𝑧 2 =𝒘2 𝒂1 +𝑏 2 𝑎2 =𝜎 𝑧 2 𝐿 𝑎 2 ,𝑦
𝒃[1] 𝒘[2]
𝒃[2]
𝑥1
𝑥2 𝑦ො
𝑥3
𝑦
𝒙
𝑇
𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1 𝑧 2 =𝒘2 𝒂1 +𝑏 2 𝑎2 =𝜎 𝑧 2 𝐿 𝑎 2 ,𝑦
𝒃[1] 𝒘[2]
𝒃[2]
[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙
[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙
𝑾[1] , 𝒃[1] 𝒂[1] ∈ ℝ4
[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙
𝑾[1] , 𝒃[1] 𝒂[1] ∈ ℝ4 It can do a more complex job
than logistic regression!
SKKU Kang Research Group / SEE3007 Spring 2025 22
Shallow NN Vs. Deep NN (DNN)
𝑥1
𝑥2 𝑦ො
𝑥3
1 1𝑇 1
𝑧2 = 𝒘2 𝒙 + 𝑏2
1 1
𝑎2 = 𝜎 𝑧2
[1]
𝑎1
𝑥1 1 1 𝑇 1 1 1
[1] 𝑧1 = 𝒘1 𝒙 + 𝑏1 𝑎1 = 𝜎 𝑧1
𝑎2
𝑥2 𝑎2 𝑦ො
1 1𝑇 1 1 1
[1] 𝑧2 = 𝒘2 𝒙 + 𝑏2 𝑎2 = 𝜎 𝑧2
𝑎3
𝑥3 1 1𝑇 1 1 1
[1] 𝑧3 = 𝒘3 𝒙 + 𝑏3 𝑎3 = 𝜎 𝑧3
𝑎4
𝑥1 1 1 𝑇 1 1
𝑎4 = 𝜎 𝑧4
1
𝒙 = 𝑥2 𝑧4 = 𝒘4 𝒙 + 𝑏4
𝑥3
Input: Parameters:
[1]
𝑎1 𝒙 ∈ ℝ𝑛 𝑾[1] ∈ ℝℎ×𝑛 𝒘[2] ∈ ℝℎ
𝑥1 𝒃[1] ∈ ℝℎ 𝑏 [2] ∈ ℝ
[1]
𝑎2
𝑥2 𝑎2 𝑦ො
[1]
𝑎3 Forward path:
𝑥3
[1] 𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝑎4 Hidden layer
𝒂[1] = 𝜎 𝒛[1]
[2] 2 𝑇 [1]
𝑧 =𝒘 𝒂 + 𝑏 [2]
In this example, Output layer
n=3, h=4 (number of neurons) 𝑦ො = 𝑎 2 = 𝜎 𝑧 2
Input: Parameters:
𝑔
𝒙 ∈ ℝ𝑛 𝑾[1] ∈ ℝℎ×𝑛 𝒘[2] ∈ ℝℎ
𝑥1 𝒃[1] ∈ ℝℎ 𝑏 [2] ∈ ℝ
𝑔
𝑥2 𝜎 𝑦ො
𝑔
Forward path:
𝑥3
𝑔 𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝒂[1] = 𝑔 𝒛[1] where 𝑔 . is an activation function
𝑧 [2] =𝒘 2 𝑇 𝒂[1] + 𝑏 [2]
𝑦ො = 𝑎 2 = 𝜎 𝑧 2
𝒛 2 = 𝑾 2 (𝑾 1 𝒙 + 𝒃 1 ) + 𝒃 2
= 𝑾 2 𝑾 1 𝒙 + (𝑾 2 𝒃 1 + 𝒃 2 )
= 𝑾′𝒙 + 𝒃′
where 𝑾′ = 𝑾 2 𝑾 1 and 𝒃′ = 𝑾 2 𝒃 1 + 𝒃 2
Sigmoid
tanh
ReLU
LeakyReLU
1
▪𝜎𝑥 = 1+𝑒 −𝑥
▪ Range: 0,1
▪ Derivative
𝑑𝜎 𝑥
=𝜎 𝑥 1−𝜎 𝑥
𝑑𝑥
𝑑𝜎 𝑥
▪ 0<
𝑑𝑥
≤ 0.25
𝑒 𝑥 −𝑒 −𝑥
▪ tanh 𝑥 =
𝑒 𝑥 +𝑒 −𝑥
= 2𝜎 2𝑥 − 1
▪ Range: −1,1
▪ Derivative
𝑑 tanh 𝑥
= 1 − tanh2 𝑥
𝑑𝑥
𝑑 tanh 𝑥
▪ 0 < 𝑑𝑥 ≤ 1
▪ 𝑎 ≪ 1 (e.g, 𝑎 = 0.01)
▪ Range: −∞, ∞
▪ Derivative
𝑎 𝑖𝑓 𝑥 < 0
𝑑𝑓 𝑥
=ቐ 1 𝑖𝑓𝑥 > 0
𝑑𝑥 𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑥 = 0
1
Sigmoid: 𝑎 = 1+𝑒 −𝑥 𝑒 𝑥 − 𝑒 −𝑥
tanh 𝑥 : 𝑎 = 𝑥
𝑒 + 𝑒 −𝑥
ReLU Most popular one. Better at vanishing gradient problem. Very simple
to implement in hardware.
Leaky ReLU Sometimes better than ReLU in dying ReLU issue, but it takes more
calculations than ReLU. Less popular than ReLU.
𝑔
𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝑥1
𝑔 𝒂[1] = 𝑔 𝒛[1] where 𝑔 . is an activation function
𝒙 𝒂[1] 𝑎2