0% found this document useful (0 votes)
5 views38 pages

AI2025 Lecture06 Recording Slide

The document discusses shallow neural networks and their comparison to logistic regression, highlighting the importance of model complexity in tasks like classification and regression. It reviews key concepts such as supervised learning, gradient descent, and the structure of neural networks, including layers and neurons. The presentation emphasizes that shallow neural networks can perform more complex tasks than logistic regression, particularly in cases like XOR problems.

Uploaded by

chiyeon0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views38 pages

AI2025 Lecture06 Recording Slide

The document discusses shallow neural networks and their comparison to logistic regression, highlighting the importance of model complexity in tasks like classification and regression. It reviews key concepts such as supervised learning, gradient descent, and the structure of neural networks, including layers and neurons. The presentation emphasizes that shallow neural networks can perform more complex tasks than logistic regression, particularly in cases like XOR problems.

Uploaded by

chiyeon0607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1

AI System Semiconductor Design


Lecture6: Shallow Neural Networks
Lecturer: Taewook Kang
Acknowledgments
Lecture material adapted from
Prof. Woowhan Jung, DSLAB, Hanyang Univ.
Andrew Ng, DeepLearning AI

SKKU Kang Research Group / SEE3007 Spring 2025 1 1


Review: Supervised Learning

Dataset Goal: generalize the input-output relationship

𝒙 Model 𝑦ො ≈ 𝑦
Input Output
(features) (prediction)

Training?
𝟏 1 2 2 𝑚 𝑚
𝐷= 𝒙 ,𝑦 , 𝒙 ,𝑦 ,… 𝒙 ,𝑦 Building a model to make the model can predict
the labels by using train data
Each row of data is called an observation or a tuple

SKKU Kang Research Group / SEE3007 Spring 2025 Prof. Woowhan Jung, DSLAB, Hanyang Univ. 2
Review: Classification vs Regression
Q1. Classification? Regression?
100

Life span (years)


90

80
Predicted rating
70
Q2. Classification? Regression?
60

Cat
Classification Regression

Output Categorical value Numeric value


type (class)
Dog

SKKU Kang Research Group / SEE3007 Spring 2025 3


4
Review: Logistic Regression (Classification Problem)
Many vector/matrix
operations

Input Model Output Label

Architecture 𝑦ො = 𝑃 𝐶𝑎𝑡 𝑦=1


+ Parameters = 0.9

Compute loss
Update parameters
to minimize the loss

SKKU Kang Research Group / SEE3007 Spring 2025 4


Review: Logistic Regression
1
▪ Output: 𝑦ො = 𝜎 𝒘⊺ 𝒙
+ 𝑏 where 𝜎 𝑧 = 1+𝑒 −𝑧
▪ Loss: 𝐿 𝑦,
ො 𝑦 = −𝑦 log 𝑦ො − 1 − 𝑦 log 1 − 𝑦ො 1
1 𝑚 𝜎 𝑧 =
1 + 𝑒 −𝑧
▪ Cost: 𝐽 𝒘, 𝑏 = 𝑚 σ𝑖=1 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
Sigmoid function
→ Good for probability output

𝑦
Features 𝒙

𝒘 𝑧 = 𝒘𝑇 𝒙 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑦,
ො 𝑦
Parameters
𝑏

SKKU Kang Research Group / SEE3007 Spring 2025 5


Review: Logistic Regression

▪ Example
𝑥1 𝑤1
▪ 𝒙=
𝑥2
, 𝒘=
𝑤2 1
▪ 𝑧 = 𝒘𝑇 𝒙 + 𝑏 = 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏 𝜎 𝑧 =
1 + 𝑒 −𝑧
▪ 𝑦ො = 𝑎 = 𝜎 𝑧 = 𝜎 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑏

𝑦
Features 𝒙

𝒘 𝑧 = 𝒘𝑇 𝒙 + 𝑏 𝑦ො = 𝑎 = 𝜎 𝑧 𝐿 𝑦,
ො 𝑦
Parameters
𝑏

SKKU Kang Research Group / SEE3007 Spring 2025 6


Review - Gradient Descent
▪ Algorithm to minimize a cost function 𝐽 𝜃 θ = {𝒘, 𝑏}
▪ 𝐽 𝜃 : cost function
▪ 𝜃: model parameters
▪ 𝜂: Learning rate

Repeatedly update

𝜃 ∗ = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
𝜃∗
By Hakky St
SKKU Kang Research Group / SEE3007 Spring 2025 7
Computing the Parameters with Gradient Decent
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏

𝑚 𝜃 = 𝜃- 𝜂 ⋅ 𝛻𝜃 𝐽(𝜃)
1
𝐽 𝒘, 𝑏 = ෍ 𝐿 𝑦ො (𝑖) , 𝑦 (𝑖)
𝑚
𝑖=1

𝜕𝐽 𝜃
𝜕𝑤1
𝑤1 𝜕𝐽 𝜃
𝑤2 𝜕𝑤2
𝜃= … 𝛻𝜃 𝐽 𝜃 = …
𝑤𝑛 𝜕𝐽 𝜃
𝑏 𝜕𝑤𝑛
𝜕𝐽 𝜃
𝜕𝑏

SKKU Kang Research Group / SEE3007 Spring 2025 8


XOR Operator: Logistic Regression?

▪ In our previous assignment, we figured out that logistic regression isn’t doing a good
job on XOR type of dataset.

SKKU Kang Research Group / SEE3007 Spring 2025 9


Decision boundary of the logistic regression
𝑥2

1
𝜎 𝑧 =
1 + 𝑒 −𝑧
𝒘⊤ 𝒙 + 𝑏 = 0

𝜎 𝒘⊤ 𝒙 + 𝑏 = 0.5

1
𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏 where 𝜎 𝑧 = 1+𝑒 −𝑧

𝑥1 𝑃 𝑦 = 1 𝒙 = 𝑦ො = 𝜎 𝒘⊤ 𝒙 + 𝑏 > 0.5
𝑦: =0
⇔ 𝒘⊤ 𝒙 + 𝑏 > 0
=1

SKKU Kang Research Group / SEE3007 Spring 2025 10


Simple XOR problem: linearly separable?

or and xor

1 1 1 1 0 1 1 1 0
𝑥2 𝑥2 𝑥2
0 0 1 0 0 0 0 0 1

0 1 0 1 0 1
𝑥1 𝑥1 𝑥1

Solution: make a more complicated model!

SKKU Kang Research Group / SEE3007 Spring 2025 11


NEURAL NETWORKS MODEL

SKKU Kang Research Group / SEE3007 Spring 2025 12


Neuron

SKKU Kang Research Group / SEE3007 Spring 2025 13


Why Neuron?

https://en.wikipedia.org/wiki/Neuron
SKKU Kang Research Group / SEE3007 Spring 2025 14
What is a Neural Network?
𝑥1
𝑥2 𝑦ො
𝑥3

Neuron

SKKU Kang Research Group / SEE3007 Spring 2025 15


What is a Neural Network?
𝑥1
𝑥2 𝑦ො
𝑥3
1st layer 2nd layer
Neuron
𝑥1
𝑥2 𝑦ො

𝑥3

𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1

𝒃[1]
SKKU Kang Research Group / SEE3007 Spring 2025 16
What is a Neural Network?
𝑥1
𝑥2 𝑦ො
𝑥3
1st layer 2nd layer
Neuron
𝑥1
𝑥2 𝑦ො

𝑥3
𝑦
𝒙

𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1 𝑧 2 =𝒘 2 𝑇𝒂 1 +𝑏 2 𝑎2 =𝜎 𝑧 2 𝐿 𝑎 2 ,𝑦

𝒃[1] 𝒘[2]
SKKU Kang Research Group / SEE3007 Spring 2025 𝒃[2] 17
Sample Calculation
1st layer 2nd layer

𝑥1
𝑥2 𝑦ො

𝑥3
𝑦
𝒙
𝑇
𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1 𝑧 2 =𝒘2 𝒂1 +𝑏 2 𝑎2 =𝜎 𝑧 2 𝐿 𝑎 2 ,𝑦

𝒃[1] 𝒘[2]
𝒃[2]

SKKU Kang Research Group / SEE3007 Spring 2025 18


Sample Calculation
1st layer 2nd layer

𝑥1
𝑥2 𝑦ො

𝑥3
𝑦
𝒙
𝑇
𝑾[1] 𝒛1 =𝑾1 𝒙+𝒃1 𝒂1 =𝜎 𝒛1 𝑧 2 =𝒘2 𝒂1 +𝑏 2 𝑎2 =𝜎 𝑧 2 𝐿 𝑎 2 ,𝑦

𝒃[1] 𝒘[2]
𝒃[2]

SKKU Kang Research Group / SEE3007 Spring 2025 19


Neural Network Representation
▪ 2-layer fully-connected neural network Neuron
Hidden layer

[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙

*In our lecture, we


𝑾[1] , 𝒃[1] 𝒂[1] ∈ ℝ4
will keep using x
SKKU Kang Research Group / SEE3007 Spring 2025 20
Neural Network Representation
▪ 2-layer fully-connected neural network Neuron
Hidden layer

[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙
𝑾[1] , 𝒃[1] 𝒂[1] ∈ ℝ4

SKKU Kang Research Group / SEE3007 Spring 2025 21


Neural Network Representation
▪ 2-layer fully-connected neural network Neuron
Hidden layer

[1]
Input layer 𝑎1
Output layer
𝑥1
[1]
𝑎2
𝑥2 𝑎[2] 𝑦ො = 𝑎[2]
[1]
𝑎3
𝑥3
[1] 𝑾[2] , 𝒃[2]
𝑎4
𝒂[𝟎] = 𝒙
𝑾[1] , 𝒃[1] 𝒂[1] ∈ ℝ4 It can do a more complex job
than logistic regression!
SKKU Kang Research Group / SEE3007 Spring 2025 22
Shallow NN Vs. Deep NN (DNN)

▪ DNN: more than 1 hidden layer


▪ It can do more complicated tasks!

SKKU Kang Research Group / SEE3007 Spring 2025 https://www.go-rbcs.com/columns/deep-learning-to-the-rescue 23


Neural Network Representation
Neuron
1 1 𝑇 1
𝑧1 = 𝒘1 𝒙 + 𝑏1
1 1
𝑎1 =𝜎 𝑧1

𝑥1

𝑥2 𝑦ො

𝑥3
1 1𝑇 1
𝑧2 = 𝒘2 𝒙 + 𝑏2
1 1
𝑎2 = 𝜎 𝑧2

SKKU Kang Research Group / SEE3007 Spring 2025 24


Neural Network Representation
Neuron

[1]
𝑎1
𝑥1 1 1 𝑇 1 1 1
[1] 𝑧1 = 𝒘1 𝒙 + 𝑏1 𝑎1 = 𝜎 𝑧1
𝑎2
𝑥2 𝑎2 𝑦ො
1 1𝑇 1 1 1
[1] 𝑧2 = 𝒘2 𝒙 + 𝑏2 𝑎2 = 𝜎 𝑧2
𝑎3
𝑥3 1 1𝑇 1 1 1
[1] 𝑧3 = 𝒘3 𝒙 + 𝑏3 𝑎3 = 𝜎 𝑧3
𝑎4
𝑥1 1 1 𝑇 1 1
𝑎4 = 𝜎 𝑧4
1
𝒙 = 𝑥2 𝑧4 = 𝒘4 𝒙 + 𝑏4
𝑥3

𝑧 [2] =𝒘 2 𝑇 𝒂[1] + 𝑏 [2] 𝑎2 =𝜎 𝑧 2

SKKU Kang Research Group / SEE3007 Spring 2025 25


Neural Network with a Hidden Layer

Input: Parameters:
[1]
𝑎1 𝒙 ∈ ℝ𝑛 𝑾[1] ∈ ℝℎ×𝑛 𝒘[2] ∈ ℝℎ
𝑥1 𝒃[1] ∈ ℝℎ 𝑏 [2] ∈ ℝ
[1]
𝑎2
𝑥2 𝑎2 𝑦ො
[1]
𝑎3 Forward path:
𝑥3
[1] 𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝑎4 Hidden layer
𝒂[1] = 𝜎 𝒛[1]
[2] 2 𝑇 [1]
𝑧 =𝒘 𝒂 + 𝑏 [2]
In this example, Output layer
n=3, h=4 (number of neurons) 𝑦ො = 𝑎 2 = 𝜎 𝑧 2

SKKU Kang Research Group / SEE3007 Spring 2025 26


ACTIVATION FUNCTIONS

SKKU Kang Research Group / SEE3007 Spring 2025 27


Neural Network with a Hidden Layer

Input: Parameters:
𝑔
𝒙 ∈ ℝ𝑛 𝑾[1] ∈ ℝℎ×𝑛 𝒘[2] ∈ ℝℎ
𝑥1 𝒃[1] ∈ ℝℎ 𝑏 [2] ∈ ℝ
𝑔

𝑥2 𝜎 𝑦ො
𝑔
Forward path:
𝑥3
𝑔 𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝒂[1] = 𝑔 𝒛[1] where 𝑔 . is an activation function
𝑧 [2] =𝒘 2 𝑇 𝒂[1] + 𝑏 [2]
𝑦ො = 𝑎 2 = 𝜎 𝑧 2

SKKU Kang Research Group / SEE3007 Spring 2025 28


Why Non-Linear Activation Functions?

𝒙 𝒛1 =𝑾1 𝒙+𝒃1 𝒛2 =𝑾2 𝒛𝟏 +𝒃2 𝒛2

𝒛 2 = 𝑾 2 (𝑾 1 𝒙 + 𝒃 1 ) + 𝒃 2

= 𝑾 2 𝑾 1 𝒙 + (𝑾 2 𝒃 1 + 𝒃 2 )

= 𝑾′𝒙 + 𝒃′
where 𝑾′ = 𝑾 2 𝑾 1 and 𝒃′ = 𝑾 2 𝒃 1 + 𝒃 2

Series of linear functions without activation function


→ just a single linear function again
→ Not useful!
SKKU Kang Research Group / SEE3007 Spring 2025 29
Activation Functions
There are so many activation functions ..

We’ll cover some important and currently


widely used activation functions

Sigmoid
tanh
ReLU
LeakyReLU

SKKU Kang Research Group / SEE3007 Spring 2025 30


Sigmoid

1
▪𝜎𝑥 = 1+𝑒 −𝑥
▪ Range: 0,1
▪ Derivative
𝑑𝜎 𝑥
=𝜎 𝑥 1−𝜎 𝑥
𝑑𝑥
𝑑𝜎 𝑥
▪ 0<
𝑑𝑥
≤ 0.25

SKKU Kang Research Group / SEE3007 Spring 2025 31


Hyperbolic Tangent: tanh

𝑒 𝑥 −𝑒 −𝑥
▪ tanh 𝑥 =
𝑒 𝑥 +𝑒 −𝑥

= 2𝜎 2𝑥 − 1

▪ Range: −1,1
▪ Derivative
𝑑 tanh 𝑥
= 1 − tanh2 𝑥
𝑑𝑥
𝑑 tanh 𝑥
▪ 0 < 𝑑𝑥 ≤ 1

SKKU Kang Research Group / SEE3007 Spring 2025 32


Vanishing Gradient

“The term vanishing gradient refers to the fact that in a feedforward


network (FFN) the backpropagated error signal typically decreases (or
increases) exponentially as a function of the distance from the final layer”
by Jason Brownlee

▪ Chain rule (for a Deep NN)


𝜕𝐿 𝜕𝐿 𝜕𝑧 𝑛 𝜕𝑧 3 𝜕𝑧 2 𝜕𝑧 1
= [𝑛] ∗ 𝑛−1
∗ ⋯∗ 2
∗ 1

𝜕𝑤 𝜕𝑧 𝜕𝑧 𝜕𝑧 𝜕𝑧 𝜕𝑤

SKKU Kang Research Group / SEE3007 Spring 2025


https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function/ 33
Rectified Linear Unit: ReLU
▪ 𝑓 𝑥 = max 0, 𝑥
▪ Range: 0, ∞
▪ Derivative
0 𝑖𝑓 𝑥 < 0
𝑑𝑓 𝑥
=ቐ 1 𝑖𝑓𝑥 > 0
𝑑𝑥 𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑥 = 0

Most popular activation function


▪ Better than sigmoid or tanh to vanishing gradient
problem
▪ Simple to implement

SKKU Kang Research Group / SEE3007 Spring 2025 34


Leaky ReLU
𝑎𝑥 𝑖𝑓 𝑥 < 0
▪ 𝑓 𝑥 =ቊ
𝑥 𝑖𝑓 𝑥 ≥ 0 ReLU Leaky ReLU

▪ 𝑎 ≪ 1 (e.g, 𝑎 = 0.01)
▪ Range: −∞, ∞
▪ Derivative

𝑎 𝑖𝑓 𝑥 < 0
𝑑𝑓 𝑥
=ቐ 1 𝑖𝑓𝑥 > 0
𝑑𝑥 𝑢𝑛𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑖𝑓 𝑥 = 0

▪ Sometimes better than ReLU to avoid dying ReLU issue


▪ But slower to compute
▪ ReLU is more popular
https://www.linkedin.com/pulse/better-activation-functions-prince-
SKKU Kang Research Group / SEE3007 Spring 2025
kumawat#:~:text=Leaky%20ReLU%20is%20an%20extension,become%20permanently%20inactive%20during%20training. 35
Pros and Cons of Activation Functions

1
Sigmoid: 𝑎 = 1+𝑒 −𝑥 𝑒 𝑥 − 𝑒 −𝑥
tanh 𝑥 : 𝑎 = 𝑥
𝑒 + 𝑒 −𝑥

ReLU: 𝑎 = max{0, 𝑧} Leaky ReLU: 𝑎 = max{𝑐𝑧, 𝑧}

SKKU Kang Research Group / SEE3007 Spring 2025 36


Pros and Cons of Activation Functions
Activation Function Note
sigmoid Rarely used in hidden layer. Only used in output layer in binary
classification since its output is 0~1 (probability)
Tanh Superior than sigmoid

ReLU Most popular one. Better at vanishing gradient problem. Very simple
to implement in hardware.
Leaky ReLU Sometimes better than ReLU in dying ReLU issue, but it takes more
calculations than ReLU. Less popular than ReLU.

Tip: If the performance is


similar, Use the simpler
one! (Use ReLU!)

SKKU Kang Research Group / SEE3007 Spring 2025 37


Neural Network with a Hidden Layer

𝑔
𝒛[1] = 𝑾[1] 𝒙 + 𝒃[1]
𝑥1
𝑔 𝒂[1] = 𝑔 𝒛[1] where 𝑔 . is an activation function

𝑥2 𝜎 𝑦ො 𝑧 [2] =𝒘 2 𝑇 𝒂[1] + 𝑏 [2]


𝑔
𝑦ො = 𝑎 2 = 𝜎 𝑧 2
𝑥3
𝑔

𝒙 𝒂[1] 𝑎2

You may use tanh, ReLU, Leaky ReLU as 𝑔

SKKU Kang Research Group / SEE3007 Spring 2025 38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy