0% found this document useful (0 votes)

7 views32 pages

Lecture 04

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views32 pages

Lecture 04

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

WiSe 2023/24

Deep Learning 1

Lecture 4 Optimization (Part 2)

Outline
Recap Lecture 3
Post-Hoc Mitigation of Poor Conditioning
▶ Momentum
▶ Adam algorithm

Ecient Optimization
▶ Redundancies in the error function
▶ The stochastic gradient descent procedure
▶ Ecient networks
▶ Local connectivity

Implementation Aspects
▶ Computations as matrix-vector operations
▶ Training on specic hardware
▶ Distributed training schemes

1/31
Recap Lecture 3

2/31
Recap: Hessian-Based Analysis
well-conditioned poorly conditioned
error function error function

minima of
the function
Using the framework of Taylor expansions, any error function can be expanded
at its minimum θ⋆ by the quadratic function:
1
E(θ) = E(θ⋆ ) + 0 + (θ − θ⋆ )⊤ H(θ − θ⋆ ) + higher-order terms
2
where H is the Hessian. The condition number is then the ratio of largest
and smallest eigenvalues of the Hessian (the lower the better):

Condition number = λmax /λmin

3/31
Better Conditioning E(θ)

Various techniques:
▶ Centering/whitening the data
▶ Centering the activations
▶ Properly scaling the weights
▶ Design the architecture appropriately (limited depth, shortcut
connections, batch-normalization layers, no bottlenecks, ...)
Note:
▶ Despite all measures to reduce the condition number
of E(θ), the latter may still be poorly conditioned.
▶ We also need to adapt the optimization procedure
in a way that it deals better with poor conditioning.

4/31
Part 1 Post-Hoc Mitigations

5/31
Momentum

Idea:
▶ Descent along directions of low
curvature can be accelerated by
imitating a physical momentum.

Algorithm:
▶ Compute the direction of descent as a cumulation of previous
gradients:
∆ ← µ · ∆ + (−∇E(θ))
where µ ∈ [0, 1[. The higher µ, the stronger the momentum.
▶ Update the parameters θ by performing a step along the obtained
direction of descent.
θ ←θ+γ·∆

6/31
Momentum
Recall:
▶ Gradient descent with momentum proceeds as

∆ ← µ · ∆ + (−∇E(θ))
θ ←θ+γ·∆

Property:
If all gradient estimates ∇E(θ) coincide along a particular direction, then
the eective learning rate along that direction becomes:
1
γ′ = γ ·
1−µ
This can be derived as the closed form of a geometric series.

Heuristic:
▶ When error function is believed to be poorly conditioned, choose a
momentum of 0.9 or 0.99.

7/31
The Adam Algorithm

from Kingma'15

8/31
Part 2 Avoiding Redundancies

9/31
Data Redundancies
Observation:
▶ The error function can usually be decomposed as the sum of errors on
individual data points, i.e. E(θ) = Ni=1 Ei (θ).
P

▶ Error terms associated to dierent data points have similar shapes, e.g.
for a linear model y = w · x + b with θ = (w, b), the overall and
individual error functions typically look like this:

10/31
Data Redundancies

Conclusion:
▶ It is redundant (and computationally inecient) to compute the error
function for every data point at each step of gradient descent.
Question:
▶ Can we perform gradient descent only on a subset of the data, or
alternatively, pick at each iteration a random subset of data?

11/31
Stochastic Gradient Descent
Gradient descent:
for t = 1 . . . T do
1 XN
θ ←θ−γ∇ Ei (θ)
N i=1
| {z }
end for ∇E(θ)

Stochastic gradient descent:

for t = 1 . . . T do
I = choose({1, 2, . . . , N }, K)
1 X
θ ←θ−γ∇ Ei (θ)
K i∈I
| {z }
end for ∇E(θ)
b

▶ Gradient descent costs O(N ) at each iteration whereas stochastic

gradient descent costs O(K) where K ≪ N .
b is an unbiased estimator of ∇.
▶ ∇
▶ SGD may never stabilize to a xed solution due to the random
sampling.
12/31
Stochastic Gradient Descent
Idea:
▶ Make the learning rate decrease over time, i.e., replace the xed
learning rate γ by a time-dependent learning rate γ (t) .
Stochastic gradient descent (improved):
for t = 1 . . . T do
I = choose({1, 2, . . . , N }, K)
1 X
θ ← θ − γ (t) ∇ Ei (θ)
K i∈I
| {z }
end for ∇E(θ)
b

▶ SGD is guaranteed to converge if the learning rate satises the

following two conditions:
lim γ (t) = 0 (i)
t→∞
X∞
γ (t) = ∞ (ii)
t=1

13/31
Choosing the Learning Rate Schedule of SGD

γ (t) = 1 γ (t) = t−1 γ (t) = e−t

lim γ (t) = 0 ✗ ✓ ✓
t→∞

∞
X
γ (t) = ∞ ✓ ✓ ✗
t=1

Observation:
▶ The learning rate should decay but not too quickly.
▶ Because of ths required slow decay, one also gets a slow convergence
rate, e.g. t−1 . (Compare with the exponential convergence of GD near
the optimum).

Question:
▶ Is SGD useful at all?

14/31
GD vs. SGD Convergence
log E(θ)
Phase 2:
Most ∇Ei (θ) point to
dierent directions →
SGD slower due to noise

Phase 1: SGD E ∝ t−1

Most ∇Ei (θ) have the
same direction → SGD
faster
GD E ∝ e−t
training time

Observations:
▶ In Phase 1, constants (K vs. N ) matter. SGD moves way faster
initially.
▶ Phase 2 is often irrelevant, because the model already starts overtting
before reaching it.
▶ K can be increased over the course of training in order to perform
eciencly in both phases.
15/31
Further advantages of SGD vs. GD

▶ May escape local minima due to noise.

▶ May arrive at better generalizing solution (cf. Regularization in lectures
56).

16/31
Part 3 Model Eciency

17/31
Model Eciency
Observation:
▶ Another factor that can have a strong eect on training eciency is
how much time/resource it takes to produce one forward pass.

General guidelines:
▶ The number of neurons in the network should not be chosen larger
than needed for the task.

Solves the task ✗ ✓ ✓

Cheap to
✓ ✓ ✗
evaluate

▶ The network should be organized in a way that only relevant

computations are performed.

18/31
Model Eciency
Global connectivity vs. local connectivity
▶ Keeping only local connections can substantially reduce the number of
computations for each neuron.
▶ Only works if the representation computed at a the layer does not
require long-range interactions.

Adapted from B. Sick, O. Durr, Deep Learning Lecture, ETHZ.

19/31
Model Eciency

Global connectivity vs. local connectivity

⊤
a= W
| {z x} = WA⊤ xA + WB⊤ xB
| {z }
8×4=32 computations
2×(2×4)=16 computations

20/31
Avoiding Computational Bottlenecks
The CNN Architecture:
▶ Lower layers detect
simple features at
exact locations.
▶ Higher layers detect
complex features at
approximate
locations.

Key Computational Benet of the CNN:

▶ Spatial information is progressively replaced by with semantic
information as we move from the input layer to the top layer.
▶ The dimensionality and number of connections is never too high at any
layer.

21/31
Avoiding Computational Bottlenecks
Example:
The Inception-v1 (GoogleNet)
architecture

Observation:
▶ No specic layer strongly dominate in terms of number of operations.

22/31
Part 4 Systemize / Parallellize Computations

23/31
Systemize Computations

24/31
Choosing Batch Size in SGD

Two factors enter into the decision of the batch size.

▶ Whether gradient of data points are redundant, typically, whether we
are in phase 1 or phase 2 of training.
▶ Whether the machine used for training the model is suciently big so
that the batch operation can be performed in O(1) on that machine.

Phase 1 of training Phase 2 of training

(correlated gradients) (decorrelated gradients)

small
small batch medium batch
machine

big
medium batch large batch
machine

25/31
Map Neural Network to Hardware

Image from Ciresan et al. 2010:

Deep Big Simple Neural Nets
Excel on Handwritten Digit
Recognition

▶ In order for the training procedure to match the hardware specications

(e.g. CPU cache, GPU block size) optimally, neural network
computations (e.g. batch computations) must be decomposed into
blocks of appropriate size.
▶ These hardware-specic optimizations are already built in most fast
neural network libraries (e.g. PyTorch, Tensorow, cuDNN, ...).

26/31
Part 5 Distributed Training

27/31
Distributed Training
Example: Google's DistBelief Architecture [Dean'12]

Each model replica trains on its own data, and synchronizes the model pa-
rameters it has learned with other replica via a dedicated parameter server.

28/31
Distributed Training
Combining data-parallelism and model-parallelism

see also Krizhevsky'14: One weird trick for parallelizing convolutional neural networks

29/31
Summary

30/31
Summary
▶ Even with the best practices for shaping the error function E(θ) such
as data centering, designing a good architecture, etc., the optimization
of E(θ) remains computationally demanding.
▶ A poorly conditioned error function can be addressed by enhancing the
simple gradient descent procedure with momentum.
▶ The contributions of dierent data points to the error function are
initially highly correlated → it is benecial to approximate the error
gradient from only a random subset of points at each iteration
(stochastic gradient descent).
▶ The model can be shaped in a way that avoids unnecessary
computations (e.g. weights connecting features known to be
unrelated), and in a way that avoids computational bottlenecks.
▶ For most ecient neural network training, it is important to consider
what the hardware can achieve (e.g. what operation the hardware
achieves in O(1)).
▶ Very large models and very large datasets do not t on a single
machine. In that case, we need to design distributed schemes, with
appropriate use of data/model parallelism.

31/31

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Module 2
No ratings yet
Module 2
67 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Lecture 221007 05
No ratings yet
Lecture 221007 05
21 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
Training NNs
No ratings yet
Training NNs
34 pages
Basics of DL: Prof. Leal-Taixé and Prof. Niessner 1
No ratings yet
Basics of DL: Prof. Leal-Taixé and Prof. Niessner 1
76 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
08 Training
No ratings yet
08 Training
18 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Lec 8
No ratings yet
Lec 8
43 pages
DS303 NN
No ratings yet
DS303 NN
20 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Cours 5
No ratings yet
Cours 5
23 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Training Neural Networks Without Gradients
No ratings yet
Training Neural Networks Without Gradients
10 pages
Large Scale Deep Learning
No ratings yet
Large Scale Deep Learning
170 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Certificate: Jawahar Navodaya Vidyalaya
No ratings yet
Certificate: Jawahar Navodaya Vidyalaya
13 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Unit 1 Lesson 1 Computer Systems
No ratings yet
Unit 1 Lesson 1 Computer Systems
8 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
CH 05 Polymer Chemistry May 16 2019 Final
No ratings yet
CH 05 Polymer Chemistry May 16 2019 Final
115 pages
Internship Report
No ratings yet
Internship Report
33 pages
Module 2-PROPERTIES OF A WELL-WRITTEN
No ratings yet
Module 2-PROPERTIES OF A WELL-WRITTEN
10 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
0 Assignment
No ratings yet
0 Assignment
5 pages
Diarrhea
No ratings yet
Diarrhea
35 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
(Ebook) Textbook of Pedodontics by Shobha Tandon ISBN 9788186635964, 8186635963 Download
No ratings yet
(Ebook) Textbook of Pedodontics by Shobha Tandon ISBN 9788186635964, 8186635963 Download
57 pages
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
93% (15)
Janlloyd Dugo - HOME ROOM GUIDANCE MODULE 1
2 pages
Out
No ratings yet
Out
61 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Recognizing The Potential Market Part 1 Q3 W3
No ratings yet
Recognizing The Potential Market Part 1 Q3 W3
33 pages
5 Resources For English Language Teachers - Cambridge English
No ratings yet
5 Resources For English Language Teachers - Cambridge English
7 pages
Parental Stress
No ratings yet
Parental Stress
13 pages
PBD TRANSIT FORM - ENG YEAR 2 (Checklist)
No ratings yet
PBD TRANSIT FORM - ENG YEAR 2 (Checklist)
3 pages
First Grade: Newspaper Activity: Major Questions
No ratings yet
First Grade: Newspaper Activity: Major Questions
4 pages
Project 5
No ratings yet
Project 5
3 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Q1 Module 2 MIL
No ratings yet
Q1 Module 2 MIL
10 pages
Chief Education Supervisor & Education Program
No ratings yet
Chief Education Supervisor & Education Program
27 pages
Admit Card
No ratings yet
Admit Card
2 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Summer Winning Camp Calendar 2026 PO BE-Batch, ME & MCA, BCA & B.SC Batch 2026 PO
No ratings yet
Summer Winning Camp Calendar 2026 PO BE-Batch, ME & MCA, BCA & B.SC Batch 2026 PO
1 page
Authoritative Coach - Building Youth Through Sport
No ratings yet
Authoritative Coach - Building Youth Through Sport
16 pages
Brand Management
No ratings yet
Brand Management
2 pages
Test Automation Framework & Design For XXXXX Project: Author: XXXXXX
No ratings yet
Test Automation Framework & Design For XXXXX Project: Author: XXXXXX
14 pages
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
No ratings yet
Students' Experiences of Active Engagement Through Cooperative Learning Activities in Lectures
11 pages
Asesmen Kebutuhan Edukasi Pasien
No ratings yet
Asesmen Kebutuhan Edukasi Pasien
5 pages
HK TSA Writing Paper 3 2020 - 3ERW3
No ratings yet
HK TSA Writing Paper 3 2020 - 3ERW3
12 pages
ELC640 Sample Resume GOOD (JAL)
No ratings yet
ELC640 Sample Resume GOOD (JAL)
2 pages
Macaingalan Elementary School / National High School: School Memorandum No. 24, S., 2021
No ratings yet
Macaingalan Elementary School / National High School: School Memorandum No. 24, S., 2021
5 pages
Gold Coast Network Map
No ratings yet
Gold Coast Network Map
1 page
Past Participle
No ratings yet
Past Participle
3 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 04

Uploaded by

Lecture 04

Uploaded by

WiSe 2023/24

Lecture 4 Optimization (Part 2)

Condition number = λmax /λmin

Stochastic gradient descent:

▶ Gradient descent costs O(N ) at each iteration whereas stochastic

▶ SGD is guaranteed to converge if the learning rate satises the

γ (t) = 1 γ (t) = t−1 γ (t) = e−t

Phase 1: SGD E ∝ t−1

▶ May escape local minima due to noise.

Solves the task ✗ ✓ ✓

▶ The network should be organized in a way that only relevant

Adapted from B. Sick, O. Durr, Deep Learning Lecture, ETHZ.

Global connectivity vs. local connectivity

Key Computational Benet of the CNN:

Two factors enter into the decision of the batch size.

Phase 1 of training Phase 2 of training

Image from Ciresan et al. 2010:

▶ In order for the training procedure to match the hardware specications

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Lecture 04

Uploaded by

Lecture 04

Uploaded by

WiSe 2023/24

Lecture 4 Optimization (Part 2)

Condition number = λmax /λmin

Stochastic gradient descent:

▶ Gradient descent costs O(N ) at each iteration whereas stochastic

▶ SGD is guaranteed to converge if the learning rate satises the

γ (t) = 1 γ (t) = t−1 γ (t) = e−t

Phase 1: SGD E ∝ t−1

▶ May escape local minima due to noise.

Solves the task ✗ ✓ ✓

▶ The network should be organized in a way that only relevant

Adapted from B. Sick, O. Durr, Deep Learning Lecture, ETHZ.

Global connectivity vs. local connectivity

Key Computational Benet of the CNN:

Two factors enter into the decision of the batch size.

Phase 1 of training Phase 2 of training

Image from Ciresan et al. 2010:

▶ In order for the training procedure to match the hardware specications

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

▶ SGD is guaranteed to converge if the learning rate satises the

Key Computational Benet of the CNN:

▶ In order for the training procedure to match the hardware specications