0% found this document useful (0 votes)
7 views32 pages

Lecture 04

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views32 pages

Lecture 04

Uploaded by

Tim Widmoser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

WiSe 2023/24

Deep Learning 1

Lecture 4 Optimization (Part 2)


Outline
Recap Lecture 3
Post-Hoc Mitigation of Poor Conditioning
▶ Momentum
▶ Adam algorithm

Ecient Optimization
▶ Redundancies in the error function
▶ The stochastic gradient descent procedure
▶ Ecient networks
▶ Local connectivity

Implementation Aspects
▶ Computations as matrix-vector operations
▶ Training on specic hardware
▶ Distributed training schemes

1/31
Recap Lecture 3

2/31
Recap: Hessian-Based Analysis
well-conditioned poorly conditioned
error function error function

minima of
the function
Using the framework of Taylor expansions, any error function can be expanded
at its minimum θ⋆ by the quadratic function:
1
E(θ) = E(θ⋆ ) + 0 + (θ − θ⋆ )⊤ H(θ − θ⋆ ) + higher-order terms
2
where H is the Hessian. The condition number is then the ratio of largest
and smallest eigenvalues of the Hessian (the lower the better):

Condition number = λmax /λmin

3/31
Better Conditioning E(θ)

Various techniques:
▶ Centering/whitening the data
▶ Centering the activations
▶ Properly scaling the weights
▶ Design the architecture appropriately (limited depth, shortcut
connections, batch-normalization layers, no bottlenecks, ...)
Note:
▶ Despite all measures to reduce the condition number
of E(θ), the latter may still be poorly conditioned.
▶ We also need to adapt the optimization procedure
in a way that it deals better with poor conditioning.

4/31
Part 1 Post-Hoc Mitigations

5/31
Momentum

Idea:
▶ Descent along directions of low
curvature can be accelerated by
imitating a physical momentum.

Algorithm:
▶ Compute the direction of descent as a cumulation of previous
gradients:
∆ ← µ · ∆ + (−∇E(θ))
where µ ∈ [0, 1[. The higher µ, the stronger the momentum.
▶ Update the parameters θ by performing a step along the obtained
direction of descent.
θ ←θ+γ·∆

6/31
Momentum
Recall:
▶ Gradient descent with momentum proceeds as

∆ ← µ · ∆ + (−∇E(θ))
θ ←θ+γ·∆

Property:
If all gradient estimates ∇E(θ) coincide along a particular direction, then
the eective learning rate along that direction becomes:
1
γ′ = γ ·
1−µ
This can be derived as the closed form of a geometric series.

Heuristic:
▶ When error function is believed to be poorly conditioned, choose a
momentum of 0.9 or 0.99.

7/31
The Adam Algorithm

from Kingma'15

8/31
Part 2 Avoiding Redundancies

9/31
Data Redundancies
Observation:
▶ The error function can usually be decomposed as the sum of errors on
individual data points, i.e. E(θ) = Ni=1 Ei (θ).
P

▶ Error terms associated to dierent data points have similar shapes, e.g.
for a linear model y = w · x + b with θ = (w, b), the overall and
individual error functions typically look like this:

10/31
Data Redundancies

Conclusion:
▶ It is redundant (and computationally inecient) to compute the error
function for every data point at each step of gradient descent.
Question:
▶ Can we perform gradient descent only on a subset of the data, or
alternatively, pick at each iteration a random subset of data?

11/31
Stochastic Gradient Descent
Gradient descent:
for t = 1 . . . T do
1 XN 
θ ←θ−γ∇ Ei (θ)
N i=1
| {z }
end for ∇E(θ)

Stochastic gradient descent:


for t = 1 . . . T do
I = choose({1, 2, . . . , N }, K)
1 X 
θ ←θ−γ∇ Ei (θ)
K i∈I
| {z }
end for ∇E(θ)
b

▶ Gradient descent costs O(N ) at each iteration whereas stochastic


gradient descent costs O(K) where K ≪ N .
b is an unbiased estimator of ∇.
▶ ∇
▶ SGD may never stabilize to a xed solution due to the random
sampling.
12/31
Stochastic Gradient Descent
Idea:
▶ Make the learning rate decrease over time, i.e., replace the xed
learning rate γ by a time-dependent learning rate γ (t) .
Stochastic gradient descent (improved):
for t = 1 . . . T do
I = choose({1, 2, . . . , N }, K)
1 X 
θ ← θ − γ (t) ∇ Ei (θ)
K i∈I
| {z }
end for ∇E(θ)
b

▶ SGD is guaranteed to converge if the learning rate satises the


following two conditions:
lim γ (t) = 0 (i)
t→∞
X∞
γ (t) = ∞ (ii)
t=1

13/31
Choosing the Learning Rate Schedule of SGD

γ (t) = 1 γ (t) = t−1 γ (t) = e−t

lim γ (t) = 0 ✗ ✓ ✓
t→∞


X
γ (t) = ∞ ✓ ✓ ✗
t=1

Observation:
▶ The learning rate should decay but not too quickly.
▶ Because of ths required slow decay, one also gets a slow convergence
rate, e.g. t−1 . (Compare with the exponential convergence of GD near
the optimum).

Question:
▶ Is SGD useful at all?

14/31
GD vs. SGD Convergence
log E(θ)
Phase 2:
Most ∇Ei (θ) point to
dierent directions →
SGD slower due to noise

Phase 1: SGD E ∝ t−1


Most ∇Ei (θ) have the
same direction → SGD
faster
GD E ∝ e−t
training time

Observations:
▶ In Phase 1, constants (K vs. N ) matter. SGD moves way faster
initially.
▶ Phase 2 is often irrelevant, because the model already starts overtting
before reaching it.
▶ K can be increased over the course of training in order to perform
eciencly in both phases.
15/31
Further advantages of SGD vs. GD

▶ May escape local minima due to noise.


▶ May arrive at better generalizing solution (cf. Regularization in lectures
56).

16/31
Part 3 Model Eciency

17/31
Model Eciency
Observation:
▶ Another factor that can have a strong eect on training eciency is
how much time/resource it takes to produce one forward pass.

General guidelines:
▶ The number of neurons in the network should not be chosen larger
than needed for the task.

Solves the task ✗ ✓ ✓

Cheap to
✓ ✓ ✗
evaluate

▶ The network should be organized in a way that only relevant


computations are performed.

18/31
Model Eciency
Global connectivity vs. local connectivity
▶ Keeping only local connections can substantially reduce the number of
computations for each neuron.
▶ Only works if the representation computed at a the layer does not
require long-range interactions.

Adapted from B. Sick, O. Durr, Deep Learning Lecture, ETHZ.

19/31
Model Eciency

Global connectivity vs. local connectivity


a= W
| {z x} = WA⊤ xA + WB⊤ xB
| {z }
8×4=32 computations
2×(2×4)=16 computations

20/31
Avoiding Computational Bottlenecks
The CNN Architecture:
▶ Lower layers detect
simple features at
exact locations.
▶ Higher layers detect
complex features at
approximate
locations.

Key Computational Benet of the CNN:


▶ Spatial information is progressively replaced by with semantic
information as we move from the input layer to the top layer.
▶ The dimensionality and number of connections is never too high at any
layer.

21/31
Avoiding Computational Bottlenecks
Example:
The Inception-v1 (GoogleNet)
architecture

Observation:
▶ No specic layer strongly dominate in terms of number of operations.

22/31
Part 4 Systemize / Parallellize Computations

23/31
Systemize Computations

24/31
Choosing Batch Size in SGD

Two factors enter into the decision of the batch size.


▶ Whether gradient of data points are redundant, typically, whether we
are in phase 1 or phase 2 of training.
▶ Whether the machine used for training the model is suciently big so
that the batch operation can be performed in O(1) on that machine.

Phase 1 of training Phase 2 of training


(correlated gradients) (decorrelated gradients)

small
small batch medium batch
machine

big
medium batch large batch
machine

25/31
Map Neural Network to Hardware

Image from Ciresan et al. 2010:


Deep Big Simple Neural Nets
Excel on Handwritten Digit
Recognition

▶ In order for the training procedure to match the hardware specications


(e.g. CPU cache, GPU block size) optimally, neural network
computations (e.g. batch computations) must be decomposed into
blocks of appropriate size.
▶ These hardware-specic optimizations are already built in most fast
neural network libraries (e.g. PyTorch, Tensorow, cuDNN, ...).

26/31
Part 5 Distributed Training

27/31
Distributed Training
Example: Google's DistBelief Architecture [Dean'12]

Each model replica trains on its own data, and synchronizes the model pa-
rameters it has learned with other replica via a dedicated parameter server.

28/31
Distributed Training
Combining data-parallelism and model-parallelism

see also Krizhevsky'14: One weird trick for parallelizing convolutional neural networks

29/31
Summary

30/31
Summary
▶ Even with the best practices for shaping the error function E(θ) such
as data centering, designing a good architecture, etc., the optimization
of E(θ) remains computationally demanding.
▶ A poorly conditioned error function can be addressed by enhancing the
simple gradient descent procedure with momentum.
▶ The contributions of dierent data points to the error function are
initially highly correlated → it is benecial to approximate the error
gradient from only a random subset of points at each iteration
(stochastic gradient descent).
▶ The model can be shaped in a way that avoids unnecessary
computations (e.g. weights connecting features known to be
unrelated), and in a way that avoids computational bottlenecks.
▶ For most ecient neural network training, it is important to consider
what the hardware can achieve (e.g. what operation the hardware
achieves in O(1)).
▶ Very large models and very large datasets do not t on a single
machine. In that case, we need to design distributed schemes, with
appropriate use of data/model parallelism.

31/31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy