Lecture 04
Lecture 04
Deep Learning 1
Ecient Optimization
▶ Redundancies in the error function
▶ The stochastic gradient descent procedure
▶ Ecient networks
▶ Local connectivity
Implementation Aspects
▶ Computations as matrix-vector operations
▶ Training on specic hardware
▶ Distributed training schemes
1/31
Recap Lecture 3
2/31
Recap: Hessian-Based Analysis
well-conditioned poorly conditioned
error function error function
minima of
the function
Using the framework of Taylor expansions, any error function can be expanded
at its minimum θ⋆ by the quadratic function:
1
E(θ) = E(θ⋆ ) + 0 + (θ − θ⋆ )⊤ H(θ − θ⋆ ) + higher-order terms
2
where H is the Hessian. The condition number is then the ratio of largest
and smallest eigenvalues of the Hessian (the lower the better):
3/31
Better Conditioning E(θ)
Various techniques:
▶ Centering/whitening the data
▶ Centering the activations
▶ Properly scaling the weights
▶ Design the architecture appropriately (limited depth, shortcut
connections, batch-normalization layers, no bottlenecks, ...)
Note:
▶ Despite all measures to reduce the condition number
of E(θ), the latter may still be poorly conditioned.
▶ We also need to adapt the optimization procedure
in a way that it deals better with poor conditioning.
4/31
Part 1 Post-Hoc Mitigations
5/31
Momentum
Idea:
▶ Descent along directions of low
curvature can be accelerated by
imitating a physical momentum.
Algorithm:
▶ Compute the direction of descent as a cumulation of previous
gradients:
∆ ← µ · ∆ + (−∇E(θ))
where µ ∈ [0, 1[. The higher µ, the stronger the momentum.
▶ Update the parameters θ by performing a step along the obtained
direction of descent.
θ ←θ+γ·∆
6/31
Momentum
Recall:
▶ Gradient descent with momentum proceeds as
∆ ← µ · ∆ + (−∇E(θ))
θ ←θ+γ·∆
Property:
If all gradient estimates ∇E(θ) coincide along a particular direction, then
the eective learning rate along that direction becomes:
1
γ′ = γ ·
1−µ
This can be derived as the closed form of a geometric series.
Heuristic:
▶ When error function is believed to be poorly conditioned, choose a
momentum of 0.9 or 0.99.
7/31
The Adam Algorithm
from Kingma'15
8/31
Part 2 Avoiding Redundancies
9/31
Data Redundancies
Observation:
▶ The error function can usually be decomposed as the sum of errors on
individual data points, i.e. E(θ) = Ni=1 Ei (θ).
P
▶ Error terms associated to dierent data points have similar shapes, e.g.
for a linear model y = w · x + b with θ = (w, b), the overall and
individual error functions typically look like this:
10/31
Data Redundancies
Conclusion:
▶ It is redundant (and computationally inecient) to compute the error
function for every data point at each step of gradient descent.
Question:
▶ Can we perform gradient descent only on a subset of the data, or
alternatively, pick at each iteration a random subset of data?
11/31
Stochastic Gradient Descent
Gradient descent:
for t = 1 . . . T do
1 XN
θ ←θ−γ∇ Ei (θ)
N i=1
| {z }
end for ∇E(θ)
13/31
Choosing the Learning Rate Schedule of SGD
lim γ (t) = 0 ✗ ✓ ✓
t→∞
∞
X
γ (t) = ∞ ✓ ✓ ✗
t=1
Observation:
▶ The learning rate should decay but not too quickly.
▶ Because of ths required slow decay, one also gets a slow convergence
rate, e.g. t−1 . (Compare with the exponential convergence of GD near
the optimum).
Question:
▶ Is SGD useful at all?
14/31
GD vs. SGD Convergence
log E(θ)
Phase 2:
Most ∇Ei (θ) point to
dierent directions →
SGD slower due to noise
Observations:
▶ In Phase 1, constants (K vs. N ) matter. SGD moves way faster
initially.
▶ Phase 2 is often irrelevant, because the model already starts overtting
before reaching it.
▶ K can be increased over the course of training in order to perform
eciencly in both phases.
15/31
Further advantages of SGD vs. GD
16/31
Part 3 Model Eciency
17/31
Model Eciency
Observation:
▶ Another factor that can have a strong eect on training eciency is
how much time/resource it takes to produce one forward pass.
General guidelines:
▶ The number of neurons in the network should not be chosen larger
than needed for the task.
Cheap to
✓ ✓ ✗
evaluate
18/31
Model Eciency
Global connectivity vs. local connectivity
▶ Keeping only local connections can substantially reduce the number of
computations for each neuron.
▶ Only works if the representation computed at a the layer does not
require long-range interactions.
19/31
Model Eciency
⊤
a= W
| {z x} = WA⊤ xA + WB⊤ xB
| {z }
8×4=32 computations
2×(2×4)=16 computations
20/31
Avoiding Computational Bottlenecks
The CNN Architecture:
▶ Lower layers detect
simple features at
exact locations.
▶ Higher layers detect
complex features at
approximate
locations.
21/31
Avoiding Computational Bottlenecks
Example:
The Inception-v1 (GoogleNet)
architecture
Observation:
▶ No specic layer strongly dominate in terms of number of operations.
22/31
Part 4 Systemize / Parallellize Computations
23/31
Systemize Computations
24/31
Choosing Batch Size in SGD
small
small batch medium batch
machine
big
medium batch large batch
machine
25/31
Map Neural Network to Hardware
26/31
Part 5 Distributed Training
27/31
Distributed Training
Example: Google's DistBelief Architecture [Dean'12]
Each model replica trains on its own data, and synchronizes the model pa-
rameters it has learned with other replica via a dedicated parameter server.
28/31
Distributed Training
Combining data-parallelism and model-parallelism
see also Krizhevsky'14: One weird trick for parallelizing convolutional neural networks
29/31
Summary
30/31
Summary
▶ Even with the best practices for shaping the error function E(θ) such
as data centering, designing a good architecture, etc., the optimization
of E(θ) remains computationally demanding.
▶ A poorly conditioned error function can be addressed by enhancing the
simple gradient descent procedure with momentum.
▶ The contributions of dierent data points to the error function are
initially highly correlated → it is benecial to approximate the error
gradient from only a random subset of points at each iteration
(stochastic gradient descent).
▶ The model can be shaped in a way that avoids unnecessary
computations (e.g. weights connecting features known to be
unrelated), and in a way that avoids computational bottlenecks.
▶ For most ecient neural network training, it is important to consider
what the hardware can achieve (e.g. what operation the hardware
achieves in O(1)).
▶ Very large models and very large datasets do not t on a single
machine. In that case, we need to design distributed schemes, with
appropriate use of data/model parallelism.
31/31