Tensorflow Deep Learning With Keras
Tensorflow Deep Learning With Keras
September 2020
Regarding the title of this course
a#parameters
#hidden units!
Deep neural network
The number of all possible models
for a network with a single hidden
layer is
a#parameters
#hidden units!
More formal result for capacity of
a deep network (per parameter)
w f −2
(w /f )(d−1)f
d
Deep neural network
target
error
input output
Training a neural network
I gradient based methods
target
error
input output
if overfitting happened is to
log(error)
−1.6
measure error on unseen data
−1.8
I Test error −2
Batch normalization
Data augmentation
Cross-validation method
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.
Notranjski regijski
INTERMITTENT parkLAKE
CERKNICA
import tensorflow as tf
cos_D = tf . cos ( D )
# < tf . Tensor : id =30 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[ 0.96945935 , -0.36729133] ,
# [ -0.89988 , 0.9873345 ]] ,
# dtype = float32 ) >
max_D = tf . reduce_max ( D )
# < tf . Tensor : id =26 , shape =() , dtype = float32 ,
# numpy = -94.0 >
C . numpy ()
# array ([[19. , 22.] , [43. , 50.]] , dtype = float32 )
svd_D = tf . linalg . svd ( D )
# ( < tf . Tensor : id =27 , shape =(2 ,) , dtype = float32 ,
# numpy = array ([520.9103 , 2.9102921] , dtype = float32 ) >,
#
# < tf . Tensor : id =28 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[ -0.30792360 , 0.95141107] ,
# [ -0.95141107 , -0.30792360]] ,
# dtype = float32 ) >,
#
# < tf . Tensor : id =29 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[0.59984480 , 0.80011636] ,
# [0.80011636 , -0.59984480]] ,
# dtype = float32 ) >)
What can TensorFlow do?
1. It can perform numerical operations on data (in a parallel way –
multi-core, GPU).
What can TensorFlow do?
1. It can perform numerical operations on data (in a parallel way –
multi-core, GPU).
2. It can calculate derivatives using automatic differentiation.
δ
S
multiply
Ψ
I transpose multiply
Β0 Σ invert
multiply
add
subtract invert
trace
Λ multiply
logdet
λ
multiply add add
Θ
transpose
sum
multiply FLASSO
absolute
task that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of training
general neural loss functions [2], simple gradient methods often find global minimizers (parameter
configurations with zero or near-zero training loss), even when data and labels are randomized before
training [42]. However, this good behavior is not universal; the trainability of neural nets is highly
dependent on network architecture design choices, the choice of optimizer, variable initialization, and
a Why
variety of
dootherweconsiderations. Unfortunately, the effect of each of these choices on the structure of
need derivatives?
the underlying loss surface is unclear. Because of the prohibitive cost of loss function evaluations
(which requires looping over all the data points in the training set), studies in this field have remained
Knowing in which direction “down” is, can help us when solving
predominantly theoretical.
optimization problems.
task that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of training
general neural loss functions [2], simple gradient methods often find global minimizers (parameter
configurations with zero or near-zero training loss), even when data and labels are randomized before
training [42]. However, this good behavior is not universal; the trainability of neural nets is highly
dependent on network architecture design choices, the choice of optimizer, variable initialization, and
a Why
variety of
dootherweconsiderations. Unfortunately, the effect of each of these choices on the structure of
need derivatives?
the underlying loss surface is unclear. Because of the prohibitive cost of loss function evaluations
(which requires looping over all the data points in the training set), studies in this field have remained
Knowing in which direction “down” is, can help us when solving
predominantly theoretical.
optimization problems.
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
∂2f ∂2f ∂2f
∂x 2 ···
1 ∂x1 ∂x2 ∂x1 ∂xn
∂2f 2
∂ f 2
∂ f
· · ·
∂x2 ∂x1 ∂x22 ∂x2 ∂xn
. . . .
.. .. .. ..
∂2f 2
∂ f ∂ f
2
···
∂xn ∂x1 ∂xn ∂x2 ∂xn2
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
∂2f ∂2f ∂2f
∂x 2 ··· If eigenvalues of Hessian matrix
1 ∂x1 ∂x2 ∂x1 ∂xn are randomly distributed, then
∂2f ∂2f ∂2f probability that a stationary
···
∂x2 ∂x1 ∂x 2 ∂x 2 ∂x point is a local minimum is 2−n .
2 n
.. .. .. ..
. . . . I Saddle points are
exponentially more common
∂2f 2
∂ f 2
∂ f than local minima.
···
∂xn ∂x1 ∂xn ∂x2 ∂xn2
x1 , x2 x3 , x4 x5 , x6 x7 , x8
10 opt
• ẑ = Λ R(x − x ) ∂ f (x1 + h, x2 , . . . ) − f (x1 − h, x2 , . . . )
• z̃i =
(
b0.5 + ẑi c if ẑi > 0.5 f (x1 , x2 , . . . ) ≈
for i = 1, . . . , D,
b0.5 + 10 ẑi c/10 otherwise
∂x1
denotes the rounding procedure in order to produce the plateaus. 2h
• z = Qz̃
∂ a exp (ax + b)
log (1 + exp (ax + b)) =
∂x 1 + exp (ax + b)
∂ a exp (ax + b)
log (1 + exp (ax + b)) =
∂x 1 + exp (ax + b)
a
x2 ∗ y
b
exp
y = (x1 + x2 ) exp (x2 ) x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
=
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
=
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
|{z} |{z}
a 1
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1
+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + = ab + a exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
|{z} |{z}
a 1
The same thing in TensorFlow
import tensorflow as tf
x1 = tf . Variable (3.1)
x2 = tf . Variable ( -1.4)
df = tape . gradient (f , [ x1 , x2 ])
# [ < tf . Tensor : id =22 , shape =() , dtype = float32 , numpy
=0.24659698 > ,
# < tf . Tensor : id =25 , shape =() , dtype = float32 , numpy
=0.66581184 >]
Optimizers
Vanilla update
x += - learning_rate * dx
Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
Adam
m = beta1 * m + (1 - beta1 ) * dx
v = beta2 * v + (1 - beta2 ) *( dx **2)
x += - learning_rate * m / ( np . sqrt ( v ) + eps )
Optimizers
users
user id movie id rating
4160 14501 5
182 14502 2
movies
6649 14502 3
17240 14502 1
115 14503 4
.. .. ..
. . .
Matrix factorization example
users genres
preference
users
movies
movies
R ≈ G · H
Matrix factorization example
users genres
preference
users
movies
movies
R ≈ G · H
geometric solve
Ku = λMu resonance spectra
stuff