0% found this document useful (0 votes)
102 views90 pages

Tensorflow Deep Learning With Keras

This document discusses the title "TensorFlow: Deep learning with Keras". It summarizes that deep learning uses artificial neural networks, Keras is a popular library for implementing deep learning methods, and TensorFlow is a library that includes Keras as a submodule.

Uploaded by

wick91274
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views90 pages

Tensorflow Deep Learning With Keras

This document discusses the title "TensorFlow: Deep learning with Keras". It summarizes that deep learning uses artificial neural networks, Keras is a popular library for implementing deep learning methods, and TensorFlow is a library that includes Keras as a submodule.

Uploaded by

wick91274
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

TensorFlow: Deep learning with Keras

Primož Godec1 and Rok Hribar2


1 University of Ljubljana, Faculty of Computer and Information Science
2 Jožef Stefan Institute, Computer Systems department

September 2020
Regarding the title of this course

TensorFlow: Deep learning with Keras


Regarding the title of this course

TensorFlow: Deep learning with Keras


I Deep learning is a set of methods for using artificial neural
networks
Regarding the title of this course

TensorFlow: Deep learning with Keras


I Deep learning is a set of methods for using artificial neural
networks
I Keras is probably the most popular library that implements Deep
learning methods
Regarding the title of this course

TensorFlow: Deep learning with Keras


I Deep learning is a set of methods for using artificial neural
networks
I Keras is probably the most popular library that implements Deep
learning methods
I TensorFlow is a library that includes Keras as a submodule
Deep learning
Artificial intelligence

I machines (or computers) that mimic cognitive


functions that we associate with the human mind
I translate text (like a book)
I recognize object in image (face, handwriting)
I recognize speech
I creativity (poetry, music, paintings)
I expert diagnosis (physician, mechanic)

What kind of murderer has moral fiber? – A cereal killer.


Artificial intelligence

I machines (or computers) that mimic cognitive


functions that we associate with the human mind
I translate text (like a book)
I recognize object in image (face, handwriting)
I recognize speech
I creativity (poetry, music, paintings)
I expert diagnosis (physician, mechanic)
I Tesler: AI is whatever hasn’t been done yet
I optical character recognition
I playing chess

What kind of murderer has moral fiber? – A cereal killer.


Machine learning
Machine learning involves computers discovering how they can perform
tasks without being explicitly programmed to do so.
Machine learning
Machine learning involves computers discovering how they can perform
tasks without being explicitly programmed to do so.
Traditional algorithm:
I A human programmer designs an algorithm telling the machine
how to execute all steps required to solve the problem at hand.
Machine learning
Machine learning involves computers discovering how they can perform
tasks without being explicitly programmed to do so.
Traditional algorithm:
I A human programmer designs an algorithm telling the machine
how to execute all steps required to solve the problem at hand.
For some tasks, it can be challenging for a human to manually create
the needed algorithm.
Machine learning
Machine learning involves computers discovering how they can perform
tasks without being explicitly programmed to do so.
Traditional algorithm:
I A human programmer designs an algorithm telling the machine
how to execute all steps required to solve the problem at hand.
For some tasks, it can be challenging for a human to manually create
the needed algorithm.
Machine learning algorithm:
I A human programmer designs an algorithm that helps the
computer develop its own algorithm, rather than having human
programmer specify every needed step.
Machine learning
Machine learning involves computers discovering how they can perform
tasks without being explicitly programmed to do so.
Traditional algorithm:
I A human programmer designs an algorithm telling the machine
how to execute all steps required to solve the problem at hand.
For some tasks, it can be challenging for a human to manually create
the needed algorithm.
Machine learning algorithm:
I A human programmer designs an algorithm that helps the
computer develop its own algorithm, rather than having human
programmer specify every needed step.
I Do not let the word “learning” mislead you.
Machine learning example: Spam filtering
Text Category

secret prize! claim secret prize now spam


could you send me that image we talked about ham model
account compromised reset password spam
free entry for 2 week tournament spam
are you coming to a secret party for Mark ham ML
you have a virus please download spam data parameters
I’m in Ljubljana on Thursday, have time? ham algorithm
$50 gift card for Amazon spam

Basic machine learning algorithm:


I Count the words that appear in spam/ham messages
I Calculate probabilities that a word is present in a message
belonging to a given class
Result is a model that can calculate probability that a message is spam
Machine learning
Artificial intelligence that is not machine learning:
I rule-based systems (natural language processing, theorem proving)
I early computer vision
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized

I can capture wide range of functions


a = W1 x + b1
h = σ(a)
y = W2 h + b2
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized
I matrix multiplication is highly
parallelizable and optimized

I can capture wide range of functions


a = W1 x + b1
h = σ(a)
y = W2 h + b2
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized
I matrix multiplication is highly
parallelizable and optimized
I composition of linear functions is
linear – we need nonlinearity

I can capture wide range of functions


a = W1 x + b1
h = σ(a)
y = W2 h + b2
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized
I matrix multiplication is highly
parallelizable and optimized
I composition of linear functions is
linear – we need nonlinearity
I the fastest nonlinear functions are
those of a single variable
I can capture wide range of functions
a = W1 x + b1
h = σ(a)
y = W2 h + b2
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized
I matrix multiplication is highly
parallelizable and optimized
I composition of linear functions is
linear – we need nonlinearity
I the fastest nonlinear functions are
those of a single variable
I can capture wide range of functions
a = W1 x + b1
I L-NL and NL-L are not universal
approximators h = σ(a)
y = W2 h + b2
Artificial neural network
Despite its name it doesn’t have much to do with biological brain.
It is a simple mathematical model that:
Traditional neural network
I is fast – can be easily parallelized
I matrix multiplication is highly
parallelizable and optimized
I composition of linear functions is
linear – we need nonlinearity
I the fastest nonlinear functions are
those of a single variable
I can capture wide range of functions
a = W1 x + b1
I L-NL and NL-L are not universal
approximators h = σ(a)
I NL-L-NL and L-NL-L are and out of y = W2 h + b2
those L-NL-L is faster
Deep neural network
The number of all possible models
for a network with a single hidden
layer is

a#parameters
#hidden units!
Deep neural network
The number of all possible models
for a network with a single hidden
layer is

a#parameters
#hidden units!
More formal result for capacity of
a deep network (per parameter)

w f −2
(w /f )(d−1)f
d
Deep neural network

More philosophical reasons for why depth is good:


I belief that the function we want to learn is a computer program
consisting of multiple steps, where each step makes use of the
previous step’s output
I belief that the nature of knowledge is hierarchical, where more
abstract concepts build on simpler ones
I belief that the learning problem consists of discovering a set of
underlying factors of variation that can in turn be described in
terms of other, simpler underlying factors of variation
Deep neural network
Training a neural network

There have been many procedures to train neural networks through


history.
I learning rules (Hebian, correlation)
Training a neural network

There have been many procedures to train neural networks through


history.
I learning rules (Hebian, correlation)
I perceptron learning (linear least squares)
I neuroevolution
I gradient based methods

target
error
input output
Training a neural network
I gradient based methods
target
error
input output

I Derivative of error with respect


to all parameters of the
network are calculated using
backpropagation algorithm.
I Parameters of the network are
changed in direction that
minimizes the error.
Overfitting and underfitting

Overfitting is a modeling error that


occurs when a model has learned
too much.
I model capacity is so high that
noise is being modeled
I model doesn’t generalize well
from our training data to
unseen data
I this can usually be avoided by

#data instances  #parameters


Overfitting and underfitting
However, overfitting is a
complicated phenomenon.
I model capacity
I data set distribution
I complexity of an underlying
problem
The most bulletproof way to know
if overfitting happened is to
measure error on unseen data
I Test error
Overfitting and underfitting
However, overfitting is a
complicated phenomenon.
I model capacity
I data set distribution
I complexity of an underlying
problem
The most bulletproof way to know −1.4
training test

if overfitting happened is to
log(error)
−1.6
measure error on unseen data
−1.8
I Test error −2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


number of epochs ·104
Regularization
AI problems normally require high capacity
models.
I depth due to problem complexity
I width to ensure information flow
Regularization
Techniques:
AI problems normally require high capacity I weight decay
models.
I parameter sharing
I depth due to problem complexity
I semi-supervised
I width to ensure information flow learning
I dropout
To reduce overfitting we handicap the
network without reducing its size. I early stopping
I constraints on the structure of the I sparse representations
network I data augmentation
I disruptions in the training phase I batch/layer
normalization
Weight decay Dropout
error + λkparameterskq

Batch normalization
Data augmentation
Cross-validation method
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
I availability of large quantity of data
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
I availability of large quantity of data
I appropriate cost functions
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
I availability of large quantity of data
I appropriate cost functions
I new regularizations
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
I availability of large quantity of data
I appropriate cost functions
I new regularizations
I new representation mappings (eg. embeddings)
Deep learning
Official definition: Deep learning is the study of machine learning
models composed of multiple layers of functions that progressively
extract higher level features from the raw input.

However, this idea existed also 1950–2010 when success of deep


learning was very limited.
I gradient based training (on GPU)
I availability of large quantity of data
I appropriate cost functions
I new regularizations
I new representation mappings (eg. embeddings)
I new network architectures
Keras
Neural networks with Keras

I Introduction to neural networks through classification


I Neural network for regression
I Image classification
Convolutional neural networks

I Image classification with convolutional neural networks


I Exercise: Classification of images from CIFAR10 dataset
Notranjski regC
INTERMITTENT

Lake Cerknica is the largest lake


in Slovenia (∼ 28km2 ).

Projekt LIFE 06 NAT/SLO/000069 - Presihajoče Cerkn


Notranjski regijski
INTERMITTENT Notranjski regC
parkLAKEINTERMITTENT
CERKNICA

Lake Cerknica is the largest lake


in Slovenia (∼ 28km2 ).
When it is full – intermittent lake.

Notranjski regijski
INTERMITTENT parkLAKE
CERKNICA

Projekt LIFE 06 NAT/SLO/000069 - Presihajoče Cerkniško  Cerknica Lake - Project area


jezero
Projekt LIFE 06 NAT/SLO/000069 - Presihajoče Cerkn
Notranjski regijski
INTERMITTENT Notranjski regC
parkLAKEINTERMITTENT
CERKNICA

Lake Cerknica is the largest lake


in Slovenia (∼ 28km2 ).
When it is full – intermittent lake.
Karst country involves an
underground drainage system with
Notranjski regijski
sinkholes and caves. INTERMITTENT parkLAKE
CERKNICA

Projekt LIFE 06 NAT/SLO/000069 - Presihajoče Cerkniško  Cerknica Lake - Project area


jezero
Projekt LIFE 06 NAT/SLO/000069 - Presihajoče Cerkn
Recurrent neural network
I part of output from a layer is
fed as additional input along
with the next instance
I short term memory
Recurrent neural networks
Internal state doesn’t depend only on Applications:
current data instance but also on all I Time series prediction
previous ones. I Robot control
I Text generation
Advantages
I Music composition
I no need to choose time window
I Video processing
I weight sharing
I Machine translation
I partially observable modeling
I Handwriting
Disadvantages
recognition
I less parallelizable
I Genetics and protein
I difficult to train related ML
I vanishing and exploding gradients I Speech recognition
TensorFlow
What can TensorFlow do?
1. It can perform numerical operations on data (in a parallel way –
multi-core, GPU).

import tensorflow as tf

A = tf . Variable ( [[1.0 , 2.0] , [3.0 , 4.0]] )


B = tf . Variable ( [[5.0 , 6.0] , [7.0 , 8.0]] )

C = tf . matmul (A , B ) # matrix multiplication


D = A - B*C # elementwise operations
cos_D = tf . cos ( D ) # elementwise math functions
sum_D = tf . reduce_sum ( D ) # sum of all D components
max_D = tf . reduce_max ( D ) # max component of D
svd_D = tf . linalg . svd ( D ) # singular value decomposition
C = tf . matmul (A , B )
# < tf . Tensor : id =17 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[19. , 22.] , [43. , 50.]] , dtype = float32 ) >

cos_D = tf . cos ( D )
# < tf . Tensor : id =30 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[ 0.96945935 , -0.36729133] ,
# [ -0.89988 , 0.9873345 ]] ,
# dtype = float32 ) >

max_D = tf . reduce_max ( D )
# < tf . Tensor : id =26 , shape =() , dtype = float32 ,
# numpy = -94.0 >

C . numpy ()
# array ([[19. , 22.] , [43. , 50.]] , dtype = float32 )
svd_D = tf . linalg . svd ( D )
# ( < tf . Tensor : id =27 , shape =(2 ,) , dtype = float32 ,
# numpy = array ([520.9103 , 2.9102921] , dtype = float32 ) >,
#
# < tf . Tensor : id =28 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[ -0.30792360 , 0.95141107] ,
# [ -0.95141107 , -0.30792360]] ,
# dtype = float32 ) >,
#
# < tf . Tensor : id =29 , shape =(2 , 2) , dtype = float32 ,
# numpy = array ([[0.59984480 , 0.80011636] ,
# [0.80011636 , -0.59984480]] ,
# dtype = float32 ) >)
What can TensorFlow do?
1. It can perform numerical operations on data (in a parallel way –
multi-core, GPU).
What can TensorFlow do?
1. It can perform numerical operations on data (in a parallel way –
multi-core, GPU).
2. It can calculate derivatives using automatic differentiation.
δ
S
multiply

Ψ
I transpose multiply

Β0 Σ invert
multiply
add

subtract invert
trace

Λ multiply
logdet

λ
multiply add add

Θ
transpose

sum
multiply FLASSO
absolute
task that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of training
general neural loss functions [2], simple gradient methods often find global minimizers (parameter
configurations with zero or near-zero training loss), even when data and labels are randomized before
training [42]. However, this good behavior is not universal; the trainability of neural nets is highly
dependent on network architecture design choices, the choice of optimizer, variable initialization, and
a Why
variety of
dootherweconsiderations. Unfortunately, the effect of each of these choices on the structure of
need derivatives?
the underlying loss surface is unclear. Because of the prohibitive cost of loss function evaluations
(which requires looping over all the data points in the training set), studies in this field have remained
Knowing in which direction “down” is, can help us when solving
predominantly theoretical.
optimization problems.
task that is hard in theory, but sometimes easy in practice. Despite the NP-hardness of training
general neural loss functions [2], simple gradient methods often find global minimizers (parameter
configurations with zero or near-zero training loss), even when data and labels are randomized before
training [42]. However, this good behavior is not universal; the trainability of neural nets is highly
dependent on network architecture design choices, the choice of optimizer, variable initialization, and
a Why
variety of
dootherweconsiderations. Unfortunately, the effect of each of these choices on the structure of
need derivatives?
the underlying loss surface is unclear. Because of the prohibitive cost of loss function evaluations
(which requires looping over all the data points in the training set), studies in this field have remained
Knowing in which direction “down” is, can help us when solving
predominantly theoretical.
optimization problems.
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
 
∂2f ∂2f ∂2f
 ∂x 2 ···
 1 ∂x1 ∂x2 ∂x1 ∂xn 

 
 ∂2f 2
∂ f 2
∂ f 
 · · · 
 ∂x2 ∂x1 ∂x22 ∂x2 ∂xn 
 
 . . . . 
 .. .. .. .. 
 
 
 
 ∂2f 2
∂ f ∂ f 
2
···
∂xn ∂x1 ∂xn ∂x2 ∂xn2
I So, various competitions show that evolutionary algorithms
outperform gradient based optimization algorithms.
I However, all those competitions use functions of “low” dimension
(≤ 100) and gradient based optimization excels in high dimensions.
How many local minima are there with respect to dimension?
 
∂2f ∂2f ∂2f
 ∂x 2 ··· If eigenvalues of Hessian matrix
 1 ∂x1 ∂x2 ∂x1 ∂xn   are randomly distributed, then
 
 ∂2f ∂2f ∂2f  probability that a stationary
 ··· 
 ∂x2 ∂x1 ∂x 2 ∂x 2 ∂x  point is a local minimum is 2−n .
 2 n
 .. .. .. .. 
 . . . .  I Saddle points are
 
  exponentially more common
 
 ∂2f 2
∂ f 2
∂ f  than local minima.
···
∂xn ∂x1 ∂xn ∂x2 ∂xn2
x1 , x2 x3 , x4 x5 , x6 x7 , x8

x9 , x10 x11 , x12 x15 , x16


x13 , x14

x17 , x18 x19 , x20 x21 , x22 x23 , x24


Ways to calculate derivatives of a program
2.3 1. Numerical differentiation
Step ellipsoid
D
X i−1
!
fstep (x) = 0.1 max |ẑ1 |/104 , 102 D−1 zi2
i=1

10 opt
• ẑ = Λ R(x − x ) ∂ f (x1 + h, x2 , . . . ) − f (x1 − h, x2 , . . . )
• z̃i =
(
b0.5 + ẑi c if ẑi > 0.5 f (x1 , x2 , . . . ) ≈
for i = 1, . . . , D,
b0.5 + 10 ẑi c/10 otherwise
∂x1
denotes the rounding procedure in order to produce the plateaus. 2h
• z = Qz̃

Very efficient for


Properties The function consists of many plateaus of different sizes. Apart from a small area
close to the global optimum, the gradient is zero almost everywhere.
• condition number is about 100
Algorithms that use it
Ozaki et al. IPSJ Transactions on Computer Vision and Applications (20
I
2.3.1
Noisy functions
113 Step ellipsoid with gaussian noise
f113 (x) = fGN (fstep (x), 1) + fpen (x) + fopt (113)
I Nelder-Mead algorithm
I Locally flat functions I OpenAI evolution strategy
Ways to calculate derivatives of a program
2. Symbolic differentiation

∂ a exp (ax + b)
log (1 + exp (ax + b)) =
∂x 1 + exp (ax + b)

Very efficient in case function


has large number of outputs
y1
y2
x1
f y3
x2
y4
y5
Ways to calculate derivatives of a program
2. Symbolic differentiation

∂ a exp (ax + b)
log (1 + exp (ax + b)) =
∂x 1 + exp (ax + b)

Very efficient in case function


has large number of outputs
y1
y2 if f (x , data ) > 0:
x1 g (x , data )
f y3
x2 else :
y4 h (x , data )
y5
Ways to calculate derivatives of a program
3. Automatic differentiation
I Sort of a hybrid between simbolic and numerical differentiation
I There exist forward and reverse automatic differentiation –
TensorFlow uses reverse automatic differentiation
Example:
Very efficient in case function has
large number of inputs f (x1 , x2 , . . . , xn ) = x1 · x2 · . . . · xn
x1  
x2 x2 · x3 · . . . · xn
y1  x1 · x3 · . . . · xn 
x3  
f ∇f =  .. 
y2  . 
x4
x1 · x2 · . . . · xn−1
x5
y = (x1 + x2 ) exp (x2 ) x1

a
x2 ∗ y
b

exp
y = (x1 + x2 ) exp (x2 ) x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
=
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
=
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y
exp =a
∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
|{z} |{z}
a 1
1
z}|{
y = (x1 + x2 ) exp (x2 ) x1 ∂y ∂y ∂a
= =b
∂x1 ∂a ∂x1

+ ∂y
=b
∂a
a
x2 ∗ y
b
∂y ∂y ∂a ∂y ∂b ∂y
= + = ab + a exp =a
∂x2 ∂a ∂x2 ∂b ∂x2 ∂b
|{z} |{z}
a 1
The same thing in TensorFlow

import tensorflow as tf

x1 = tf . Variable (3.1)
x2 = tf . Variable ( -1.4)

with tf . GradientTape () as tape : # Save graph to tape


tape . watch ([ x1 , x2 ]) # Watch for x1 and x2
f = ( x1 + x2 ) * tf . exp ( x2 ) # Calculate f ( x1 , x2 )

df = tape . gradient (f , [ x1 , x2 ])
# [ < tf . Tensor : id =22 , shape =() , dtype = float32 , numpy
=0.24659698 > ,
# < tf . Tensor : id =25 , shape =() , dtype = float32 , numpy
=0.66581184 >]
Optimizers

Vanilla update
x += - learning_rate * dx

Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

Adam
m = beta1 * m + (1 - beta1 ) * dx
v = beta2 * v + (1 - beta2 ) *( dx **2)
x += - learning_rate * m / ( np . sqrt ( v ) + eps )
Optimizers

Optimizers are available in tf.keras.optimizers module


I Vanilla update (tf.keras.optimizers.SGD)
I Adagrad (tf.keras.optimizers.Adagrad)
I RMSprop (tf.keras.optimizers.RMSprop)
I Adam (tf.keras.optimizers.Adam)
Matrix factorization example
Suppose we have movie ratings from various people for set of movies
they have watched.

user id movie id rating


4160 14501 5
182 14502 2
6649 14502 3
17240 14502 1
115 14503 4
.. .. ..
. . .
Matrix factorization example
Suppose we have movie ratings from various people for set of movies
they have watched.

users
user id movie id rating
4160 14501 5
182 14502 2

movies
6649 14502 3
17240 14502 1
115 14503 4
.. .. ..
. . .
Matrix factorization example

users genres

preference
users
movies

movies
R ≈ G · H
Matrix factorization example

users genres

preference
users
movies

movies
R ≈ G · H

error = kW (R − G · H)k = min. G, H ≥ 0


# Initialize variables .
G = tf . Variable (...)
H = tf . Variable (...)

# Choose a gradient based optimizer .


optimizer = tf . keras . optimizers . Adam ()

# Perform gradient descent .


for i in range ( num_steps ) :
with tf . GradientTape () as tape :
tape . watch ([ G , H ])
absG = tf . abs ( G )
absH = tf . abs ( H )
dR = R - tf . matmul ( absG , absH )
loss = tf . reduce_sum ( tf . square ( dR ) )
dG , dH = tape . gradient ( loss , [G , H ])
optimizer . apply_gradients ([[ dG , G ] , [ dH , H ]])
Example: Finite element method

geometric solve
Ku = λMu resonance spectra
stuff

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy