0% found this document useful (0 votes)
84 views44 pages

Training of Neural Networks: Q.J. Zhang, Carleton University

The document discusses the training of neural networks. It covers defining the model inputs and outputs, generating training data through measurement or simulation, preprocessing and scaling the data, and dividing it into training, validation, and testing sets. Formulas are provided for calculating the training and validation errors during the neural network training process.

Uploaded by

Aditi Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views44 pages

Training of Neural Networks: Q.J. Zhang, Carleton University

The document discusses the training of neural networks. It covers defining the model inputs and outputs, generating training data through measurement or simulation, preprocessing and scaling the data, and dividing it into training, validation, and testing sets. Formulas are provided for calculating the training and validation errors during the neural network training process.

Uploaded by

Aditi Biswas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Training of Neural Networks

Q.J. Zhang, Carleton University


Notation:
x: input of the original modeling problem or the neural network

y: output of the original modeling problem or the neural network

w: internal weights/parameters of the neural network

m: number of outputs of the model

y = f(x , w) : neural network model

d: data for y (e.g., trainining data)

Q.J. Zhang, Carleton University


Define Model Input-Output

Define model input-output (x, y), for example,

x: physical/geometrical parameters of the component


y: S-parameters of the component

Q.J. Zhang, Carleton University


Data Generation:

(a)Generate (x,y) samples: ( xk , yk ) , k = 1, 2, …, P, such that


the finished NN best (accurately) represent original x~y
problem

(b)Data generator
• Measurement : for each given xk , measure values of yk ,
k=1,2,…, p
• Simulation: for each given xk , use a simulator to
calculate yk , k=1,2,…, p

Q.J. Zhang, Carleton University


Comparison of Neural Network Based Microwave Model Development
Using Data from Two Types of Data Generators
Basis of Comparison Neural Model Development Using Neural Model Development Using

Measurement Data Simulation Data

Availability of Problem Model can be developed even if Model can be developed only for

Theory-Equations the theory-equations are not the problems that have theory that

known, or difficult to implement in is implemented in a simulator.

CAD.

Assumptions No assumptions are involved and Often involves assumptions and the

the model could include all the model will be limited by the

effects, e.g., 3D-fullwave effects, assumptions made by the simulator,

fringing effects etc. e.g., 2.5D EM.

Input Parameter Sweep Data generation could either be Relatively easier to sweep any

expensive or infeasible, if a parameter in the simulator, because

geometrical parameter, e.g., the changes are numerical and not

transistor gate-length needs to be physical/manual.

sampled/changed.

Q.J. Zhang, Carleton University


Comparison of Neural Network Based Microwave Model Development
Using Data from Two Types of Data Generators (continued)

Basis of Comparison Neural Model Development Using Neural Model Development Using

Measurement Data Simulation Data

Sources of Small and Equipment limitations and Accuracy limitations and non-

Large/Gross Errors tolerances. convergence of simulations.

Feasibility of Getting Development of models is possible Any response can be modeled as

Desired Output for measurable responses only. For long as it can be computed by the

example, drain charge of an FET simulator.

may not be easy to measure.

Q.J. Zhang, Carleton University


Data Generation:
(c) Range of x to be sampled

• For Testing Data and Validation data:


xmin  xmax : Should represent the user-intended range
in which the NN is to be used by user

• For Training Data:


Default range of x samples should be equal to user-
intended range, or if feasible, slightly beyond the user
intended range

Q.J. Zhang, Carleton University


Data Generation
- where data should be sampled
x3

x1

x2

(d) Distribution of x samples

• Uniform grid distribution


• Non-uniform grid distribution
• Design of Experiments (DOE) methodology
central-composite design
2n factorial design
• Star distribution
• Random distribution
Q.J. Zhang, Carleton University
Data Generation (continued):

(e) Number of samples P -- Theoretical factor:


•For grid distribution case: Shannon’s Theorem
•For random distribution case: statistical confidence

Q.J. Zhang, Carleton University


Input / Output Scaling
The orders of magnitude of various x and d values in
microwave applications can be very different from one
another.

Scaling of training data is desirable for efficient neural


network training

The data can be scaled such that various x (or d ) have


similar order of magnitude

Q.J. Zhang, Carleton University


Input / Output Scaling:
Notation:
x and y -- Original x and y
~
x and ~y -- Scaled x and y
xm ax, xm in -- Obtained from data
x~max , x~min -- Dictated by NN trainer

• Linear scale
x − xmin ~
Scale formula: x~ = x~min + ( xmax − x~min )
xmax − xmin
~
x −~
xmin
De-scaled formula: x = x + ~ ~ ( xmax − xmin )
x −x
min
max min

• Log scale
Scale formula: ~ = ln( x − x )
x min
x~
De-scale formula: x = xmin + e

Q.J. Zhang, Carleton University


Illustration of Data Scaling
y
d
De-scaling
Scaling

Training data
Neural network Trained Neural network
Scaled
model
Data

Scaling

Scaling

x x
x

Data Data Neural network Finished model


generation scaling training for user

Q.J. Zhang, Carleton University


Divide Data into Training Set,
Validation Set and Testing Set
Notation:
P – total number of data samples generated
D – Set for all data, D = {1, 2, …, P}
Tr – Training data set
V -- Validation data set
Te -- Test data set

Ideally: Each data set (Tr , V, Te) should be an adequate


representation of original y = f ( x ) problem in the
entire xmin ~ xmax range. Three sets have no overlap.

Q.J. Zhang, Carleton University


Divide Data into Training Set,
Validation Set and Testing Set

Case 1: When original data is quite sufficient, split D into


non-overlapping sets

Case 2: When data is very limited, duplicate D, such that


Tr = V = Te = D

Case 3: Split data D into 2 sets.

Q.J. Zhang, Carleton University


Training / Validation and Testing
1
 y j ( xk , w ) − d jk
q q
 1 m 
Training error: ETr ( w ) =    
 size(Tr)  m kTr j = 1 ymax j − ymin j 
 

1
 y j ( xk , w ) − d jk
q q
 1 m 
Validation error: EV ( w ) =    
 size(V)  m kV j =1 ymax j − ymin j 
 

1
 y j ( xk , w ) − d jk
q q
 1 m 
Test error: ETe ( w ) =    
 size(Te )  m k Te j =1 ymax j − ymin j 
 

Q.J. Zhang, Carleton University


Where to Use Each Error Criteria

ETr
Training error: The training error ETr(w) and its derivative w
are used to determine how to update w during training

Validation error: The validation error EV(w) is used as a stopping


criteria during training, i.e., to determine if training
is sufficient.

Test error: The test error ETe(w) is used after training has finished
to provide a final assessment of the quality of the trained
neural network. Test error is not involved during training.

Q.J. Zhang, Carleton University


Flow-chart Showing Neural Network Training, Validation and Testing

Update neural network weight Compute derivatives of


parameters using a gradient-based training
training
error
error
w.r.t.
w.r.t.
ANN Perform feedforward
Evaluate
algorithm
algorithm (e.g.,
(e.g., BP,
backpropagation)
quasi-Newton) neural
weights
network computation for all
training error
internal weights samples in training set

No

Assign random initial Perform feedforward Desired


Evaluate accuracy
values for all the weight computation for all
validation error achieved?
parameters samples in validation set

Yes
Select a neural network
structure, e.g., MLP
STOP
Training
Evaluate test error as
START an independent quality Perform feedforward
measure
measure
of for
the ANN
trained computation for all
model
neuralreliability
network samples in test set

Q.J. Zhang, Carleton University


Initial Value of NN Weights Before Training
MLP: small random values

RBF/Wavelet: estimate center & width of RBF


or translation & dilation of Wavelet

Knowledge Based NN: use physical/electrical


experience

Q.J. Zhang, Carleton University


Overlearning
Definition (strict):
 ETr  0
Math  E  E
 V Tr

Observation: NN memorized training data, but can not


generalize well
Possible reasons:
a) Too many hidden neurons
b) Not enough training data
Actions:
a) Add training data
b) Delete hidden neurons
c) Backup/retrieve previous solution
Q.J. Zhang, Carleton University
Neural Network Over-Learning
output (y)
1.2

0.8

0.4

0.0
Neural network

-0.4 Training data


Validation data
-0.8
-5.0 0.0 5.0 10.0 15.0
input (x)

Q.J. Zhang, Carleton University


Underlearning
Definition (strict):
Math ETr  0

Observation: NN can not even represent the problem at


training points

Possible reasons:
a) Not enough hidden neurons
b) Training stuck at local solution
c) Not enough training

Actions:
a) Add hidden neurons
b) More training
c) Perturb solution, then train
Q.J. Zhang, Carleton University
Neural Network Under-Learning
Output (y)
1.5

0.5

0.0
Neural network

-0.5 Training data


Validation data
-1
-5.0 0.0 5.0 10.0 15.0
Input (x)

Q.J. Zhang, Carleton University


Perfect Learning:

Definition (strict):
Math ETr  EV  0

Observation: generalized well

Q.J. Zhang, Carleton University


Perfect Learning of Neural Networks
output (y)
1.2

0.8
.
0.4
.
.
.
0.0
Neural network
.
-0.4 Training data
. Validation data
-0.8
-5 0 5 10 15
input (x)

Q.J. Zhang, Carleton University


Types of Training

• Sample-by-sample (or online) training: ANN weights are updated


every time a training sample is presented to the network, i.e., weight
update is based on training error from that sample

• Batch-mode (or offline) training: ANN weights are updated after each
epoch, i.e., weight update is based on training error from all the
samples in training data set

where an epoch is defined as a stage of ANN training that


involves presentation of all the samples in the training data set to
the neural network once for the purpose of learning

• Supervised training: using (x & y) data for training

• Un-supervised training: using only x data for training

Q.J. Zhang, Carleton University


Neural Network Training

The error between training data and neural network outputs


is fedback to the neural network to guide the internal
weight update of the network
Training
Error
d - y
Training Neural
Data Network
w

x
Q.J. Zhang, Carleton University
Training Problem Statement:
Given training data (xk ,d k ), k Tr
Validation data (x k , d k ), k  V
NN model y(x, w)
Find values of w, such that validation error is minimized
min EV ( epoch)
epoch
1
where  y j ( xk , w (epoch))− d jk
q q
Δ 1 m 
EV (epoch)=    
 VP  m kV j =1 ymax j − ymin j 
 
PV = size(V )
w(epoch) = w(epoch − 1) +  w(epoch − 1)
w |epoch=0 = user’s / software initial guess

 w(epoch − 1) is the update determined by the optimization


algorithm (training algorithm) which minimizes the training
error
Q.J. Zhang, Carleton University
Steps of Gradient Based Training Algorithms:
Step 1: w = initial guess
epoch = 0

Step 2: If EV (epoch)   (given accuracy criteria)


or epoch > max_epoch (max number of epochs),
stop
ETr ( w)
Step 3: Calculate ETr (w) and using partial or all
w
training Data

Step 4: Use optimization algorithm to find w


w  w+  w

Step 5: If all training data are used, then


epoch = epoch + 1 , go to Step 2, else go to Step 3

Q.J. Zhang, Carleton University


Update w in Gradient-based Methods

w = h h

where h is the direction of the update of w


h is the step size of the update of w
ETr ( w)
Gradient based methods use information of ETr (w) and
w
to determine update direction of w .

Step size h is determined by:


• Small fixed constant set by user
• Adaptive constant during training
• Line minimization method to find best value of h

Q.J. Zhang, Carleton University


Line Minimization Problem Statement
Let a scalar function of one variable be defined as
f(h) = ETr(w + h h)
Given present value of w and direction h
Find h such that f(h) is minimized.

Solution method: (1-dimensional optimization method):


 Golden section method
Sectioning methods Fibonacci method
 
 Bisection method




 Quadratic method
Interpolation methods Cubic method
 

Q.J. Zhang, Carleton University
Back-Propagation (BP), (Rumelhart, Hinton
& Williams, 1986)
ETr ( w)
We use the negative gradient direction: h = -
w

for  w = h h

The neural network weights are updated during training as:


ETr ( w )
w = w −h
w
or ETr ( w )
w = w −h + a  w | epoch−1
w
where h is called the learning rate
a is called the momentum factor
Q.J. Zhang, Carleton University
Determining h and a for BP :
• Set h and a as fixed constant
• h and a can be adaptive
• h = c / epoch, c is a constant

• STC (Darkens, 1992)


c epoch
1+( ) ( )
h0 
h = h0
c epoch epoch
1+( ) ( ) +  2( )2
h0  
where h0 , , c are user defined

• Delta-bar-delta (Jacobs, 1988) ETr ( w )


(a) A hi for each weight wi in w of NN, wi = wi − hi
w
(b) h is adjusted during training using present and previous
information of ETr ( w)
wi

Q.J. Zhang, Carleton University


Concept of Contour Plots
To illustrate the process of how the w vector change, we can use
contour plots.
Simple examples of contour plots with 2 variables w=[w1 w2]:

ETr(w)=(w1 -1)2 + (w2 -2)2 ETr(w)=4(w1 -1)2 + (w2 -2)2 ETr(w)=(1.73(w1 -1)-(w2 -2))2 +
0.25*((w1 -1)+1.73(w2 -2))2
w2 w2 w2

w1 w1 w1

Arrows show direction of gradient vector ∂ETr / ∂w


The gradient vector is always perpendicular to the contour
For BP, w will move along negative direction of the gradient Q.J. Zhang, Carleton University
Conjugate Gradient Method
w = h h
ETr ( w )
Let E =
w
Then h( epoch) = −E +  h( epoch− 1 )
where h0 = −E
E ( epoch ) 2
 = (Fletcher/Reeves)
E ( epoch − 1 ) 2

( E ( epoch ) − E ( epoch − 1 ))T E ( epoch ) (Polak – Ribiere)


 =
E ( epoch − 1 ) 2
 Line minimization method
h is determined by Trust Region method

Speed: generally fast than BP


Memory: A few vectors of NW long, where NW is the total # of NN
weights/parameters in w
Q.J. Zhang, Carleton University
Illustration of Conjugate Direction
Simple examples of contour plots with 2 variables w=[w1 w2]:
w2

Location of w

Gradient
direction

Negative
gradient
direction

Conjugate
gradient
direction
w1

Q.J. Zhang, Carleton University


Quasi-Newton Method
Let H be the Hessian matrix of ETr w.r.t. w
B be the inverse of H
Weight update: w = h h, where h = − B E

Use information of w and g to approximate B:


B epoch=0 = I
 w wT B epoch−1 gg T B epoch− 1
B epoch = B epoch−1 + −
w g
T
g T B epoch−1 g
(DFP formula)
where w = w(epoch) – w(epoch-1)
 g =  E ( epoch) −  E ( epoch − 1)

Speed: fast
2
Main Memory Needed: N W (Large)

Q.J. Zhang, Carleton University


Levenberg-Marquardt Method
Obtain  w from solving linear equations, e.g.
− (J T J +  I )  w = J T e
y j ( xk , w ) − d jk
where e = [ e1 , e2 , . . . , eNe ]T , ei =  =
ymax, j − ymin, j
jk

T
e
J is Jacobian, J = ( )T
w
 0 Typical Levenberg Marquardt

= 0 Gauss Newton

This method is good if e can be very small, e.g., small residue


problems.
Computation needs LU decomposition
2
Main Memory Needed: N W , (Large)
Q.J. Zhang, Carleton University
Other Training Methods
Huber-Quasi-Newton

Similar to Quasi-Newton, except that the error function


for training is based on Huber function, and not the
conventional least square error function.

The Huber formulation allows the training algorithm


to robustly handle both small random errors
and accidental large error in training data.

Q.J. Zhang, Carleton University


Other Training Methods (continued)
Simplex Method using information ETr (w ) only

The method starts with several initial guesses of w. These


initial points form a simplex in the w space.

The method then iteratively updates the simplex using basic


moves such as reflection, expansion, and contraction, according
to the information of ETr (w ) at vertices of the simplex

The error ETr (w ) generally decreases as the simplex is updated.

Q.J. Zhang, Carleton University


Other Training Methods (continued)
Genetic Algorithm using information ETr(w) only, searching
for global minimum

The algorithm starts with several initial points of w


called a population of w

A fitness value is defined for each w such that w with


lower error ETr(w) has a high fitness

w points with high fitness values are more likely selected


as parents from whom new points of w called offspring
are produced

This process continues, and fitness among the population


improves
Q.J. Zhang, Carleton University
Other Methods (continued)
Particle Swarm Optimization (PSO) using information ETr(w) only,
searching for global minimum

The algorithm starts with several initial points of w. Each point


of w is called a particle, and all the particles together is
called a swarm of w.

Let pb represent the historical best of a particle w. Let gb represent


the historical best of all particles w in the swarm.

Let v be defined as the velocity of a particle w, computed as:


v = c0 vold + c1r1 (pb - w) + c2r2 (gb - w)
where c0, c1 and c2 are constant weight parameters, r1 and r2
are random values between 0 and 1, and vold represent the v of the particle
in the previous iteration.

Each particle is updated by w = w+ v


Q.J. Zhang, Carleton University
Qualitative Comparison of Different Algorithms

Convergence
Rate
fast more
Levenberg Marquardt (for small-residue problems)
Quasi-Newton
Conjugate Gradient
BP
slow
less

Memory need and


effort in implementation

Q.J. Zhang, Carleton University


Comparison of Training Algorithms for 3-Conductor Microstrip Line
Example (5 input neurons, 28 hidden neurons, 5 output neurons)

Training Algorithm No. of Epochs Training Error (%) Avg. Test Error (%) CPU (in Sec)

Adaptive

Backpropagation 10755 0.224 0.252 13724

Conjugate-gradient

2169 0.415 0.473 5511

Quasi-Newton

1007 0.227 0.242 2034

Levenberg-

Marquardt 20 0.276 0.294 1453

Q.J. Zhang, Carleton University


Comparison of Training Algorithms for MESFET Example
(4 input neurons, 60 hidden neurons, 8 output neurons)

Training Algorithm No of Epochs Training Error (%) Ave. Test Error (%) CPU (in Sec)

Adaptive

Backpropagation 15319 0.98 1.04 11245

Conjugate Gradient

1605 0.99 1.04 4391

Quasi-Newton 570 0.88 0.89 1574

Levenberg-

Marquardt 12 0.97 1.03 4322

Q.J. Zhang, Carleton University

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy