Counter Propagation Network
Counter Propagation Network
Since the input pattern looks more like a ‘T’, when the
network classifies it, it sees the input closely
resembling ‘T’ and outputs the pattern that represents
a ‘T’.
Pattern Recognition (cont.)
• Feed-forward networks
– Feed-forward NNs allow signals to travel one way
only; from input to output. There is no feedback
(loops) i.e. the output of any layer does not affect
that same layer.
– Feed-forward NNs tend to be straight forward
networks that associate inputs with outputs. They
are extensively used in pattern recognition.
– This type of organization is also referred to as
bottom-up or top-down.
Contd….
• Feedback networks
– Feedback networks can have signals traveling in
both directions by introducing loops in the
network.
– Feedback networks are dynamic; their 'state' is
changing continuously until they reach an
equilibrium point. They remain at the equilibrium
point until the input changes and a new
equilibrium needs to be found.
– Feedback architectures are also referred to as
interactive or recurrent, although the latter term is
often used to denote feedback connections in
single-layer organizations.
Diagram of an NN
• What is a bias?
• How can we learn the bias value?
• A node’s output:
– 1 if w1x1 + w2x2 + … + wnxn >= bias
– 0 otherwise
• Rewrite
– w1x1 + w2x2 + … + wnxn - >= 0
– w1x1 + w2x2 + … + wnxn + (-1) >= 0
• Consider a perceptron
• Its output is
– 1, if W1X1 + W2X2 >
– 0, otherwise
• In terms of feature space
– hence, it can only classify examples if a line
(hyperplane more generally) can separate the
positive examples from the negative examples
What can Perceptrons Represent ?
For example:
Majority function - will be covered in this lecture.
majority
• XOR function
Input1 Input2 Output
0 0 0
0 1 1
1 0 1
1 1 0
Bad news:
- There are not many linearly separable functions.
Good news:
- There is a perceptron algorithm that will learn
any linearly separable function, given enough
training examples.
Learning Linearly Separable Functions (2)
• Rule of thumb:
– the number of training examples should be at
least five to ten times the number of weights
of the network.
• Other rule:
|W|= number of
|W| weights
N
(1 - a) a = expected accuracy
on test set
Training Basics
Error
propagation
Backward Step
Backprop adjusts the weights of the NN in order to
minimize the network total mean squared error.
Applications of Back Propagation Network
1. Classification
In this classification problem, the goal is to identify
whether a certain "data point" belongs to Class 1, 2,
or 3 (see above). Random points are assigned to a
certain class, and the neural network is trained to find
the pattern. When training is complete, it will use
what it has learned to accurately classify new points.
2. Function Approximation
In this problem, the network tries to approximate the
value of a certain function.
It is fed with noisy data, and the goal is to find the true
pattern. After training, the network successfully
estimates the value of the gaussian function (below).
3. Time-series Prediction
In this problem, the goal is to design a neural
network to predict a value based on a given time-series
data (i.e. stock market prediction based on given trends).
To approach this problem, the inputs to the neural
network have to be refactored in chunks, and the
resulting output will be the next data item directly
following that chunk
-
The Learning Process
• classification
in marketing: consumer spending pattern
classification
In defence: radar and sonar image classification
In agriculture & fishing: fruit and catch grading
In medicine: ultrasound and electrocardiogram
image classification, EEGs, medical diagnosis
Recognition and Identification
In general computing and telecommunications:
speech, vision and handwriting recognition
In finance: signature verification and bank
note verification assessment
In engineering: product inspection monitoring
and control
In defence: target tracking
In security: motion detection, surveillance image
analysis and fingerprint matching forecasting and
prediction
In finance: foreign exchange rate and stock
market forecasting
In agriculture: crop yield forecasting
In marketing: sales forecasting
UNIT-II
STATISTICAL METHODS
• Hopfield nets
• Cauchy training
• Simulated annealing
• The Boltzmann machine
• Associative memory
• Bidirectional associative memory -applications.
Hopfield Nets
• A Hopfield net is composed of binary threshold units
with recurrent connections between them. Recurrent
networks of non-linear units are generally very hard
to analyze. They can behave in many different ways:
– Settle to a stable state,Oscillate,Follow chaotic
trajectories that cannot be predicted far into the
future.
• But Hopfield realized that if the connections are
symmetric, there is a global energy function
– Each “configuration” of the network has an energy.
– The binary threshold decision rule causes the
network to settle to an energy minimum.
The energy function
E si bi si s j wij
i i j
E ( si 0) E ( si 1) bi s j wij
j
Settling to an energy minimum
-100
0 0
+5 +5
How to make use of this type of computation
• We can improve
efficiency by using sparse 1
vectors and only allowing
one bit per weight. 0
– Turn on a synapse in 1
when input and output 0
units are both active. 0
• For retrieval, set the
output threshold equal to 0 1 0 0 1
the number of active
input units output units with
– This makes false
positives improbable dynamic thresholds
An iterative storage method
A B
C
Stochastic units
1 1
p( si 1) j s j wij T
Ei T
1 e 1 e
temperatu
Energy gap Ei E ( si 0) E ( si 1) re
The annealing trade-off
1
p ( pick higher energy state) E T
1 e Energy
increase
• At low temperature the equilibrium probabilities of
good states are much better than the equilibrium
probabilities of bad ones.
( E A EB )
PA
e T
PB
How temperature affects transition probabilities
p ( A B ) 0 .2
High p ( A B ) 0 .1
temperature
transition
probabilities A
B
p ( A B ) 0 . 001
p ( A B ) 0 . 000001
Low
temperature
transition
A
probabilities
B
Thermal equilibrium
• Thermal equilibrium is a difficult concept!
– It does not mean that the system has settled down
into the lowest energy configuration.
– The thing that settles down is the probability
distribution over configurations.
• The best way to think about it is to imagine a huge
ensemble of systems that all have exactly the same
energy function.
– After running the systems stochastically in the
right way, we eventually reach a situation where
the number of systems in each configuration
remains constant even though any given system
keeps moving between configurations
Simulated Annealing
Algorithm SIMULATED-ANNEALING
Begin
temp = INIT-TEMP;
place = INIT-PLACEMENT;
while (temp > FINAL-TEMP) do
while (inner_loop_criterion = FALSE) do
new_place = PERTURB(place);
ΔC = COST(new_place) - COST(place);
if (ΔC < 0) then
place = new_place;
else if (RANDOM(0,1) > e-(ΔC/temp)) then
place = new_place;
temp = SCHEDULE(temp);
End.
Parameters
• INIT-TEMP = 4000000;
• INIT-PLACEMENT = Random;
• PERTURB(place)
1. Displacement of a block to a new position.
2. Interchange blocks.
3. Orientation change for a block.
• SCHEDULE.
Cooling schedule
Convergence of simulated annealing
A T IN IT _ T E M P U n c o n d itio n a l A c c e p ta n c e
H IL L C L IM B IN G M o v e a c c e p te d w ith
p r o b a b ility
= e -( ^ C /te m p )
COST FUNCTION, C
H IL L C L IM B IN G
H IL L C L IM B IN G
A T F IN A L _ T E M P
N U M B E R O F IT E R A T IO N S
Algorithm for partitioning
Algorithm SA
Begin
t = t0;
cur_part = ini_part;
cur_score = SCORE(cur_part);
repeat
repeat
comp1 = SELECT(part1);
comp2 = SELECT(part2);
trial_part = EXCHANGE(comp1, comp2, cur_part);
trial_score = SCORE(trial_part);
δs = trial_score – cur_score;
if (δs < 0) then
cur_score = trial_score;
cur_part = MOVE(comp1, comp2);
else
r = RANDOM(0,1);
if (r < e-(δs/t)) then
cur_score = trial_score;
cur_part = MOVE(comp1, comp2);
until (equilibrium at t is reached)
t = αt (0 < α < 1)
until (freezing point is reached)
End.
Qualitative Analysis
Initial position
of the ball Simulated Annealing explores
more. Chooses this move with a
small probability (Hill Climbing)
Greedy Algorithm
gets stuck here!
Locally Optimum
Solution.
w11
x1 y1
. w1j .
. .
wi1 .
. w1m
wij yj
xi
. wim .
w wnj .
. n1
. .
xn wnm ym
Heteroassociative Memory
x1(p) 1 x1(p+1) 1
1 y1(p) 1 y1(p)
x2 (p) 2 x2(p+1) 2
2 y2(p) 2 y2(p)
xi (p)
j yj(p) j yj(p)
i xi(p+1) i
m ym(p) m ym(p)
xn(p) n xn(p+1) n
Input Output Input Output
layer layer layer layer
(a) Forward direction. (b) Backward direction.
The basic idea behind the BAM is to store pattern pairs
so that when n-dimensional vector X from set A is
presented as input, the BAM recalls m-dimensional vector
Y from set B, but when Y is presented as input, the BAM
recalls X.
To develop the BAM, we need to create a correlation
matrix for each pattern pair we want to store. The
correlation matrix is the matrix product of the input
vector X, and the transpose of the output vector YT.
The BAM weight matrix is the sum of all correlation
matrices, that is,
M
T
W m m
X Y
m1
– input of an x vector
– input of a y vector only
– input of an x:y ,possibly with some distorted or
missing elements in either or both vectors.
Full Counterpropagation (cont.)
• Phase 1
• The units in the cluster layer compete. The
learning rule for weight updates on the winning
cluster unit is (only the winning unit is allowed
to learn)
w iJnew wiJold x i wiJold i 1, 2 ,..., n
new
u kJ w kJ y k u kJ k 1, 2 ,..., m
old old
• Phase 2
– The weights from the winning cluster unit J to the output units are
adjusted so that the vector of activations of the units in the Y output
layer, y*, is an approximation to the input vector y; x*, is an
approximation to the input vector x. The weight updates for the units
in the Y output and X output layers are
X1 w Y1
Hidden layer u
Xi Yk
Z1
Xn Ym
Zj
v t
Zp
Y1* X1*
Yk* Xi*
Cluster layer *
Ym* Xn
Full Counterpropagation Algorithm
x : input training vector : x ( x1 ,..., xi ,..., xn )
y : target output corresponding to input x : y ( y1,...,yk ,...,ym )
z j : activation of cluster layer unit Z j
x* : computed approximation to vector x
y * : computed approximation to vector y
wij : weight from X input layer, unit X i , to cluster layer, unit Z j
ukj : weight from Y input layer, unit Yk , to cluster layer, unit Z j
v jk : weight from cluster layer, unit Z j , to Y output layer, unit Yk*
t jk : weight from cluster layer, unit Z j , to X output layer, unit X i*
, : learning rates for weights into cluster layer (Kohonen learning)
a, b : learning rates for weight out from cluster layer (Grossberg learning)
Full Counterpropagation Algorithm (phase 1)
• dot product (find the cluster with the largest net input)
net j xw
i
i ij y k u kj
• Euclidean distance (find the clusterkwith smallest square
distance from the input)
i k
Full Counterpropagation Application
• X*i=tji
• Y*k=ujk
Full counterpropagation example
• Function approximation of y=1/x
• After training phase we have
• Cluster unit v w
• z1 0.11 9.0
• z2 0.14 7.0
• z3 0.20 5.0
• z4 0.30 3.3
• z5 0.60 1.6
• z6 1.60 0.6
• z7 3.30 0.3
• z8 5.00 0.2
• z9 7.00 0.14
• z10 9.00 0.11
Full counterpropagation example (cont.)
X1 Y1
0.1 9.0
1 Z1 7.0
0.1
0.24 Z2 5.0
.
9.0 7.0 . 0.1 0.1
. 4 1
Y1* 5.0 Z10 0.2 X1*
Full counterpropagation example (cont.)
XY
XY
w u
X1 Y1
Z1
Xi Zj Yk
Zp
Xn Ym
Input layer Cluster layer Output layer
• StepForward Onlyweights,
1. Initialize Counterpropagation
learning rates,Algorithm
etc.
• Step 2. While stopping condition for Phase 1 is false,
do Step 3-8
• Step 3. For each training input x, do Step 4-6
• Step 4. Set X input layer activations to vector x
• Step 5. Find winning cluster unit; call its index j
• Step 6. Update weights for unit ZJ:
w iJnew w iJold x i w iJold , i 1,2,..., n
• Step 7. Reduce learning rate
• Step 8. Test stopping condition for Phase 1 training.
• Step 9. While stopping condition for Phase 2 is false, do
Step 10-16
• Step 10. For each training input pair x:y, do Step 11-14
• Step 11. Set X input layer activations to vector x ;
• set Y input layer activations to vector y.
• Step 12. Find winning cluster unit; call its index J
• Step 13. Update weights for unit ZJ ( is small)
u Jknew u Jkold a yk u Jkold , k 1,2,..., m.
• Step 14. Update weights from unit ZJ to the output
layers
w iJnew w iJold x i w iJold , i 1,2,..., n
• Step 15. Reduce learning rate a.
• Step 16. Test stopping condition for Phase 2 training.
Forward Only Counterpropagation Application
• Cluster unit w u
• z1 0.5 5.5
• z2 1.5 0.75
• z3 2.5 0.4
• z4 . .
• z5 . .
• z6 . .
• z7 . .
• z8 . .
• z9 . .
• z10 9.5 0.1
Function Approximation
140
spring 2006
Introduction of SOM contd…
• Sample data
• Weights
• Output nodes
Structure of the map
• Initialize Map
• For t from 0 to 1
– Select a sample
– Get best matching
unit
– Scale neighbors
– Increase t a small
amount
End for
m i ( t 1) m i ( t ) ( t )[ x ( t ) m i ( t )]
i N c ( t )
Initializing the weights
• Determining Neighbors
–Neighborhood size (t ) exp[ (2 / 3)x (|| ri rm ||)]
(t ) learning coefficien t
ri position _ vector
•Decreases over
time
–Effect on neighbors
• Learning
i N c ( t ),
m i ( t 1) m i ( t ) ( t )[ x ( t ) m i ( t )]
otherwise,
m i ( t 1) m i ( t )
Necessary conditions
• Architecture
– Word category Map
– Document category Map
• Modes of Operation
– Supervised
• (some information about the class is given, for
e.g. in the collection of Newsgroup articles
maybe the name of news group is supplied)
– Unsupervised
• (no information provided)
Word Category Map
• Preprocessing
– Remove unimportant data (like images, signatures)
– Remove articles prepositions etc.
– Words occurring less than some fixed no. of times
are to be termed as don’t care !
– Replace synonymous words
Averaging Method
• Word code vector
–Each word represented by a unique vector (with
dimension n ~ 100)
–Values may be random
• Context Vector
–For word at position i word vector is x(i)
» where:
– E() = Estimate of expected value of x over text
corpus
– ε = small scalar number
(contd.)
Example
Document Category Map
• Phonetic Typewriter
Current Applications (contd…)
183
F2 – Association ART
Learning such associations
consists of four steps:
1. Choosing most
relevant association.
2. Selecting association.
3. Determining if vectors
are within vigilance.
4. Learning.
Internet Identity's Phishing Trends report for the
second quarter of 2009 said that Avalanche "have
detailed knowledge of commercial banking
platforms, particularly treasury management
systems and the Automated Clearing House (ACH)
system. They are also performing successful real-
time man-in-the-middle attacks that defeat two-
factor security tokens."
Avalanche had many similarities to the previous
group Rock Phish - the first phishing group which
used automated techniques - but with greater in
scale and volume.Avalanche hosted its domains
on compromised computers (a botnet). There was
no single hosting provider, makingis difficult to
take down the domain and requiring the
involvement of the responsible domain registrar.
The Formal Avalanche
• Speech Recognition
• Radar Analysis
• Sonar echo classification
Unit-V
NEO-CONGNITRON