08 Neural Networks
08 Neural Networks
Simple Neural
Networks
This is in your brain
Output value y
a
Non-linear transform σ
z
Weighted sum ∑
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
3
heart, a neural unit is taking a weighted sum of its inputs, with o
econvenient
Eq. 7.2,
in the zis
sum to expressthisweighted
just
Neural unit
called a real
bias valued
term. number.
Given sum
a set using
of vector
inputs x1
notati
...xn, a
gebra that
rresponding a vector
weights wis, at
...w heart,
and ajust
bias ab,list
so
stead of using z, a linear function of x, as the outpu
1 n or
the array of
weighted numb
sum
d
utas:
z in terms
Take of a
weightedweight
sum vector
of w,
inputs, a
plus scalar
a bias bias b, and an inp
near function f to z. We Xwill refer to the output of th
place the sum with the z =convenient
b+ wi xdot
i product:
value for the unit, a. Since i
we are just modeling a s
henodeisin fact thefinal z=this
output
w·x+
of thenetwork, which
b sum using vector notatio
w
moreconvenient to express weighted
value
r algebra defined
y isthat as:is, at heart, just a list or array of number
a vector
Instead of just using z, we'll apply a nonlinear activation
about zfunction
Eq. 7.2, inzis
termsf:ofaareal
just weight vector
valued w, a scalar bias b, and an inpu
number.
l replace
stead of the sumz,with
using y =convenient
the
a linearafunction
= f (z)dot
of product:
x, as the output, ne
near function f to z. We will refer to the output of this fu
fact thefinal output of thenetwork, which we’ ll generally
ned as:
Non-Linear Activation Functions
y = a = f (z)
The sigmoid for logistic regression:
non-linear functions f () below (the sigmoid, the tanh,
LU) but it’s pedagogically convenient to start with the
saw it in Chapter 5:
Sigmoid
1
y = s (z) = (7.3)
1+ e− z
Fig. 7.1) has a number of advantages; it maps the output
is useful in squashing outliers toward 0 or 1. And it’s
saw in Section ?? will be handy for learning.
5
Final function the unit is computing
7.1 • U NI
stituting Eq. 7.2 into Eq. 7.3 gives us the output of a neural unit:
1
y = s (w·x+ b) =
1+ exp(− (w·x+ b))
7.2 shows a final schematic of a basic neural unit. In this example t
nput values x1, x2, and x3, and computes a weighted sum, multiplyin
aweight (w1, w2, and w3, respectively), adds them to abias term b, a
he resulting sum through a sigmoid function to result in a number bet
Final unit again
Output value y
a
Weights w1 w2 w3
bias
b
Input layer x1 x2 x3 +1
7
w = [0.2, 0.3, 0.9]
An example
b = 0.5
Suppose a unit has:
What
w =would this unit do with the following input vector:
[0.2,0.3,0.9]
b = 0.5 x = [0.5, 0.6, 0.1]
What happens with input x:
The resulting output y would be:
x = [0.5,0.6,0.1]
1 1
y = s (w·x+ b) = − (w·x+ b)
= − (.5⇤.2+ .6⇤.3+ .1⇤.9+ .5)
=
1+ e 1+ e
In practice, the sigmoid is not commonly used as an activation
w == [0.2,
w [0.2,0.3,
0.3,0.9]
0.9]
An example
Suppose a unit has: bb == 0.5
0.5
What w = this
What would
would [0.2,0.3,0.9]
this unit do
unit do with
with the
thefollowing
following input
input vector:
vector:
b = 0.5
xx == [0.5,
[0.5,0.6,
0.6,0.1]
0.1]
What happens with the following input x?
Theresulting
The resulting output yy would
x = output would be:
[0.5,0.6,0.1] be:
11 1
1
yy== ss (w·x+
(w·x+ b)b) ==
1+ e−
− (w·x+
(w·x+ b)
b)
==
1+ e−(.5
− ⇤
(.5⇤ .2+.6
.2+ ⇤
.6⇤ .3+.1
.3+ ⇤.9+
.1⇤ .5) =
.9+ .5)
1+ e 1+ e
Inpractice,
In practice, the
thesigmoid
sigmoid is
isnot
not commonly
commonly used
used as
asan
an activation
activation
ww == we
example just to get an intuition. Let’s suppose [0.2,
[0.2, 0.3,
0.3,
have 0.9]
a 0.9]
An example
eight vector and bias: bb == 0.5
0.5
Suppose a unit has:
w = would
What [0.2, 0.3,0.9]
this unit do
do with
with the
thefollowing
following inputinput vector:
vector:
What would w = this unit
[0.2,0.3,0.9]
b = 0.5
b = 0.5 xx == [0.5,
[0.5,0.6,
0.6,0.1]
0.1]
with the following input vector:
What happens with input x:
Theresulting
The resulting output
output yy would
would be:be:
x = [0.5,
x =0.6,[0.5,0.6,0.1]
0.1]
11 1
1
yy== ss (w·x+
uld be: (w·x+ b) b) ==
1+ e−− (w·x+
(w·x+ b)
b)
==
1+ e−(.5
− ⇤
(.5⇤ .2+.6
.2+ ⇤
.6⇤ .3+.1
.3+ ⇤.9+
.1⇤ .5) =
.9+ .5)
1+ e 1+ e
1 1 1
In=practice,
(w·x+In
b) practice,
− the
⇤ .2+ sigmoid
.6⇤.3+ .1⇤ is
1+ e the sigmoid is not commonly
(.5 .9+ .5)
= commonly
not 1+ e − 0.87
= used
.70 as an activation
used as an activation
ght vector
ght vector and
andbias:
bias: w == we
[0.2, 0.3, 0.9]
example just to get an intuition. Let’s w
suppose [0.2, 0.3,
have a 0.9]
An example
eight vector
w = and bias:
[0.2, 0.3,0.9] bb == 0.5
0.5
w Suppose
= [0.2,a0.3, 0.9]
unit has:
bw
Whatb ===would
What [0.2,
0.5=0.3,0.9]
w
0.5
would [0.2,0.3,0.9]
this
this unit do
unit do with
with thethefollowing
following input
input vector:
vector:
b = 0.5
the following b = input
following 0.5vector:
input vector:
th the xx == [0.5,
[0.5,0.6,
0.6,0.1]
0.1]
with the following
What happens input vector:
with input x:
The
xx ==
The resulting
[0.5,
[0.5,
resulting 0.6,output
0.6, 0.1]
0.1]
output
x =0.6,[0.5,0.6,0.1] y
y would
would be:
be:
x = [0.5, 0.1]
ldd be:
be: 11 1
1
yy== ss (w·x+
uld be: (w·x+ b) b) ==
1+ e−− (w·x+
(w·x+ b)
b)
==
1+ e−−(.5 ⇤
(.5⇤ .2+.6
.2+ ⇤
.6⇤ .3+.1
.3+ ⇤.9+
.1⇤ .5) =
.9+ .5)
1 1+ e 1+1 e
1 1 1 1 1
In ==practice,−− the
⇤ sigmoid
⇤ ⇤is not == − 0.87−=0.87
= commonly .70= =.70
used as .70
an activation
activation
w·x+
(w·x+
w·x+ b)In practice,
b)b) 1+
1+ e e
e − the
⇤
(.5
(.5
(.5.2+
⇤ sigmoid
.2+
.6
.2+⇤.6 .3+
.3+
.6⇤ .1⇤
.3+ .1
.9+is
.1 .9+
.5)
⇤ not
.5)
.9+ commonly
1+
.5) 1+
e e
1+ e − used
0.87 as an
The simplest activation function, and perhaps the most comm
not commonly used as an activation function. A function
U tified linear unit, also called the ReLU,besides
Non-Linear Activation Functions sigmoid
shown in Fig. 7.3b.
ost always better is the tanh function shown in Fig. 7.3a;
when
moid that z is from
ranges positive,
-1 to and
+1: 0 otherwise: Most Common:
ez − e− z y = max(z,0)
y= z −z (7.5)
e+e
nction, and perhaps the most commonly used, is the rec-
ed the ReL U, shown in Fig. 7.3b. It’s just the same as x
therwise:
y = max(x, 0) (7.6)
tanh ReLU
Rectified Linear Unit 12
The XOR problem
Simple Neural
Networks
ts into larger networks.
One of the most clever demonstrations of the need for multi-layer networks was
The XOR problem
proof by Minsky and Papert (1969) that aMinsky
single neural unit cannot compute
and Papert (1969)
me very simple functions of its input. Consider the task of computing elementary
ical functions of two inputs,
Can neural like AND, OR,
units compute and XOR.
simple functions As a reminder,
of input?here are
truth tables for those functions:
AND OR XOR
x1 x2 y x1 x2 y x1 x2 y
0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
This example was first shown for the per ceptr on, which is a very simple neural
t that has a binary output and does not have a non-linear activation function. The
Perceptrons
A very simple neural unit
• Binary output (0 or 1)
• No non-linear activation function
Early in the history of Early
neuralinnetworks
the historyit was realized
of neural that theitpower
networks of
was real
Easy to build AND or OR with perceptrons
works, as with the real neurons
works, that inspired
as with them, comes
the real neurons from comb
that inspired the
units into larger networks.
units into larger networks.
One of the most cleverOnedemonstrations of the demonstrations
of the most clever need for multi-layer
of thenen
the proof by Minsky the and proof
Papertby(1969)
Minsky thatand
a single
Papertneural
(1969)unit
thatcanno
a si
some very simple functions
some of itssimple
very input. Consider
functions the task
of its of computing
input. Consider
logical functions of two inputs,
logical like AND,
functions OR,inputs,
of two and XOR. As a remind
like AND, OR, an
the truth tables for those
thefunctions:
truth tables for those functions:
AND OR AND XOR OR
x1 x2 y x1 x2x1yx2 y x1 x2
x1yx2 y
0 0 0 0 00 00 0 0 00 00 0
0 1 0 0 10 11 0 0 10 11 1
AND 1 0 0 1 01 1 OR
0 0 1 01 10 1
1 1 1 1 11 11 1 1 11 01 1
Why? Perceptrons are linear classifiers
Perceptron equation given x1 and x2, is the equation of a line
w1x1 + w2x2 + b = 0
x2 x2 x2
1 1 1
?
0 x1 0 x1 0 x1
0 1 0 1 0 1
a) x1 AND x2 b) x1 OR x2 c) x1 XOR x2
+1
1 1 1 1 0 -1
x1 x2 +1
x2 h2
1 1
0 x1 0
h1
0 1 0 1 2
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
24
Multinomial Logistic Regression as a 1-layer Network
Fully connected single layer network
y1 yn
Output layer s s s 𝑦 = softmax(𝑊𝑥 + 𝑏)
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
25
Reminder: softmax: a generalization of sigmoid
Example:
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units j
Wji
(σ node)
W b vector
Input layer
x1 i xn +1
(vector)
Two-Layer Network with scalar output
Output layer 𝑦 = 𝜎(𝑧) y is a scalar
(σ node) z = 𝑈ℎ
U
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Two-Layer Network with softmax output
Output layer 𝑦 = softmax(𝑧)
(σ node) z = 𝑈ℎ
U y is a vector
hidden units
(σ node) Could be ReLU
Or tanh
W b
Input layer
x1 xn +1
(vector)
Multi-layer Notation
𝑦 = 𝑎[2]
sigmoid or softmax
𝑎[2] = 𝑔 2 (𝑧 2 )
𝑧 [2] = 𝑊 [2] 𝑎[1] + 𝑏 [2]
W[2 b[2]
]
j 𝑎[1] = 𝑔 1 (𝑧 1
) ReLU
x1 i xn +1 𝑎[0]
y
z
∑
w1 w2 w3 b
x1 x2 x3 +1
32
Replacing the bias unit
Let's switch to a notation without the bias unit
Just a notational change
1. Add a dummy node a0=1 to each layer
2. Its weight w0 will be the bias
3. So input layer a[0]0=1,
◦ And a[1]0=1 , a[2]0=1,…
Replacing the bias unit
Instead of: We'll do this:
x= x1, x2, …, xn0 x= x0, x1, x2, …, xn0
Replacing the bias unit
Instead of: We'll do this:
y1 y2 … yn
2
y1 y2 … yn
2
U U
h1 h2 h3 … hn
1
h1 h2 h3 … hn
1
W b W
x1 x2 … xn
0 +1 x0=1 x1 x2 … xn
0
Simple Neural
Networks and
Applying feedforward networks
Neural to Real tasks
Language
Models
Use cases for feedforward networks
Let's consider 2 (simplified) sample tasks:
1. Text classification
2. Language modeling
37
Classification: Sentiment Analysis
W
x1 xn
Feedforward nets for simple classification
σ
σ U
2-layer
Logistic
Regression W feedforward
network
W
x1 xn
f1 f2 fn x1 xn
f1 f2 fn
40
41
Neural Net Classification with embeddings as input features!
42
Issue: texts come in different sizes
This assumes a fixed size length (3)!
Kind of unrealistic.
Some simple solutions (more sophisticated solutions later)
1. Make the input the length of the longest review
• If shorter then pad with zero embeddings
• Truncate if you get longer reviews at test time
2. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
• Take the mean of all the word embeddings
• Take the element-wise max of all the word embeddings
• For each dimension, pick the max value from all words 43
Reminder: Multiclass Outputs
What if you have more than two output classes?
◦ Add more output units (one for each class)
◦ And use a “softmax layer”
W
x1 xn
44
Neural Language Models (LMs)
Language Modeling: Calculating the probability of the
next word in a sequence given some history.
• We've seen N-gram based LMs
• But neural network LMs far outperform n-gram
language models
State-of-the-art neural LMs are based on more
powerful neural network technology like Transformers
But simple feedforward LMs can do almost as well!
45
Simple feedforward Neural Language Models
46
Neural Language Model
47
Why Neural LMs work better than N-gram LMs
Training data:
We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed
Test data:
I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog"
embeddings to generalize and predict “fed” after dog