0% found this document useful (0 votes)
46 views72 pages

CISC 867 Deep Learning: 12. Recurrent Neural Networks

- Recurrent neural networks are useful for processing sequential data like time series. - The Jena Climate dataset contains over 8 years of weather observations recorded every 10 minutes, consisting of 14 measured features like temperature, pressure, humidity. - To train a model to forecast temperature 24 hours ahead, the input would be a sequence of 720 past observations (5 days worth) and the target output would be the temperature 24 hours after the last observation in the input sequence.

Uploaded by

adel hany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views72 pages

CISC 867 Deep Learning: 12. Recurrent Neural Networks

- Recurrent neural networks are useful for processing sequential data like time series. - The Jena Climate dataset contains over 8 years of weather observations recorded every 10 minutes, consisting of 14 measured features like temperature, pressure, humidity. - To train a model to forecast temperature 24 hours ahead, the input would be a sequence of 720 past observations (5 days worth) and the target output would be the temperature 24 hours after the last observation in the input sequence.

Uploaded by

adel hany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

CISC 867 Deep Learning

12. Recurrent Neural Networks

Credits: Vassilis Athitsos, Yu Li

1
Sequential Data

• Sequential data, as the name implies, are sequences.


• What is the difference between a sequence and a set?
• A set is a collection of elements, with no inherent order.
– {1,2,3} = {3,1,2} = {2,1,3}, the order in which we write the elements
does not matter.
• A sequence 𝑿 is a set of elements, together with a total
order imposed on those elements.
– A total order describes, for any two elements 𝒙𝟏 , 𝒙𝟐 , which of them
comes before and which comes after.
– Sequences (1,2,3), (3,1,2), (2,1,3) are all different from each other,
because the order of elements matters.

2
Time Series

• A time series is a sequence of vectors.


• Each vector can be thought of as an observation, or
measurement, that corresponds to one specific moment
in time.
• Examples:
– Stock market prices (for a single stock, or for multiple stocks).
– Heart rate of a patient over time.
– Position of one or multiple people/cars/airplanes over time.
– Speech: represented as a sequence of audio measurements at
discrete time steps.
– A musical melody: represented as a sequence of pairs (note,
duration).

3
Dimensionality of Vectors

• In the simplest case, a time series is just a sequence of


numbers.
• An example is the sequence of daily high temperatures (in
Fahrenheit) in Arlington, from January 1 to January 4, 2022.
– We get the time series (74, 40, 54, 54).
– This time series has length 4.
• In general, a time series is a sequence of vectors.
– All vectors in the time series must have the same dimensionality.
• For example, we can take the previous sequence of daily high
temperatures, and include the daily low temperature as well.
– We get the time series ((74,24), (40,19), (55, 25), (64,33)).
– It has length 4, and every element is a 2D vector.

4
Time Series Terminology

• We can use the term “sequence” to refer to a time series.


– This is correct usage. Any time series is a sequence. The reverse is not
true, there are sequences (for example, strings) that are NOT time series.
• The length of a sequence is the number of elements in the
sequence.
– For example, sequence ((74,24), (40,19), (55, 25), (64,33)) has length 4.
• A feature refers to a specific dimension of the vectors that the
time series contains.
– For example, in an earlier slide we said that the first feature in sequence
((74,24), (40,19), (55, 25), (64,33)) is the daily high temperature. The
second feature is the daily low temperature.
• We can refer to an element of a time series as a “feature
vector”.
5
Strings and Time Series

• Strings are an example of sequential data: a string is a


sequence of characters from some alphabet.
• Strings are sequential: the order of the characters matters.
– Strings “mile” and “lime” are not equal.
– Compare to sets {‘m’, ‘i’, ‘l’, ‘e’} and {‘l’, ‘i’, ‘m’, ‘e’}, which are equal.
• Strings are not time series, because their elements are
characters (symbols from a finite and discrete alphabet)
and not vectors.
• However, we can easily convert a string dataset to a time
series dataset.
– We map each character to a one-hot vector.
– The dimensionality of these one-hot vectors is equal to the number of
letters in the alphabet.
6
Text and Time Series

• Text is another example of sequential data: a piece of text


data can be seen as a sequence of letters, or as a sequence
of words.
• Using one-hot vectors we can convert a text dataset to a
time series dataset.
– We can map each letter to a one-hot vector, or each word to a one-hot
vector. Mapping words is more common.
– There are other methods as well for converting a piece of text to a
time series.
– We will cover this topic in detail in a few weeks.

7
Example: The Jena Climate Dataset

• The Jena Climate dataset is a weather time series dataset.


– Publicly available at: https://www.kaggle.com/mnassrib/jena-climate
• The data was recorded at the Weather Station of the Max
Planck Institute for Biogeochemistry in Jena, Germany.
• 8 years of data: January 1, 2009 to December 31, 2016.
• A feature vector was recorded every 10 minutes during those
eight years.
• Each feature vector is 14-dimensional.
• These are some of the recorded features:
– Air temperature.
– Atmospheric pressure.
– Humidity.
– Wind direction…

8
Jena Climate Dataset: A Closer Look

• You can download the dataset from:


https://www.kaggle.com/mnassrib/jena-climate
• The dataset is saved in a CSV file with 15 columns:
– 0: Date and Time
– 1: Atmospheric pressure, in millibars.
– 2: Temperature in Celsius.
– 3: Temperature in Kelvin.
– 4: Temperature in Celsius relative to humidity. According to the
dataset web page, “Dew Point is a measure of the absolute amount of
water in the air, the DP is the temperature at which the air cannot hold
all the moisture in it and water condenses.”
– 5: Relative humidity.
– 6: Saturation vapor pressure.
– 7: Vapor pressure.

9
Jena Climate Dataset: A Closer Look

• The dataset is saved in a CSV file with 15 columns:


– 8: Vapor pressure deficit.
– 9: Specific humidity.
– 10: Water vapor concentration.
– 11: Airtight.
– 12: Wind speed.
– 13: Maximum wind speed.
– 14: Wind direction in degrees.

• As you see, some of these features (like “saturation vapor


pressure”, “airtight”) are pretty esoteric to non-
specialists, whereas others (like temperature, wind
speed) have a meaning that we can all understand.

10
Reading the Data

fname = "jena_climate_2009_2016.csv"
with open(fname) as f:
data = f.read()
lines = data.split("\n")
lines = lines[1:] # The first line in the file is header information

temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(",")[1:]]
temperature[i] = values[1]
raw_data[i] = values

• This code creates two time series: temperature, and raw_data.


11
Reading the Data

for i, line in enumerate(lines):


values = [float(x) for x in line.split(",")[1:]]
temperature[i] = values[1]
raw_data[i] = values

• Variable temperature is a 1D time series.


– temperature[i] is the i-th temperature observation, in Celsius.
• Variable raw_data is a 14-dimensional time series.
– raw_data[i] is a 14-dimensional vector.
– From the original 15 columns of the dataset, we exclude column 0,
which was the date and time.

12
Visualizing the Data

plt.plot(range(0, len(temperature)),
temperature)

plots all 420,451 values in the


temperature array (one observation
every ten minutes, for eight years).

plt.plot(range(0, 1440),
temperature[:1440])

plots the first 1440 values,


corresponding to the first 10 days.

13
Inputs and Target Outputs

• Let’s look at our forecasting task description again: given


data from the last five days, the goal is to predict the
temperature exactly 24 hours from now.
• What will be the input to this “forecasting” system?
• What will be the output of the system?

14
Inputs and Target Outputs

• Let’s look at our forecasting task description again: given


data from the last five days, the goal is to predict the
temperature exactly 24 hours from now.
• What will be the input to this “forecasting” system?
• What will be the output of the system?
• The input will be a sequence of feature vectors containing
all values from a period of five days.
– Number of columns: 14, since we have 14 features in our data.
– Number of rows: 5 days * 24 hours * 6 observations per hour = 720.
– Shape of the input: 720x14, which gives 10080 numbers.
• The output will be a single number: the temperature (in
Celsius) 24 hours after the last observation in the input.
15
Creating a Training Set

• Our training data is a 14-dimensional time series of


length 210,225.
• We want to extract a random training example of length
720.
• So, we pick a random start point in the time series, and
we get the next 720 elements.
• What is the smallest and largest legal value for the start
point?
• Smallest: 0
• Largest: timeseries length – 720 – 24*6.
– Why? We need enough room to choose 720 elements, plus enough
room to look 24 hours past the last element, to get the target value
that we aim to forecast.

16
Creating a Training Set

• What is the smallest and largest legal value for the start
point?
• Smallest: 0
• Largest: timeseries length – 720 – 24*6.
– Why? We need enough room to choose 720 elements, plus enough
room to look 24 hours past the last element, to get the target value that
we aim to forecast.
target value
input length = 720

24 hours = 24*6 time steps


Random
after last element of input
start point 17
A Simple RNN Model

• This is an example model that is small enough to draw


easily:
– The input to the model is a time series 𝒙 of length 3.
– Each element of the 𝒙 has two dimensions.

• So, 𝒙 = 𝑥1,1 , 𝑥1,2 , 𝑥2,1 , 𝑥2,2 , 𝑥3,1 , 𝑥3,2


𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


18
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
A Simple RNN Model

• Previously, we used to draw input layers on the left, and


output layers on the right.
• Here, it is easier to draw input layers at the bottom, and
output layers at the top.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


19
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
A Simple RNN Model

• There is nothing different about the input layer, it is as


usual.
• 𝒙 = 𝑥1,1 , 𝑥1,2 , 𝑥2,1 , 𝑥2,2 , 𝑥3,1 , 𝑥3,2 , so we need six
input units to represent the input.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


20
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
A Simple RNN Model

• The second layer is a recurrent layer.


– Because of this layer, this network is not a feedforward neural
network.
– In a feedforward neural network, the inputs to a layer come from the
outputs of the previous layer.
– Here, some inputs to the second layer come from the second layer
itself. 𝑧 𝑧 𝑧
2,1 2,2 2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


21
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
A Simple RNN Model

• Notice that unit 𝑈2,2 receives inputs not only from input units, but also
from second-layer unit 𝑈2,1 .
• Similarly, unit 𝑈2,3 receives inputs not only from input units, but also
from second-layer unit 𝑈2,2 .
• These connections between units of the same layer are called recurrent
connections.
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


22
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
A Simple RNN Model

• Notice that all units in the second layer share the same weights.
• The weights connecting two input units to a second-layer unit are
denoted with the same two symbols.
• There is also a new symbol, the recurrent weight 𝑢2,1 for the recurrent
connections between 𝑈2,1 and 𝑈2,2 , and between 𝑈2,2 and 𝑈2,3 .

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


23
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Computing the Output

• The outputs of the second layer play two roles:


– They are used as inputs to other units in the second layer.
– They are also the outputs of the entire network.
• In more complicated models, we could have more layers on
top.
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


24
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Computing the Output

• Computing the output of each unit needs to be follow the


order of the time steps.
• First we compute, from bottom to top, the output of all units
that correspond to time step 1.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


25
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Computing the Output

• Computing the output of each unit needs to be follow the


order of the time steps.
• Second we compute, from bottom to top, the output of all
units that correspond to time step 2.
– This way we can use 𝑧2,1 from time step 1.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


26
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Computing the Output

• Computing the output of each unit needs to be follow the


order of the time steps.
• Third we compute, from bottom to top, the output of all
units that correspond to time step 3.
– This way we can use 𝑧2,2 from time step 2.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


27
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Order of Output Computations

• In a feedforward neural network, we simply followed the


order of layers, from input to output.
• In an RNN, we first follow the order of time steps.
– Within a single time step, we follow the order of layers, from input to
output.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


28
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Translating to Keras

model = keras.Sequential([keras.Input(shape=(3,2)),
keras.layers.SimpleRNN(1)])

• This piece of code implements our network.


– Parameter 1 for the SimpleRNN layer specifies one unit per time step.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


29
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Translating to Keras

model = keras.Sequential([keras.Input(shape=(3,2)),
keras.layers.SimpleRNN(3)])

• Here the SimpleRNN layer has three units per time step.
• We do not show the connections and weights anymore.
– Within a time step, all three 2nd layer units are connected to all two
inputs.
– All 2nd layer units from the previous step are inputs to all 2nd layer units
in the next step.
𝑈2,1,1 𝑈2,1,2 𝑈2,1,3 𝑈2,2,1 𝑈2,2,2 𝑈2,2,3 𝑈2,3,1 𝑈2,3,2 𝑈2,3,3

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


30
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Simplifying the Drawings

• When we draw an RNN, we typically do not show each


individual unit.
• Instead, we group units into blocks, such that all units in a
block belong to the same layer and the same time step.

𝑈2,1,1 𝑈2,1,2 𝑈2,1,3 𝑈2,2,1 𝑈2,2,2 𝑈2,2,3 𝑈2,3,1 𝑈2,3,2 𝑈2,3,3

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2


31
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Simplifying the Drawings

• When we draw an RNN, we typically do not show each


individual unit.
• Instead, we group units into blocks, such that all units in a
block belong to the same layer and the same time step.
• In this simplified drawing:
– 𝐿1,1 groups together units 𝑈1,1,1 and 𝑈1,1,2 .
– 𝐿1,2 groups together units 𝑈2,1,1 , 𝑈2,1,2 , and 𝑈2,1,3 .

𝐿1,2 𝐿2,2 𝐿3,2

𝐿1,1 𝐿2,1 𝐿3,1


32
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Simplifying the Drawings

• Now that we have simplified the drawing, we can draw


connections again.
• An arrow means that all units of one group are connected
to all units of the other group.
• Of course, now it is not clear how many units are in each
layer.
– When we simplify, some details are inevitably lost.

𝐿1,2 𝐿2,2 𝐿3,2

𝐿1,1 𝐿2,1 𝐿3,1


33
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Simplifying the Drawings

• We can always make up conventions to provide more


information.
• For example, here we show for each block:
– The type of layer that it belongs to.
– The number of units.

Recurrent(3) Recurrent(3) Recurrent(3)

Input(2) Input(2) Input(2)


34
𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2
Simplifying the Drawings

• This is a common way to draw RNNs.


• Since the structure at each time step looks the same, we
just show three steps:
– “Previous”, “current”, and next”.
– Oftentimes more detail in shown in the current step.

SimpleRNN(3)

Input(2)
35
𝑡−1 𝑡 𝑡+1
An RNN Network for Jena Climate

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16),
keras.layers.Dense(1),])

model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])

callbacks = [keras.callbacks.ModelCheckpoint("jena_LSTM1_16.keras",
save_best_only=True)]

history_lstm = model.fit(training_inputs, training_targets, epochs=20,


validation_data=(validation_inputs, validation_targets),
callbacks=callbacks)

• This code trains a network with an RNN layer.

36
An RNN Network for Jena Climate

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16),
keras.layers.Dense(1),])

• The highlighted line creates the recurrent layer.


– It has 16 units for each time step.
– There are 120 time steps (not shown in this drawing).

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 37


An RNN Network for Jena Climate

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16),
keras.layers.Dense(1),])

• We have a fully connected output layer.


• This layer only connects to the recurrent
units of the LAST TIME STEP.
Dense(1)

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 38


Detour: The return_sequences option

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16, return_sequences=False),
keras.layers.Dense(1),])

• SimpleRNN layers have an option called


return_sequences.
– The default value is False.
– This specifies that the output of the layer Dense(1)
is just the output of the last time step.

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 39


Detour: The return_sequences option

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16, return_sequences=True),
keras.layers.Flatten(),
keras.layers.Dense(1),])

• If return_sequences is true, the output of the layer is the


output of all time steps.

Dense(1)

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 40


Detour: The return_sequences option

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16, return_sequences=True),
keras.layers.Flatten(),
keras.layers.Dense(1),])

• Note the flattening step in this case, between the


SimpleRNN layer and the Dense layer.

Dense(1)

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 41


Detour: The return_sequences option

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.SimpleRNN(16, return_sequences=True),
keras.layers.Flatten(),
keras.layers.Dense(1),])

• We will revisit the return_sequences=True option later.


– For temperature forecasting, it gives worse accuracy.

Dense(1)

Recurrent(16) Recurrent(16) Recurrent(16)

Input(14) Input(14) Input(14) 42


LSTM Layer

• LSTM stands for Long Short-Term Memory.


• Like SimpleRNN, an LSTM layer is a recurrent layer.
• LSTM layers are used widely in practice.
• However, the description of an LSTM layer is more
complicated than that of a SimpleRNN layer.
– An LSTM layer produces a secondary output, shown in red. This is
called the carry, and it is computed using some special rules.

LSTM(M)

Input(N)
𝑡−1 𝑡 𝑡+1 43
SimpleRNN Computations at Time t

• The output of a simple RNN layer at time 𝑡 depends on:


– An N-dimensional input vector 𝑥𝑡 from the input layer.
– An M-dimensional vector 𝑧𝑡−1 , produced by the RNN layer at time 𝑡 −
1.
– 𝑊𝑧 , which is an 𝑀 × 𝑁 matrix of weights applied to 𝑥𝑡 .
– 𝑈𝑧 , which is an 𝑀 × 𝑀 matrix of weights applied to 𝑧𝑡−1 .
– 𝐵𝑧 , which is an 𝑀-dimensional vector of bias weights.
• The output 𝑧𝑡 is computed as:
𝑧𝑡 = tanh 𝑊𝑧 𝑥𝑡 + 𝑈𝑧 𝑧𝑡−1 + 𝐵
𝑧𝑡

– Note that tanh is the 𝑧𝑡−1 SimpleRNN(M)


default activation function 𝑧𝑡
for a SimpleRNN layer, but another
function could be substituted.
Input(N)
Time 𝑡 44
LSTM Computations at Time t

• In addition to the inputs and weights of a simple RNN layer,


an LSTM layer also has:
– An 𝑀-dimensional carry vector 𝑐𝑡−1 , produced at time 𝑡 − 1.
– 𝑉𝑧 , which is an 𝑀 × 𝑀 matrix of weights applied to 𝑐𝑡−1 .
• The output 𝑧𝑡 is now computed with a new formula:
𝑧𝑡 = tanh 𝑊𝑧 𝑥𝑡 + 𝑈𝑧 𝑧𝑡−1 + 𝑉𝑧 𝑐𝑡−1 + 𝐵

𝑐𝑡−1
𝑧𝑡 𝑐𝑡
𝑧𝑡−1 LSTM(M)
𝑧𝑡

Input(N)
Time 𝑡 45
LSTM Computations at Time t

• To complete the description of the LSTM layer, we must


specify how to compute the carry vector 𝑐𝑡 at time 𝑡.
• To do that, we use some additional weight matrices:
– 𝑊𝑖 , 𝑊𝑓 , 𝑊𝑘 are three 𝑀 × 𝑁 weight matrices applied to 𝑥𝑡 .
– 𝑈𝑖 , 𝑈𝑓 , 𝑈𝑘 are three 𝑀 × 𝑀 weight matrices applied to 𝑧𝑡 .
– 𝐵𝑖 , 𝐵𝑓 , 𝐵𝑘 are three 𝑀-dimensional vectors of bias weights.
• Then, we compute:
𝑖𝑡 = 𝜎 𝑊𝑖 𝑥𝑡 + 𝑈𝑖 𝑧𝑡−1 + 𝐵𝑖 𝑐𝑡−1
𝑘𝑡 = 𝜎 𝑊𝑘 𝑥𝑡 + 𝑈𝑘 𝑧𝑡−1 + 𝐵𝑘 𝑧𝑡 𝑐𝑡
𝑓𝑡 = 𝜎 𝑊𝑓 𝑥𝑡 + 𝑈𝑓 𝑧𝑡−1 + 𝐵𝑓 𝑧𝑡−1 LSTM(M)
𝑐𝑡 = 𝑖𝑡 ∗ 𝑘𝑡 + 𝑐𝑡−1 ∗ 𝑓𝑡 𝑧𝑡

Symbol * means “pointwise multiplication”. Input(N)


Time 𝑡 46
An Intuitive Interpetation

𝑖𝑡 = 𝜎 𝑊𝑖 𝑥𝑡 + 𝑈𝑖 𝑧𝑡−1 + 𝐵𝑖
𝑘𝑡 = 𝜎 𝑊𝑘 𝑥𝑡 + 𝑈𝑘 𝑧𝑡−1 + 𝐵𝑘
𝑓𝑡 = 𝜎 𝑊𝑓 𝑥𝑡 + 𝑈𝑓 𝑧𝑡−1 + 𝐵𝑓
𝑐𝑡 = 𝑖𝑡 ∗ 𝑘𝑡 + 𝑐𝑡−1 ∗ 𝑓𝑡

• There is a somewhat intuitive interpretation that motivated these


formulas. According to this interpretation:
– 𝑖𝑡 represents new information, computed at time 𝑡.
– 𝑘𝑡 represents the importance of each dimension in 𝑖𝑡 .
– 𝑐𝑡−1 represents old information, computed from previous time steps.
– 𝑓𝑡 represents the importance of each dimension in 𝑐𝑡−1 .
• If all values of 𝑘𝑡 are 1, and all values of 𝑓𝑡 are 0, then 𝑐𝑡 = 𝑖𝑡 .
– New information 𝑖𝑡 replaces old information 𝑐𝑡 − 1, which is “forgotten”.

47
An Intuitive Interpetation

𝑖𝑡 = 𝜎 𝑊𝑖 𝑥𝑡 + 𝑈𝑖 𝑧𝑡−1 + 𝐵𝑖
𝑘𝑡 = 𝜎 𝑊𝑘 𝑥𝑡 + 𝑈𝑘 𝑧𝑡−1 + 𝐵𝑘
𝑓𝑡 = 𝜎 𝑊𝑓 𝑥𝑡 + 𝑈𝑓 𝑧𝑡−1 + 𝐵𝑓
𝑐𝑡 = 𝑖𝑡 ∗ 𝑘𝑡 + 𝑐𝑡−1 ∗ 𝑓𝑡

• There is a somewhat intuitive interpretation that motivated these


formulas. According to this interpretation:
– 𝑖𝑡 represents new information, computed at time 𝑡.
– 𝑘𝑡 represents the importance of each dimension in 𝑖𝑡 .
– 𝑐𝑡−1 represents old information, computed from previous time steps.
– 𝑓𝑡 represents the importance of each dimension in 𝑐𝑡−1 .
• If all values of 𝑘𝑡 are 0, and all values of 𝑓𝑡 are 1, then 𝑐𝑡 = 𝑐𝑡−1 .
– New information 𝑖𝑡 is ignored, old information 𝑐𝑡 − 1 is retained in 𝑐𝑡 .

48
An Intuitive Interpetation

𝑖𝑡 = 𝜎 𝑊𝑖 𝑥𝑡 + 𝑈𝑖 𝑧𝑡−1 + 𝐵𝑖
𝑘𝑡 = 𝜎 𝑊𝑘 𝑥𝑡 + 𝑈𝑘 𝑧𝑡−1 + 𝐵𝑘
𝑓𝑡 = 𝜎 𝑊𝑓 𝑥𝑡 + 𝑈𝑓 𝑧𝑡−1 + 𝐵𝑓
𝑐𝑡 = 𝑖𝑡 ∗ 𝑘𝑡 + 𝑐𝑡−1 ∗ 𝑓𝑡

• In the typical case, individual values of 𝑘𝑡 and 𝑓𝑡 will range


between 0 and 1.
• Then, each dimension of vector 𝑐𝑡 will be a weighted sum of:
– new information from the corresponding dimension of 𝑖𝑡 , with weight
specified by the corresponding dimension of 𝑘𝑡 .
– old information from the corresponding dimension of 𝑐𝑡−1 , with weight
specified by the corresponding dimension of 𝑓𝑡 .

49
The Vanishing Gradient Problem

• An additional justification for the LSTM architecture is the


“vanishing gradient” problem.
– Under some neural network architectures, some weights do not get
sufficiently updated during backpropagation, due to very small
gradients.
– Consequently, backpropagation does not learn good values for those
weights.𝑧 𝑧 𝑧
2,1 2,2 2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 50


The Vanishing Gradient Problem

• To understand the problem, consider the toy RNN model


below.
– Let’s assume that the only output of the model is 𝑧2,3 , so that the model
estimates a single number.
– Consider how weight 𝑤2,1,1 (as an example) gets updated during
training.
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 51


The Vanishing Gradient Problem

• Weight 𝑤2,1,1 influences the output in multiple ways.


– It gets multiplied by input 𝑥1,1 during the first time step.
– It gets multiplied by input 𝑥2,1 during the second time step.
– It gets multiplied by input 𝑥3,1 during the third time step.

𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 52


The Vanishing Gradient Problem

• During backpropagation, we update 𝑤2,1,1 based on the


three different ways in which it influenced the output.
• However, the third time step will often influence
disproportionately how 𝑤2,1,1 is updated.
• Why?
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 53


The Vanishing Gradient Problem

𝜕𝐸 𝜕𝐸 𝜕𝑧2,3 𝜕𝑎2,3
=
𝜕𝑤2,1,1 𝜕𝑧2,3 𝜕𝑎2,3 𝜕𝑤2,1,1

• We start by applying the chain rule to compute the


partial derivative of the loss 𝐸 with respect to 𝑤2,1,1 .
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 54


The Vanishing Gradient Problem

𝜕𝑎2,3 𝜕𝑎2,3 𝜕𝑧2,2 𝜕𝑎2,2


= 𝑥3,1 +
𝜕𝑤2,1,1 𝜕𝑧2,2 𝜕𝑎2,2 𝜕𝑤2,1,1

𝜕𝑎2,3 𝜕𝑧2,2 𝜕𝑎2,2 𝜕𝑧2,1 𝜕𝑎2,1


= 𝑥3,1 + 𝑥2,1 +
𝜕𝑧2,2 𝜕𝑎2,2 𝜕𝑧2,1 𝜕𝑎2,1 𝜕𝑤2,1,1
𝑧2,1 𝑧2,2 𝑧2,3

𝑢2,1 𝑢2,1
𝑈2,1 𝑈2,2 𝑈2,3
𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2 𝑤2,1,1 𝑤2,1,2

𝑈1,1,1 𝑈1,1,2 𝑈1,2,1 𝑈1,2,2 𝑈1,3,1 𝑈1,3,2

𝑥1,1 𝑥1,2 𝑥2,1 𝑥2,2 𝑥3,1 𝑥3,2 55


The Vanishing Gradient Problem

• Combining the calculations from the previous slides we get:


Influence of 3rd time step:
𝜕𝐸 𝜕𝐸 𝜕𝑧2,3 product of 3 numbers.
= 𝑥 +
𝜕𝑤2,1,1 𝜕𝑧2,3 𝜕𝑎2,3 3,1 Influence of 2nd time step:
product of 5 numbers.
𝜕𝐸 𝜕𝑧2,3 𝜕𝑎2,3 𝜕𝑧2,2
𝑥2,1 +
𝜕𝑧2,3 𝜕𝑎2,3 𝜕𝑧2,2 𝜕𝑎2,2 Influence of 1st time step:
product of 7 numbers.
𝜕𝐸 𝜕𝑧2,3 𝜕𝑎2,3 𝜕𝑧2,2 𝜕𝑎2,2 𝜕𝑧2,1
𝑥1,1
𝜕𝑧2,3 𝜕𝑎2,3 𝜕𝑧2,2 𝜕𝑎2,2 𝜕𝑧2,1 𝜕𝑎2,1

• The influence of each step is the product of many terms, which


typically are between 0 and 1.
56
The Vanishing Gradient Problem

• We can extrapolate the previous formula to our temperature


forecasting RNN.
• There, the input length is 120 steps.
• For any weight 𝑤𝑖 connecting the SimpleRNN layer to the
𝜕𝐸
input, partial derivative will be a sum of 120 terms.
𝜕𝑤𝑖
𝜕𝐸
• Each term of will correspond to the influence of a single
𝜕𝑤𝑖
time step.
– The influence of time step 120 will be a product of 3 numbers.
– The influence of time step 119 will be a product of 5 numbers.
– The influence of time step 118 will be a product of 7 numbers.

– The influence of time step 1 will be a product of 241 numbers.

57
The Vanishing Gradient Problem

𝜕𝐸
• Each term of will correspond to the influence of a single
𝜕𝑤𝑖
time step.
– The influence of time step 120 will be a product of 3 numbers.
– The influence of time step 119 will be a product of 5 numbers.
– The influence of time step 118 will be a product of 7 numbers.

– The influence of time step 1 will be a product of 241 numbers.
• So, the influence of time step 1 will be a product of 241
numbers, which will usually be between 0 and 1.
• This will be a very small quantity.
• Overall, the influence of a time step drops exponentially as we
move from the end towards the beginning of the input time
series.

58
LSTMs and Vanishing Gradients

• The carry output can (potentially) remember information from


earlier time steps.
• This allows calculations from earlier time steps to influence the
output more heavily than in a SimpleRNN layer.
– Influencing the output more heavily means higher contributions to the
partial derivatives of weights.
– That way, the model can learn to give more importance to earlier time
steps.

LSTM(M)

Input(N)
𝑡−1 𝑡 𝑡+1 59
Detour: ResNet

• The vanishing gradient problem is not particular to RNNs.


• Any deep network involves a sequence of calculations,
mapping inputs to outputs.
– Calculations earlier in the sequence end up making smaller
contributions to partial derivatives of weights.
• For convolutional neural networks (CNNs), a popular method
for resolving the vanishing gradient problem is ResNet.
• We will not discuss ResNet in this class.
– The method is somewhat similar to LSTM, by providing a way for
earlier calculations to be “remembered” in later layers.
• If you are interested in learning more about ResNet, a good
starting point is the Wikipedia article:
https://en.wikipedia.org/wiki/Residual_neural_network

60
GRU

• GRU stands for Gated Recurrent Unit.


• GRU layers are yet another type of recurrent layer.
• GRU layers can be used instead of SimpleRNN or LSTM.
• You can think of a GRU layer as an approach that is more
complicated than a SimpleRNN layer and more simple
than an LSTM layer.
• We will not discuss GRU layers any further in this class.
• As usual, the Wikipedia article is a good starting point
for more info:

https://en.wikipedia.org/wiki/Gated_recurrent_unit

61
Recurrent Dropout

• Dropout can be used with recurrent layers (such as


SimpleRNN, LSTM, GRU).
• However, the picture is more complicated, because the
same weights are used in multiple time steps.
• In practice, better results are usually obtained if the same
weights are “dropped” at each time step.
• A normal Keras dropout layer does not know how to do
that.
• To use dropout properly with recurrent layers, you
should use the optional parameters dropout and
recurrent_dropout.

62
Recurrent Dropout

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.LSTM(32, dropout=0.3,
recurrent_dropout=0.25),
keras.layers.Dropout(0.5),
keras.layers.Dense(1),])

• This piece of code shows an example of how to combine


different dropouts.
• The LSTM layer specifies a dropout value of 0.3.
– This means that 30% of the weights between the input layer and the
LSTM layer will be dropped for each training object.
• The LSTM layer specifies a recurrent_dropout value of 0.25.
– This means that 25% of the weights applied to outputs and carry
values from the previous time step will be dropped for each training
object.

63
Recurrent Dropout

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.LSTM(32, dropout=0.3,
recurrent_dropout=0.25),
keras.layers.Dropout(0.5),
keras.layers.Dense(1),])

• Notice that we still use a regular Keras dropout layer between


the LSTM layer and the fully connected output layer.
– Here we specify that, for each training object, 50% of the weights
connecting the LSTM outputs to the Dense layer will be dropped.
• Optional parameters dropout and recurrent_dropout specify
how to do dropout of weights incoming to the recurrent layer.
• For weights outgoing from the layer and incoming to a fully
connected layer, a regular dropout layer should be used.

64
Bidirectional Layers

• A recurrent layer processes information from time step to


time step, in chronological order.
• Would it make a difference if information was processed
in reverse chronological order?
– It might.
• How can we know which order is better?
– We usually don’t.

Recurrent

Input
𝑡−1 𝑡 𝑡+1 65
Bidirectional Layers

• A bidirectional layer processes information in both


chronological and anti-chronological order.
• Essentially, a bidirectional layer consists of two recurrent
layers, each processing information in different order.
• The output of the bidirectional layer is simply the
merged output of both layers.
𝒛1 𝒛2 𝒛3 𝒛4 𝒛5 𝒛6

𝒙1 𝒙2 𝒙3 𝒙1 𝒙2 𝒙3
𝑡=1 𝑡=2 𝑡=3 𝑡=1 𝑡=2 𝑡=3 66
Recurrent layer (SimpleRNN, Recurrent layer (SimpleRNN,
LSTM, GRU) processing LSTM, GRU) processing
information in chronological information in REVERSE
order. chronological order.

𝒛1 𝒛2 𝒛3 𝒛4 𝒛5 𝒛6

𝒙1 𝒙2 𝒙3 𝒙1 𝒙2 𝒙3
𝑡=1 𝑡=2 𝑡=3 𝑡=1 𝑡=2 𝑡=3
67
Recurrent layer (SimpleRNN, Recurrent layer (SimpleRNN,
LSTM, GRU) processing LSTM, GRU) processing
information in chronological information in REVERSE
order. chronological order.

𝒛1 𝒛2 𝒛3 𝒛4 𝒛5 𝒛6

𝒙1 𝒙2 𝒙3 Both layers 𝒙1 𝒙2 𝒙3
receive the
𝑡=1 𝑡=2 𝑡=3 exact same 𝑡=1 𝑡=2 𝑡=3
inputs. 68
These two recurrent layers, combined, form what we call a
bidirectional layer.

𝒛1 𝒛2 𝒛3 𝒛4 𝒛5 𝒛6

𝒙1 𝒙2 𝒙3 𝒙1 𝒙2 𝒙3
𝑡=1 𝑡=2 𝑡=3 𝑡=1 𝑡=2 𝑡=3
69
The output of the bidirectional layer is the concatenation
of the outputs of the two recurrent layers.

𝒛1 𝒛2 𝒛3 𝒛4 𝒛5 𝒛6

𝒙1 𝒙2 𝒙3 𝒙1 𝒙2 𝒙3
𝑡=1 𝑡=2 𝑡=3 𝑡=1 𝑡=2 𝑡=3
70
Bidirectional Layers in Keras

model = keras.Sequential([keras.Input(shape=input_shape),
keras.layers.Bidirectional(keras.layers.LSTM(32)),
keras.layers.Dense(1),])

• The Bidirectional layer takes as argument a recurrent layer.


• The Bidirectional layer creates two replicas of the recurrent
layer.
– The first replica processes information in chronological order.
– The second replica processes information in antichronological order.
• The output of the Bidirectional layer is the concatenated
output of the two replicas.

71
RNN Summary

• Recurrent layers process information one time step at a time.


• The most simple recurrent layer is SimpleRNN.
• The LSTM layer is a more complicated recurrent layer, that allows
the model to learn when to remember old information and when
to replace that “memory” with new information.
• Recurrent dropout must be handled differently than regular
dropout, using optional parameters when creating the recurrent
layers.
– dropout for weights connecting the previous layer to the recurrent layer.
– recurrent_dropout for weights connecting outputs of the recurrent layer
from previous time to the current time step.
• Bidirectional layers allow processing information both in
chronological and in antichronological order.

72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy