Master Spilak Bruno
Master Spilak Bruno
2
CONTENTS 3
6 Conclusion 54
A Cryptocurrencies 55
A.1 Cryptocurrencies prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A.2 Cryptocurrencies quarterly returns . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B Models architectures 58
Price prediction is one of the main challenge of quantitative finance. This paper presents a
Neural Network framework to provide a deep machine learning solution to the price prediction
problem. The framework is realized in three instants with a Multilayer Perceptron (MLP), a
simple Recurrent Neural Network (RNN) and a Long Short-Term Memory (LSTM), which can
learn long dependencies. We describe the theory of neural networks and deep learning in order
to be able to build a reproducible method for our applications on the cryptocurrency market.
Since price prediction is used in order to make financial decisions such as trade signals, we com-
pare different approaches of the prediction problem by exploring supervised learning methods
in classification tasks. We study these models to predict out-of-sample price directions of height
major cryptocurrencies with a rolling window regression method. For that goal, we build a
classification problem that predicts if the price of each cryptocurrency will increase or decrease
considerably, as a basis for three-months trading strategies. We build different trading strategies,
based on long or long/short positions build on our predictions and compare their performance
with a passive index investment on the cryptocurrency market that follows CRIX (Trimborn and
Härdle; 2016). Cryptocurrencies, Bitcoin being the most famous, are electronic money based
on Blockchain technology that can be used as a decentralized alternative to fiat currencies.
Thanks to their numerous applications, the cryptocurrency market has experienced an expo-
nential growth during 2017. We compare different weighted portfolios to test how an investor
can benefit from fundamental indicators such as market capitalization. We find that LSTM
has the best accuracy for predicting directional movements for the most important cryptocur-
rencies of CRIX and that an equally weighted portfolio beats CRIX on the first quarters of 2017.
Keywords: Deep learning, multilayer perceptron, recurrent neural network, long short-term
memory network, cryptocurrencies, CRIX
4
Acknowledgments
I would like to express my gratitude to my supervisor Professor Härdle for his continuous help
on my thesis. His support helped me to understand the cryptocurrency market which was
essential for my research. I would also like to thank my second supervisor Professor Lessmann
for introducing me to Machine Learning and Neural Networks and helping me with forecasting
theory. Besides my advisor, I would like to thank Simon Trimborn for providing the CRIX data
that was crucial for my thesis, Victor Cluzel who helped me with LateX, Alla Petukhina who
gave me advice on Quantlet format, Faruk Mustafic and Camil Humajick for their support while
I was working with DynamicaSoft and all the researcher from the IRTG 1792 for their comments
and remarks at Privatissimum Seminar from the Ladislaus von Bortkiewicz Chair of Statistics
of Humboldt Universität.
5
Introduction
On December 17th 2017, Bitcoin prices worthed almost over 20 000 dollars while at the beginning
of the year, it was selling for 1000 dollars. Was this rapid growth predictable ?
Stock price forecasting is one of the most important task of quantitative finance. Indeed,
profits are the guiding force behind most investment choices. Stock market investors need to
know the appropriate time to buy or sell stocks in order to maximize their investment return.
However, stock market prices do not behave as simple time series. The theory of price prediction
is a major discussion topic in Finance. The Efficient Market Hypothesis, which states that price
prediction is useless for profit maximization, has attempted to give a definite answer. However,
with the appearance of Behavioral Finance, many financial economists believe that stock prices
are at least partially predictable on the basis of historical stock price patterns, which reinvigorate
Fundamental and particularly Technical analysis as tools for price prediction.
Deep Learning models, particularly deep feedforward neural networks, have already found
numerous applications in quantitative finance, such as volatility forecasting. In a supervised
learning scheme, neural networks are useful tool for price prediction since no strong assumption is
needed for their application, which contrasts with traditional time series models, such as ARIMA
and its extensions. Moreover, deep learning architectures catch patterns with an important
generalization power and most recent LSTM networks seem more appropriate for sequential
data such as time series. Nevertheless, Deep Learning is frequently criticized for lacking a
fundamental theory that could crack open its black box.
In this thesis, we explain the neural network theory and investigate how LSTM networks
can outperform, in terms of prediction accuracy, former deep neural network architectures,
such as MLP and RNN, on the cryptocurrency market, which recently sparked the interest
of new investors on the financial market. In order to reflect trading decisions, we show how,
in supervised learning, a deep neural network successfully predicts price movements within a
classification task. Then, we build a simple investment strategy based on a long-term portfolio
comprised of the most important cryptocurrencies from CRIX, the cryptocurrency index from
Trimborn and Härdle (2016). These results may be used as a basis for a trading strategy.
We first overview deep MLP, RNN and LSTM models by explaining the basic concept of
neural networks, their elements and architectures. Then, we explain the deep learning method
for neural networks. Finally, we present our application of these models on the cryptocurrency
market by building a simple trading strategy.
6
Chapter 1
In this chapter, we discuss different architectures of deep neural networks. We explain how they
corresponds to different representation of nonlinear functions, from a static relation thanks to
deep feedforward networks (section 1.1) to dynamic process thanks to recurrent neural networks
which, as we will see in sections 1.2 and 1.3, are necessarily very deep architectures. We explain
also how these models are built, defining their elements by discussing essentially the notions of
neurons and activations functions.
where f (1) is the hidden layer and f (2) is the output layer. The number of hidden layers, or the
length of the chain, gives the depth of the network.
Finally, because the role of each node is analogous to a brain neuron, we call the network
neural and nodes neurons. The number of neurons in the hidden layers gives us the width of the
model. We will first explain how neurons in neural networks works.
1
2 CHAPTER 1. DEEP NEURAL NETWORKS
thus we need to learn it. We can then reformulate the model as follows:
where we use the parameter θ to learn φ and parameter w to map φ(x) to the desired output
(Goodfellow et al.; 2016).
A famous primary example is the XOR function, ”exclusive or” function, which takes the
binary variable x = (x1 , x2 )> ∈ {0, 1}2 as input and produce the output y = f ∗ (x) as follows:
y = 1, if either x1 = 1, or x2 = 1,
y = 2, if x1 = x2 = 0, or x1 = x2 = 1
To approximate f ∗ , we need to learn the parameter θ of the function ŷ = f (x; θ). It is easy
to show that a linear model, such as Rosenblatt’s perceptron, (Rosenblatt; 1958), cannot learn
the XOR function. The perceptron uses a linear combiner, g : x 7→ w> x + b, followed by a
threshold function x 7→ 1x≥0 . By noting w = (w1 , w2 ), the weights vector of the perceptron and
b, a bias vector, the parameters of the model are θ = (w, b). The perceptron maps the input x
to the output ŷ as follows:
x1 x2 ŷ
0 0 0
0 1 1
1 0 1
1 1 0
0 ∗ w1 + 0 ∗ w2 + b ≤ 0 ⇔ b≤0 (1.2)
0 ∗ w1 + 1 ∗ w2 + b > 0 ⇔ b > −w2 (1.3)
1 ∗ w1 + 0 ∗ w2 + b > 0 ⇔ b > −w1 (1.4)
1 ∗ w1 + 1 ∗ w2 + b ≤ 0 ⇔ b ≤ −w1 − w2 (1.5)
We can see that Equation (1.5) is contradictory with Equations (1.3) and (1.4). Indeed,
because the inputs are not linearly separable, the XOR problem cannot be learned with a
linear perceptron. We can solve this problem by using a different function φ that map the
inputs into another feature space that can represent the classifier.
1.1. DEEP FEEDFORWARD NETWORKS 3
Let us create a simple feedforward network with one hidden layer with two units, see Figure
1.1. The vector of hidden units h = (h1 , h2 ) is computed by the function f (1) : x 7→ f (1) (x; W, c),
the output layer ŷ is then computed by the function f (2) : x 7→ f (2) (x; w, b) using h as in-
put. Thus, the network is represented in the chain function: f : x 7→ ŷ = f (x; W, c, w, b) =
f (2) (f (1) (x)). We must choose a function for f (1) and for f (2) but we can not use linear func-
tions otherwise the network would be linear. Neural networks uses nonlinear perceptrons as a
regular function for hidden units. The nonlinear perceptron consists of a fixed nonlinear function
g, called an activation function, or squashing function, applied to Rosenblatt’s linear perceptron:
h = g(w> x + b)
where w = (w1 , . . . , wp )> is the weights of the hidden layer, b its biases and g the activation
function chosen.
h1
W1 w1
x ŷ
W2 w2
h2
Figure 1.1: Simple neural network with two hidden units, for clarification the biases are not
represented
Neural networks use different activation functions and one of the most common function, for
a classification task, is the logistic function, often called sigmoid function:
1
σ(x) = , x∈R (1.6)
1 + exp(−x)
Using the sigmoid function as activation function both for the hidden and output neurons,
the complete network is then:
1 1
ŷ = f (x; W, c, w, b) = >
, with h =
1 + exp(−w h − b) 1 + exp(−W > x − c)
We can now give a solution to the XOR problem. Let
20 −20
W = ,
20 −20
−10
c= ,
30
2
w=
2
and b = −3.
4 CHAPTER 1. DEEP NEURAL NETWORKS
Let us explain how the model processes a batch of inputs. Let x(1) , x(2) , x(3) , x(4) ∈ R2 be
the set of inputs, called the training set and which we can represent in matrix form with one
example per column:
0 0 1 1
X=
0 1 0 1
During the first step in the neural network, we multiply the input matrix by the first layer’s
weight matrix:
> 0 20 20 40
W X=
0 −20 −20 −40
Then we add the bias vector c, to obtain:
−10 10 10 30
30 10 10 −10
To get the value of h for each input example, we apply the sigmoid function:
0 1 1 1
h'
1 1 1 0
After this transformation, the inputs are not on a single line anymore, thus, we can find a
hyperplane. Let us process this inputs through the final output layer, we multiply by the weight
vector w, add the bias b and apply the sigmoid function to get:
1 0 0 1
We can solve the XOR problem with different neural networks, using other activation func-
tions. In modern applications, the default use is the rectified linear unit or ReLU (Jarrett et al.;
2009), defined by the activation function g(z) = max(0, z), z ∈ R. A neural network with only
one hidden ReLU unit and a linear perceptron as output layer solves the XOR problem.
We call this architecture a multilayer perceptron (MLP) or multilayer neural network. Figure
1.2 represents such a multilayer perceptron.
The question of the depth of the network, when determining the architecture of a neural
network is very important. Hornik et al. (1989) showed that a feedforward network with a
single hidden layer with sufficiently many hidden units and arbitrary bounded and nonconstant
activation functions, such as sigmoid activation function, are universal approximators for finite
input. Thus, a multilayer perceptron can approximate any continuous function mapping from a
finite dimensional discrete space to another.
1.1. DEEP FEEDFORWARD NETWORKS 5
x1
x2 y
x3
x4
The universal approximation theorem has also been proved for a wider class of activation
functions, such as ReLU, (Leshno et al.; 1993). While this theorem means that a sufficiently
large MLP can represent any function, it does not mean that we can necessarily learn it. As
we will see in the next chapter, very large network are very slow to train, that is why we often
prefer deep rather than large networks.
1.1.3 Back-propagation
Introduction to statistical learning
We know that we can implement a neural network to represent any function from input to
output, but we need to define how to find the network parameters based on a training set and
output targets.
In the XOR function example, we gave a specific solution to the classification problem that
has no error. In real application, the training set can have billions of observations that we want
to classify in a model that uses many parameters. In that case, we need to learn the parameters
of the model in a way that they produce the smallest error as possible, to get a ”good” if not
a ”perfect” classification (Franke et al.; 2015) of the training set. We define that error as a
loss function of the parameters, Q(θ), that we want to minimize with respect to the parameter
θ. Each machine learning problem is different, for example, we can use neural networks for
regression and classification problems, that is why we define different loss functions, each one
corresponding to a specific problem, see Section 2.1.2.
Let us consider the training set (x1 , y1 ), . . . , (xn , yn ) where xi and yi are input vectors and
target vectors respectively. We estimate the target y by the neural network function ŷ = f (x; θ)
and we estimate the parameter θ so that Q(θ) is minimum, thanks to an optimization algorithm
such as gradient descent, see Section 2.2. In learning algorithms, the gradient of the cost function,
called error gradient with respect to the parameter, ∇Q(θ), is required at each layer, in order
to understand how changes on the weights at the input layer can impact the loss function,
6 CHAPTER 1. DEEP NEURAL NETWORKS
computed at the output layer. This computation is very complicate for deep neural networks
and we need a particular numerical method.
Back-propagation algorithm
• Forward pass: In feedforward networks, when we feed an input x, information propagates
forward through the hidden layers in the network to produce an output ŷ. This step is
called the forward pass (Graves; 2012). This forward propagation of information continues
while we train the network over the training set until it produces the scalar loss Q(θ). To
be able to update the weights of the network correctly, we need to be able to propagate
the information from the loss, backward in the network, during a second step called the
backward pass.
dot .
x operation
u(1)
w
b u(2) ŷ
σ
+
operation
Backpropagation is an algorithm that expressed the error gradient with respect to quantities
of a given neuron as a function of its outgoing neurons. This is possible thanks to the chain rule
of calculus that computes the derivatives of the composition of several functions. Let x be a
real number, and f and g real functions. We take the composition of f and g to get the output
z = f (y) = f (g(x)), the chain rule states that:
∂z ∂z ∂y
=
∂x ∂y ∂x
Using the chain rule, it is easy to explicit the expression for the error gradient of a scalar
with respect to any node in the computational graph that produced that scalar. Figure 1.4
represent a part of a multilayer perceptron computational graph, with zi the activation of unit
i, ui the input of unit i, wij the weight between unit i and j, b the bias of the perceptron and
Q(θ) the loss function.
1.1. DEEP FEEDFORWARD NETWORKS 7
From this graph, we can write the error back-propagation explicitly using the partial deriva-
tives.
∂Q(θ) ∂zi X ∂Q(θ) ∂uj
= ,
∂ui ∂ui ∂uj ∂zi
|{z} j |{z}
g 0 (ui ) wij
which gives:
∂Q(θ) ∂Q(θ) ∂Q(θ) ∂Q(θ)
= zi and =
∂wij ∂uj ∂bj ∂uj
We can generalize the implementation of the back-propagation algorithm to any multilayer
perceptron, using its matrix form.
Σ
b ∂Q(θ)
∂Q(θ) ∂Q(θ)
∂ui ∂zi ∂uj
ui gi (.) zi Σ uj
wij
Activation
function
• Forward equations:
u l+1 = W >
l,l+1 .z l + b l+1 (linear perceptron) (1.12)
g l+1 = g (u l+1 ) (1.13)
8 CHAPTER 1. DEEP NEURAL NETWORKS
• Backward equations:
∂Q(θ) ∂Q(θ)
= W l,l+1 (1.14)
∂z l ∂u l+1
∂Q(θ) ∂Q(θ)
= g 0 (u l ) where is the element-wise product (1.15)
∂u l ∂z l
• Weights updates:
1.2.1 Architectures
Jordan (1986) presented a first architecture as a superset of feedforward artificial neural networks
that has one or more cycles. Each cycle make possible to follow a path from a neuron back to
itself, allowing feedback of information. These cycles, or recurrent edges, allow the network’s
hidden units to see its own previous output so they give the network memory (Elman; 1990)
and introduce the notion of time into the model. The recurrent neurons are sometime referred
to as context neurons or state neurons. The structure of a simple recurrent neural network is
shown on Figure 1.5.
x h o
Figure 1.5: Simple Recurrent Neural Network with one hidden layer
Jordan (1986) introduced the notion of time in the model thanks to state units that represent
the ”temporal context”. The output information is produced from the input and state units
1.2. RECURRENT NEURAL NETWORKS 9
through the hidden units. Finally, the recurrent connections from the state units to themselves
and from the output units to the state units allow the output information to be fed back in the
network at the following time step. These recurrent connections constitute Jordan network’s
memory as in Figure 1.6.
State units
• Forward pass: The forward pass of an RNN is the same as that of an MLP with a single
hidden layer, but the activations at the hidden layer comes from both the current input
and the hidden layer activations at the previous time step (Graves; 2012).
Let us consider an input sequence, x, of length T presented to a RNN with I input units,
one hidden layer with H hidden units and K output units. Let us also denote I the index
interval for the input to hidden layer connections, H the index interval for the hidden to
10 CHAPTER 1. DEEP NEURAL NETWORKS
Output layer u1 u2 u3 u4 u5
Hidden layer h1 h2 h3 h4 h5
W W W W
Input layer x1 x2 x3 x4 x5
hidden layer connections (recurrent connections) and K the index interval for the hidden
to output layer connections.
Consider the following notation:
This notation, inspired from Williams and Zipser (1995) and Graves (2012), will make the
equations of the algorithm easier.
As in a regular MLP, see Section 1.1.3, we have for the hidden units:
I
X H
X
uth = wih xti + wh0 h zht−1
0 h = 1, . . . , H (1.18)
i=1 h0 =1
Activation functions are then applied exactly as for MLPs to get the output of hidden unit
h, zht , and the output units of the network, utk at time t:
To ease the notation we did not consider the bias vector, bi at unit i. By applying Equations
(1.18) and (1.19) for every t = 0, . . . , T , we get all the activations for the hidden layer.
Nevertheless, we must choose an initializer for zi0 , corresponding to the network’s state
before it receives any information for the input dataset. We often set it to 0, which works
well for sequence-to-sequence learning, but in some cases, we can use non-zero or noisy
initial state, as in Zimmermann et al. (2006).
• Backward pass: To compute the backward pass, we need derivatives with respect to the
weights. We use the back-propagation algorithm that applies standard back-propagation
1.2. RECURRENT NEURAL NETWORKS 11
to the unfolded RNN as in Figure 1.7 thanks to the chain rule (see Section 1.1.3). But in
the case of RNN, both the output layer at the current time step and the hidden layer at
the next time step are influenced by the activation on the hidden layer at the current time
step (Graves; 2012), which gives the following back-propagated error:
K H
!
∂Q(θ) 0 t
X ∂Q(θ) X ∂Q(θ)
t = g (u h ) t w hk + t+1 wh h
0 t = 1, . . . , T − 1 (1.21)
∂uh ∂uk 0 ∂u h 0
k=1 h =1
∂Q(θ) t
PT
t=1
x,
∂uth i
if i ∈ I and j ∈ H
T
∂Q(θ) X ∂Q(θ) ∂utj PT ∂Q(θ) t
= t = t=1 z ,
∂uth h0
if (i, j) ∈ H 2 (1.22)
∂wij ∂uj ∂wij ∂Q(θ) t
t=1 PT
if i ∈ H and j ∈ K
t=1 ∂utk zh ,
T
∂Q(θ) X ∂Q(θ) ∂ut ∂ht ∂hk
= 1≤k≤T
∂W ∂ut ∂ht ∂hk ∂W
t=1
∂ht
and applying again the chain rule on ∂hk , we get:
T t
!
∂Q(θ) X ∂Q(θ) ∂ut Y ∂hi ∂hk
= 1≤k≤T (1.23)
∂W ∂ut ∂ht ∂hi−1 ∂W
t=1 i=k+1
QT ∂hj
= Ti=k+1 W > diag(σ 0 (hj−1 )), where σ is the sigmoid activation function, is a
Q
i=t+1 ∂hj−i
Jacobian matrix and as a norm inferior to 1. Indeed, we use the sigmoid activation function
that squashes the output values in [0, 1] and the derivatives [0, 1/4], see Pascanu et al. (2013)
for a complete demonstration. Thus, sigmoid layers can easily squash their input to a smaller
output region. If it happens repeatedly on stacked multiple sigmoid layers, even a large change
in the parameters of the first layer can finally have a really small impact on the output.
Indeed, according to Equation (1.23), long-term contributions, for which T − k is large, can
go to 0 exponentially fast with the T − k order matrix multiplication, if the absolute value
of the largest eigenvalue of the recurrent weight matrix W is inferior to some boundary γ.
12 CHAPTER 1. DEEP NEURAL NETWORKS
Pascanu et al. (2013) showed that for tanh and sigmoid activation functions, γ = 1 and γ = 1/4
respectively.
Moreover, the long-term components can explode instead of vanishing. We can invert the
latter condition to get a necessary condition for exploding gradient.
• Using a L1 or L2 weight penalty on the recurrent weights can also be a solution. Pascanu
et al. (2013) address the vanishing gradients problem using a regularization term such that
back-propagated gradients neither increase or decrease too much in magnitude.
• Pascanu et al. (2013) proposed a gradient clipping to deal with exploding gradients. They
rescale the gradients, g, whenever their norm, kgk go over a threshold, v:
gv
Ifkgk > v then g ←
kgk
• Glorot et al. (2011) proposed to use Rectifier Unit, ReLU, which activation function is
rectif ier(x) = max(0, x), x ∈ R. The function computed by each neuron is then linear
by parts and because of this linearity, there is no vanishing gradient due to nonlinear
activations like tanh or sigmoid. Indeed, ReLU does not have this property of squashing
the input space into smaller region.
• We can also change the structure of the model to cope with the vanishing gradients prob-
lem. By introducing a new set of units called Long Short-Term Memory units (LSTM
network), Hochreiter and Schmidhuber (1997) enforce a constant error flow through ”con-
stant error carousels”, thanks to input and output gates.
• One (or more) memory cell sc , called the cell state, is the central feature. It is referred to
as constant error carrousel (CEC) in Hochreiter and Schmidhuber (1997). The cell state
produces only some minor linear transformations, achieving a constant error flow through
the memory block. It is a linear unit with a fixed recurrent self-connection. Controlling
the cell state, gate cells are added to the memory cell(s).
• One multiplicative input gate unit is introduced to protect the current memory content
stored in sc to be perturbed by previous states.
• One multiplicative output gate unit is also introduced to protect next units to be perturbed
by the currently irrelevant memory content.
These gates gives to LSTM the ability to control the information flow in the cell state and
have a sigmoid activation function, which gives the amount of information to let trough. They
are closed when the activation is close to 0 and are opened when the activation is close to 1.
Thus, the input gate decides when to keep or override information in the memory cell.
With this architecture, the cell state, sc , is updated based on its current state and three
sources of inputs: netc , the input to the sell itself through the recurrent connection, netin and
netout the inputs to the input and output gates, (Gers et al.; 2000). At each time step, during
the forward pass, all units are updated and the error signals for all weights are computed during
the backward pass.
If LSTM have found a large success and numerous applications, Gers et al. (2000) identified
a weakness when they process continual input streams, without explicitly resetting the network
state. In some case, the cell state tends to grow linearly during learning which can make the
LSTM cell degenerate into an ordinary recurrent network, where the gradient vanishes, see Gers
et al. (2000). Gers et al. (2000) proposed to add to the memory block a forget gate that allows
the memory block to reset itself, thanks to a sigmoid activation function. For now on, we will
only consider the extended LSTM with forget gates from Gers et al. (2000). Figure 1.8 illustrates
such a LSTM cell.
xt xt
Cell
xt × ct × ht
ft Forget Gate
xt
Figure 1.8: LSTM cell with a forget gate, source: Graves (2013)
Let xt be one observation at time t of the input vector, the LSTM cell from Figure 1.8 is
implemented by the following equations (Graves; 2013):
14 CHAPTER 1. DEEP NEURAL NETWORKS
where σ is the logistic sigmoid function, and i, f , o, c are respectively the intput gate,
forget gate, output gate and memory cell activations. Wxi , Whi , Wxf , Whf , Wcf , Wxc Whc , Wxo
Who Wco , which indexes are intuitive, are the input-input gate weight matrix, hidden-input gate
weight matrix, etc.
• indexes i, f and o refer to input gate, forget gate and output gate respectively.
• g and h are respectively the cell input and output activations functions.
The forward and backward passes are calculated as in Section 1.2.2. For more details, the
reader can refer to Graves (2012) from which we took the equations.
Forward pass
Let us introduce:
∂Q(θ)
δjt = (1.29)
∂utj
Input gates
I
X H
X C
X
uti = wi xti + whi zht−1 + wci st−1
c (1.30)
i=1 h=1 c=1
zit = f (uti ) (1.31)
(1.32)
Forget gates
I
X H
X C
X
utf = wif xti + whf zht−1 + wcf sct−1 (1.33)
i=1 h=1 c=1
zft = f (utf ) (1.34)
Cells
I
X H
X
utc = wic xti + whc zht−1 (1.35)
i=1 h=1
stc = zft st−1
c + zit g(utc ) (1.36)
Output gates
I
X H
X C
X
uto = wio xti + who zht−1 + wco st−1
c (1.37)
i=1 h=1 c=1
zot = f (uto ) (1.38)
Cell Outputs
Backward pass
Let us introduce some notations:
∂Q(θ) ∂Q(θ)
tc = ts = (1.40)
∂zct ∂stc
Cell Outputs
K
X H
X
tc = wck δkt + wch δht+1 (1.41)
k=1 h=1
Output gates
C
∂f (uto ) X
δot = h(stc )tc (1.42)
∂uto
c=1
states
∂h(stc ) t
ts = zot + zft−1 t+1 + wci δit+1 + wcf δft+1 + wco δot (1.43)
∂stc c s
Cells
Forget gates
C
∂f (utf ) X
δft = st−1
c s
t
(1.45)
δutf
c=1
Input gates
C
∂f (uti ) X
δit = g(utc )ts (1.46)
δuti
c=1
Thanks to its complex architecture, LSTM is easier to train than RNN. Nevertheless, we
need to define how training occurs in neural networks.
Chapter 2
In the previous chapter, we discussed the architecture of neural networks and how they can
represent a mapping from an input x to an output ŷ = f (x; θ) that approximates the targeted
output y = f ∗ (x) where f ∗ is unknown. To do that, neural networks need to learn the parameter
θ. In this chapter, we will explain how to learn θ to get the best approximation of y. The deep
learning procedure consists of choosing a specification for the dataset, f (x; θ), a loss function,
that corresponds to the task, and an optimization algorithm for a given model (Goodfellow
et al.; 2016). We will explain these three steps.
• Supervised learning, where each input of a dataset is associated with a label or target. We
have a teacher that provides a cost for each input that we want to reduce. Most tasks
are classification or regression problems. In this thesis, we only discuss about supervised
learning algorithms.
• Reinforcement learning interacts with an environment and lies in between supervised and
unsupervised learnings. It is similar to supervised learning with a feedback, a critic, that
we get from the real targeted category label to improve the classifier.
Definition 2.1 (Maximum likelihood estimator) Consider the data generating distribution
p(x), which is the true probability. Consider the training set x = x1 , . . . , xn independently dis-
tributed from p(x). Let p(x; θ) be a parametric family of probability distributions over the space
17
18 CHAPTER 2. LEARNING A NEURAL NETWORK
x estimating the true probability. The maximum likelihood estimator for θ is then defined as:
and if we take the logarithm, which has no incidence to the value of arg max, we transform the
product in a sum, which is much more easier to deal with:
n
X
θM V = arg min ln p(xi ; θ) (2.1)
θ i=1
The usual approach to find θM V is then to minimize the negative log-likelihood, − ni=1 ln p(xi ; θ).
P
We can interpret maximum likelihood estimation as minimizing the Kullback-Leibler distance
(KL distance) between the model probability, p(x; θ) and the empirical distribution, p̂(x).
Definition 2.2 (Kullback-Leibler distance) The KL divergence between the two probability
distributions p(x; θ) and p̂(x) is given by:
However, as ln p̂(x) is not a function of the model, minimizing this KL divergence corresponds
exactly to minimizing the cross-entropy between the training data generated by p̂(x) and p(x; θ).
Definition 2.3 (Cross-entropy) The cross-entropy between the two probability distributions
p̂(x) and p(x; θ) is given by:
Q(θ) = −Ep̂ [ln p(x; θ)]
where p̂ is the empirical distribution of the true probability on the training set x.
For some problems, such as simple linear regression, computing the maximum likelihood
estimator can be easy by using the normal equations. However in practice, it is much more
challenging for neural networks. Indeed, there is no analytical solution for their optimal weights
and we must search for them using the log-likelihood.
The per-example loss function measure the distance between each network output ŷi and
the targeted output yi , but we need to get a measure of performance over all of the training
set. That is why we sum all the elementary loss functions to get the loss function on the whole
training set, also called risk function:
n
1X
Q(θ) = −Ex,ŷ L(x, y, θ) = L(xi , yi , θ)
n
i=1
Linear regression
For linear regression, we can simply use a linear perceptron as output. Given the hidden in-
termediate activations h = f (x; θ), the output of a linear perceptron will be ŷ = W > h + b. In
that case, maximizing the log-likelihood is similar to minimize the Root Mean Squared Error
(RMSE), defined as follow:
v
u n
u1 X
RM SE = t ŷi − yi (2.3)
n
i=1
Classification
If we want to represent a K-class classifier, the standard approach is to use K output units
each having a softmax activation function to obtain the estimations of the class probabilities,
p(y = j | x) = ŷj . Because ŷj is a probability, it needs to be between 0 and 1 and the vector
ŷ = (ŷ1 , . . . , ŷK ) needs to sum to 1. The outputs are then:
exp(zj )
ŷj = p(yj | x; θ) = sof tmax(zj ) = PK (2.4)
k=1 exp(zj )
where z = W > h + b.
It is convenient to use a one-hot-encoded vector for the target class y which is a binary vector
with all elements equal to zero except for element k, corresponding to the correct class, which
equals one. The target probabilities are then:
K
ŷkyk
Y
p(y | x; θ) =
k=1
In the particular case of a binary classifier, the network needs only to predict one class
probability, ŷ = P (y = 1 | x). In that case, we can use a single output unit with a sigmoid
activation function and the output is:
ŷ = σ(w> h + b)
where σ is the sigmoid function defined in Equation (1.6). The loss function is then:
20 CHAPTER 2. LEARNING A NEURAL NETWORK
n
1X
Q(θ) = − yi ln p(yi | xi ; θ) + (1 − yi ) ln((1 − p(yi | xi ; θ))
n
i=1
n
1 X
=− yi ln ŷi + (1 − yi ) ln(1 − ŷi )
n
i=1
θ = θ − η∇Q(θ)
where η is the learning rate and ∇Q(θ) is the gradient of the cost function with respect to the
parameter.
Gradient descent is difficult to use in practice, because it gets easily stuck in a local minimum.
Moreover, it can be very slow to converge, because it follows the gradient of the entire training
set, technique that is called batch learning. That is why most deep learning algorithm use
Stochastic Gradient Descent (SGD).
Indeed, instead of following the gradient on the whole training set, SGD computes the
gradient on minibatches, that is small samples of the training set that can be randomly chosen.
The size of the minibatches is determined by the batch size. As we can see, the batch size
controls when to update the network weights during training, it is one of the most important
hyper-parameters of the model. Then we estimate the whole gradient with the average of the
minibatches gradients, which is an unbiased estimator (Goodfellow et al.; 2016, Chapter 8). We
get the following update for a given iteration:
n
X
θ = θ − η∇ Q(θ, xi , yi )
i=1
v = γv + η∇Q(θ)
θ =θ−v
η
θ=θ− p g
E[g 2 ] +
• Adam (Kingma and Ba; 2015) is another adaptive algorithm and is nowadays one of the
most used optimization algorithm. It is a combination of RMSProp and momentum SGD
algorithms. We have for given parameters value β1 , β2 , e:
m = β1 m + (1 − β1 )g
v = β2 v + (1 − β2 )g 2
where m and v are estimates of the first moment (mean) and the second moment vectors
of the gradient, g. These estimations are biased, so the authors compute a bias-correction
at time step t:
m
m̂ =
1 − β1t
v
v̂ =
1 − β2t
Finally, we can formulate the update rule for a given iteration:
η
θ=θ− √ m̂
v̂ + e
Now that we know how to train neural networks, we need to explain how can we measure
that the learning is indeed effective.
• Underfitting occurs when the model cannot learn the underlying structure of the data,
which implies a large training error. Underfitting phenomenon can appear when the com-
plexity of the model, defined in the next paragraph, is too small.
• Overfitting occurs when the model learns too exactly the patterns in the training data
and cannot generalize these patterns for prediction. In other terms, overfitting happens
when the trained model learns also the noise in the sample. Thus, the model will have a
good performance on the train set, but not on the test set. Overfitting implies a large gap
between training and test error and can appear when the complexity of the model is too
large.
2.3. GENERALIZATION METHODS 23
Model selection
To select the best model, we use the generalization error on the validation set and we select
the one with the lowest error with respect to the first Occam’s Razor from Domingos (1999):
”Given two models with the same generalization error, the simpler one should be preferred
because simplicity is desirable in itself.”.
We have different possible measures of complexity such as the number of parameters of
the model, the choice of variables the model receives as input, the properties of the function
(continuity, slope), the VC-dimension that provides a bound on generalization error or the
number of training iterations. Overfitting phenomenon is essentially related to the complexity
of the model.
Metric
For the model evaluation, we need to look at the generalization power of the model, which is
given by the scalar loss, but we need also an evaluation metric which is a performance measure
for a given goal. The metric used will depend on the task. For a classification task, the usual
metric is the accuracy of the model, which computes the fraction of correct predictions. The
accuracy is given by the formula:
n−1
1X
Accuracy(y, ŷ) = 1{ŷi =yi } (2.6)
n
i=0
However, the accuracy does not reflect imbalanced class, which is a problem in our application
in the next chapter. Indeed, let us consider a binary classifier on 100 data points where 2 points
correspond to class 0 and 98 points correspond to class 1. If the classifier predict 2 points in class
0 and 98 points in class 1, it is a perfect classifier which achieve optimal performance with an
accuracy of 100%. However, if the classifier predict 100 points in class 1, the accuracy is then 98%
and the classifier looks as if it achieves a near optimal performance, but in fact, it fails to predict
any points in class 0. It is a naive classifier. Thus, we need to use an alternative performance
measure with higher robustness to class skew, for example the F-measure or F1-score.
For that, we introduce the notion of precision, a measure of exactness, which in a binary
classification task is the fraction of correctly classified positives among all examples classified
as positive. Intuitively, it is the ability of a classifier not to label as positive a sample that is
negative. We also introduce the notion of recall, a measure of completeness, which in a binary
classification task is the fraction of correctly classified positives among all positives. Intuitively,
it is the ability of a classifier to find all the positive samples.
Let us introduce some notation. T Pk (True positives) is the number of points assigned
correctly to class k; F Pk (Falses positives), the number of points that do not belong to class k
24 CHAPTER 2. LEARNING A NEURAL NETWORK
but are assigned to class k incorrectly by the classifier; and F Nk (False Negatives) is the number
of points that are not assigned to class k by the classifier, but actually belong to class k.
The recall, ρk , and precision, πk , measure for class k are then (Özgür et al.; 2005):
T Pk T Pk
πk = , ρk = (2.7)
T Pk + F Pk T Pk + F N k
We take the macro-averaged F-measure, which is computed locally over each category first
and then the average over all categories is taken. We have:
K
πk ρk 1 X
Fk = , Fmacro = Fk
πk + ρk K
k=1
The Fmacro does not take into account imbalanced class so it is influenced by the classifier
performance on rare categories. That is why we use a weighted F-measure, which calculates
metrics for each class and find their average weighted by support (the number of true instances
for each label). This alternative Fweighted takes into account label imbalance.
K
1 X
Fweighted = w k Fk
K
k=1
All the latter introduced metrics gives some performance measures in order to compute the
generalization error of a model. A model with a low generalization error should have a low loss
and a good metric on both train and test sets.
Dropout
The most common way to avoid overfitting in deep neural networks is to use a dropout layer
as proposed Srivastava et al. (2014). The key idea is to randomly drop units (along with their
connections) from the neural network during training, by forcing the weights of the units to be
equal to 0, which reduces the number of parameters in the model.
Dropout can also be thought as an effective bagging method for many large neural networks
(Goodfellow et al.; 2016). Bagging evolves training multiple models and evaluating multiple mod-
els on each test example. With the dropout method, we train an ensemble of all sub-networks
that can be constructed by removing some input units from an underlying base network.
2.3. GENERALIZATION METHODS 25
Now that we explained the different architectures of neural networks and that we have
a reproducible and objective method to train and select the best models corresponding to a
specific task, let us give examples of applications of deep neural networks on financial data.
Chapter 3
26
3.2. STOCK PRICES AND UNDERLYING ASSUMPTIONS 27
the mechanism behind time series, nevertheless finding time series that can be modelized by them
is challenging. Indeed, ARIMA models have strong assumptions that in practice are very hard
to meet, such as constant parameters, white noise residuals or homoscedasticity. For example,
stock prices often present heteroscedastic residuals. To face these problems, econometricians
built a wide variety of different models such as ARCH or GARCH models.
However, if we can find a model that explain the structure of stock prices, it is questionable
that it can actually predict future values, especially in long-term horizon. In fact, the random
walk theory for stock prices implies that any predictive model for stock prices is useless.
Definition 3.1 (Non-Linear Autoregressive with exogenous variables model) Let {St }1≤t≤T
be real values of a time series, {Xt = (X1,t , . . . Xd,t )}1≤t≤T the d-vectors of exogenous variables
and εt independent real random variables. A N LARX(p, x, q), (p, x, q)3 ∈ N, process represent
a mapping f ∗ : Rp+xd → Rq defined as follows:
The structure of the network depends on its nature. For example, a MLP network would
have p + xd input neurons and q output neurons.
The architecture of such network can be challenging to represent because of different time
lags in the autoregressive part of the model p and the exogenous part x.
As we can see from its definition, a NLARX process gives a great liberty in the forecasting
procedure. Indeed, we distinguish two main procedures to predict the time window of length q
of a time series St1≤t≤T
This strategy has the advantage to be parsimonious in the number of parameters to esti-
mate, thus the network has minimum complexity. However, the crucial weakness is that
it accumulate the estimation error at each forecasting step. Thus, if forecast window,
q, is large, the performance can quickly degrade and the estimation of the price q-step
ahead can be really unreliable. Moreover, if this model represent the relations between
the past values (St , . . . , St−p+1 ), it does not learn the structure between the output values
(St+1 , . . . , St+q ). In sequence learning theory, it is a many-to-one model that takes as input
a sequence and produce a scalar.
Nevertheless, the recursive strategy can be applied for small forecast window. Indeed, a
smaller forecast window might need a smaller look-back window (q), which leads to less
parameters to estimate and a simpler model, which is easier and faster to train.
This model is capable of predicting the entire forecast window in one operation, preventing
the error forecast to increase as the forecast window grows. This model is more complex
as it needs more parameters which allow us to learn the dependence between inputs and
outputs as well as between outputs: it is a emany-to-many sequence model. Nevertheless,
this complexity implies costs from a longer training time and from a larger amount of
needed data to avoid overfitting problem.
the most important cryptocurrencies, we selected the 8 cryptocurrencies with the largest market
capitalization over the period and without any missing values to avoid missing data imputation
problems. At the time of the start of our study (February 2017), the cryptocurrencies were
Bitcoin (btc), Dash (dash), Ripple (xrp), Monero (xmr), Litecoin (ltc), Dogecoin (doge), Nxt
(nxt) and Namecoin (nmc) from which we have data from 2014-07-31 to 2017-10-25.
Table 3.1: Market capitalization statistics (in Millions Dollars) of the 8 most import cryptocur-
rencies from 2014-07-31 to 2017-10-25
From Table 3.1, we can observe that btc dominates the market with a market capitalization
30 times bigger in average than xrp, the second most important cryptocurrency, and 1000 times
larger than nmc, the smallest cryptocurrency considered.
The structure of the cryptocurrency market is changing everyday and at the time of writing
this paper (January 2018) the 8 cryptocurrencies with the largest market capitalization are
different but the one we selected are still important.
Indeed if we only consider the cryptocurrencies for which we have daily prices at our disposal
until 2014-07-31, the cryptocurrencies are still in the top 13 cryptocurrencies with the largest
market capitalization in average over the period considered. We can say that the 8 cryptocur-
rencies are dominating the market of old cryptocurrencies in terms of market capitalization.
If we consider also recent cryptocurrencies, the market structure is different and this result
doesn’t hold as we can see in Table 3.2
Table 3.2: Market capitalization importance in the 631 cryptocurrencies, average on the period
Nevertheless, btc is still dominating the market over the period and xrp, ltc and dash are
major cryptocurrencies in terms of market capitalization. Over the last months, from March
to October 2017, we can say that eth, bch, neo, xem, miota and etc are the major new cryp-
tocurrencies ranking in the top 10 of cryptocurrencies with the largest market capitalization in
average from March to October 2017. It would be interesting to add them in a future analysis.
• suitable transformation of the daily returns such as the 14 days, 30 days and 90 days
moving averages and their respective Bollinger bands which are defined in Section 4.1.2
• the CRIX daily returns at time t
• Euribor interest rates at different horizon (year, 6 months and 3 months)
• Different exchange rates (EURO/UK, EURO/USD, US/JPY)
• Because cryptocurrencies price is available everyday, we replace the missing values on the
weekend with the last value observed at the closing of the exchange on Friday
Before, training the neural network, we need to preprocess the data. Indeed, input represen-
tation is very important for neural network. Because the price is not stationary, as we saw in
the previous section, we consider the logarithm of the differenced prices, also called log-returns.
• We take the differenced logarithm of the price series to eliminate trend and seasonality.
We get the logarithm return of holding btc for each period considered, here one day:
Pt+1
Rt = ln( )
Pt
• We standardized the input variables to avoid any scaling problem: Jt = Itσ−µ 2 where µ and
σ are respectively the mean and the standard deviation of one input variable It on the
training set. We will get standardized predictions on the test set so we will use the mean
and the standard deviation of the training set to get the predictions in their original scale.
We then estimate the NLARX(5,0,2) with the following MLP funtion:
scaled scaled
(Rt+2 , Rt+1 ) = f (Rtscaled , Rt−1
scaled scaled
, ..., Rt−4 , Xtscaled ; θ) + εt
where Rtscaled , Xtscaled represent the scaled daily returns of btc and the scaled exogenous
variables respectively. This network will give us prediction for btc returns. However, we want to
predict btc returns in order to make simple trading decisions (buy btc if the price is expected to
grow and sell btc if the price is expected to fall), thus the actual value of the price is irrelevant.
We are only interested in the future price movement. We can then reformulate the problem as
a binary classification of the future trend, that is:
(
P
0, if ln( Pt+k )≤0
Tt,k = t
Pt+k
1, if ln( Pt ) > 0
We can now build the MLP network corresponding to our task by the function:
We first realize this tuning with a maximum number of 50 epochs and then 500 epochs. In
the following table, we present the accuracy for each final model.
From this table, we can see that all architectures have an equivalent performance. Especially
adding layers does not necessarily improve the accuracy of the model. In particular, very deep
MLP architectures (model 4 and model 8) have a lower performance than the baseline model.
Nevertheless, architectures with three or four hidden layers perform at least as good as model 1
which has only two hidden layers, indicating that adding a few layer to the baseline architecture
32 CHAPTER 3. NEURAL NETWORKS FOR TIME SERIES FORECASTING
can improve the generalization power. Finally, model 6 is the best model with a cross validation
score of 58%. The final measure of generalization error is obtain by computing the accuracy on
the test set that is unseen from the model. We obtain a final accuracy score of 51%.
Chapter 4
Application: Cryptocurrency
portfolio
This application is inspired by the pilot study that was carried out in cooperation with Com-
merzbank AG, Franke (1999). The goal was to develop a trading strategy for a portfolio made
up of 28 of the most important stocks from the Dutch CBS-Index based on predicted returns
only. We transposed this strategy to a portfolio made up of the 8 cryptocurrencies that dom-
inates CRIX, Trimborn and Härdle (2016) from 2014 to early 2017. As in Franke’s study we
restrict ourselves to the buy-and-hold strategy with a time horizon of a quarter of a year (90
trading days because cryptocurrency exchanges are open on weekends). The portfolio is created
at the beginning of a quarter and then held for three months without any alteration. We include
the cryptocurrencies whose prices are predicted to rise significantly, hold them up to the end of
the end of the quarter and sell them afterwards. At the end of the three months, the value of
the portfolio should be as large as possible. As a basis for the trading strategy a three months
forecast of the cryptocurrencies is used.
To modelize the time series Si,t , we use a NLARX process (see Definition 3.1) where Si,t
represents the price of cryptocurrency i (see Table 4.1). We have to build one model for each
cryptocurrency in order to predict the price three months ahead. Let us first present the input
data we use for each model.
33
34 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
The three components of Bollinger bands corresponds to µ, the central band, µ + δσ, the
upper band and µ − δσ the lower band, where δ is a fixed parameter historically equals to 2.
The width of the band is a direct indicator for volatility of a stock. For example, on Figure 4.2
we can observe that Bitcoin price volatility increased as the price grew during the first months
of 2017.
4.2. MODEL ARCHITECTURE 35
Figure 4.2: Bictoin prices (red), its 20-days moving average (green) and its lower and upper
Bollinger bands (light blue) CRIXbtcBB
For each cryptocurrency model, we consider Bollinger bands of the cryptocurrency price on
three different horizons to reflect the long-term trend directions of the price and its volatility.
Let us consider the price of cryptocurrency i, Si,t at time t and let us assume that Si,t is an
NLARX(p,q,x). We estimate the function f ∗ from Definition 3.1 with different neural network
architectures, f .
St+k
Rt (k) = ln
St
• Sequence learning:
Because we are dealing with time series, we are going to use a sequence learning algorithm
which learns the inner relations of sequences as well as relations between them. That is, we
want to be able to predict if the price of a cryptocurrency will rise within a sequence but
also if it will rise after a sequence as been fed to the network, thus we use a many-to-many
sequence model.
where Ri,t (k) is the k-log-returns of cryptocurrency i at time t and Xt is the vector of
exogenous variables (technical indicators of the cryptocurrency and fundamental indica-
tors) at time t. For each cryptocurrency model, we only include the Bollinger bands of
the cryptocurrency considered in order to reduce the complexity of the model and avoid
overfitting problems. Thus, the height models share all input variables except for the
Bollinger bands.
Output variables: The output variables depends both on the objective we fixed
and the activation function at the output layer.
As outputs, we use the sequence of trend classifications in the future. We need to code the
k-log-returns as three dimensional variables that reflects the price movements accordingly
to our objective.
• Scaling: Finally, because neural networks uses activation functions that squash the input
variables to their output interval, we need to scale the input data so the neurons learn
faster, see LeCun et al. (1998). The scaler used depends on the activation function.
Because we use the default LSTM activation functions, tanh, we scale the input variables
to [−1, 1].
Loss function
Since our problem is a classification task, we use the cross-entropy as loss function. But, as we
saw on Figure 4.1, the market is quite stable at the beginning of the period and experienced
a rapid growth at the end, so we can expect to have unbalanced classes in the dataset. While
training we will have to balance the loss function by the class weights. From Table 4.3, we can
clearly see that, except for ltc, all cryptocurrencies have unbalanced classes.
Cryptocurrency 0 1 2
btc 0.26 0.48 0.26
dash 0.18 0.51 0.31
xrp 0.19 0.33 0.48
xmr 0.13 0.53 0.35
ltc 0.32 0.34 0.34
doge 0.29 0.31 0.40
nxt 0.16 0.29 0.56
nmc 0.22 0.24 0.54
Thus we need to use a modified version of the cross-entropy defined in Equation 2.5 and use
the weighted cross-entropy:
n 2
1 XX
Q(θ) = − wk yi,k ln ŷi,k
n
i=1 k=0
n
where wk = K∗frequence of class k are the class weights.
Metric
As performance metric for model selection, we use the accuracy and the weighted F-measure
introduce in Section 2.3.1.
38 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
Let us first present three different baseline models with two hidden layers with ten neurons each
corresponding to three type of neurons, LSTM, simple RNN and perceptron. To reduce the
training time, we used the well-known method of early stopping defined in Section 2.3.2. We
apply early stopping on the validation accuracy and the validation loss with a patience of 10
epochs. That is, we stop the model training if the validation accuracy does not increase every
10 epochs or if the validation loss does not decrease every 10 epochs. We repeat the training
of each model ten times in order to obtain a average performance which is more robust. We
present the result in Table 4.4.
For btc, dash, ltc and xmr we can see that the baseline models have already a good prediction
power on the validation set with a minimum F measure of 68% for LSTM model on dash and a
maximum F-measure of 97% for MLP model on xmr.
For the other cryptocurrency, the results are mixed. MLP and LSTM models have a good
F-measure for the returns of xrp and doge respectively, since their generalization power is better
than a pure random forecast that would have a F-measure of 33%.
For the other models, it seems that each network experienced real difficulties during the
training process. For example, for nmc, the best performance of the three models occurred
after the first iteration, since the validation loss nor the accuracy did not improve after the
eleventh epoch. They have a large loss with a minimum of 1.23 and really poor prediction
performance with a maximum F-measure of 21%. The networks are incapable of extracting
general information from the training set. We can see that the same phenomenon occurred with
4.2. MODEL ARCHITECTURE 39
Figure 4.3: Training (blue curve) and validation (orange curve) accuracy
Figure 4.4: Training (blue curve) and validation (orange curve) loss
From the loss and accuracy curves, we can see that the model performs well on the train set.
Nevertheless, the model overfits completely the train set. Indeed, the loss curve on the validation
set increases while training and the model has no generalization power as the validation accuracy
does not improve.
The general architecture of the models are presented in Appendix B on Figure B.1, B.2 and
B.3. The shape of the different layers is the same as the one from the baseline models except for
the last dimension that depends on the number of neurons in the layer. We explain the tuning
of this number and the number of hidden layers in the next paragraph.
Model tuning
As we saw in Section 3.3.2, model tuning is an important step of the model construction. We
realize a tuning of the width and depth of each model as in the previous chapter, but we also
tune the training parameters such as learning rate, batch size and epochs.
We first tuned the batch size with (32, 64, 128, 256) as grid search, to make computation
easier for Keras. We chose 256 as batch size to reduce the variance on the evaluation metrics,
which gives us more robust results. A large batch size also allow us to reduce considerably the
training time. Finally, we tested (0.01, 0.001, 0.0001, 0.00001) as grid search for the learning
rate and realized that small learning rates do not improve training and makes the complexity
of the model higher with the need of more epochs. Thus, we selected 0.01, 0.001 and 0.0001 as
potential learning rates.
To avoid a too large number of trainable parameters in the model, we prefer to use less
neurons and more layers. For example, a MLP with one hidden layer with 100 neurons with 10
input variables has (10 + 1) ∗ 10 = 110 trainable parameters, but a MLP with two hidden layers
with 5 neurons each has (5 + 1) ∗ 10 + (5 + 1) ∗ 5 = 85 trainable parameters. Here, we prefer
the second architecture. Nevertheless, we realize that adding a fourth layer did not improve the
F-measure of the model.
Thus, we establish a grid search of parameters for the width (2, 5, 10, 15 and 24 neurons,
24 corresponding to the number of features in our model) and for the depth (2 or 3 layers). We
select the parameters corresponding to the model with the highest F-measure.
We also could have tested much more values for the hyper-parameters of our models, but
our computational resources were limited for this study.
In the Table 4.6, we present the architectures and the metrics of the final models selected.
On average, LSTM, RNN and MLP network achieve a performance of 78%, 43% and 68% F
measure respectively which shows that LSTM is the best model in terms of prediction accuracy.
From Table 4.6, it is worth noticeable that MLP networks with two hidden layers achieve
a better performance than MLP networks with three hidden layers. RNN and LSTM networks
tend to be deeper, preferring three hidden layers for four and five cryptocurrencies out of eight
respectively. The number of hidden units does not seem to have a major impact on the perfor-
mance. Finally, LSTM network prefer a small learning rate of 0.001 or 0.0001 for xrp, whereas
a high learning rate of 0.01 was selected for 5 RNN networks. Considering the low performance
of RNN networks and the fact that small learning rate for RNN networks achieve a lower per-
formance than high learning rate, we could maybe have adopted the technique of reducing the
learning rate while training, which consists of beginning the training of the network with a high
learning and reducing it by plateau monitoring the evolution of the metric considered.
Multi-training
The split into train/test sets and the amount of data in the train set may have an influence
on the results. Indeed, the train set may include some special pattern in the data that are not
useful for the generalizing purposes or may exclude general information that can improve the
performance on the test set. That is why, we finally test the stability of the forecasting method
when new information becomes available by retraining the models after each quarter.
We cut the test set in three sets Q1 , Q2 and Q3 corresponding to three consecutive quarter
42 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
Cryptocurrency Model First layer Second layer Third layer Learning rate F1 score
LSTM 15 5 15 0.001 99
btc RNN 10 15 15 0.01 75
MLP 5 15 0 0.001 79
LSTM 15 5 15 0.001 100
dash RNN 15 5 5 0.01 89
MLP 5 15 0 0.001 79
LSTM 5 5 5 0.0001 54
xrp RNN 5 15 0 0.01 23
MLP 24 15 0 0.001 67
LSTM 10 10 0 0.001 95
xmr RNN 24 24 24 0.01 100
MLP 5 10 0 0.01 100
LSTM 5 5 5 0.001 71
ltc RNN 5 5 0 0.001 37
MLP 5 15 0 0.001 22
LSTM 10 10 0 0.001 56
doge RNN 5 5 0 0.001 28
MLP 5 24 0 0.001 73
LSTM 15 5 0 0.001 87
nxt RNN 10 5 5 0.001 6
MLP 5 24 0 0.001 36
LSTM 2 2 2 0.001 75
nmc RNN 10 24 0 0.01 14
MLP 5 15 0 0.001 86
From Table 4.7, we see that for LSTM network, the multi-training experiment improved
for each cryptocurrency the performance of the model, which underlines the ability of LSTM
network to capture new patterns in the data and the necessity to retrain the models at each
quarter for nmc, nxt and ltc. We obtain mixed results for MLP and RNN networks which
indicates maybe a tendency to overfit new data when the F-measure declines from the one-shot
and three-steps predictions. Nevertheless, on average we improve the generalization power by
11, 14 and 6 points for MLP, RNN and LSTM networks respectively.
Definition 4.1 (Portfolio return ) Let us consider a portfolio with N assets, Pti the price of
asset i at time t. The T -days return of asset i is:
i
Pt+T − Pti
i
rt+T =
Pti
We add a constraint on the weights, so the capital invested in the portfolio is divided between
the assets included:
N
X
wi = 1
i=1
As we can see, wi can be negative allowing for short position in the portfolio. Indeed, when we
can take a short position on the cryptocurrency i, we sell on margin the cryptocurrency i, which
implies that we borrow an amount M to a broker at the risk free interest rate, rf ree = 0.001.
The return of a long-short portfolio can be written:
X X X
l s
Rt+T = wl rt+T + ws rt+T + 2| ws |rf ree
l∈L s∈S s∈L
where L and S contains the long and the short positions respectively.
Moreover, we need a benchmark to compare our strategies on the test set. We use three
benchmarks, a portfolio replicating CRIX, that is CRIX quarterly returns, a portfolio based on
the predictions of the baseline models for each type of neural networks and a portfolio based on
perfect predictions or ”true signals”, that is the observed training signal on the test set. This
benchmark reflects the accuracy of our predictive models.
Finally, we do not apply our strategy at the beginning of each quarter, but we apply it
everyday, creating a new portfolio each day, as if we would have a new investor. In this way,
we can measure not only the performance of a portfolio at the beginning of the test set, but
the overall performance of our strategies on the whole test set. We can measure the overall
performance of our different strategies with the cumulative quarterly returns, regardless of the
date of investment, which gives us two ways to evaluate our strategies.
44 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
• We first use the quarterly returns everyday. If at time t, the quarterly return is higher than
CRIX return at that date, we say that the portfolio beats the cryptocurrency market on
that quarter. We can then say that on that quarter, it is more profitable to invest in our
strategy, rather than in CRIX. We can use as indicator the number of days of investment
which gives us higher quarterly returns than CRIX’s.
• If the cumulative quarterly returns of our portfolio is higher than the cumulative returns
of CRIX at the end of the test set, we can say that our portfolio beats the cryptocurrency
market on the whole period considered. We can then conclude that it is always better to
invest in our strategy, rather than in CRIX, on the test period. We use as indicator, the
cumulative returns at the end of the test set.
Since in our case, we buy only one coin of each cryptocurrencies that we include in our
portfolio, we have:
i i
P P
i∈I Pt+90 − i∈I Pt
Rt+90 = P i
i∈I Pt
This portfolio correspond to a price weighted portfolio, which returns is influenced by the
most expensive cryptocurrency. Indeed, cryptocurrencies with a higher price will be given more
weight and will have a greater influence over the performance of the portfolio.
i − i∈I Pti
P P
i∈I Pt+90
Rt+89 = P i
i∈I Pt
i − Pti )
P
i∈I (Pt+90
=
Pti
P
i∈I
i − Pti
X Pti Pt+90
= i Pti
P
i∈I i∈I Pt
X
i Pti
= wi Rt+90 , where wi = P
i∈I i∈I Pti
Cryptocurrency Weight
btc 0.93
dash 0.042
doge 4.8e-07
ltc 8.4e-03
nmc 5.5e-04
nxt 2.0e-05
xmr 1.4e-02
xrp 4.1e-05
Table 4.8: Weights of a price weighted portfolio, average on the test set
on the LSTM, RNN and MLP networks predictions respectively. The results are presented in
Figure 4.5.
On Figure 4.5, we can see that LSTM portfolio clearly outperforms MLP portfolio from
March 2017, but they are equivalent before this date. Indeed, we can see that MLP portfolio
successively predicted wrong trading signals in March 2017 and Mai 2017, implying also that
RNN portfolio outperforms MLP portfolio from April 2017.
Nevertheless, the three strategies and the perfect prediction portfolio do not beat CRIX
before mid May, as we can see from the quarterly returns. Indeed, we know from Trimborn
and Härdle (2016) that CRIX is based on market capitalization indexing, which gives higher
weights to larger cryptocurrencies in terms of market capitalization (see next section for details
on marketcap indexing). Thus, the structure of CRIX returns should be more influenced by the
performance of the cryptocurrencies we consider. Nevertheless, we can see from Figure 4.5 that
CRIX returns are two times larger than our portfolio at end of March 2017.
From this result, we can say that cryptocurrencies with a lower marketcap than the height
cryptocurrencies we consider, had very large returns between February and June 2017, since
they have a smaller weight in CRIX than our cryptocurrencies. We must also remind the reader
that Ethereum, which value was multiplied by a factor of 100 during that period, is not included
in our strategies but is in the computation of CRIX returns on the test set. Yet, Ethereum is the
second cryptocurrency in terms of market capitalization, thus has the second most important
influence on CRIX returns on the test set. From this, we can explain, as we can see from the
cumulative returns, how CRIX portfolio outperforms all our strategies and the perfect prediction
portfolio on the whole period.
Nevertheless, LSTM portfolio always beats CRIX from mid May, which indicates that the
most expensive cryptocurrencies were the best investment solutions during the last quarter.
Similarly, MLP portfolio beats CRIX from mid June.
Since the returns of a price weighted portfolio are influenced by the most expensive cryp-
tocurrency, we can say that the latter strategies are highly influenced by btc changes. If one
model is wrong at predicting btc returns, it can have a very low performance, but if it is wrong
at predicting cheap cryptocurrencies, it can however have a good performance.
From this, we can clearly see that a price weighted portfolio does not profit from high returns
in cheap cryptocurrencies. Yet, xrp, nxt and doge, which are the cheapest cryptocurrencies we
study (see Table 4.1), experienced the highest returns on the test set as we can see on Figure A.2
in Appendix. That is why we also build a portfolio based on market capitalization weighting.
Figure 4.5: Price weighted quarterly returns (top) and cumulative quarterly returns (bottom)
of CRIX (blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios
(purple) CRIXportfolio
For example, if we build a portfolio with the height cryptocurrencies we consider in our thesis,
we get on average on the test set the weights in Table 4.8. As we can expect, the portfolio is
more diversified than a price weighted portfolio.
We present the performance of our strategies on Figure 4.6. RNN portfolio has the best
performance and beats CRIX on the whole test set thanks to very large returns during the first
quarter.
Nevertheless, the overall performance of the other strategies is similar to the price weighted
4.3. EVALUATION OF DIFFERENT TRADING STRATEGIES 47
Cryptocurrency Weight
btc 0.86
dash 0.017
doge 0.081
ltc 0.011
nmc 0.023
nxt 2.8e-03
xmr 1.0e-03
xrp 4.4e-04
Table 4.9: Weights of a marketcap weighted portfolio, average on the test set
portfolio and CRIX beats MLP and LSTM strategies. Indeed, since the market is growing,
a marketcap weighted portfolio based on accurate predictions corresponds approximatively to
a basket portfolio containing the height largest constituents of CRIX, since CRIX weights are
defined by market capitalization (Trimborn and Härdle; 2016).
Finally, a marketcap strategy cannot profit from high returns in cryptocurrencies with lower
marketcap. Yet, nxt and doge, which have a low marketcap (see Table 3.1), experienced very
high returns as we can see from Figure A.2 in Appendix. In order to benefit from both high
returns in cryptocurrencies with low price and market capitalization, we build a last trading
strategies with an equally weighted portfolio, the simplest Beta strategy.
An equally weighted portfolio weights each stock in the basket equally. As a result, the port-
folio is highly diversified. As opposed to marketcap weighted portfolio, it does not overweight
overpriced cryptocurrencies and underweight underpriced cryptocurrencies, which implies that
it can overweight cryptocurrencies with a larger risk. Indeed, smaller cryptocurrency have a
higher risk of failure.
Nevertheless, an equally weighted portfolio is really easy to construct, since its weights are
inversely proportional to the number of cryptocurrencies in the basket. The quarterly return of
such portfolio is defined as follows, where Nt is the number of stocks in the basket at time t:
1 X i
Rt+90 = rt+90
Nt
i∈I
We present the performance of the different strategies based on such a portfolio on Figure
4.7. As we can see, the strategies based on our three predictive models and the perfect prediction
portfolio beat the market on the whole test set, since the cumulative returns are much higher
than CRIX portfolio at the end of the test period. Indeed, these portfolios benefit from very
high returns in cheap cryptocurrencies at the beginning of the test set, as opposed to CRIX.
However, at the end of the test period, the different strategies have a similar performance.
Indeed, cheap currencies, for example xrp, nxt or doge, experienced very low returns, close to 0
or even negative returns, from end of May 2017. Since, these cryptocurrencies are overweighted
comparing to CRIX weights, CRIX beats our equal weighted portfolios for some days in the last
quarter.
Nevertheless, the overall performance of the strategies based on our predictive models is
much higher than CRIX’s (almost 4 times higher for MLP portfolio).
Finally, from Table 4.10, we can see that the strategies based on the predictions of the tuned
model always beats the strategies based on the predictions of the baseline models for each type
48 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
Figure 4.6: Marketcap weighted quarterly returns (top) and cumulative quarterly returns (bot-
tom) of CRIX (blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios
(purple) CRIXportfolio
of neural networks. The best strategy is an equally weighted portfolio based on the predictions
from MLP model.
Figure 4.7: Equally weighted quarterly returns and cumulative quarterly returns of CRIX
(blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios (pur-
ple) CRIXportfolio
As we can see from Table 4.11, we observe first that the strategies based on the final models
have a larger cumulative return at the end of the test period than the strategies based on the
baseline models, except for the strategies based on marketcap portfolio with MLP predictions
and equally weighted portfolio with RNN predictions.
Again, price weighted and marketcap final portfolios do not beat CRIX and have quite
similar performances on the whole test set (see Figure C.1 and Figure C.2 in Appendix). Only
the equal weighted portfolios based on MLP and LSTM predictions beat CRIX (see Figure
C.3 in Appendix), LSTM equal weighted portfolio performing more than two times better than
CRIX for a final return of 627.
LSTM portfolio always beats RNN or MLP portfolios on the long run, which means that
wrong predictions by RNN or MLP networks cost a lot in terms of returns to the trading
strategies. Indeed, if we look at the returns of the equally weighted portfolio based on RNN
final predictions, we can see that the strategy has a negative performance. This is caused
by wrong predictions of long instead of short trading signals, and conversely. Indeed, LSTM
50 CHAPTER 4. APPLICATION: CRYPTOCURRENCY PORTFOLIO
Table 4.10: Long strategy cumulative return at the end of the test period
Table 4.11: Long/short strategies cumulative returns at the end of the test period
strategy has a better performance, because LSTM model has a better F-measure.
Long Long/short
Portfolio
Price Marketcap Equal Price Marketcap Equal
CRIX 291 291 291 291 291 291
LSTM 200 199 743 198 195 627
RNN 150 343 533 129 133 -353
MLP 135 280 878 130 175 458
Maximum CRIX CRIX MLP CRIX CRIX LSTM
Best NN LSTM RNN MLP LSTM LSTM LSTM
Table 4.12: Long and Long/short strategies cumulative returns at the end of the test period
In general, from these results, we prefer to apply a buy-and-hold strategy that does not allow
short position, because our models predict too many trading signals of short, instead of long
4.3. EVALUATION OF DIFFERENT TRADING STRATEGIES 51
positions and conversely. However, we know that it is hard to beat an index when the market is
growing and, during the year 2017, the cryptocurrency market has known a exponential growth.
From Table 4.13, we can see that on average on all strategies, our portfolio beats CRIX on 38%
of the test set, which is quite remarkable. Moreover, the equally weighted portfolios always
beats CRIX, except for RNN long/short portfolio, with a maximum cumulative return of 878
for MLP long portfolio, that is 3 times higher than CRIX.
Long Long/short
Portfolio Average Best strategy
Price Marketcap Equal Price Marketcap Equal
MLP 24 38 79 24 32 44 40 Equal
LSTM 32 26 72 32 25 49 39 Equal
RNN 22 45 73 22 24 15 34 Equal
Average 26 36 75 26 27 36 38 Equal
Best strategy LSTM RNN MLP LSTM MLP LSTM MLP MLP Equal
Table 4.13: Number of days (in percentage) with higher returns than CRIX portfolio
Chapter 5
Generalization
The cryptocurrency market is a very dynamic market. Its structure has changed in comparison
with January 2017 when we started our study. Some cryptocurrencies we have considered are
not the part of largest in market capitalization, one year later at the time of writing. Major
cryptocurrencies have emerged such as Ethereum, BitcoinCash, Iota, and many more. That is
why it would be necessary to always update the initial basket in relation with the evolution of
the market structure.
52
5.3. TOWARDS A REAL PORTFOLIO MANAGEMENT STRATEGY 53
our models to perform very well since it is much easier to predict long positions in a highly
growing market. Thus, even if the deep learning methodology always tries to avoid overfitting,
it would be interesting to see how our models perform in a bearish market.
Conclusion
In this thesis, we presented a general introduction on the deep neural networks theory. We
applied it on financial time series that is why we focused our analysis on recurrent neural
networks as a nonlinear method for sequence learning. We explained the basics of MLP, RNN and
LSTM architectures and the deep learning methodology in order to open what some researchers
or practitioners called the “black box” of neural networks. The reader should have a basics
theoretical knowledge on how to choose the neural network model corresponding to its particular
problem and how to train it in order to build prediction on unseen data.
We also showed on two examples how we can apply this methodology to a real practical prob-
lem: price prevision in order to take financial decisions. With this thesis, we add to the literature
by providing a first cryptocurrencies portfolio based on deep learning asset selection strategies.
By tuning the different hyper-parameters of the model with the trial-and-error methodology that
we presented, the reader should be able to find a model with an acceptable generalization power.
We showed how this methodology is necessary since hyper-parameter tuning always improves
the prediction accuracy of the model.
If our performance results on the cryptocurrency market should be taken with great care,
since the market was experiencing abnormal positive returns, typical of a bubble situation, we
showed how neural networks, especially LSTM, are useful tool for trend predictions by achieving
high prediction accuracy. Our strategy succeed to beat CRIX index in terms of financial returns
showing how index funding can be outperformed with an AI based buy-and-hold strategy, even
in a highly growing market.
Nevertheless, predicting the market is very risky and a realistic investment system should
be implemented by taking into account the active environment where it is evolving. A neural
network for trend prediction is not able to understand the financial cost of misclassifications.
That is why it would be interesting to study how such strategy would perform by adding a
financial policy and risk measure to the learning process in a Reinforcement Learning manner.
54
Appendix A
Cryptocurrencies
55
56 APPENDIX A. CRYPTOCURRENCIES
Models architectures
58
59
Figure C.1: Long short price weighted quarterly returns and cumulative quarterly returns of
CRIX (blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios (pur-
ple) CRIXportfolio
62
63
Figure C.2: Long short marketcap weighted quarterly returns and cumulative quarterly returns
of CRIX (blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios
(purple) CRIXportfolio
64 APPENDIX C. LONG SHORT PORTFOLIO
Figure C.3: Long short equally weighted quarterly returns and cumulative quarterly returns
of CRIX (blue), perfect prediction (orange), LSTM (green), RNN (red) and MLP portfolios
(purple) CRIXportfolio
Bibliography
Back, A. and Tsoi, A. (1991). Fir and iir synapses, a new neural network architecture for time
series modeling, Neural Computation pp. 375–385.
Bottou, L. (1998). On-line learning and stochastic approximations, in D. Saad (ed.), On-line
Learning in Neural Networks, Cambridge University Press, New York, NY, USA, pp. 9–42.
Domingos, P. (1999). The role of occam’s razor in knowledge discovery, Data Mining and
Knowledge Discovery 3(4): 409–425.
Duda, R. O., Hart, P. E. and Stork, D. G. (2000). Pattern Classification (2Nd Edition), Wiley-
Interscience.
Eisl, A., Gasser, S. M. and Weinmayer, K. (2015). Caveat emptor: Does bitcoin improve portfolio
diversification?, SSRN Scholarly Paper ID 2408997. Rochester, NY: Social Science Research
Network .
Elendner, H., Trimborn, S., Ong, B. and Lee, T. M. (2017). The cross-section of cryptocurrencies
as financial assets, in D. Lee Kuo Chen and R. Deng (eds), Handbook of Digital Finance and
Financial Inclusion: Cryptocurrency, FinTech, InsurTech, Regulation, ChinaTech, Mobile
Security, and Distributed Ledger. 1st Edition.
Franke, J. (1999). Nonlinear and nonparametric methods for analyzing financial time series,
in P. Kall and H.-J. Lüthi (eds), Operations Research Proceedings 1998: Selected Papers of
the International Conference on Operations Research Zurich, August 31 – September 3, 1998,
Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 271–282.
Franke, J., Härdle, W. K. and Hafner, C. M. (2015). Statistics of Financial Markets, Springer.
Gers, F. A., Schmidhuber, J. and Cummins, F. (2000). Learning to forget: Continual prediction
with lstm, Neural Computation 12(10): 2451–2471.
Glorot, X., Bordes, A. and Bengio, Y. (2011). Deep sparse rectifier neural networks, in G. Gor-
don, D. Dunson and M. Dudı́k (eds), Proceedings of the 14th International Conference on
Artificial Intelligence and Statistics, Vol. 15 of Proceedings of Machine Learning Research,
PMLR, Fort Lauderdale, FL, USA, pp. 315–323.
Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, MIT Press, Cambridge,
MA, USA. http://www.deeplearningbook.org.
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks, Springer.
65
66 BIBLIOGRAPHY
Hornik, K., Stinchcombe, M. and White, H. (1989). Multilayer feedforward networks are uni-
versal approximators, Neural Networks 2: 359–366.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by
reducing internal covariate shift, CoRR abs/1502.03167.
Jarrett, K., Kavukcuoglu, K., Ranzato, M. and LeCun, Y. (2009). What is the best multi-stage
architecture for object recognition?, 2009 IEEE 12th International Conference on Computer
Vision pp. 2146–2153.
Jiang, Z., Xu, D. and Lian, J. (2017). A deep reinforcement learning framework for the financial
portfolio management problem, ArXiv e-prints .
Jordan, M. I. (1986). Serial order: A parallel distributed processing approach, Technical report,
Institute for Cognitive Science, University of California, San Diego.
Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization, International
Conference on Learning Representations .
LeCun, Y., Bottou, L., Orr, G. B. and Müller, K.-R. (1998). Efficient backprop, in G. B. Orr
and K.-R. Müller (eds), Neural Networks: Tricks of the Trade, Springer Berlin Heidelberg,
Berlin, Heidelberg, pp. 9–50.
Leshno, M., Ya. Lin, V., Pinkus, A. and Schocken, S. (1993). Multilayer feedforward networks
with a nonpolynomial activation function can approximate any function, Neural Networks
6: 861–867.
Özgür, A., Özgür, L. and Güngör, T. (2005). Text categorization with class-based and corpus-
based keyword selection, in P. Yolum, T. Güngör, F. Gürgen and C. Özturan (eds), Computer
and Information Sciences - ISCIS 2005: 20th International Symposium, Istanbul, Turkey,
October 26-28, 2005. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 606–
615.
Pascanu, R., Mikolov, T. and Bengio, Y. (2013). On the difficulty of training recurrent neural
networks, in S. Dasgupta and D. McAllester (eds), Proceedings of the 30th International
Conference on Machine Learning, Vol. 28 (3) of Proceedings of Machine Learning Research,
PMLR, Atlanta, Georgia, USA, pp. 1310–1318.
Qian, N. (1999). On the momemtum term in gradient descent learning algorithms, Neural
Networks .
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and orga-
nization in the brain, Psychological review 65(6): 386–408.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting, Journal of Machine Learning Re-
search 15(1): 1929–1958.
BIBLIOGRAPHY 67
Trimborn, S. and Härdle, W. K. (2016). Crix an index for blockchain based currencies, SFB 649
Econmic Risk, revise and resubmit Journal for Empirical Finances 2016-021.
Trimborn, S., Li, M. and Härdle, W. (2017). Investing with cryptocurrencies - a liquidity
constrained investment approach, SFB 649 Discussion Paper (2017-014).
Williams, R. J. and Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks
and their computational complexity, in D. E. Rumelheart and J. L. McClelland (eds), Back-
propagation: Theory, Architectures and Applications, Lawrence Erlbaum Publishers, Hillsdale,
N.J., chapter 13, pp. 433–486.
Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On early stopping in gradient descent learning,
Constructive Approximation 26(2): 289–315.
Zimmermann, M., Chappelier, J.-C. and Bunke, H. (2006). Offline grammar-based recognition
of handwritten sentences, IEEE Transactions on Pattern Analysis and Machine Intelligence
pp. 818–821.