Preview-9781000176766 A39526004
Preview-9781000176766 A39526004
Investing
CHAPMAN & HALL/CRC
Financial Mathematics Series
Series Editors
M.A.H. Dempster
Centre for Financial Research
Department of Pure Mathematics and Statistics
University of Cambridge
Dilip B. Madan
Robert H. Smith School of Business
University of Maryland
Rama Cont
Department of Mathematics
Imperial College
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowl-
edged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact mpkbookspermissions@tandf.co.uk
Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Preface xiii
I Introduction 1
1 Notations and data 3
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Introduction 9
2.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Portfolio construction: the workflow . . . . . . . . . . . . . . . . . . . . . 10
2.3 Machine learning is no magic wand . . . . . . . . . . . . . . . . . . . . . . 11
4 Data preprocessing 35
4.1 Know your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Scaling the predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.1 Simple labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.2 Categorical labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
viii Contents
6 Tree-based methods 69
6.1 Simple trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.2 Further details on classification . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Pruning criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Code and interpretation . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Boosted trees: Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Boosted trees: extreme gradient boosting . . . . . . . . . . . . . . . . . . 82
6.4.1 Managing loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.2 Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.3 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4.4 Tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.6 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.7 Instance weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Coding exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Contents ix
7 Neural networks 91
7.1 The original perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.1 Introduction and notations . . . . . . . . . . . . . . . . . . . . . . . 93
7.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2.3 Learning via back-propagation . . . . . . . . . . . . . . . . . . . . . 97
7.2.4 Further details on classification . . . . . . . . . . . . . . . . . . . . 100
7.3 How deep we should go and other practical issues . . . . . . . . . . . . . . 101
7.3.1 Architectural choices . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3.2 Frequency of weight updates and learning duration . . . . . . . . . 102
7.3.3 Penalizations and dropout . . . . . . . . . . . . . . . . . . . . . . . 103
7.4 Code samples and comments for vanilla MLP . . . . . . . . . . . . . . . . 104
7.4.1 Regression example . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.2 Classification example . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4.3 Custom losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.5 Recurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5.1 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5.2 Code and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.6 Other common architectures . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.6.1 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . 117
7.6.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.6.3 A word on convolutional networks . . . . . . . . . . . . . . . . . . . 119
7.6.4 Advanced architectures . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7 Coding exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
V Appendix 261
17 Data description 263
Bibliography 289
Index 319
Preface
This book is intended to cover some advanced modelling techniques applied to equity
investment strategies that are built on firm characteristics. The content is threefold.
First, we try to simply explain the ideas behind most mainstream machine learning algorithms
that are used in equity asset allocation. Second, we mention a wide range of academic
references for the readers who wish to push a little further. Finally, we provide hands-on R
code samples that show how to apply the concepts and tools on a realistic dataset which we
share to encourage reproducibility.
• Use cases of alternative datasets that show how to leverage textual data from
social media, satellite imagery, or credit card logs to predict sales, earning reports, and,
ultimately, future returns. The literature on this topic is still emerging (see, e.g., Blank
et al. (2019), Jha (2019) and Ke et al. (2019)) but will likely blossom in the near future.
xiii
xiv Preface
(2009), Cornuejols et al. (2018) (written in French), James et al. (2013) (coded in R!)
and Mohri et al. (2018) for a general treatment on the subject.1 Moreover, Du and
Swamy (2013) and Goodfellow et al. (2016) are solid monographs on neural networks
particularly and Sutton and Barto (2018) provide a self-contained and comprehensive
tour in reinforcement learning.
• Finally, the book does not cover methods of natural language processing (NLP)
that can be used to evaluate sentiment which can in turn be translated into investment
decisions. This topic has nonetheless been trending lately and we refer to Loughran and
McDonald (2016), Cong et al. (2019a), Cong et al. (2019b) and Gentzkow et al. (2019)
for recent advances on the matter.
awesome-machine-learning/blob/master/books.md.
Preface xv
foundations (theoretical and empirical) of factor investing and briefly sums up the dedicated
recent literature. Chapter 4 deals with data preparation. It rapidly recalls the basic tips and
warns about some major issues.
Part II of the book is dedicated to predictive algorithms in supervised learning. Those are
the most common tools that are used to forecast financial quantities (returns, volatilities,
Sharpe ratios, etc.). They range from penalized regressions (Chapter 5), to tree methods
(Chapter 6), encompassing neural networks (Chapter 7), support vector machines (Chapter
8) and Bayesian approaches (Chapter 9).
The next portion of the book bridges the gap between these tools and their applications in
finance. Chapter 10 details how to assess and improve the ML engines defined beforehand.
Chapter 11 explains how models can be combined and often why that may not be a good
idea. Finally, one of the most important chapters (Chapter 12) reviews the critical steps of
portfolio backtesting and mentions the frequent mistakes that are often encountered at this
stage.
The end of the book covers a range of advanced topics connected to machine learning more
specifically. The first one is interpretability. ML models are often considered to be black
boxes and this raises trust issues: how and why should one trust ML-based predictions?
Chapter 13 is intended to present methods that help understand what is happening under
the hood. Chapter 14 is focused on causality, which is both a much more powerful concept
than correlation and also at the heart of many recent discussions in Artificial Intelligence
(AI). Most ML tools rely on correlation-like patterns and it is important to underline the
benefits of techniques related to causality. Finally, Chapters 15 and 16 are dedicated to
non-supervised methods. The latter can be useful, but their financial applications should be
wisely and cautiously motivated.
Companion website
This book is entirely available at http://www.mlfactor.com. It is important that not only
the content of the book be accessible, but also the data and code that are used throughout the
chapters. They can be found at https://github.com/shokru/mlfactor.github.io/tree/
master/material. The online version of the book will be updated beyond the publication
of the printed version.
Why R?
The supremacy of Python as the dominant ML programming language is a widespread
belief. This is because almost all applications of deep learning (which is as of 2020 one of
the most fashionable branches of ML) are coded in Python via Tensorflow or Pytorch. The
fact is that R has a lot to offer as well. First of all, let us not forget that one of the most
influencial textbooks in ML (Hastie et al. (2009)) is written by statisticians who code in R.
Moreover, many statistics-orientated algorithms (e.g., BARTs in Section 9.5) are primarily
xvi Preface
coded in R and not always in Python. The R offering in Bayesian packages in general
(https://cran.r-project.org/web/views/Bayesian.html) and in Bayesian learning in
particular is probably unmatched.
There are currently several ML frameworks available in R.
• caret: https://topepo.github.io/caret/index.html, a compilation of more than
200 ML models;
Coding instructions
A list of the packages we use can be found in Table 1 below. Packages with a star ∗ need to
be installed via bioconductor.2 Packages with a plus + need to be installed manually.3
Of all of these packages (or collections thereof), the tidyverse and lubridate are compulsory
in almost all sections of the book. To install a new package in R, just type
install.packages(“name_of_the_package”)
in the console. Sometimes, because of function name conflicts (especially with the select()
function), we use the syntax package::function() to make sure the function call is from the
2 Oneexample: https://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html
3 Bycopy-pasting the content of the package in the library folder. To get the address of the folder, execute
the command .libPaths() in the R console.
xviii Preface
right source. The exact version of the packages used to compile the book is listed in the
“renv.lock” file available on the book’s GitHub web page https://github.com/shokru/
mlfactor.github.io. One minor comment is the following: while the functions gather() and
spread() from the dplyr package have been superseded by pivot_longer() and pivot_wider(),
we still use them because of their much more compact syntax.
As much as we could, we created short code chunks and commented each line whenever
we felt it was useful. Comments are displayed at the end of a row and preceded with a single
hastag #.
The book is constructed as a very big notebook, thus results are often presented below code
chunks. They can be graphs or tables. Sometimes, they are simple numbers and are preceded
with two hashtags ##. The example below illustrates this formatting.
1+2 # Example
## [1] 3
The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on
previously defined variables. When replicating parts of the code (via online code), please
make sure that the environment includes all relevant variables. One best practice is
to always start by running all code chunks from Chapter 1. For the exercises, we often resort
to variables created in the corresponding chapters.
Acknowledgments
The core of the book was prepared for a series of lectures given by one of the authors to
students of master’s degrees in finance at EMLYON Business School and at the Imperial
College Business School in the Spring of 2019. We are grateful to those students who asked
fruitful questions and thereby contributed to improve the content of the book.
We are grateful to Bertrand Tavin and Gautier Marti for their thorough screening of the
book. We also thank Eric André, Aurélie Brossard, Alban Cousin, Frédérique Girod, Philippe
Huber, Jean-Michel Maeso, Javier Nogales and for friendly reviews; Christophe Dervieux for
his help with bookdown; Mislav Sagovac and Vu Tran for their early feedback; John Kimmel
for making this happen and Jonathan Regenstein for his availability, no matter the topic.
Lastly, we are grateful for the anonymous reviews collected by John.
Future developments
Machine learning and factor investing are two immense research domains and the overlap
between the two is also quite substantial and developing at a fast pace. The content of this
book will always constitute a solid background, but it is naturally destined to obsolescence.
Moreover, by construction, some subtopics and many references will have escaped our
scrutiny. Our intent is to progressively improve the content of the book and update it with
Preface xix
the latest ongoing research. We will be grateful to any comment that helps correct or update
the monograph. Thank you for sending your feedback directly (via pull requests) on the
book’s website which is hosted at https://github.com/shokru/mlfactor.github.io.
Part I
Introduction
1
Notations and data
1.1 Notations
This section aims at providing the formal mathematical conventions that will be used
throughout the book.
Bold notations indicate vectors and matrices. We use capital letters for matrices and lower
case letters for vectors. v� and M� denote the transposes of v and M. M = [m]i,j , where i
is the row index and j the column index.
We will work with two notations in parallel. The first one is the pure machine learning notation
in which the labels (also called output, dependent variables or predicted variables)
y = yi are approximated by functions of features Xi = (xi,1 , . . . , xi,K ). The dimension
of the feature matrix X is I × K: there are I instances, records, or observations and
each one of them has K attributes, features, inputs, or predictors which will serve as
independent and explanatory variables (all these terms will be used interchangeably).
Sometimes, to ease notations, we will write xi for one instance (one row) of X or xk for one
(feature) column vector of X.
The second notation type pertains to finance and will directly relate to the first. We will
often work with discrete returns rt,n = pt,n /pt−1,n − 1 computed from price data. Here t
is the time index and n the asset index. Unless specified otherwise, the return is always
computed over one period, though this period can sometimes be one month or one year.
Whenever confusion might occur, we will specify other notations for returns.
In line with our previous conventions, the number of return dates will be T and the number
(k)
of assets, N . The features or characteristics of assets will be denoted with xt,n : it is the
time-t value of the k th attribute of firm or asset n. In stacked notation, xt,n will stand for
the vector of characteristics of asset n at time t. Moreover, rt stands for all returns at time
t while rn stands for all returns of asset n. Often, returns will play the role of the dependent
variable, or label (in ML terms). For the riskless asset, we will use the notation rt,f .
The link between the two notations will most of the time be the following. One instance (or
observation) i will consist of one couple (t, n) of one particular date and one particular
firm (if the data is perfectly rectangular with no missing field, I = T × N ). The label will
usually be some performance measure of the firm computed over some future period, while
the features will consist of the firm attributes at time t. Hence, the purpose of the machine
learning engine in factor investing will be to determine the model that maps the time-t
characteristics of firms to their future performance.
In terms of canonical matrices: IN will denote the (N × N ) identity matrix.
From the probabilistic literature, we employ the expectation operator E[·] and the conditional
expectation Et [·], where the corresponding filtration Ft corresponds to all information
3
4 1 Notations and data
available at time t. More precisely, Et [·] = E[·|Ft ]. V[·] will denote the variance operator.
Depending on the context, probabilities will be written simply P , but sometimes we will use
the heavier notation P. Probability density functions (pdfs) will be denoted with lowercase
letters (f ) and cumulative distribution functions (cdfs) with uppercase letters (F ). We will
d
write equality in distribution as X = Y , which is equivalent to FX (z) = FY (z) for all z on
the support of the variables. For a random process Xt , we say that it is stationary if the
d d
law of Xt is constant through time, i.e., Xt = Xs , where = means equality in distribution.
Sometimes, asymptotic behaviors will be characterized with the usual Landau notation
o(·) and O(·). The symbol ∝ refers to proportionality: x ∝ y means that x is proportional
∂
to y. With respect to derivatives, we use the standard notation ∂x when differentiating
with respect to x. We resort to the compact symbol \ when all derivatives are computed
(gradient vector).
In equations, the left-hand side and right-hand side can be written more compactly: l.h.s.
and r.h.s., respectively.
Finally, we turn to functions. We list a few below:
- 1{x} : the indicator function of the condition x, which is equal to one if x is true and to
zero otherwise.
- φ(·) and Φ(·) are the standard Gaussian pdf and cdf.
- card(·) = #(·) are two notations for the cardinal function which evaluates the number of
elements in a given set (provided as argument of the function).
- l·J is the integer part function.
- for a real number x, [x]+ is the positive part of x, that is max(0, x).
x −x
- tanh(·) is the hyperbolic tangent: tanh(x) = eex −e
+e−x .
- ReLu(·) is the rectified linear unit: ReLu(x) = max(0, x).
xi
- s(·) will be the softmax function: s(x)i = Je xj , where the subscript i refers to the ith
e
j=1
element of the vector.
1.2 Dataset
Throughout the book, and for the sake of reproducibility, we will illustrate the con
cepts we present with examples of implementation based on a single financial dataset
available at https://github.com/shokru/mlfactor.github.io/tree/master/material.
This dataset comprises information on 1,207 stocks listed in the US (possibly originating
from Canada or Mexico). The time range starts in November 1998 and ends in March 2019.
For each point in time, 93 characteristics describe the firms in the sample. These attributes
cover a wide range of topics:
• valuation (earning yields, accounting ratios);
• risk (volatilities);
1.2 Dataset 5
• estimates (earnings-per-share);
## # A tibble: 6 x 6
## stock_id date Advt_12M_Usd Advt_3M_Usd Advt_6M_Usd Asset_Turnover
## <int> <date> <dbl> <dbl> <dbl> <dbl>
## 1 1 2000-01-31 0.41 0.39 0.42 0.19
## 2 1 2000-02-29 0.41 0.39 0.4 0.19
## 3 1 2000-03-31 0.4 0.37 0.37 0.2
## 4 1 2000-04-30 0.39 0.36 0.37 0.2
## 5 1 2000-05-31 0.4 0.42 0.4 0.2
## 6 1 2000-06-30 0.41 0.47 0.42 0.21
The data has 99 columns and 268336 rows. The first two columns indicate the stock identifier
and the date. The next 93 columns are the features (see Table 17.1 in the Appendix for
details). The last four columns are the labels. The points are sampled at the monthly
frequency. As is always the case in practice, the number of assets changes with time, as is
shown in Figure 1.1.
data_ml %>%
group_by(date) %>% # Group by date
summarize(nb_assets = stock_id %>% # Count nb assets
as.factor() %>% nlevels()) %>%
ggplot(aes(x = date, y = nb_assets)) + geom_col() + # Plot
coord_fixed(3)
1250
1000
nb_assets
750
500
250
0
2000 2005 2010 2015
date
FIGURE 1.1: Number of assets through time.
6 1 Notations and data
There are four immediate labels in the dataset: R1M_Usd, R3M_Usd, R6M_Usd and
R12M_Usd, which correspond to the 1-month, 3-month, 6-month and 12-month fu-
ture/forward returns of the stocks. The returns are total returns, that is, they incorporate
potential dividend payments over the considered periods. This is a better proxy of financial
gain compared to price returns only. We refer to the analysis of Hartzmark and Solomon
(2019) for a study on the impact of decoupling price returns and dividends. These labels are
located in the last 4 columns of the dataset. We provide their descriptive statistics below.
## # A tibble: 4 x 5
## Label mean sd min max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 R12M_Usd 0.137 0.738 -0.991 96.0
## 2 R1M_Usd 0.0127 0.176 -0.922 30.2
## 3 R3M_Usd 0.0369 0.328 -0.929 39.4
## 4 R6M_Usd 0.0723 0.527 -0.98 107.
In anticipation for future models, we keep the name of the predictors in memory. In addition,
we also keep a much shorter list of predictors.
features <- colnames(data_ml[3:95]) # Keep the feature's column names (hard-coded, beware!)
features_short <- c("Div_Yld", "Eps", "Mkt_Cap_12M_Usd", "Mom_11M_Usd",
"Ocf", "Pb", "Vol1Y_Usd")
The predictors have been uniformized, that is, for any given feature and time point, the
distribution is uniform. Given 1,207 stocks, the graph below cannot display a perfect
rectangle.
data_ml %>%
filter(date == "2000-02-29") %>%
ggplot(aes(x = Div_Yld)) + geom_histogram(bins = 100) + coord_fixed(0.03)
9
count
0
0.00 0.25 0.50 0.75 1.00
Div_Yld
FIGURE 1.2: Distribution of the dividend yield feature on date 2000-02-29.
The original labels (future returns) are numerical and will be used for regression exercises,
that is, when the objective is to predict a scalar real number. Sometimes, the exercises can
be different and the purpose may be to forecast categories (also called classes), like “buy”,
“hold” or “sell”. In order to be able to perform this type of classification analysis, we create
additional labels that are categorical.
ungroup() %>%
mutate_if(is.logical, as.factor)
The new labels are binary: they are equal to 1 (true) if the original return is above that of
the median return over the considered period and to 0 (false) if not. Hence, at each point in
time, half of the sample has a label equal to zero and the other half to one: some stocks
overperform and others underperform.
In machine learning, models are estimated on one portion of data (training set) and then
tested on another portion of the data (testing set) to assess their quality. We split our
sample accordingly.
We also keep in memory a few key variables, like the list of asset identifiers and a rectangular
version of returns. For simplicity, in the computation of the latter, we shrink the investment
universe to keep only the stocks for which we have the maximum number of points.
Conclusions often echo introductions. This chapter was completed at the very end of the
writing of the book. It outlines principles and ideas that are probably more relevant than
the sum of technical details covered subsequently. When stuck with disappointing results,
we advise the reader to take a step away from the algorithm and come back to this section
to get a broader perspective of some of the issues in predictive modelling.
2.1 Context
The blossoming of machine learning in factor investing has it source at the confluence of three
favorable developments: data availability, computational capacity, and economic groundings.
First, the data. Nowadays, classical providers, such as Bloomberg and Reuters have seen their
playing field invaded by niche players and aggregation platforms.1 In addition, high-frequency
data and derivative quotes have become mainstream. Hence, firm-specific attributes are easy
and often cheap to compile. This means that the size of X in (2.1) is now sufficiently large
to be plugged into ML algorithms. The order of magnitude (in 2019) that can be reached is
the following: a few hundred monthly observations over several thousand stocks (US listed
at least) covering a few hundred attributes. This makes a dataset of dozens of millions of
points. While it is a reasonably high figure, we highlight that the chronological depth is
probably the weak point and will remain so for decades to come because accounting figures
are only released on a quarterly basis. Needless to say that this drawback does not hold for
high-frequency strategies.
Second, computational power, both through hardware and software. Storage and pro
cessing speed are not technical hurdles anymore and models can even be run on the cloud
thanks to services hosted by major actors (Amazon, Microsoft, IBM and Google) and by
smaller players (Rackspace, Techila). On the software side, open source has become the
norm, funded by corporations (TensorFlow & Keras by Google, Pytorch by Facebook, h2o,
etc.), universities (Scikit-Learn by INRIA, NLPCore by Stanford, NLTK by UPenn) and
small groups of researchers (caret, xgboost, tidymodels to list but a pair of frameworks).
Consequently, ML is no longer the private turf of a handful of expert computer scientists,
but is on the contrary accessible to anyone willing to learn and code.
Finally, economic framing. Machine learning applications in finance were initially intro
duced by computer scientists and information system experts (e.g., Braun and Chandler
(1987), White (1988)) and exploited shortly after by academics in financial economics (Bansal
1 We refer to https://alternativedata.org/data-providers/2 for a list of alternative data providers. Moreover,
we recall that Quandl, an alt-data hub was acquired by Nasdaq in December 2018. As large players acquire
newcomers, the field may consolidate.
9
10 2 Introduction
and Viswanathan (1993)), and hedge funds (see, e.g., Zuckerman (2019)). Nonlinear relation
ships then became more mainstream in asset pricing (Freeman and Tse (1992), Bansal et al.
(1993)). These contributions started to pave the way for the more brute-force approaches
that have blossomed since the 2010 decade and which are mentioned throughout the book.
In the synthetic proposal of Arnott et al. (2019b), the first piece of advice is to rely on a model
that makes sense economically. We agree with this stance, and the only assumption that we
make in this book is that future returns depend on firm characteristics. The relationship
between these features and performance is largely unknown and probably time-varying.
This is why ML can be useful: to detect some hidden patterns beyond the documented
asset pricing anomalies. Moreover, dynamic training allows to adapt to changing market
conditions.
y = f (X) + C, (2.1)
to conceive it as integrated. All steps are intertwined and each part should not be dealt
with independently from the others.3 The global framing of the problem is essential, from
the choice of predictors, to the family of algorithms, not to mention the portfolio weighting
schemes (see Chapter 12 for the latter).
• Thus, researchers have most of the time to make do with simple correlation patterns,
which are far less informative and robust.
3 Other approaches are nonetheless possible, as is advocated in de Prado and Fabozzi (2020).
12 2 Introduction
• The no-free lunch theorem of Wolpert (1992a) imposes that the analyst formulates
views on the model. This is why economic or econometric framing is key. The
assumptions and choices that are made regarding both the dependent variables
and the explanatory features are decisive. As a corollary, data is key. The inputs
given to the models are probably much more important than the choice of the model itself.
• Everybody makes mistakes. Errors in loops or variable indexing are part of the journey.
What matters is to learn from those lapses.
To conclude, we remind the reader of this obvious truth: nothing will ever replace practice.
Gathering and cleaning data, coding backtests, tuning ML models, testing weighting schemes,
debugging, starting all over again: these are all absolutely indispensable steps and tasks that
must be repeated indefinitely. There is no sustitute to experience.
3
Factor investing and asset pricing anomalies
Asset pricing anomalies are the foundations of factor investing. In this chapter our aim is
twofold:
• present simple ideas and concepts: basic factor models and common empirical facts
(time-varying nature of returns and risk premia);
• provide the reader with lists of articles that go much deeper to stimulate and satisfy
curiosity.
The purpose of this chapter is not to provide a full treatment of the many topics related
to factor investing. Rather, it is intended to give a broad overview and cover the essential
themes so that the reader is guided towards the relevant references. As such, it can serve as
a short, non-exhaustive, review of the literature. The subject of factor modelling in finance
is incredibly vast and the number of papers dedicated to it is substantial and still rapidly
increasing.
The universe of peer-reviewed financial journals can be split in two. The first kind is the
academic journals. Their articles are mostly written by professors, and the audience consists
mostly of scholars. The articles are long and often technical. Prominent examples are the
Journal of Finance, the Review of Financial Studies and the Journal of Financial Economics.
The second type is more practitioner-orientated. The papers are shorter, easier to read,
and target finance professionals predominantly. Two emblematic examples are the Journal
of Portfolio Management and the Financial Analysts Journal. This chapter reviews and
mentions articles published essentially in the first family of journals.
Beyond academic articles, several monographs are already dedicated to the topic of style
allocation (a synonym of factor investing used for instance in theoretical articles (Barberis
and Shleifer (2003)) or practitioner papers (Asness et al. (2015))). To cite but a few, we
mention:
• Ilmanen (2011): an exhaustive excursion into risk premia, across many asset classes,
with a large spectrum of descriptive statistics (across factors and periods),
• Ang (2014): covers factor investing with a strong focus on the money management
industry,
• Bali et al. (2016): very complete book on the cross-section of signals with statistical
analyses (univariate metrics, correlations, persistence, etc.),
• Jurczenko (2017): a tour on various topics given by field experts (factor purity, pre
dictability, selection versus weighting, factor timing, etc.).
Finally, we mention a few wide-scope papers on this topic: Goyal (2012), Cazalet and Roncalli
(2014) and Baz et al. (2015).
13
14 3 Factor investing and asset pricing anomalies
3.1 Introduction
The topic of factor investing, though a decades-old academic theme, has gained traction
concurrently with the rise of equity traded funds (ETFs) as vectors of investment. Both have
gathered momentum in the 2010 decade. Not so surprisingly, the feedback loop between
practical financial engineering and academic research has stimulated both sides in a mutually
beneficial manner. Practitioners rely on key scholarly findings (e.g., asset pricing anomalies)
while researchers dig deeper into pragmatic topics (e.g., factor exposure or transaction costs).
Recently, researchers have also tried to quantify and qualify the impact of factor indices on
financial markets. For instance, Krkoska and Schenk-Hoppé (2019) analyze herding behaviors
while Cong and Xu (2019) show that the introduction of composite securities increases
volatility and cross-asset correlations.
The core aim of factor models is to understand the drivers of asset prices. Broadly speaking,
the rationale behind factor investing is that the financial performance of firms depends on
factors, whether they be latent and unobservable, or related to intrinsic characteristics (like
accounting ratios for instance). Indeed, as Cochrane (2011) frames it, the first essential
question is which characteristics really provide independent information about average
returns? Answering this question helps understand the cross-section of returns and may
open the door to their prediction.
Theoretically, linear factor models can be viewed as special cases of the arbitrage pricing
theory (APT) of Ross (1976), which assumes that the return of an asset n can be modelled
as a linear combination of underlying factors fk :
K
K
rt,n = αn + βn,k ft,k + Ct,n , (3.1)
k=1
where the usual econometric constraints on linear models hold: E[Ct,n ] = 0, cov(Ct,n , Ct,m ) = 0
for n = m and cov(fn , En ) = 0. If such factors do exist, then they are in contradiction with
the cornerstone model in asset pricing: the capital asset pricing model (CAPM) of Sharpe
(1964), Lintner (1965) and Mossin (1966). Indeed, according to the CAPM, the only driver
of returns is the market portfolio. This explains why factors are also called ‘anomalies’.
Empirical evidence of asset pricing anomalies has accumulated since the dual publication of
Fama and French (1992) and Fama and French (1993). This seminal work has paved the
way for a blossoming stream of literature that has its meta-studies (e.g., Green et al. (2013),
Harvey et al. (2016) and McLean and Pontiff (2016)). The regression (3.1) can be evaluated
once (unconditionally) or sequentially over different time frames. In the latter case, the
parameters (coefficient estimates) change and the models are thus called conditional (we
refer to Ang and Kristensen (2012) and to Cooper and Maio (2019) for recent results on
this topic as well as for a detailed review on the related research). Conditional models are
more flexible because they acknowledge that the drivers of asset prices may not be constant,
which seems like a reasonable postulate.