The Probability of Backtest Overfitting
The Probability of Backtest Overfitting
OVERFITTING
∗ †
David H. Bailey Jonathan M. Borwein
‡
Marcos López de Prado Qiji Jim Zhu§
∗
Lawrence Berkeley National Laboratory (retired), 1 Cyclotron Road, Berke-
ley, CA 94720, USA, and Research Fellow at the University of California,
Davis, Department of Computer Science. E-mail: david@davidhbailey.com; URL:
http://www.davidhbailey.com
†
Laureate Professor of Mathematics at University of Newcastle, Callaghan NSW
2308, Australia, and a Fellow of the Royal Society of Canada, the Australian
Academy of Science, the American Mathematical Society and the AAAS. E-mail:
jonathan.borwein@newcastle.edu.au; URL: http://www.carma.newcastle.edu.au/jon
‡
Senior Managing Director at Guggenheim Partners, New York, NY 10017, and Re-
search Affiliate at Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
E-mail: lopezdeprado@lbl.gov; URL: http://www.QuantResearch.info
§
Professor, Department of Mathematics, Western Michigan University, Kalamazoo, MI
49008, USA. Email: zhu@wmich.edu; URL: http://homepages.wmich.edu/~zhu/
Abstract
Many investment firms and portfolio managers rely on backtests
(i.e., simulations of performance based on historical market data) to
select investment strategies and allocate capital. Standard statistical
techniques designed to prevent regression overfitting, such as hold-
out, tend to be unreliable and inaccurate in the context of investment
backtests. We propose a general framework to assess the probabil-
ity of backtest overfitting (PBO). We illustrate this framework with
specific generic, model-free and nonparametric implementations in the
context of investment simulations, which implementations we call com-
binatorially symmetric cross-validation (CSCV). We show that CSCV
produces reasonable estimates of PBO for several useful examples.
1 Introduction
Modern investment strategies rely on the discovery of patterns that can be
quantified and monetized in a systematic way. For example, algorithms can
be designed to profit from phenomena such as “momentum,” i.e., the ten-
dency of many securities to exhibit long runs of profits or losses, beyond
what could be expected from securities following a martingale. One advan-
tage of this systematization of investment strategies is that those algorithms
are amenable to “backtesting.” A backtest is a historical simulation of how
an algorithmic strategy would have performed in the past. Backtests are
valuable tools because they allow researchers to evaluate the risk/reward
profile of an investment strategy before committing funds.
Recent advances in algorithmic research and high-performance comput-
ing have made it nearly trivial to test millions and billions of alternative
investment strategies on a finite dataset of financial time series. While these
advances are undoubtedly useful, they also present a negative and often si-
lenced side-effect: The alarming rise of false positives in related academic
publications (The Economist [32]). This paper introduces a computational
procedure for detecting false positives in the context of investment strategy
research.
To motivate our study, consider a researcher who is investigating an al-
gorithm to profit from momentum. Perhaps the most popular technique
among Commodity Trading Advisors (CTAs) is to use so-called crossing-
moving averages to detect a change of trend in a security1 . Even for the
simplest case, there are at least five parameters that the researcher can fit:
Two sample lengths for the moving averages, entry threshold, exit threshold
and stop-loss. The number of combinations that can be tested over thou-
sands of securities is in the billions. For each of those billions of backtests,
we could estimate its Sharpe ratio (or any other performance statistic), and
determine whether that Sharpe ratio is indeed statistically significant at
a confidence level of 95%. Although this approach is consistent with the
Neyman-Pearson framework of hypothesis testing, it is highly likely that
false positives will emerge with a probability greater than 5%. The reason
1
Several technical tools are based on this principle, such as the Moving Average Con-
vergence Divergence (MACD) indicator.
10
i) M is a true matrix, i.e. with the same number of rows for each column,
where observations are synchronous for every row across the N trials,
and
ii) the performance evaluation metric used to choose the “optimal” strat-
egy can be estimated on subsamples of each column.
For example, if that metric was the Sharpe ratio, we would expect that the
IID Normal distribution assumption could be maintained on various slices
of the reported performance. If different model configurations trade with
different frequencies, observations should be aggregated to match a common
index t = 1, . . . , T .
Second, we partition M across rows, into an even number S of disjoint
submatrices of equal dimensions. Each of these submatrices Ms , with s =
1, . . . , S, is of order (T /S × N ).
Third, we form all combinations CS of Ms , taken in groups of size S/2.
This gives a total number of combinations
S/2−1
S S−1 S Y S−i
= = ... = (2.3)
S/2 S/2 − 1 S/2 S/2 − i
i=0
For instance, if S = 16, we will form 12, 780 combinations. Each combination
c ∈ CS is composed of S/2 submatrices Ms .
Fourth, for each combination c ∈ CS , we:
a) Form the training set J, by joining the S/2 submatrices Ms that con-
stitute c in their original order. J is a matrix of order (T /S)(S/2) ×
N ) = T /2 × N .
11
f) Define the relative rank of r̄nc ∗ by ω̄c := r̄nc ∗ /(N + 1) ∈ (0, 1). This is
the relative rank of the OOS performance associated with the strategy
chosen IS. If the strategy optimization procedure is not overfitting, we
should observe that r̄nc ∗ systematically outperforms OOS, just as rnc ∗
outperformed IS.
ω̄c
g) We define the logit λc = ln (1−ω̄ c)
. High logit values imply a consis-
tency between IS and OOS performances, which indicates a low level
of backtest overfitting.
Fifth, we compute the distribution of ranks OOS by collecting all the
λc , for c ∈ CS . Define the relative frequency at which λ occurred across all
CS by
X χ{λ} (λc )
f (λ) = , (2.4)
#(CS )
c∈CS
12
3 Overfit statistics
The framework introduced in Section 2 allows us to characterize the relia-
bility of a strategy’s backtest in terms of four complementary analysis:
13
14
15
16
17
18
19
of size T /k. Then it sequentially tests on each of the k samples the model
trained on the T − T /k sample. Although a very valid approach in many
situations, we believe that our procedure is more satisfactory than K-FCV
in the context of strategy selection. In particular, we would like to compute
the Sharpe ratio (or any other performance measure) on each of the k testing
sets of size T /k. This means that k must be sufficiently small, so that the
Sharpe ratio estimate is reliable (see Bailey and López de Prado [2] for a
discussion of Sharpe ratio confidence bands). But if k is small, K-FCV
will essentially reduce to a “hold-out” method, which we have argued is
unreliable. Also, LOOCV is a K-FCV where k = T . We are not aware of
any reliable performance metric computed on a single OOS observation.
The combinatorially symmetric cross-validation (CSCV) method we have
proposed in Section 2.2 differs from both K-FCV and LOOCV. The key idea
S
is to generate S/2 testing sets of size T /2 by recombining the S slices of
the overall sample of size T . This procedure presents a number of advan-
tages. First, CSCV ensures that the training and testing sets are of equal
size, thus providing comparable accuracy to the IS and OOS Sharpe ratios
(or any other performance metric that is susceptible to sample size).
20
21
22
23
24
6 A practical application
Bailey et al. [1] present an example of an investment strategy that attempts
to profit from a seasonal effect. For the reader’s convenience, we reiterate
here how the strategy works. Suppose that we would like to identify the
optimal monthly trading rule, given four customary parameters: Entry day,
Holding period, Stop loss and Side.
Side defines whether we will hold long or short positions on a monthly
basis. Entry day determines the business day of the month when we enter
a position. Holding period gives the number of days that the position is
held. Stop loss determines the size of the loss as a multiple of the series’
volatility that triggers an exit for that month’s position. For example, we
could explore all nodes that span the interval [1, . . . , 22] for Entry day, the
interval [1, . . . , 20] for Holding period, the interval [0, . . . , 10] for Stop loss,
and {−1, 1} for Sign. The parameters combinations involved form a four-
dimensional mesh of 8,800 elements. The optimal parameter combination
can be discovered by computing the performance derived by each node.
First, as discussed in the above cited paper, a time series of 1, 000 daily
prices (about 4 years) was generated by drawing from a random walk. Pa-
rameters were optimized (Entry day = 11, Holding period = 4, Stop loss =
-1 and Side = 1), resulting in an annualized Sharpe ratio of 1.27. Given the
elevated Sharpe ratio, we may conclude that this strategy’s performance is
significantly greater than zero for any confidence level. Indeed, the PSR-
Stat is 2.83, which implies a less than 1% probability that the true Sharpe
ratio is below 0 (see Bailey and López de Prado [2] for details). Figure 6
gives a graphical illustration of this example.
We have estimated the PBO using our CSCV procedure, and obtained
the results illustrated below. Figure 7 shows that approx. 53% of the SR
OOS are negative, despite all SR IS being positive and ranging between
1 and 2.2. Figure 8 plots the distribution of logits, which implies that,
despite the elevated SR IS, the PBO is as high as 55%. Consequently,
25
Figure 9 shows that the distribution of optimized OOS SR does not dominate
the overall distribution of OOS SR. This is consistent with the fact that
the underlying series follows a random walk, thus the serial independence
among observations makes any seasonal patterns coincidental. The CSCV
framework has succeeded in diagnosing that the backtest was overfit.
Second, we generated a time series of 1, 000 daily prices (about 4 years),
following a random walk. But unlike the first case, we have shifted the
returns of the first 5 random observations of each month to be centered at
a quarter of a standard deviation. This simulates a monthly seasonal effect,
which the strategy selection procedure should discover. Figure 10 plots
the random series, as well as the performance associated with the optimal
parameter combination: Entry day = 1, Holding period = 4, Stop loss =
-10 and Side = 1. The annualized Sharpe ratio at 1.54 is similar to the
previous (overfit) case (1.54 vs. 1.3).
The next three graphs report the results of the CSCV analysis, which
confirm the validity of this backtest in the sense that performance inflation
from overfitting is minimal. Figure 11 shows only 13% of the OOS SR to
be negative. Because there is a real monthly effect in the data, the PBO for
26
this second case should be substantially lower than the PBO of the first case.
Figure 12 shows a distribution of logits with a PBO of only 13%. Figure
13 evidences that the distribution of OOS SR from IS optimal combinations
clearly dominates the overall distribution of OOS SR. The CSCV analysis
has this time correctly recognized the validity of this backtest, in the sense
that performance inflation from overfitting is small.
In this practical application we have illustrated how simple is to produce
overfit backtests when answering common investment questions, such as the
presence of seasonal effects. We refer the reader to [1, Appendix 4] for the
implementation of this experiment in Python language. Similar experiments
can be designed to demonstrate overfitting in the context of other effects,
such as trend-following, momentum, mean-reversion, event-driven effects,
and the like. Given the facility with which elevated Sharpe ratios can be
manufactured IS, the reader would be well advised to remain critical of
backtests and researchers that fail to report the PBO results.
27
7 Conclusions
In [2] Bailey and López de Prado developed methodologies to evaluate the
probability that a Sharpe ratio is inflated (PSR), and to determine the
minimum track record length (MinTRL) required for a Sharpe ratio to be
statistically significant. These statistics were developed to assess Sharpe
ratios based on live investment performance and backtest track records. This
paper has extended this approach to present formulas and approximation
techniques for finding the probability of backtest overfitting.
To that end, we have proposed a general framework for modeling the
IS and OOS performance using probability. We define the probability of
backtested overfitting (PBO) as the probability that an optimal strategy IS
underperforms the mean OOS. To facilitate the evaluation of PBO for par-
ticular applications, we have proposed a combinatorially symmetric cross-
validation (CSCV) implementation framework for estimating this probabil-
ity. This estimate is generic, symmetric, model-free and non-parametric.
We have assessed the accuracy of CSCV as an approximation of PBO in
28
two different ways, on a wide variety of test cases. Monte Carlo simula-
tions show that CSCV applied on a single dataset provides similar results to
computing PBO on a large number of independent samples. We have also
directly computed PBO by deriving the Extreme Value distributions that
model the performance of IS optimal strategies. These results indicate that
CSCV provides reasonable estimates of PBO, with relatively small errors.
Besides estimating PBO, our general framework and its CSCV imple-
mentation scheme can also be used to deal with other issues related to
overfitting, such as performance degeneration, probability of loss and pos-
sible stochastic dominance of a strategy. On the other hand, the CSCV
implementation also has some limitations. This suggests that other imple-
mentation frameworks may well be more suitable, particularly for problems
with structure information.
Nevertheless, we believe that CSCV provides both a new and powerful
tool in the arsenal of an investment and financial researcher, and that it also
29
References
[1] Bailey, D., J. Borwein, M. López de Prado and J. Zhu, “Pseudo-mathematics
and financial charlatanism: The effects of backtest over fitting on out-of-sample
performance,” Notices of the AMS, 61 May (2014), 458–471. Online at http:
//www.ams.org/notices/201405/rnoti-p458.pdf.
[2] Bailey, D. and M. López de Prado, “The Sharpe Ratio Efficient Frontier,” Journal
of Risk, 15(2012), 3–44. Available at http://ssrn.com/abstract=1821643.
[3] Bailey, D. and M. López de Prado, “The Deflated Sharpe Ratio: Correcting for
Selection Bias, Backtest Overfitting and Non-Normality”, Journal of Portfolio Man-
agement, 40 (5) (2014), 94-107.
[4] Calkin, N. and M. López de Prado, “Stochastic Flow Diagrams”, Algorithmic Fi-
nance, 3(1-2) (2014) Available at http://ssrn.com/abstract=2379314.
30
[5] Calkin, N. and M. López de Prado, “The Topology of Macro Financial Flows: An Ap-
plication of Stochastic Flow Diagrams”, Algorithmic Finance, 3(1-2) (2014). Avail-
able at http://ssrn.com/abstract=2379319.
[6] Carr, P. and M. López de Prado, “Determining Optimal Trading Rules without
Backtesting”, (2014) Available at http://arxiv.org/abs/1408.1159.
[7] Doyle, J. and C. Chen, “The wandering weekday effect in major stock markets,”
Journal of Banking and Finance, 33 (2009), 1388–1399.
[9] Feynman, R., The Character of Physical Law, 1964, The MIT Press.
[10] Gelman, A. and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical
Models, 2006, Cambridge University Press, First Edition.
[11] Hadar, J. and W. Russell, “Rules for Ordering Uncertain Prospects,” American
Economic Review, 59 (1969), 25–34.
[12] Harris, L., Trading snf Exchanges: Market Microstructure for Practitioners, Oxford
University Press, 2003.
31
[13] Harvey, C. and Y. Liu, “Backtesting”, SSRN, working paper, 2013. Available at
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2345489.
[14] Harvey, C., Y. Liu and H. Zhu, “...and the Cross-Section of Expected Returns,”
SSRN, 2013. Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_
id=2249314.
[15] Hawkins, D., “The problem of overfitting,” Journal of Chemical Information and
Computer Science, 44 (2004), 10–12.
[16] Hirsch, Y., Don’t Sell Stocks on Monday, Penguin Books, 1st Edition, 1987.
[17] Ioannidis, J.P.A., “Why most published research findings are false.” PloS Medicine,
Vol. 2, No. 8,(2005) 696-701.
[18] Leinweber, D. and K. Sisk,“Event Driven Trading and the ‘New News’,” Journal of
Portfolio Management, 38(2011), 110–124.
[20] Lo, A., “The Statistics of Sharpe Ratios,” Financial Analysts Journal, 58 (2002),
July/August.
32
[21] López de Prado, M. and A. Peijan, “Measuring the Loss Potential of Hedge Fund
Strategies,” Journal of Alternative Investments, 7 (2004), 7–31. Available at http:
//ssrn.com/abstract=641702.
[23] MacKay, D.J.C. “Information Theory, Inference and Learning Algorithms”, Cam-
bridge University Press, First Edition, 2003.
[24] Mayer, J., K. Khairy and J. Howard, “Drawing an Elephant with Four Complex
Parameters,” American Journal of Physics, 78 (2010), 648–649.
[25] Miller, R.G., Simultaneous Statistical Inference, 2nd Ed. Springer Verlag, New York,
1981. ISBN 0-387-90548-0.
[26] Resnick, S., Extreme Values, Regular Variation and Point Processes, Springer, 1987.
[27] Romano, J. and M. Wolf, “Stepwise multiple testing as formalized data snooping”,
Econometrica, 73 (2005), 1273–1282.
33
[29] Schorfheide, F. and K. Wolpin, “On the Use of Holdout Samples for Model Selec-
tion,” American Economic Review, 102 (2012), 477–481.
[30] Stodden, V., Bailey, D., Borwein, J., LeVeque, R, Rider, W. and Stein, W., “Set-
ting the default to reproducible: Reproduciblity in computational and experimen-
tal mathematics,” February, 2013. Available at http://www.davidhbailey.com/
dhbpapers/icerm-report.pdf.
[31] Strathern, M., “Improving Ratings: Audit in the British University System,” Euro-
pean Review, 5, (1997) pp. 305-308.
[33] Van Belle, G. and K. Kerr, Design and Analysis of Experiments in the Health Sci-
ences, John Wiley and Sons, 2012.
[34] Weiss, S. and C. Kulikowski, Computer Systems That Learn: Classification and Pre-
diction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems,
Morgan Kaufman, 1st Edition, 1990.
[35] White, H., “A Reality Check for Data Snooping,” Econometrica, 68 (2000), 1097–
1126.
34