SSRN Id3270269 PDF
SSRN Id3270269 PDF
• Entropy Features
– Shannon entropy
– The plug-in estimator
– Lempel-Ziv estimators
– Encoding schemes
– Entropy of a Gaussian process
– Entropy and the generalized mean
2
Electronic copy available at: https://ssrn.com/abstract=3270269
What are we going to learn today? (2/2)
• Microstructural Features
– First generation: Price sequences
• The tick rule
• The roll model
• The high-low volatility estimator
• The Corwin-Schulz bid-ask spread model
– Second generation: Strategic trade models
• Kyle’s lambda
• Amihud’s lambda
• Hasbrouck’s lambda
– Third generation: Sequential trade models
• Probability of information-based trading
• Volume-synchronized probability of informed trading
3
Electronic copy available at: https://ssrn.com/abstract=3270269
SECTION I
Structural Breaks
−1
𝑆𝑛,𝑡 = 𝑦𝑡 − 𝑦𝑛 𝜎𝑡 𝑡 − 𝑛
𝑡
𝜎𝑡2 = 𝑡 − 1 −1
∆𝑦𝑖 2
𝑖=2
6
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: Chow-Type Dickey-Fuller (1/2)
• Consider the first order autoregressive process with white noise 𝜀𝑡
𝑦𝑡 = 𝜌𝑦𝑡−1 + 𝜀𝑡
• The null hypothesis is that 𝑦𝑡 follows a random walk, 𝐻0 : 𝜌 = 1, and the alternative
hypothesis is that 𝑦𝑡 starts as a random walk but changes at time 𝜏 ∗ 𝑇, where 𝜏 ∗ ∈ 0,1 ,
into an explosive process:
𝑦𝑡−1 + 𝜀𝑡 for 𝑡 = 1, . . . , 𝜏 ∗ 𝑇
𝐻1 : 𝑦𝑡 =
𝜌𝑦𝑡−1 + 𝜀𝑡 for 𝑡 = 𝜏 ∗ 𝑇 + 1, . . . , 𝑇, with 𝜌 > 1
• At time 𝑇 we can test for a switch (from random walk to explosive process) having taken
place at time 𝜏 ∗ 𝑇 (break date). In order to test this hypothesis, we fit the following
specification,
∆𝑦𝑡 = 𝛿𝑦𝑡−1 𝐷𝑡 𝜏 ∗ + 𝜀𝑡
where 𝐷𝑡 𝜏 ∗ is a dummy variable that takes zero value if 𝑡 < 𝜏 ∗ 𝑇, and takes the value one if
𝑡 ≥ 𝜏 ∗ 𝑇.
7
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: Chow-Type Dickey-Fuller (2/2)
• Then, the null hypothesis 𝐻0 : 𝛿 = 0 is tested against the (one-sided) alternative 𝐻1 : 𝛿 > 1:
𝛿
𝐷𝐹𝐶𝜏∗ =
𝜎𝛿
• The main drawback of this method is that 𝜏 ∗ is unknown.
• To address this issue, Andrews [1993] proposed a new test where all possible 𝜏 ∗ are tried,
within some interval 𝜏 ∗ ∈ 𝜏0 , 1 − 𝜏0 .
• The test statistic for an unknown 𝜏 ∗ is the maximum of all 𝑇 1 − 2𝜏0 values of 𝐷𝐹𝐶𝜏∗
𝑆𝐷𝐹𝐶 = sup 𝐷𝐹𝐶𝜏∗
𝜏∗ ∈ 𝜏0 ,1−𝜏0
• Another drawback of Chow’s approach is that it assumes that there is only one break date
𝜏 ∗ 𝑇, and that the bubble runs up to the end of the sample (there is no switch back to a
random walk). For situations where three or more regimes (random walk bubble
random walk . . .) exist, this is problematic.
8
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: SADF (1/2)
• In the words of Phillips, Wu and Yu [2011]: “[S]tandard unit root and cointegration tests are
inappropriate tools for detecting bubble behavior because they cannot effectively distinguish
between a stationary process and a periodically collapsing bubble model. Patterns of
periodically collapsing bubbles in the data look more like data generated from a unit root or
stationary autoregression than a potentially explosive process.”
• To address this flaw, these authors propose fitting the regression specification
where we test for 𝐻0 : 𝛽 ≤ 0, 𝐻1 : 𝛽 > 0. Inspired by Andrews [1993], Phillips and Yu [2011]
and Phillips, Wu and Yu [2011] proposed the Supremum Augmented Dickey-Fuller test
(SADF).
9
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: SADF (2/2)
• SADF fits the above regression at each end point 𝑡 with backwards expanding start points,
then computes
𝛽𝑡0,𝑡
𝑆𝐴𝐷𝐹𝑡 = sup 𝐴𝐷𝐹𝑡0,𝑡 = sup
𝑡0 ∈ 1,𝑡−𝜏 𝑡0 ∈ 1,𝑡−𝜏 𝜎𝛽𝑡0 ,𝑡
where 𝛽𝑡0 ,𝑡 is estimated on a sample that starts at 𝑡0 and ends at t, 𝜏 is the minimum sample
length used in the analysis, 𝑡0 is the left bound of the backwards expanding window, and
𝑡 = 𝜏, . . . , 𝑇. For the estimation of 𝑆𝐴𝐷𝐹𝑡 , the right side of the window is fixed at 𝑡. The
standard ADF tests is a special case of 𝑆𝐴𝐷𝐹𝑡 , where 𝜏 = 𝑡.
• There are two critical differences between 𝑆𝐴𝐷𝐹𝑡 and SDFC: First, 𝑆𝐴𝐷𝐹𝑡 is computed at
each 𝑡 ∈ 𝜏, 𝑇 , whereas SDFC is computed only at 𝑇. Second, instead of introducing a
dummy variable, SADF recursively expands the beginning of the sample (𝑡0 ∈ 1, 𝑡 − 𝜏 ). By
trying all combinations of a nested double loop on (𝑡0 , 𝑡), SADF does not assume a known
number of regime switches or break dates.
10
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: Sub/Super-Martingale (1/3)
• Consider a process that is either a sub- or super-martingale. Given some observations 𝑦𝑡 ,
we would like to test for the existence of an explosive time trend, 𝐻0 : 𝛽 = 0, 𝐻1 : 𝛽 ≠ 0,
under alternative specifications.
• Polynomial trend (SM-Poly1):
𝑦𝑡 = 𝛼 + 𝛾𝑡 + 𝛽𝑡 2 + 𝜀𝑡
• Polynomial trend (SM-Poly2):
log 𝑦𝑡 = 𝛼 + 𝛾𝑡 + 𝛽𝑡 2 + 𝜀𝑡
• Exponential trend (SM-Exp):
𝑦𝑡 = 𝛼𝑒 𝛽𝑡 + 𝜀𝑡 ⟹ log 𝑦𝑡 = log 𝛼 + 𝛽𝑡 + 𝜉𝑡
• Power trend (SM-Power):
𝑦𝑡 = 𝛼𝑡𝛽 + 𝜀𝑡 ⟹ log 𝑦𝑡 = log 𝛼 + 𝛽log 𝑡 + 𝜉𝑡
11
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: Sub/Super-Martingale (2/3)
• Similar to SADF, we fit any of these specifications to each end point 𝑡 = 𝜏, . . . , 𝑇, with
backwards expanding start points, then compute
𝛽𝑡0,𝑡
𝑆𝑀𝑇𝑡 = sup
𝑡0 ∈ 1,𝑡−𝜏 𝜎𝛽𝑡0,𝑡
• The reason for the absolute value is that we are equally interested in explosive growth and
collapse. In the simple regression case (Greene [2008], p. 48), the variance of 𝛽 is
𝜎𝜀2
𝜎𝛽2 = 2 𝑡−𝑡 , hence lim 𝜎𝛽𝑡0,𝑡 = 0. The same result is generalizable to the multivariate
𝜎𝑥𝑥 0 𝑡→∞
linear regression case (Greene [2008], pp. 51–52).
• Problem: The 𝜎𝛽2 of a weak long-run bubble may be smaller than the 𝜎𝛽2 of a strong short-
run bubble, hence biasing the method towards long-run bubbles.
12
Electronic copy available at: https://ssrn.com/abstract=3270269
EXPLOSIVENESS: Sub/Super-Martingale (3/3)
• Solution: We can penalize large sample lengths by determining the coefficient 𝜑 ∈ 0,1
that yields best explosiveness signals
𝛽𝑡0,𝑡
𝑆𝑀𝑇𝑡 = sup 𝜑
𝑡0 ∈ 1,𝑡−𝜏 𝜎𝛽𝑡0 ,𝑡 𝑡 − 𝑡0
• For instance,
– when 𝜑 = 0.5, we compensate for the lower 𝜎𝛽𝑡0,𝑡 associated with longer sample lengths, in the simple regression
case.
– For 𝜑 → 0, 𝑆𝑀𝑇𝑡 will exhibit longer trends, as that compensation wanes and long-run bubbles mask short-run
bubbles.
– For 𝜑 → 1, 𝑆𝑀𝑇𝑡 becomes noisier, because more short-run bubbles are selected over long-run bubbles.
• Consequently, this is a natural way to adjust the explosiveness signal, so that it filters
opportunities targeting a particular holding period.
• The features used by the ML algorithm may include 𝑆𝑀𝑇𝑡 estimated from a wide range of 𝜑
values.
13
Electronic copy available at: https://ssrn.com/abstract=3270269
SECTION II
Entropy Features
𝐻 𝑋 =− 𝑝 𝑥 log 𝑝 𝑥
𝑥∈𝑆𝑋
• A few observations:
1
– The value measures how surprising an observation is, because surprising observations are characterized by their
𝑝𝑥
low probability.
1
– Entropy is the expected value of those surprises, where the log . function prevents that 𝑝 𝑥 cancels and
𝑝𝑥
endows entropy with desirable mathematical properties.
– Accordingly, entropy can be interpreted as the amount of uncertainty associated with 𝑋. Entropy is zero when all
probability is concentrated in a single element of 𝑆𝑋 . Entropy reaches a maximum at log 𝑆𝑋 when 𝑋 is distributed
1
uniformly, 𝑝 𝑥 = , ∀𝑥 ∈ 𝑆𝑋 .
𝑆𝑋
15
Electronic copy available at: https://ssrn.com/abstract=3270269
Time Series Entropy: The Plug-In Estimator
• Given a data sequence 𝑥1𝑛 , comprising the string of values starting in position 1 and ending
in position n, we can form a dictionary of all words of length 𝑤 < 𝑛 in that sequence, 𝐴𝑤 .
• Consider an arbitrary word 𝑦1𝑤 ∈ 𝐴𝑤 of length 𝑤. We denote 𝑝𝑤 𝑦1𝑤 the empirical
probability of the word 𝑦1𝑤 in 𝑥1𝑛 , which means that 𝑝𝑤 𝑦1𝑤 is the frequency with which 𝑦1𝑤
appears in 𝑥1𝑛 .
• Assuming that the data is generated by a stationary and ergodic process, then the law of
large numbers guarantees that, for a fixed w and large n, the empirical distribution 𝑝𝑤 will
be close to the true distribution 𝑝𝑤 .
• Under these circumstances, a natural estimator for the entropy rate (i.e., average entropy
per bit) is
1
𝐻𝑛,𝑤 = − 𝑝𝑤 𝑦1𝑤 log 2 𝑝𝑤 𝑦1𝑤
𝑤 𝑤 𝑤
𝑦1 ∈𝐴
16
Electronic copy available at: https://ssrn.com/abstract=3270269
Time Series Entropy: Lempel-Ziv Estimators
• Plug-in estimators require large samples.
• Kontoyiannis [1998] attempts to make a more efficient use of the information available in a
message.
• Let us define 𝐿𝑛𝑖 as 1 plus the length of the longest match found in the n bits prior to i,
𝑗+𝑙
𝐿𝑛𝑖 = 1 + max 𝑙 𝑥𝑖𝑖+𝑙 = 𝑥𝑗 for some 𝑖 − 𝑛 ≤ 𝑗 ≤ 𝑖 − 1, 𝑙 ∈ 0, 𝑛
• The general intuition is, as we increase the available history, we expect that messages with
high entropy will produce relatively shorter non-redundant substrings. In contrast, messages
with low entropy will produce relatively longer non-redundant substrings as we parse
through the message.
𝑛+𝑘−1
• The sliding-window LZ estimator 𝐻𝑛,𝑘 = 𝐻𝑛,𝑘 𝑥−𝑛+1 is defined by
−1
𝑘
1 𝐿𝑛𝑖
𝐻𝑛,𝑘 =
𝑘 log 2 𝑛
𝑖=1
17
Electronic copy available at: https://ssrn.com/abstract=3270269
Time Series Entropy: Encoding Schemes (1/2)
• Entropy rate estimation requires the discretization of a continuous variable, so that each
value can be assigned a code from a finite alphabet.
• Binary Encoding:
– A stream of returns 𝑟𝑡 can be encoded according to the sign, 1 for 𝑟𝑡 > 0, 0 for 𝑟𝑡 < 0, removing cases where 𝑟𝑡 = 0.
– Binary encoding arises naturally in the case of returns series sampled from price bars (i.e., bars that contain prices
fluctuating between two symmetric horizontal barriers, centered around the start price), because 𝑟𝑡 is
approximately constant.
• Quantile Encoding:
– Unless price bars are used, it is likely that more than two codes will be needed.
– One approach consists in assigning a code to each 𝑟𝑡 according to the quantile it belongs to.
– The quantile boundaries are determined using an in-sample period (training set).
– There will be the same number of observations assigned to each letter for the overall in-sample, and close to the
same number of observations per letter out-of-sample.
– This uniform (in-sample) or close to uniform (out-of-sample) distribution of codes tends to increase entropy readings
on average.
18
Electronic copy available at: https://ssrn.com/abstract=3270269
Time Series Entropy: Encoding Schemes (2/2)
• Sigma Encoding:
– As an alternative approach, rather than fixing the number of codes, we could let the price stream determine the
actual dictionary.
– Suppose we fix a discretization step, 𝜎. Then, we assign the value 0 to 𝑟𝑡 ∈ min 𝑟 , min 𝑟 + 𝜎 , 1 to 𝑟𝑡 ∈
max 𝑟 −min 𝑟
min 𝑟 + 𝜎, min 𝑟 + 2𝜎 and so on until every observation has been encoded with a total of 𝑐𝑒𝑖𝑙
𝜎
codes, where 𝑐𝑒𝑖𝑙 . is the ceiling function.
– Unlike quantile encoding, now each code covers the same fraction of 𝑟𝑡 ’s range.
– Because codes are not uniformly distributed, entropy readings will tend to be smaller than in quantile encoding on
average; however, the appearance of a “rare” code will cause spikes in entropy readings.
19
Electronic copy available at: https://ssrn.com/abstract=3270269
Entropy and the Generalized Mean (1/4)
• Here is an interesting way of thinking about entropy.
• Consider a set of real numbers 𝑥 = 𝑥𝑖 𝑖=1,...,𝑛 and weights 𝑝 = 𝑝𝑖 𝑖=1,...,𝑛 , such that
0 ≤ 𝑝𝑖 ≤ 1, ∀𝑖 and 𝑛𝑖=1 𝑝𝑖 = 1.
• The generalized weighted mean of x with weights p on a power 𝑞 ≠ 0 is defined as
1
𝑛 𝑞
𝑞
𝑀𝑞 𝑥, 𝑝 = 𝑝𝑖 𝑥𝑖
𝑖=1
20
Electronic copy available at: https://ssrn.com/abstract=3270269
Entropy and the Generalized Mean (2/4)
• The reason this is a generalized mean is that other means can be obtained as special cases:
– Minimum: lim 𝑀𝑞 𝑥, 𝑝 = min 𝑥𝑖
𝑞→−∞ 𝑖
𝑛 −1 −1
– Harmonic mean: 𝑀−1 𝑥, 𝑝 = 𝑖=1 𝑝𝑖 𝑥𝑖
𝑛 𝑛
– Geometric mean: lim 𝑀𝑞 𝑥, 𝑝 = 𝑒 𝑖=1 𝑝𝑖 log 𝑥𝑖
= 𝑝𝑖
𝑞→0 𝑖=1 𝑥𝑖
𝑛
– Arithmetic mean: 𝑀1 𝑥, 𝑛−1 𝑖=1,...,𝑛 = 𝑛−1 𝑖=1 𝑥𝑖
𝑛
– Weighted mean: 𝑀1 𝑥, 𝑝 = 𝑖=1 𝑝𝑖 𝑥𝑖
𝑛 2 1/2
– Quadratic mean: 𝑀2 𝑥, 𝑝 = 𝑖=1 𝑝𝑖 𝑥𝑖
– Maximum: lim 𝑀𝑞 𝑥, 𝑝 = max 𝑥𝑖
𝑞→+∞ 𝑖
21
Electronic copy available at: https://ssrn.com/abstract=3270269
Entropy and the Generalized Mean (3/4)
1
• Let us define the quantity 𝑁𝑞 𝑝 = , for some 𝑞 ≠ 1.
𝑀𝑞−1 𝑝,𝑝
22
Electronic copy available at: https://ssrn.com/abstract=3270269
Entropy and the Generalized Mean (4/4)
𝑛
• Shannon’s entropy is 𝐻 𝑝 = 𝑖=1 −𝑝𝑖 log 𝑝𝑖 = −log lim 𝑀𝑞 𝑝 = log lim 𝑁𝑞 𝑝 .
𝑞→0 𝑞→1
• This shows that entropy can be interpreted as the logarithm of the effective number of items
in a list p, where 𝑞 → 1.
• Intuitively, entropy measures information as the level of diversity contained in a random
variable. This intuition is formalized through the notion of generalized mean.
• The implication is that Shannon’s entropy is a special case of a diversity measure (hence its
connection with volatility).
• We can now define and compute alternative measures of diversity, other than entropy,
where 𝑞 ≠ 1.
23
Electronic copy available at: https://ssrn.com/abstract=3270269
SECTION III
Microstructural Features
• The covariance is
evaluated on a
rolling basis for
different window
size
Δ𝑝𝑡 = 𝜆𝑏𝑡 𝑉𝑡
𝑅𝑒𝑡𝑢𝑟𝑛𝑡
𝑀𝑜𝑣𝑖𝑛𝑔 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 , 𝑤𝑖𝑛𝑑𝑜𝑤
𝐷𝑜𝑙𝑙𝑎𝑟 𝑉𝑜𝑙𝑢𝑚𝑒𝑡
|𝑉𝑡𝑆 − 𝑉𝑡𝐵 |
𝑛𝑉
∆𝑝𝑡
• 𝑉𝑡𝐵 = 𝑉𝑡 𝑍 , 𝑉𝑡𝑆 = 𝑉𝑡 − 𝑉𝑡𝐵
𝜎∆𝑝𝑡
Financial problems require very distinct machine learning solutions. Dr. López
de Prado’s book is the first one to characterize what makes standard machine
learning tools fail when applied to the field of finance, and the first one to
provide practical solutions to unique challenges faced by asset managers.
Everyone who wants to understand the future of finance should read this
book.
— Prof. Frank Fabozzi, EDHEC Business School. Editor of The Journal of
Portfolio Management.
35
Electronic copy available at: https://ssrn.com/abstract=3270269
Disclaimer
• The views expressed in this document are the authors’ and do not necessarily
reflect those of the organizations he is affiliated with.
• No investment decision or particular course of action is recommended by this
presentation.
• All Rights Reserved. © 2017-2020 by True Positive Technologies, LP
www.QuantResearch.org
36
Electronic copy available at: https://ssrn.com/abstract=3270269
Market Microstructure
in the Age of Machine Learning
Marcos López de Prado
Lawrence Berkeley National Laboratory
Computational Research Division
𝑝𝜏 𝑉𝜏 ≥ 𝐿
𝜏=𝑡−1
• The threshold 𝐿 is chosen to have roughly 50 bars per day in the year
2017. Each bar contains Close, Open, High, Low and Volume
• The covariance is
evaluated on a rolling
basis for different
window size (all plots
are done with window
= 50 bars)
Roll impact
Δ𝑝𝑡 = 𝜆𝑏𝑡 𝑉𝑡
|𝑉𝑡𝑆 − 𝑉𝑡𝐵 |
𝑛𝑉
∆𝑝𝑡
• 𝑉𝑡𝐵 = 𝑉𝑡 𝑍 , 𝑉𝑡𝑆 = 𝑉𝑡 − 𝑉𝑡𝐵
𝜎∆𝑝𝑡
• In contrast to MDI, MDA tests the actual importance for the out-of-
sample performance.
return skewness prediction MDI | 250 bars return skewness prediction MDA | 250 bars
Electronic copy available at: https://ssrn.com/abstract=3270269
Prediction labels
Binary Label generation: at time 𝑡, we generate a binary label by
computing a measure 𝑀𝑡 that is constructed with a backward window
and comparing it to its value at 𝑡 + ℎ, with ℎ = 250 bars (around 5
trading days) for all labels hereafter in this study.
𝑀𝑡+ℎ
𝑀𝑡
𝑡 𝑡+ℎ time
Label at 𝑡:
sign[𝑀𝑡+ℎ - 𝑀𝑡 ]
sign 𝑆𝑡+ℎ − 𝑆𝑡
2 2
2 𝑒 𝛼𝑡 −1 2𝛽𝑡 − 𝛽𝑡 𝛾𝑡 1 𝐻𝑡−𝑗 𝐻𝑡−1,𝑡
𝑆𝑡 = , 𝛼𝑡 = − , 𝛽𝑡 = 𝐸 𝑗=0 𝑙𝑜𝑔 , 𝛾𝑡 = 𝑙𝑜𝑔
1+𝑒 𝛼𝑡 3−2 2 3−2 2 𝐿𝑡−𝑗 𝐿𝑡−1,𝑡
sign 𝜎𝑡+ℎ − 𝜎𝑡
sign 𝐽𝐵 𝑟𝑡+ℎ − 𝐽𝐵 𝑟𝑡 ,
𝑛 1
𝐽𝐵 𝑟 = 𝑆2 + 𝐶−3 2
, 𝑆 is the skewness and 𝐶 is the kurtosis of realized return 𝑟 in the past
6 4
window
𝑠𝑐𝑡 = corr 𝑟𝑡 , 𝑟𝑡−1 , the correlation between returns of two consecutive bars
• For the prediction of the realized volatility, while Amihud and Roll
measures remain stable, VPIN’s importance increases while Kyle’ lambda
decreases as the window size expands, indicating the growth of
predictability of VPIN with larger look back window size.
• For the prediction of the JB test the result is similar to realized volatility.
VPIN’s importance increases while Amihud measure decreases as the
window size expands, while the others remain almost the same, indicating
the growth of predictability of VPIN with larger look back window size.
45
Electronic copy available at: https://ssrn.com/abstract=3270269
THANKS FOR YOUR ATTENTION!
46
Electronic copy available at: https://ssrn.com/abstract=3270269
Disclaimer
• The views expressed in this document are the authors’ and do not necessarily
reflect those of the organizations he is affiliated with.
• No investment decision or particular course of action is recommended by this
presentation.
• All Rights Reserved. © 2017-2019 by True Positive Technologies, LP
47
Electronic copy available at: https://ssrn.com/abstract=3270269