Ch05-Bekes Kezdi Data Analysis Slides v2
Ch05-Bekes Kezdi Data Analysis Slides v2
Gábor Békés
2020
Inference CS: A1 Repeated samples CS: A2-A3 The CI Calculating the SE CS:A4 The bootstrap SE CS:A5 External validity Summary
I gabors-data-analysis.com
I Download all data and code:
gabors-data-analysis.com/data-
and-code/
Motivation
Generalization
I Sometimes we analyze a dataset with the goal of learning about patterns in that
dataset alone.
I In such cases there is no need to generalize our findings to other datasets.
I Example: We search for a good deal among offers of hotels, all we care about are
the observations in our dataset.
I Often we analyze a dataset in order to learn about patterns that may be true in
other situations.
I We are interested in finding it the relationship between
I Our dataset
I The situation we care about
Generalization
Statistical inference
I The general pattern is an abstract thing that may or may not exist.
I If we can assume that the general pattern exists, the tools of statistical inference
can be very helpful.
External validity
I Assessing whether our data represents the same general pattern that would be
relevant for the situation we truly care about.
I Externally valid case: the situation we care about and the data we have represent
the same general pattern
I With external validity, our data can tell what to expect.
I No external validity: whatever we learn from our data, may turn out to be not
relevant at all.
I Inference problem: How can we generalize this finding? What can we infer from
this 0.5 percent chance for the next calendar year?
Repeated samples
I Easier concept: When our data is sample from a well-defined population - many
other samples could have turned out instead of what we have.
I Harder concept: no clear definition of population. We think of a general pattern
we care about.
Repeated samples
Repeated samples
The sampling distribution of a statistic is the distribution of this statistic across repeated
samples.
The sampling distribution has three important properties
1. Unbiasedness: The average of the values in repeated samples is equal to its true
value (=the value in the entire population / general pattern).
2. Asymptotic normality: The sampling distribution is approximately normal. With
large sample size, it is very very close.
3. Root-n convergence: The standard error (the standard deviation of the sampling
distribution) is smaller the larger the samples, with a proportionality factor of the
square root of the sample size.
I Do simple random sampling: days are considered one after the other and are
selected or not selected in an independent random fashion.
I This sampling destroys the time series nature
I This is OK because daily returns are (almost) independent across days in the original
dataset
I We do this 10,000 times....
Histogram of the proportion of days with losses of 5 percent or more, across repeated samples of
size n=900. 10,000 random samples. Source: sandp-stocks data. S&P 500 market index.
I The “95 percent CI” gives the range of values where we think that true value falls
with a 95 percent likelihood.
I Viewed from the perspective of a single sample, the chance (probability) that the
truth is within the CI measured around the value estimated from that single sample
is 95 percent.
I Also: we think that with 5 percent likelihood, the true value will fall outside the
confidence interval.
I Confidence interval - symmetric range around the estimated value of the statistic
in our dataset.
I Get estimated value.
I Define probability
I Calculate CI with the use of SE
I 95 percent CI is the ±1.96SE (but we use ±2SE ) interval around the estimate
from the data.
I 90% CI is the ±1.6SE interval, the 99 % CI is the ±2.6SE
I This means that in the general pattern represented by the 11-year history of
returns in our data, we can be 95 percent confident that daily losses of more than
5 percent occur with a 0.2 to 0.8 percent chance.
05 Generalizing from data 28 / 42 Gábor Békés
Inference CS: A1 Repeated samples CS: A2-A3 The CI Calculating the SE CS:A4 The bootstrap SE CS:A5 External validity Summary
The bootstrap
I Bootstrap is a method to create synthetic samples that are similar but different
I An method that is very useful in general.
I It is essential for many advanced statistics application such as machine learning
I More in Chapter 05
The bootstrap
I The bootstrap method takes the original dataset and draws many repeated
samples of the size of that dataset.
I The trick is that the samples are drawn with replacement.
I The observations are drawn randomly one by one from the original dataset; once
an observation is drawn it is “replaced” to the pool so that it can be drawn again,
with the same probability as any other observation.
I The drawing stops when it reaches the size of the original dataset.
I The result is a sample of the same size as the original dataset, yielding a single
bootstrap sample.
The bootstrap
The bootstrap
I We have a dataset
(the sample), can
compute a statistic
(e.g. mean)
I Create many
bootstrap samples,
and get a mean value
for each sample
I Bootstrap estimate
of SE = standard
deviation of statistic
based on bootstrap
samples’ estimates.
05 Generalizing from data 33 / 42 Gábor Békés
Inference CS: A1 Repeated samples CS: A2-A3 The CI Calculating the SE CS:A4 The bootstrap SE CS:A5 External validity Summary
The bootstrap SE
I The bootstrap method creates many repeated samples that are different from each
other, but each has the same size as the original dataset.
I Bootstrap gives a good approximation of the standard error, too.
I The bootstrap estimate (or the estimate from the bootstrap method) of the
standard error is simply the standard deviation of the statistic across the bootstrap
samples.
I This means that in the general pattern represented by the 11-year history of
returns in our data, we can be 95 percent confident that daily losses of more than
5 percent occur with a 0.22 to 0.78 percent chance.
I SE formula and bootstrap gave the same exact answer
I Under some conditions, this is what we expect
I Large enough sample size
I Observations independent
External validity
External validity
I Time: we have data on the past, but we care about the future
I Space: our data is on one country, but interested how a pattern would hold
elsewhere in the world
I Sub-groups: our data is on 25-30 year old people. Would a pattern hold on
younger / older people?
External validity
I Daily 5%+ loss probability - 95 percent CI [0.2, 0.8] in our sample. This captures
uncertainty for samples like ours.
I If the future one year will be like the past 11 years in terms of the general pattern
that determines returns on our investment portfolio.
I However, external validity may not be high - not sure what the future holds.
I Our data: 2006-2016 dataset includes the financial crisis and great recession of
2008-2009. It does not include the dotcom boom and bust of 2000-2001. We have
no way to know which crisis is representative to future crises to come.
I Hence, the real CI is likely to be substantially wider.
Generalization - Summary