LECTURE 2 - Introduction to Nonparametric
LECTURE 2 - Introduction to Nonparametric
Administrative Notes
1) SPSS can have glitches. If you experience difficulty reading in data: 1-make sure your
original file is closed. 2-restart SPSS. 3-restart your computer
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶�
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 114.72 + 9.23 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶�
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 114.72 + 9.23 ∗ 5
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶�
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 160.87
2) Interpretation:
�𝒊𝒊 ’s.
Estimates 𝜷𝜷
General
�0 = intercept= average value of Y when all the X’s=0
β
Page 1 of 24
�1 = “slopes” associated with predictor Xj; 1 unit change (usually increase) in Xj is associated
β
with a β�j change in Y
In the context of our radiation example.
�0 or b0 (intercept): 114.72; the average cancer mortality rate is 114.72 death per 100,000 people
β
when there is no radiation exposure.
�1 or b1 (slope): 9.23; for every point increase in radiation exposure, the cancer mortality rate
β
increase by 9.23 deaths per 100,000 people
�𝒊𝒊 ’s.
Confidence Interval for 𝜷𝜷
�0 : 95% CI [95.69, 133.74]
β
On average in area where there is no radiation exposure, cancer mortality rate is between 96 and
134 deaths/100,000 people. Since these rates are all above 0 we’re sure (95%) that there’s cancer
even without radiation.
�1: 95% CI [5.88, 12.59]
β
On average, every extra point of radiation exposure is associated with an increase in mortality
rate of between 5.88 to 12.59 deaths per 100,000. Again whole CI is above 0 => positive
relationship between radiation and morality rate.
SPSS
Parameter Estimates
Dependent Variable: CancerMortality
95% Confidence Interval
Parameter B Std. Error t Sig. Lower Bound Upper Bound
Intercept 114.716 8.046 14.258 0.000 95.691 133.741
Radiation 9.231 1.419 6.507 0.000 5.877 12.586
Assumptions:
- Y is continuous
- Average relationship between Y and X’s is linear
- X’s are on the correct scale
- Errors (𝜀𝜀′𝑠𝑠) are
Mean zero
Like having a random
Independent
sample from a
Normally distributed
common distribution
Have constant variance (same degree of certainty in the measurements for
each subject / set of X values)
Page 2 of 24
This class is about what we do when one or more of the OLS assumptions is violated.
Thus far, we dealt with independent observations. But what happens when the observations are
NOT independent of each other.
Example: Effect of COVID quarantine and Exercise Program
X=weight before COVID quarantine
Y=weight after COVID quarantine
Z=X-Y= amount of weight lost (note: Z<0 indicates weight gained)
Paired t-test setting
Non-parametric
Nonparametric/semi-parametric methods: use these techniques when you do not know or don’t
wish to make assumptions about the distribution of
- The data: i.e. X or Y variables or errors
- The parameters estimate (e.g. β;s)
There are many such techniques:
- Classical methods: principally based on ranks; examples include Spearman rank
correlation, Wilcoxon rank-sum, Wilcoxon signed-rank tests, the Kruskal-Wallis test, etc.
o Those generalized Pearson correlation, t-test, paired t-test, and ANOVA
respectively
- Modern methods: smoothing, permutation tests, the bootstrap and other simulation or
resampling methods.
Idea is to let the data tell you about the distribution or nature of the relationships. These
techniques make fewer (but not NO) assumptions.
- Want to do standard tests or fit models making as few assumptions as possible about
distribution of the estimator for parameter of interest.
Classical Approaches: Mostly based on ranks. The basic procedure is as follows:
1. Order (Rank) the values in your data set from smallest to largest
2. Replace the original values by their ranks
3. Ties get the average of the ranks
4. Run the analysis you wanted to do on the ranks
Key: Distribution of ranks is “known”
- it’s uniform on 1,2,….,n
Page 3 of 24
- so it can easily get the distribution of your test statistics
Advantages of Non-parametric Methods:
• Easy: can be used with most standard methods
• Know that you are using the “right” distribution
• Works no matter what underlying distribution
• Robust even if you have an extreme outlier, at worst it’s rank is 1 to n
Disadvantages of Non-parametric Methods:
• Lose lots of information when we throw away original values
• Non-parametric methods tend to be less powerful than the corresponding parametric
method would be if its assumptions were correct
0 100
Non-parametric version
- Rank X’s and Y’s from 1 to n (preserve the pairs)
Page 4 of 24
- Calculate Pearson correlation of the ranks
- Answer is still between -1 to 1 and the interpretation is largely the same BUT the
Spearman correlation doesn’t measure linear relationships
- Spearman correlation measures whether X and Y have the same ordering
(monotonicity; monotonic relationships)
When would we use Spearman correlation?
- Distributions of X and Y are very non-normal
- Sample size is small or there are extreme outliers
- If you want strength of a non-linear relationship:
- If data are ordinal, but not interval scaled
How do we choose between Pearson and Spearman correlations?
- The two methods can produce quite similar answers (e.g. when the data match the
parametric assumptions of normality); in this case, you feel very confident about the
answer and it doesn’t matter which you use
- They can produce quite different answers (e.g. relationship is non-linear; there are
outliers); how big the difference depends on the exact data pattern. In this case, you
need to try to identify what caused the difference and decide which is more
appropriate based on context/analytic goals. This is an art rather than a rigid rule.
Bottom line: if in doubt, calculate both and compare the results.
Page 5 of 24
Figure A: The Spearman correlation captures
perfect non-linear relationship without
figuring out a transformation
Possible shuffles:
Rank of X X 1 2 3 Spearman
Correlations
Null distribution given
Possible orders Y 1 2 3 1 (observed value) by the 6 values- all
6 possible 1 3 2 .5 correlations of X and Y
orders for y 2 1 3 .5 ranks; all equally likely
values all 2 3 1 -0.5 under H0
equally under 3 1 2 -0.5
H0 3 2 1 -1
Page 7 of 24
Rank Analogues for t-tests and ANOVA
Paired t-test setting
Example: Effect of Diet and Exercise Program
X=weight before diet/exercise program
Y=weight after the program
Z=X-Y= amount of weight lost (note: Z<0 indicates weight gained)
What I want to know is whether the program helps people lose weight. How do we define
“helping”?
P=1/2 P=1/2
P>1/2 P<1/2
Page 8 of 24
Scenario 2: We could ask instead whether people on the program are more likely to lose weight
than not to lose weight.
Let P= P(Z>0)= probability of losing weight
Then we can write our hypotheses
H0: P ≤ ½ -people are no more likely to have lost (than gained wt.); half or fewer lose weight
HA: P > ½ -more than half of people lose weight
You can think of this as asking whether median weight loss is above zero.
How do we actually perform the hypothesis test?
Sign Test: Focuses on H0: P ≤ ½ versus HA: P > ½
- Throws away all info except whether or not the person lost weight. Test statistic is
the number or fraction of people in data set who lost weight.
In general, if we have a sample of size n and we let n* be the number of subjects who have a
difference (either + or -) between their two values (i.e. non-zero weight loss). We want to
calculate the number/fraction with a positive value.
Why n*? Convention is that pairs with no difference tell you nothing about the direction of the
effect. People with ties some value before and after tell you nothing about direction of change
and are usually dropped.
- However, we can argue in our context that weight losses of zeroes are “failures” and count
them as “non-loss”. The clinician/researcher need to decide!
- Under H0 (P= ½ as boundary condition), then s+, the number of pairs with a positive
difference, has a binomial distribution with parameters n* and p= ½ (flipping a fair coin)
The p-value is probability of seeing data as or more extreme than what you observed (more
favorable to HA assuming H0 is true.
In weight loss example, p-value is the probability that “this” many people or more in our
sample would have lost weight if the program didn’t work.
Example: n= 9 people
n* = 8 people had weight change
s+ = 7 of these 8 lost weight
H0: P ≤ ½ versus HA: P > ½ recall p=probability of losing weight
p-value=probability that at least 7 out of 8 people would lose weight if program did not work
p-value = P(s+≥7|p=1/2, n=8) = p(s+ = 7) + p(s+ = 8)
}
where n! = n ∙ (n-1)… 3 ∙ 2 ∙ 1
In our example n*= 8 P= ½ under H0
p-value= P(s+ = 7) + P(s+ = 8)
1 1 1 1
= �87� (2)7 (2)1 + �88�(2)8 (2)0 = .035
In general, let computer do it, the conclusion for the 1-sided test using 𝛼𝛼 = .05 is that we reject
H0 (p-value=0.035) and conclude that people on the diet program do “lose weight” i.e. more
likely to have lower weight after the program than before.
Two-sided test would be:
p-value= P (s+ = 0) + P (s+ = 1) + P (s+= 7) + P (s+= 8) = .07 (by symmetry) and we’d fail to
reject at .05
A problem with the sign test is that it has rather low power because it throws away all info about
the actual values (even ranks) and we really care about magnitude of the changes in our example.
Page 10 of 24
Wilcoxon signed-rank test:
- same set up as the sign test but technically instead of testing whether the median
difference is zero (i.e. P=1/2), Wilcoxon signed-rank tests whether the difference scores
are symmetric about zero. It takes into account the magnitude as well as the signs but
only using ranks, not the original values to avoid distributional assumptions.
Procedure:
1. take differences: Z=X-Y
2. Take |Z|, absolute value of the differences (i.e. size but not direction) and rank them from
smallest to largest
3. Our test statistic W+ = sum of the ranks for the subjects with positive changes (of course
W- works just as well)
High W+ means either
(i) lots of positive differences (most people lost weight)
(ii) All the people with the highest ranks had positive changes (biggest weight changes
were losses)
a. High W- or low W+ means opposite
b. W- ≈ W+ suggests there was no change in either direction
We need to get a p-value associated with our test statistic W+ (or W-).
To get the distribution of W+ under H0 you (or better computer!) write done all the possible
combinations of ranks (i.e. possible values of W+), for your sample size, order them, and look
at where your observed value falls on the list.
- Why are we doing this? Under H0, each rank (subject) is equally likely to be a
loss or a gain so all combinations of ranks are equally likely. We calculate the
rank sum and this is the probability associated with each value of W+.
Note: Need n≥5 or no hope of significance
- May have ties use average ranks but his makes calculating W+ distribution
even harder
- It outs out there’s also a normal approximation for the distribution of W+ if n is
large enough (n > 20).
Example: Weight data W+=30 (7+2+4.5+8+1+4.5+3); W- = 6
This is obviously titled in favor of weight loss but because the one person who gained weight
had a high rank and sample is small the p-value = .0508 (1-sided; look up via table or
computer; or p=0.10 for two sided p-value) so just fail to reject H0.
Page 11 of 24
Weight Before Weight After Weight Change Sign of Change Removed sign Ranks Ranks with Signs
ID X Y X-Y Sign X-Y Absolute X-Y Rank of Absolute Signed Rank
1 125 110 15 + 15 7 7
2 115 112 3 + 3 2 2
3 130 125 5 + 5 4.5 4.5
4 140 140 0 0
5 115 124 -9 - 9 6 -6
6 140 123 17 + 17 8 8
7 125 123 2 + 2 1 1
8 140 135 5 + 5 4.5 4.5
9 135 131 4 + 4 3 3
Under H0 we’re assuming x’s and y’s are exchangeable (i.e. the group labels don’t really matter
or whether there is an actual difference)
Most general way to write this mathematically: if X is a randomly selected member of group 1
and Y is a randomly selected member of group 2 then
H0: P (X ≥ Y) = P (Y ≥ X) = ½
HA: P (X ≥ Y) = P (Y ≥ X) ≠ ½ for two-sided test
Really is asking X consistently/systematically bigger than Y or vice versa. One of the groups
tends to have higher values than the other group.
If the distribution of the Y’s (group 2) has the same shape as the distribution of the X’s (Figure A
and B) but is just shifted higher or lower by a fixed amount (location shift) then the above is
equivalent to a comparison of group medians (or means). Hence, the Wilcoxon test is often
stated as a test of medians.
If you further assume that the distributions are symmetric, then it’s the usual test of means since
mean=median.
Page 12 of 24
Location shift
µ1 µ2
mean=median mean=median
Location shift
Figure A: There is a location shift, and distributions are symmetrical. Testing means, medians or
distributions are equivalent hypotheses.
Note: if the distributions are symmetric but not identical shape then mean/median test is NOT the
same as testing of distributions.
Figure B: There is a location shift, but the distributions are not symmetric. Testing the means is
NOT the same as testing medians, but testing one is the same as testing the other.
Figure C: Shapes of distributions are different (no symmetry or location shift); all tests are
different; Y is systemically higher than X but means or medians could be equal.
Page 13 of 24
Wilcoxon Rank Sum Test focuses on:
- Put all the observations (both groups: X and Y) together and rank them from smallest to
largest; Ties get average rank
- Calculate the sum of the ranks for group one (R1) or group 2 (R2); doesn’t matter which one;
this is the test statistics.
- Idea is that under H0 the sums of ranks for 2 groups should be about the same (Any value
should be equally likely to come from either group.) If the groups are different sizes, it's the
average ranks that should be the same. The way this is presented on computer output is the
“expected” rank sum for each group under H0 adjusting for sample sizes.
- To figure out the p-value associated with R1 or R2 we need to write down all the possible sets
of ways are ranks for the n1+n2 total subjects could have been split into groups of size and n1
and n2, to calculate the associated R1 and R2 and see where our observed values fall.
Example: 4 people in the Phase II clinical trial (evaluating safety and a little efficacy), 2 get
assigned a new treatment and 2 get control/placebo
Outcome that measures how well people respond, assume high values=good
Original Original
Control New Treatment
60 71
62 80
Let's focus on TX group: How many ways are there for the ranks to be allocated?
4
� � =4 choose 2= 6 ways
2
Page 14 of 24
Ranks of
Treatment Sum
1,2 3
1,3 4
1,4 5
2,3 5
2,4 6
3,4 7 (observed)
If there’s no treatment effect, the various rank values (1,2,3,4) are equally likely to be in either
group
P(Rtx = 3) = 1�6
P(Rtx = 4) = 1�6
P(Rtx = 5) = 1�3
p-value = P(tx group would have done so well in our sample if tx didn’t work)
= P(Rtx = 7)
= 1�6
= 0.167
Page 15 of 24
- non-parametric analogue of ANOVA; it allows you to compare whether values of an
outcome are systematically different (larger versus smaller) or across 3 or more groups
- it’s a straightforward extension of the Wilcoxon rank sum test.
Page 16 of 24
Simulation Based Non-Parametric Tests
- Classical rank tests depend on the idea that the ranks are uniformly distributed and under
the null hypothesis, H0, are “randomly distributed” (equally likely to occur in any
particular combination)
- This means that you can get the null distribution of the test statistic for a rank based
method by simply writing down all the possible combinations of ranks and calculating
the corresponding test statistic values. This is a special case of something called a
permutation test.
Permutation Test
- A permutation test tries to approximate the null distribution of your test statistic without
making assumptions about the underlying data distribution or invoking a theoretical
argument.
- In essence, we let the data tell us what the null distribution looks like;
- Instead of using a Z, t, F, or χ2 table, we “simulate” a data-set specific table.
- In a permutation test, H0 is generally that there is “no relationship” between two
constructs of interest and HA is that there is a relationship.
- The key idea in creating the null distribution of the test statistic is to “break” the
relationship that may (or may not) by present in the data by shuffling or permuting the
values of one of the variables.
Procedure
Step 1: Pick your hypothesis – usually H0 is generally that there is “no relationship” between
two key variables and HA is that there is a relationship, but you need to be very clear, what
variables you are relating!
Step 2: Pick a test statistics – this should be an intuitive measure of how well your data match H0
vs HA. It can be a traditional statistic (e.g. sample correlation, t-score, etc.) or something else
(e.g difference in medians, rate ratio, etc.). Calculate the test statistics for the observed data.
Step 3: Get the distribution of your test statistic under H0 by picking an appropriate way to
reshuffle your data to break the relationship and calculate the test statistics for each of the
simulated data sets. Order the resulting test statistic values.
Step 4: To get the p-value find the percentile corresponding to the value of your observed test
statistics from Step 3.
Example: Permutation Test for Group Difference
Outcome: # of papers published last year by assistant professors in
(1) Math Dept: 1,2,6 ͞x1 = 3, med1= 2
(2) Biostat Dept: 4,9,11 ͞x2 = 8, med2= 9
Page 17 of 24
Goal: We want to see who publishes “more” which could mean on average (means), in terms of
the center of the distribution (median), whether a randomly selected biostatistician usually has a
higher value than a randomly selected mathematician (distribution), etc.
- We want to compare groups say by means or medians.
- We will look at permutation tests based on both the difference in means and the
difference in medians.
o For the means, we could use a t-test, but the n’s are small and there appears to be
one unusual point in each group so usual t-statistic may not have a t distribution
and we don’t know the distribution of the sample mean.
o For the distribution of the difference in medians is not so simple (sadly…) and in
any case depends on the underlying data distribution which we don’t want to
assume.
Test statistics
(a) Means – traditional choice is t-statistics
8−3
t= 1 1
= 1.93
3.16 � +
3 3
What do we shuffle? Here the prospective relationship is between # of papers and department
membership so we permute the group labels to break the relationship between number of
publications and field of study.
This amounts to choosing which 3 people get the “biostat” label. How many are there?
n= 6 profs; need k=3 in each group
6! 6∙5∙4∙3∙2∙1
�63� = 3!3!= = 20
3∙2∙1∙3∙2∙1
Page 18 of 24
9 Biostat Math Biostat Math Math
11 Biostat Math Biostat Biostat Math
The figure shows histograms of all possible permutation values for the test statistics.
Our t-score was the second highest (1-sided p-value=2/20 = 0.1) and the median difference was
tied for the highest (1-sided p-value=2/20 = 0.1). Not good enough to establish a significant
difference in publication rates between biostatistics and math (bummer!)
- In this case, it was possible to list out all the possible permutations of the data but if the
sample sizes are larger, this will not be feasible even with a computer. Ordinarily we
would just do a random subset of permutations, say P=1000 and get an estimate.
- Aside: Why use the t-test in the first part of this example even though the t-test
assumptions are unmet?
o The t-test statistics still gives us an intuitively reasonable measure of whether the
average publication levels differ standardized by the variability = we just don’t
want to assume we can use a t-table to calculate the corresponding p-value.
- The smaller the 𝛼𝛼 you want to use, the larger the # of permutations you need to get a
good approximation of the tail probabilities.
Page 19 of 24
The Bootstrap
-Permutation procedures try to generate the null distribution of a statistic without making many
assumptions
-There are other inferential situations when you might want the “actual distribution under the
“true” value rather than the null distributions; e.g. to get the confidence interval, we’re really
interested in the distribution of HA
-The bootstrap is a simulation procedure for looking at the “actual” distribution of your
parameter estimate or test statistic without making assumptions about either the distribution of
the data or of the estimator/test statistics other than that we have a representative sample.
When do we use the bootstrap?
-If the distribution of estimator/statistic is unknown/hard to derive analytically
-Use it even for standard statistics (͞x, s, 𝛽𝛽̂ ) where the usual assumptions may be dubious (non-
normality, outliers, small n)
In general, suppose we’re interested in a population parameter 𝜃𝜃, and we estimate it using a
^
statistic, 𝜃𝜃, based on a sample of size n: {𝑥𝑥1 , … . 𝑥𝑥2 , … . , 𝑥𝑥𝑛𝑛 }
Question: How can we estimate the distribution (i.e. probabilities, standard error, etc.) for 𝜃𝜃? We
need this to do inference.
-Key idea is to mimic the relationship between the sample we observed and the population.
-Ideally to get the distribution of 𝜃𝜃� based on a sample size of n we would draw lots of samples of
�∗ for each of them and look at the resulting distribution (e.g. histogram)
size n, calculate 𝜃𝜃
Problem: We can’t afford to do this and even if we could, we’d just want to combine points into
one big sample to get better estimates!
Solution: Bootstrap approach relies on the idea that my sample is “like” or is representative of
the population (standard basis for statistical inference)
- so…. taking samples from my original sample should be “like” sampling from the
original population. But I can take as many “resamples” as I want from my original data
set!
- The bootstrap samples need to be the same size as the original sample so that the
parameter estimate 𝜃𝜃�∗ behave the same way as the original 𝜃𝜃�.
- This means that we need to sample with replacement or we won’t get any variation. Some
of the original values will appear more than once in a given bootstrap sample and others
will not appear at all.
- Because the original sample is smaller than the full population resampling from it gives a
coarser estimate of the distribution of 𝜃𝜃� – the smaller the sample, the worse this problem
is,
Page 20 of 24
- Bootstrap does NOT save you from small sample sizes or “create” new data – it just
makes maximum use of the data you have
- �∗ ’s, to estimate anything we want about
We can use the bootstrap parameter value, the 𝜃𝜃
the distribution of 𝜃𝜃�.
Conceptual Picture
Procedure:
1. Take original sample of size n and calculate the estimator/test statistic of interest, 𝜃𝜃�.
2. Obtain B Bootstrap samples of size n with replacement from the original sample (some
values occur multiple times while others are left out of any given resample.)
3. For each bootstrap sample compute the statistic of interest to get 𝜃𝜃�1∗ , … 𝜃𝜃�𝐵𝐵∗
4. Use the 𝜃𝜃� ∗ ’s to learn anything you want about the distribution of 𝜃𝜃
Page 21 of 24
(a) Make a histogram of 𝜃𝜃� to get the shape of distribution
(b) You can order the 𝜃𝜃� to get percentiles/probabilities associated with the distribution of
𝜃𝜃�
(c) You can calculate the standard deviation of the 𝜃𝜃� ∗ to get a standard error estimate for
𝜃𝜃�; to estimate the uncertainty of 𝜃𝜃� and calculate confidence interval. This is
especially useful if you do not have a formula for the standard error of 𝜃𝜃� – you can
simply estimate it by the standard deviation of the 𝜃𝜃� ∗ ′𝑠𝑠
The bootstrap helps you do inference about 𝜃𝜃 (which is fixed but unknown) by describing the
behavior 𝜃𝜃� which is supposed to be a good estimate of 𝜃𝜃
It works in most (though not all) situations. It is totally non-parametric but it doesn’t solve all
problems. It does not solve having a small sample, no new data are created and it does not help
you if original sample was bad.
One of the major uses of the bootstrap is to get uncertainly estimate/CIs for “difficult” parameter
estimates (i.e. ones whose distribution we don’t know or want to assume)
You could take the bootstrap estimate of the standard deviation error (i.e the standard deviation
of 𝜃𝜃�1∗ , … 𝜃𝜃�𝐵𝐵∗ from bootstrap samples and create a standard confidence interval
But….. this “2” implicitly assumes that 𝜃𝜃� is normally distributed! (so 95% of values lie w/in 2
SDs).
If our situation is that 𝜃𝜃� IS normal and we just didn’t have a formula for the standard error, this
would be ok. If 𝜃𝜃 is not normal we can do the following instead:
Bootstrap Percentile Confidence Interval
(1) Take your bootstrap estimates, 𝜃𝜃�1∗ , … 𝜃𝜃�𝐵𝐵∗ , and order them from smallest to largest
(2) Identify the desired confidence level 1- 𝛼𝛼
𝛼𝛼 ↔ 100 (1-𝛼𝛼) % CI
.05 ↔ 95% CI
and get the 𝛼𝛼 �2 and 1- 𝛼𝛼 �2 percentiles of your set of 𝜃𝜃� ∗and use these as the upper and
lower bounds of you CI. For a 95% CI, we take the 2.5% and 97.5% values e.g. if we
have B=200 bootstrap samples these are the 5th and 195th values of 𝜃𝜃� ∗
Page 22 of 24
Notes: You can do this for any 𝛼𝛼 and any B. If you use a small 𝛼𝛼 you need a bigger # of
bootstrap samples, B, to get good estimates of the edge of the interval
How many bootstrap samples do we need? Rough rule of thumb
-B=50-200 is pretty good for getting a standard error estimate
-B=500-2000 is usually good for CI
-You can always check by running it a few times and seeing if estimate/CI changes
-The more complicated the distn of 𝜃𝜃� or the smaller 𝛼𝛼 is the more samples you need.
-Extra bootstrap samples are almost free never hurts to do too many
-In theory you can get 𝜃𝜃� ∗ for all possible bootstrap samples and get the “ideal” bootstrap distn.
However there are nn possible samples which makes even a computer choke.
-In practice you pick a random selection from possible samples unbiased approximation to
idea value.
-The bootstrap can be used for things s.e.’s/CI’s In particular, you can estimate the bias
(systematic error) in your parameter estimate and correct for it. Along with this you can get “bias
corrected bootstrap CIs” (The formula is a bit messy/technical so we’re not going to worry about
hard calcs.)
Empirical Bootstrap Example
Empirical just means we resample from our original data.
Variable x; sample of size n=4
values: 0,2,4,10
Suppose we’re interested in the mean and median of x.
Sample mean ͞x might not be normal (n is small; one “unusual” value of 10 can’t really tell if this
is an outlier given sample size) For median, we don’t know the distribution in general.
Sample values are ͞x=4, m=3
Let’s get bootstrap CIs:
There are 44= 256 possible bootstrap samples (counting order)
mean median
0,0,0,0 0 0
0,0,0,2 .5 0
0,0,2,0 .5 0
I generated them all. Bootstrap means looks fairly normal even with n=4 but the distribution of
sample median is skewed.
Page 23 of 24
Page 24 of 24