Stats 101 - Class 02
Stats 101 - Class 02
Bryon Aragam
Chicago Booth
booth41000@gmail.com
https://canvas.uchicago.edu/courses/43775/
Suggested Readings:
Naked Statistics, Chapters 7, 8, 9 and 10
OpenIntro Statistics, Chapters 5, 6, and 7
1 / 109
Example
2 / 109
Flashback
3 / 109
The statistical process
4 / 109
The statistical process
Let’s try to make this more formal. Simplifying a bit, let’s pretend
like the winner of the election goes to the popular vote.
I The problem: Who will win the election?
I The answer: Who has the highest share of the vote?
I The target parameter: True proportion of voters for each
candidate.
I The data: A random sample of the population.
I The statistic: The sample proportion.
Problem −→ statistics.
5 / 109
Two key concepts
6 / 109
What is a parameter?
Intuititely, a parameter is any quantity whose value we do not
know, and the target parameter is any quantity whose value we
do not know but we want to know. (Why do we want to know?
Think about the statistical process.)
Examples:
I True proportion of voters for each candidate
I Future value of an investment
I Average price of competitors’ products
I Proportion of Americans with COVID-19 (or vaccinated, or
tested, etc.)
Examples:
I Sample mean
I Sample proportion
I Maximum or minimum
I Range (max − min)
I Lots of others...
Even though you only ever see one dataset, the process that
generated it (i.e. random sampling) was random, so your data is
always random.
The data:
I Firm A: Biden 51%, Trump 49% (n = 1000)
I Firm B: Biden 49%, Trump 51% (n = 1000)
11 / 109
Moving forward
The ideas apply to any statistic, any dataset, and any assumptions
on the data (i.e. normality is NOT necessary).
12 / 109
Estimating proportions and probabilities
13 / 109
A First Modeling Exercise: Polling
So, Tay Tay is running against The Rock for president. Out of a
random sample of n = 1000 voters, 511 voters intend to vote for
Tay Tay, and 489 intend to vote for The Rock.
14 / 109
Recall: The statistical process
15 / 109
Polling - Model
16 / 109
Polling - Sample size
We will keep this notation throughout the rest of the class, and
often use it without mentioning it specifically.
How big is big enough? We’ll discuss a little more later in this
section. It turns out even simple questions like this have
complicated answers. (There is a short discussion of this in the
book: See Sections 5.1.3-5.1.5, 5.3.6, 6.1.5.)
17 / 109
Polling - Statistic
# of 1s
p̂ = .
n
18 / 109
Estimating means and variances
19 / 109
Another Modeling Exercise
20 / 109
Recall: The statistical process
Purchase behaviour
0.08
0.06
Density
0.04
0.02
0.00
75 80 85 90 95 100 105
22 / 109
Purchase Behaviour - Model
X ∼ N(µ, σ 2 ).
n n
1X 1 X 2
X̄ = xi , s2 = xi − X̄ .
n n−1
i=1 i=1
24 / 109
Sample vs population quantities
Sample Population
# of 1s
p̂ = n p = P(X = 1)
1
Pn Pn
X̄ = n i=1 xi E(X ) = i=1 P(X = xi ) × xi
1
Pn 2 Pn 2
s 2 = n−1 i=1 xi − X̄ var(X ) = i=1 P(X = xi ) × [xi − E(X )]
0.04
0.02
0.00
75 80 85 90 95 100 105
26 / 109
Notation
27 / 109
Knowns vs unknowns
In real life:
I We do not know the actual distribution of X
I We do not know the actual mean and variance of X
I Different observations of X (the daily returns) are dependent
Seems hopeless!
28 / 109
Knowns vs unknowns
29 / 109
Building Portfolios
Assume I invest some money in the U.S. stock market. What are
some questions I might be interested in?
I What is my expected one year return? Expected value
I How volatile is my portfolio? Risk / variance
30 / 109
The statistical process, again
31 / 109
Building Portfolios
-20 -10 0 10 20 30 40
IBM
I Remember that:
34 / 109
Building Portfolios
I µ̂P and σ̂P2 refer to the estimated mean and variance of our
portfolio
35 / 109
Sampling distributions and variation
37 / 109
Models, Parameters, Estimates...
38 / 109
Oracle vs SAP Example (understanding variation)
X ∼ N(µ, σ 2 )
X̄ s2
SAP firms 0.1263 0.065
0.12
I Well, 0.15 ≈ 0.8! I guess the ad is correct, right?
I Not so fast...
41 / 109
Oracle vs. SAP
42 / 109
Oracle vs. SAP
data sample
� ↓ �
bootstrap samples
43 / 109
Oracle vs. SAP
15
10
Density
5
0
0.1263
44 / 109
Sampling distributions
45 / 109
Sampling Distribution of Sample Mean
If X ∼ N(µ, σ 2 ), then
σ2
X̄ ∼ N µ, .
n
46 / 109
Sampling Distribution of Sample Mean
s2
X̄ ∼ N µ,
n
X̄ ∼ N µ, sX̄2
48 / 109
You can’t handle the truth
Key takeaway:
I Data is complicated
I Some data is so complicated that you really need to worry
about the little details (research)
I But, in many cases, simplifying assumptions are OK and we
can still make good decisions
50 / 109
Activity: Polling
http://rocknpoll.graphics/
51 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean
15
10
Sampling Distribution
Density
5
0
0.1263
52 / 109
Confidence intervals
54 / 109
Confidence Intervals
X̄ ∼ N µ, sX̄2
so...
(X̄ − µ) ∼ N 0, sX̄2
right?
I What is a good prediction of µ? What is our best guess?
X̄
I How do we make mistakes? How far from µ can we be?
95% of the time ±2 × sX̄
I [X̄ ±2 × sX̄ ] gives a 95% range of plausible values for µ... this
is called the 95% Confidence Interval (CI) for µ.
55 / 109
Confidence Intervals
More generally:
I If we want to estimate the parameter θ using θ̂,
I Then we first need to find the standard error sθ̂ ,
I And the 95% CI is given by
56 / 109
Interpreting CIs
How do you interpret a confidence interval?
With 95% confidence, you can safely rule out any value that falls
outside the 95% CI.
The main point of CIs is that there is an entire range of values vs.
a single point. This is FUNDAMENTAL.
57 / 109
Oracle vs. SAP example... one more time
= [0.069; 0.183]
58 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean
15
10
Sampling Distribution
Density
60 / 109
Estimating Proportions...
p̂ ± 2 sp̂ .
61 / 109
Estimating Proportions... another modeling example
62 / 109
Estimating Proportions...
Defects:
This gives
r
(.007) (.993)
sp̂ = = 0.00244.
1000
63 / 109
Polls: yet another example...
Then, r
(.5) (.5)
sp̂ = = .0158.
1000
64 / 109
Difference in means
65 / 109
Example: Portfolio comparison
66 / 109
Difference in Means
When comparing groups to detect differences, we can use a clever
hack.
− µB
µA\ vs. bA − µ
µ bB
67 / 109
Standard Error for the Difference in Means
s
sX2 a sX2
s(X̄a −X̄b ) = + b
na nb
s
p̂a (1 − p̂a ) p̂b (1 − p̂b )
s(p̂a −p̂b ) = +
na nb
68 / 109
Confidence Interval for the Difference in Means
or, the
confidence interval for the difference in proportions
69 / 109
Back to the example
70 / 109
Example: Google Search Algorithm
Let’s look at the difference between the current algorithm and the
modifications:
r
0.702 ∗ 0.298 0.74 ∗ 0.26
s(p̂current −p̂A ) = + = 0.0127
2500 2500
r
0.702 ∗ 0.298 0.704 ∗ 0.296
s(p̂current −p̂B ) = + = 0.0129
2500 2500
(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)
73 / 109
Google Search Algorithm: The WRONG way
(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)
mod (A):
r r !
.740 ∗ (1 − .740) .740 ∗ (1 − .740)
.740 − 2 × ; .740 + 2 × = (0.723; 0.758)
2500 2500
mod (B):
r r !
.704 ∗ (1 − .704) .704 ∗ (1 − .704)
.704 − 2 × ; .704 + 2 × = (0.686; 0.722)
2500 2500
74 / 109
What’s the difference?
75 / 109
What’s the difference?
Example: You can tell the difference that a cat is not a dog
without accurately estimating their height/weight/breed/etc.
76 / 109
What’s the difference?
77 / 109
What’s the difference?
78 / 109
The Bottom Line...
estimate ± 2 × s.e.
X̄ ± 2 × sX̄
p̂ ± 2 × sp̂
80 / 109
Hypothesis testing
81 / 109
Testing
Suppose we want to evaluate some unknown property of a
parameter.
I Is the average difference in returns larger than 0?
I Is the proportion of voters for candidate A larger than the
proportion for candidate B?
I Will my return on investment (ROI) be positive?
I Is a bank lending money fairly?
θ could be µ, p, σ 2 , ...
82 / 109
Examples of tests
Other examples:
H0 : p = p 0 vs. H1 : p 6= p 0
H0 : σ 2 = 0 vs. H1 : σ 2 > 0
H0 : p 1 = p 2 vs. H1 : p1 > p2
83 / 109
Testing (means)
Let’s start with means:
H0 : µ = µ0 vs. H1 : µ 6= µ0
X̄ − µ0
t=
sX̄
85 / 109
Testing (Proportions)
p̂ − p 0
t=
sp̂
(X̄a − X̄b ) − d 0
t=
s(X̄a −X̄b )
(p̂a − p̂b ) − d 0
t=
s(p̂a −p̂b )
Why 2? Why not |t| > 3 or |t| > 1? (Hint: It’s arbitrary.)
88 / 109
Example: Google modifications
Back to the Google example: Although Mod A significantly outperformed
the current algorithm (why?), execs will not sign off on the mod unless it
is at least 1% better than the current algorithm. Can you convince them
to switch?
H0 : pcurrent − pA = 0 [ ⇐⇒ pA = pcurrent ]
H1 : pcurrent − pA ≤ −0.01 [ ⇐⇒ pA ≥ pcurrent + 0.01].
The 95% CI was (−0.0634, −0.0126). Since this interval only contains
values smaller than −0.01, there is significant evidence that Mod A will
yield a > 1% improvement. (Be careful with the signs here!)
Let pW denote the true proportion voting for Wu. We want to test:
H0 : pW = 0.5
H1 : pW > 0.5
What do we conclude?
90 / 109
The hard part
92 / 109
Intuition behind testing
93 / 109
Intuition behind testing
94 / 109
Example: Psychic powers
How about this: I’ll try to prove to you that I can read your mind.
Think of a number between 1-10 and write it down on a piece of
paper (or your hand).
OK, I’ll admit one thing: My psychic skills aren’t perfect. But I
swear I’m a psychic.
96 / 109
Example: Psychic powers
(NOTE: You can replace 0.1 with any baseline you find convincing. As an exercise, try
this!)
97 / 109
Example: Psychic powers
98 / 109
Example: Psychic powers
This is also a hypothesis testing problem!
I Let p = P(guess correctly) be the true parameter (we don’t
know this!)
I We are interested in whether or not p > 0.1
The null hypothesis is always the baseline. It’s the status quo,
boring hypothesis. =⇒ H0 : p = 0.1
So:
H0 : p = 0.1
H1 : p > 0.1
99 / 109
How many guesses?
I need at least 5/10 correct guesses I need at least 18/100 correct guesses I need at least 121/1000 correct guesses
1.0
1.0
1.0
0.8
0.8
0.8
95% CI lower bound
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
2 4 6 8 10 0 20 40 60 80 100 0 200 400 600 800 1000
101 / 109
More examples
These are just a few examples, and the baselines are also just
examples. Can you think of any other appropriate baselines for
these examples?
102 / 109
Back to an example: Google modifications
103 / 109
Is this even useful?
We’ve made a lot of assumptions so far...
I The data is perfectly normally distributed
I The data is perfectly independent
I The sampling is perfectly random and unbiased
I The sample size is sufficiently large
In the real world, have to deal with “perfect world” problems like
sampling error and approximations in addition to “real world”
problems like dependence, limited samples, etc.
105 / 109
The Importance of Considering and Reporting
Uncertainty
It was predicted that the rain and the spring melt would lead to a
49-feet crest of the river. The levees were 51-feet high.
107 / 109
The Importance of Considering and Reporting
Uncertainty
It turns out the prediction interval for the flood was 49ft ± 9ft
meaning the 54ft levees could overflow!
108 / 109
The Importance of Considering and Reporting
Uncertainty
Don’t make this mistake! Intervals are your friend and will lead to
better decisions!
(Remember the weather apps and rain probabilities from Section 1?)
109 / 109