0% found this document useful (0 votes)
6 views103 pages

Stats 101 - Class 02

Estimation, Condence Intervals and Hypothesis Testing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views103 pages

Stats 101 - Class 02

Estimation, Condence Intervals and Hypothesis Testing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Section 2: Estimation, Confidence Intervals

and Hypothesis Testing

Bryon Aragam
Chicago Booth
booth41000@gmail.com
https://canvas.uchicago.edu/courses/43775/

Suggested Readings:
Naked Statistics, Chapters 7, 8, 9 and 10
OpenIntro Statistics, Chapters 5, 6, and 7

Last Updated: October 9, 2022

1 / 109
Example

Two different polling firms ran two different polls to determine to


winner of the 2020 US presidential election.

Both firms are objective, fair, and willing to spare no expense to


ensure their samples are perfectly random and unbiased.

I Firm A: Biden 51%, Trump 49% (n = 1000)


I Firm B: Biden 49%, Trump 51% (n = 1000)

Could Firm B have possibly reported Biden 12%, Trump 88%?

Who is right? Is either one wrong?

Why didn’t they get the same answer?

2 / 109
Flashback

Even in a perfect world where we wouldn’t have to worry about


bias, manipulation, privacy, security, etc., we must still deal with
random sampling.

This section is all about conceptualizing and managing sampling


error.

The key concept is that of a sampling distribution.

3 / 109
The statistical process

4 / 109
The statistical process

Let’s try to make this more formal. Simplifying a bit, let’s pretend
like the winner of the election goes to the popular vote.
I The problem: Who will win the election?
I The answer: Who has the highest share of the vote?
I The target parameter: True proportion of voters for each
candidate.
I The data: A random sample of the population.
I The statistic: The sample proportion.

Problem −→ statistics.

5 / 109
Two key concepts

In going from problem −→ statistics, we encounter two key


concepts:
I The target parameter
I The statistic

These are crucial abstractions that make talking about statistical


analyses easier.

Every statistical problem consists of a target parameter and a


statistic.

6 / 109
What is a parameter?
Intuititely, a parameter is any quantity whose value we do not
know, and the target parameter is any quantity whose value we
do not know but we want to know. (Why do we want to know?
Think about the statistical process.)

Examples:
I True proportion of voters for each candidate
I Future value of an investment
I Average price of competitors’ products
I Proportion of Americans with COVID-19 (or vaccinated, or
tested, etc.)

If you know the value of the target parameter, there is nothing


more to be done. Game over.
7 / 109
What is a statistic?
A statistic is a summary (formally: a function) of a
dataset—typically used to approximate the value of a target
parameter.

Examples:
I Sample mean
I Sample proportion
I Maximum or minimum
I Range (max − min)
I Lots of others...

Emphasis: A statistic is any summary of a dataset. Some statistics


are useful, some are not so useful, and some are downright
misleading!
8 / 109
Statistics are random variables

Random sampling =⇒ data is random.

Even though you only ever see one dataset, the process that
generated it (i.e. random sampling) was random, so your data is
always random.

So, statistics are random “things” (aka random variables)!


I Data is random
I A statistic is a summary of (random) data Your statistic is a
random variable
9 / 109
Back to the example

The data:
I Firm A: Biden 51%, Trump 49% (n = 1000)
I Firm B: Biden 49%, Trump 51% (n = 1000)

Why didn’t they get the same answer?


Random sampling =⇒ different, but equally valid, samples

Who is right? Is either one wrong?


Both firms are objective and each sample is perfectly random, so
both answers are equally “correct.”

Could Firm B have possibly reported Trump 12%, Biden 88%?


Yes. It is possible, though unlikely, that a random sample of 1000
voters was skewed like this.
10 / 109
tl;dr

This is the most important lesson from this section.

Your data is random, so every step of a data analysis is random


and involves uncertainty that must be quantified.

Since a statistic is a random variable, it has a distribution.

The distribution of a statistic is called its sampling distribution,


and the sampling distribution is how we’ll quantify this uncertainty.

11 / 109
Moving forward

Understanding and computing sampling distributions is hard.

The rest of this section is about computing these distributions and


using them to quantify uncertainty.

We will focus on two canonical examples:


I Estimating a sample proportion (e.g. what proportion of
voters will vote for Candidate A?)
I Estimating a sample mean (e.g. what is the average height of
this class?)

The ideas apply to any statistic, any dataset, and any assumptions
on the data (i.e. normality is NOT necessary).

12 / 109
Estimating proportions and probabilities

13 / 109
A First Modeling Exercise: Polling

In a perfect world, Taylor “Tay Tay” Swift would run against


Dwayne “The Rock” Johnson for president. Let’s pretend like
we’re in this perfect world.

So, Tay Tay is running against The Rock for president. Out of a
random sample of n = 1000 voters, 511 voters intend to vote for
Tay Tay, and 489 intend to vote for The Rock.

Simplifying a bit, assume the winner of the election will go to the


popular vote. Who is going to win the election?

14 / 109
Recall: The statistical process

I The problem: Who will win the election?


I The answer: Who has the highest share of the vote?
I The target parameter: True proportion of voters for each
candidate.
I The data: A random sample of the population.
I The statistic: The sample proportion.

How do we model this scenario? How much uncertainty is there in


our statistic?

15 / 109
Polling - Model

I Each sample corresponds to a randomly selected voter, and


the observation is who they intend to vote for
I As a first modeling decision, let’s call the random variable
associated with a voter’s decision X , which takes one of two
values X = 0 (The Rock) or X = 1 (Tay Tay)
I We say that the votes X are independent and identically
distributed as
X ∼ Ber(p).

I Question: What is a reasonable value for p?


I We can estimate this from the data (n = 1000 observations)...

16 / 109
Polling - Sample size

Once and for all, let n denote the sample size.

We will keep this notation throughout the rest of the class, and
often use it without mentioning it specifically.

n will ALWAYS refer to the sample size.

How big is big enough? We’ll discuss a little more later in this
section. It turns out even simple questions like this have
complicated answers. (There is a short discussion of this in the
book: See Sections 5.1.3-5.1.5, 5.3.6, 6.1.5.)

17 / 109
Polling - Statistic

A reasonable statistic in this scenario is the sample proportion:

# of 1s
p̂ = .
n

In this case, p̂ corresponds to the proportion of voters in favour of


Tay Tay and p̂ = 511/1000 = 0.511.

(The assignment to 0 and 1 here is irrelevant: We could have


defined X = 1 to mean “voting for The Rock”, and everything
would still work with p̂ = 0.489. Just be careful about interpreting
things!)

18 / 109
Estimating means and variances

19 / 109
Another Modeling Exercise

I I have been tracking the purchase behaviour of my customers


so I know how much they spend at my business.
I In order to keep up with inventory, consider new product
offerings, and generally to evaluate the health of my business,
I’d like to predict how much my customers will spend in the
future.
I What should I do?

20 / 109
Recall: The statistical process

I The problem: How can I predict my customers’ purchase


behaviour?
I The answer: What is the distribution of purchase amounts?
I The target parameters: True mean and variance of purchase
amounts.
I The data: A database of consumer purchases.
I The statistics: The sample mean and variance.

In this example, the target parameters are not a perfect substitute


for the answer we want, but they will get us pretty far. Let’s start
here.

Where is the randomness? Can’t predict the future.


21 / 109
Purchase Behaviour - Data

Purchase behaviour

0.08
0.06
Density

0.04
0.02
0.00
75 80 85 90 95 100 105

Purchase amount ($)

22 / 109
Purchase Behaviour - Model

I Each observation corresponds to a single purchase


I As a modeling decision, let’s call the random variable
associated with each purchase X and assume that purchases
can be modeled as a normal distribution.
I As before, let’s assume the returns X are independent and
identically distributed, except this time a discrete distribution
is not appropriate (Why?)
I Let’s model X as a continuous distribution, so

X ∼ N(µ, σ 2 ).

I Question: What are reasonable values of µ and σ 2 ?


I We can estimate them from the data (n = 113 observations)...
23 / 109
Purchase behaviour - Model

I Let’s assume that each observation in the random sample


{x1 , x2 , x3 , . . . , xn } is independent and distributed according to
the model above, i.e., xi ∼ N(µ, σ 2 )
I To estimate µ and σ 2 , reasonable choices are the sample
mean (X̄ ) and the sample variance (s 2 )... (their sample
counterparts)

n n
1X 1 X 2
X̄ = xi , s2 = xi − X̄ .
n n−1
i=1 i=1

I Sometimes written µ̂ and σ̂ 2 (ˆ· = “estimate of”)

24 / 109
Sample vs population quantities
Sample Population
# of 1s
p̂ = n p = P(X = 1)
1
Pn Pn
X̄ = n i=1 xi E(X ) = i=1 P(X = xi ) × xi
1
Pn 2 Pn 2
s 2 = n−1 i=1 xi − X̄ var(X ) = i=1 P(X = xi ) × [xi − E(X )]

I p, E(X ), var(X ) depend on P(X = xi ) which are unknown


I p̂, X̄ , s 2 are approximations to P(X = 1), E(X ), var(X ) that
depend only on known quantities (i.e. the data)
I (Beyond the scope of 41000): These approximations can be
shown to be optimal*—they cannot be improved in terms of
prediction
I *Under certain assumptions
I The catch: Outliers, fairness, privacy, RAM, etc.
25 / 109
Purchase behaviour - Model checking

For the purchasing data,


Purchase behaviour
0.08
0.06

X̄ = 88.64 and s 2 = 25.43


Density

0.04
0.02
0.00

75 80 85 90 95 100 105

Purchase amount ($)

I The red line represents our “model”, i.e., the normal


distribution with mean and variance given by the estimated
quantities X̄ and s 2 .

26 / 109
Notation

If we want to wax poetic about general parameters and statistics


(i.e. without specifying exactly which ones we care about), it is
common to use another greek letter: θ
I Parameter: θ
I Statistic: θ̂

The “hat” ˆ· is universal stats lingo for “statistic” or “estimate”.

This is just a placeholder: Examples of “θ” we’ve considered


already are p, µ, σ 2 and examples of “θ̂” are p̂, X̄ , s 2 , etc.

27 / 109
Knowns vs unknowns

In real life:
I We do not know the actual distribution of X
I We do not know the actual mean and variance of X
I Different observations of X (the daily returns) are dependent

Seems hopeless!

Models to the rescue:


I Replace the actual distribution with a tractable distribution
(the normal distribution)
I Estimate the “best” possible values for µ and σ 2
I Hope that observations are “close” to independent

28 / 109
Knowns vs unknowns

Sometimes this works reasonably well:


I Financial data
I Demographic data
I Scientific data

Sometimes it fails catastrophically:


I Computer vision (images)
I Natural language (text)
I Speech recognition (audio)

Always be asking: Is your model a good model?

29 / 109
Building Portfolios

Assume I invest some money in the U.S. stock market. What are
some questions I might be interested in?
I What is my expected one year return? Expected value
I How volatile is my portfolio? Risk / variance

As an example, consider 3 investment opportunities:


1. IBM stocks
2. ALCOA stocks
3. Treasury Bonds (T-bill)

How should we start thinking about these questions?

30 / 109
The statistical process, again

You know the drill by now:


I The problem: What is my expected one year return?
I The answer: What is the average return historically?
I The target parameter: The expected value of the investment.
I The data: Historical returns data.
I The statistic: The sample mean.

Can you do the same for volatility?

31 / 109
Building Portfolios

To understand returns and volatility, we need to estimate expected


values and variances. Let’s write some things down:
I µI = expected value of IBM, σI = standard deviation of IBM
I Similarly, we have µA and σA for ALCOA, and µT and σT for
T-bills

What about the T-bill?

I The return on the T-bill is fixed at 3%


I =⇒ µT = 3, σT = 0 (Why?)

After observing some return data we can came up with estimates


for the means and variances describing the behavior of these stocks.
32 / 109
Building Portfolios
40

IBM (%) ALCOA (%) T-bill (%)


µ̂I = 12.5 µ̂A = 14.9 µT = 3
ALCOA
20

σ̂I = 10.5 σ̂A = 14.0 σT = 0


0

corr (IBM, ALCOA) = 0.33


-20

-20 -10 0 10 20 30 40

IBM

I Why µ̂I and µ̂I , but µT (i.e. no ˆ· on the T-bill)?


I What is cov(IBM, Tbill)? How about cov(ALCOA, Tbill)?
I What are the units here? Could have been %, or $, or
decimals... ALWAYS STATE YOUR UNITS.
33 / 109
Building Portfolios

I What if we combine these options? Is that a good idea?


I I don’t like “conventional wisdom”: Let’s actually find out!
I What happens if I place half of my money in ALCOA and the
other half on T-bills...

I Remember that:

E(aX + bY ) = aE(X ) + bE(Y )


var(aX + bY ) = a2 var(X ) + b 2 var(Y ) + 2ab ∗ cov(X , Y )

34 / 109
Building Portfolios

I So, by using what we know about the means and variances we


get to:

µ̂P = 0.5µ̂A + 0.5µT


= 0.5(14.9) + 0.5(3)
σ̂P2 = 0.52 · σ̂A2 + 0.52 · σT2 + 2 · 0.5 · 0.5 · cov(ALCOA, Tbill)
= 0.52 (14)2 + 0.52 · 0 + 2 · 0.5 · 0.5 · 0

I µ̂P and σ̂P2 refer to the estimated mean and variance of our
portfolio

I What are we assuming here?

35 / 109
Sampling distributions and variation

37 / 109
Models, Parameters, Estimates...

In general we talk about unknown quantities using the language of


probability... and the following steps:
I Define the random variables of interest
I Define a model (or probability distribution) that describes the
behavior of the RV of interest
I Based on the data available, we estimate the parameters
defining the model
I We are now ready to describe possible scenarios, generate
predictions, make decisions, evaluate risk, etc...

“All models are wrong, but some are useful.”

38 / 109
Oracle vs SAP Example (understanding variation)

Do we “buy” the claim from this ad?


39 / 109
Recall: The statistical process

I The problem: Is it true that SAP customers are 20% less


profitable than their peers?
I The answer: What are the profit margins of SAP customers,
and their peers?
I The target parameters: True mean profit margins for SAP
customers and their peers.
I The data: A database of company profits.
I The statistics: The sample means.

To simplify, let’s assume that the industry ROE is known to be


15% (this is also an estimate but let’s assume it is accurate).

So we only need to worry about estimating ROE for SAP firms.


40 / 109
Oracle vs. SAP

I Do we “buy” the claim from this ad?


I We have a dataset of 81 firms that use SAP...
I The industry ROE is 15%
I We assume that the random variable X represents ROE of
SAP firms and can be described by

X ∼ N(µ, σ 2 )

X̄ s2
SAP firms 0.1263 0.065

0.12
I Well, 0.15 ≈ 0.8! I guess the ad is correct, right?
I Not so fast...
41 / 109
Oracle vs. SAP

I Let’s assume the sample we have is a good representation of


the “population” of firms that use SAP...

I What if we had observed a different sample of size 81?

42 / 109
Oracle vs. SAP

I Selecting a random, with replacement, from the original 81


samples I get a new X̄ = 0.09... I do it again, and I get
TheX̄Bootstrap: whyagain
= 0.155... and it works
X̄ = 0.132...

data sample
� ↓ �

bootstrap samples
43 / 109
Oracle vs. SAP

I After doing this 1000 times... here’s the histogram of X̄ ...


Now, what do you think about the ad?
Histogram of sample mean

15
10
Density

5
0

0.1263

0.05 0.10 0.15 0.20

44 / 109
Sampling distributions

Things are about to get a little bit meta.


I If we sample X 100 times, we get a list of random numbers
P
{x1 , . . . , x100 } and their mean x̄1 = (1/n) i xi
I If we do this a second time, we get a second list of random
numbers and a second mean x̄2
I x̄1 is different from x̄2 !
I If we do this N times, we get N different lists of numbers, and
N corresponding means x̄1 , . . . , x̄N
I All the sample means x̄i are different!
I The sample mean is itself a random variable!
I So it has a distribution, called the sampling distribution

45 / 109
Sampling Distribution of Sample Mean

Consider the sample mean for an iid sample of n observations


{x1 , . . . , xn }.

If X ∼ N(µ, σ 2 ), then

σ2
 
X̄ ∼ N µ, .
n

This is the sampling distribution of the sample mean.

46 / 109
Sampling Distribution of Sample Mean

I The sampling distribution of X̄ describes how our estimate would


vary over different datasets of the same size n
I It provides us with a vehicle to evaluate the uncertainty associated
with our estimate of the mean...
I It turns out that s 2 is a good proxy for σ 2 so that we can
approximate the sampling distribution by

s2
 
X̄ ∼ N µ,
n

I We call sd(X̄ ) the standard error of X̄ ... it is a measure of its


variability... I like the notation
r
s2
sX̄ = sd(X̄ ) =
n
47 / 109
Sampling Distribution of Sample Mean

X̄ ∼ N µ, sX̄2


I X̄ is unbiased... E(X̄ ) = µ. On average, X̄ is right!


I X̄ is consistent... as n grows, sX̄2 → 0 =⇒ X̄ → µ
I With enough information (i.e. samples / data), eventually X̄
correctly estimates µ

If your data is non-normal, this still remains approximately true:


I X̄ is still unbiased and consistent: E(X̄ ) = µ and sX̄2 → 0
I The sampling distribution is no longer exactly normal but:

With enough samples, X̄ is approximately normally distributed.

48 / 109
You can’t handle the truth

I Strictly speaking, it is not true that X̄ ∼ N(µ, sX̄2 )—the truth


2
is X̄ ∼ N(µ, σn ) (notice the difference?)
σ2
I Since sX̄2 ≈ n , the distribution is “close enough” to N(µ, sX̄2 )
for practical purposes, when n is large
49 / 109
You can’t handle the truth

Key takeaway:

How and when is it appropriate to simplify?

Independence, normality, approximations, etc.

I Data is complicated
I Some data is so complicated that you really need to worry
about the little details (research)
I But, in many cases, simplifying assumptions are OK and we
can still make good decisions

50 / 109
Activity: Polling

http://rocknpoll.graphics/

51 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean

15
10

Sampling Distribution
Density

5
0

0.1263

0.05 0.10 0.15 0.20

52 / 109
Confidence intervals

54 / 109
Confidence Intervals

X̄ ∼ N µ, sX̄2


so...
(X̄ − µ) ∼ N 0, sX̄2


right?
I What is a good prediction of µ? What is our best guess?

I How do we make mistakes? How far from µ can we be?
95% of the time ±2 × sX̄
I [X̄ ±2 × sX̄ ] gives a 95% range of plausible values for µ... this
is called the 95% Confidence Interval (CI) for µ.
55 / 109
Confidence Intervals

More generally:
I If we want to estimate the parameter θ using θ̂,
I Then we first need to find the standard error sθ̂ ,
I And the 95% CI is given by

[θ̂ −2 × sθ̂ , θ̂ +2 × sθ̂ ]

For a 99% CI, replace 2 with 3: [θ̂ ±3 × sθ̂ ].


For a 68% CI, replace 2 with 1: [θ̂ ±1 × sθ̂ ].

56 / 109
Interpreting CIs
How do you interpret a confidence interval?

With 95% confidence, you can safely rule out any value that falls
outside the 95% CI.

Suppose the CI of µ is the interval (a, b).


I It is NOT TRUE that there is a “95% probability that µ is in
(a, b)” .
I This common misconception leads to a misunderstanding of
what a CI is and where the randomness comes in.
I Safest bet: Use the “ruling out” interpretation above.

The main point of CIs is that there is an entire range of values vs.
a single point. This is FUNDAMENTAL.
57 / 109
Oracle vs. SAP example... one more time

In this example, X̄ = 0.1263, s 2 = 0.065 and n = 81... therefore,


0.065
sX̄2 = 81 so, the 95% confidence interval for the ROE of SAP
firms is
 
X̄ − 2 × sX̄ ; X̄ + 2 × sX̄
" r r #
0.065 0.065
= 0.1263 − 2 × ; 0.1263 + 2 ×
81 81

= [0.069; 0.183]

I Is 0.15 a plausible value? What does that mean?

58 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean

15
10

Sampling Distribution
Density

0.069 0.1263 0.183


5
0

0.05 0.10 0.15 0.20


59 / 109
Estimating Proportions...

We used the proportion of defects in our sample to estimate p, the


true, long-run, proportion of defects.

Could this estimate be wrong?


Let p̂ denote the sample proportion.

The standard error associated with the sample proportion


as an estimate of the true proportion is:
r
p̂ (1 − p̂)
sp̂ =
n

60 / 109
Estimating Proportions...

We estimate the true p by the observed sample proportion


of 1’s, p̂.

The (approximate) 95% confidence interval for the true pro-


portion is:

p̂ ± 2 sp̂ .

61 / 109
Estimating Proportions... another modeling example

The Consumer Product Safety Commission (CPSC) mandates that


at most 1% of a particular kind of product is allowed to be
defective.

Your job is to test these products for a particular company. Each


time you test, it is defective or not. After testing 1000 products,
you find 7 defective products.

Can you provide convincing evidence to CPSC that no more than


1% of your products are defective?

(Why isn’t 7/1000 = 0.7% < 1% already convincing evidence to


the CPSC?)

62 / 109
Estimating Proportions...

Defects:

In our defect example we had p̂ = .007 and n = 1000.

This gives
r
(.007) (.993)
sp̂ = = 0.00244.
1000

The confidence interval is 0.007 ± 0.00488 = (0, 0.0123). Since


0.01 is in this interval, NO, you cannot make this claim to the
CPSC.

63 / 109
Polls: yet another example...

If we take a relatively small random sample from a large population


and ask each respondent yes or no with yes ≈ Yi = 1 and no
≈ Yi = 0, where p is the true population proportion of yes.

Suppose, as is common, n = 1000, and p̂ ≈ .5.

Then, r
(.5) (.5)
sp̂ = = .0158.
1000

The standard error is .0158 so that the ± is .0316, or about ± 3%.

(Does this sound familiar?!)

64 / 109
Difference in means

65 / 109
Example: Portfolio comparison

Say we are comparing the returns of two different funds based on


an initial investment of $125,000. To study this, we sample the
returns for 100 accounts from Fund A and 150 accounts from Fund
B from different banks around Chicago.
Here is a summary of the data:

average value std. deviation


Fund A 150k 30k
Fund B 143k 15k

What can we conclude?

66 / 109
Difference in Means
When comparing groups to detect differences, we can use a clever
hack.

Suppose there are two groups A and B with means µA and µB .

If we do not care about the particular values of µA and µB , and


instead only care about if they are different, then build a
confidence interval for the difference µA − µB .

Estimating µA − µB (a single number) is easier than estimating µA


and µB (two numbers) separately.

− µB
µA\ vs. bA − µ
µ bB

67 / 109
Standard Error for the Difference in Means

µA − µB is called the difference in means.

We can compute the standard error for the difference in means:

s
sX2 a sX2
s(X̄a −X̄b ) = + b
na nb

or, for the difference in proportions

s
p̂a (1 − p̂a ) p̂b (1 − p̂b )
s(p̂a −p̂b ) = +
na nb

68 / 109
Confidence Interval for the Difference in Means

We can then compute the


confidence interval for the difference in means:

(X̄a − X̄b ) ± 2 × s(X̄a −X̄b )

or, the
confidence interval for the difference in proportions

(p̂a − p̂b ) ± 2 × s(p̂a −p̂b )

69 / 109
Back to the example

average value std. deviation


Fund A 150k 30k
Fund B 143k 15k

In the fund comparison example:


r
302 152
s(X̄A −X̄B ) = + = 3.24
100 150

so that the confidence interval for the difference in means is:

(150 − 143) ± 2 × 3.24 = (0.519; 13.48)

What is the conclusion now?

70 / 109
Example: Google Search Algorithm

Google is testing a couple of modifications in its search


algorithms... they experiment with 2,500 searches and check how
often the result was defined as a “success”. Here’s the data from
this experiment:

Algorithm Current Mod (A) Mod (B)


success 1755 1850 1760
failure 745 650 740

The probability of success is estimated to be p̂ = 0.702 for the


current algorithm, p̂A = 0.74 for modification (A) and p̂B = 0.704
for modification (B) .

Are the modifications better?


71 / 109
Example: Google Search Algorithm

Let’s look at the difference between the current algorithm and the
modifications:
r
0.702 ∗ 0.298 0.74 ∗ 0.26
s(p̂current −p̂A ) = + = 0.0127
2500 2500
r
0.702 ∗ 0.298 0.704 ∗ 0.296
s(p̂current −p̂B ) = + = 0.0129
2500 2500

so that the CIs for the differences in means are:

pcurrent − pA : (0.702 − 0.74) ± 2 × 0.0127 = (−0.0634, −0.0126)


pcurrent − pB : (0.702 − 0.704) ± 2 × 0.0129 = (−0.0278, 0.0238)

What do you conclude?


72 / 109
Portfolio comparison: The WRONG way

(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)

Let’s compute the confidence intervals for both portfolios


separately:
r r
302 302
Fund A: (150 − 2 × ; 150 + 2 × ) = (144; 156)
r 100 r 100
152 152
Fund B: (143 − 2 × ; 143 + 2 × ) = (140.55; 145.45)
150 150

What would we conclude based on these CIs?

73 / 109
Google Search Algorithm: The WRONG way
(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)

Let’s compute all three CIs for Google’s algorithms separately:


current:
r r !
.702 ∗ (1 − .702) .702 ∗ (1 − .702)
.702 − 2 × ; .702 + 2 × = (0.683; 0.720)
2500 2500

mod (A):
r r !
.740 ∗ (1 − .740) .740 ∗ (1 − .740)
.740 − 2 × ; .740 + 2 × = (0.723; 0.758)
2500 2500

mod (B):
r r !
.704 ∗ (1 − .704) .704 ∗ (1 − .704)
.704 − 2 × ; .704 + 2 × = (0.686; 0.722)
2500 2500
74 / 109
What’s the difference?

Now let’s take a close look:


I In the fund comparison:
I Fund A: (144;156) / Fund B: (140.55;145.45) not enough
evidence
I Differences: (0.519;13.48) difference!
I In the Google example (Current vs Mod A):
I Current: (0.683;0.720) / Mod A: (0.723;0.758) difference!
I Differences: (-0.0634,-0.0126) difference!
I In the Google example (Current vs Mod B):
I Current: (0.683;0.720) / Mod B: (0.686;0.722) not enough
evidence
I Differences: (-0.0278;0.0238) not enough evidence

75 / 109
What’s the difference?

Why is there a difference?

We are estimating / comparing different things!


I In the first approach, we are making a comparison by directly
estimating two quantities µA and µB
I In the second approach, we are making a comparison by
directly estimating just one quantity µA − µB
I Estimating one thing is easier than estimating two things

Example: You can tell the difference that a cat is not a dog
without accurately estimating their height/weight/breed/etc.

76 / 109
What’s the difference?

Are these pictures the same, or different?

77 / 109
What’s the difference?

Are these pictures the same, or different?

78 / 109
The Bottom Line...

I Estimates are based on random samples and therefore random


(uncertain) themselves
I We need to account for this uncertainty!
I “Standard Error” measures the uncertainty of an estimate
I We define the “95% Confidence Interval” as

estimate ± 2 × s.e.

I This provides us with a plausible range for the quantity we are


trying to estimate.
I There is less uncertainty in estimating one thing vs.
estimating multiple things
79 / 109
The Bottom Line...

I When estimating a mean the 95% C.I. is

X̄ ± 2 × sX̄

I When estimating a proportion the 95% C.I. is

p̂ ± 2 × sp̂

I The same idea applies when comparing means or proportions

80 / 109
Hypothesis testing

81 / 109
Testing
Suppose we want to evaluate some unknown property of a
parameter.
I Is the average difference in returns larger than 0?
I Is the proportion of voters for candidate A larger than the
proportion for candidate B?
I Will my return on investment (ROI) be positive?
I Is a bank lending money fairly?

All of these problems are examples of the following general


problem:
Given some unknown parameter θ, test whether or not θ
is equal to (larger than, smaller than) some value.

θ could be µ, p, σ 2 , ...
82 / 109
Examples of tests

This is called hypothesis testing. Formally we test the null


hypothesis:
H0 : µ = µ 0
vs. the alternative
H1 : µ 6= µ0 .

Other examples:

H0 : p = p 0 vs. H1 : p 6= p 0
H0 : σ 2 = 0 vs. H1 : σ 2 > 0
H0 : p 1 = p 2 vs. H1 : p1 > p2

83 / 109
Testing (means)
Let’s start with means:

H0 : µ = µ0 vs. H1 : µ 6= µ0

2 ways to think about testing:


1. Building a test statistic... the t-statistic,

X̄ − µ0
t=
sX̄

This quantity measures how many standard deviations the


estimate (X̄ ) from the proposed value (µ0 ).

If the absolute value of t is greater than 2, we need to worry


(why?)... we reject the hypothesis.
84 / 109
Testing (means)

2. Looking at the confidence interval. If the proposed value is


outside the confidence interval you reject the hypothesis.

Notice that this is equivalent to the t-stat. An absolute value


for t greater than 2 implies that the proposed value is outside
the confidence interval... therefore reject.

This is my preferred approach for the testing problem. You


can’t go wrong by using the confidence interval!

85 / 109
Testing (Proportions)

I The same idea applies to proportions... we can compute the


t-stat testing the hypothesis that the true proportion equals p 0

p̂ − p 0
t=
sp̂

Again, if the absolute value of t is greater than 2,


we reject the hypothesis.

I As always, the confidence interval provides you with the same


(and more!) information.

(Note: In the proportion case, this test is sometimes called a


z-test)
86 / 109
Testing (Differences)

I For testing the difference in means:

(X̄a − X̄b ) − d 0
t=
s(X̄a −X̄b )

I For testing a difference in proportions:

(p̂a − p̂b ) − d 0
t=
s(p̂a −p̂b )

In both cases d 0 is the proposed value for the difference (we


often think of zero here... why?)

Again, if the absolute value of t is greater than 2,


we reject the hypothesis.
87 / 109
Significance

When |t| > 2, it is common to say that the result is statistically


significant.

Why 2? Why not |t| > 3 or |t| > 1? (Hint: It’s arbitrary.)

We’ll answer this in Section 3.

88 / 109
Example: Google modifications
Back to the Google example: Although Mod A significantly outperformed
the current algorithm (why?), execs will not sign off on the mod unless it
is at least 1% better than the current algorithm. Can you convince them
to switch?

H0 : pcurrent − pA = 0 [ ⇐⇒ pA = pcurrent ]
H1 : pcurrent − pA ≤ −0.01 [ ⇐⇒ pA ≥ pcurrent + 0.01].

The 95% CI was (−0.0634, −0.0126). Since this interval only contains
values smaller than −0.01, there is significant evidence that Mod A will
yield a > 1% improvement. (Be careful with the signs here!)

(By the way, the t-stat is:

(p̂current − p̂A ) − d 0 (.702 − .74) − (−0.01)


t= = = −2.205.)
s(p̂current −p̂A ) 0.0127
89 / 109
Example: Political polling
During the 2021 Boston mayoral race, a poll showed Michelle Wu leading
challenger Annissa Essaibi George by 53.5%-46.5% based on a random
sample of n = 645 voters. Based on this data, can you conclude with
confidence that Wu is in fact leading Essaibi George? Test this
hypothesis with a 95% CI.

Let pW denote the true proportion voting for Wu. We want to test:
H0 : pW = 0.5
H1 : pW > 0.5

The 95% CI for the proportion pW is:


0.0196
r zr }| {
p̂(1 − p̂) 0.535(1 − 0.535)
p̂ ± 2 = 0.535 ± 2 = (0.496, 0.574).
n 645

What do we conclude?
90 / 109
The hard part

How do we set up a hypothesis testing problem?


I What is the property we want to test?
I What is the parameter?
I What is the hypothesis?
I How to decide null vs. alternative hypothesis?
I What is the right statistic?

As usual, the hardest part is not the calculations: It’s knowing


what to do in the first place!

92 / 109
Intuition behind testing

93 / 109
Intuition behind testing

Let’s go back to some intuition.

What hypothesis testing is NOT about: Proving or disproving with


certainty the truth of something.

What hypothesis testing IS: Comparing to some baseline and


asking, are we outperforming this baseline?

94 / 109
Example: Psychic powers

How about this: I’ll try to prove to you that I can read your mind.
Think of a number between 1-10 and write it down on a piece of
paper (or your hand).

OK, I’ll admit one thing: My psychic skills aren’t perfect. But I
swear I’m a psychic.

I won’t get 100% right. What is the smallest number of right


answers I can give without looking like a complete fool?

96 / 109
Example: Psychic powers

At the very, very least, I should be able to beat random guessing.


Is that fair?
I This does not prove that I’m psychic, but it would give some
supporting evidence to my claim
I But if I can’t beat random guessing, why would you pay me
$100?

If I guess randomly, then P(guess correctly) = 0.1 (why?). So, to


beat random guessing, I need to show that
P(guess correctly) > 0.1 (why?).

(NOTE: You can replace 0.1 with any baseline you find convincing. As an exercise, try

this!)

97 / 109
Example: Psychic powers

This is a classic statistics problem.


I The problem: Am I really psychic?
I The answer: Can I beat random guessing?
I The target parameter: The probability that I guess your
number correctly.
I The data: A random sample of guesses.
I The statistic: The sample proportion.

This is also a hypothesis testing problem!

98 / 109
Example: Psychic powers
This is also a hypothesis testing problem!
I Let p = P(guess correctly) be the true parameter (we don’t
know this!)
I We are interested in whether or not p > 0.1

The null hypothesis is always the baseline. It’s the status quo,
boring hypothesis. =⇒ H0 : p = 0.1

The alternative hypothesis is beating the baseline. It’s where I


make money. =⇒ H1 : p > 0.1

So:
H0 : p = 0.1
H1 : p > 0.1
99 / 109
How many guesses?

I need at least 5/10 correct guesses I need at least 18/100 correct guesses I need at least 121/1000 correct guesses
1.0

1.0

1.0
0.8

0.8

0.8
95% CI lower bound

95% CI lower bound

95% CI lower bound


0.6

0.6

0.6
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
2 4 6 8 10 0 20 40 60 80 100 0 200 400 600 800 1000

# of correct guesses # of correct guesses # of correct guesses

101 / 109
More examples

I Should Netflix have abandoned their Cinematch algorithm and


used the Netflix prize solution? (Baseline: current
performance of Cinematch)
I Should Facebook have added reactions? (Baseline: current
metrics on user engagement)
I Should a political campaign increase spending on political
ads? (Baseline: current performance in polls)
I Should the FDA approve a new drug? (Baseline: placebo)

These are just a few examples, and the baselines are also just
examples. Can you think of any other appropriate baselines for
these examples?
102 / 109
Back to an example: Google modifications

Although Mod A significantly outperformed the current algorithm


(why?), execs will not sign off on the mod unless it is at least 1% better
than the current algorithm. Can you convince them to switch?

I What is the property we want to test? Whether or not the


modification outperforms the current algorithm by at least 1%

I What is the parameter? pcurrent − pA , the difference in performance


between the algorithms

I How to decide null vs. alternative hypothesis? Baseline / status quo


is a tie, i.e. pcurrent − pA = 0. So, H0 : pcurrent − pA = 0.

I What is the right statistic? The sample difference in proportions


p̂current − p̂A .

103 / 109
Is this even useful?
We’ve made a lot of assumptions so far...
I The data is perfectly normally distributed
I The data is perfectly independent
I The sampling is perfectly random and unbiased
I The sample size is sufficiently large

These assumptions allow to derive explicit formulas for the


sampling distributions, standard errors, and test statistics.

Even under these ideal assumptions, the problem is hard!


I Sampling error
I Can’t say much FOR SURE
I Need to use approximations
104 / 109
Is this even useful?
Unfortunately, the real world is not so kind...

Thought experiment: Imagine the real world


I The true distribution is unknown and very complicated
I The data is dependent (e.g. network effects)
I The sampling is not random and heavily biased (e.g.
convenience samples)
I The sample size is limited

Does this make the problem easier or harder?

In the real world, have to deal with “perfect world” problems like
sampling error and approximations in addition to “real world”
problems like dependence, limited samples, etc.
105 / 109
The Importance of Considering and Reporting
Uncertainty

In 1997 the Red River flooded Grand Forks, ND overflowing its


levees with a 54-feet crest. 75% of the homes in the city were
damaged or destroyed!

It was predicted that the rain and the spring melt would lead to a
49-feet crest of the river. The levees were 51-feet high.

The Water Services of North Dakota had explicitly avoided


communicating the uncertainty in their forecasts as they were
afraid the public would lose confidence in their abilities to predict
such events.

107 / 109
The Importance of Considering and Reporting
Uncertainty

It turns out the prediction interval for the flood was 49ft ± 9ft
meaning the 54ft levees could overflow!

Should we take the point prediction (49ft) or the interval as an


input for a decision problem?

In general, the distribution of potential outcomes are very relevant


to help us make a decision.

108 / 109
The Importance of Considering and Reporting
Uncertainty

The answer seems obvious in this example (and it is!)... however,


you see these things happening all the time as people tend to
underplay uncertainty in many situations!
“Why do people not give intervals? Because they are em-
barrassed!”
-Jan Hatzius, Goldman Sachs economist talking about economic
forecasts

Don’t make this mistake! Intervals are your friend and will lead to
better decisions!

(Remember the weather apps and rain probabilities from Section 1?)

109 / 109

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy