0% found this document useful (0 votes)

6 views103 pages

Stats 101 - Class 02

Estimation, Condence Intervals and Hypothesis Testing

Uploaded by

GabrielMarceloNeiraCastro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views103 pages

Stats 101 - Class 02

Estimation, Condence Intervals and Hypothesis Testing

Uploaded by

GabrielMarceloNeiraCastro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Section 2: Estimation, Confidence Intervals

and Hypothesis Testing

Bryon Aragam
Chicago Booth
booth41000@gmail.com
https://canvas.uchicago.edu/courses/43775/

Suggested Readings:
Naked Statistics, Chapters 7, 8, 9 and 10
OpenIntro Statistics, Chapters 5, 6, and 7

Last Updated: October 9, 2022

1 / 109
Example

Two different polling firms ran two different polls to determine to

winner of the 2020 US presidential election.

Both firms are objective, fair, and willing to spare no expense to

ensure their samples are perfectly random and unbiased.

I Firm A: Biden 51%, Trump 49% (n = 1000)

I Firm B: Biden 49%, Trump 51% (n = 1000)

Could Firm B have possibly reported Biden 12%, Trump 88%?

Who is right? Is either one wrong?

Why didn’t they get the same answer?

2 / 109
Flashback

Even in a perfect world where we wouldn’t have to worry about

bias, manipulation, privacy, security, etc., we must still deal with
random sampling.

This section is all about conceptualizing and managing sampling

error.

The key concept is that of a sampling distribution.

3 / 109
The statistical process

4 / 109
The statistical process

Let’s try to make this more formal. Simplifying a bit, let’s pretend
like the winner of the election goes to the popular vote.
I The problem: Who will win the election?
I The answer: Who has the highest share of the vote?
I The target parameter: True proportion of voters for each
candidate.
I The data: A random sample of the population.
I The statistic: The sample proportion.

Problem −→ statistics.

5 / 109
Two key concepts

In going from problem −→ statistics, we encounter two key

concepts:
I The target parameter
I The statistic

These are crucial abstractions that make talking about statistical

analyses easier.

Every statistical problem consists of a target parameter and a

statistic.

6 / 109
What is a parameter?
Intuititely, a parameter is any quantity whose value we do not
know, and the target parameter is any quantity whose value we
do not know but we want to know. (Why do we want to know?
Think about the statistical process.)

Examples:
I True proportion of voters for each candidate
I Future value of an investment
I Average price of competitors’ products
I Proportion of Americans with COVID-19 (or vaccinated, or
tested, etc.)

If you know the value of the target parameter, there is nothing

more to be done. Game over.
7 / 109
What is a statistic?
A statistic is a summary (formally: a function) of a
dataset—typically used to approximate the value of a target
parameter.

Examples:
I Sample mean
I Sample proportion
I Maximum or minimum
I Range (max − min)
I Lots of others...

Emphasis: A statistic is any summary of a dataset. Some statistics

are useful, some are not so useful, and some are downright
misleading!
8 / 109
Statistics are random variables

Random sampling =⇒ data is random.

Even though you only ever see one dataset, the process that
generated it (i.e. random sampling) was random, so your data is
always random.

So, statistics are random “things” (aka random variables)!

I Data is random
I A statistic is a summary of (random) data Your statistic is a
random variable
9 / 109
Back to the example

The data:
I Firm A: Biden 51%, Trump 49% (n = 1000)
I Firm B: Biden 49%, Trump 51% (n = 1000)

Why didn’t they get the same answer?

Random sampling =⇒ different, but equally valid, samples

Who is right? Is either one wrong?

Both firms are objective and each sample is perfectly random, so
both answers are equally “correct.”

Could Firm B have possibly reported Trump 12%, Biden 88%?

Yes. It is possible, though unlikely, that a random sample of 1000
voters was skewed like this.
10 / 109
tl;dr

This is the most important lesson from this section.

Your data is random, so every step of a data analysis is random

and involves uncertainty that must be quantified.

Since a statistic is a random variable, it has a distribution.

The distribution of a statistic is called its sampling distribution,

and the sampling distribution is how we’ll quantify this uncertainty.

11 / 109
Moving forward

Understanding and computing sampling distributions is hard.

The rest of this section is about computing these distributions and

using them to quantify uncertainty.

We will focus on two canonical examples:

I Estimating a sample proportion (e.g. what proportion of
voters will vote for Candidate A?)
I Estimating a sample mean (e.g. what is the average height of
this class?)

The ideas apply to any statistic, any dataset, and any assumptions
on the data (i.e. normality is NOT necessary).

12 / 109
Estimating proportions and probabilities

13 / 109
A First Modeling Exercise: Polling

In a perfect world, Taylor “Tay Tay” Swift would run against

Dwayne “The Rock” Johnson for president. Let’s pretend like
we’re in this perfect world.

So, Tay Tay is running against The Rock for president. Out of a
random sample of n = 1000 voters, 511 voters intend to vote for
Tay Tay, and 489 intend to vote for The Rock.

Simplifying a bit, assume the winner of the election will go to the

popular vote. Who is going to win the election?

14 / 109
Recall: The statistical process

I The problem: Who will win the election?

I The answer: Who has the highest share of the vote?
I The target parameter: True proportion of voters for each
candidate.
I The data: A random sample of the population.
I The statistic: The sample proportion.

How do we model this scenario? How much uncertainty is there in

our statistic?

15 / 109
Polling - Model

I Each sample corresponds to a randomly selected voter, and

the observation is who they intend to vote for
I As a first modeling decision, let’s call the random variable
associated with a voter’s decision X , which takes one of two
values X = 0 (The Rock) or X = 1 (Tay Tay)
I We say that the votes X are independent and identically
distributed as
X ∼ Ber(p).

I Question: What is a reasonable value for p?

I We can estimate this from the data (n = 1000 observations)...

16 / 109
Polling - Sample size

Once and for all, let n denote the sample size.

We will keep this notation throughout the rest of the class, and
often use it without mentioning it specifically.

n will ALWAYS refer to the sample size.

How big is big enough? We’ll discuss a little more later in this
section. It turns out even simple questions like this have
complicated answers. (There is a short discussion of this in the
book: See Sections 5.1.3-5.1.5, 5.3.6, 6.1.5.)

17 / 109
Polling - Statistic

A reasonable statistic in this scenario is the sample proportion:

# of 1s
p̂ = .
n

In this case, p̂ corresponds to the proportion of voters in favour of

Tay Tay and p̂ = 511/1000 = 0.511.

(The assignment to 0 and 1 here is irrelevant: We could have

defined X = 1 to mean “voting for The Rock”, and everything
would still work with p̂ = 0.489. Just be careful about interpreting
things!)

18 / 109
Estimating means and variances

19 / 109
Another Modeling Exercise

I I have been tracking the purchase behaviour of my customers

so I know how much they spend at my business.
I In order to keep up with inventory, consider new product
offerings, and generally to evaluate the health of my business,
I’d like to predict how much my customers will spend in the
future.
I What should I do?

20 / 109
Recall: The statistical process

I The problem: How can I predict my customers’ purchase

behaviour?
I The answer: What is the distribution of purchase amounts?
I The target parameters: True mean and variance of purchase
amounts.
I The data: A database of consumer purchases.
I The statistics: The sample mean and variance.

In this example, the target parameters are not a perfect substitute

for the answer we want, but they will get us pretty far. Let’s start
here.

Where is the randomness? Can’t predict the future.

21 / 109
Purchase Behaviour - Data

Purchase behaviour

0.08
0.06
Density

0.04
0.02
0.00
75 80 85 90 95 100 105

Purchase amount ($)

22 / 109
Purchase Behaviour - Model

I Each observation corresponds to a single purchase

I As a modeling decision, let’s call the random variable
associated with each purchase X and assume that purchases
can be modeled as a normal distribution.
I As before, let’s assume the returns X are independent and
identically distributed, except this time a discrete distribution
is not appropriate (Why?)
I Let’s model X as a continuous distribution, so

X ∼ N(µ, σ 2 ).

I Question: What are reasonable values of µ and σ 2 ?

I We can estimate them from the data (n = 113 observations)...
23 / 109
Purchase behaviour - Model

I Let’s assume that each observation in the random sample

{x1 , x2 , x3 , . . . , xn } is independent and distributed according to
the model above, i.e., xi ∼ N(µ, σ 2 )
I To estimate µ and σ 2 , reasonable choices are the sample
mean (X̄ ) and the sample variance (s 2 )... (their sample
counterparts)

n n
1X 1 X 2
X̄ = xi , s2 = xi − X̄ .
n n−1
i=1 i=1

I Sometimes written µ̂ and σ̂ 2 (ˆ· = “estimate of”)

24 / 109
Sample vs population quantities
Sample Population
# of 1s
p̂ = n p = P(X = 1)
1
Pn Pn
X̄ = n i=1 xi E(X ) = i=1 P(X = xi ) × xi
1
Pn 2 Pn 2
s 2 = n−1 i=1 xi − X̄ var(X ) = i=1 P(X = xi ) × [xi − E(X )]

I p, E(X ), var(X ) depend on P(X = xi ) which are unknown

I p̂, X̄ , s 2 are approximations to P(X = 1), E(X ), var(X ) that
depend only on known quantities (i.e. the data)
I (Beyond the scope of 41000): These approximations can be
shown to be optimal*—they cannot be improved in terms of
prediction
I *Under certain assumptions
I The catch: Outliers, fairness, privacy, RAM, etc.
25 / 109
Purchase behaviour - Model checking

For the purchasing data,

Purchase behaviour
0.08
0.06

X̄ = 88.64 and s 2 = 25.43

Density

0.04
0.02
0.00

75 80 85 90 95 100 105

Purchase amount ($)

I The red line represents our “model”, i.e., the normal

distribution with mean and variance given by the estimated
quantities X̄ and s 2 .

26 / 109
Notation

If we want to wax poetic about general parameters and statistics

(i.e. without specifying exactly which ones we care about), it is
common to use another greek letter: θ
I Parameter: θ
I Statistic: θ̂

The “hat” ˆ· is universal stats lingo for “statistic” or “estimate”.

This is just a placeholder: Examples of “θ” we’ve considered

already are p, µ, σ 2 and examples of “θ̂” are p̂, X̄ , s 2 , etc.

27 / 109
Knowns vs unknowns

In real life:
I We do not know the actual distribution of X
I We do not know the actual mean and variance of X
I Different observations of X (the daily returns) are dependent

Seems hopeless!

Models to the rescue:

I Replace the actual distribution with a tractable distribution
(the normal distribution)
I Estimate the “best” possible values for µ and σ 2
I Hope that observations are “close” to independent

28 / 109
Knowns vs unknowns

Sometimes this works reasonably well:

I Financial data
I Demographic data
I Scientific data

Sometimes it fails catastrophically:

I Computer vision (images)
I Natural language (text)
I Speech recognition (audio)

Always be asking: Is your model a good model?

29 / 109
Building Portfolios

Assume I invest some money in the U.S. stock market. What are
some questions I might be interested in?
I What is my expected one year return? Expected value
I How volatile is my portfolio? Risk / variance

As an example, consider 3 investment opportunities:

1. IBM stocks
2. ALCOA stocks
3. Treasury Bonds (T-bill)

How should we start thinking about these questions?

30 / 109
The statistical process, again

You know the drill by now:

I The problem: What is my expected one year return?
I The answer: What is the average return historically?
I The target parameter: The expected value of the investment.
I The data: Historical returns data.
I The statistic: The sample mean.

Can you do the same for volatility?

31 / 109
Building Portfolios

To understand returns and volatility, we need to estimate expected

values and variances. Let’s write some things down:
I µI = expected value of IBM, σI = standard deviation of IBM
I Similarly, we have µA and σA for ALCOA, and µT and σT for
T-bills

What about the T-bill?

I The return on the T-bill is fixed at 3%

I =⇒ µT = 3, σT = 0 (Why?)

After observing some return data we can came up with estimates

for the means and variances describing the behavior of these stocks.
32 / 109
Building Portfolios
40

IBM (%) ALCOA (%) T-bill (%)

µ̂I = 12.5 µ̂A = 14.9 µT = 3
ALCOA
20

σ̂I = 10.5 σ̂A = 14.0 σT = 0

corr (IBM, ALCOA) = 0.33

-20

-20 -10 0 10 20 30 40

IBM

I Why µ̂I and µ̂I , but µT (i.e. no ˆ· on the T-bill)?

I What is cov(IBM, Tbill)? How about cov(ALCOA, Tbill)?
I What are the units here? Could have been %, or $, or
decimals... ALWAYS STATE YOUR UNITS.
33 / 109
Building Portfolios

I What if we combine these options? Is that a good idea?

I I don’t like “conventional wisdom”: Let’s actually find out!
I What happens if I place half of my money in ALCOA and the
other half on T-bills...

I Remember that:

E(aX + bY ) = aE(X ) + bE(Y )

var(aX + bY ) = a2 var(X ) + b 2 var(Y ) + 2ab ∗ cov(X , Y )

34 / 109
Building Portfolios

I So, by using what we know about the means and variances we

get to:

µ̂P = 0.5µ̂A + 0.5µT

= 0.5(14.9) + 0.5(3)
σ̂P2 = 0.52 · σ̂A2 + 0.52 · σT2 + 2 · 0.5 · 0.5 · cov(ALCOA, Tbill)
= 0.52 (14)2 + 0.52 · 0 + 2 · 0.5 · 0.5 · 0

I µ̂P and σ̂P2 refer to the estimated mean and variance of our
portfolio

I What are we assuming here?

35 / 109
Sampling distributions and variation

37 / 109
Models, Parameters, Estimates...

In general we talk about unknown quantities using the language of

probability... and the following steps:
I Define the random variables of interest
I Define a model (or probability distribution) that describes the
behavior of the RV of interest
I Based on the data available, we estimate the parameters
defining the model
I We are now ready to describe possible scenarios, generate
predictions, make decisions, evaluate risk, etc...

“All models are wrong, but some are useful.”

38 / 109
Oracle vs SAP Example (understanding variation)

Do we “buy” the claim from this ad?

39 / 109
Recall: The statistical process

I The problem: Is it true that SAP customers are 20% less

profitable than their peers?
I The answer: What are the profit margins of SAP customers,
and their peers?
I The target parameters: True mean profit margins for SAP
customers and their peers.
I The data: A database of company profits.
I The statistics: The sample means.

To simplify, let’s assume that the industry ROE is known to be

15% (this is also an estimate but let’s assume it is accurate).

So we only need to worry about estimating ROE for SAP firms.

40 / 109
Oracle vs. SAP

I Do we “buy” the claim from this ad?

I We have a dataset of 81 firms that use SAP...
I The industry ROE is 15%
I We assume that the random variable X represents ROE of
SAP firms and can be described by

X ∼ N(µ, σ 2 )

X̄ s2
SAP firms 0.1263 0.065

0.12
I Well, 0.15 ≈ 0.8! I guess the ad is correct, right?
I Not so fast...
41 / 109
Oracle vs. SAP

I Let’s assume the sample we have is a good representation of

the “population” of firms that use SAP...

I What if we had observed a different sample of size 81?

42 / 109
Oracle vs. SAP

I Selecting a random, with replacement, from the original 81

samples I get a new X̄ = 0.09... I do it again, and I get
TheX̄Bootstrap: whyagain
= 0.155... and it works
X̄ = 0.132...

data sample
� ↓ �

bootstrap samples
43 / 109
Oracle vs. SAP

I After doing this 1000 times... here’s the histogram of X̄ ...

Now, what do you think about the ad?
Histogram of sample mean

15
10
Density

5
0

0.1263

0.05 0.10 0.15 0.20

44 / 109
Sampling distributions

Things are about to get a little bit meta.

I If we sample X 100 times, we get a list of random numbers
P
{x1 , . . . , x100 } and their mean x̄1 = (1/n) i xi
I If we do this a second time, we get a second list of random
numbers and a second mean x̄2
I x̄1 is different from x̄2 !
I If we do this N times, we get N different lists of numbers, and
N corresponding means x̄1 , . . . , x̄N
I All the sample means x̄i are different!
I The sample mean is itself a random variable!
I So it has a distribution, called the sampling distribution

45 / 109
Sampling Distribution of Sample Mean

Consider the sample mean for an iid sample of n observations

{x1 , . . . , xn }.

If X ∼ N(µ, σ 2 ), then

σ2

X̄ ∼ N µ, .
n

This is the sampling distribution of the sample mean.

46 / 109
Sampling Distribution of Sample Mean

I The sampling distribution of X̄ describes how our estimate would

vary over different datasets of the same size n
I It provides us with a vehicle to evaluate the uncertainty associated
with our estimate of the mean...
I It turns out that s 2 is a good proxy for σ 2 so that we can
approximate the sampling distribution by

s2

X̄ ∼ N µ,
n

I We call sd(X̄ ) the standard error of X̄ ... it is a measure of its

variability... I like the notation
r
s2
sX̄ = sd(X̄ ) =
n
47 / 109
Sampling Distribution of Sample Mean

X̄ ∼ N µ, sX̄2

I X̄ is unbiased... E(X̄ ) = µ. On average, X̄ is right!

I X̄ is consistent... as n grows, sX̄2 → 0 =⇒ X̄ → µ
I With enough information (i.e. samples / data), eventually X̄
correctly estimates µ

If your data is non-normal, this still remains approximately true:

I X̄ is still unbiased and consistent: E(X̄ ) = µ and sX̄2 → 0
I The sampling distribution is no longer exactly normal but:

With enough samples, X̄ is approximately normally distributed.

48 / 109
You can’t handle the truth

I Strictly speaking, it is not true that X̄ ∼ N(µ, sX̄2 )—the truth

2
is X̄ ∼ N(µ, σn ) (notice the difference?)
σ2
I Since sX̄2 ≈ n , the distribution is “close enough” to N(µ, sX̄2 )
for practical purposes, when n is large
49 / 109
You can’t handle the truth

Key takeaway:

How and when is it appropriate to simplify?

Independence, normality, approximations, etc.

I Data is complicated
I Some data is so complicated that you really need to worry
about the little details (research)
I But, in many cases, simplifying assumptions are OK and we
can still make good decisions

50 / 109
Activity: Polling

http://rocknpoll.graphics/

51 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean

15
10

Sampling Distribution
Density

5
0

0.1263

0.05 0.10 0.15 0.20

52 / 109
Confidence intervals

54 / 109
Confidence Intervals

X̄ ∼ N µ, sX̄2

so...
(X̄ − µ) ∼ N 0, sX̄2

right?
I What is a good prediction of µ? What is our best guess?
X̄
I How do we make mistakes? How far from µ can we be?
95% of the time ±2 × sX̄
I [X̄ ±2 × sX̄ ] gives a 95% range of plausible values for µ... this
is called the 95% Confidence Interval (CI) for µ.
55 / 109
Confidence Intervals

More generally:
I If we want to estimate the parameter θ using θ̂,
I Then we first need to find the standard error sθ̂ ,
I And the 95% CI is given by

[θ̂ −2 × sθ̂ , θ̂ +2 × sθ̂ ]

For a 99% CI, replace 2 with 3: [θ̂ ±3 × sθ̂ ].

For a 68% CI, replace 2 with 1: [θ̂ ±1 × sθ̂ ].

56 / 109
Interpreting CIs
How do you interpret a confidence interval?

With 95% confidence, you can safely rule out any value that falls
outside the 95% CI.

Suppose the CI of µ is the interval (a, b).

I It is NOT TRUE that there is a “95% probability that µ is in
(a, b)” .
I This common misconception leads to a misunderstanding of
what a CI is and where the randomness comes in.
I Safest bet: Use the “ruling out” interpretation above.

The main point of CIs is that there is an entire range of values vs.
a single point. This is FUNDAMENTAL.
57 / 109
Oracle vs. SAP example... one more time

In this example, X̄ = 0.1263, s 2 = 0.065 and n = 81... therefore,

0.065
sX̄2 = 81 so, the 95% confidence interval for the ROE of SAP
firms is

X̄ − 2 × sX̄ ; X̄ + 2 × sX̄
" r r #
0.065 0.065
= 0.1263 − 2 × ; 0.1263 + 2 ×
81 81

= [0.069; 0.183]

I Is 0.15 a plausible value? What does that mean?

58 / 109
Back to the Oracle vs. SAP example
Back to our simulation...
Histogram of sample mean

15
10

Sampling Distribution
Density

0.069 0.1263 0.183

5
0

0.05 0.10 0.15 0.20

59 / 109
Estimating Proportions...

We used the proportion of defects in our sample to estimate p, the

true, long-run, proportion of defects.

Could this estimate be wrong?

Let p̂ denote the sample proportion.

The standard error associated with the sample proportion

as an estimate of the true proportion is:
r
p̂ (1 − p̂)
sp̂ =
n

60 / 109
Estimating Proportions...

We estimate the true p by the observed sample proportion

of 1’s, p̂.

The (approximate) 95% confidence interval for the true pro-

portion is:

p̂ ± 2 sp̂ .

61 / 109
Estimating Proportions... another modeling example

The Consumer Product Safety Commission (CPSC) mandates that

at most 1% of a particular kind of product is allowed to be
defective.

Your job is to test these products for a particular company. Each

time you test, it is defective or not. After testing 1000 products,
you find 7 defective products.

Can you provide convincing evidence to CPSC that no more than

1% of your products are defective?

(Why isn’t 7/1000 = 0.7% < 1% already convincing evidence to

the CPSC?)

62 / 109
Estimating Proportions...

Defects:

In our defect example we had p̂ = .007 and n = 1000.

This gives
r
(.007) (.993)
sp̂ = = 0.00244.
1000

The confidence interval is 0.007 ± 0.00488 = (0, 0.0123). Since

0.01 is in this interval, NO, you cannot make this claim to the
CPSC.

63 / 109
Polls: yet another example...

If we take a relatively small random sample from a large population

and ask each respondent yes or no with yes ≈ Yi = 1 and no
≈ Yi = 0, where p is the true population proportion of yes.

Suppose, as is common, n = 1000, and p̂ ≈ .5.

Then, r
(.5) (.5)
sp̂ = = .0158.
1000

The standard error is .0158 so that the ± is .0316, or about ± 3%.

(Does this sound familiar?!)

64 / 109
Difference in means

65 / 109
Example: Portfolio comparison

Say we are comparing the returns of two different funds based on

an initial investment of $125,000. To study this, we sample the
returns for 100 accounts from Fund A and 150 accounts from Fund
B from different banks around Chicago.
Here is a summary of the data:

average value std. deviation

Fund A 150k 30k
Fund B 143k 15k

What can we conclude?

66 / 109
Difference in Means
When comparing groups to detect differences, we can use a clever
hack.

Suppose there are two groups A and B with means µA and µB .

If we do not care about the particular values of µA and µB , and

instead only care about if they are different, then build a
confidence interval for the difference µA − µB .

Estimating µA − µB (a single number) is easier than estimating µA

and µB (two numbers) separately.

− µB
µA\ vs. bA − µ
µ bB

67 / 109
Standard Error for the Difference in Means

µA − µB is called the difference in means.

We can compute the standard error for the difference in means:

s
sX2 a sX2
s(X̄a −X̄b ) = + b
na nb

or, for the difference in proportions

s
p̂a (1 − p̂a ) p̂b (1 − p̂b )
s(p̂a −p̂b ) = +
na nb

68 / 109
Confidence Interval for the Difference in Means

We can then compute the

confidence interval for the difference in means:

(X̄a − X̄b ) ± 2 × s(X̄a −X̄b )

or, the
confidence interval for the difference in proportions

(p̂a − p̂b ) ± 2 × s(p̂a −p̂b )

69 / 109
Back to the example

average value std. deviation

Fund A 150k 30k
Fund B 143k 15k

In the fund comparison example:

r
302 152
s(X̄A −X̄B ) = + = 3.24
100 150

so that the confidence interval for the difference in means is:

(150 − 143) ± 2 × 3.24 = (0.519; 13.48)

What is the conclusion now?

70 / 109
Example: Google Search Algorithm

Google is testing a couple of modifications in its search

algorithms... they experiment with 2,500 searches and check how
often the result was defined as a “success”. Here’s the data from
this experiment:

Algorithm Current Mod (A) Mod (B)

success 1755 1850 1760
failure 745 650 740

The probability of success is estimated to be p̂ = 0.702 for the

current algorithm, p̂A = 0.74 for modification (A) and p̂B = 0.704
for modification (B) .

Are the modifications better?

71 / 109
Example: Google Search Algorithm

Let’s look at the difference between the current algorithm and the
modifications:
r
0.702 ∗ 0.298 0.74 ∗ 0.26
s(p̂current −p̂A ) = + = 0.0127
2500 2500
r
0.702 ∗ 0.298 0.704 ∗ 0.296
s(p̂current −p̂B ) = + = 0.0129
2500 2500

so that the CIs for the differences in means are:

pcurrent − pA : (0.702 − 0.74) ± 2 × 0.0127 = (−0.0634, −0.0126)

pcurrent − pB : (0.702 − 0.704) ± 2 × 0.0129 = (−0.0278, 0.0238)

What do you conclude?

72 / 109
Portfolio comparison: The WRONG way

(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)

Let’s compute the confidence intervals for both portfolios

separately:
r r
302 302
Fund A: (150 − 2 × ; 150 + 2 × ) = (144; 156)
r 100 r 100
152 152
Fund B: (143 − 2 × ; 143 + 2 × ) = (140.55; 145.45)
150 150

What would we conclude based on these CIs?

73 / 109
Google Search Algorithm: The WRONG way
(WARNING This slide is to highlight what goes wrong if you do it the WRONG way!)

Let’s compute all three CIs for Google’s algorithms separately:

current:
r r !
.702 ∗ (1 − .702) .702 ∗ (1 − .702)
.702 − 2 × ; .702 + 2 × = (0.683; 0.720)
2500 2500

mod (A):
r r !
.740 ∗ (1 − .740) .740 ∗ (1 − .740)
.740 − 2 × ; .740 + 2 × = (0.723; 0.758)
2500 2500

mod (B):
r r !
.704 ∗ (1 − .704) .704 ∗ (1 − .704)
.704 − 2 × ; .704 + 2 × = (0.686; 0.722)
2500 2500
74 / 109
What’s the difference?

Now let’s take a close look:

I In the fund comparison:
I Fund A: (144;156) / Fund B: (140.55;145.45) not enough
evidence
I Differences: (0.519;13.48) difference!
I In the Google example (Current vs Mod A):
I Current: (0.683;0.720) / Mod A: (0.723;0.758) difference!
I Differences: (-0.0634,-0.0126) difference!
I In the Google example (Current vs Mod B):
I Current: (0.683;0.720) / Mod B: (0.686;0.722) not enough
evidence
I Differences: (-0.0278;0.0238) not enough evidence

75 / 109
What’s the difference?

Why is there a difference?

We are estimating / comparing different things!

I In the first approach, we are making a comparison by directly
estimating two quantities µA and µB
I In the second approach, we are making a comparison by
directly estimating just one quantity µA − µB
I Estimating one thing is easier than estimating two things

Example: You can tell the difference that a cat is not a dog
without accurately estimating their height/weight/breed/etc.

76 / 109
What’s the difference?

Are these pictures the same, or different?

77 / 109
What’s the difference?

Are these pictures the same, or different?

78 / 109
The Bottom Line...

I Estimates are based on random samples and therefore random

(uncertain) themselves
I We need to account for this uncertainty!
I “Standard Error” measures the uncertainty of an estimate
I We define the “95% Confidence Interval” as

estimate ± 2 × s.e.

I This provides us with a plausible range for the quantity we are

trying to estimate.
I There is less uncertainty in estimating one thing vs.
estimating multiple things
79 / 109
The Bottom Line...

I When estimating a mean the 95% C.I. is

X̄ ± 2 × sX̄

I When estimating a proportion the 95% C.I. is

p̂ ± 2 × sp̂

I The same idea applies when comparing means or proportions

80 / 109
Hypothesis testing

81 / 109
Testing
Suppose we want to evaluate some unknown property of a
parameter.
I Is the average difference in returns larger than 0?
I Is the proportion of voters for candidate A larger than the
proportion for candidate B?
I Will my return on investment (ROI) be positive?
I Is a bank lending money fairly?

All of these problems are examples of the following general

problem:
Given some unknown parameter θ, test whether or not θ
is equal to (larger than, smaller than) some value.

θ could be µ, p, σ 2 , ...
82 / 109
Examples of tests

This is called hypothesis testing. Formally we test the null

hypothesis:
H0 : µ = µ 0
vs. the alternative
H1 : µ 6= µ0 .

Other examples:

H0 : p = p 0 vs. H1 : p 6= p 0
H0 : σ 2 = 0 vs. H1 : σ 2 > 0
H0 : p 1 = p 2 vs. H1 : p1 > p2

83 / 109
Testing (means)
Let’s start with means:

H0 : µ = µ0 vs. H1 : µ 6= µ0

2 ways to think about testing:

1. Building a test statistic... the t-statistic,

X̄ − µ0
t=
sX̄

This quantity measures how many standard deviations the

estimate (X̄ ) from the proposed value (µ0 ).

If the absolute value of t is greater than 2, we need to worry

(why?)... we reject the hypothesis.
84 / 109
Testing (means)

2. Looking at the confidence interval. If the proposed value is

outside the confidence interval you reject the hypothesis.

Notice that this is equivalent to the t-stat. An absolute value

for t greater than 2 implies that the proposed value is outside
the confidence interval... therefore reject.

This is my preferred approach for the testing problem. You

can’t go wrong by using the confidence interval!

85 / 109
Testing (Proportions)

I The same idea applies to proportions... we can compute the

t-stat testing the hypothesis that the true proportion equals p 0

p̂ − p 0
t=
sp̂

Again, if the absolute value of t is greater than 2,

we reject the hypothesis.

I As always, the confidence interval provides you with the same

(and more!) information.

(Note: In the proportion case, this test is sometimes called a

z-test)
86 / 109
Testing (Differences)

I For testing the difference in means:

(X̄a − X̄b ) − d 0
t=
s(X̄a −X̄b )

I For testing a difference in proportions:

(p̂a − p̂b ) − d 0
t=
s(p̂a −p̂b )

In both cases d 0 is the proposed value for the difference (we

often think of zero here... why?)

Again, if the absolute value of t is greater than 2,

we reject the hypothesis.
87 / 109
Significance

When |t| > 2, it is common to say that the result is statistically

significant.

Why 2? Why not |t| > 3 or |t| > 1? (Hint: It’s arbitrary.)

We’ll answer this in Section 3.

88 / 109
Example: Google modifications
Back to the Google example: Although Mod A significantly outperformed
the current algorithm (why?), execs will not sign off on the mod unless it
is at least 1% better than the current algorithm. Can you convince them
to switch?

H0 : pcurrent − pA = 0 [ ⇐⇒ pA = pcurrent ]
H1 : pcurrent − pA ≤ −0.01 [ ⇐⇒ pA ≥ pcurrent + 0.01].

The 95% CI was (−0.0634, −0.0126). Since this interval only contains
values smaller than −0.01, there is significant evidence that Mod A will
yield a > 1% improvement. (Be careful with the signs here!)

(By the way, the t-stat is:

(p̂current − p̂A ) − d 0 (.702 − .74) − (−0.01)

t= = = −2.205.)
s(p̂current −p̂A ) 0.0127
89 / 109
Example: Political polling
During the 2021 Boston mayoral race, a poll showed Michelle Wu leading
challenger Annissa Essaibi George by 53.5%-46.5% based on a random
sample of n = 645 voters. Based on this data, can you conclude with
confidence that Wu is in fact leading Essaibi George? Test this
hypothesis with a 95% CI.

Let pW denote the true proportion voting for Wu. We want to test:
H0 : pW = 0.5
H1 : pW > 0.5

The 95% CI for the proportion pW is:

0.0196
r zr }| {
p̂(1 − p̂) 0.535(1 − 0.535)
p̂ ± 2 = 0.535 ± 2 = (0.496, 0.574).
n 645

What do we conclude?
90 / 109
The hard part

How do we set up a hypothesis testing problem?

I What is the property we want to test?
I What is the parameter?
I What is the hypothesis?
I How to decide null vs. alternative hypothesis?
I What is the right statistic?

As usual, the hardest part is not the calculations: It’s knowing

what to do in the first place!

92 / 109
Intuition behind testing

93 / 109
Intuition behind testing

Let’s go back to some intuition.

What hypothesis testing is NOT about: Proving or disproving with

certainty the truth of something.

What hypothesis testing IS: Comparing to some baseline and

asking, are we outperforming this baseline?

94 / 109
Example: Psychic powers

How about this: I’ll try to prove to you that I can read your mind.
Think of a number between 1-10 and write it down on a piece of
paper (or your hand).

OK, I’ll admit one thing: My psychic skills aren’t perfect. But I
swear I’m a psychic.

I won’t get 100% right. What is the smallest number of right

answers I can give without looking like a complete fool?

96 / 109
Example: Psychic powers

At the very, very least, I should be able to beat random guessing.

Is that fair?
I This does not prove that I’m psychic, but it would give some
supporting evidence to my claim
I But if I can’t beat random guessing, why would you pay me
$100?

If I guess randomly, then P(guess correctly) = 0.1 (why?). So, to

beat random guessing, I need to show that
P(guess correctly) > 0.1 (why?).

(NOTE: You can replace 0.1 with any baseline you find convincing. As an exercise, try

this!)

97 / 109
Example: Psychic powers

This is a classic statistics problem.

I The problem: Am I really psychic?
I The answer: Can I beat random guessing?
I The target parameter: The probability that I guess your
number correctly.
I The data: A random sample of guesses.
I The statistic: The sample proportion.

This is also a hypothesis testing problem!

98 / 109
Example: Psychic powers
This is also a hypothesis testing problem!
I Let p = P(guess correctly) be the true parameter (we don’t
know this!)
I We are interested in whether or not p > 0.1

The null hypothesis is always the baseline. It’s the status quo,
boring hypothesis. =⇒ H0 : p = 0.1

The alternative hypothesis is beating the baseline. It’s where I

make money. =⇒ H1 : p > 0.1

So:
H0 : p = 0.1
H1 : p > 0.1
99 / 109
How many guesses?

I need at least 5/10 correct guesses I need at least 18/100 correct guesses I need at least 121/1000 correct guesses
1.0

1.0

1.0
0.8

0.8

0.8
95% CI lower bound

95% CI lower bound

0.6

0.6
0.4

0.4

0.4
0.2

0.2

0.2
0.0

0.0

0.0
2 4 6 8 10 0 20 40 60 80 100 0 200 400 600 800 1000

# of correct guesses # of correct guesses # of correct guesses

101 / 109
More examples

I Should Netflix have abandoned their Cinematch algorithm and

used the Netflix prize solution? (Baseline: current
performance of Cinematch)
I Should Facebook have added reactions? (Baseline: current
metrics on user engagement)
I Should a political campaign increase spending on political
ads? (Baseline: current performance in polls)
I Should the FDA approve a new drug? (Baseline: placebo)

These are just a few examples, and the baselines are also just
examples. Can you think of any other appropriate baselines for
these examples?
102 / 109
Back to an example: Google modifications

Although Mod A significantly outperformed the current algorithm

(why?), execs will not sign off on the mod unless it is at least 1% better
than the current algorithm. Can you convince them to switch?

I What is the property we want to test? Whether or not the

modification outperforms the current algorithm by at least 1%

I What is the parameter? pcurrent − pA , the difference in performance

between the algorithms

I How to decide null vs. alternative hypothesis? Baseline / status quo

is a tie, i.e. pcurrent − pA = 0. So, H0 : pcurrent − pA = 0.

I What is the right statistic? The sample difference in proportions

p̂current − p̂A .

103 / 109
Is this even useful?
We’ve made a lot of assumptions so far...
I The data is perfectly normally distributed
I The data is perfectly independent
I The sampling is perfectly random and unbiased
I The sample size is sufficiently large

These assumptions allow to derive explicit formulas for the

sampling distributions, standard errors, and test statistics.

Even under these ideal assumptions, the problem is hard!

I Sampling error
I Can’t say much FOR SURE
I Need to use approximations
104 / 109
Is this even useful?
Unfortunately, the real world is not so kind...

Thought experiment: Imagine the real world

I The true distribution is unknown and very complicated
I The data is dependent (e.g. network effects)
I The sampling is not random and heavily biased (e.g.
convenience samples)
I The sample size is limited

Does this make the problem easier or harder?

In the real world, have to deal with “perfect world” problems like
sampling error and approximations in addition to “real world”
problems like dependence, limited samples, etc.
105 / 109
The Importance of Considering and Reporting
Uncertainty

In 1997 the Red River flooded Grand Forks, ND overflowing its

levees with a 54-feet crest. 75% of the homes in the city were
damaged or destroyed!

It was predicted that the rain and the spring melt would lead to a
49-feet crest of the river. The levees were 51-feet high.

The Water Services of North Dakota had explicitly avoided

communicating the uncertainty in their forecasts as they were
afraid the public would lose confidence in their abilities to predict
such events.

107 / 109
The Importance of Considering and Reporting
Uncertainty

It turns out the prediction interval for the flood was 49ft ± 9ft
meaning the 54ft levees could overflow!

Should we take the point prediction (49ft) or the interval as an

input for a decision problem?

In general, the distribution of potential outcomes are very relevant

to help us make a decision.

108 / 109
The Importance of Considering and Reporting
Uncertainty

The answer seems obvious in this example (and it is!)... however,

you see these things happening all the time as people tend to
underplay uncertainty in many situations!
“Why do people not give intervals? Because they are em-
barrassed!”
-Jan Hatzius, Goldman Sachs economist talking about economic
forecasts

Don’t make this mistake! Intervals are your friend and will lead to
better decisions!

(Remember the weather apps and rain probabilities from Section 1?)

109 / 109

Sta 341 Class Notes Final
No ratings yet
Sta 341 Class Notes Final
120 pages
R Programming Unit 4
No ratings yet
R Programming Unit 4
26 pages
STATPROB Module 7
No ratings yet
STATPROB Module 7
16 pages
Data Science Inference and Modeling
No ratings yet
Data Science Inference and Modeling
98 pages
Slovin Formula
50% (2)
Slovin Formula
3 pages
ResearchMethodology PPT 1
No ratings yet
ResearchMethodology PPT 1
297 pages
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
No ratings yet
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
10 pages
Sp25 Module 06 Sampling
No ratings yet
Sp25 Module 06 Sampling
45 pages
Stat2602 Chapter3
No ratings yet
Stat2602 Chapter3
37 pages
Eba3e PPT ch06
No ratings yet
Eba3e PPT ch06
41 pages
ParameterEstimation Slides
No ratings yet
ParameterEstimation Slides
40 pages
Lecture Notes For Mathematical Statistics
No ratings yet
Lecture Notes For Mathematical Statistics
184 pages
002 Probability-and-Statistics-Part-4-Statistics
No ratings yet
002 Probability-and-Statistics-Part-4-Statistics
123 pages
Probability and Statistics 1
No ratings yet
Probability and Statistics 1
30 pages
L06 Inference
No ratings yet
L06 Inference
48 pages
06 Stat Est
No ratings yet
06 Stat Est
41 pages
Seminar Week 4 - With Solutions - Fullpage
No ratings yet
Seminar Week 4 - With Solutions - Fullpage
35 pages
Statistics Unit1ppt
No ratings yet
Statistics Unit1ppt
94 pages
Slides SM 1
No ratings yet
Slides SM 1
51 pages
Sampling & Sampling Distributions
No ratings yet
Sampling & Sampling Distributions
26 pages
Week 10 - Statistics, Random Sampling, Point Estimation
No ratings yet
Week 10 - Statistics, Random Sampling, Point Estimation
14 pages
UNL STAT318 Notes Chapter 1-4 (2020)
No ratings yet
UNL STAT318 Notes Chapter 1-4 (2020)
66 pages
Lecture Slides 10 UN1201
No ratings yet
Lecture Slides 10 UN1201
35 pages
Topic 3 - ETC1000
No ratings yet
Topic 3 - ETC1000
10 pages
Lecture1 - Copy (1) Copy 2
No ratings yet
Lecture1 - Copy (1) Copy 2
24 pages
Sta255 Week 10-2 Pre
No ratings yet
Sta255 Week 10-2 Pre
20 pages
Chapter1 S
No ratings yet
Chapter1 S
100 pages
1.1 Introduction - Measures of Location
No ratings yet
1.1 Introduction - Measures of Location
33 pages
Business Statistics: Lecture 1: Course Introduction & Descriptive Statistics
No ratings yet
Business Statistics: Lecture 1: Course Introduction & Descriptive Statistics
46 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
Maltz ( (1984) 2001) Recidivism
No ratings yet
Maltz ( (1984) 2001) Recidivism
252 pages
Sampling & Sampling Distributions
No ratings yet
Sampling & Sampling Distributions
44 pages
09 Inference - Slides Web
No ratings yet
09 Inference - Slides Web
39 pages
(Ebook PDF) The Basic Practice of Statistics 8th Edition Instant Download
100% (1)
(Ebook PDF) The Basic Practice of Statistics 8th Edition Instant Download
53 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Reading For P2C Quiz6
No ratings yet
Reading For P2C Quiz6
11 pages
10 Statistics PDF
No ratings yet
10 Statistics PDF
11 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
Statistics BI: Models of Random Outcomes. What Is A Model?
No ratings yet
Statistics BI: Models of Random Outcomes. What Is A Model?
22 pages
Lecture Notes Week 2
No ratings yet
Lecture Notes Week 2
10 pages
Data Science and Visualization
No ratings yet
Data Science and Visualization
37 pages
Unit - 1 Introduction-Statistical Inference
No ratings yet
Unit - 1 Introduction-Statistical Inference
28 pages
7 Inference L8 Unlocked
No ratings yet
7 Inference L8 Unlocked
29 pages
Insurance Pricing Basic Statistical Principles
No ratings yet
Insurance Pricing Basic Statistical Principles
33 pages
Stats 201 Midterm Sheet
No ratings yet
Stats 201 Midterm Sheet
2 pages
MATH10282: Introduction To Statistics Lecture Notes
No ratings yet
MATH10282: Introduction To Statistics Lecture Notes
49 pages
ETF1100 Business Statistics Week 6: Midterm Test Revision
No ratings yet
ETF1100 Business Statistics Week 6: Midterm Test Revision
25 pages
Introduction To Statistical Modeling With SAS/STAT Software
No ratings yet
Introduction To Statistical Modeling With SAS/STAT Software
60 pages
MCE 311 - Engineering Measurements: Bin (CM) Number of Measurements
No ratings yet
MCE 311 - Engineering Measurements: Bin (CM) Number of Measurements
7 pages
Experimental Economics Method and Applications
No ratings yet
Experimental Economics Method and Applications
25 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
1 CourseIntro
No ratings yet
1 CourseIntro
48 pages
Lectorial Slides 6a
No ratings yet
Lectorial Slides 6a
30 pages
Anachem Lesson 1 5
No ratings yet
Anachem Lesson 1 5
8 pages
Chapter 1 Statistics Review Sept20
No ratings yet
Chapter 1 Statistics Review Sept20
11 pages
Transition To MATH503
No ratings yet
Transition To MATH503
12 pages
R Lang-Unit-04
No ratings yet
R Lang-Unit-04
12 pages
STATS 200: Introduction To Statistical Inference: Lecture 1: Course Introduction and Polling
No ratings yet
STATS 200: Introduction To Statistical Inference: Lecture 1: Course Introduction and Polling
35 pages
Chapter 08 Statistics 2
No ratings yet
Chapter 08 Statistics 2
47 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
16 pages
BEMM460J Mock Exam
No ratings yet
BEMM460J Mock Exam
15 pages
Ecn 2331 Statistics For Economics Lesson 4 Estimation-6
No ratings yet
Ecn 2331 Statistics For Economics Lesson 4 Estimation-6
57 pages
1999 - Vyazovkin - Model Free and Model Fitting Approaches To Kinetic Analysis of Isothermal and Nonisothermal Data
No ratings yet
1999 - Vyazovkin - Model Free and Model Fitting Approaches To Kinetic Analysis of Isothermal and Nonisothermal Data
16 pages
Stat Lecture 2
No ratings yet
Stat Lecture 2
6 pages
Statistics PDF
No ratings yet
Statistics PDF
17 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Hypothesis Testing Seminar
No ratings yet
Hypothesis Testing Seminar
10 pages
Biostatistics 3 - Statistical Inference
No ratings yet
Biostatistics 3 - Statistical Inference
90 pages
08 Sample Problems On Confidence Intervals
No ratings yet
08 Sample Problems On Confidence Intervals
24 pages
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
No ratings yet
Point Estimation: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
12 pages
BS 0812-104 - 1994
No ratings yet
BS 0812-104 - 1994
24 pages
Random Sampling, Statistics, and Estimators
No ratings yet
Random Sampling, Statistics, and Estimators
9 pages
Bmjopen 2016 December 6 12 Inline Supplementary Material 2
No ratings yet
Bmjopen 2016 December 6 12 Inline Supplementary Material 2
19 pages
ETC2410 Assignment 1 2023
No ratings yet
ETC2410 Assignment 1 2023
17 pages
Stats 8 Practice Test
No ratings yet
Stats 8 Practice Test
6 pages
Confidence Intervals For The Population Proportion Instructions
0% (1)
Confidence Intervals For The Population Proportion Instructions
3 pages
Problem Set
No ratings yet
Problem Set
6 pages
Simple Linear Reg
No ratings yet
Simple Linear Reg
46 pages
A Probalistic Life Model For Insulation Materials Showing Electrical Thresholds
No ratings yet
A Probalistic Life Model For Insulation Materials Showing Electrical Thresholds
8 pages
Sociology 3811: Basic Social Statistics: Staff
No ratings yet
Sociology 3811: Basic Social Statistics: Staff
17 pages
Marketing Management Project (Product Launch)
94% (16)
Marketing Management Project (Product Launch)
30 pages
Econ140 Spring2016 Section05 Handout Solutions
No ratings yet
Econ140 Spring2016 Section05 Handout Solutions
5 pages
Reliability of The Five-Point Test: The Clinical Neuropsychologist September 2008
No ratings yet
Reliability of The Five-Point Test: The Clinical Neuropsychologist September 2008
12 pages
A Course in Mathematical Statistics 0125993153
100% (10)
A Course in Mathematical Statistics 0125993153
593 pages
What Is A Margin of Error?
No ratings yet
What Is A Margin of Error?
8 pages
Statistical Significance Versus Clinical Importance
No ratings yet
Statistical Significance Versus Clinical Importance
5 pages
MIT2 854F10 Stats
No ratings yet
MIT2 854F10 Stats
38 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Statistics For Dummies
From Everand
Statistics For Dummies
Deborah J. Rumsey
4/5 (28)
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.