0% found this document useful (0 votes)
156 views566 pages

Statistics M It

This document appears to list slides from various probability and statistics classes. It includes titles and page numbers for slides covering topics like continuous variables, histograms, probability basics, conditional probability, discrete and continuous random variables, expectation, variance, law of large numbers, joint distributions, Bayesian updating, probability intervals, frequentist methods, confidence intervals, linear regression and more. The slides come from 18.05 Introduction to Probability and Statistics from Spring 2014 at MIT.

Uploaded by

007kruno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views566 pages

Statistics M It

This document appears to list slides from various probability and statistics classes. It includes titles and page numbers for slides covering topics like continuous variables, histograms, probability basics, conditional probability, discrete and continuous random variables, expectation, variance, law of large numbers, joint distributions, Bayesian updating, probability intervals, frequentist methods, confidence intervals, linear regression and more. The slides come from 18.05 Introduction to Probability and Statistics from Spring 2014 at MIT.

Uploaded by

007kruno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 566

Class 5 Slides with Solutions: Gallery of continuous variables, histograms 2

Class 1 Slides with Solutions: Introduction, Counting and Sets 14


Class 2 Slides with Solutions: Probability basics 37
Class 3 Slides with Solutions: Conditional Probability, Bayes' Theorem 66
Class 4 Slides with Solutions: Discrete Random Variables, Expectation 94
Class 5 Slides with Solutions: Variance, Continuous Random Variables 120
Class 6 Slides with Solutions: Expectation, Variance, Law of Large Numbers and Central Limit Theorem 146
Class 7 Slides with Solutions: Joint Distributions: Independence, Covariance and Correlation 177
Class 8 Slides with Solutions: Review for Exam 1 213
Class 10 Slides with Solutions: Introduction to Statistics; Maximum Likelihood Estimates 231
Class 11 Slides with Solutions: Bayesian Updating with Known Discrete Priors 254
Class 12 Slides with Solutions: Bayesian Updating: Probabilistic Prediction; Odds 276
Class 13 Slides with Solutions: Bayesian Updating: Continuous Prior, Discrete Data 302
Class 14 Slides with Solutions: Beta Distributions: Continuous Data 326
Class 15 Slides with Solutions: Conjugate Priors; Choosing Priors 352
Class 16 Slides with Solutions: Probability Intervals 372
Class 17 Slides with Solutions: Frequentist Methods; NHST 405
Class 18 Slides with Solutions: NHST II: Significance Level, Power, T-tests 430
Class 19 Slides with Solutions: NHST III: Gallery of Tests 457
Class 20 Slides with Solutions: Comparison of Bayesian and Frequentist Inference 479
Class 22 Slides with Solutions: Confidence Intervals for Normal Data 489
Class 23 Slides with Solutions: Confidence Intervals II 508
Class 24 Slides with Solutions: Bootstrap Confidence Intervals 526
Class 25 Slides with Solutions: Linear Regression 542
Studio 3

18.05 Spring 2014

Jeremy Orloff and Jonathan Bloom

frequency density

4 0.4

3 0.3

2 0.2

1 0.1

x x
.5 1.5 2.5 3.5 4.5 .5 1.5 2.5 3.5 4.5

Class 5 Slides with Solutions: Gallery of continuous variables, histo2


Concept questions

Suppose X is a continuous random variable.

a) What is P(a ≤ X ≤ a)?

b) What is P(X = 0)?

c) Does P(X = 2) = 0 mean X never equals 2?


answer: a) 0
b) 0

c) No. For a continuous distribution any single value has probability 0.

Only a range of values has non-zero probability.

MIT18_05S14_cl5cont_slides 3
July 13, 2014 2 / 11
Concept question

Which of the following are graphs of valid cumulative distribution


functions?

Add the numbers of the valid cdf’s and click that number.
answer: Test 2 and Test 3.

MIT18_05S14_cl5cont_slides 4
July 13, 2014 3 / 11
Solution

Test 1 is not a cdf: it takes negative values, but probabilities are positive.

Test 2 is a cdf: it increases from 0 to 1.

Test 3 is a cdf: it increases from 0 to 1.

Test 4 is not a cdf: it decreases. A cdf must be non-decreasing since it

represents accumulated probability.

MIT18_05S14_cl5cont_slides 5
July 13, 2014 4 / 11
Exponential Random Variables

Parameter: λ (called the rate parameter).


Range: [0, ∞).
Notation: exponential(λ) or exp(λ).
Density: f (x) = λe−λx for 0 ≤ x.
Models: Waiting time
P (3 < X < 7)
.1 F (x) = 1 − e−x/10
1

x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16

Continuous analogue of geometric distribution –memoryless!

MIT18_05S14_cl5cont_slides 6
July 13, 2014 5 / 11

Uniform and Normal Random Variables

Uniform: U(a, b) or uniform(a, b)


Range: [a, b]
1
PDF: f (x) =
b−a
Normal: N(µ, σ 2 )
Range: (−∞, ∞]
1 2 2
PDF: f (x) = √ e−(x−µ) /2σ
σ 2π

http://mathlets.org/mathlets/probability-distributions/

MIT18_05S14_cl5cont_slides 7
July 13, 2014 6 / 11
Table questions

Open the applet


http://mathlets.org/mathlets/probability-distributions/

1. For the standard normal distribution N(0, 1) how much


probability is within 1 of the mean? Within 2? Within 3?

2. For N(0, 32 ) how much probability is within σ of the mean?


Within 2σ? Within 3σ.

3. Does changing µ change your answer to problem 2?

MIT18_05S14_cl5cont_slides 8
July 13, 2014 7 / 11
Normal probabilities

within 1 · σ ≈ 68%

Normal PDF within 2 · σ ≈ 95%

within 3 · σ ≈ 99%
68%

95%

99%
z
−3σ −2σ −σ σ 2σ 3σ

Rules of thumb:
P(−1 ≤ Z ≤ 1) ≈ .68,

P(−2 ≤ Z ≤ 2) ≈ .95,

P(−3 ≤ Z ≤ 3) ≈ .997

MIT18_05S14_cl5cont_slides 9
July 13, 2014 8 / 11

Download R script

Download studio3.zip and unzip it into your 18.05 working directory.

Open studio3.r in RStudio.

MIT18_05S14_cl5cont_slides 10
July 13, 2014 9 / 11
Histograms

Will discuss in more detail in class 6.

Made by ‘binning’ data.

Frequency: height of bar over bin = # of data points in bin.

Density: area of bar over bin is proportional to # of data points in

bin. Total area of a density histogram is 1.

frequency density

4 0.4

3 0.3

2 0.2

1 0.1

x x
.5 1.5 2.5 3.5 4.5 .5 1.5 2.5 3.5 4.5
MIT18_05S14_cl5cont_slides 11
July 13, 2014 10 / 11
Histograms of averages of exp(1)

1. Generate a frequency histogram of 1000 samples from an exp(1)

random variable.

2. Generate a density histogram for the average of 2 independent

exp(1) random variable.

3. Using rexp(), matrix() and colMeans() generate a density

histogram for the average of 50 independent exp(1) random variables.

Make 10000 sample averages and use a binwidth of .1 for this.

Look at the spread of the histogram.

4. Superimpose a graph of the pdf of N(1, 1/50) on your plot in

problem 3. (Remember the second parameter in N is σ 2 .)

Code for the solutions is at


MIT18_05S14_cl5cont_slides 12
https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/studio-resources/studio3.zip

July 13, 2014 11 / 11


MIT OpenCourseWare
http://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

MIT18_05S14_cl5cont_slides 13
Welcome to 18.05

Introduction to Probability and Statistics

Spring 2014

Class 1 Slides with Solutions: Introduction, Counting and Sets 14


http://xkcd.com/904/

January 1, 2017 1 / 23
R

Free open source package.


Very easy to use and install.
Instructions and a link for this are on MITx/18.05r.

MIT18_05S14_class1_slides 15
January 1, 2017 2 / 23
Platonic Dice

MIT18_05S14_class1_slides 16
January 1, 2017 3 / 23
Probability vs. Statistics
Different subjects: both about random processes

Probability
Logically self-contained
A few rules for computing probabilities
One correct answer
Statistics
Messier and more of an art
Get experimental data and try to draw probabilistic
conclusions
No single correct answer
MIT18_05S14_class1_slides 17
January 1, 2017 4 / 23
Counting: Motivating Examples

What is the probability of getting exactly 1 heads in 3

tosses of a fair coin?

MIT18_05S14_class1_slides 18
January 1, 2017 5 / 23
Poker Hands
Deck of 52 cards
13 ranks: 2, 3, . . . , 9, 10, J, Q, K, A
4 suits: ♥, ♠, ♦, ♣,
Poker hands
Consists of 5 cards
A one-pair hand consists of two cards having one rank and the
remaining three cards having three other rank
Example: {2♥, 2♠, 5♥, 8♣, K♦}
The probability of a one-pair hand is:
(1) less than 5%
(2) between 5% and 10%
(3) between 10% and 20%
(4) between 20% and 40%
(5) greater than 40%
MIT18_05S14_class1_slides 19
January 1, 2017 6 / 23
Sets in Words
Old New England rule: don’t eat clams (or any shellfish) in months
without an ’r’ in their name.
S = all months

L = the month has 31 days

R = the month has an ‘r’ in its name

S = {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}

L = {Jan, Mar, May, Jul, Aug, Oct, Dec}

R = {Jan, Feb, Mar, Apr, Sep, Oct, Nov, Dec}

L ∩ R = {Jan, Mar, Oct, Dec} = months with 31 days and an ‘r’

MIT18_05S14_class1_slides 20
January 1, 2017 7 / 23
Visualize Set Operations with Venn Diagrams

S L R

L∪R L∩R Lc L−R

MIT18_05S14_class1_slides 21
January 1, 2017 8 / 23
Product of Sets

S × T = {(s, t)}

MIT18_05S14_class1_slides 22
January 1, 2017 9 / 23
Inclusion-Exclusion Principle

B A∩B A

MIT18_05S14_class1_slides 23
January 1, 2017 10 / 23
Board Question

A band consists of singers and guitar players.


7 people sing
4 play guitar
2 do both

How many people are in the band?

MIT18_05S14_class1_slides 24
January 1, 2017 11 / 23
Rule of Product

3 shirts, 4 pants = 12 outfits

(More powerful than it seems.)

MIT18_05S14_class1_slides 25
January 1, 2017 12 / 23
Concept Question: DNA
DNA is made of sequences of nucleotides: A, C, G, T.

How many DNA sequences of length 3 are there?

(i) 12 (ii) 24 (iii) 64 (iv) 81

answer: (iii) 4 × 4 × 4 = 64

How many DNA sequences of length 3 are there with no


repeats?

(i) 12 (ii) 24 (iii) 64 (iv) 81

answer: (ii) 4 × 3 × 2 = 24
MIT18_05S14_class1_slides 26
January 1, 2017 13 / 23
Board Question 1

There are 5 Competitors in 100m final.

How many ways can gold, silver, and bronze be awarded?

answer: 5 × 4 × 3.

There are 5 ways to pick the winner. Once the winner is chosen there are

4 ways to pick second place and then 3 ways to pick third place.

MIT18_05S14_class1_slides 27
January 1, 2017 14 / 23
Board Question 2

I won’t wear green and red together; I think black or


denim goes with anything; Here is my wardrobe.

Shirts: 3B, 3R, 2G; sweaters 1B, 2R, 1G; pants 2D,2B.

How many different outfits can I wear?

MIT18_05S14_class1_slides 28
January 1, 2017 15 / 23
Solution
answer: Suppose we choose shirts first. Depending on whether we choose
red compatible or green compatible shirts there are different numbers of
sweaters we can choose next. So we split the problem up before using the
rule of product. A multiplication tree is an easy way to present the answer.

3 3 2
Shirts R B G
3 4 2
Sweaters R,B R,B,G B,G
4 4 4
Pants
B, D B, D B, D

Multiplying down the paths of the tree:

Number of outfits = (3 × 3 × 4) + (3 × 4 × 4) + (2 × 2 × 4) = 100

MIT18_05S14_class1_slides 29
January 1, 2017 16 / 23
Permutations

Lining things up. How many ways can you do it?

‘abc’ and ‘cab’ are different permutations of {a, b, c}

MIT18_05S14_class1_slides 30
January 1, 2017 17 / 23
Permutations of k from a set of n

Give all permutations of 3 things out of {a, b, c, d}

abc abd acb acd adb adc


bac bad bca bcd bda bdc
cab cad cba cbd cda cdb
dab dac dba dbc dca dcb

Would you want to do this for 7 from a set of 10?

MIT18_05S14_class1_slides 31
January 1, 2017 18 / 23
Combinations

Choosing subsets – order doesn’t matter.


How many ways can you do it?

MIT18_05S14_class1_slides 32
January 1, 2017 19 / 23
Combinations of k from a set of n

Give all combinations of 3 things out of {a, b, c, d}

Answer: {a,b,c}, {a,b,d}, {a,c,d}, {b,c,d}

MIT18_05S14_class1_slides 33
January 1, 2017 20 / 23
Permutations and Combinations

abc acb bac bca cab cba {a, b, c}


abd adb bad bda dab dba {a, b, d}
acd adc cad cda dac dca {a, c, d}
bcd bdc cbd cdb dbc dcb {b, c, d}
Permutations: Combinations:
4
4 P3 3 = 4 C3

� �
4 4 P3
= 4 C3 =
3 3!
MIT18_05S14_class1_slides 34
January 1, 2017 21 / 23
Board Question

(a) Count the number of ways to get exactly 3 heads in


10 flips of a coin.

(b) For a fair coin, what is the probability of exactly 3


heads in 10 flips?
10

answer: (a) We have to ’choose’ 3 out of 10 flips for heads: 3 .

(b) There are 210 possible outcomes from 10 flips (this is the rule of
product). For a fair coin each outcome is equally probable so the
probability of exactly 3 heads is
10

3 120
= = 0.117
210 1024
MIT18_05S14_class1_slides 35
January 1, 2017 22 / 23
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class1_slides 36
Probability: Terminology and Examples
18.05 Spring 2014

Class 2 Slides with Solutions: Probability basics 37


January 1, 2017 1 / 29
Board Question
Deck of 52 cards
13 ranks: 2, 3, . . . , 9, 10, J, Q, K, A
4 suits: ♥, ♠, ♦, ♣,
Poker hands
Consists of 5 cards
A one-pair hand consists of two cards having one rank and the
remaining three cards having three other ranks

Example: {2♥, 2♠, 5♥, 8♣, K♦}

Question
(a) How many different 5 card hands have exactly one pair?
Hint: practice with how many 2 card hands have exactly one pair.
Hint for hint: use the rule of product.
(b) What is the probability of getting a one pair poker hand?
MIT18_05S14_class2_slides 38
January 1, 2017 2 / 29
Answer to board question
We can do this two ways as combinations or permutations. The keys are:
1. be consistent
2. break the problem into a sequence of actions and use the rule of

product.

Note, there are many ways to organize this. We will break it into very

small steps in order to make the process clear.

Combinations approach
a) Count the number of one-pair hands, where the order they are dealt
doesn’t matter.
Action 1. Choose the rank of the pair: 13 different ranks, choosing 1, so
13
o
1 ways to do this.
Action 2. Choose 2 cards from this rank: 4 cards in a rank, choosing 2, so
4
o
2 ways to do this.
Action 3. Choose the 3 cards of different ranks: 12 remaining ranks, so
12
o
3 ways to do this.
MIT18_05S14_class2_slides 39
(Continued on next slide.)

January 1, 2017 3 / 29
Combination solution continued

Action 4. Choose 1 card from each of these ranks: 4 cards in each rank so

4 3
o
1 ways to do this.
answer: (Using the rule of product.)
13 4 12
· · · 43 = 1098240
1 2 3

b) To compute the probability we have to stay consistent and count


combinations. To make a 5 card hand we choose 5 cards out of 52, so
there are
52
= 2598960
5
possible hands. Since each hand is equally likely the probability of a
one-pair hand is
1098240/2598960 = 0.42257.
MIT18_05S14_class2_slides 40
On the next slide we redo this problem using permutations.

January 1, 2017 4 / 29
Permutation approach
This approach is a little trickier. We include it to show that there is

usually more than one way to count something.

a) Count the number of one-pair hands, where we keep track of the order

they are dealt.

Action 1. (This one is tricky.) Choose the positionso in the hand that will

hold the pair: 5 different positions, so 52 ways to do this.


Action 2. Put a card in the first position of the pair: 52 cards, so 52 ways
to do this.
Action 3. Put a card in the second position of the pair: since this has to
match the first card, there are only 3 ways to do this.
Action 4. Put a card in the first open slot: this can’t match the pair so
there are 48 ways to do this.
Action 5. Put a card in the next open slot: this can’t match the pair or
the previous card, so there 44 ways to do this.
Action 6. Put a card in the last open slot: there are 40 ways to do this.
MIT18_05S14_class2_slides
(Continued on next slide.) 41
January 1, 2017 5 / 29
Permutation approach continued
answer: (Using the rule of product.)
 
5
· 52 · 3 · 48 · 44 · 40 = 131788800
2

ways to deal a one-pair hand where we keep track of order.

b) There are

52 P5 = 52 · 51 · 50 · 49 · 48 = 311875200

five card hands where order is important.


Thus, the probability of a one-pair hand is

131788800/311875200 = 0.42257.

(Both approaches give the same answer.)

MIT18_05S14_class2_slides 42
January 1, 2017 6 / 29
Clicker Test

Set your clicker channel to 41.

Do you have your clicker with you?

No = 0
Yes = 1

MIT18_05S14_class2_slides 43
January 1, 2017 7 / 29
Probability Cast

Introduced so far
Experiment: a repeatable procedure
Sample space: set of all possible outcomes S (or Ω).
Event: a subset of the sample space.
Probability function, P(ω): gives the probability for
each outcome ω ∈ S
1. Probability is between 0 and 1
2. Total probability of all possible outcomes is 1.

MIT18_05S14_class2_slides 44
January 1, 2017 8 / 29
Example (from the reading)

Experiment: toss a fair coin, report heads or tails.

Sample space: Ω = {H, T }.

Probability function: P(H) = .5, P(T ) = .5.

Use tables:
Outcomes H T

Probability
1/2 1/2

(Tables can really help in complicated examples)

MIT18_05S14_class2_slides 45
January 1, 2017 9 / 29
Discrete sample space

Discrete = listable

Examples:

{a, b, c, d} (finite)

{0, 1, 2, . . . } (infinite)

MIT18_05S14_class2_slides 46
January 1, 2017 10 / 29
Events

Events are sets:


Can describe in words
Can describe in notation
Can describe with Venn diagrams

Experiment: toss a coin 3 times.

Event:

You get 2 or more heads = { HHH, HHT, HTH, THH}

MIT18_05S14_class2_slides 47
January 1, 2017 11 / 29
CQ: Events, sets and words

Experiment: toss a coin 3 times.

Which of following equals the event “exactly two heads”?

A = {THH, HTH, HHT , HHH}

B = {THH, HTH, HHT }

C = {HTH, THH}

(1) A (2) B (3) C (4) A or B


answer: : 2) B.

The event “exactly two heads” determines a unique subset, containing all

outcomes that have exactly two heads.

MIT18_05S14_class2_slides 48
January 1, 2017 12 / 29
CQ: Events, sets and words

Experiment: toss a coin 3 times.

Which of the following describes the event


{THH, HTH, HHT }?

(1) “exactly one head”


(2) “exactly one tail”
(3) “at most one tail”
(4) none of the above
answer: (2) “exactly one tail”

Notice that the same event E ⊂ Ω may be described in words in multiple

ways (“exactly 2 heads” and “exactly 1 tail”).

MIT18_05S14_class2_slides 49
January 1, 2017 13 / 29
CQ: Events, sets and words

Experiment: toss a coin 3 times.

The events “exactly 2 heads” and “exactly 2 tails” are


disjoint.

(1) True (2) False


answer: True: {THH, HTH, HHT } ∩ {TTH, THT , HTT } = ∅.

MIT18_05S14_class2_slides 50
January 1, 2017 14 / 29
CQ: Events, sets and words

Experiment: toss a coin 3 times.

The event “at least 2 heads” implies the event “exactly


two heads”.

(1) True (2) False


False. It’s the other way around:
{THH, HTH, HHT } ⊂ {THH, HTH, HHT , HHH}.

MIT18_05S14_class2_slides 51
January 1, 2017 15 / 29
Probability rules in mathematical notation

Sample space: S = {ω1 , ω2 , . . . , ωn }


Outcome: ω ∈ S
Probability between 0 and 1: 0 ≤ P(ω) ≤ 1
n
Total probability is 1: P(ωj ) = 1, P(ω) = 1
j=1 ω∈S

Event A: P(A) = P(ω)


ω∈A

MIT18_05S14_class2_slides 52
January 1, 2017 16 / 29
Probability and set operations on events

Events A, L, R
Rule 1. Complements: P(Ac ) = 1 − P(A).
Rule 2. Disjoint events: If L and R are disjoint then
P(L ∪ R) = P(L) + P(R).
Rule 3. Inclusion-exclusion principle: For any L and R:
P(L ∪ R) = P(L) + P(R) − P(L ∩ R).

Ac
L R L R
A

Ω = A ∪ Ac , no overlap L ∪ R, no overlap L ∪ R, overlap = L ∩ R


MIT18_05S14_class2_slides 53
January 1, 2017 17 / 29
Table question

Class has 50 students


20 male (M), 25 brown-eyed (B)

For a randomly chosen student what is the range of


possible values for p = P(M ∪ B)?
(a) p ≤ .4
(b) .4 ≤ p ≤ .5
(c) .4 ≤ p ≤ .9
(d) .5 ≤ p ≤ .9
(e) .5 ≤ p
answer: (d) .5 ≤ p ≤ .9
MIT18_05S14_class2_slides
Explanation on next slide. 54
January 1, 2017 18 / 29
Solution to CQ

The easy way to answer this is that A ∪ B has a minumum of 25 members


(when all males are brown-eyed) and a maximum of 45 members (when no
males have brown-eyes). So, the probability ranges from .5 to .9
Thinking about it in terms of the inclusion-exclusion principle we have

P(M ∪ B) = P(M) + P(B) − P(M ∩ B) = .9 − P(M ∩ B).

So the maximum possible value of P(M ∪ B) happens if M and B are


disjoint, so P(M ∩ B) = 0. The minimum happens when M ⊂ B, so
P(M ∩ B) = P(M) = .4.

MIT18_05S14_class2_slides 55
January 1, 2017 19 / 29
Table Question
Experiment:
1. Your table should make 9 rolls of a 20-sided die (one
each if the table is full).
2. Check if all rolls at your table are distinct.

Repeat the experiment five times and record the results.

For this experiment, how would you define the sample


space, probability function, and event?
Compute the true probability that all rolls (in one trial)
are distinct and compare with your experimental result.
answer: 1 − (20 · 19 · · · 13 · 12)/(209 ) = 0.881. (The explanation is on
MIT18_05S14_class2_slides
the next frame.) 56
January 1, 2017 20 / 29
Board Question Solution
For the sample space S we will take all sequences of 9 numbers between 1
and 20. (We are assuming there are 9 people at table.) We find the size of
S using the rule of product. There are 20 ways to choose the first number
in the sequence, followed by 20 ways to choose the second, etc. Thus,
|S| = 209 .
It is sometimes easier to calculate the probability of an event indirectly by
calculating the probability of the complement and using the formula
P(A) = 1 − P(Ac ).
In our case, A is the event ‘there is a match’, so Ac is the event ‘there is
no match’. We can use the rule of product to compute |Ac | as follows.
There are 20 ways to choose the first number, then 19 ways to choose the
second, etc. down to 12 ways to choose the ninth number. Thus, we have
|Ac | = 20 · 19 · 18 · 17 · 16 · 15 · 14 · 13 · 12
That is |Ac | = 20 P9 .
Putting this all together
20 · 19 · 18 · 17 · 16 · 15 · 14 · 13 · 12
P(A) = 1 − P(Ac ) = 1 −
MIT18_05S14_class2_slides = .88157
209
January 1, 2017 21 / 29
Jon’s dice

Jon has three six-sided dice with unusual numbering.

A game consists of two players each choosing a die. They


roll once and the highest number wins.

Which die would you choose?


MIT18_05S14_class2_slides 58
January 1, 2017 22 / 29
Board Question

1. Make probability tables for the red and which dice.


2. Make a probability table for the product sample space of red and
white.
3. Compute the probability that red beats white.

4. Pair up with another group. Have one group compare red vs.
green and the other compare green vs. red. Based on the three
MIT18_05S14_class2_slides
comparisons rank the dice from best to worst. 59
January 1, 2017 23 / 29
Computations for solution

Red die White die Green die


Outcomes 3 6 2 5 1 4
Probability 5/6 1/6 3/6 3/6 1/6 5/6

The 2 × 2 tables show pairs of dice.


Each entry is the probability of seeing the pair of numbers
corresponding to that entry.
The color gives the winning die for that pair of numbers. (We
use black instead of white when the white die wins.)
White Green
2 5 1 4
Red 3 15/36 15/36 5/36 25/36
6 3/36 3/36 1/36 5/36
Green 1 3/36 3/36
4 15/36 15/36
MIT18_05S14_class2_slides 60
January 1, 2017 24 / 29
Answer to board question continued

White Green
2 5 1 4
Red 3 15/36 15/36 5/36 25/36
6 3/36 3/36 1/36 5/36
Green 1 3/36 3/36
4 15/36 15/36
The three comparisons are:
P(red beats white) = 21/36 = 7/12
P(white beats green) = 21/36 = 7/12
P(green beats red) = 25/36

Thus: red is better than white is better than green is better than red.

There is no best die: the property of being ‘better than’ is


non-transitive.
MIT18_05S14_class2_slides 61
January 1, 2017 25 / 29
Concept Question

Lucky Larry has a coin that you’re quite sure is not fair.

He will flip the coin twice


It’s your job to bet whether the outcomes will be the
same (HH, TT) or different (HT, TH).

Which should you choose?


1. Same
2. Different
3. It doesn’t matter, same and different are equally likely

MIT18_05S14_class2_slides 62
January 1, 2017 26 / 29
Board Question
Lucky Larry has a coin that you’re quite sure is not fair.

He will flip the coin twice


It’s your job to bet whether the outcomes will be the
same (HH, TT) or different (HT, TH).

Which should you choose?


1. Same 2. Different 3. Doesn’t matter

Question: Let p be the probability of heads and use


probability to answer the question.

(If you don’t see the symbolic algebra try p = .2, p=.5)

MIT18_05S14_class2_slides 63
January 1, 2017 27 / 29
Solution
answer: 1. Same (same is more likely than different)
The key bit of arithmetic is that if a = b then

(a − b)2 > 0 ⇔ a2 + b 2 > 2ab.

To keep the notation cleaner, let’s use P(T ) = (1 − p) = q.


Since the flips are independent (we’ll discuss this next week) the
probabilities multiply. This gives the following 2 × 2 table.
second flip
H T
first flip H p 2 pq
T qp q2

So, P(same) = p 2 + q 2 and P(diff) = 2pq. Since the coin is unfair we


know p = q. Now we use our key bit of arithmetic to say

p 2 + q 2 > 2pq ⇒ P(same) > P(different).


MIT18_05S14_class2_slides 64
QED

January 1, 2017 28 / 29
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class2_slides 65
Conditional Probability, Independence, Bayes’ Theorem
18.05 Spring 2014

Class 3 Slides with Solutions: Conditional Probability, Bayes' The 66


January 1, 2017 1 / 28
Sample Space Confusions

1. Sample space = set of all possible outcomes of an experiment.


2. The size of the set is NOT the sample space.
3. Outcomes can be sequences of numbers.
Examples.
1. Roll 5 dice: Ω = set of all sequences of 5 numbers between 1 and

6, e.g. (1, 2, 1, 3, 1, 5) ∈ Ω.

The size |Ω| = 65 is not a set.

2. Ω = set of all sequences of 10 birthdays,


e.g. (111, 231, 3, 44, 55, 129, 345, 14, 24, 14) ∈ Ω.
|Ω| = 36510
3. n some number, Ω = set of all sequences of n birthdays.
|Ω| = 365n .
MIT18_05S14_class3_slides 67
January 1, 2017 2 / 28
Conditional Probability
‘the probability of A given B’.
P(A ∩ B)
P(A|B) = , provided P(B) = 0.
P(B)

B
A
A∩B

Conditional probability: Abstractly and for coin example

MIT18_05S14_class3_slides 68
January 1, 2017 3 / 28
Table/Concept Question
(Work with your tablemates, then everyone click in the answer.)
Toss a coin 4 times. Let
A = ‘at least three heads’
B = ‘first toss is tails’.
1. What is P(A|B)?
(a) 1/16 (b) 1/8 (c) 1/4 (d) 1/5

2. What is P(B|A)?
(a) 1/16 (b) 1/8 (c) 1/4 (d) 1/5
answer: 1. (b) 1/8. 2. (d) 1/5.
Counting we find |A| = 5, |B| = 8 and |A ∩ B| = 1. Since all sequences
are equally likely
P(A ∩ B) |A ∩ B| |B ∩ A|
P(A|B) = = = 1/8. P(B|A) = = 1/5.
P(B) |B|
MIT18_05S14_class3_slides |A| 69
January 1, 2017 4 / 28
Table Question

“Steve is very shy and withdrawn, invariably


helpful, but with little interest in people, or in the
world of reality. A meek and tidy soul, he has a
need for order and structure and a passion for
detail.”∗
What is the probability that Steve is a librarian?
What is the probability that Steve is a farmer?
Discussion on next slide.
∗ From
Judgment under uncertainty: heuristics and biases by Tversky and
Kahneman.
MIT18_05S14_class3_slides 70
January 1, 2017 5 / 28
Discussion of Shy Steve

Discussion: Most people say that it is more likely that Steve is a librarian
than a farmer. Almost all people fail to consider that for every male
librarian in the United States, there are more than fifty male farmers.
When this is explained, most people who chose librarian switch their
solution to farmer.
This illustrates how people often substitute representativeness for
likelihood. The fact that a librarian may be likely to have the above
personality traits does not mean that someone with these traits is likely to
be a librarian.

MIT18_05S14_class3_slides 71
January 1, 2017 6 / 28
Multiplication Rule, Law of Total Probability
Multiplication rule: P(A ∩ B) = P(A|B) · P(B).

Law of total probability: If B1 , B2 , B3 partition Ω then

P(A) = P(A ∩ B1 ) + P(A ∩ B2 ) + P(A ∩ B3 )

= P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 )


B1
A ∩ B1

A ∩ B2 A ∩ B3

B2
MIT18_05S14_class3_slides B3 72
January 1, 2017 7 / 28
Trees
Organize computations
Compute total probability
Compute Bayes’ formula
Example. : Game: 5 red and 2 green balls in an urn. A random ball
is selected and replaced by a ball of the other color; then a second
ball is drawn.
1. What is the probability the second ball is red?
2. What is the probability the first ball was red given the second ball
was red?
5/7 2/7
First draw R1 G1
4/7 3/7 6/7 1/7
Second draw
R2 G2 R2 G2
MIT18_05S14_class3_slides 73
January 1, 2017 8 / 28
Solution

5 4 2 6 32

1. The law of total probability gives P(R2 ) = · + · =


7 7 7 7 49

P(R1 ∩ R2 ) 20/49 20

2. Bayes’ rule gives P(R1 |R2 ) = = =


P(R2 ) 32/49 32

MIT18_05S14_class3_slides 74
January 1, 2017 9 / 28
Concept Question: Trees 1
x
A1 y A2

B1 z B2 B1 B2

C1 C2 C1 C2 C1 C2 C1 C2

1. The probability x represents

(a) P(A1 )
(b) P(A1 |B2 )
(c) P(B2 |A1 )
(d) P(C1 |B2 ∩ A1 ).
answer: (a) P(A1 ).
MIT18_05S14_class3_slides 75
January 1, 2017 10 / 28
Concept Question: Trees 2
x
A1 y A2

B1 z B2 B1 B2

C1 C2 C1 C2 C1 C2 C1 C2

2. The probability y represents

(a) P(B2 )
(b) P(A1 |B2 )
(c) P(B2 |A1 )
(d) P(C1 |B2 ∩ A1 ).
answer: (c) P(B2 |A1 ).
MIT18_05S14_class3_slides 76
January 1, 2017 11 / 28
Concept Question: Trees 3
x
A1 y A2

B1 z B2 B1 B2

C1 C2 C1 C2 C1 C2 C1 C2

3. The probability z represents

(a) P(C1 )
(b) P(B2 |C1 )
(c) P(C1 |B2 )
(d) P(C1 |B2 ∩ A1 ).
answer: (d) P(C1 |B2 ∩ A1 ).
MIT18_05S14_class3_slides 77
January 1, 2017 12 / 28
Concept Question: Trees 4
x
A1 y A2

B1 z B2 B1 B2

C1 C2 C1 C2 C1 C2 C1 C2

4. The circled node represents the event

(a) C1
(b) B2 ∩ C1
(c) A1 ∩ B2 ∩ C1
(d) C1 |B2 ∩ A1 .
answer: (c) A1 ∩ B2 ∩ C1 .
MIT18_05S14_class3_slides 78
January 1, 2017 13 / 28
Let’s Make a Deal with Monty Hall
One door hides a car, two hide goats.
The contestant chooses any door.
Monty always opens a different door with a goat. (He
can do this because he knows where the car is.)
The contestant is then allowed to switch doors if she
wants.
What is the best strategy for winning a car?
(a) Switch (b) Don’t switch (c) It doesn’t matter

MIT18_05S14_class3_slides 79
January 1, 2017 14 / 28
Board question: Monty Hall
Organize the Monty Hall problem into a tree and compute
the probability of winning if you always switch.
Hint first break the game into a sequence of actions.
answer: Switch. P(C |switch) = 2/3

It’s easiest to show this with a tree representing the switching strategy:

First the contestant chooses a door, (then Monty shows a goat), then the

contestant switches doors.

Probability Switching Wins the Car


1/3 2/3

Chooses
C G
0 1 1 0
Switches
C G C G

The MIT18_05S14_class3_slides
(total) probability of C is P(C |switch) = 1
3 ·0+ 2
3 · 1 = 23 . 80
January 1, 2017 15 / 28
Independence
Events A and B are independent if the probability that
one occurred is not affected by knowledge that the other
occurred.

Independence ⇔ P(A|B) = P(A) (provided P(B) 6= 0)


⇔ P(B|A) = P(B) (provided P(A) 6= 0)

(For any A and B)

⇔ P(A ∩ B) = P(A)P(B)

MIT18_05S14_class3_slides 81
January 1, 2017 16 / 28
Table/Concept Question: Independence
(Work with your tablemates, then everyone click in the answer.)

Roll two dice and consider the following events


A = ‘first die is 3’
B = ‘sum is 6’
C = ‘sum is 7’
A is independent of
(a) B and C (b) B alone
(c) C alone (d) Neither B or C .
answer: (c). (Explanation on next slide)
MIT18_05S14_class3_slides 82
January 1, 2017 17 / 28
Solution
P(A) = 1/6, P(A|B) = 1/5. Not equal, so not independent.
P(A) = 1/6, P(A|C ) = 1/6. Equal, so independent.

Notice that knowing B, removes 6 as a possibility for the first die and
makes A more probable. So, knowing B occurred changes the probability
of A.

But, knowing C does not change the probabilities for the possible values of
the first roll; they are still 1/6 for each value. In particular, knowing C
occured does not change the probability of A.

Could also have done this problem by showing

P(B|A) 6= P(B) or P(A ∩ B) 6= P(A)P(B).


MIT18_05S14_class3_slides 83
January 1, 2017 18 / 28
Bayes’ Theorem

Also called Bayes’ Rule and Bayes’ Formula.


Allows you to find P(A|B) from P(B|A), i.e. to ‘invert’
conditional probabilities.

P(B|A) · P(A)
P(A|B) =
P(B)
Often compute the denominator P(B) using the law of
total probability.

MIT18_05S14_class3_slides 84
January 1, 2017 19 / 28
Board Question: Evil Squirrels

Of the one million squirrels on MIT’s campus most are


good-natured. But one hundred of them are pure evil! An
enterprising student in Course 6 develops an “Evil Squirrel
Alarm” which she offers to sell to MIT for a passing
grade. MIT decides to test the reliability of the alarm by
conducting trials.

MIT18_05S14_class3_slides 85
January 1, 2017 20 / 28
Evil Squirrels Continued

When presented with an evil squirrel, the alarm goes


off 99% of the time.
When presented with a good-natured squirrel, the
alarm goes off 1% of the time.
(a) If a squirrel sets off the alarm, what is the probability
that it is evil?
(b) Should MIT co-opt the patent rights and employ the
system?
Solution on next slides.
MIT18_05S14_class3_slides 86
January 1, 2017 21 / 28
One solution
(This is a base rate fallacy problem)
We are given:
P(nice) = 0.9999, P(evil) = 0.0001 (base rate)

P(alarm | nice) = 0.01, P(alarm | evil) = 0.99

P(alarm | evil)P(evil)
P(evil | alarm) =
P(alarm)

P(alarm | evil)P(evil)
=
P(alarm | evil)P(evil) + P(alarm | nice)P(nice)

(0.99)(0.0001)
=
(0.99)(0.0001) + (0.01)(0.9999)

≈ 0.01

MIT18_05S14_class3_slides 87
January 1, 2017 22 / 28
Squirrels continued
Summary:
Probability a random test is correct = 0.99

Probability a positive test is correct ≈ 0.01

These probabilities are not the same!

Alternative method of calculation:


Evil Nice
Alarm 99 9999 10098
No alarm 1 989901 989902
100 999900 1000000
MIT18_05S14_class3_slides 88
January 1, 2017 23 / 28
Evil Squirrels Solution

answer: (a) This is the same solution as in the slides above, but in a more
compact notation. Let E be the event that a squirrel is evil. Let A be the
event that the alarm goes off. By Bayes’ Theorem, we have:

P(A | E )P(E )
P(E | A) =
P(A | E )P(E ) + P(A | E c )P(E c )
100

.99
1000000

= 100 999900
.99 1000000 + .01 1000000

≈ .01.

(b) No. The alarm would be more trouble than its worth, since for every
true positive there are about 99 false positives.

MIT18_05S14_class3_slides 89
January 1, 2017 24 / 28
Washington Post, hot off the press

Annual physical exam is probably unnecessary if you’re


generally healthy
For patients, the negatives include time away
from work and possibly unnecessary tests.
“Getting a simple urinalysis could lead to a false
positive, which could trigger a cascade of even
more tests, only to discover in the end that you
had nothing wrong with you.” Mehrotra says.

http://www.washingtonpost.com/national/health-science/
annual-physical-exam-is-probably-unnecessary-if-youre-general
2013/02/08/2c1e326a-5f2b-11e2-a389-ee565c81c565_story.html
MIT18_05S14_class3_slides 90
January 1, 2017 25 / 28
Table Question: Dice Game
1 The Randomizer holds the 6-sided die in one fist and
the 8-sided die in the other.

2 The Roller selects one of the Randomizer’s fists and

covertly takes the die.

3 The Roller rolls the die in secret and reports the result

to the table.

Given the reported number, what is the probability that


the 6-sided die was chosen? (Find the probability for each
possible reported number.)
answer: If the number rolled is 1-6 then P(six-sided) = 4/7.
If the number rolled is 7 or 8 then P(six-sided) = 0.
MIT18_05S14_class3_slides 91
Explanation on next page
January 1, 2017 26 / 28
Dice Solution
This is a Bayes’ formula problem. For concreteness let’s suppose the roll
was a 4. What we want to compute is P(6-sided|roll 4). But, what is
easy to compute is P(roll 4|6-sided).
Bayes’ formula says

P(roll 4|6-sided)P(6-sided)
P(6-sided|roll 4) =
P(4)
(1/6)(1/2)
= = 4/7.
(1/6)(1/2) + (1/8)(1/2)

The denominator is computed using the law of total probability:


1 1 1 1
P(4) = P(4|6-sided)P(6-sided) + P(4|8-sided)P(8-sided) = · + · .
6 2 8 2
Note that any roll of 1,2,. . . 6 would give the same result. A roll of 7 (or
8) would give clearly give probability 0. This is seen in Bayes’ formula
because the term P(roll 7|6-sided) = 0.
MIT18_05S14_class3_slides 92
January 1, 2017 27 / 28
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class3_slides 93
Discrete Random Variables; Expectation
18.05 Spring 2014

https://en.wikipedia.org/wiki/Bean_machine#/media/File:
Quincunx_(Galton_Box)_-_Galton_1889_diagram.png
Class 4 Slides with Solutions: Discrete Random Variables, Expect94
http://www.youtube.com/watch?v=9xUBhhM4vbM January 1, 2017 1 / 26
Reading Review
Random variable X assigns a number to each outcome:

X :Ω→R

“X = a” denotes the event {ω | X (ω) = a}.

Probability mass function (pmf) of X is given by

p(a) = P(X = a).

Cumulative distribution function (cdf) of X is given by

F (a) = P(X ≤ a).


MIT18_05S14_class4_slides 95
January 1, 2017 2 / 26
Example from class

Suppose X is a random variable with the following table.


values of X : -2 -1 0 4
pmf p(a): 1/4 1/4 1/4 1/4
cdf F (a): 1/4 2/4 3/4 4/4

The cdf is the probability ‘accumulated’ from the left.

Examples. F (−1) = 2/4, F (0) = 3/4, F (0.5) = 3/4, F (−5) = 0,


F (5) = 1.

Properties of F (a):
1. Nondecreasing
2. Way to the left, i.e. as a → −∞), F is 0
3. Way to the right, i.e. as a → ∞, F is 1.
MIT18_05S14_class4_slides 96
January 1, 2017 3 / 26
CDF and PMF

1 F (a)
.9
.75
.5

a
1 3 5 7

p(a)
.5
.25
.15
a
1 3 5 7

MIT18_05S14_class4_slides 97
January 1, 2017 4 / 26
Concept Question: cdf and pmf
X a random variable.
values of X : 1 3 5 7
cdf F (a): 0.5 0.75 0.9 1

1. What is P(X ≤ 3)?

(a) 0.15 (b) 0.25 (c) 0.5 (d) 0.75

2. What is P(X = 3)
(a) 0.15 (b) 0.25 (c) 0.5 (d) 0.75
1. answer: (d) 0.75. P(X ≤ 3) = F (3) = 0.75.
2. answer: (b) P(X = 3) = F (3) − F (1) = 0.75 − 0.5 = 0.25.
MIT18_05S14_class4_slides 98
January 1, 2017 5 / 26
Deluge of discrete distributions
Bernoulli(p) = 1 (success) with probability p,

0 (failure) with probability 1 − p.

In more neutral language:


Bernoulli(p) = 1 (heads) with probability p,
0 (tails) with probability 1 − p.

Binomial(n,p) = # of successes in n independent


Bernoulli(p) trials.

Geometric(p) = # of tails before first heads in a


sequence of indep. Bernoulli(p) trials.
(Neutral language avoids confusing whether we want the number of
successes before the first failure or vice versa.)
MIT18_05S14_class4_slides 99
January 1, 2017 6 / 26
Concept Question
1. Let X ∼ binom(n, p) and Y ∼ binom(m, p) be
independent. Then X + Y follows:
(a) binom(n + m, p) (b) binom(nm, p)
(c) binom(n + m, 2p) (d) other
2. Let X ∼ binom(n, p) and Z ∼ binom(n, q) be
independent. Then X + Z follows:
(a) binom(n, p + q) (b) binom(n, pq)
(c) binom(2n, p + q) (d) other
1. answer: (a). Each binomial random variable is a sum of independent
Bernoulli(p random variables, so their sum is also a sum of Bernoulli(p)
r.v.’s.
2. answer: (d) This is different from problem 1 because we are combining
Bernoulli(p) r.v.’s with Bernoulli(q) r.v.’s. This is not one of the named
MIT18_05S14_class4_slides 100
random variables we know about.
January 1, 2017 7 / 26
Board Question: Find the pmf

X = # of successes before the second failure of a


sequence of independent Bernoulli(p) trials.

Describe the pmf of X .


Hint: this requires some counting.
Answer is on the next slide.

MIT18_05S14_class4_slides 101
January 1, 2017 8 / 26
Solution

X takes values 0, 1, 2, . . . . The pmf is p(n) = (n + 1)p n (1 − p)2 .


For concreteness, we’ll derive this formula for n = 3. Let’s list the
outcomes with three successes before the second failure. Each must have
the form
F
with three S and one F in the first four slots. So we just have to choose
which of these four slots contains the F :

{FSSSF , SFSSF , SSFSF , SSSFF }

In other words, there are 41 = 4 = 3 + 1 such outcomes. Each of these


outcomes has three S and two F , so probability p 3 (1 − p)2 . Therefore

p(3) = P(X = 3) = (3 + 1)p 3 (1 − p)2 .

The MIT18_05S14_class4_slides
same reasoning works for general n.
102
January 1, 2017 9 / 26
Dice simulation: geometric(1/4)

Roll the 4-sided die repeatedly until you roll a 1.

Click in X = # of rolls BEFORE the 1.

(If X is 9 or more click 9.)

Example: If you roll (3, 4, 2, 3, 1) then click in 4.

Example: If you roll (1) then click 0.

MIT18_05S14_class4_slides 103
January 1, 2017 10 / 26
Fiction

Gambler’s fallacy: [roulette] if black comes up several


times in a row then the next spin is more likely to be red.

Hot hand: NBA players get ‘hot’.

MIT18_05S14_class4_slides 104
January 1, 2017 11 / 26
Fact

P(red) remains the same.


The roulette wheel has no memory. (Monte Carlo, 1913).

The data show that player who has made 5 shots in a row
is no more likely than usual to make the next shot.
(Currently, there seems to be some disagreement about
this.)

MIT18_05S14_class4_slides 105
January 1, 2017 12 / 26
Gambler’s fallacy

“On August 18, 1913, at the casino in Monte Carlo, black came up a
record twenty-six times in succession [in roulette]. [There] was a
near-panicky rush to bet on red, beginning about the time black had
come up a phenomenal fifteen times. In application of the maturity
[of the chances] doctrine, players doubled and tripled their stakes, this
doctrine leading them to believe after black came up the twentieth
time that there was not a chance in a million of another repeat. In the
end the unusual run enriched the Casino by some millions of francs.”

MIT18_05S14_class4_slides 106
January 1, 2017 13 / 26
Hot hand fallacy

An NBA player who made his last few shots is more likely
than his usual shooting percentage to make the next one?
See The Hot Hand in Basketball: On the Misperception of Random
Sequences by Gilovish, Vallone and Tversky. (A link that worked when
these slides were written is
http://www.cs.colorado.edu/~mozer/Teaching/syllabi/7782/
readings/gilovich%20vallone%20tversky.pdf)
(There seems to be some controversy about this. Some statisticians feel
that the authors of the above paper erred in their analysis of the data and
the data do support the notion of a hot hand in basketball.)

MIT18_05S14_class4_slides 107
January 1, 2017 14 / 26
Amnesia

Show that Geometric(p) is memoryless, i.e.

P(X = n + k | X ≥ n) = P(X = k)

Explain why we call this memoryless.


Explanation given on next slide.

MIT18_05S14_class4_slides 108
January 1, 2017 15 / 26
Proof that geometric(p) is memoryless
One method is to look at the tree for this distribution. Here we’ll just use

the formula that defines conditional probability. To do this we need to find

probabilities for the events used in the formula.

Let A be ‘X = n + k’ and let B be ‘X ≥ n’.

We have the following:

A ∩ B = A. This is because X = n + k guarantees X ≥ n. Thus,


P(A ∩ B) = P(A) = p ( n + k)(1 − p)
P(B) = p n . This is because B consists of all sequences that start
with n successes.
We can now compute the conditional probability

P(A ∩ B) p n+k (1 − p)
P(A|B) = = = p k (1 − p) = P(X = k).
P(B) pn

This is what we wanted to show!

MIT18_05S14_class4_slides 109
January 1, 2017 16 / 26
Expected Value
X is a random variable takes values x1 , x2 , . . . , xn :
The expected value of X is defined by
= n
E (X ) = p(x1 )x1 + p(x2 )x2 + . . . + p(xn )xn = p(xi ) xi
i=1
It is a weighted average.
It is a measure of central tendency.
Properties of E (X )
E (X + Y ) = E (X ) + E (Y ) (linearity I)
E (aX + b) ==aE (X ) + b (linearity II)
E (h(X )) = h(xi ) p(xi )

MIT18_05S14_class4_slides
i
110
January 1, 2017 17 / 26
Meaning of expected value
What is the expected average of one roll of a die?
answer: Suppose we roll it 5 times and get (3, 1, 6, 1, 2). To find the
average we add up these numbers and divide by 5: ave = 2.6. With so few
rolls we don’t expect this to be representative of what would usually
happen. So let’s think about what we’d expect from a large number of
rolls. To be specific, let’s (pretend to) roll the die 600 times.
We expect that each number will come up roughly 1/6 of the time. Let’s
suppose this is exactly what happens and compute the average.
value: 1 2 3 4 5 6
expected counts: 100 100 100 100 100 100
The average of these 600 values (100 ones, 100 twos, etc.) is then
100 · 1 + 100 · 2 + 100 · 3 + 100 · 4 + 100 · 5 + 100 · 6
average =
600
1 1 1 1 1 1
= ·1+ ·2+ ·3+ ·4+ ·5+ ·6= 3.5.

6 6 6 6 6 6
ThisMIT18_05S14_class4_slides
is the ‘expected average’. We will call it the expected value
111
January 1, 2017 18 / 26
Examples
Example 1. Find E (X )
1. X: 3 4 5 6
2. pmf: 1/4 1/2 1/8 1/8
3. E (X ) = 3/4 + 4/2 + 5/8 + 6/8 = 33/8

Example 2. Suppose X ∼ Bernoulli(p). Find E (X ).


1. X: 0 1
2. pmf: 1 − p p
3. E (X ) = (1 − p) · 0 + p · 1 = p.

Example 3. Suppose X ∼ Binomial(12, .25). Find E (X ).


X = X1 + X2 + . . . + X12 , where Xi ∼ Bernoulli(.25). Therefore

E (X ) = E (X1 ) + E (X2 ) + . . . E (X12 ) = 12 · (.25) = 3

In general if X ∼ Binomial(n, p) then E (X ) = np.

MIT18_05S14_class4_slides 112
January 1, 2017 19 / 26
Class example
We looked at the random variable X with the following table top 2 lines.
1. X : -2 -1 0 1 2
2. pmf: 1/5 1/5 1/5 1/5 1/5
3. E (X ) = -2/5 - 1/5 + 0/5 + 1/5 + 2/5 = 0
4. X 2: 4 1 0 1 4
5. E (X 2 ) = 4/5 + 1/5 + 0/5 + 1/5 + 4/5 = 2

Line 3 computes E (X ) by multiplying the probabilities in line 2 by the


values in line 1 and summing.
Line 4 gives the values of X 2 .
Line 5 computes E (X 2 ) by multiplying the probabilities in line 2 by the
values in line 4 and summing. This illustrates the use of the formula
=
E (h(X )) = h(xi ) p(xi ).
i
MIT18_05S14_class4_slides 113
Continued on the next slide.

January 1, 2017 20 / 26
Class example continued

Notice that in the table on the previous slide, some values for X 2 are
repeated. For example the value 4 appears twice. Summing all the
probabilities where X 2 = 4 gives P(X 2 = 4) = 2/5. Here’s the full table
for X 2

1. X 2: 4 1 0
2. pmf: 2/5 2/5 1/5
3. E (X 2 ) = 8/5 + 2/5 + 0/5 = 2

Here we used the definition of expected value to compute E (X 2 ). Of


course, we got the same expected value E (X 2 ) = 2 as we did earlier.

MIT18_05S14_class4_slides 114
January 1, 2017 21 / 26
Board Question: Interpreting Expectation

(a) Would you accept a gamble that offers a 10% chance


to win $95 and a 90% chance of losing $5?

(b) Would you pay $5 to participate in a lottery that


offers a 10% percent chance to win $100 and a 90%
chance to win nothing?

• Find the expected value of your change in assets in each


case?
Discussion on next slide.

MIT18_05S14_class4_slides 115
January 1, 2017 22 / 26
Discussion

Framing bias / cost versus loss. The two situations are identical, with an
expected value of gaining $5. In a study, 132 undergrads were given these
questions (in different orders) separated by a short filler problem. 55 gave
different preferences to the two events. Of these, 42 rejected (a) but
accepted (b). One interpretation is that we are far more willing to pay a
cost up front than risk a loss. (See Judgment under uncertainty: heuristics
and biases by Tversky and Kahneman.)
Loss aversion and cost versus loss sustain the insurance industry: people
pay more in premiums than they get back in claims on average (otherwise
the industry wouldn’t be sustainable), but they buy insurance anyway to
protect themselves against substantial losses. Think of it as paying $1
each year to protect yourself against a 1 in 1000 chance of losing $100
that year. By buying insurance, the expected value of the change in your
assets in one year (ignoring other income and spending) goes from
negative 10 cents to negative 1 dollar. But whereas without insurance you
might lose $100, with insurance you always lose exactly $1.
MIT18_05S14_class4_slides 116
January 1, 2017 23 / 26
Board Question
Suppose (hypothetically!) that everyone at your table got up, ran
around the room, and sat back down randomly (i.e., all seating
arrangements are equally likely).
What is the expected value of the number of people sitting in their
original seat?
(We will explore this with simulations in Friday Studio.)
Neat fact: A permutation in which nobody returns to their original seat is
called a derangement. The number of derangements turns out to be the
nearest integer to n!/e. Since there are n! total permutations, we have:
n!/e
P(everyone in a different seat) ≈ = 1/e ≈ 0.3679.
n!
It’s surprising that the probability is about 37% regardless of n, and that it
converges to 1/e as n goes to infinity.
MIT18_05S14_class4_slides
http://en.wikipedia.org/wiki/Derangement 117
January 1, 2017 24 / 26
Solution
Number the people from 1 to n. Let Xi be the Bernoulli random variable
with value 1 if person i returns to their original seat and value 0 otherwise.
Since person i is equally likely to sit back down in any of the n seats, the
probability that person i returns to their original seat is 1/n. Therefore
Xi ∼ Bernoulli(1/n) and E (Xi ) = 1/n. Let X be the number of people
sitting in their original seat following the rearrangement. Then
X = X1 + X2 + · · · + Xn .
By linearity of expected values, we have
n
= n
=
E (X ) = E (Xi ) = 1/n = 1.
i=1 i=1

• It’s neat that the expected value is 1 for any n.


• If n = 2, then both people either retain their seats or exchange seats. So
P(X = 0) = 1/2 and P(X = 2) = 1/2. In this case, X never equals E (X ).
• The Xi are not independent (e.g. for n = 2, X1 = 1 implies X2 = 1).
MIT18_05S14_class4_slides 118
• Expectation behaves linearly even when the variables are dependent.
January 1, 2017 25 / 26
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class4_slides 119
Variance; Continuous Random Variables
18.05 Spring 2014

Class 5 Slides with Solutions: Variance, Continuous Random Va 120


January 1, 2017 1 / 26
Variance and standard deviation

X a discrete random variable with mean E (X ) = µ.

Meaning: spread of probability mass about the mean.


Definition as expectation (weighted sum):

Var(X ) = E ((X − µ)2 ).

Computation as sum:
n
n
Var(X ) = p(xi )(xi − µ)2 .
i=1

Standard deviation σ = Var(X ).


Units for standard deviation = units of X .
MIT18_05S14_class5_slides 121
January 1, 2017 2 / 26
Concept question
The graphs below give the pmf for 3 random variables. Order them
by size of standard deviation from biggest to smallest. (Assume x has
the same units in all 3.)
(A) (B)

x x
1 2 3 4 5 1 2 3 4 5

(C)

x
1 2 3 4 5

1. ABC 2. ACB 3. BAC 4. BCA 5. CAB 6. CBA


MIT18_05S14_class5_slides 122
Answer on next slide
January 1, 2017 3 / 26
Solution

answer: 5. CAB

All 3 variables have the same range from 1-5 and all of them are
symmetric so their mean is right in the middle at 3. (C) has most of
its weight at the extremes, so it has the biggest spread. (B) has the
most weight in the middle so it has the smallest spread.
From biggest to smallest standard deviation we have (C), (A), (B).

MIT18_05S14_class5_slides 123
January 1, 2017 4 / 26
Computation from tables

Example. Compute the variance and standard deviation

of X .

values x
1 2 3 4 5

pmf p(x) 1/10 2/10 4/10 2/10 1/10

Answer on next slide

MIT18_05S14_class5_slides 124
January 1, 2017 5 / 26
Computation from tables

From the table we compute the mean:


1 4 12 8 5
µ= + + + + = 3.
10 10 10 10 10
Then we add a line to the table for (X − µ)2 .
values X 1 2 3 4 5
pmf p(x) 1/10 2/10 4/10 2/10 1/10
(X − µ)2 4 1 0 1 4
Using the table we compute variance E ((X − µ)2 ):

1 2 4 2 1
·4+ ·1+ ·0+ ·1+ · 4 = 1.2
10 10 10 10 10

The standard deviation is then σ = 1.2.
MIT18_05S14_class5_slides 125
January 1, 2017 6 / 26
Concept question
Which pmf has the bigger standard deviation? (Assume w
and y have the same units.)
1. Y 2. W
pmf for Y p(y) p(W ) pmf for W
1/2
.4

.2
.1
y w
-3 0 3 10 20 30 40 50

Table question: make probability tables for Y and W


and compute their standard deviations.
MIT18_05S14_class5_slides 126
Solution on next slide
January 1, 2017 7 / 26
Solution
answer: We get the table for Y from the figure. After computing E (Y )
we add a line for (Y − µ)2 .
Y -3 3
p(y ) 0.5 0.5
(Y − µ)2 9 9
2
E (Y ) = 0.5(−3) + 0.5(3) = 0. E ((Y − µ) ) = 0.5(9) + 0.5(9) = 9
therefore Var(Y ) = 9 ⇒ σY = 3.
W 10 20 30 40 50
p(w ) 0.1 0.2 0.4 0.2 0.1
(W − µ)2 400 100 0 100 400
We compute E (W ) = 1 + 4 + 12 + 8 + 5 = 30 and add a line to the table
for (W − µ)2 . Then

Var(W ) = E ((W −µ)2 ) = .1(400)+.2(100)+.4(0)+.2(100)+.1(100) = 120


√ √
σW = 120 = 10 1.2.
MIT18_05S14_class5_slides 127
Note: Comparing Y and W , we see that scale matters for variance.

January 1, 2017 8 / 26
Concept question

True or false: If Var(X ) = 0 then X is constant.

1. True 2. False
answer: True. If X can take more than one value with positive probability,
than Var(X ) will be a sum of positive terms. So X is constant if and only
if Var(X ) = 0.

MIT18_05S14_class5_slides 128
January 1, 2017 9 / 26
Algebra with variances

If a and b are constants then

Var(aX + b) = a2 Var(X ), σaX +b = |a| σX .

If X and Y are independent random variables then

Var(X + Y ) = Var(X ) + Var(Y ).

MIT18_05S14_class5_slides 129
January 1, 2017 10 / 26
Board questions

1. Prove: if X ∼ Bernoulli(p) then Var(X ) = p(1 − p).

2. Prove: if X ∼ bin(n, p) then Var(X ) = n p(1 − p).

3. Suppose X1 , X2 , . . . , Xn are independent and all have


the same standard deviation σ = 2. Let X be the average
of X1 , . . . , Xn .

What is the standard deviation of X ?

Solution on next slide

MIT18_05S14_class5_slides 130
January 1, 2017 11 / 26
Solution

1. For X ∼ Bernoulli(p) we use a table. (We know E (X ) = p.)

X 0 1
p(x) 1 − p p
(X − µ)2 p2 (1 − p)2
Var(X ) = E ((X − µ)2 ) = (1 − p)p 2 + p(1 − p)2 = p(1 − p)

2. X ∼ bin(n, p) means X is the sum of n independent Bernoulli(p)


random variables X1 , X2 , . . . , Xn . For independent variables, the variances
add. Since Var(Xj ) = p(1 − p) we have

Var(X ) = Var(X1 ) + Var(X2 ) + . . . + Var(Xn ) = np(p − 1).

continued on next slide


MIT18_05S14_class5_slides 131
January 1, 2017 12 / 26
Solution continued

3. Since the variables are independent, we have

Var(X1 + . . . + Xn ) = 4n.

X is the sum scaled by 1/n and the rule for scaling is


Var(aX ) = a2 Var(X ), so

X1 + · · · + Xn 1 4
Var(X ) = Var( ) = 2 Var(X1 + . . . + Xn ) = .
n n n
2
This implies σX = √ .
n
Note: this says that the average of n independent measurements varies
less than the individual measurements.
MIT18_05S14_class5_slides 132
January 1, 2017 13 / 26
Continuous random variables

Continuous range of values:

[0, 1], [a, b], [0, ∞), (−∞, ∞).

Probability density function (pdf)


d
f (x) ≥ 0; P(c ≤ x ≤ d) = f (x) dx.
c

prob.
Units for the pdf are
unit of x
Cumulative distribution function (cdf)
x
F (x) = P(X ≤ x) = f (t) dt.
MIT18_05S14_class5_slides −∞ 133
January 1, 2017 14 / 26
Visualization

f (x)

P (c ≤ X ≤ d)

x
c d
pdf and probability

f (x)

F (x) = P (X ≤ x)

x
x
pdf and cdf

MIT18_05S14_class5_slides 134
January 1, 2017 15 / 26
Properties of the cdf

(Same as for discrete distributions)

(Definition) F (x) = P(X ≤ x).


0 ≤ F (x) ≤ 1.
non-decreasing.
0 to the left: lim F (x) = 0.
x→−∞
1 to the right: lim F (x) = 1.
x→∞
P(c < X ≤ d) = F (d) − F (c).
F ' (x) = f (x).
MIT18_05S14_class5_slides 135
January 1, 2017 16 / 26
Board questions

1. Suppose X has range [0, 2] and pdf f (x) = cx 2 .


(a) What is the value of c.
(b) Compute the cdf F (x).
(c) Compute P(1 ≤ X ≤ 2).

2. Suppose Y has range [0, b] and cdf F (y ) = y 2 /9.


(a) What is b?
(b) Find the pdf of Y .
Solution on next slide

MIT18_05S14_class5_slides 136
January 1, 2017 17 / 26
Solution

1a. Total probability must be 1. So

Z2 Z 2
8 3
f (x) dx = cx 2 dx = c = 1 ⇒ c = .
0 0 3 8

1b. The pdf f (x) is 0 outside of [0, 2] so for 0 ≤ x ≤ 2 we have


x
x3
Z
c 3
F (x) = cu 2 du = x = .
0 3 8

F (x) is 0 fo x < 0 and 1 for x > 2.


Z 2
1c. We could compute the probability as f (x) dx, but rather than redo
1
the integral let’s use the cdf:

1 7
P(1 ≤ X ≤ 2) = F (2) − F (1) = 1 − = .
8 8
MIT18_05S14_class5_slides 137
Continued on next slide

January 1, 2017 18 / 26
Solution continued

2a. Since the total probability is 1, we have

b2
F (b) = 1 ⇒ = 1 ⇒ b = 3 .

9
2y
2b. f (y ) = F ' (y ) = .
9

MIT18_05S14_class5_slides 138
January 1, 2017 19 / 26
Concept questions

Suppose X is a continuous random variable.

(a) What is P(a ≤ X ≤ a)?

(b) What is P(X = 0)?

(c) Does P(X = a) = 0 mean X never equals a?


answer: (a) 0
(b) 0
(c) No. For a continuous distribution any single value has probability 0.
Only a range of values has non-zero probability.
MIT18_05S14_class5_slides 139
January 1, 2017 20 / 26
Concept question

Which of the following are graphs of valid cumulative distribution


functions?

Add the numbers of the valid cdf’s and click that number.
answer: Test 2 and Test 3.

MIT18_05S14_class5_slides 140
January 1, 2017 21 / 26
Solution

Test 1 is not a cdf: it takes negative values, but probabilities are positive.

Test 2 is a cdf: it increases from 0 to 1.

Test 3 is a cdf: it increases from 0 to 1.

Test 4 is not a cdf because it decreases. A cdf must be non-decreasing

since it represents accumulated probability.

MIT18_05S14_class5_slides 141
January 1, 2017 22 / 26
Exponential Random Variables

Parameter: λ (called the rate parameter).


Range: [0, ∞).
Notation: exponential(λ) or exp(λ).
Density: f (x) = λe−λx for 0 ≤ x.
Models: Waiting time
P (3 < X < 7)
.1 F (x) = 1 − e−x/10
1
f (x) = λe−λx

x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16

Continuous analogue of geometric distribution –memoryless!

MIT18_05S14_class5_slides 142
January 1, 2017 23 / 26
Board question
I’ve noticed that taxis drive past 77 Mass. Ave. on the average of
once every 10 minutes.
Suppose time spent waiting for a taxi is modeled by an exponential
random variable
1 −x/10
X ∼ Exponential(1/10); f (x) = e
10
(a) Sketch the pdf of this distribution
(b) Shade the region which represents the probability of waiting
between 3 and 7 minutes
(c) Compute the probability of waiting between between 3 and 7
minutes for a taxi
(d) Compute and sketch the cdf.
MIT18_05S14_class5_slides 143
January 1, 2017 24 / 26
Solution

Sketches for (a), (b), (d)


P (3 < X < 7)
.1 F (x) = 1 − e−x/10
1
f (x) = λe−λx

x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16

(c)
Z 7
1 −x/10
7

(3 < X < 7) = e dx = −e−x/10 = e−3/10 − e−7/10 ≈ 0.244


3
10 3

MIT18_05S14_class5_slides 144
January 1, 2017 25 / 26
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class5_slides 145
Continuous Expectation and Variance,

the Law of Large Numbers,

and the Central Limit Theorem

18.05 Spring 2014

0.5

0.4

0.3

0.2

0.1

0 with Solutions: Expectation, Variance, Law of La 146


Class 6 Slides
-4 -3 -2 -1 0 1 2 3 4
January 1, 2017 1 / 31
Expected value

Expected value: measure of location, central tendency


X continuous with range [a, b] and pdf f (x):
� b
E (X ) = xf (x) dx.
a

X discrete with values x1 , . . . , xn and pmf p(xi ):


n
n
E (X ) = xi p(xi ).
i=1

View these as essentially the same formulas.


MIT18_05S14_class6_slides 147
January 1, 2017 2 / 31
Variance and standard deviation
Standard deviation: measure of spread, scale
For any random variable X with mean µ
Var(X ) = E ((X − µ)2 ), σ= Var(X )
X continuous with range [a, b] and pdf f (x):
Z b
Var(X ) = (x − µ)2 f (x) dx.
a

X discrete with values x1 , . . . , xn and pmf p(xi ):


n n
Var(X ) = (xi − µ)2 p(xi ).
i=1
View these as essentially the same formulas.

MIT18_05S14_class6_slides 148
January 1, 2017 3 / 31
Properties

Properties: (the same for discrete and continuous)

1. E (X + Y ) = E (X ) + E (Y ).
2. E (aX + b) = aE (X ) + b.
3. If X and Y are independent then
Var(X + Y ) = Var(X ) + Var(Y ).
4. Var(aX + b) = a2 Var(X ).
5. Var(X ) = E (X 2 ) − E (X )2 .

MIT18_05S14_class6_slides 149
January 1, 2017 4 / 31
Board question

The random variable X has range [0,1] and pdf cx 2 .


(a) Find c.
(b) Find the mean, variance and standard deviation of X .
(c) Find the median value of X .
(d) Suppose X1 , . . . X16 are independent
identically-distributed copies of X . Let X be their
average. What is the standard deviation of X ?
(e) Suppose Y = X 4 . Find the pdf of Y .
answer: See next slides.
MIT18_05S14_class6_slides 150
January 1, 2017 5 / 31
Solution

Z 1
(a) Total probability is 1: cx 2 dx = 1 ⇒ c = 3 .
0

1
3
(b) µ = 0 3x dx = 3/4.
1 3 9 9 3
σ2 = ( − 3/4)2 3x 2 dx) =
0 (x 5 − 8 + 16 = 80 .
σ = 3/80 = 14 3/5 ≈ .194
p p


Z x
(c) Set F (q0.5 ) = 0.5, solve for q0.5 : F (x) = 3u 2 du = x 3 . Therefore,
0
3 1/3
F (q0.5 ) = q0.5 = .5. We get, q0.5 = (0.5) .

(d) Because they are independent


Var(X1 + . . . + X16 ) = Var(X1 ) + Var(X2 ) + . . . + Var(X16 ) = 16Var(X ).
σX
Thus, Var(X ) = 16Var(X
162
)
= Var(X )
16 . Finally, σX = 4 = 0.194/4 .
MIT18_05S14_class6_slides 151
January 1, 2017 6 / 31
Solution continued

(e) Method 1 use the cdf:


1 1 3
FY (y ) = P(X 4 < y ) = P(X < y 4 ) = FX (y 4 ) = y 4 .
3 1
Now differentiate. fY (y ) = FY� (y ) = y − 4 .
4

Method 2 use the pdf: We have


dy
y = x 4 ⇒ dy = 4x 3 dx ⇒ = dx
4y 3/4

dy 3y 2/4 dy 3
This implies fX (x) dx = fX (y 1/4 ) 3/4
= 3/4
= 1/4 dy
4y 4y 4y
3
Therefore fY (y ) = 1/4
4y

MIT18_05S14_class6_slides 152
January 1, 2017 7 / 31
Quantiles
Quantiles give a measure of location.
φ(z)

left tail area = prob. = .6

z
q0.6 = 0.253

Φ(z)
1
F (q0.6 ) = 0.6

z
q0.6 = 0.253

MIT18_05S14_class6_slides
q0.6 : left tail area = 0.6 ⇔ F (q0.6 ) = 0.6 153
January 1, 2017 8 / 31
Concept question
Each of the curves is the density for a given random variable. The
median of the black plot is always at q. Which density has the
greatest median?

1. Black 2. Red 3. Blue


4. All the same 5. Impossible to tell
(A) (B)
Curves coincide to here.

q q
answer: See next frame.

MIT18_05S14_class6_slides 154
January 1, 2017 9 / 31
Solution
(A)
Curves coincide to here.

Area to
the left of
the me-
dian = 0.5

Plot A: 4. All three medians are the same. Remember that probability is
computed as the area under the curve. By definition the median q is the
point where the shaded area in Plot A .5. Since all three curves coincide
up to q. That is, the shaded area in the figure is represents a probability of
.5 for all three densities.

Continued on next slide.


MIT18_05S14_class6_slides 155
January 1, 2017 10 / 31
Solution continued

(B)

Plot B: 2. The red density has the greatest median. Since q is the
median for the black density, the shaded area in Plot B is .5. Therefore
the area under the blue curve (up to q) is greater than .5 and that under
the red curve is less than .5. This means the median of the blue density is
to the left of q (you need less area) and the median of the red density is to
the right of q (you need more area).
MIT18_05S14_class6_slides 156
January 1, 2017 11 / 31
Law of Large Numbers (LoLN)
Informally: An average of many measurements is more accurate
than a single measurement.
Formally: Let X1 , X2 , . . . be i.i.d. random variables all with mean
µ and standard deviation σ.
Let n
X1 + X2 + . . . + Xn 1 n
Xn = = Xi .
n n i=1
Then for any (small number) a, we have

lim P(|X n − µ| < a) = 1.

n→∞

No guarantees but: By choosing n large enough we can make


X n as close as we want to µ with probability close to 1.
MIT18_05S14_class6_slides 157
January 1, 2017 12 / 31
Concept Question: Desperation

You have $100. You need $1000 by tomorrow morning.


Your only way to get it is to gamble.
If you bet $k, you either win $k with probability p or lose $k with
probability 1 − p.

Maximal strategy: Bet as much as you can, up to what you need,

each time.

Minimal strategy: Make a small bet, say $5, each time.

1. If p = 0.45, which is the better strategy?


(a) Maximal (b) Minimal (c) They are the same

2. If p = 0.8, which is the better strategy?


(a) Maximal (b) Minimal (c) They are the same
answer: On next slide
MIT18_05S14_class6_slides 158
January 1, 2017 13 / 31
Solution to previous two problems
answer: If p = 0.45 use maximal strategy; If p = 0.8 use minimal strategy.
If you use the minimal strategy the law of large numbers says your average
winnings per bet will almost certainly be the expected winnings of one bet.
The two tables represent p = 0.45 and p = 0.8 respectively.

Win -10 10 Win -10 10


p 0.55 0.45 p 0.2 0.8

The expected value of a $5 bet when p = 0.45 is -$0.50 Since on average


you will lose $0.50 per bet you want to avoid making a lot of bets. You go
for broke and hope to win big a few times in a row. It’s not very likely, but
the maximal strategy is your best bet.
The expected value when p = 0.8 is $3. Since this is positive you’d like to
make a lot of bets and let the law of large numbers (practically) guarantee
you will win an average of $6 per bet. So you use the minimal strategy.
MIT18_05S14_class6_slides 159
January 1, 2017 14 / 31
Histograms
Made by ‘binning’ data.

Frequency: height of bar over bin = number of data points in bin.

Density: area of bar is the fraction of all data points that lie in the

bin. So, total area is 1.

frequency density

4 0.8

3 0.6

2 0.4

1 0.2

x x
0.25 0.75 1.25 1.75 2.25 0.25 0.75 1.25 1.75 2.25

Check that the total area of the histogram on the right is 1.


160
MIT18_05S14_class6_slides
January 1, 2017 15 / 31
Board question

1. Make both a frequency and density histogram from the data below.
Use bins of width 0.5 starting at 0. The bins should be right closed.

1 1.2 1.3 1.6 1.6


2.1 2.2 2.6 2.7 3.1
3.2 3.4 3.8 3.9 3.9

2. Same question using unequal width bins with edges 0, 1, 3, 4.

3. For question 2, why does the density histogram give a more


reasonable representation of the data.

MIT18_05S14_class6_slides 161
January 1, 2017 16 / 31
Solution

3.0

0.4
Frequency

Density
2.0

0.2
1.0
0.0

0.0
0 1 2 3 4 0 1 2 3 4

Histograms with equal width bins

0.4
8
Frequency

Density
6

0.2
4
2

0.0
0

0 1 2 3 4 0 1 2 3 4

MIT18_05S14_class6_slides
Histograms with unequal width bins
162
January 1, 2017 17 / 31
LoLN and histograms
LoLN implies density histogram converges to pdf:
0.5

0.4

0.3

0.2

0.1

0
-4 -3 -2 -1 0 1 2 3 4

Histogram with bin width 0.1 showing 100000 draws from


a standard normal distribution. Standard normal pdf is
overlaid in red.
MIT18_05S14_class6_slides 163
January 1, 2017 18 / 31
Standardization

Random variable X with mean µ and standard deviation


σ.
X −µ
Standardization: Y = .
σ
Y has mean 0 and standard deviation 1.
Standardizing any normal random variable produces
the standard normal.
If X ≈ normal then standardized X ≈ stand. normal.
We use reserve Z to mean a standard normal random
variable.
MIT18_05S14_class6_slides 164
January 1, 2017 19 / 31
Concept Question: Standard Normal
within 1 · σ ≈ 68%

Normal PDF within 2 · σ ≈ 95%

within 3 · σ ≈ 99%
68%

95%

99%
z
−3σ −2σ −σ σ 2σ 3σ

1. P(−1 < Z < 1) is


(a) 0.025 (b) 0.16 (c) 0.68 (d) 0.84 (e) 0.95

2. P(Z > 2)
(a) 0.025 (b) 0.16 (c) 0.68 (d) 0.84 (e) 0.95
answer: 1c, 2a
MIT18_05S14_class6_slides 165
January 1, 2017 20 / 31
Central Limit Theorem
Setting: X1 , X2 , . . . i.i.d. with mean µ and standard dev. σ.
For each n:
1
Xn = (X1 + X2 + . . . + Xn ) average
n
Sn = X1 + X2 + . . . + Xn sum.
Conclusion: For large n:

σ2
� �
X n ≈ N µ,
n
Sn ≈ N nµ, nσ 2

Standardized Sn or X n ≈ N(0, 1)

Sn − nµ X n − µ

That is,√ = √ ≈ N(0, 1).



MIT18_05S14_class6_slides σ/ n 166
January 1, 2017 21 / 31
CLT: pictures
Standardized average of n i.i.d. uniform random variables
with n = 1, 2, 4, 12.
0.4 0.5
0.35
0.4
0.3
0.25 0.3
0.2
0.15 0.2
0.1
0.1
0.05
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 167
January 1, 2017 22 / 31
CLT: pictures 2
The standardized average of n i.i.d. exponential random

variables with n = 1, 2, 8, 64.


1 0.7
0.6
0.8
0.5
0.6 0.4

0.4 0.3
0.2
0.2
0.1
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 168
January 1, 2017 23 / 31
CLT: pictures 3
The standardized average of n i.i.d. Bernoulli(0.5)

random variables with n = 1, 2, 12, 64.


0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3 4 169
January 1, 2017 24 / 31
CLT: pictures 4
The (non-standardized) average of n Bernoulli(0.5)
random variables, with n = 4, 12, 64. (Spikier.)
1.4 3
1.2 2.5
1
2
0.8
1.5
0.6
1
0.4
0.2 0.5

0 0
-1 -0.5 0 0.5 1 1.5 2 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

7
6
5
4
3
2
1
0
MIT18_05S14_class6_slides
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 170
January 1, 2017 25 / 31
Table Question: Sampling from the standard normal
distribution
As a table, produce a single random sample from (an approximate)

standard normal distribution.

The table is allowed nine rolls of the 10-sided die.

Note: µ = 5.5 and σ 2 = 8.25 for a single 10-sided die.

Hint: CLT is about averages.


answer: The average of 9 rolls is a sample from the average of 9
independent random variables. The √ CLT says this average is approximately
normal with µ = 5.5 and σ = 8.25/ 9 = 2.75
If x is the average of 9 rolls then standardizing we get
x − 5.5
z=
2.75
MIT18_05S14_class6_slides
is (approximately) a sample from N(0, 1).
171
January 1, 2017 26 / 31
Board Question: CLT

1. Carefully write the statement of the central limit theorem.

2. To head the newly formed US Dept. of Statistics, suppose that


50% of the population supports Ani, 25% supports Ruthi, and the
remaining 25% is split evenly between Efrat, Elan, David and Jerry.
A poll asks 400 random people who they support. What is the
probability that at least 55% of those polled prefer Ani?

3. What is the probability that less than 20% of those polled prefer
Ruthi?
answer: On next slide.

MIT18_05S14_class6_slides 172
January 1, 2017 27 / 31
Solution
answer: 2. Let A be the fraction polled who support Ani. So A is the
average of 400 Bernoulli(0.5) random variables. That is, let Xi = 1 if the
ith person polled prefers Ani and 0 if not, so A = average of the Xi .
The question asks for the probability A > 0.55.
Each Xi has µ = 0.5 and σ 2 = 0.25. So, E (A) = 0.5 and
2 = 0.25/400 or σ = 1/40 = 0.025.
σA A

Because A is the average of 400 Bernoulli(0.5) variables the CLT says it is


approximately normal and standardizing gives
A − 0.5
≈Z
0.025
So
P(A > 0.55) ≈ P(Z > 2) ≈ 0.025
Continued on next slide
MIT18_05S14_class6_slides 173
January 1, 2017 28 / 31
Solution continued

3. Let R be the fraction polled who support Ruthi.

The question asks for the probability the R < 0.2.

Similar to problem 2, R is the average of 400 Bernoulli(0.25) random

variables. So

2 = (0.25)(0.75)/400 = ⇒ σ =

E (R) = 0.25 and σR R 3/80.
R − 0.25
So √ ≈ Z . So,
3/80

P(R < 0.2) ≈ P(Z < −4/ 3) ≈ 0.0105

MIT18_05S14_class6_slides 174
January 1, 2017 29 / 31
Bonus problem

Not for class. Solution will be posted with the slides.


An accountant rounds to the nearest dollar. We’ll assume
the error in rounding is uniform on [-0.5, 0.5]. Estimate
the probability that the total error in 300 entries is more
than $5.
answer: Let Xj be the error in the j th entry, so, Xj ∼ U(−0.5, 0.5).
We have E (Xj ) = 0 and Var(Xj ) = 1/12.
The total error S = X1 + . . . + X300 has E (S) = 0,
Var(S) = 300/12 = 25, and σS = 5.
Standardizing we get, by the CLT, S/5 is approximately standard normal.
That is, S/5 ≈ Z .
So P(S < −5 or S > 5) ≈ P(Z < −1 or Z > 1) ≈ 0.32 .
MIT18_05S14_class6_slides 175
January 1, 2017 30 / 31
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class6_slides 176
Joint Distributions, Independence

Covariance and Correlation

18.05 Spring 2014

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

Class 7 Slides with Solutions: Joint Distributions: Independence, 177


January 1, 2017 1 / 36
Joint Distributions
X and Y are jointly distributed random variables.
Discrete: Probability mass function (pmf):

p(xi , yj )

Continuous: probability density function (pdf):

f (x, y )

Both: cumulative distribution function (cdf):

F (x, y ) = P(X ≤ x, Y ≤ y )

MIT18_05S14_class7_slides 178
January 1, 2017 2 / 36
Discrete joint pmf: example 1

Roll two dice: X = # on first die, Y = # on second die

X takes values in 1, 2, . . . , 6, Y takes values in 1, 2, . . . , 6

Joint probability table:


X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

pmf: p(i, j) = 1/36 for any i and j between 1 and 6.

MIT18_05S14_class7_slides 179
January 1, 2017 3 / 36
Discrete joint pmf: example 2

Roll two dice: X = # on first die, T = total on both dice

X\T 2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36

MIT18_05S14_class7_slides 180
January 1, 2017 4 / 36
Continuous joint distributions
X takes values in [a, b], Y takes values in [c, d]
(X , Y ) takes values in [a, b] × [c, d].
Joint probability density function (pdf) f (x, y )
f (x, y ) dx dy is the probability of being in the small square.
y
d
Prob. = f (x, y) dx dy

dy

dx

x
a
MIT18_05S14_class7_slides b 181
January 1, 2017 5 / 36
Properties of the joint pmf and pdf
Discrete case: probability mass function (pmf)
1. 0 ≤ p(xi , yj ) ≤ 1
2. Total probability is 1.
n m
m m
p(xi , yj ) = 1
i=1 j=1

Continuous case: probability density function (pdf)


1. 0 ≤ f (x, y )
2. Total probability is 1.
� d � b
f (x, y ) dx dy = 1
c a

MIT18_05S14_class7_slides
Note: 182
f (x, y ) can be greater than 1: it is a density not a probability.

January 1, 2017 6 / 36
Example: discrete events
Roll two dice: X = # on first die, Y = # on second die.

Consider the event: A = ‘Y − X ≥ 2’

Describe the event A and find its probability.

answer: We can describe A as a set of (X , Y ) pairs:

A = {(1, 3), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 5), (3, 6), (4, 6)}.

Or we can visualize it by shading the table:


X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

MIT18_05S14_class7_slides
P(A) = sum of probabilities in shaded cells = 10/36.
183
January 1, 2017 7 / 36
Example: continuous events
Suppose (X , Y ) takes values in [0, 1] × [0, 1].

Uniform density f (x, y ) = 1.

Visualize the event ‘X > Y ’ and find its probability.

answer:
y
1

‘X > Y ’

x
1

The event takes up half the square. Since the density is uniform this
is half the probability. That is, P(X > Y ) = 0.5
MIT18_05S14_class7_slides 184
January 1, 2017 8 / 36
Cumulative distribution function


Z y �Z x
F (x, y ) = P(X ≤ x, Y ≤ y ) = f (u, v ) du dv .
c a

∂2F
f (x, y ) = (x, y ).
∂x∂y
Properties
1. F (x, y ) is non-decreasing. That is, as x or y increases F (x, y )
increases or remains constant.
2. F (x, y ) = 0 at the lower left of its range.
If the lower left is (−∞, −∞) then this means

lim F (x, y ) = 0.
(x,y )→(−∞,−∞)

3. F (x, y ) = 1 at the upper right of its range.

MIT18_05S14_class7_slides 185
January 1, 2017 9 / 36
Marginal pmf and pdf
Roll two dice: X = # on first die, T = total on both dice.
The marginal pmf of X is found by summing the rows. The marginal
pmf of T is found by summing the columns

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(tj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

For continuous distributions the marginal pdf fX (x) is found by


integrating out the y . Likewise for fY (y ).
MIT18_05S14_class7_slides 186
January 1, 2017 10 / 36
Board question

Suppose X and Y are random variables and


(X , Y ) takes values in [0, 1] × [0, 1].
3 2
the pdf is (x + y 2 ).
2
1 Show f (x, y ) is a valid pdf.
2 Visualize the event A = ‘X > 0.3 and Y > 0.5’. Find its
probability.
3 Find the cdf F (x, y ).
4 Find the marginal pdf fX (x). Use this to find P(X < 0.5).
5 Use the cdf F (x, y ) to find the marginal cdf FX (x) and
P(X < 0.5).
6 See next slide
MIT18_05S14_class7_slides 187
January 1, 2017 11 / 36
Board question continued

6. (New scenario) From the following table compute F (3.5, 4).

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

answer: See next slide

MIT18_05S14_class7_slides 188
January 1, 2017 12 / 36
Solution

answer: 1. Validity: Clearly f (x, y ) is positive. Next we must show that


total probability = 1:

1 3 3 2 1
�Z 1 �Z 1 � 1
Z � 1
Z
3 2 2 1 3 2
(x + y ) dx dy = x + xy dy = + y dy = 1.
0 0 2 0 2 2 0 0 2 2

2. Here’s the visualization


y
1

.5

x
.3 1

The pdf is not constant so we must compute an integral

�Z 1 �Z 1 � 1
Z 1
3 2 2 3 2 1
P(A) = (x + y ) dy dx = x y + y3 dx
.3 .5 2 .3 2 2 .5
MIT18_05S14_class7_slides
(continued)
189
January 1, 2017 13 / 36
Solutions 2, 3, 4, 5

1
3x 2

Z
7
2. (continued) = + dx = 0.5495
.3 4 16
y x
x 3y xy 3

Z �
Z
3 2
3. F (x, y ) = (u + v 2 ) du dv = + .
0 0 2 2 2
4.
1 1
y3
�Z 
3 2 3 2 3 1
fX (x) = (x + y 2 ) dy = x y+ = x2 +
0 2 2 2 0 2 2
.5 .5
1 3 1 .5

Z �Z  
3 2 1 5
P(X < .5) = fX (x) dx = x + dx = x + x = .
0 0 2 2 2 2 0 16
5. To find the marginal cdf FX (x) we simply take y to be the top of the
1
y -range and evalute F : FX (x) = F (x, 1) = (x 3 + x).
2
1 1 1 5
Therefore P(X < .5) = F (.5) = ( + ) = .
MIT18_05S14_class7_slides
2 8 2 16 190
6. On next slide
January 1, 2017 14 / 36
Solution 6

6. F (3.5, 4) = P(X ≤ 3.5, Y ≤ 4).

X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36

Add the probability in the shaded squares: F (3.5, 4) = 12/36 = 1/3.

MIT18_05S14_class7_slides 191
January 1, 2017 15 / 36
Independence

Events A and B are independent if

P(A ∩ B) = P(A)P(B).

Random variables X and Y are independent if

F (x, y ) = FX (x)FY (y ).

Discrete random variables X and Y are independent if

p(xi , yj ) = pX (xi )pY (yj ).

Continuous random variables X and Y are independent if

f (x, y ) = fX (x)fY (y ).

MIT18_05S14_class7_slides 192
January 1, 2017 16 / 36
Concept question: independence I
Roll two dice: X = value on first, Y = value on second

X\Y 1 2 3 4 5 6 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
3 1/36 1/36 1/36 1/36 1/36 1/36 1/6
4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
5 1/36 1/36 1/36 1/36 1/36 1/36 1/6
6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/6 1/6 1/6 1/6 1/6 1/6 1

Are X and Y independent? 1. Yes 2. No


answer: 1. Yes. Every cell probability is the product of the marginal
probabilities.
MIT18_05S14_class7_slides 193
January 1, 2017 17 / 36
Concept question: independence II

Roll two dice: X = value on first, T = sum

X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1

Are X and Y independent? 1. Yes 2. No


answer: 2. No. The cells with probability zero are clearly not the product
of the marginal probabilities.
MIT18_05S14_class7_slides 194
January 1, 2017 18 / 36
Concept Question
Among the following pdf’s which are�Zindependent?
�Z (Each of the
ranges is a rectangle chosen so that f (x, y ) dx dy = 1.)

(i) f (x, y ) = 4x 2 y 3 .
(ii) f (x, y ) = 12 (x 3 y + xy 3 ).
(iii) f (x, y ) = 6e
−3x−2y

Put a 1 for independent and a 0 for not-independent.

(a) 111 (b) 110 (c) 101 (d) 100

(e) 011 (f) 010 (g) 001 (h) 000

answer: (c). Explanation on next slide.


MIT18_05S14_class7_slides 195
January 1, 2017 19 / 36
Solution

(i) Independent. The variables can be separated: the marginal densities


are fX (x) = ax 2 and fY (y ) = by 3 for some constants a and b with ab = 4.

(ii) Not independent. X and Y are not independent because there is no


way to factor f (x, y ) into a product fX (x)fY (y ).

(iii) Independent. The variables can be separated: the marginal densities


are fX (x) = ae−3x and fY (y ) = be−2y for some constants a and b with
ab = 6.

MIT18_05S14_class7_slides 196
January 1, 2017 20 / 36
Covariance

Measures the degree to which two random variables vary together,


e.g. height and weight of people.

X , Y random variables with means µX and µY

Cov(X , Y ) = E ((X − µX )(Y − µY )).

MIT18_05S14_class7_slides 197
January 1, 2017 21 / 36
Properties of covariance

Properties
1. Cov(aX + b, cY + d) = acCov(X , Y ) for constants a, b, c, d.
2. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).
3. Cov(X , X ) = Var(X )
4. Cov(X , Y ) = E (XY ) − µX µY .
5. If X and Y are independent then Cov(X , Y ) = 0.
6. Warning: The converse is not true, if covariance is 0 the variables
might not be independent.

MIT18_05S14_class7_slides 198
January 1, 2017 22 / 36
Concept question

Suppose we have the following joint probability table.

Y \X -1 0 1 p(yj )
0 0 1/2 0 1/2
1 1/4 0 1/4 1/2
p(xi ) 1/4 1/2 1/4 1

At your table work out the covariance Cov(X , Y ).

Because the covariance is 0 we know that X and Y are independent

1. True 2. False

Key point: covariance measures the linear relationship between X and


Y . It can completely miss a quadratic or higher order relationship.
MIT18_05S14_class7_slides 199
January 1, 2017 23 / 36
Board question: computing covariance

Flip a fair coin 12 times.

Let X = number of heads in the first 7 flips

Let Y = number of heads on the last 7 flips.

Compute Cov(X , Y ),

MIT18_05S14_class7_slides 200
January 1, 2017 24 / 36
Solution

Use the properties of covariance.

Xi = the number of heads on the i th flip. (So Xi ∼ Bernoulli(.5).)

X = X1 + X2 + . . . + X7 and Y = X6 + X7 + . . . + X12 .
We know Var(Xi ) = 1/4. Therefore using Property 2 (linearity) of
covariance
Cov(X , Y ) = Cov(X1 + X2 + . . . + X7 , X6 + X7 + . . . + X12 )
= Cov(X1 , X6 ) + Cov(X1 , X7 ) + Cov(X1 , X8 ) + . . . + Cov(X7 , X12 )
Since the different tosses are independent we know
Cov(X1 , X6 ) = 0, Cov(X1 , X7 ) = 0, Cov(X1 , X8 ) = 0, etc.
Looking at the expression for Cov(X , Y ) there are only two non-zero terms

1
Cov(X , Y ) = Cov(X6 , X6 ) + Cov(X7 , X7 ) = Var(X6 ) + Var(X7 ) = .
MIT18_05S14_class7_slides 2
201
January 1, 2017 25 / 36
Correlation

Like covariance, but removes scale.

The correlation coefficient between X and Y is defined by

Cov(X , Y )
Cor(X , Y ) = ρ = .
σ X σY
Properties:
1. ρ is the covariance of the standardized versions of X
and Y .
2. ρ is dimensionless (it’s a ratio).
3. −1 ≤ ρ ≤ 1. ρ = 1 if and only if Y = aX + b with
a > 0 and ρ = −1 if and only if Y = aX + b with a < 0.
MIT18_05S14_class7_slides 202
January 1, 2017 26 / 36
Real-life correlations

Over time, amount of Ice cream consumption is


correlated with number of pool drownings.
In 1685 (and today) being a student is the most
dangerous profession.
In 90% of bar fights ending in a death the person who
started the fight died.
Hormone replacement therapy (HRT) is correlated
with a lower rate of coronary heart disease (CHD).

Discussion is on the next slides.

MIT18_05S14_class7_slides 203
January 1, 2017 27 / 36
Real-life correlations discussion

Ice cream does not cause drownings. Both are correlated with
summer weather.

In a study in 1685 of the ages and professions of deceased men, it was


found that the profession with the lowest average age of death was
“student.” But, being a student does not cause you to die at an early
age. Being a student means you are young. This is what makes the
average of those that die so low.

A study of fights in bars in which someone was killed found that, in


90% of the cases, the person who started the fight was the one who
died.
Of course, it’s the person who survived telling the story.

Continued on next slide


MIT18_05S14_class7_slides 204
January 1, 2017 28 / 36
(continued)

In a widely studied example, numerous epidemiological studies showed


that women who were taking combined hormone replacement therapy
(HRT) also had a lower-than-average incidence of coronary heart
disease (CHD), leading doctors to propose that HRT was protective
against CHD. But randomized controlled trials showed that HRT
caused a small but statistically significant increase in risk of CHD.
Re-analysis of the data from the epidemiological studies showed that
women undertaking HRT were more likely to be from higher
socio-economic groups (ABC1), with better-than-average diet and
exercise regimens. The use of HRT and decreased incidence of
coronary heart disease were coincident effects of a common cause (i.e.
the benefits associated with a higher socioeconomic status), rather
than cause and effect, as had been supposed.

MIT18_05S14_class7_slides 205
January 1, 2017 29 / 36
Correlation is not causation

Edward Tufte: ”Empirically observed covariation is a


necessary but not sufficient condition for causality.”

MIT18_05S14_class7_slides 206
January 1, 2017 30 / 36
Overlapping sums of uniform random variables

We made two random variables X and Y from overlapping sums of


uniform random variables

For example:

X = X1 + X2 + X3 + X4 + X5
Y = X3 + X4 + X5 + X6 + X7

These are sums of 5 of the Xi with 3 in common.

If we sum r of the Xi with s in common we name it (r , s).

Below are a series of scatterplots produced using R.

MIT18_05S14_class7_slides 207
January 1, 2017 31 / 36
Scatter plots
(1, 0) cor=0.00, sample_cor=−0.07 (2, 1) cor=0.50, sample_cor=0.48

2.0
● ● ● ●● ● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●●
● ● ● ●
● ●●● ●●●● ● ● ● ● ● ● ●● ● ●
● ●●
●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ●● ●
●● ●● ● ● ● ● ●
●●● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● ● ● ●●●● ● ●
● ●● ●● ● ●
● ● ● ●● ● ●● ● ● ● ● ●●
●●
●●
● ● ●● ●● ● ● ●● ● ● ● ● ●● ●● ●● ●● ● ● ● ●● ●
0.8

● ●

● ● ● ● ● ● ●● ●

● ● ● ●● ● ● ●

●● ● ● ●● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●

1.5

● ● ● ● ● ●● ● ● ●
●● ● ● ● ●●●● ● ● ● ●● ● ●●
● ●
●●● ● ● ● ● ●●● ● ●●● ● ● ● ●● ●● ● ● ● ● ●
●● ● ●● ● ●● ● ●
● ● ● ●

●● ● ● ● ●●● ● ● ●
● ●●● ●● ●●● ●● ● ●●● ●
●● ●● ●
●●● ● ● ● ●● ●●● ● ●● ●●● ●
● ●● ●
● ● ●●● ●● ●

● ●● ●● ●
● ● ●

● ● ● ●●● ● ● ● ●●● ● ●● ●● ●● ● ● ● ●

● ● ● ● ● ●
● ● ●● ●● ● ●
● ●● ●

●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●
● ●
● ●●● ● ●●


● ●
● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●●● ● ●
●● ● ● ● ●● ● ● ●●●●
● ● ●● ● ● ●●●● ● ●● ● ●●
●●
● ●● ●●● ●
● ●● ● ●● ● ●● ● ● ● ● ● ● ●
● ● ● ● ●●●●●● ●● ● ●
●● ●●● ● ●
●● ●●●

●●● ● ●●●●
● ● ● ● ● ● ● ● ● ●
● ●
● ●●● ●● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ●
●● ●●● ●● ●●
● ● ●
●● ●
●●● ●●● ● ●● ●●● ●● ● ●

●●● ● ●● ●●
● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●
● ●●● ● ●● ●●● ●●● ● ●●
● ●● ●● ●● ●
● ● ●● ●● ● ● ●● ●●● ● ● ● ● ● ● ●● ●●●● ●● ●●●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ●

1.0
●● ● ●● ● ●● ●● ●
● ● ● ● ●● ● ● ● ●●●●● ●
●●● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ●●
● ● ●● ● ● ●●● ● ● ●●
● ●●● ●● ●● ● ●● ● ●●
● ●● ● ●●●● ●● ● ● ●●
y

y
●● ● ● ● ●● ● ●● ● ●● ● ●●●● ●●●● ● ● ●●
● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ●●●●● ● ● ●
● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●

● ● ● ●●● ●●● ● ●● ●●● ● ●●● ● ● ● ●
● ● ●
● ●● ● ●● ●● ● ● ● ● ● ● ●● ●

●●● ●● ● ● ●●●●●● ●●

●●● ●● ●● ●●● ●●● ●● ● ● ●● ●● ●● ●●
●● ● ● ● ● ●● ● ●● ●●●
0.4

● ●● ● ●● ● ●●
● ● ●● ●
● ● ● ● ●● ● ● ●● ●● ● ● ●●● ● ● ●●●●
●● ●●
● ●● ● ●
● ●● ● ●● ●
● ●
● ●●
● ● ●● ● ●
● ● ● ●●●

● ●●


●● ●●●● ●
●● ●●●
●● ● ●
● ● ● ● ● ● ● ●● ●●
● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ●
●●●● ●● ●
●●●● ● ● ●●● ● ●● ●
● ● ●● ● ● ●●
● ●
● ●● ●●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●●● ● ●●●
● ● ●● ●●
● ● ● ● ●● ● ●● ● ●
● ●●●●● ●●
● ●● ● ●
● ● ● ●
● ● ● ● ● ●●
● ●
● ●●● ● ●● ●●● ● ●● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●●
● ●
● ● ● ● ●● ● ● ● ● ●
●● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●
● ● ●● ●● ●● ●
●● ● ●●● ● ●● ●
● ●● ●●
● ●● ● ●●● ● ●● ●●●● ● ●● ●●
● ● ● ● ●
● ●

0.5
● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ● ●
● ●●
● ● ● ● ● ● ● ●● ●● ●
● ●
● ●
● ●● ● ● ● ●●
●● ● ● ● ● ●● ● ● ●● ● ● ●● ●

●●●●●● ●


●●●
●● ●●●
●● ●●●●● ●
● ●●
● ●●● ● ●● ● ●● ●● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
●● ●
● ● ●● ● ●● ● ●
● ●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●● ● ●●● ●●●● ●●● ●●●●● ● ●●
● ●
● ●● ● ●● ● ● ● ● ● ● ●● ●
●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●
● ●●●
● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ●
●●●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●● ●
● ●● ● ●● ●● ● ●●●
● ●●● ●● ● ● ● ● ●
●● ●● ● ●●● ●● ● ●

● ● ● ●
● ● ● ● ● ●●● ● ●● ● ● ●
● ●●●
● ●●● ●

● ● ● ● ● ●● ● ● ●● ● ● ● ●
● ●●
●● ● ● ● ● ●●
● ●●● ●● ● ● ● ● ● ●
● ●● ●● ● ●
0.0

● ●●● ●
●● ●

0.0
● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ● ●
● ● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
x x

(5, 1) cor=0.20, sample_cor=0.21 (10, 8) cor=0.80, sample_cor=0.81



8 ●
● ●

●● ● ● ● ● ●
4

● ● ● ● ● ●
●● ● ● ●●
● ● ●
7
● ● ● ●● ● ● ●● ●● ●
● ●
● ● ● ● ●● ●● ●● ● ●● ● ●
● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
●●● ● ● ●● ●●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●●●●● ● ● ●●
●● ●● ●● ●
●● ●● ●●●● ●●●● ● ●●
● ●●

● ●

●● ●●
●●●


● ● ● ●
●● ● ● ● ● ● ● ● ● ●● ●● ● ●
● ●● ●● ●●● ● ● ●
● ●
●●● ●● ●●●●●
●●
● ●● ● ● ● ● ● ● ●●● ●●
●●●●● ●● ● ●●

● ●● ● ● ●● ●●● ● ● ●● ● ●● ●● ● ●●● ●●
●●●● ●● ●●●
● ●●
● ●●● ●● ● ● ●● ● ●●●●
6

● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●●

● ●● ● ● ● ● ● ● ●● ●●●●● ●●


●● ●●●●●● ● ●●● ● ●●● ● ●● ●●●●



●● ●
●●
●● ● ● ● ●●
●●●●●● ● ● ●● ●● ● ●
3

● ● ● ● ●● ● ●● ● ●●●● ●● ●● ●●●● ● ●● ● ●●
●●●
●● ●●●
● ●●● ●● ● ● ●●

● ●●●● ● ●●●● ●
●●●●● ● ●●●

● ●●
●●
● ●● ●●●● ●
● ●●● ● ●●● ●●
● ●● ●●
●●●● ●●●
●●● ●●●
● ●● ● ●
●●●●
●●
● ● ●● ●● ●●●●●●● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ●

●● ● ● ● ● ●
● ●
●● ●● ● ● ●● ●● ● ●● ●●● ●●● ●●●
● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●● ●●●
●●● ●●
● ● ● ● ●●
● ●●● ●● ● ●●●● ●●● ● ●●● ● ●●● ●

● ● ●● ● ●
●●●●● ●●●

●● ● ●● ●●
● ●● ● ● ●●●● ●●●●●●●●


● ● ● ● ● ● ●●
● ●
●●●


●●●

●● ●
●●●●
● ●● ●
● ●●●



●●●● ●●

● ●
● ● ● ● ●● ●● ●●● ●●● ● ●●● ● ●●●●●●●● ●● ●●
● ●●●● ●● ●● ● ●●●●● ● ●●

●●● ●●● ●●
●● ●●●
● ●● ● ●
●● ●● ●●● ● ●●●● ●● ● ●● ● ●●
●●●
●● ● ● ●●
● ● ●● ● ●●● ● ●●●●
●●
●● ●●●●●●● ●●●
●●●●●●●●
●●●●●
● ● ● ●● ●●
●●
●● ● ● ● ● ● ●
●●● ●● ● ●
● ● ● ●● ●●●
● ● ●●●●
●●●
● ●●●
5

● ● ●● ● ● ●
●● ●● ● ● ●● ●●●● ●
● ●●● ●●● ● ●
● ●● ● ●● ● ● ●● ●● ●● ● ● ●●
y


y

●● ● ● ●●● ●●
●● ●● ● ● ●●●
● ●

● ●●
● ● ● ●●● ●
● ●● ●● ● ● ● ●
● ●● ● ● ● ●
● ●●● ●● ●
●●● ●
● ●●
● ● ● ● ● ● ● ●●
●●●●● ●●●
●●● ●● ●
● ●
● ●● ●●
●● ●●●●
● ●● ●●
●● ●● ●
● ● ●●
● ●● ● ●● ● ● ● ●● ●●● ●
● ● ● ●●● ●● ● ● ●
●●
● ●
●● ●




●●●
●●

● ●● ●● ● ●●● ●
● ● ●
● ●● ● ●● ●●● ●● ●

●●● ●●●●●●

●● ●● ●● ●●● ● ● ●
● ● ●●
● ● ● ●● ●●● ●
●●
●●●●
●● ●●●
●● ●
●●● ●
●●●
●●●●●●●
●●● ●● ●
● ● ●● ●●

● ●● ●
●●● ●● ●●●
●●●● ● ● ● ● ●● ●●●●● ●● ●● ● ●●● ● ●●● ●●
●●●

● ● ● ● ● ● ●●● ●●
●● ●●●

●●
● ●●
●●● ●●● ● ● ● ● ● ● ● ●
● ●●
● ● ●●●

●●
●● ●● ●●
●●

●●●● ●●●

●●●
●●●●● ●
● ●● ● ●
●● ● ●
●● ● ● ●●● ● ● ●●●
●●●●● ● ●●● ●●●●● ● ●● ● ● ● ● ●●
● ● ● ●● ●●●● ● ●●

2

●● ●●● ● ●● ● ●

● ● ●
● ● ● ●●●● ●●●● ●● ●● ●●● ●●
● ●
●●● ●● ● ● ●●● ●● ●●● ● ● ●● ● ●●
●●●●
●● ●
●●●●● ● ●● ● ●●●●
● ●
●●● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●●● ●
●● ●●● ● ●
●●●● ● ●
● ●● ●●● ● ● ● ●●● ●●
4

● ● ●● ●
● ●● ● ●● ●●●●● ●
●● ● ●
●● ●● ● ● ●● ● ●
● ● ●●
● ● ●● ● ● ● ● ●●●● ● ● ●●● ●
● ●● ●● ●●
●● ●● ●
●● ●●●
● ●●● ●●● ●●●
●● ● ●●

●● ●
●●●
●●● ●●● ● ●● ● ● ● ●●
● ● ● ●● ● ●●


● ●
● ● ● ● ● ●●● ●● ● ● ● ● ●●● ●
● ● ●● ● ●●●● ●●●
● ● ●● ●●
●● ●● ● ●● ●●●● ● ●●● ●
●●● ● ● ●
● ● ●● ● ● ●●
● ●● ● ●● ● ● ● ●●● ● ● ●● ●● ●●
● ●●● ● ●●

● ● ● ● ●● ●● ●

● ● ●● ● ●
● ● ●● ● ● ●

● ● ● ●● ●● ● ●

● ●●● ●●
● ● ● ● ●● ● ●● ● ●● ●
● ● ●●● ●
● ●

3

● ● ●
● ● ● ●● ●● ●
● ● ● ●
●●
● ●● ● ● ●● ●●
1

● ● ● ●
● ● ●

● ● ● ● ● ●


2

● ●

MIT18_05S14_class7_slides
1 2 3 4 3 4 5 6 7 8 208
x x
January 1, 2017 32 / 36
Concept question

Toss a fair coin 2n + 1 times. Let X be the number of


heads on the first n + 1 tosses and Y the number on the
last n + 1 tosses.

If n = 1000 then Cov(X , Y ) is:


(a) 0 (b) 1/4 (c) 1/2 (d) 1

(e) More than 1 (f) tiny but not 0


answer: 2. 1/4. This is computed in the answer to the next table
question.

MIT18_05S14_class7_slides 209
January 1, 2017 33 / 36
Board question
Toss a fair coin 2n + 1 times. Let X be the number of
heads on the first n + 1 tosses and Y the number on the
last n + 1 tosses.
Compute Cov(X , Y ) and Cor(X , Y ).
As usual let Xi = the number of heads on the i th flip, i.e. 0 or 1. Then
n+1
m 2n+1
m
X = Xi , Y = Xi
1 n+1

X is the sum of n + 1 independent Bernoulli(1/2) random variables, so


n+1 n+1
µX = E (X ) = , and Var(X ) = .
2 4
n+1 n+1
Likewise, µY = E (Y ) = , and Var(Y ) = .
2
MIT18_05S14_class7_slides 4 210
Continued on next slide.

January 1, 2017 34 / 36
Solution continued
Now,
�n+1 2n+1
� n+1 2n+1
m m m m
Cov(X , Y ) = Cov Xi Xj = Cov(Xi Xj ).
1 n+1 i=1 j=n+1

Because the Xi are independent the only non-zero term in the above sum
1
is Cov(Xn+1 Xn+1 ) = Var(Xn+1 ) = Therefore,
4
1
Cov(X , Y ) = .
4
We get the correlation by dividing by the standard deviations.

Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σ X σY (n + 1)/4 n+1

This makes sense: as n increases the correlation should decrease since the
MIT18_05S14_class7_slides
contribution 211
of the one flip they have in common becomes less important.
January 1, 2017 35 / 36
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class7_slides 212
Review for Exam 1
18.05 Spring 2014

Class 8 Slides with Solutions: Review for Exam 1 213


January 1, 2017 1 / 18
Normal Table
Standard normal table of left tail probabilities.
z Φ(z)
z Φ(z)
z Φ(z)
z Φ(z)

-4.00 0.0000 -2.00 0.0228 0.00 0.5000 2.00 0.9772


-3.95 0.0000 -1.95 0.0256 0.05 0.5199 2.05 0.9798
-3.90 0.0000 -1.90 0.0287 0.10 0.5398 2.10 0.9821
-3.85 0.0001 -1.85 0.0322 0.15 0.5596 2.15 0.9842
-3.80 0.0001 -1.80 0.0359 0.20 0.5793 2.20 0.9861
-3.75 0.0001 -1.75 0.0401 0.25 0.5987 2.25 0.9878
-3.70 0.0001 -1.70 0.0446 0.30 0.6179 2.30 0.9893
-3.65 0.0001 -1.65 0.0495 0.35 0.6368 2.35 0.9906
-3.60 0.0002 -1.60 0.0548 0.40 0.6554 2.40 0.9918
-3.55 0.0002 -1.55 0.0606 0.45 0.6736 2.45 0.9929
-3.50 0.0002 -1.50 0.0668 0.50 0.6915 2.50 0.9938
-3.45 0.0003 -1.45 0.0735 0.55 0.7088 2.55 0.9946
-3.40 0.0003 -1.40 0.0808 0.60 0.7257 2.60 0.9953
-3.35 0.0004 -1.35 0.0885 0.65 0.7422 2.65 0.9960
-3.30 0.0005 -1.30 0.0968 0.70 0.7580 2.70 0.9965
MIT18_05S14_class8_slides 214
-3.25 0.0006 -1.25 0.1056 0.75 0.7734 2.75 0.9970
January 1, 2017 2 / 18
Topics

1. Sets.
2. Counting.
3. Sample space, outcome, event, probability function.
4. Probability: conditional probability, independence, Bayes’ theorem.
5. Discrete random variables: events, pmf, cdf.
6. Bernoulli(p), binomial(n, p), geometric(p), uniform(n)
7. E (X ), Var(X ), σ
8. Continuous random variables: pdf, cdf.
9. uniform(a,b), exponential(λ), normal(µ,σ 2 )
10. Transforming random variables.
11. Quantiles.
12. Central limit theorem, law of large numbers, histograms.
13. Joint distributions: pmf, pdf, cdf, covariance and correlation.
MIT18_05S14_class8_slides 215
January 1, 2017 3 / 18
Sets and counting

Sets:

∅, union, intersection, complement Venn diagrams,

products

Counting:

inclusion-exclusion, rule of product,

n
permutations n Pk , combinations n Ck = k

MIT18_05S14_class8_slides 216
January 1, 2017 4 / 18
Probability

Sample space, outcome, event, probability function.

Rule: P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

Special case: P(Ac ) = 1 − P(A)

(A and B disjoint ⇒ P(A ∪ B) = P(A) + P(B).)

Conditional probability, multiplication rule, trees, law

of total probability, independence

Bayes’ theorem, base rate fallacy

MIT18_05S14_class8_slides 217
January 1, 2017 5 / 18
Random variables, expectation and variance

Discrete random variables: events, pmf, cdf


Bernoulli(p), binomial(n, p), geometric(p), uniform(n)
E (X ), meaning, algebraic properties, E (h(X ))
Var(X ), meaning, algebraic properties
Continuous random variables: pdf, cdf
uniform(a,b), exponential(λ), normal(µ,σ)
Transforming random variables
Quantiles
MIT18_05S14_class8_slides 218
January 1, 2017 6 / 18
Central limit theorem

Law of large numbers averages and histograms

Central limit theorem

MIT18_05S14_class8_slides 219
January 1, 2017 7 / 18
Joint distributions

Joint pmf, pdf, cdf.

Marginal pmf, pdf, cdf

Covariance and correlation.

MIT18_05S14_class8_slides 220
January 1, 2017 8 / 18
Hospitals (binomial, CLT, etc)
A certain town is served by two hospitals.
Larger hospital: about 45 babies born each day.
Smaller hospital about 15 babies born each day.
For a period of 1 year, each hospital recorded the days on which
more than 60% of the babies born were boys.
(a) Which hospital do you think recorded more such days?
(i) The larger hospital. (ii) The smaller hospital.
(iii) About the same (that is, within 5% of each other).

(b) Assume exactly 45 and 15 babies are born at the hospitals each
day. Let Li (resp., Si ) be the Bernoulli random variable which takes
the value 1 if more than 60% of the babies born in the larger (resp.,
smaller) hospital on the i th day were boys. Determine the distribution
of Li and of Si .
MIT18_05S14_class8_slides 221
Continued on next slide
January 1, 2017 9 / 18
Hospital continued

(c) Let L (resp., S) be the number of days on which more than 60%
of the babies born in the larger (resp., smaller) hospital were boys.
What type of distribution do L and S have? Compute the expected
value and variance in each case.

(d) Via the CLT, approximate the 0.84 quantile of L (resp., S).
Would you like to revise your answer to part (a)?

(e) What is the correlation of L and S? What is the joint pmf of L


and S? Visualize the region corresponding to the event L > S.
Express P(L > S) as a double sum.
Solution on next slide.

MIT18_05S14_class8_slides 222
January 1, 2017 10 / 18
Solution

answer: (a) When this question was asked in a study, the number of
undergraduates who chose each option was 21, 21, and 55, respectively.
This shows a lack of intuition for the relevance of sample size on deviation
from the true mean (i.e., variance).
(b) The random variable XL , giving the number of boys born in the larger
hospital on day i, is governed by a Bin(45, .5) distribution. So Li has a
Ber(pL ) distribution with
45
4 45
pL = P(X: > 27) = .545 ≈ 0.068.
k
k=28
Similarly, the random variable XS , giving the number of boys born in the
smaller hospital on day i, is governed by a Bin(15, .5) distribution. So Si
has a Ber(pS ) distribution with
15
4 15
pS = P(XS > 9) = .515 ≈ 0.151.
k
k=10
MIT18_05S14_class8_slides 223
We see that pS is indeed greater than pL , consistent with (ii).
January 1, 2017 11 / 18
Solution continued

(c) Note that L = 365i=1 Li and S =


365
i=1 Si . So L has a Bin(365, pL )
distribution and S has a Bin(365, pS ) distribution. Thus

E (L) = 365pL ≈ 25
E (S) = 365pS ≈ 55
Var(L) = 365pL (1 − pL ) ≈ 23
Var(S) = 365pS (1 − pS ) ≈ 47

(d) By the CLT, the 0.84 quantile is approximately the mean + one sd in
each case: √
For L, q0.84 ≈ 25 + √23.
For S, q0.84 ≈ 55 + 47.

Continued on next slide.


MIT18_05S14_class8_slides 224
January 1, 2017 12 / 18
Solution continued

(e) Since L and S are independent, their correlation is 0 and theirjoint


distribution is determined by multiplying their individual distributions.
Both L and S are binomial with n = 365 and pL and pS computed above.
Thus
   
365 i 365−i 365
P(L = i and S = j) = p(i, j) = pL (1−pL ) pSj (1−pS )365−j
i j

Thus
364 4
4 365
P(L > S) = p(i, j) ≈ .0000916
i=0 j=i+1

We used the R code on the next slide to do the computations.

MIT18_05S14_class8_slides 225
January 1, 2017 13 / 18
R code

pL = 1 - pbinom(.6*45,45,.5)

pS = 1 - pbinom(.6*15,15,.5)

print(pL)

print(pS)

pLGreaterS = 0

for(i in 0:365)

{
for(j in 0:(i-1))
{
= pLGreaterS + dbinom(i,365,pL)*dbinom(j,365,pS)
}
}
print(pLGreaterS)

MIT18_05S14_class8_slides 226
January 1, 2017 14 / 18
Problem correlation

1. Flip a coin 3 times. Use a joint pmf table to compute the


covariance and correlation between the number of heads on the first 2
and the number of heads on the last 2 flips.

2. Flip a coin 5 times. Use properties of covariance to compute the


covariance and correlation between the number of heads on the first 3
and last 3 flips.
answer: 1. Let X = the number of heads on the first 2 flips and Y the
number in the last 2. Considering all 8 possibe tosses: HHH, HHT etc we
get the following joint pmf for X and Y
Y /X 0 1 2
0 1/8 1/8 0 1/4
1 1/8 1/4 1/8 1/2
2 0 1/8 1/8 1/4
1/4 1/2 1/4 1
Solution continued on next slide
MIT18_05S14_class8_slides 227
January 1, 2017 15 / 18
Solution 1 continued

Using the table we find


1 1 1 1 5
E (XY ) = +2 +2 +4 = .
4 8 8 8 4
We know E (X ) = 1 = E (Y ) so

5 1
Cov(X , Y ) = E (XY ) − E (X )E (Y ) = −1= .
4 4

Since X is the sum of 2 independent Bernoulli(.5) we have σX = 2/4

Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σX σY (2)/4 2

Solution to 2 on next slide


MIT18_05S14_class8_slides 228
January 1, 2017 16 / 18
Solution 2
2. As usual let Xi = the number of heads on the i th flip, i.e. 0 or 1.
Let X = X1 + X2 + X3 the sum of the first 3 flips and Y = X3 + X4 + X5
the sum of the last 3. Using the algebraic properties of covariance we have

Cov(X , Y ) = Cov(X1 + X2 + X3 , X3 + X4 + X5 )
= Cov(X1 , X3 ) + Cov(X1 , X4 ) + Cov(X1 , X5 )
+ Cov(X2 , X3 ) + Cov(X2 , X4 ) + Cov(X2 , X5 )
+ Cov(X3 , X3 ) + Cov(X3 , X4 ) + Cov(X3 , X5 )

Because the Xi are independent the only non-zero term in the above sum

is Cov(X3 X3 ) = Var(X3 ) = Therefore, Cov(X , Y ) = 14 .


4
We get the correlation by dividing by the standard deviations.
p Since X is
the sum of 3 independent Bernoulli(.5) we have σX = 3/4

Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σX σY (3)/4 3
MIT18_05S14_class8_slides 229
January 1, 2017 17 / 18
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class8_slides 230
Introduction to Statistics
18.05 Spring 2014

T T T H H T H H H T
H T H T H T H T H T
H T T T H T T T T H
H T T H H T H H T H
T T H H H H T H T H
T T T H T H H H H T
T T T H H H T T T H
H H H H H H H T T T
H T H H T T T H H T
H T H H H T T T H H

Class 10 Slides with Solutions: Introduction to Statistics; Maximu231


January 1, 2017 1 / 23
Three ‘phases’

Data Collection:
Informal Investigation / Observational Study / Formal
Experiment
Descriptive statistics
Inferential statistics (the focus in 18.05)

To consult a statistician after an experiment is finished is


often merely to ask him to conduct a post-mortem
examination. He can perhaps say what the experiment died
of.
R.A. Fisher

MIT18_05S14_class10_slides 232
January 1, 2017 2 / 23
Is it fair?

T T T H H T H H H T
H T H T H T H T H T
H T T T H T T T T H
H T T H H T H H T H
T T H H H H T H T H
T T T H T H H H H T
T T T H H H T T T H
H H H H H H H T T T
H T H H T T T H H T
H T H H H T T T H H

MIT18_05S14_class10_slides 233
January 1, 2017 3 / 23
Is it normal?

Does it have µ = 0? Is it normal? Is it standard normal?

0.20
Density
0.10
0.00

−4 −2 0 2 4
x

Sample mean = 0.38; sample standard deviation = 1.59

MIT18_05S14_class10_slides 234
January 1, 2017 4 / 23
What is a statistic?

Definition. A statistic is anything that can be computed from the


collected data. That is, a statistic must be observable.

Point statistic: a single value computed from data, e.g sample


average x n or sample standard deviation sn .

Interval or range statistics: an interval [a, b] computed from the


data. (Just a pair of point statistics.) Often written as x ± s.

Important: A statistic is itself a random variable since a new


experiment will produce new data to compute it.

MIT18_05S14_class10_slides 235
January 1, 2017 5 / 23
Concept question

You believe that the lifetimes of a certain type of lightbulb follow an


exponential distribution with parameter λ. To test this hypothesis you
measure the lifetime of 5 bulbs and get data x1 , . . . x5 .

Which of the following are statistics?


x1 +x2 +x3 +x4 +x5
(a) The sample average x = 5
.
(b) The expected value of a sample, namely 1/λ.
(c) The difference between x and 1/λ.

1. (a) 2. (b) 3. (c)


4. (a) and (b) 5. (a) and (c) 6. (b) and (c)
7. all three 8. none of them
answer: 1. (a). λ is a parameter of the distribution it cannot be
computed from the data. It can only be estimated.
MIT18_05S14_class10_slides 236
January 1, 2017 6 / 23
Notation

Big letters X , Y , Xi are random variables.

Little letters x, y , xi are data (values) generated by the random

variables.

Example. Experiment: 10 flips of a coin:

Xi is the random variable for the i th flip: either 0 or 1.

xi is the actual result (data) from the i th flip.

e.g. x1 , . . . , x10 = 1, 1, 1, 0, 0, 0, 0, 0, 1, 0.

MIT18_05S14_class10_slides 237
January 1, 2017 7 / 23
Reminder of Bayes’ theorem

Bayes’s theorem is the key to our view of statistics.


(Much more next week!)

P(D|H)P(H)
P(H|D) = .
P(D)

P(data|hypothesis)P(hypothesis)
P(hypothesis|data) =
P(data)

MIT18_05S14_class10_slides 238
January 1, 2017 8 / 23
Estimating a parameter

Example. Suppose we want to know the percentage p of people for


whom cilantro tastes like soap.

Experiment: Ask n random people to taste cilantro.

Model:

Xi ∼ Bernoulli(p) is whether the i th person says it tastes like soap.

Data: x1 , . . . , xn are the results of the experiment

Inference: Estimate p from the data.

MIT18_05S14_class10_slides 239
January 1, 2017 9 / 23
Parameters of interest

Example. You ask 100 people to taste cilantro and 55 say it tastes
like soap. Use this data to estimate p the fraction of all people for
whom it tastes like soap.

So, p is the parameter of interest.

MIT18_05S14_class10_slides 240
January 1, 2017 10 / 23
Likelihood

For a given value of p the probability of getting 55 ‘successes’ is the


binomial probability

100 55
P(55 soap|p) = p (1 − p)45 .
55

Definition:
100 55
The likelihood P(data|p) = p (1 − p)45 .
55

NOTICE: The likelihood takes the data as fixed and computes the
probability of the data for a given p.

MIT18_05S14_class10_slides 241
January 1, 2017 11 / 23
Maximum likelihood estimate (MLE)

The maximum likelihood estimate (MLE) is a way to estimate the


value of a parameter of interest.

The MLE is the value of p that maximizes the likelihood.

Different problems call for different methods of finding the maximum.

Here are two –there are others:


d
1. Calculus: To find the MLE, solve dp P(data | p) = 0 for p. (We
should also check that the critical point is a maximum.)

2. Sometimes the derivative is never 0 and the MLE is at an endpoint


of the allowable range.
MIT18_05S14_class10_slides 242
January 1, 2017 12 / 23
Cilantro tasting MLE

The MLE for the cilantro tasting experiment is found by calculus.


 
dP(data | p) 100
= (55p 54 (1 − p)45 − 45p 55 (1 − p)44 ) = 0
dp 55

A sequence of algebraic steps gives:

55p 54 (1 − p)45 = 45p 55 (1 − p)44


55(1 − p) = 45p
55 = 100p
55
Therefore the MLE is p̂ = 100 .

MIT18_05S14_class10_slides 243
January 1, 2017 13 / 23
Log likelihood

Because the log function turns multiplication into addition it is often


convenient to use the log of the likelihood function

log likelihood = ln(likelihood) = ln(P(data | p)).

Example.
100 55
 
Likelihood P(data|p) = p (1 − p)45
55
 
100
Log likelihood = ln + 55 ln(p) + 45 ln(1 − p).
55
(Note first term is just a constant.)
MIT18_05S14_class10_slides 244
January 1, 2017 14 / 23
Board Question: Coins

A coin is taken from a box containing three coins, which give heads
with probability p = 1/3, 1/2, and 2/3. The mystery coin is tossed
80 times, resulting in 49 heads and 31 tails.

(a) What is the likelihood of this data for each type on coin? Which
coin gives the maximum likelihood?

(b) Now suppose that we have a single coin with unknown probability
p of landing heads. Find the likelihood and log likelihood functions
given the same data. What is the maximum likelihood estimate for p?

See next slide.

MIT18_05S14_class10_slides 245
January 1, 2017 15 / 23
Solution

answer: (a) The data D is 49 heads in 80 tosses.

We have three hypotheses: the coin has probability

p = 1/3, p = 1/2, p = 2/3. So the likelihood function P(D|p) takes 3

values:

   49  31
80 1 2
P(D|p = 1/3) = = 6.24 · 10−7
49 3 3
   49  31
80 1 1
P(D|p = 1/2) = = 0.024
49 2 2
   49  31
80 2 1
P(D|p = 2/3) = = 0.082
49 3 3

The maximum likelihood is when p = 2/3 so this our maximum likelihood


estimate is that p = 2/3.
Answer to part (b) is on the next slide
MIT18_05S14_class10_slides 246
January 1, 2017 16 /
/ 23
22
Solution to part (b)
(b) Our hypotheses now allow p to be any value between 0 and 1. So our
likelihood function is
 
80 49
P(D|p) = p (1 − p)31
49
To compute the maximum likelihood over all p, we set the derivative of
the log likelihood to 0 and solve for p:
   
d d 80
ln(P(D|p)) = ln + 49 ln(p) + 31 ln(1 − p) = 0
dp dp 49

49 31

⇒ − =0
p 1−p
49
⇒ p=
80
So our MLE is p̂ = 49/80.

MIT18_05S14_class10_slides 247
January
January 1, 2017
1, 2017 17 /
17 / 23
22
Continuous likelihood

Use the pdf instead of the pmf

Example. Light bulbs


Lifetime of each bulb ∼ exp(λ).

Test 5 bulbs and find lifetimes of x1 , . . . , x5 .

(i) Find the likelihood and log likelihood functions.


(ii) Then find the maximum likelihood estimate (MLE) for λ.

answer: See next slide.

MIT18_05S14_class10_slides 248
January 1, 2017 18 / 23
Solution

(i) Let Xi ∼ exp(λ) = the lifetime of the i th bulb.


Likelihood = joint pdf (assuming independence):

f (x1 , x2 , x3 , x4 , x5 |λ) = λ5 e−λ(x1 +x2 +x3 +x4 +x5 ) .

Log likelihood

ln(f (x1 , x2 , x3 , x4 , x5 |λ)) = 5 ln(λ) − λ(x1 + x2 + x3 + x4 + x5 ).

(ii) Using calculus to find the MLE:

d ln(f (x1 , x2 , x3 , x4 , x5 |λ)) 5 � ˆ = �5 .


= − xi = 0 ⇒ λ
dλ λ xi

MIT18_05S14_class10_slides 249
January 1, 2017 19 / 23
Board Question

Suppose the 5 bulbs are tested and have lifetimes of 2, 3, 1, 3, 4 years


respectively. What is the maximum likelihood estimate (MLE) for λ?

Work from scratch. Do not simply use the formula just given.

Set the problem up carefully by defining random variables and


densities.
Solution on next slide.

MIT18_05S14_class10_slides 250
January 1, 2017 20 / 23
Solution

answer: We need to be careful with our notation. With five different


values it is best to use subscripts. So, let Xj be the lifetime of the i th bulb
and let xi be the value it takes. Then Xi has density λe−λxi . We assume
each of the lifetimes is independent, so we get a joint density

f (x1 , x2 , x3 , x4 , x5 |λ) = λ5 e−λ(x1 +x2 +x3 +x4 +x5 ) .

Note, we write this as a conditional density, since it depends on λ. This


density is our likelihood function. Our data had values

x1 = 2, x2 = 3, x3 = 1, x4 = 3, x5 = 4.

So our likelihood and log likelihood functions with this data are

f (2, 3, 1, 3, 4 | λ) = λ5 e−13λ , ln(f (2, 3, 1, 3, 4 | λ)) = 5 ln(λ) − 13λ

Continued on next slide


MIT18_05S14_class10_slides 251
January 1, 2017 21 / 23
Solution continued

Using calculus to find the MLE we take the derivative of the log likelihood

5 5
− 13 = 0 ⇒ λ̂ = .
λ 13

MIT18_05S14_class10_slides 252
January 1, 2017 22 / 23
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class10_slides 253
Bayesian Updating: Discrete Priors: 18.05 Spring 2014

Class 11 Slides with Solutions: Bayesian Updating with Known D254


http://xkcd.com/1236/

January 1, 2017 1 / 22
Learning from experience
Which treatment would you choose?

1. Treatment 1: cured 100% of patients in a trial.


2. Treatment 2: cured 95% of patients in a trial.
3. Treatment 3: cured 90% of patients in a trial.

Which treatment would you choose?

1. Treatment 1: cured 3 out of 3 patients in a trial.


2. Treatment 2: cured 19 out of 20 patients treated in a trial.
3. Standard treatment: cured 90000 out of 100000 patients in clinical
practice.
MIT18_05S14_class11_slides 255
January 1, 2017 2 / 22
Which die is it?

I have a bag containing dice of two types: 4-sided and 10-sided.

Suppose I pick a die at random and roll it.

Based on what I rolled which type would you guess I picked?

• Suppose you find out that the bag contained one 4-sided die and
one 10-sided die. Does this change your guess?

• Suppose you find out that the bag contained one 4-sided die and
100 10-sided dice. Does this change your guess?

MIT18_05S14_class11_slides 256
January 1, 2017 3 / 22
Board Question: learning from data
• A certain disease has a prevalence of 0.005.
• A screening test has 2% false positives an 1% false negatives.

Suppose a patient is screened and has a positive test.


1 Represent this information with a tree and use Bayes’ theorem to
compute the probabilities the patient does and doesn’t have the
disease.
2 Identify the data, hypotheses, likelihoods, prior probabilities and
posterior probabilities.
3 Make a full likelihood table containing all hypotheses and
possible test data.
4 Redo the computation using a Bayesian update table. Match the
terms in your table to the terms in your previous calculation.
MIT18_05S14_class11_slides
Solution on next slides. 257
January 1, 2017 4 / 22
Solution

1. Tree based Bayes computation


Let H+ mean the patient has the disease and H− they don’t.
Let T+ : they test positive and T− they test negative.

We can organize this in a tree:


0.005 0.995
H+ H−
0.99 0.01 0.02 0.98

T+ T− T+ T−

P(T+ | H+ )P(H+ )
Bayes’ theorem says P(H+ | T+ ) = .
P(T+ )
Using the tree, the total probability
P(T+ ) = P(T+ | H+ )P(H+ ) + P(T+ | H− )P(H− )
= 0.99 · 0.005 + 0.02 · 0.995 = 0.02485

MIT18_05S14_class11_slides 258
Solution continued on next slide.

January 1, 2017 5 / 22
Solution continued

So,
P(T+ | H+ )P(H+ ) 0.99 · 0.005
P(H+ | T+ ) = = = 0.199
P(T+ ) 0.02485
P(T+ | H− )P(H− ) 0.02 · 0.995
P(H− | T+ ) = = = 0.801
P(T+ ) 0.02485

The positive test greatly increases the probability of H+ , but it is still


much less probable than H− .

Solution continued on next slide.

MIT18_05S14_class11_slides 259
January 1, 2017 6 / 22
Solution continued

2. Terminology
Data: The data are the results of the experiment. In this case, the
positive test.
Hypotheses: The hypotheses are the possible answers to the question
being asked. In this case they are H+ the patient has the disease; H−
they don’t.
Likelihoods: The likelihood given a hypothesis is the probability of the
data given that hypothesis. In this case there are two likelihoods, one for
each hypothesis

P(T+ | H+ ) = 0.99 and P(T+ | H− ) = 0.02.

We repeat: the likelihood is a probability given the hypothesis, not a


probability of the hypothesis.

MIT18_05S14_class11_slides 260
Continued on next slide.
January 1, 2017 7 / 22
Solution continued

Prior probabilities of the hypotheses: The priors are the probabilities of the
hypotheses prior to collecting data. In this case,

P(H+ ) = 0.005 and P(H− ) = 0.995

Posterior probabilities of the hypotheses: The posteriors are the


probabilities of the hypotheses given the data. In this case

P(H+ | T+ ) = 0.199 and P(H− | T+ ) = 0.801.

Posterior Likelihood Prior

P (T+ | H+ ) · P (H+ )
P (H+ | T+ ) =
P (T+ )

Total probability of the data


MIT18_05S14_class11_slides 261
January 1, 2017 8 / 22
Solution continued

3. Full likelihood table

The table holds likelihoods P(D|H) for every possible hypothesis and data
combination.

hypothesis H likelihood P(D|H)


disease state P(T+ |H) P(T− |H)
H+ 0.99 0.01
H− 0.02 0.98

Notice in the next slide that the P(T+ | H) column is exactly the likelihood
column in the Bayesian update table.

MIT18_05S14_class11_slides 262
January 1, 2017 9 / 22
Solution continued
4. Calculation using a Bayesian update table
H = hypothesis: H+ (patient has disease); H− (they don’t).

Data: T+ (positive screening test).

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(T+ |H) P(T+ |H)P(H) P(H|T+ )
H+ 0.005 0.99 0.00495 0.199
H− 0.995 0.02 0.0199 0.801
total 1 NO SUM P(T+ ) = 0.02485 1

Data D = T+

Total probability: P(T+ ) = sum of Bayes numerator column = 0.02485

P(T+ |H)P(H) likelihood × prior


Bayes’ theorem: P(H|T+ ) =
MIT18_05S14_class11_slides =
P(T+ ) total prob. of data 263
January 1, 2017 10 / 22
Board Question: Dice
Five dice: 4-sided, 6-sided, 8-sided, 12-sided, 20-sided.
Suppose I picked one at random and, without showing it to you,
rolled it and reported a 13.

1. Make the full likelihood table (be smart about identical columns).
2. Make a Bayesian update table and compute the posterior
probabilities that the chosen die is each of the five dice.
3. Same question if I rolled a 5.
4. Same question if I rolled a 9.

(Keep the tables for 5 and 9 handy! Do not erase!)

MIT18_05S14_class11_slides 264
January 1, 2017 11 / 22
Tabular solution

D = ‘rolled a 13’

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 0 0 0
H20 1/5 1/20 1/100 1
total 1 1/100 1

MIT18_05S14_class11_slides 265
January 1, 2017 12 / 22
Tabular solution

D = ‘rolled a 5’

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 1/6 1/30 0.392
H8 1/5 1/8 1/40 0.294
H12 1/5 1/12 1/60 0.196
H20 1/5 1/20 1/100 0.118
total 1 0.085 1

MIT18_05S14_class11_slides 266
January 1, 2017 13 / 22
Tabular solution

D = ‘rolled a 9’

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 1/12 1/60 0.625
H20 1/5 1/20 1/100 0.375
total 1 .0267 1

MIT18_05S14_class11_slides 267
January 1, 2017 14 / 22
Iterated Updates

Suppose I rolled a 5 and then a 9.

Update in two steps:


First for the 5
Then update the update for the 9.

MIT18_05S14_class11_slides 268
January 1, 2017 15 / 22
Tabular solution

D1 = ‘rolled a 5’

D2 = ‘rolled a 9’

Bayes numerator1 = likelihood1 × prior.

Bayes numerator2 = likelihood2 × Bayes numerator1

Bayes Bayes
hyp. prior likel. 1 num. 1 likel. 2 num. 2 posterior
H P(H) P(D1 |H) ∗∗∗ P(D2 |H) ∗∗∗ P(H|D1 , D2 )
H4 1/5 0 0 0 0 0
H6 1/5 1/6 1/30 0 0 0
H8 1/5 1/8 1/40 0 0 0
H12 1/5 1/12 1/60 1/12 1/720 0.735
H20 1/5 1/20 1/100 1/20 1/2000 0.265
total 1 0.0019 1

MIT18_05S14_class11_slides 269
January 1, 2017 16 / 22
Board Question

Suppose I rolled a 9 and then a 5.

1. Do the Bayesian update in two steps:


First update for the 9.
Then update the update for the 5.

2. Do the Bayesian update in one step


The data is D = ‘9 followed by 5’

MIT18_05S14_class11_slides 270
January 1, 2017 17 / 22
Tabular solution: two steps

D1 = ‘rolled a 9’

D2 = ‘rolled a 5’

Bayes numerator1 = likelihood1 × prior.

Bayes numerator2 = likelihood2 × Bayes numerator1

Bayes Bayes
hyp. prior likel. 1 num. 1 likel. 2 num. 2 posterior
H P(H) P(D1 |H) ∗∗∗ P(D2 |H) ∗∗∗ P(H|D1 , D2 )
H4 1/5 0 0 0 0 0
H6 1/5 0 0 1/6 0 0
H8 1/5 0 0 1/8 0 0
H12 1/5 1/12 1/60 1/12 1/720 0.735
H20 1/5 1/20 1/100 1/20 1/2000 0.265
total 1 0.0019 1

MIT18_05S14_class11_slides 271
January 1, 2017 18 / 22
Tabular solution: one step

D = ‘rolled a 9 then a 5’

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 1/144 1/720 0.735
H20 1/5 1/400 1/2000 0.265
total 1 0.0019 1

MIT18_05S14_class11_slides 272
January 1, 2017 19 / 22
Board Question: probabilistic prediction

Along with finding posterior probabilities of hypotheses. We might


want to make posterior predictions about the next roll.

With the same setup as before let:


D1 = result of first roll
D2 = result of second roll

(a) Find P(D1 = 5).


(b) Find P(D2 = 4|D1 = 5).

MIT18_05S14_class11_slides 273
January 1, 2017 20 / 22
Solution
D1 = ‘rolled a 5’

D2 = ‘rolled a 4’
Bayes
hyp. prior likel. 1 num. 1 post. 1 likel. 2 post. 1 × likel. 2
H P(H) P(D1 |H) ∗ ∗ ∗ P(H|D1 ) P(D2 |H, D1 ) P(D2 |H, D1 )P(H|D1 )
H4 1/5 0 0 0 ∗ 0
H6 1/5 1/6 1/30 0.392 1/6 0.392 · 1/6
H8 1/5 1/8 1/40 0.294 1/8 0.294 · 1/40
H12 1/5 1/12 1/60 0.196 1/12 0.196 · 1/12
H20 1/5 1/20 1/100 0.118 1/20 0.118 · 1/20
total 1 0.085 1 0.124

The law of total probability tells us P(D1 ) is the sum of the Bayes
numerator 1 column in the table: P(D1 ) = 0.085 .
The law of total probability tells us P(D2 |D1 ) is the sum of the last
column in the table: P(D2 |D1 ) = 0.124
MIT18_05S14_class11_slides 274
January 1, 2017 21 / 22
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class11_slides 275
Prediction and Odds
18.05 Spring 2014

Class 12 Slides with Solutions: Bayesian Updating: Probabilistic 276


January 1, 2017 1 / 26
Probabilistic Prediction

Also called probabilistic forecasting.


Assign a probability to each outcome of a future experiment.

Prediction: “It will rain tomorrow.”

Probabilistic prediction: “Tomorrow it will rain with probability


60% (and not rain with probability 40%).”

Examples: medical treatment outcomes, weather forecasting, climate


change, sports betting, elections, ...

MIT18_05S14_class12_slides 277
January 1, 2017 2 / 26
Words of estimative probability (WEP)
WEP Prediction: “It is likely to rain tomorrow.”

Memo: Bin Laden Determined to Strike in US


See http://en.wikipedia.org/wiki/Words_of_Estimative_Probability

“The language used in the [Bin Laden] memo lacks words of


estimative probability (WEP) that reduce uncertainty, thus preventing
the President and his decision makers from implementing measures
directed at stopping al Qaeda’s actions.”

“Intelligence analysts would rather use words than numbers to describe


how confident we are in our analysis,” a senior CIA officer who’s served for
more than 20 years told me. Moreover, “most consumers of intelligence
aren’t particularly sophisticated when it comes to probabilistic analysis.
They like words and pictures, too. My experience is that [they] prefer
briefings that don’t center on numerical calculation.”
MIT18_05S14_class12_slides 278
January 1, 2017 3 / 26
WEP versus Probabilities: medical consent

No common standard for converting WEP to numbers.

Suggestion for potential risks of a medical procedure:

Word Probability
Likely Will happen to more than 50% of patients
Frequent Will happen to 10-50% of patients
Occasional Will happen to 1-10% of patients
Rare Will happen to less than 1% of patients

From same Wikipedia article

MIT18_05S14_class12_slides 279
January 1, 2017 4 / 26
Example: Three types of coins

Type A coins are fair, with probability 0.5 of heads


Type B coins have probability 0.6 of heads
Type C coins have probability 0.9 of heads

A drawer contains one coin of each type. You pick one at random.
Prior predictive probability: Before taking data, what is the
probability a toss will land heads? Tails?

Take data: say the first toss lands heads.

Posterior predictive probability: After taking data. What is the


probability the next toss lands heads? Tails?

MIT18_05S14_class12_slides 280
January 1, 2017 5 / 26
Solution 1

1. Use the law of total probability: (A probability tree is an excellent way


to visualize this. You should draw one before reading on.)
Let D1,H = ‘toss 1 is heads’, D1,T = ‘toss 1 is tails’.

P(D1,H ) = P(D1,H |A)P(A) + P(D1,H |B)P(B) + P(D1,H |C )P(C )


= 0.5 · 0.3333 + 0.6 · 0.3333 + 0.9 · 0.3333
= 0.6667

P(D1,T ) = 1 − P(D1,H ) = 0.3333

MIT18_05S14_class12_slides 281
January 1, 2017 6 / 26
Solution 2
2. We are given the data D1,H . First update the probabilities for the type
of coin.
Let D2,H = ‘toss 2 is heads’, D2,T = ‘toss 2 is tails’.
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D1,H |H) P(D1,H |H)P(H) P(H|D1,H )
A 1/3 0.5 0.1667 0.25
B 1/3 0.6 0.2 0.3
C 1/3 0.9 0.3 0.45
total 1 0.6667 1
Next use the law of total probability:
P(D2,H |D1,H ) = P(D2,H |A)P(A|D1,H ) + P(D2,H |B)P(B|D1,H )
+P(D2,H |C )P(C |D1,H )
= 0.71
P(DMIT18_05S14_class12_slides
2,T |D1,H ) = 0.29.
282
January 1, 2017 7 / 26
Three coins, continued.

As before: 3 coins with probabilities 0.5, 0.6, and 0.9 of heads.


Pick one; toss 5 times; Suppose you get 1 head out of 5 tosses.

Concept question: What’s your best guess for the probability of


heads on the next toss?
(a) 0.1 (b) 0.2 (c) 0.3 (d) 0.4

(e) 0.5 (f) 0.6 (g) 0.7 (h) 0.8

(i) 0.9 (j) 1.0

MIT18_05S14_class12_slides 283
January 1, 2017 8 / 26
Board question: three coins
Same setup:
3 coins with probabilities 0.5, 0.6, and 0.9 of heads.
Pick one; toss 5 times.
Suppose you get 1 head out of 5 tosses.

Compute the posterior probabilities for the type of coin and the
posterior predictive probabilities for the results of the next toss.

1. Specify clearly the set of hypotheses and the prior probabilities.

2. Compute the prior and posterior predictive distributions, i.e. give


the probabilities of all possible outcomes.

answer: See next slide.


MIT18_05S14_class12_slides 284
January 1, 2017 9 / 26
Solution

Data = ‘1 head and 4 tails’

Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D )
P5P 5
A 1/3 P5P1 0.5 4 0.0521 0.669
B 1/3 P15P0.6 · .44 0.0256 0.329
C 1/3 1 0.9 · .1 0.00015 0.002
total 1 0.0778 1

So,

P(heads|D) = 0.669 · 0.5 + 0.329 · 0.6 + 0.002 · 0.9 = 0.53366


P(tails|D) = 1 − P(heads|D) = 0.46634.

MIT18_05S14_class12_slides 285
January 1, 2017 10 / 26
Concept Question

Does the order of the 1 head and 4 tails affect the posterior
distribution of the coin type?

1. Yes 2. No

Does the order of the 1 head and 4 tails affect the posterior predictive
distribution of the next flip?

1. Yes 2. No

answer: No for both questions.

MIT18_05S14_class12_slides 286
January 1, 2017 11 / 26
Odds

Definition The odds of an event are


P(E )
O(E ) = .
P(E c )

Usually for two choices: E and not E .


Can split multiple outcomes into two groups.
Can do odds of A vs. B = P(A)/P(B).
Our Bayesian focus:
Updating the odds of a hypothesis H given data D.

MIT18_05S14_class12_slides 287
January 1, 2017 12 / 26
Examples

0.5
A fair coin has O(heads) = = 1.
0.5
We say ‘1 to 1’ or ‘fifty-fifty’.

1/6 1
The odds of rolling a 4 with a six-sided die are = .
5/6 5
We say ‘1 to 5 for’ or ‘5 to 1 against’
p
For event E , if P(E ) = p then O(E ) = .
1−p

If an event is rare, then P(E ) ≈ O(E ).

MIT18_05S14_class12_slides 288
January 1, 2017 13 / 26
Bayesian framework: Marfan’s Syndrome

Marfan’s syndrome (M) is a genetic disease of connective tissue. The


main ocular features (F) of Marfan syndrome include bilateral ectopia
lentis (lens dislocation), myopia and retinal detachment.

P(M) = 1/15000, P(F |M) = 0.7, P(F |M c ) = 0.07

If a person has the main ocular features F what is the probability


they have Marfan’s syndrome.
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(F |H) P(F |H)P(H) P(H|F )
M 0.000067 0.7 0.0000467 0.00066
Mc 0.999933 0.07 0.069995 0.99933
total 1 0.07004 1
MIT18_05S14_class12_slides 289
January 1, 2017 14 / 26
Odds form
P(M) = 1/15000, P(F |M) = 0.7, P(F |M c ) = 0.07
Prior odds:
P(M) 1/15000 1
O(M) = c
= = = 0.000067.
P(M ) 14999/15000 14999
Note: O(M) ≈ P(M) since P(M) is small.

Posterior odds: can use the Bayes numerator!


P(M|F ) P(F |M)P(M)
O(M|F ) = c
= = 0.000667.
P(M |F ) P(F |M c )P(M c )
The posterior odds is a product of factors:
P(F |M) P(M) 0.7
O(M|F ) = · = · O(M)
P(F |M c ) P(M c ) 0.07
MIT18_05S14_class12_slides 290
January 1, 2017 15 / 26
Bayes factors

P(F |M) P(M)


O(M|F ) = ·
P(F |M ) P(M c )
c

P(F |M)
= · O(M)
P(F |M c )

posterior odds = Bayes factor · prior odds

The Bayes factor is the ratio of the likelihoods.


The Bayes factor gives the strength of the ‘evidence’ provided by
the data.
A large Bayes factor times small prior odds can be small (or large
or in between).
The Bayes factor for ocular features is 0.7/0.07 = 10.
MIT18_05S14_class12_slides 291
January 1, 2017 16 / 26
Board Question: screening tests
A disease is present in 0.005 of the population.

A screening test has a 0.05 false positive rate and a 0.02 false
negative rate.

1. Give the prior odds a patient has the disease

Assume the patient tests positive

2. What is the Bayes factor for this data?

3. What are the posterior odds they have the disease?

4. Based on your answers to (1) and (2) would you say a positive test
(the data) provides strong or weak evidence for the presence of the
disease.

MIT18_05S14_class12_slides 292
answer: See next slide
January 1, 2017 17 / 26
Solution
Let H+ = ‘has disease’ and H− = ‘doesn’t’
Let T+ = positive test
P(H+ ) 0.005
1. O(H+ ) = = = 0.00503
P(H− ) 0.995
Likelihood table:
Possible data
T+ T−
Hypotheses H+ 0.98 0.02
H− 0.05 0.95
P(T+ |H+ ) 0.98
2. Bayes factor = ratio of likelihoods = = = 19.6
P(T+ |H− ) 0.05
3. Posterior odds = Bayes factor × prior odds = 19.6 × 0.00504 = 0.0985
4. Yes, a Bayes factor of 19.6 indicates a positive test is strong evidence
the patient has the disease. The posterior odds are still small because the
prior odds are extremely small.
MIT18_05S14_class12_slides 293
More on next slide.
January 1, 2017 18 / 26
Solution continued

Of course we can compute the posterior odds by computing the posterior


probabilities using a Bayesian update table.
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(T+ |H) P(T+ |H)P(H) P(H|T+ )
H+ 0.005 0.98 0.00490 0.0897
H− 0.995 0.05 0.04975 0.9103
total 1 0.05474 1
0.0897
Posterior odds: O(H+ |T+ ) = = 0.0985
0.9103

MIT18_05S14_class12_slides 294
January 1, 2017 19 / 26
Board Question: CSI Blood Types*
Crime scene: the two perpetrators left blood: one of type O and
one of type AB
In population 60% are type O and 1% are type AB
1 Suspect Oliver is tested and has type O blood.
Compute the Bayes factor and posterior odds that Oliver was one
of the perpetrators.
Is the data evidence for or against the hypothesis that Oliver is
guilty?
2 Same question for suspect Alberto who has type AB blood.

Show helpful hint on next slide.

*From ‘Information Theory, Inference, and Learning Algorithms’ by


MIT18_05S14_class12_slides 295
David J. C. Mackay.
January 1, 2017 20 / 26
Helpful hint

Population: 60% type O; 1% type AB

For the question about Oliver we have

Hypotheses:
S = ‘Oliver and another unknown person were at the scene’
S c = ‘two unknown people were at the scene’

Data:
D = ‘type ‘O’ and ‘AB’ blood were found; Oliver is type O’

MIT18_05S14_class12_slides 296
January 1, 2017 21 / 26
Solution to CSI Blood Types
For Oliver:
P(D|S) 0.01
Bayes factor = c
= = 0.83.
P(D|S ) 2 · 0.6 · 0.01

Therefore the posterior odds = 0.83 × prior odds (O(S|D) = 0.83 · O(S))
Since the odds of his presence decreased this is (weak) evidence of his
innocence.

For Alberto:
P(D|S) 0.6
Bayes factor = c
= = 50.
P(D|S ) 2 · 0.6 · 0.01

Therefore the posterior odds = 50 × prior odds (O(S|D) = 50 · O(S))


Since the odds of his presence increased this is (strong) evidence of his
presence at the scene.
MIT18_05S14_class12_slides 297
January 1, 2017 22 / 26
Legal Thoughts

David Mackay:

“In my view, a jury’s task should generally be to multiply together


carefully evaluated likelihood ratios from each independent piece of
admissible evidence with an equally carefully reasoned prior
probability. This view is shared by many statisticians but learned
British appeal judges recently disagreed and actually overturned the
verdict of a trial because the jurors had been taught to use Bayes’
theorem to handle complicated DNA evidence.”

MIT18_05S14_class12_slides 298
January 1, 2017 23 / 26
Updating again and again

Collect data: D1 , D2 , . . .
Posterior odds to D1 become prior odds to D2 . So,
P(D1 |H) P(D2 |H)
O(H|D1 , D2 ) = O(H) · ·
P(D1 |H ) P(D2 |H c )
c

= O(H) · BF1 · BF2 .

Independence assumption:
D1 and D2 are conditionally independent.

P(D1 , D2 |H) = P(D1 |H)P(D2 |H).

MIT18_05S14_class12_slides 299
January 1, 2017 24 / 26
Marfan’s Symptoms
The Bayes factor for ocular features (F) is
P(F |M) 0.7
BFF = c
= = 10
P(F |M ) 0.07
The wrist sign (W) is the ability to wrap one hand around your other
wrist to cover your pinky nail with your thumb. Assume 10% of the
population have the wrist sign, while 90% of people with Marfan’s
have it. So,
P(W |M) 0.9
BFW = c
= = 9.
P(W |M ) 0.1
1 6
O(M|F , W ) = O(M) · BFF · BFW = · 10 · 9 ≈ .
14999 1000
We can convert posterior odds back to probability, but since the odds
are so small the result is nearly the same:
6
P(M|F , W ) ≈ ≈ 0.596%
MIT18_05S14_class12_slides 1000 + 6 300
January 1, 2017 25 / 26
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class12_slides 301
Bayesian Updating: Continuous Priors
18.05 Spring 2014

0 0.1 0.3 0.5 0.7 0.9 1.0


Class 13 Slides with Solutions: Bayesian Updating: Continuous 302
January 1, 2017 1 /24
Continuous range of hypotheses

Example. Bernoulli with unknown probability of success p.

Can hypothesize that p takes any value in [0, 1].

Model: ‘bent coin’ with probability p of heads.

Example. Waiting time X ∼ exp(λ) with unknown λ.

Can hypothesize that λ takes any value greater than 0.

Example. Have normal random variable with unknown µ and σ. Can

hypothesisze that (µ, σ) is anywhere in (−∞, ∞) × [0, ∞).

MIT18_05S14_class13_slides 303
January 1, 2017 2 /24
Example of Bayesian updating so far

Three types of coins with probabilities 0.25, 0.5, 0.75 of heads.

Assume the numbers of each type are in the ratio 1 to 2 to 1.

Assume we pick a coin at random, toss it twice and get TT .

Compute the posterior probability the coin has probability 0.25 of


heads.

MIT18_05S14_class13_slides 304
January 1, 2017 3 /24
Solution (2 times)
Let C0.25 stand for the hypothesis (event) that the chosen coin has
probability 0.25 of heads. We want to compute P(C0.25 |data).
Method 1: Using Bayes’ formula and the law of total probability:
P(data|C.25 )P(C.25 )
P(C.25 |data) =
P(data)
P(data|C.25 )P(C.25 )
=
P(data|C.25 )P(C.25 ) + P(data|C.5 )P(C.5 ) + P(data|C.75 )P(C.75 )
(0.75)2 (1/4)
=
(0.75) (1/4) + (0.5)2 (1/2) + (0.25)2 (1/4)
2

= 0.5
Method 2: Using a Bayesian update table:
hypotheses prior likelihood Bayes numerator posterior
H P(H) P(data|H) P(data|H)P(H) P(H|data)
C0.25 1/4 (0.75)2 0.141 0.500
C0.5 1/2 (0.5)2 0.125 0.444
C0.75 1/4 (0.25)2 0.016 0.056
MIT18_05S14_class13_slides
Total 1 P(data) = 0.281 1 305
January 1, 2017 4 /24
Solution continued

Please be sure you understand how each of the pieces in method 1


correspond to the entries in the Bayesian update table in method 2.

Note. The total probability P(data) is also called the prior predictive
probability of the data.

MIT18_05S14_class13_slides 306
January 1, 2017 5 /24
Notation with lots of hypotheses I.
Now there are 5 types of coins with probabilities 0.1, 0.3, 0.5,
0.7, 0.9 of heads.
Assume the numbers of each type are in the ratio 1:2:3:2:1 (so
fairer coins are more common).
Again we pick a coin at random, toss it twice and get TT .
Construct the Bayesian update table for the posterior probabilities of
each type of coin.
hypotheses prior likelihood Bayes numerator posterior
H P(H) P(data|H) P(data|H)P(H) P(H|data)
C0.1 1/9 (0.9)2 0.090 0.297
C0.3 2/9 (0.7)2 0.109 0.359
C0.5 3/9 (0.5)2 0.083 0.275
C0.7 2/9 (0.3)2 0.020 0.066
C0.9 1/9 (0.1)2 0.001 0.004
MIT18_05S14_class13_slides
Total 1 P(data) = 0.303 1 307
January 1, 2017 6 /24
Notation with lots of hypotheses II.

What about 9 coins with probabilities 0.1, 0.2, 0.3, . . . , 0.9?

Assume fairer coins are more common with the number of coins of

probability θ of heads proportional to θ(1 − θ)

Again the data is TT .

We can do this!

MIT18_05S14_class13_slides 308
January 1, 2017 7 /24
Table with 9 hypotheses

hypotheses prior likelihood Bayes numerator posterior


H P(H) P(data|H) P(data|H)P(H) P(H|data)
C0.1 k(0.1 · 0.9) (0.9)2 0.0442 0.1483
C0.2 k(0.2 · 0.8) (0.8)2 0.0621 0.2083
C0.3 k(0.3 · 0.7) (0.7)2 0.0624 0.2093
C0.4 k(0.4 · 0.6) (0.6)2 0.0524 0.1757
C0.5 k(0.5 · 0.5) (0.5)2 0.0379 0.1271
C0.6 k(0.6 · 0.4) (0.4)2 0.0233 0.0781
C0.7 k(0.7 · 0.3) (0.3)2 0.0115 0.0384
C0.8 k(0.8 · 0.2) (0.2)2 0.0039 0.0130
C0.9 k(0.9 · 0.1) (0.1)2 0.0005 0.0018
Total 1 P(data) = 0.298 1

k = 0.606 was computed so that the total prior probability is 1.

MIT18_05S14_class13_slides 309
January 1, 2017 8 /24
Notation with lots of hypotheses III.

What about 99 coins with probabilities 0.01, 0.02, 0.03, . . . , 0.99?

Assume fairer coins are more common with the number of coins of

probability θ of heads proportional to θ(1 − θ)

Again the data is TT .

We could do this . . .

MIT18_05S14_class13_slides 310
January 1, 2017 9 /24
Table with 99 coins

Hypothesis H prior P (H) likelihood P (data | hyp.) Bayes numerator Posterior


H P (H) (k = 0.04) P (data | hyp.) P (data | hyp.)P (H) P (data | hyp.)P (H)/P (data)
C0.01 k(0.01)(1-0.01) (1 − 0.01)2 k(0.01)(1 − 0.01)2 0.001940921
C0.02 k(0.02)(1-0.02) (1 − 0.02)2 k(0.02)(1 − 0.02)2 0.003765396
C0.03 k(0.03)(1-0.03) (1 − 0.03)2 k(0.03)(1 − 0.03)2 0.005476951
C0.04 k(0.04)(1-0.04) (1 − 0.04)2 k(0.04)(1 − 0.04)2 0.007079068
C0.05 k(0.05)(1-0.05) (1 − 0.05)2 k(0.05)(1 − 0.05)2 0.008575179
C0.06 k(0.06)(1-0.06) (1 − 0.06)2 k(0.06)(1 − 0.06)2 0.009968669
C0.07 k(0.07)(1-0.07) (1 − 0.07)2 k(0.07)(1 − 0.07)2 0.01126288
C0.08 k(0.08)(1-0.08) (1 − 0.08)2 k(0.08)(1 − 0.08)2 0.01246108
C0.09 k(0.09)(1-0.09) (1 − 0.09)2 k(0.09)(1 − 0.09)2 0.01356654
C0.1 k(0.1)(1-0.1) (1 − 0.1)2 k(0.1)(1 − 0.1)2 0.01458243
C0.11 k(0.11)(1-0.11) (1 − 0.11)2 k(0.11)(1 − 0.11)2 0.0155119
C0.12 k(0.12)(1-0.12) (1 − 0.12)2 k(0.12)(1 − 0.12)2 0.01635805
C0.13 k(0.13)(1-0.13) (1 − 0.13)2 k(0.13)(1 − 0.13)2 0.01712393
C0.14 k(0.14)(1-0.14) (1 − 0.14)2 k(0.14)(1 − 0.14)2 0.01781254
C0.15 k(0.15)(1-0.15) (1 − 0.15)2 k(0.15)(1 − 0.15)2 0.01842682
C0.16 k(0.16)(1-0.16) (1 − 0.16)2 k(0.16)(1 − 0.16)2 0.01896969
C0.17 k(0.17)(1-0.17) (1 − 0.17)2 k(0.17)(1 − 0.17)2 0.019444
C0.18 k(0.18)(1-0.18) (1 − 0.18)2 k(0.18)(1 − 0.18)2 0.01985256
C0.19 k(0.19)(1-0.19) (1 − 0.19)2 k(0.19)(1 − 0.19)2 0.02019812
C0.2 k(0.2)(1-0.2) (1 − 0.2)2 k(0.2)(1 − 0.2)2 0.02048341
C0.21 k(0.21)(1-0.21) (1 − 0.21)2 k(0.21)(1 − 0.21)2 0.02071109
C0.22 k(0.22)(1-0.22) (1 − 0.22)2 k(0.22)(1 − 0.22)2 0.02088377
C0.23 k(0.23)(1-0.23) (1 − 0.23)2 k(0.23)(1 − 0.23)2 0.02100402
C0.24 k(0.24)(1-0.24) (1 − 0.24)2 k(0.24)(1 − 0.24)2 0.02107436
C0.25 k(0.25)(1-0.25) (1 − 0.25)2 k(0.25)(1 − 0.25)2 0.02109727
C0.26 k(0.26)(1-0.26) (1 − 0.26)2 k(0.26)(1 − 0.26)2 0.02107516
C0.27 k(0.27)(1-0.27) (1 − 0.27)2 k(0.27)(1 − 0.27)2 0.02101042
C0.28 k(0.28)(1-0.28) (1 − 0.28)2 k(0.28)(1 − 0.28)2 0.02090537
C0.29 k(0.29)(1-0.29) (1 − 0.29)2 k(0.29)(1 − 0.29)2 0.0207623
C0.3 k(0.3)(1-0.3) (1 − 0.3)2 k(0.3)(1 − 0.3)2 0.02058343
C0.31 k(0.31)(1-0.31) (1 − 0.31)2 k(0.31)(1 − 0.31)2 0.02037095
C0.32 k(0.32)(1-0.32) (1 − 0.32)2 k(0.32)(1 − 0.32)2 0.020127
C0.33 k(0.33)(1-0.33) (1 − 0.33)2 k(0.33)(1 − 0.33)2 0.01985367
C0.34 k(0.34)(1-0.34) (1 − 0.34)2 k(0.34)(1 − 0.34)2 0.01955299
C0.35 k(0.35)(1-0.35) (1 − 0.35)2 k(0.35)(1 − 0.35)2 0.01922695
C0.36 k(0.36)(1-0.36) (1 − 0.36)2 k(0.36)(1 − 0.36)2 0.01887751
C0.37 k(0.37)(1-0.37) (1 − 0.37)2 k(0.37)(1 − 0.37)2 0.01850656
C0.38 k(0.38)(1-0.38) (1 − 0.38)2 k(0.38)(1 − 0.38)2 0.01811595
C0.39 k(0.39)(1-0.39) (1 − 0.39)2 k(0.39)(1 − 0.39)2 0.01770747
C0.4 k(0.4)(1-0.4) (1 − 0.4)2 k(0.4)(1 − 0.4)2 0.01728288
C0.41 k(0.41)(1-0.41) (1 − 0.41)2 k(0.41)(1 − 0.41)2 0.01684389
C0.42 k(0.42)(1-0.42) (1 − 0.42)2 k(0.42)(1 − 0.42)2 0.01639214
C0.43 k(0.43)(1-0.43) (1 − 0.43)2 k(0.43)(1 − 0.43)2 0.01592925
C0.44 k(0.44)(1-0.44) (1 − 0.44)2 k(0.44)(1 − 0.44)2 0.01545678
C0.45 k(0.45)(1-0.45) (1 − 0.45)2 k(0.45)(1 − 0.45)2 0.01497625
C0.46 k(0.46)(1-0.46) (1 − 0.46)2 k(0.46)(1 − 0.46)2 0.0144891
C0.47 k(0.47)(1-0.47) (1 − 0.47)2 k(0.47)(1 − 0.47)2 0.01399677
C0.48 k(0.48)(1-0.48) (1 − 0.48)2 k(0.48)(1 − 0.48)2 0.01350062
C0.49 k(0.49)(1-0.49) (1 − 0.49)2 k(0.49)(1 − 0.49)2 0.01300196
C0.5 k(0.5)(1-0.5) (1 − 0.5)2 k(0.5)(1 − 0.5)2 0.01250208
C0.51 k(0.51)(1-0.51) (1 − 0.51)2 k(0.51)(1 − 0.51)2 0.0120022
C0.52 k(0.52)(1-0.52) (1 − 0.52)2 k(0.52)(1 − 0.52)2 0.01150349
C0.53 k(0.53)(1-0.53) (1 − 0.53)2 k(0.53)(1 − 0.53)2 0.01100707
C0.54 k(0.54)(1-0.54) (1 − 0.54)2 k(0.54)(1 − 0.54)2 0.01051404
C0.55 k(0.55)(1-0.55) (1 − 0.55)2 k(0.55)(1 − 0.55)2 0.01002542
C0.56 k(0.56)(1-0.56) (1 − 0.56)2 k(0.56)(1 − 0.56)2 0.009542198
C0.57 k(0.57)(1-0.57) (1 − 0.57)2 k(0.57)(1 − 0.57)2 0.009065309
C0.58 k(0.58)(1-0.58) (1 − 0.58)2 k(0.58)(1 − 0.58)2 0.008595641
C0.59 k(0.59)(1-0.59) (1 − 0.59)2 k(0.59)(1 − 0.59)2 0.008134034
C0.6 k(0.6)(1-0.6) (1 − 0.6)2 k(0.6)(1 − 0.6)2 0.00768128
C0.61 k(0.61)(1-0.61) (1 − 0.61)2 k(0.61)(1 − 0.61)2 0.007238124

MIT18_05S14_class13_slides
C0.62
C0.63
C0.64
k(0.62)(1-0.62)
k(0.63)(1-0.63)
k(0.64)(1-0.64)
(1 − 0.62)2
(1 − 0.63)2
(1 − 0.64)2
k(0.62)(1 − 0.62)2
k(0.63)(1 − 0.63)2
k(0.64)(1 − 0.64)2
0.006805262
0.006383342
0.005972963
311
C0.65 k(0.65)(1-0.65) (1 − 0.65)2 k(0.65)(1 − 0.65)2 0.005574679
C0.66 k(0.66)(1-0.66) (1 − 0.66)2 k(0.66)(1 − 0.66)2 0.005188993
C0.67 k(0.67)(1-0.67) (1 − 0.67)2 k(0.67)(1 − 0.67)2 0.004816361
C0.68 k(0.68)(1-0.68) (1 − 0.68)2
2
k(0.68)(1 − 0.68)2
2
0.004457191 January 1, 2017 10 / 23

Maybe there’s a better way


Use some symbolic notation!

Let θ be the probability of heads: θ = 0.01, 0.02, . . . , 0.99.


Use θ to also stand for the hypothesis that the coin is of type

with probability of heads = θ.

We are given a formula for the prior: p(θ) = kθ(1 − θ)

The likelihood P(data|θ) = P(TT |θ) = (1 − θ)2 .

Our 99 row table becomes:


hyp. prior likelihood Bayes numerator posterior
H P(H) P(data|H) P(data|H)P(H) P(H|data)
θ kθ(1 − θ) (1 − θ)2 kθ(1 − θ)3 0.006 · θ(1 − θ)3
Total 1 P(data) = 0.300 1

(We used R to compute k so that the total prior probability is 1. Then we


usedMIT18_05S14_class13_slides
it again to compute P(data) and k/P(data) = 0.060.) 312
January 1, 2017 11 /24
Notation: big and little letters
1. (Big letters) Event A, probability function P(A).

2. (Little letters) Value x, pmf p(x) or pdf f (x).


‘X = x’ is an event: P(X = x) = p(x).

Bayesian updating
3. (Big letters) For hypotheses H and data D:

P(H), P(D), P(H|D), P(D|H).

4. (Small letters) Hypothesis values θ and data values x:


p(θ) p(x) p(θ|x) p(x|θ)
f (θ) dθ f (x) dx f (θ|x) dθ f (x|θ) dx

Example. Coin example in reading


MIT18_05S14_class13_slides 313
January 1, 2017 12 /24
Review of pdf and probability
X random variable with pdf f (x).
f (x) is a density with units: probability/units of x.
f (x) f (x)

probability f (x)dx

P (c ≤ X ≤ d)

x dx
c x
d x

l d
P(c ≤ X ≤ d) = f (x) dx.
c

Probability X is in an infitesimal range dx around x is

f (x) dx

MIT18_05S14_class13_slides 314
January 1, 2017 13 /24
Example of continuous hypotheses

Example. Suppose that we have a coin with probability of heads θ,


where θ is unknown. We can hypothesize that θ takes any value in
[0, 1].
Since θ is continuous we need a prior pdf f (θ), e.g.

f (θ) = kθ(1 − θ).

Use f (θ) dθ to work with probabilities instead of densities, e.g.


For example, the prior probability that θ is in an infinitesimal
range dθ around 0.5 is f (0.5) dθ.

To avoid cumbersome language we will simply say

‘The hypothesis θ has prior probability f (θ) dθ.’

MIT18_05S14_class13_slides 315
January 1, 2017 14 /24
Law of total probability for continuous distributions
Discrete set of hypotheses H1 , H2 , . . . Hn ; data D:
n
n
P(D) = P(D|Hi )P(Hi ).
i=1

In little letters: Hypothesis θ1 , θ2 , . . . , θn ; data x


n
n
p(x) = p(x|θi )p(θi ).
i=1

Continuous range of hypothesis θ on [a, b]; discrete data x:


l b
p(x) = p(x|θ)f (θ) dθ
a
Also called prior predictive probability of the outcome x.
MIT18_05S14_class13_slides 316
January 1, 2017 15 /24
Board question: total probability
1. A coin has unknown probability of heads θ with prior pdf
f (θ) = 3θ2 . Find the probability of throwing tails on the first toss.

2. Describe a setup with success and failure that this models.

answer: 1. Take x = 1 for heads and x = 0 for tails. The likelihood


p(x = 0|θ) = 1 − θ.
The law of total probability says
l 1 l 1
p(x = 0) = p(x = 0|θ)f (θ) dθ = (1 − θ) 3θ2 dθ = 1/4.
0 0

2. There are many possible examples. Here’s one:


A medical treatment has unknown probability θ of success. A priori we
think it’s a good treatment so we use a prior of f (θ) = 3θ2 which is biased
towards success. The first use of it is succesful, so after updating we are
evenMIT18_05S14_class13_slides
more biased for the treatment. 317
January 1, 2017 16 /24
Bayes’ theorem for continuous distributions

θ: continuous parameter with pdf f (θ) and range [a, b];


x: random discrete data;
likelihood: p(x|θ).

Bayes’ Theorem.
p(x|θ)f (θ) dθ p(x|θ)f (θ) dθ
f (θ|x) dθ = = Jb .
p(x) p(x|θ)f (θ) dθ
a

Not everyone uses dθ (but they should):

p(x|θ)f (θ) p(x|θ)f (θ)


f (θ|x) = = Jb .
p(x) p(x|θ)f (θ) dθ
a
MIT18_05S14_class13_slides 318
January 1, 2017 17 /24
Concept question

Suppose X ∼ Bernoulli(θ) where the value of θ is unknown. If we use


Bayesian methods to make probabilistic statements about θ then
which of the following is true?

1. The random variable is discrete, the space of hypotheses is


discrete.
2. The random variable is discrete, the space of hypotheses is
continuous.
3. The random variable is continuous, the space of hypotheses is
discrete.
4. The random variable is continuous, the space of hypotheses is
continuous.
answer: 2. A Bernoulli random variable takes values 0 or 1. So X is
discrete. The parameter θ can be anywhere in the continuous range [0,1].
MIT18_05S14_class13_slides
Therefore the space of hypotheses is continuous. 319
January 1, 2017 18 /24
Bayesian update tables: continuous priors
X ∼ Bernoulli(θ). Unknown θ

Continuous hypotheses θ in [0,1].

Data x.

Prior pdf f (θ).

Likelihood p(x|θ).

hypothesis prior likelihood Bayes posterior


numerator
p(x|θ)f (θ) dθ
θ f (θ) dθ p(x|θ) p(x|θ)f (θ) dθ
p(x)
Z 1
Total 1 p(x) = p(x|θ)f (θ) dθ 1
0
;
NoteMIT18_05S14_class13_slides
p(x) = the prior predictive probability of x.
320
January 1, 2017 19 /24
Board question
‘Bent’ coin: unknown probability θ of heads.

Prior: f (θ) = 2θ on [0, 1].

Data: toss and get heads.

1. Find the posterior pdf to this new data.

2. Suppose you toss again and get tails. Update your posterior from
problem 1 using this data.

3. On one set of axes graph the prior and the posteriors from
problems 1 and 2.

See MIT18_05S14_class13_slides
next slide for solution. 321
January 1, 2017 20 /24
Solution
Problem 1
Bayes
hypoth. prior likelihood numerator posterior
θ 2θ dθ θ 2θ2 dθ 3θ2 dθ
J1 2
Total 1 T = 0 2θ dθ = 2/3 1
2
Posterior pdf: f (θ|x) = 3θ . (Should graph this.)
Note: We don’t really need to compute T . Once we know the posterior
density is of the form cθ2 we only have to find the value of c makes it
have total probability 1.

Problem 2
Bayes
hypoth. prior likelihood numerator posterior
θ 3θ2 dθ 1−θ 3θ2 (1 − θ), dθ 12θ2 (1 − θ) dθ
J1 2
Total 1 0 3θ (1 − θ) dθ = 1/4 1
Posterior 2
pdf: f (θ|x) = 12θ (1 − θ).
MIT18_05S14_class13_slides 322
January 1, 2017 21 /24
Board Question

Same scenario: bent coin ∼ Bernoulli(θ).

Flat prior: f (θ) = 1 on [0, 1]

Data: toss 27 times and get 15 heads and 12 tails.

1. Use this data to find the posterior pdf.

Give the integral for the normalizing factor, but do not compute it
out. Call its value T and give the posterior pdf in terms of T .
1 15
answer: f (θ|x) = T θ (1 − θ)12 . (Called a Beta distribution.)

MIT18_05S14_class13_slides 323
January 1, 2017 22 /24
Beta distribution
Beta(a, b) has density

(a + b − 1)! a−1
f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!

http://mathlets.org/mathlets/beta-distribution/

Observation: The coefficient is a normalizing factor so if

f (θ) = cθa−1 (1 − θ)b−1

is a pdf, then
(a + b − 1)!
c=
(a − 1)!(b − 1)!
and f (θ) is the pdf of a Beta(a, b) distribution.
MIT18_05S14_class13_slides 324
January 1, 2017 23 /24
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class13_slides 325
Bayesian Updating: Continuous Priors
18.05 Spring 2014

Z Compute
b
f (x|θ)f (θ) dθ
a

Class 14 Slides with Solutions: Beta Distributions: Continuous D 326


January 1, 2017 1 /26
Beta distribution

Beta(a, b) has density

(a + b − 1)! a−1
f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!

http://mathlets.org/mathlets/beta-distribution/

Observation:
The coefficient is a normalizing factor, so if we have a pdf
f (θ) = cθa−1 (1 − θ)b−1
then
θ ∼ beta(a, b)
and
(a + b − 1)!
c=
(a − 1)!(b − 1)!
MIT18_05S14_class14_slides 327
January 1, 2017 2 /26
Board question preamble: beta priors
Suppose you are testing a new medical treatment with unknown
probability of success θ. You don’t know that θ, but your prior belief
is that it’s probably not too far from 0.5. You capture this intuition
with a beta(5,5) prior on θ.
2.0
1.0
0.0 Beta(5,5) for θ

0.0 0.2 0.4 0.6 0.8 1.0

To sharpen this distribution you take data and update the prior.
MIT18_05S14_class14_slides 328
Question on next slide.
January 1, 2017 3 /26
Board question: beta priors
(a + b − 1)! a−1
Beta(a, b): f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!
Treatment has prior f (θ) ∼ beta(5, 5)

1. Suppose you test it on 10 patients and have 6 successes. Find the


posterior distribution on θ. Identify the type of the posterior
distribution.

2. Suppose you recorded the order of the results and got


S S S F F S S S F F. Find the posterior based on this data.

3. Using your answer to (2) give an integral for the posterior


predictive probability of success with the next patient.

4. Use what you know about pdf’s to evaluate the integral without
computing it directly
MIT18_05S14_class14_slides 329
January 1, 2017 4 /26
Solution

9! 4
1. Prior pdf is f (θ) = 4! 4! θ (1 − θ)4 = c1 θ4 (1 − θ)4 .
hypoth. prior C10likelihood
) 6 Bayes numer. posterior
θ c1 θ4 (1 − θ)4 dθ 6 θ (1 − θ)
4 c3 θ10 (1 − θ)8 dθ beta(11, 9)
We know the normalized posterior is a beta distribution because it has the
form of a beta distribution (cθa− (1 − θ)b−1 on [0,1]) so by our earlier
observation it must be a beta distribution.
2. The answer is the same. The only change is that the likelihood has a
coefficient of 1 instead of a binomial coefficent.
3. The posterior on θ is beta(11, 9) which has density

19! 10
f (θ |, data) = θ (1 − θ)8 .
10! 8!

Solution to (3) continued on next slide

MIT18_05S14_class14_slides 330
January 1, 2017 5 /26
Solution continued
The law of total probability says that the posterior predictive probability of
success is
1 1
P(success | data) = f (success | θ) · f (θ | data) dθ
0
1 1 1 1
19! 10 8 19! 11
= θ· θ (1 − θ) dθ = θ (1 − θ)8 dθ
0 10! 8! 0 10! 8!

4. We compute the integral in (3) by relating it to the pdf of beta(12, 9):


20! 11 7
11! 8! θ (1 − θ) . Since the pdf of beta(12, 9) integrates to 1 we have
1 1 1 1
20! 11 11! 8!
θ (1 − θ)7 = 1 ⇒ θ11 (1 − θ)7 = .
0 11! 8! 0 20!

Thus 1 1
19! 11 19! 11! 8! 11
θ (1 − θ)8 dθ = · .= .
0 10! 8!
MIT18_05S14_class14_slides 10! 8! 20! 20 331
January 1, 2017 6 /26
Conjugate priors

We had
Prior f (θ) dθ: beta distribution

Likelihood p(x|θ): binomial distribution

Posterior f (θ|x) dθ: beta distribution

The beta distribution is called a conjugate prior for the binomial


likelihood.

That is, the beta prior becomes a beta posterior and repeated
updating is easy!

MIT18_05S14_class14_slides 332
January 1, 2017 7 /26
Concept Question

Suppose your prior f (θ) in the bent coin example is Beta(6, 8). You
flip the coin 7 times, getting 2 heads and 5 tails. What is the
posterior pdf f (θ|x)?

1. Beta(2,5)
2. Beta(3,6)
3. Beta(6,8)
4. Beta(8,13)

We saw in the previous board question that 2 heads and 5 tails will update
a beta(a, b) prior to a beta(a + 2, b + 5) posterior.

answer: (4) beta(8, 13).


MIT18_05S14_class14_slides 333
January 1, 2017 8 /26
Reminder: predictive probabilities
Continuous hypotheses θ, discrete data x1 , x2 , . . .
(Assume trials are independent given the hypothesis θ.)

Prior predictive probability


1
p(x1 ) = p(x1 | θ)f (θ) dθ

Posterior predictive probability


1
p(x2 | x1 ) = p(x2 | θ)f (θ | x1 ) dθ

Analogous to discrete hypotheses: H1 , H2 , . . ..


1
n 1
n
p(x1 ) = p(x1 | Hi )P(Hi ) p(x2 | x1 ) = p(x2 | Hi )p(Hi | x1 ).
i=1
MIT18_05S14_class14_slides i=1 334
January 1, 2017 9 /26
Continuous priors, continuous data

Bayesian update tables:

Bayes posterior
hypoth. prior likelihood numerator f (θ|x) dθ

f (x | θ)f (θ) dθ
θ f (θ) dθ f (x | θ) f (x | θ)f (θ) dθ
f (x)

total 1 f (x) 1
1
f (x) = f (x | θ)f (θ) dθ

MIT18_05S14_class14_slides 335
January 1, 2017 10 /26
Normal prior, normal data
N(µ, σ 2 ) has density
1 2 2
f (y ) = √ e−(y −µ) /2σ .
σ 2π

Observation:
The coefficient is a normalizing factor, so if we have a pdf
2 /2σ 2
f (y ) = ce−(y −µ)

then
y ∼ N(µ, σ 2 )
and
1
c= √
MIT18_05S14_class14_slides σ 2π 336
January 1, 2017 11 /26
Board question: normal prior, normal data

1 2 2
N(µ, σ 2 ) has pdf: √ e−(y −µ) /2σ .
f (y ) =
σ 2π
Suppose our data follows a N(θ, 4) distribution with unknown
mean θ and variance 4. That is

f (x | θ) = pdf of N(θ, 4)

Suppose our prior on θ is N(3, 1).

Suppose we obtain data x1 = 5.

1. Use the data to find the posterior pdf for θ.

Write out your tables clearly. Use (and understand) infinitesimals.


You will have to remember how to complete the square to do the
MIT18_05S14_class14_slides
updating! 337
January 1, 2017 12 /26
Solution

We have:
2 /2
Prior: θ ∼ N(3, 1): f (θ) = c1 e−(θ−3)
2 /8
Likelihood x ∼ N(θ, 4): f (x | θ) = c2 e−(x−θ)
2 /8
For x = 5 the likelihood is c2 e−(5−θ)

hypoth. prior likelihood Bayes numer.


2 2 2 2
θ c1 e−(θ−3) /2 dθ c2 e−(5−θ) /8 dx c3 e−(θ−3) /2 e−(5−θ) /8 dθ dx

A bit of algebraic manipulation of the Bayes numerator gives


2 /2 2 /8 5 2 − 34 θ+61] 5 2 +61−(17/5)2 ]
c3 e−(θ−3) e−(5−θ) dθ dx = c3 e− 8 [θ 5 = c3 e− 8 [(θ−17/5)
5 2 5 2
= c3 e− 8 (61−(17/5) ) e− 8 (θ−17/5)
(θ−17/5)2
5 2 − 4
= c4 e− 8 (θ−17/5) = c4 e 2· 5

C 17 )
The MIT18_05S14_class14_slides
last expression shows the posterior is N 5 , 45 . 338
January 1, 2017 13 /26
Solution graphs

prior = blue; posterior = purple; data = red

Data: x1 = 5
Prior is normal: µprior = 3; σprior = 1
Likelihood is normal: µ = θ; σ=2
Posterior is normal µposterior = 3.4; σposterior = 0.894
• Will see simple formulas for doing this update next time.
MIT18_05S14_class14_slides 339
January 1, 2017 14 /26
Board question: Romeo and Juliet

Romeo is always late. How late follows a uniform distribution


uniform(0, θ) with unknown parameter θ in hours.

Juliet knows that θ ≤ 1 hour and she assumes a flat prior for θ on
[0, 1].

On their first date Romeo is 15 minutes late. Use this data to update
the prior distribution for θ.

(a) Find and graph the prior and posterior pdfs for θ.
(b) Find the prior predictive pdf for how late Romeo will be on the
first date and the posterior predictive pdf of how late he’ll be on the
second date (if he gets one!). Graph these pdfs.
See next slides for solution
MIT18_05S14_class14_slides 340
January 1, 2017 15 /26
Solution

Parameter of interest: θ = upper bound on R’s lateness.

Data: x1 = 0.25.

Goals: (a) Posterior pdf for θ

(b) Predictive pdf’s –requires pdf’s for θ


In the update table we split the hypotheses into the two different cases
θ < 0.25 and θ ≥ 0.25 :
prior likelihood Bayes posterior
hyp. f (θ) f (x1 |θ) numerator f (θ|x1 )
θ < 0.25 dθ 0 0 0
1 dθ c
θ ≥ 0.25 dθ θ θ θ dθ
Tot. 1 T 1
The normalizing constant c must make the total posterior probability 1, so
1 1
dθ 1
c =1 ⇒ c = .
0.25 θ ln(4)

Continued on next slide.

MIT18_05S14_class14_slides 341
January 1, 2017 16 /26
Solution graphs

MIT18_05S14_class14_slides
Prior and posterior pdf’s for θ. 342
January 1, 2017 17 /26
Solution graphs continued

(b) Prior prediction: The likelihood function falls into cases:



1
if θ ≥ x1
f (x1 |θ) = θ
0 if θ < x1

Therefore the prior predictive pdf of x1 is


1 1 1
1
f (x1 ) = f (x1 |θ)f (θ) dθ = dθ = − ln(x1 ).
x1 θ

continued on next slide

MIT18_05S14_class14_slides 343
January 1, 2017 18 /26
Solution continued

Posterior prediction:

The likelihood function is the same as before:


1
if θ ≥ x2
f (x2 |θ) = θ
0 if θ < x2 .
1
The posterior predictive pdf f (x2 |x1 ) = f (x2 |θ)f (θ|x1 ) dθ. The
integrand is 0 unless θ > x2 and θ > 0.25. There are two cases:
1 1
c
If x2 < 0.25 : f (x2 |x1 ) = 2
dθ = 3c = 3/ ln(4).
0.25 θ
1 1
c 1
If x2 ≥ 0.25 : f (x2 |x1 ) = 2
dθ = ( − 1)/ ln(4)
x2 θ x 2

Plots of the predictive pdf’s are on the next slide.

MIT18_05S14_class14_slides 344
January 1, 2017 19 /26
Solution continued

MIT18_05S14_class14_slides
Prior (red) and posterior (blue) predictive pdf’s for x2 345
January 1, 2017 20 /26
From discrete to continuous Bayesian updating

Bent coin with unknown probability of heads θ.

Data x1 : heads on one toss.

Start with a flat prior and update:

Bayes
hyp. prior likelihood numerator numerator
θ dθ θ θ dθ 2θ dθ
J1
Total 1 0
θ dθ = 1/2 1

Posterior pdf: f (θ | x1 ) = 2θ.

MIT18_05S14_class14_slides 346
January 1, 2017 21 /26
Approximate continuous by discrete

approximate the continuous range of hypotheses by a finite


number of hypotheses.
create the discrete updating table for the finite number of
hypotheses.
consider how the table changes as the number of hypotheses
goes to infinity.

MIT18_05S14_class14_slides 347
January 1, 2017 22 /26
Chop [0, 1] into 4 intervals

hypothesis prior likelihood Bayes num. posterior

θ = 1/8 1/4 1/8 (1/4) × (1/8) 1/16

θ = 3/8 1/4 3/8 (1/4) × (3/8) 3/16

θ = 5/8 1/4 5/8 (1/4) × (5/8) 5/16

θ = 7/8 1/4 7/8 (1/4) × (7/8) 7/16


n
X
Total 1 – θi ∆θ 1
i=1

MIT18_05S14_class14_slides 348
January 1, 2017 23 /26
Chop [0, 1] into 12 intervals

hypothesis prior likelihood Bayes num. posterior


θ = 1/24 1/12 1/24 (1/12) × (1/24) 1/144
θ = 3/24 1/12 3/24 (1/12) × (3/24) 3/144
θ = 5/24 1/12 5/24 (1/12) × (5/24) 5/144
θ = 7/24 1/12 7/24 (1/12) × (7/24) 7/144
θ = 9/24 1/12 9/24 (1/12) × (9/24) 9/144
θ = 11/24 1/12 11/24 (1/12) × (11/24) 11/144
θ = 13/24 1/12 13/24 (1/12) × (13/24) 13/144
θ = 15/24 1/12 15/24 (1/12) × (15/24) 15/144
θ = 17/24 1/12 17/24 (1/12) × (17/24) 17/144
θ = 19/24 1/12 19/24 (1/12) × (19/24) 19/144
θ = 21/24 1/12 21/24 (1/12) × (21/24) 21/144
θ = 23/24 1/12 23/24 (1/12) × (23/24) 23/144
n
X
Total 1 – θi ∆θ 1
i=1

MIT18_05S14_class14_slides 349
January 1, 2017 24 /26
Density historgram

Density historgram for posterior pmf with 4 and 20 slices.

2 density
2 density

1.5
1.5

1
1

.5
.5

x
1/8 3/8 5/8 7/8 x

The original posterior pdf is shown in red.

MIT18_05S14_class14_slides 350
January 1, 2017 25 /26
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class14_slides 351
Conjugate Priors: Beta and Normal
18.05 Spring 2014

Class 15 Slides with Solutions: Conjugate Priors; Choosing Prio 352


January 1, 2017 1 /20
Review: Continuous priors, discrete data

‘Bent’ coin: unknown probability θ of heads.

Prior f (θ) = 2θ on [0,1].

Data: heads on one toss.

Question: Find the posterior pdf to this data.

Bayes
hypoth. prior likelihood numerator posterior
θ 2θ dθ θ 2θ2 dθ 3θ2 dθ
f1 2
Total 1 T = 0 2θ dθ = 2/3 1

Posterior pdf: f (θ|x) = 3θ2 .

MIT18_05S14_class15_slides 353
January 1, 2017 2 /20
Review: Continuous priors, continuous data

Bayesian update table

Bayes
hypoth. prior likeli. numerator posterior
f (x | θ)f (θ) dθ
θ f (θ) dθ f (x | θ) f (x | θ)f (θ) dθ f (θ | x) dθ =
f (x)

total 1 f (x) 1
J
f (x) = f (x | θ)f (θ) dθ

Notice that we overuse the letter f . It is a generic symbol meaning


‘whatever function is appropriate here’.

MIT18_05S14_class15_slides 354
January 1, 2017 3 /20
Romeo and Juliet

See class 14 slides

MIT18_05S14_class15_slides 355
January 1, 2017 4 /20
Updating with normal prior and normal likelihood
A normal prior is conjugate to a normal likelihood with known σ.
Data: x1 , x2 , . . . , xn
Normal likelihood. x1 , x2 , . . . , xn ∼ N(θ, σ 2 )

Assume θ is our unknown parameter of interest, σ is known.

2
Normal prior. θ ∼ N(µprior , σprior ).
2
Normal Posterior. θ ∼ N(µpost , σpost ).

We have simple updating formulas that allow us to avoid


complicated algebra or integrals (see next slide).

hypoth. prior likelihood posterior


2
θ f (θ) ∼ N(µprior , σprior ) f (x|θ) ∼ N(θ, σ 2 ) f (θ|x) ∼ N(µpost , σpost2
)
( 2
) ( 2
) ( 2
)
−(θ−µprior ) −(θ−µpost )
θ c1 exp 2
2σprior
c2 exp −(x−θ)
2σ 2 c3 exp 2
2σpost
MIT18_05S14_class15_slides 356
January 1, 2017 5 /20
Board question: Normal-normal updating formulas

1 n aµprior + bx̄ 2 1
a= 2
b= , µpost = , σpost = .
σprior σ2 a+b a+b

Suppose we have one data point x = 2 drawn from N(θ, 32 )


Suppose θ is our parameter of interest with prior θ ∼ N(4, 22 ).

0. Identify µprior , σprior , σ, n, and x̄.


1. Make a Bayesian update table, but leave the posterior as an
unsimplified product.
2. Use the updating formulas to find the posterior.

3. By doing enough of the algebra, understand that the updating


formulas come by using the updating table and doing a lot of algebra.
MIT18_05S14_class15_slides 357
January 1, 2017 6 /20
Solution

0. µprior = 4, σprior = 2, σ = 3, n = 1, x̄ = 2.
1.
hypoth. prior likelihood posterior
θ f (θ) ∼ N(4, 22 ) f (x|θ) ∼ N(θ, 32 ) 2
f (θ|x) ∼ N(µpost , σpost )
( 2
) ( 2
) ( 2
) ( 2
)
θ c1 exp −(θ−4)
8 c2 exp −(2−θ)
18 c3 exp −(θ−4)
8 exp −(2−θ)
18

2. We have a = 1/4, b = 1/9, a + b = 13/36. Therefore

µpost = (1 + 2/9)/(13/36) = 44/13 = 3.3846


2
σpost = 36/13 = 2.7692

The posterior pdf is f (θ|x = 2) ∼ N(3.3846, 2.7692).


3. See the reading class15-prep-a.pdf example 2.
MIT18_05S14_class15_slides 358
January 1, 2017 7 /20
Concept question: normal priors, normal likelihood

Plot 3 Plot 5

0.8
Plot 2
0.6

Prior
0.4

Plot 1 Plot 4
0.2
0.0

0 2 4 6 8 10 12 14

Blue graph = prior Red lines = data in order: 3, 9, 12

(a) Which plot is the posterior to just the first data value?
(Click on the plot number.) (Solution in 2 slides)
MIT18_05S14_class15_slides 359
January 1, 2017 8 /20
Concept question: normal priors, normal likelihood

Plot 3 Plot 5

0.8
Plot 2
0.6

Prior
0.4

Plot 1 Plot 4
0.2
0.0

0 2 4 6 8 10 12 14

Blue graph = prior Red lines = data in order: 3, 9, 12

(b) Which graph is posterior to all 3 data values?


(Click on the plot number.) (Solution on next slide)
MIT18_05S14_class15_slides 360
January 1, 2017 9 /20
Solution to concept question

(a) Plot 2: The first data value is 3. Therefore the posterior must have its
mean between 3 and the mean of the blue prior. The only possibilites for
this are plots 1 and 2. We also know that the variance of the posterior is
less than that of the posterior. Between the plots 1 and 2 graphs only plot
2 has smaller variance than the prior.
(b) Plot 3: The average of the 3 data values is 8. Therefore the posterior
must have mean between the mean of the blue prior and 8. Therefore the
only possibilities are the plots 3 and 4. Because the posterior is posterior
to the magenta graph (plot 2) it must have smaller variance. This leaves
only the Plot 3.

MIT18_05S14_class15_slides 361
January 1, 2017 10 /20
Board question: normal/normal

x1 +...+xn
For data x1 , . . . , xn with data mean x̄ = n

1 n aµprior + bx̄ 2 1
a= 2
b= , µpost = , σpost = .
σprior σ2 a+b a+b

Question. On a basketball team the average free throw percentage


over all players is a N(75, 36) distribution. In a given year individual
players free throw percentage is N(θ, 16) where θ is their career
average.

This season Sophie Lie made 85 percent of her free throws. What is
the posterior expected value of her career percentage θ?
answer: Solution on next frame

MIT18_05S14_class15_slides 362
January 1, 2017 11 /20
Solution

This is a normal/normal conjugate prior pair, so we use the update

formulas.

Parameter of interest: θ = career average.

Data: x = 85 = this year’s percentage.

Prior: θ ∼ N(75, 36)

2
Likelihood x ∼ N(θ, 16). So f (x|θ) = c1 e−(x−θ) /2·16 .
The updating weights are

a = 1/36, b = 1/16, a + b = 52/576 = 13/144.

Therefore
2
µpost = (75/36 + 85/16)/(52/576) = 81.9, σpost = 36/13 = 11.1.

The posterior pdf is f (θ|x = 85) ∼ N(81.9, 11.1).


MIT18_05S14_class15_slides 363
January 1, 2017 12 /20
Conjugate priors
A prior is conjugate to a likelihood if the posterior is the same type of

distribution as the prior.

Updating becomes algebra instead of calculus.

hypothesis data prior likelihood posterior


Bernoulli/Beta θ ∈ [0, 1] x beta(a, b) Bernoulli(θ) beta(a + 1, b) or beta(a, b + 1)

θ x=1 c1 θa−1 (1 − θ)b−1 θ c3 θa (1 − θ)b−1

θ x=0 c1 θa−1 (1 − θ)b−1 1−θ c3 θa−1 (1 − θ)b


Binomial/Beta θ ∈ [0, 1] x beta(a, b) binomial(N, θ) beta(a + x, b + N − x)

(fixed N ) θ x c1 θa−1 (1 − θ)b−1 c2 θx (1 − θ)N −x c3 θa+x−1 (1 − θ)b+N −x−1


Geometric/Beta θ ∈ [0, 1] x beta(a, b) geometric(θ) beta(a + x, b + 1)

θ x c1 θa−1 (1 − θ)b−1 θx (1 − θ) c3 θa+x−1 (1 − θ)b


2 2 2
Normal/Normal θ ∈ (−∞, ∞) x N(µprior , σprior ) N(θ, σ ) N(µpost , σpost )
 2
    
−(θ−µprior ) 2
(θ−µpost )2
(fixed σ 2 ) θ x c1 exp 2
2σprior
c2 exp −(x−θ)
2σ 2 c3 exp 2σ 2
post

There are many other likelihood/conjugate prior pairs.

MIT18_05S14_class15_slides 364
January 1, 2017 13 /20
Concept question: conjugate priors
Which are conjugate priors?

hypothesis data prior likelihood


2
a) Exponential/Normal θ ∈ [0, ∞) x N(µprior , σprior ) exp(θ)
 
(θ−µ )2
θ x c1 exp − 2σ2prior θe−θx
prior

b) Exponential/Gamma θ ∈ [0, ∞) x Gamma(a, b) exp(θ)

θ x c1 θa−1 e−bθ θe−θx


2
c) Binomial/Normal θ ∈ [0, 1] x N(µprior , σprior ) binomial(N, θ)
 
(θ−µ )2
(fixed N ) θ x c1 exp − 2σ2prior c2 θx (1 − θ)N −x
prior

1. none 2. a 3. b 4. c
5.MIT18_05S14_class15_slides
a,b 6. a,c 7. b,c 8. a,b,c 365
January 1, 2017 14 /20
Answer: 3. b

We have a conjugate prior if the posterior as a function of θ has the same


form as the prior.

Exponential/Normal posterior:
(θ−µprior )2
− −θ x
2σ 2
f (θ|x) = c1 θe prior

The factor of θ before the exponential means this is not the pdf of a
normal distribution. Therefore it is not a conjugate prior.

Exponential/Gamma posterior: Note, we have never learned about Gamma


distributions, but it doesn’t matter. We only have to check if the posterior
has the same form:
f (θ|x) = c1 θa e−(b+x)θ
The posterior has the form Gamma(a + 1, b + x). This is a conjugate prior.

Binomial/Normal: It is clear that the posterior does not have the form of a
MIT18_05S14_class15_slides 366
normal distribution.
January 1, 2017 15 /20
Variance can increase
Normal-normal: variance always decreases with data.
Beta-binomial: variance usually decreases with data.
6

beta(2,12) beta(21,19)
5

beta(21,12)

beta(12,12)
4
3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0

Variance of beta(2,12) (blue) is bigger than that of beta(12,12)


MIT18_05S14_class15_slides
(magenta), but beta(12,12) can be a posterior to beta(2,12) 367
January 1, 2017 16 /20
Table discussion: likelihood principle

Suppose the prior has been set. Let x1 and x2 be two sets of data.
Which of the following are true.
(a) If the likelihoods f (x1 |θ) and f (x2 |θ) are the same then they
result in the same posterior.
(b) If x1 and x2 result in the same posterior then their likelihood
functions are the same.
(c) If the likelihoods f (x1 |θ) and f (x2 |θ) are proportional then they
result in the same posterior.
(d) If two likelihood functions are proportional then they are equal.
answer: (4): a: true; b: false, the likelihoods are proportional.
c: true, scale factors don’t matter d: false

MIT18_05S14_class15_slides 368
January 1, 2017 17 /20
Concept question: strong priors
Say we have a bent coin with unknown probability of heads θ.

We are convinced that θ ≤ 0.7.

Our prior is uniform on [0, 0.7] and 0 from 0.7 to 1.

We flip the coin 65 times and get 60 heads.

Which of the graphs below is the posterior pdf for θ?

80

A B C D E F
60
40
20
0

MIT18_05S14_class15_slides
0.0 0.2 0.4 0.6 0.8 1.0369
January 1, 2017 18 /20
Solution to concept question

answer: Graph C, the blue graph spiking near 0.7.

Sixty heads in 65 tosses indicates the true value of θ is close to 1. Our


prior was 0 for θ > 0.7. So no amount of data will make the posterior
non-zero in that range. That is, we have forclosed on the possibility of
deciding that θ is close to 1. The Bayesian updating puts θ near the top of
the allowed range.

MIT18_05S14_class15_slides 370
January 1, 2017 19 /20
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class15_slides 371
Choosing Priors
Probability Intervals
18.05 Spring 2014

Class 16 Slides with Solutions: Probability Intervals 372


Conjugate priors
A prior is conjugate to a likelihood if the posterior is the same type of
distribution as the prior.
Updating becomes algebra instead of calculus.
hypothesis data prior likelihood posterior
Bernoulli/Beta θ ∈ [0, 1] x beta(a, b) Bernoulli(θ) beta(a + 1, b) or beta(a, b + 1)

θ x=1 c1 θa−1 (1 − θ)b−1 θ c3 θa (1 − θ)b−1

θ x=0 c1 θa−1 (1 − θ)b−1 1−θ c3 θa−1 (1 − θ)b


Binomial/Beta θ ∈ [0, 1] x beta(a, b) binomial(N, θ) beta(a + x, b + N − x)

(fixed N ) θ x c1 θa−1 (1 − θ)b−1 c2 θx (1 − θ)N −x c3 θa+x−1 (1 − θ)b+N −x−1


Geometric/Beta θ ∈ [0, 1] x beta(a, b) geometric(θ) beta(a + x, b + 1)

θ x c1 θa−1 (1 − θ)b−1 θx (1 − θ) c3 θa+x−1 (1 − θ)b


2 2 2
Normal/Normal θ ∈ (−∞, ∞) x N(µprior , σprior ) N(θ, σ ) N(µpost , σpost )
 2
    
−(θ −µprior ) x−θ)2 (θ−µpost )2
(fixed σ 2 ) θ x c1 exp 2
2σprior
c2 exp −(2σ 2 c3 exp 2σ 2
post

There are many other likelihood/conjugate prior pairs.


MIT18_05S14_class16_slides 373
April 18, 2017 2 / 33
Concept question: conjugate priors
Which are conjugate priors?

hypothesis data prior likelihood


2
a) Exponential/Normal θ ∈ [0, ∞) x N(µprior , σprior ) exp(θ)
 
(θ−µ )2
θ x c1 exp − 2σ2prior θe−θx
prior

b) Exponential/Gamma θ ∈ [0, ∞) x Gamma(a, b) exp(θ)

θ x c1 θa−1 e−bθ θe−θx


2
c) Binomial/Normal θ ∈ [0, 1] x N(µprior , σprior ) binomial(N, θ)
 
(θ−µ )2
(fixed N ) θ x c1 exp − 2σ2prior c2 θx (1 − θ)N −x
prior

1. none 2. a 3. b 4. c
5.MIT18_05S14_class16_slides
a,b 6. a,c 7. b,c 8. a,b,c 374
April 18, 2017 3 / 33
Answer: 3. b
We have a conjugate prior if the posterior as a function of θ has the same
form as the prior.

Exponential/Normal posterior:
(θ−µprior )2
− −θ x
2σ 2
f (θ|x) = c1 θe prior

The factor of θ before the exponential means this is not the pdf of a
normal distribution. Therefore it is not a conjugate prior.

Exponential/Gamma posterior: Note, we have never learned about Gamma


distributions, but it doesn’t matter. We only have to check if the posterior
has the same form:
f (θ|x) = c1 θa e−(b+x)θ
The posterior has the form Gamma(a + 1, b + x). This is a conjugate prior.

Binomial/Normal: It is clear that the posterior does not have the form of a
MIT18_05S14_class16_slides 375
normal distribution.
April 18, 2017 4 / 33
Concept question: strong priors
Say we have a bent coin with unknown probability of heads θ.
We are convinced that θ ≤ 0.7.
Our prior is uniform on [0, 0.7] and 0 from 0.7 to 1.
We flip the coin 65 times and get 60 heads.
Which of the graphs below is the posterior pdf for θ?
80

A B C D E F
60
40
20
0

MIT18_05S14_class16_slides
0.0 0.2 0.4 0.6 0.8 1.0376
April 18, 2017 5 / 33
Solution to concept question

answer: Graph C, the blue graph spiking near 0.7.

Sixty heads in 65 tosses indicates the true value of θ is close to 1. Our


prior was 0 for θ > 0.7. So no amount of data will make the posterior
non-zero in that range. That is, we have forclosed on the possibility of
deciding that θ is close to 1. The Bayesian updating puts θ near the top of
the allowed range.

MIT18_05S14_class16_slides 377
April 18, 2017 6 / 33
Two parameter tables: Malaria

In the 1950’s scientists injected 30 African “volunteers” with malaria.


S = carrier of sickle-cell gene
N = non-carrier of sickle-cell gene
D+ = developed malaria
D− = did not develop malaria

D+ D−
S 2 13 15
N 14 1 15
16 14 30

MIT18_05S14_class16_slides 378
April 18, 2017 7 / 33
Model

θS = probability an injected S develops malaria.


θN = probabilitiy an injected N develops malaria.
Assume conditional independence between all the experimental
subjects.
Likelihood is a function of both θS and θN :

P(data|θS , θN ) = c θS2 (1 − θS )13 θN14 (1 − θN ).

Hypotheses: pairs (θS , θN ).


Finite number of hypotheses. θS and θN are each one of
0, .2, .4, .6, .8, 1.

MIT18_05S14_class16_slides 379
April 18, 2017 8 / 33
Color-coded two-dimensional tables
Hypotheses
θN \θS 0 0.2 0.4 0.6 0.8 1
1 (0,1) (.2,1) (.4,1) (.6,1) (.8,1) (1,1)
0.8 (0,.8) (.2,.8) (.4,.8) (.6,.8) (.8,.8) (1,.8)
0.6 (0,.6) (.2,.6) (.4,.6) (.6,.6) (.8,.6) (1,.6)
0.4 (0,.4) (.2,.4) (.4,.4) (.6,.4) (.8,.4) (1,.4)
0.2 (0,.2) (.2,.2) (.4,.2) (.6,.2) (.8,.2) (1,.2)
0 (0,0) (.2,0) (.4,0) (.6,0) (.8,0) (1,0)

Table of hypotheses for (θS , θN )

Corresponding level of protection due to S:


red = strong, pink = some, orange = none,
white = negative.
MIT18_05S14_class16_slides 380
April 18, 2017 9 / 33
Color-coded two-dimensional tables

Likelihoods (scaled to make the table readable)


θN \θS 0 0.2 0.4 0.6 0.8 1
1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 1.93428 0.18381 0.00213 0.00000 0.00000
0.6 0.00000 0.06893 0.00655 0.00008 0.00000 0.00000
0.4 0.00000 0.00035 0.00003 0.00000 0.00000 0.00000
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

Likelihoods scaled by 100000/c

p(data|θS , θN ) = c θS2 (1 − θS )13 θN14 (1 − θN ).

MIT18_05S14_class16_slides 381
April 18, 2017 10 / 33
Color-coded two-dimensional tables

Flat prior
θN \θS 0 0.2 0.4 0.6 0.8 1 p(θN )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.8 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(θS ) 1/6 1/6 1/6 1/6 1/6 1/6 1

Flat prior p(θS , θN ): each hypothesis (square) has equal probability

MIT18_05S14_class16_slides 382
April 18, 2017 11 / 33
Color-coded two-dimensional tables
Posterior to the flat prior
θN \θS 0 0.2 0.4 0.6 0.8 1 p(θN |data)
1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 0.88075 0.08370 0.00097 0.00000 0.00000 0.96542
0.6 0.00000 0.03139 0.00298 0.00003 0.00000 0.00000 0.03440
0.4 0.00000 0.00016 0.00002 0.00000 0.00000 0.00000 0.00018
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
p(θS |data) 0.00000 0.91230 0.08670 0.00100 0.00000 0.00000 1.00000

Normalized posterior to the flat prior: p(θS , θN |data)

Strong protection: P(θN − θS > .5 | data) = sum of red = .88075


Some protection: P(θN > θS | data) = sum pink and red = .99995
MIT18_05S14_class16_slides 383
April 18, 2017 12 / 33
Continuous two-parameter distributions
Sometimes continuous parameters are more natural.

Malaria example (from class notes):


discrete prior table from the class notes.
Similarly colored version for the continuous parameters (θS , θN )
over range [0, 1] × [0, 1].
θN − θS > 0.6
1
θN \θS 0 0.2 0.4 0.6 0.8 1
1 (0,1) (.2,1) (.4,1) (.6,1) (.8,1) (1,1)
0.6 θS < θN
0.8 (0,.8) (.2,.8) (.4,.8) (.6,.8) (.8,.8) (1,.8) θN
0.6 (0,.6) (.2,.6) (.4,.6) (.6,.6) (.8,.6) (1,.6)
θN < θS
0.4 (0,.4) (.2,.4) (.4,.4) (.6,.4) (.8,.4) (1,.4)
0.2 (0,.2) (.2,.2) (.4,.2) (.6,.2) (.8,.2) (1,.2)
0 (0,0) (.2,0) (.4,0) (.6,0) (.8,0) (1,0) θS 1

The probabilities are given by double integrals over regions.


MIT18_05S14_class16_slides 384
April 18, 2017 13 / 33
Treating severe respiratory failure*

*Adapted from Statistics a Bayesian Perspective by Donald Berry

Two treatments for newborns with severe respiratory failure.


1. CVT: conventional therapy (hyperventilation and drugs)
2. ECMO: extracorporeal membrane oxygenation (invasive procedure)

In 1983 in Michigan:
19/19 ECMO babies survived and 0/3 CVT babies survived.
Later Harvard ran a randomized study:
28/29 ECMO babies survived and 6/10 CVT babies survived.

MIT18_05S14_class16_slides 385
April 18, 2017 14 / 33
Board question: updating two parameter priors
Michigan: 19/19 ECMO babies and 0/3 CVT babies survived.
Harvard: 28/29 ECMO babies and 6/10 CVT babies survived.
θE = probability that an ECMO baby survives
θC = probability that a CVT baby survives
Consider the values 0.125, 0.375, 0.625, 0.875 for θE and θS
1. Make the 4 × 4 prior table for a flat prior.
2. Based on the Michigan results, create a reasonable informed prior
table for analyzing the Harvard results (unnormalized is fine).
3. Make the likelihood table for the Harvard results.
4. Find the posterior table for the informed prior.
5. Using the informed posterior, compute the probability that ECMO
is better than CVT.
6. Also compute the posterior probability that θE − θC ≥ 0.6.
(TheMIT18_05S14_class16_slides
posted solutions will also show 4-6 for the flat prior.) 386
April 18, 2017 15 / 33
Solution
Flat prior
θE
0.125 0.375 0.625 0.875
0.125 0.0625 0.0625 0.0625 0.0625
θC 0.375 0.0625 0.0625 0.0625 0.0625
0.625 0.0625 0.0625 0.0625 0.0625
0.875 0.0625 0.0625 0.0625 0.0625

Informed prior (this is unnormalized)


θE
0.125 0.375 0.625 0.875
0.125 18 18 32 32
θC 0.375 18 18 32 32
0.625 18 18 32 32
0.875 18 18 32 32
(Rationale for the informed prior is on the next slide.)
MIT18_05S14_class16_slides 387
April 18, 2017 16 / 33
Solution continued
Since 19/19 ECMO babies survived we believe θE is probably near 1.0
That 0/3 CVT babies survived is not enough data to move from a uniform
distribution. (Or we might distribute a little more probability to larger θC .)
So for θE we split 64% of probability in the two higher values and 36% for
the lower two. Our prior is the same for each value of θC .

Likelihood
Entries in the likelihood table are θE28 (1 − θE )θC6 (1 − θC )4 . We don’t bother
including the binomial coefficients since they are the same for every entry.
θE
0.125 0.375 0.625 0.875
0.125 1.012e-31 1.653e-18 1.615e-12 6.647-09
θC 0.375 1.920e-29 3.137e-16 3.065e-10 1.261-06
0.625 5.332e-29 8.713e-16 8.513e-10 3.504e-06
0.875 4.95e-30 8.099e-17 7.913e-11 3.257e-07
MIT18_05S14_class16_slides
(Posteriors are on the next slides). 388
April 18, 2017 17 / 33
Solution continued
Flat posterior
The posterior table is found by multiplying the prior and likelihood tables
and normalizing so that the sum of the entries is 1. We call the posterior
derived from the flat prior the flat posterior. (Of course the flat posterior
is not itself flat.)
θE
0.125 0.375 0.625 0.875
0.125 .984e-26 3.242e-13 3.167e-07 0.001
θc 0.375 .765e-24 6.152e-11 6.011e-05 0.247
0.625 1.046e-23 1.709e-10 1.670e-04 0.687
0.875 9.721e-25 1.588e-11 1.552e-05 0.0639
The boxed entries represent most of the probability where θE > θC .

All our computations were done in R. For the flat posterior:


Probability ECMO is better than CVT is
P(θE > θC | Harvard data) = 0.936
MIT18_05S14_class16_slides
P(θE − θC ≥ 0.6 | Harvard data) = 0.001 389
April 18, 2017 18 / 33
Solution continued

Informed posterior
θE
0.125 0.375 0.625 0.875
0.125 1.116e-26 1.823e-13 3.167e-07 0.001
θC 0.375 2.117e-24 3.460e-11 6.010e-05 0.2473
0.625 5.882e-24 9.612e-11 1.669e-04 0.6871
0.875 5.468e-25 8.935e-12 1.552e-05 0.0638

For the informed posterior:


P(θE > θC | Harvard data) = 0.936
P(θE − θC ≥ 0.6 | Harvard data) = 0.001
Note: Since both flat and informed prior gave the same answers we gain
confidence that these calculations are robust. That is, they are not too
sensitive to our exact choice prior.
MIT18_05S14_class16_slides 390
April 18, 2017 19 / 33
Probability intervals

Example. If P(a ≤ θ ≤ b) = 0.7 then [a, b] is a 0.7 probability


interval for θ. We also call it a 70% probability interval.
Example. Between the 0.05 and 0.55 quantiles is a 0.5
probability interval. Another 50% probability interval goes from
the 0.25 to the 0.75 quantiles.
Symmetric probability intevals. A symmetric 90% probability
interval goes from the 0.05 to the 0.95 quantile.
Q-notation. Writing qp for the p quantile we have 0.5
probability intervals [q0.25 , q0.75 ] and [q0.05 , q0.55 ].
Uses. To summarize a distribution; To help build a subjective
prior.
MIT18_05S14_class16_slides 391
April 18, 2017 20 / 33
Probability intervals in Bayesian updating

We have p-probability intervals for the prior f (θ).

We have p-probability intervals for the posterior f (θ|x).

The latter tends to be smaller than the former. Thanks data!

Probability intervals are good, concise statements about our


current belief/understanding of the parameter of interest.

We can use them to help choose a good prior.

MIT18_05S14_class16_slides 392
April 18, 2017 21 / 33
Probability intervals for normal distributions

Red = 0.68, magenta = 0.9, green = 0.5


MIT18_05S14_class16_slides 393
April 18, 2017 22 / 33
Probability intervals for beta distributions

Red = 0.68, magenta = 0.9, green = 0.5


MIT18_05S14_class16_slides 394
April 18, 2017 23 / 33
Concept question

To convert an 80% probability interval to a 90% interval should you


shrink it or stretch it?

1. Shrink 2. Stretch.

answer: 2. Stretch. A bigger probability requires a bigger interval.

MIT18_05S14_class16_slides 395
April 18, 2017 24 / 33
Reading questions
The following slides contain bar graphs of last year’s responses to the
reading questions. Each bar represents one student’s estimate of their own
50% probability interval (from the 0.25 quantile to the 0.75 quantile).
Here is what we found for answers to the questions:

1. Airline deaths in 100 years: We extracted this data from a government


census table at https://www2.census.gov/library/publications/
2011/compendia/statab/131ed/2012-statab.pdf page 676 There
were 13116 airline fatalities in the 18 years from 1990 to 2008. In the 80
years before that there were fewer people flying, but it was probably more
dangerous. Let’s assume they balance out and estimate the total number
of fatalities in 100 years as 5 × 13116 ≈ 66000.

2. Number of girls born in the world each year: I had trouble finding a
reliable source. Wiki.answers.com gave the number of 130 million births in
2005. If we take what seems to be the accepted ratio of 1.07 boys born for
everyMIT18_05S14_class16_slides
girl then 130/2.07 = 62.8 million baby girls. 396
April 18, 2017 25 / 33
Reading questions continued

3. Percentage of Black or African-Americans in the U.S as of 2015.:


13.3% (https://www.census.gov/quickfacts/)

4. Number of French speakers world-wide: 72-79 million native speakers,


265 million primary + secondary speaker
(http://www2.ignatius.edu/faculty/turner/languages.htm)

5. Number of abortions in the U.S. each year: 1.2 million (http:


//www.guttmacher.org/in-the-know/characteristics.html)

MIT18_05S14_class16_slides 397
April 18, 2017 26 / 33
Subjective probability 1 (50% probability interval)

Airline deaths in 100 years


66000

10 50000

MIT18_05S14_class16_slides 398
April 18, 2017 27 / 33
Subjective probability 2 (50% probability interval)

Number of girls born in world each year


63000000

100 500000000

MIT18_05S14_class16_slides 399
April 18, 2017 28 / 33
Subjective probability 3 (50% probability interval)

Percentage of African-Americans in US
13

0 100

MIT18_05S14_class16_slides 400
April 18, 2017 29 / 33
Subjective probability 3 censored (50% probability interval)

Censored by changing numbers less than 1 to percentages and


ignoring numbers bigger that 100.

Percentage of African-Americans in US (censored data)


13

5 100

MIT18_05S14_class16_slides 401
April 18, 2017 30 / 33
Subjective probability 4 (50% probability interval)

Number of French speakers world-wide


75000000 265000000

able to speak French


100 native speakers 1000000000

MIT18_05S14_class16_slides 402
April 18, 2017 31 / 33
Subjective probability 5 (50% probability interval)

Number of abortions in the U.S. each year


1200000

100 1500000

MIT18_05S14_class16_slides 403
April 18, 2017 32 / 33
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class16_slides 404
Frequentist Statistics and Hypothesis Testing
18.05 Spring 2014

http://xkcd.com/539/

Class 17 Slides with Solutions: Frequentist Methods; NHST 405


January 2, 2017 1 /25
Agenda

Introduction to the frequentist way of life.


What is a statistic?

NHST ingredients; rejection regions


Simple and composite hypotheses
z-tests, p-values

MIT18_05S14_class17_slides 406
January 2, 2017 2 /25
Frequentist school of statistics

Dominant school of statistics in the 20th century.


p-values, t-tests, χ2 -tests, confidence intervals.
Defines probability as long-term frequency in a repeatable

random experiment.

� Yes: probability a coin lands heads.


� Yes: probability a given treatment cures a certain disease.
� Yes: probability distribution for the error of a measurement.

Rejects the use of probability to quantify incomplete knowledge,


measure degree of belief in hypotheses.
� No: prior probability for the probability an unknown coin lands heads.
� No: prior probability on the efficacy of a treatment for a disease.
� No: prior probability distribution for the unknown mean of a normal
distribution.
MIT18_05S14_class17_slides 407
January 2, 2017 3 /25
The fork in the road

Everyone uses Bayes’


P (D|H)P (H)
Probability P (H|D) = formula when the prior
(mathematics) P (D) P (H) is known.

Bayesian path Frequentist path

Statistics
(art)
P (D|H)Pprior (H)
PPosterior (H|D) = Likelihood L(H; D) = P (D|H)
P (D)
Bayesians require a prior, so Without a known prior frequen-
they develop one from the best tists draw inferences from just
information they have. the likelihood function.
MIT18_05S14_class17_slides 408
January 2, 2017 4 /25
Disease screening redux: probability

The test is positive. Are you sick?

P (H) 0.001 0.999

H = sick H = healthy

P (D | H) 0.99 0.01
D = pos. test neg. test D = pos. test neg. test

The prior is known so we can use Bayes’ Theorem.


0.001 · 0.99
P(sick | pos. test) = ≈ 0.1
0.001 · 0.99 + 0.999 · 0.01

MIT18_05S14_class17_slides 409
January 2, 2017 5 /25
Disease screening redux: statistics

The test is positive. Are you sick?

P (H) ? ?

H = sick H = healthy

P (D | H) 0.99 0.01
D = pos. test neg. test D = pos. test neg. test

The prior is not known.

Bayesian: use a subjective prior P(H) and Bayes’ Theorem.

Frequentist: the likelihood is all we can use: P(D | H)

MIT18_05S14_class17_slides 410
January 2, 2017 6 /25
Concept question

Each day Jane arrives X hours late to class, with X ∼ uniform(0, θ),
where θ is unknown. Jon models his initial belief about θ by a prior
pdf f (θ). After Jane arrives x hours late to the next class, Jon
computes the likelihood function f (x|θ) and the posterior pdf f (θ|x).

Which of these probability computations would the frequentist


consider valid?
1. none 5. prior and posterior
2. prior 6. prior and likelihood
3. likelihood 7. likelihood and posterior
4. posterior 8. prior, likelihood and posterior.

MIT18_05S14_class17_slides 411
January 2, 2017 7 /25
Concept answer

answer: 3. likelihood

Both the prior and posterior are probability distributions on the possible
values of the unknown parameter θ, i.e. a distribution on hypothetical
values. The frequentist does not consider them valid.

The likelihood f (x|theta) is perfectly acceptable to the frequentist. It


represents the probability of data from a repeatable experiment, i.e.
measuring how late Jane is each day. Conditioning on θ is fine. This just
fixes a model parameter θ. It doesn’t require computing probabilities of
values of θ.

MIT18_05S14_class17_slides 412
January 2, 2017 8 /25
Statistics are computed from data

Working definition. A statistic is anything that can be computed

from random data.

A statistic cannot depend on the true value of an unknown parameter.

A statistic can depend on a hypothesized value of a parameter.

Examples of point statistics


Data mean
Data maximum (or minimum)
Maximum likelihood estimate (MLE)

A statistic is random since it is computed from random data.

We can also get more complicated statistics like interval statistics.

MIT18_05S14_class17_slides 413
January 2, 2017 9 /25
Concept questions

Suppose x1 , . . . , xn is a sample from N(µ, σ 2 ), where µ and σ are


unknown.

Is each of the following a statistic?

1. Yes 2. No

1. The median of x1 , . . . , xn .

2. The interval from the 0.25 quantile to the 0.75 quantile


of N(µ, σ 2 ).
x̄−µ
3. The standardized mean √ .
σ/ n

4. The set of sample values less than 1 unit from x̄.


MIT18_05S14_class17_slides 414
January 2, 2017 10 /25
Concept answers

1. Yes. The median only depends on the data x1 , . . . , xn .

2. No. This interval depends only on the distribution parameters µ and σ.


It does not consider the data at all.

3. No. this depends on the values of the unknown parameters µ and σ.

4. Yes. x̄ depends only on the data, so the set of values within 1 of x̄ can
all be found by working with the data.

MIT18_05S14_class17_slides 415
January 2, 2017 11 /25
Cards and NHST

MIT18_05S14_class17_slides 416
January 2, 2017 12 /25
NHST ingredients

Null hypothesis: H0
Alternative hypothesis: HA
Test statistic: x
Rejection region: reject H0 in favor of HA if x is in this region
f (x|H0 )

x
x2 -3 0 x1 3
reject H0 don’t reject H0 reject H0

p(x|H0 ) or f (x|H0 ): null distribution


MIT18_05S14_class17_slides 417
January 2, 2017 13 /25
Choosing rejection regions

Coin with probability of heads θ.

Test statistic x = the number of heads in 10 tosses.

H0 : ‘the coin is fair’, i.e. θ = 0.5

HA : ‘the coin is biased, i.e. θ


= 0.5

Two strategies:
1. Choose rejection region then compute significance level.

2. Choose significance level then determine rejection region.

***** Everything is computed assuming H0 *****


MIT18_05S14_class17_slides 418
January 2, 2017 14 /25
Table question
Suppose we have the coin from the previous slide.

1. The rejection region is bordered in red, what’s the significance


level?
p(x | H0 )
.25

.15

.05
x
0 1 2 3 4 5 6 7 8 9 10

x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

MIT18_05S14_class17_slides
2. Given 419
significance level α = .05 find a two-sided rejection region.

January 2, 2017 15 /25


Solution

1. α = 0.11

x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

2. α = 0.05

x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001

MIT18_05S14_class17_slides 420
January 2, 2017 16 /25
Concept question
The null and alternate pdfs are shown on the following plot
f (x|HA ) f (x|H0 )
R3
R2
R1 R4

x
reject H0 region . non-reject H0 region

The significance level of the test is given by the area of which region?

1. R1 2. R2 3. R3 4. R4
5. R1 + R2 6. R2 + R3 7. R2 + R3 + R4 .

answer: 6. R2 + R3 . This is the area under the pdf for H0 above the
rejection region.
MIT18_05S14_class17_slides 421
January 2, 2017 17 /25
z-tests, p-values

Suppose we have independent normal Data: x1 , . . . , xn ; with


unknown mean µ, known σ

Hypotheses: H0 : xi ∼ N(µ0 , σ 2 )
HA : Two-sided: µ = µ0 , or one-sided: µ > µ0
x − µ0
z-value: standardized x: z = √
σ/ n
Test statistic: z
Null distribution: Assuming H0 : z ∼ N(0, 1).
p-values: Right-sided p-value: p = P(Z > z | H0 )
(Two-sided p-value: p = P(|Z | > z | H0 ))
Significance level: For p ≤ α we reject H0 in favor of HA .
Note: Could have used x as test statistic and N(µ0 , σ 2 ) as the null
MIT18_05S14_class17_slides
distribution. 422
January 2, 2017 18 /25
Visualization

Data follows a normal distribution N(µ, 152 ) where µ is unkown.

H0 : µ = 100

HA : µ > 100 (one-sided)

112 − 100
Collect 9 data points: x̄ = 112. So, z = = 2.4.
15/3
Can we reject H0 at significance level 0.05?
f (z|H0 ) ∼ N(0, 1)

z0.05 = 1.64
α = pink + red = 0.05
p = red = 0.008

z
z0.05 2.4
non-reject H0 reject H0
MIT18_05S14_class17_slides 423
January 2, 2017 19 /25
Board question

H0 : data follows a N(5, 102 )


HA : data follows a N(µ, 102 ) where µ 6= 5.
Test statistic: z = standardized x.
Data: 64 data points with x = 6.25.
Significance level set to α = 0.05.

(i) Find the rejection region; draw a picture.


(ii) Find the z-value; add it to your picture.
(iii) Decide whether or not to reject H0 in favor of HA .
(iv) Find the p-value for this data; add to your picture.
(v) What’s the connection between the answers to (ii), (iii) and (iv).
MIT18_05S14_class17_slides 424
January 2, 2017 20 /25
Solution

The null distribution f (z | H0 ) ∼ N(0, 1)


(i) The rejection region is |z| > 1.96, i.e. 1.96 or more standard deviations
from the mean.
x −5 1.25
(ii) Standardizing z = = = 1.
5/4 1.25
(iii) Do not reject since z is not in the rejection region
(iv) Use a two-sided p-value p = P(|Z | > 1) = .32

f (z|H0 ) ∼ N(0, 1)
z0.025 = 1.96
z0.975 = −1.96
α = red = 0.05

z
−1.96 0 z=1 1.96
reject H0 non-reject H0 reject H0
MIT18_05S14_class17_slides 425
January 2, 2017 21 /25
Solution continued

(v) The z-value is not in the rejection region tells us exactly the same
thing as the p-value being greater than the significance, i.e. don’t
reject the null hypothesis H0 .

MIT18_05S14_class17_slides 426
January 2, 2017 22 /25
Board question
Two coins: probability of heads is 0.5 for C1 ; and 0.6 for C2 .
We pick one at random, flip it 8 times and get 6 heads.

1. H0 = ’The coin is C1 ’ HA = ’The coin is C2 ’


Do you reject H0 at the significance level α = 0.05?

2. H0 = ’The coin is C2 ’ HA = ’The coin is C1 ’


Do you reject H0 at the significance level α = 0.05?

3. Do your answers to (1) and (2) seem paradoxical?


Here are binomial(8,θ) tables for θ = 0.5 and 0.6.
k 0 1 2 3 4 5 6 7 8
p(k|θ = 0.5) .004 .031 .109 .219 .273 .219 .109 .031 .004
p(k|θ = 0.6) .001 .008 .041 .124 .232 .279 .209 .090 .017
MIT18_05S14_class17_slides 427
January 2, 2017 23 /25
Solution
1. Since 0.6 > 0.5 we use a right-sided rejection region.

Under H0 the probability of heads is 0.5. Using the table we find a one

sided rejection region {7, 8}. That is we will reject H0 in favor of HA only

if we get 7 or 8 heads in 8 tosses.

Since the value of our data x = 6 is not in our rejection region we do not

reject H0 .

2. Since 0.6 > 0.5 we use a left-sided rejection region.

Now under H0 the probability of heads is 0.6. Using the table we find a

one sided rejection region {0, 1, 2}. That is we will reject H0 in favor of

HA only if we get 0, 1 or 2 heads in 8 tosses.

Since the value of our data x = 6 is not in our rejection region we do not

reject H0 .

3. The fact that we don’t reject C1 in favor of C2 or C2 in favor of C1

reflects the asymmetry in NHST. The null hypothesis is the cautious

choice. That is, we only reject H0 if the data is extremely unlikely when

MIT18_05S14_class17_slides
we assume H0 . This is not the case for either C1 or C2 .
428
January 2, 2017 24 /25
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class17_slides 429
Null Hypothesis Significance Testing
p-values, significance level, power, t-tests
18.05 Spring 2014

Class 18 Slides with Solutions: NHST II: Significance Level, Pow430


January 1, 2017 1 /28
Understand this figure

f (x|H0 )

x
reject H0 don’t reject H0 reject H0

x = test statistic

f (x|H0 ) = pdf of null distribution = green curve

Rejection region is a portion of the x-axis.

Significance = probability over the rejection region = red 431


MIT18_05S14_class18_slides area.

January 1, 2017 2 /28


Simple and composite hypotheses

Simple hypothesis: the sampling distribution is fully specified.


Usually the parameter of interest has a specific value.

Composite hypotheses: the sampling distribution is not fully


specified. Usually the parameter of interest has a range of values.

Example. A coin has probability θ of heads. Toss it 30 times and let


x be the number of heads.
(i) H: θ = 0.4 is simple. x ∼ binomial(30, 0.4).
(ii) H: θ > 0.4 is composite. x ∼ binomial(30, θ) depends on which
value of θ is chosen.

MIT18_05S14_class18_slides 432
January 1, 2017 3 /28
Extreme data and p-values
Hypotheses: H0 , HA .

Test statistic: value: x, random variable X .

Null distribution: f (x|H0 ) (assumes the null hypothesis is true)

Sides: HA determines if the rejection region is one or two-sided.

Rejection region/Significance: P(x in rejection region | H0 ) = α.

The p-value is a computational tool to check if the test statistic is in


the rejection region. It is also a measure of the evidence for rejecting
H0 .
p-value: P(data at least as extreme as x | H0 )

Data at least as extreme: Determined by the sided-ness of the


MIT18_05S14_class18_slides 433
rejection region.
January 1, 2017 4 /28
Extreme data and p-values
Example. Suppose we have the right-sided rejection region shown
below. Also suppose we see data with test statistic x = 4.2. Should
we reject H0 ?
f (x|H0 )

x
cα 4.2
don’t reject H0 reject H0

answer: The test statistic is in the rejection region, so reject H0 .

Alternatively: blue area < red area

Significance: α = P(x in rejection region | H0 ) = red area.

p-value: p = P(data at least as extreme as x | H0 ) = blue area.

Since, p < α we reject H0 .

MIT18_05S14_class18_slides 434
January 1, 2017 5 /28
Extreme data and p-values

Example. Now suppose x = 2.1 as shown. Should we reject H0 ?


f (x|H0 )

x
2.1 cα
don’t reject H0 reject H0

answer: The test statistic is not in the rejection region, so don’t


reject H0 .

Alternatively: blue area > red area

Significance: α = P(x in rejection region | H0 ) = red area.

p-value: p = P(data at least as extreme as x | H0 ) = blue area.

Since, p > α we don’t reject H0 .

MIT18_05S14_class18_slides 435
January 1, 2017 6 /28
Critical values

Critical values:

The boundary of the rejection region are called critical values.

Critical values are labeled by the probability to their right.

They are complementary to quantiles: c0.1 = q0.9

Example: for a standard normal c0.025 = 1.96 and c0.975 = −1.96.

In R, for a standard normal c0.025 = qnorm(0.975).

MIT18_05S14_class18_slides 436
January 1, 2017 7 /28
Two-sided p-values
These are trickier: what does ‘at least as extreme’ mean in this case?
Remember the p-value is a trick for deciding if the test statistic is in
the region.
If the significance (rejection) probability is split evenly between the
left and right tails then

p = 2min(left tail prob. of x, right tail prob. of x)

f (x|H0 )

x
c1−α/2 x cα/2
reject H0 don’t reject H0 reject H0

MIT18_05S14_class18_slides
x is outside the rejection region, so p > α: do not reject H437
0
January 1, 2017 8 /28
Concept question
1. You collect data from an experiment and do a left-sided z-test
with significance 0.1. You find the z-value is 1.8
(i) Which of the following computes the critical value for the
rejection region.
(a) pnorm(0.1, 0, 1) (b) pnorm(0.9, 0, 1)
(c) pnorm(0.95, 0, 1) (d) pnorm(1.8, 0, 1)
(e) 1 - pnorm(1.8, 0, 1) (f) qnorm(0.05, 0, 1)
(g) qnorm(0.1, 0, 1) (h) qnorm(0.9, 0, 1)
(i) qnorm(0.95, 0, 1)

(ii) Which of the above computes the p-value for this experiment.

(iii) Should you reject the null hypothesis.


(a) Yes (b) No
answer: (i) g. (ii) d. (iii) No. (Draw a picture!)
MIT18_05S14_class18_slides 438
January 1, 2017 9 /28
Error, significance level and power

True state of nature


H0 HA
Our Reject H0 Type I error correct decision
decision Don’t reject H0 correct decision Type II error

Significance level = P(type I error)

= probability we incorrectly reject H0

= P(test statistic in rejection region | H0 )

= P(false positive)

Power = probability we correctly reject H0

= P(test statistic in rejection region | HA )

= 1 − P(type II error)

= P(true positive)

• HA determines the power of the test.


• Significance and power are both probabilities of the rejection region.
MIT18_05S14_class18_slides 439
• Want significance level near 0 and power near 1.
January 1, 2017 10 /28
Table question: significance level and power
The rejection region is boxed in red. The corresponding probabilities
for different hypotheses are shaded below it.
x 0 1 2 3 4 5 6 7 8 9 10
H0 : p(x|θ = 0.5) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
HA : p(x|θ = 0.6) .000 .002 .011 .042 .111 .201 .251 .215 .121 .040 .006
HA : p(x|θ = 0.7) .000 .0001 .001 .009 .037 .103 .200 .267 .233 .121 .028

1. Find the significance level of the test.


2. Find the power of the test for each of the two alternative
hypotheses.

answer:
1. Significance level = P(x in rejection region | H0 ) = 0.11
2. θ = 0.6: power = P(x in rejection region | HA ) = 0.18
0.7: power = P(x in rejection region | HA ) = 0.383
θ = MIT18_05S14_class18_slides 440
January 1, 2017 11 /28
Concept question

1. The power of the test in the graph is given by the area of

f (x|HA ) f (x|H0 )
R3
R2
R1 R4

x
reject H0 region . non-reject H0 region

(a) R1 (b) R2 (c) R1 + R2 (d) R1 + R2 + R3

answer: (c) R1 + R2 . Power = P(rejection region | HA ) = area R1 + R2 .

MIT18_05S14_class18_slides 441
January 1, 2017 12 /28
Concept question

2. Which test has higher power?

f (x|HA ) f (x|H0 )

x
reject H0 region . do not reject H0 region

f (x|HA ) f (x|H0 )

x
reject H0 region . do not reject H0 region

MIT18_05S14_class18_slides
(a) Top graph (b) Bottom graph
442
January 1, 2017 13 /28
Solution

answer: (a) The top graph.

Power = P(x in rejection region | HA ). In the top graph almost all the
probability of HA is in the rejection region, so the power is close to 1.

MIT18_05S14_class18_slides 443
January 1, 2017 14 /28
Discussion question

The null distribution for test statistic x is N(4, 82 ). The rejection


region is {x ≥ 20}.

What is the significance level and power of this test?

answer: 20 is two standard deviations above the mean of 4. Thus,

P(x ≥ 20|H0 ) ≈ 0.025

This was a trick question: we can’t compute the power without an


alternative distribution.

MIT18_05S14_class18_slides 444
January 1, 2017 15 /28
One-sample t-test

Data: we assume normal data with both µ and σ unknown:


x1 , x2 , . . . , xn ∼ N(µ, σ 2 ).
Null hypothesis: µ = µ0 for some specific value µ0 .
Test statistic:
x − µ0
t= √
s/ n
where n
2 1 n
s = (xi − x)2 .
n − 1 i=1
Here t is the Studentized mean and s 2 is the sample variance.

Null distribution: f (t | H0 ) is the pdf of T ∼ t(n − 1),

the t distribution with n − 1 degrees of freedom.

Two-sided p-value: p = P(|T | > |t|).

R command: pt(x,n-1) is the cdf of t(n − 1).

MIT18_05S14_class18_slides
http://mathlets.org/mathlets/t-distribution/ 445
January 1, 2017 16 /28
Board question: z and one-sample t-test
For both problems use significance level α = 0.05.

Assume the data 2, 4, 4, 10 is drawn from a N(µ, σ 2 ).

Suppose H0 : µ = 0; HA : µ = 0.

1. Is the test one or two-sided? If one-sided, which side?

2. Assume σ 2 = 16 is known and test H0 against HA .

3. Now assume σ 2 is unknown and test H0 against HA .

Answer on next slide.


MIT18_05S14_class18_slides 446
January 1, 2017 17 /28
Solution
9+1+1+25
We have x̄ = 5, s2 = 3 = 12

1. Two-sided. A standardized sample mean above or below 0 is consistent


with HA .
2. We’ll use the standardized mean z for the test statistic (we could also
use x̄). The null distribution for z is N(0, 1). This is a two-sided test so
the rejection region is

(z ≤ z0.975 or z ≥ z0.025 ) = (−∞, −1.96] ∪ [1.96, ∞)

Since z = (x̄ − 0)/(4/2) = 2.5 is in the rejection region we reject H0 in

favor of HA .

Repeating the test using a p-value:

p = P(|z| ≥ 2.5 | H0 ) = 0.012

Since p < α we reject H0 in favor of HA .


MIT18_05S14_class18_slides 447
Continued on next slide.
January 1, 2017 18 /28
Solution continued

x̄−µ
3. We’ll use the Studentized t = √
s/ n
for the test statistic. The null

distribution for t is t3 . For the data we have t = 5/ 3. This is a
two-sided test so the p-value is

p = P(|t| ≥ 5/ 3|H0 ) = 0.06318

Since p > α we do not reject H0 .

MIT18_05S14_class18_slides 448
January 1, 2017 19 /28
Two-sample t-test: equal variances
Data: we assume normal data with µx , µy and (same) σ unknown:
x1 , . . . , xn ∼ N(µx , σ 2 ), y1 , . . . , ym ∼ N(µy , σ 2 )

Null hypothesis H0 : µx = µy .
(n − 1)sx2 + (m − 1)sy2 1 1
Pooled variance: sp2 = + .
n+m−2 n m
x̄ − ȳ
Test statistic: t=
sp
Null distribution: f (t | H0 ) is the pdf of T ∼ t(n + m − 2)

In general (so we can compute power) we have


(x̄ − ȳ ) − (µx − µy )
∼ t(n + m − 2)
sp
MIT18_05S14_class18_slides
Note: there are more general formulas for unequal variances.
449
January 1, 2017 20 /28
Board question: two-sample t-test

Real data from 1408 women admitted to a maternity hospital for (i)
medical reasons or through (ii) unbooked emergency admission. The
duration of pregnancy is measured in complete weeks from the
beginning of the last menstrual period.
Medical: 775 obs. with x̄ = 39.08 and s 2 = 7.77.
Emergency: 633 obs. with x̄ = 39.60 and s 2 = 4.95

1. Set up and run a two-sample t-test to investigate whether the


duration differs for the two groups.
2. What assumptions did you make?

MIT18_05S14_class18_slides 450
January 1, 2017 21 /28
Solution

The pooled variance for this data is

 
774(7.77) + 632(4.95) 1 1
sp2 = + = .0187
1406 775 633
The t statistic for the null distribution is
x̄ − ȳ
= −3.8064
sp
Rather than compute the two-sided p-value using 2*tcdf(-3.8064,1406)
we simply note that with 1406 degrees of freedom the t distribution is
essentially standard normal and 3.8064 is almost 4 standard deviations. So

P(|t| ≥ 3.8064) = P(|z| ≥ 3.8064)

which is very small, much smaller than α = .05 or α = .01. Therefore we


reject the null hypothesis in favor of the alternative that there is a
difference in the mean durations.
MIT18_05S14_class18_slides 451
Continued on next slide.
January 1, 2017 22 /28
Solution continued

2. We assumed the data was normal and that the two groups had equal
variances. Given the big difference in the sample variances this assumption
might not be warranted.

Note: there are significance tests to see if the data is normal and to see if
the two groups have the same variance.

MIT18_05S14_class18_slides 452
January 1, 2017 23 /28
Table discussion: Type I errors Q1

1. Suppose a journal will only publish results that are statistically


significant at the 0.05 level. What percentage of the papers it
publishes contain type I errors?

answer: With the information given we can’t know this. The


percentage could be anywhere from 0 to 100! –See the next
two questions.

MIT18_05S14_class18_slides 453
January 1, 2017 25 /28
Table discussion: Type I errors Q2
2. Jerry desperately wants to cure diseases but he is terrible at
designing effective treatments. He is however a careful scientist and
statistician, so he randomly divides his patients into control and
treatment groups. The control group gets a placebo and the
treatment group gets the experimental treatment. His null hypothesis
H0 is that the treatment is no better than the placebo. He uses a
significance level of α = 0.05. If his p-value is less than α he publishes
a paper claiming the treatment is significantly better than a placebo.
(a) Since his treatments are never, in fact, effective what percentage
of his experiments result in published papers?
(b) What percentage of his published papers contain type I errors,
i.e. describe treatments that are no better than placebo?
answer: (a) Since in all of his experiments H0 is true, roughly 5%, i.e. the
significance level, of his experiments will have p < 0.05 and be published.
(b) Since he’s always wrong, all of his published papers contain type I
MIT18_05S14_class18_slides 454
errors.
January 1, 2017 26 /28
Table discussions: Type I errors: Q3
3. Efrat is a genius at designing treatments, so all of her proposed
treatments are effective. She’s also a careful scientist and statistician
so she too runs double-blind, placebo controlled, randomized studies.
Her null hypothesis is always that the new treatment is no better than
the placebo. She also uses a significance level of α = 0.05 and
publishes a paper if p < α.
(a) How could you determine what percentage of her experiments
result in publications?
(b) What percentage of her published papers contain type I errors,
i.e. describe treatments that are no better than placebo?
answer: 3. (a)The percentage that get published depends on the power
of her treatments. If they are only a tiny bit more effective than placebo
then roughly 5% of her experiments will yield a publication. If they are a
lot more effective than placebo then as many as 100% could be published.
(b) MIT18_05S14_class18_slides
None of her published papers contain type I errors. 455
January 1, 2017 27 /28
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class18_slides 456
Null Hypothesis Significance Testing

Gallery of Tests

18.05 Spring 2014

Class 19 Slides with Solutions: NHST III: Gallery of Tests 457


January 1, 2017 1 /22
Discussion of Studio 8 and simulation

What is a simulation?
– Run an experiment with pseudo-random data instead of

real-world real random data.

– By doing this many times we can estimate the statistics for the
experiment.
Why do a simulation?
– In the real world we are not omniscient.
– In the real world we don’t have infinite resources.
What was the point of Studio 8?
– To simulate some simple significance tests and compare various
frequences.
– Simulated P(reject|H0 ) ≈ α
– Simulated P(reject|HA ) ≈ power
– P(H0 |reject can be anything depending on the (usually)

MIT18_05S14_class19_slides
unknown prior
458
January 1, 2017 2 /22
Concept question

We run a two-sample t-test for equal means, with α = 0.05, and


obtain a p-value of 0.04. What are the odds that the two samples are
drawn from distributions with the same mean?

(a) 19/1 (b) 1/19 (c) 1/20 (d) 1/24 (e) unknown

answer: (e) unknown. Frequentist methods only give probabilities for data
under an assumed hypothesis. They do not give probabilities or odds for
hypotheses. So we don’t know the odds for distribution means

MIT18_05S14_class19_slides 459
January 1, 2017 3 /22
General pattern of NHST
You are interested in whether to reject H0 in favor of HA .

Design:
Design experiment to collect data relevant to hypotheses.

Choose text statistic x with known null distribution f (x | H0 ).

Choose the significance level α and find the rejection region.

For a simple alternative HA , use f (x | HA ) to compute the power.

Alternatively, you can choose both the significance level and the

power, and then compute the necessary sample size.

Implementation:
Run the experiment to collect data.
Compute the statistic x and the corresponding p-value.
If p < α, reject H0 .
MIT18_05S14_class19_slides 460
January 1, 2017 4 /22
Chi-square test for homogeneity

In this setting homogeneity means that the data sets are all drawn
from the same distribution.

Three treatments for a disease are compared in a clinical trial,


yielding the following data:

Treatment 1 Treatment 2 Treatment 3


Cured 50 30 12
Not cured 100 80 18

Use a chi-square test to compare the cure rates for the three
treatments, i.e. to test if all three cure rates are the same.

MIT18_05S14_class19_slides 461
January 1, 2017 5 /22
Solution

H0 = all three treatments have the same cure rate.


HA = the three treatments have different cure rates.
Expected counts
Under H0 the MLE for the cure rate is

(total cured)/(total treated) = 92/290 = 0.317 .

Assuming H0 , the expected number cured for each treatment is

the number treated times 0.317.

This gives the following table of observed and expected counts

(observed in black, expected in blue).

We include the marginal values (in red). These are all needed to

compute the expected counts.

Treatment 1 Treatment 2 Treatment 3


Cured 50, 47.6 30, 34.9 12, 9.5 92
Not cured 100, 102.4 80, 75.1 18, 20.5 198
150 110 30 290
MIT18_05S14_class19_slides 462
continued
January 1, 2017 6 /22
Solution continued

Likelihood ratio statistic: G =2 Oi ln(Oi /Ei ) = 2.12


(Oi − Ei )2
Pearson’s chi-square statistic: X2 = = 2.13
Ei
Degrees of freedom
Formula: Test for homogeneity df = (2 − 1)(3 − 1) = 2.
Counting: The marginal totals are fixed because they are needed
to compute the expected counts. So we can freely put values in 2
of the cells and then all the others are determined: degrees of
freedom = 2.
p-value
p = 1 - pchisq(2.12, 2) = 0.346

The data does not support rejecting H0 . We do not conclude that the
treatments have differing efficacy.
MIT18_05S14_class19_slides 463
January 1, 2017 7 /22
Board question: Khan’s restaurant

Sal is thinking of buying a restaurant and asks about the distribution


of lunch customers. The owner provides row 1 below. Sal records the
data in row 2 himself one week.
M T W R F S
Owner’s distribution .1 .1 .15 .2 .3 .15
Observed # of cust. 30 14 34 45 57 20

Run a chi-square goodness-of-fit test on the null hypotheses:

H0 : the owner’s distribution is correct.

HA : the owner’s distribution is not correct.

Compute both G and X 2

MIT18_05S14_class19_slides 464
January 1, 2017 8 /22
Solution

The total number of observed customers is 200.

The expected counts (under H0 ) are 20 20 30 40 60 30

X
G =2 Oi log(Oi /Ei ) = 11.39
X (Oi − Ei )2 |
X2 = = 11.44
Ei
df = 6 − 1 = 5 (6 cells, compute 1 value –the total count– from the data)

p = 1-pchisq(11.39,5) = 0.044.
So, at a significance level of 0.05 we reject the null hypothesis in favor of
the alternative the the owner’s distribution is wrong.

MIT18_05S14_class19_slides 465
January 1, 2017 9 /22
Board question: genetic linkage
In 1905, William Bateson, Edith Saunders, and Reginald Punnett

were examining flower color and pollen shape in sweet pea plants by

performing crosses similar to those carried out by Gregor Mendel.

Purple flowers (P) is dominant over red flowers (p).

Long seeds (L) is dominant over round seeds (l).

F0: PPLL x ppll (initial cross)

F1: PpLl x PpLl (all second generation plants were PpLl)

F2: 2132 plants (third generation)

H0 = independent assortment: color and shape are independent.


purple, long purple, round red, long red, round
Expected ? ? ? ?
Observed 1528 106 117 381
Determine the expected counts for F2 under H0 and find the p-value
for aMIT18_05S14_class19_slides 466
Pearson Chi-square test. Explain your findings biologically.
January 1, 2017 10 /22
Solution

Since every F1 generation flower has genotype Pp we’d expect F2 to split


1/4, 1/2, 1/4 between PP, Pp, pp. For phenotype we expect F2 to have
3/4 purple and 1/4 red flowers. Similarly for LL, Ll, ll. Assuming H0 that
color and shape are independent we’d expect the following probabilities for
F2.
LL Ll ll
Long Round
PP 1/16 1/8 1/16 1/4
Purple 9/16 3/16 3/4
Pp 1/8 1/4 1/8 1/2
Red 3/16 1/16 1/4
pp 1/16 1/8 1/16 1/4
3/4 1/4 1
1/4 1/2 1/4 1
Genotype Phenotype
Using the total of 2132 plants in F2, the expected counts come from the
phenotype table:
purple, long purple, round red, long red, round
Expected 1199 400 400 133
Observed 1528
MIT18_05S14_class19_slides 106 117 381 467
January 1, 2017 11 /22
Continued

Using R we compute: G = 972.0, X 2 = 966.6.


The degrees of freedom is 3 (4 cells - 1 cell needed to make the total work
out). The p-values for both statistics is effectively 0. With such a small
p-value we reject H0 in favor of the alternative that the genes are not
indpendent.

MIT18_05S14_class19_slides 468
January 1, 2017 12 /22
F -distribution

Notation: Fa,b , a and b degrees of freedom


Derived from normal data
Range: [0, ∞)

Plot of F distributions
1
F 3 4
0.8 F 10 15
F 30 15
0.6
0.4
0.2
0
0 2 4 6 8 10
MIT18_05S14_class19_slides x 469
January 1, 2017 13 /22
F -test = one-way ANOVA
Like t-test but for n groups of data with m data points each.
yi,j ∼ N(µi , σ 2 ), yi,j = j th point in ith group

Null-hypothesis is that means are all equal: µ1 = · · · = µn


MSB
Test statistic is MSW
where:
m X
MSB = between group variance = (ȳi − ȳ )2
n−1
MSW = within group variance = sample mean of s12 , . . . , sn2
Idea: If µi are equal, this ratio should be near 1.
Null distribution is F-statistic with n − 1 and n(m − 1) d.o.f.:
MSB
∼ Fn−1, n(m−1)
MSW
Note: Formulas easily generalizes to unequal group sizes:
MIT18_05S14_class19_slides
http://en.wikipedia.org/wiki/F-test 470
January 1, 2017 14 /22
Board question

The table shows recovery time in days for three medical treatments.
1. Set up and run an F-test testing if the average recovery time is the
same for all three treatments.
2. Based on the test, what might you conclude about the treatments?

T1 T2 T3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12

For α = 0.05, the critical value of F2,15 is 3.68.


MIT18_05S14_class19_slides 471
January 1, 2017 15 /22
Solution

H0 is that the means of the 3 treatments are the same. HA is that they
are not.
Our test statistic w is computed following the procedure from a previous
slide. We get that the test statistic w is approximately 9.25. The p-value
is approximately 0.0024. We reject H0 in favor of the hypothesis that the
means of three treatments are not the same.

MIT18_05S14_class19_slides 472
January 1, 2017 16 /22
Concept question: multiple-testing

1. Suppose we have 6 treatments and want to know if the average


recovery time is the same for all of them. If we compare two at a
time, how many two-sample t-tests do we need to run.
(a) 1 (b) 2 (c) 6 (d) 15 (e) 30

2. Suppose we use the significance level 0.05 for each of the 15 tests.
Assuming the null hypothesis, what is the probability that we reject at
least one of the 15 null hypotheses?
(a) Less than 0.05 (b) 0.05 (c) 0.10 (d) Greater than 0.25

Discussion: Recall that there is an F -test that tests if all the means
are the same. What are the trade-offs of using the F -test rather than
many two-sample t-tests?
answer: Solution on next slide.
MIT18_05S14_class19_slides 473
January 1, 2017 17 /22
Solution

answer: 1. 6 choose 2 = 15.


2. answer: (d) Greater than 0.25.
Under H0 the probability of rejecting for any given pair is 0.05. Because
the tests aren’t independent, i.e. if the group1-group2 and group2-group3
comparisons fail to reject H0 , then the probability increases that the
group1-group3 comparison will also fail to reject.
We can say that the following 3 comparisons: group1-group2,
group3-group4, group5-group6 are independent. The number of rejections
among these three follows a binom(3, 0.05) distribution. The probablity
the number is greater than 0 is 1 − (0.95)3 ≈ 0.14.
Even though the other pairwise tests are not independent they do increase
the probability of rejection. In simulations of this with normal data the
false rejection rate was about 0.36.
MIT18_05S14_class19_slides 474
January 1, 2017 18 /22
Board question: chi-square for independence

(From Rice, Mathematical Statistics and Data Analysis, 2nd ed. p.489)

Consider the following contingency table of counts

Education Married once Married multiple times Total


College 550 61 611
No college 681 144 825
Total 1231 205 1436

Use a chi-square test with significance level 0.01 to test the hypothesis
that the number of marriages and education level are independent.

MIT18_05S14_class19_slides 475
January 1, 2017 19 /22
Solution

The null hypothesis is that the cell probabilities are the product of the
marginal probabilities. Assuming the null hypothesis we estimate the
marginal probabilities in red and multiply them to get the cell probabilities
in blue.
Education Married once Married multiple times Total
College 0.365 0.061 611/1436
No college 0.492 0.082 825/1436
Total 1231/1436 205/1436 1
We then get expected counts by multiplying the cell probabilities by the
total number of women surveyed (1436). The table shows the observed,
expected counts:
Education Married once Married multiple times
College 550, 523.8 61, 87.2
No college 681, 707.2 144, 117.8
MIT18_05S14_class19_slides 476
January 1, 2017 20 /22
Solution continued

We then have

G = 16.55 and X 2 = 16.01

The number of degrees of freedom is (2 − 1)(2 − 1) = 1. We could count


this: we needed the marginal probabilities to compute the expected
counts. Now setting any one of the cell counts determines all the rest
because they need to be consistent with the marginal probabilities. We get

p = 1-pchisq(16.55,1) = 0.000047

Therefore we reject the null hypothesis in favor of the alternate hypothesis


that number of marriages and education level are not independent

MIT18_05S14_class19_slides 477
January 1, 2017 21 /22
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class19_slides 478
Comparison of Bayesian and Frequentist Inference
18.05 Spring 2014

• First discuss last class 19 board question,

Class 20 Slides with Solutions: Comparison of Bayesian and Fre479


January 1, 2017 1 /10
Compare
Bayesian inference
Uses priors
Logically impeccable
Probabilities can be interpreted
Prior is subjective
Frequentist inference
No prior
Objective –everyone gets the same answer
Logically complex
Conditional probability of error is often misinterpreted as total
probability of error
Requires complete description of experimental protocol and data
analysis protocol before starting the experiment. (This is both
good and bad)
MIT18_05S14_class20_slides 480
January 1, 2017 2 /10
Concept question
Three different tests are run all with significance level α = 0.05.

1. Experiment 1: finds p = 0.03 and rejects its null hypothesis H0 .

2. Experiment 2: finds p = 0.049 and rejects its null hypothesis.

3. Experiment 3: finds p = 0.15 and fails to rejects its null


hypothesis.

Which result has the highest probability of being correct?

(Click 4 if you don’t know.)

answer: 4. You can’t know probabilities of hypotheses based just on


p values.
MIT18_05S14_class20_slides 481
January 1, 2017 3 /10
Board question: Stop!

Experiments are run to test a coin that is suspected of being biased


towards heads. The significance level is set to α = 0.1
Experiment 1: Toss a coin 5 times. Report the sequence of tosses.
Experiment 2: Toss a coin until the first tails. Report the sequence
of tosses.

1. Give the test statistic, null distribution and rejection region for
each experiment. List all sequences of tosses that produce a test
statistic in the rejection region for each experiment.

2. Suppose the data is HHHHT .


(a) Do the significance test for both types of experiment.
(b) Do a Bayesian update starting from a flat prior: Beta(1,1).
Draw some conclusions about the fairness of coin from your posterior.
(Use R: pbeta for computation in part (b).)
MIT18_05S14_class20_slides 482
January 1, 2017 4 /10
Solution

1. Experiment 1: The test statistic is the number of heads x out of 5


tosses. The null distribution is binomial(5,0.5). The rejection region is
{x = 5}. The sequence of tosses HHHHH. is the only one that leads to
rejection.
Experiment 2: The test statistic is the number of heads x until the first
tails. The null distribution is geom(0.5). The rejection region {x ≥ 4}.
The sequences of tosses that lead to rejection are
{HHHHT , HHHHH ∗ ∗T }, where ’∗∗’ means an arbitrary length string of
heads.
2a. For experiment 1 and the given data, ‘as or more extreme’ means 4 or
5 heads. So for experiment 1 the p-value is P(4 or 5 heads | fair coin) =
6/32 ≈ 0.20.
For experiment 2 and the given data ‘as or more extreme’ means at least 4
heads at the start. So p = 1 - pgeom(3,0.5) = 0.0625.
(Solution continued.)
MIT18_05S14_class20_slides 483
January 1, 2017 5 /10
Solution continued

2b. Let θ be the probability of heads, Four heads and a tail updates the
prior on θ, Beta(1,1) to the posterior Beta(5,2). Using R we can compute

P(Coin is biased to heads) = P(θ > 0.5) = 1 -pbeta(0.5,5,2) = 0.89.

If the prior is good then the probability the coin is biased towards heads is
0.89.

MIT18_05S14_class20_slides 484
January 1, 2017 6 /10
Board question: Stop II

For each of the following experiments (all done with α = 0.05)


(a) Comment on the validity of the claims.
(b) Find the true probability of a type I error in each experimental setup.
1 By design Ruthi did 50 trials and computed p = 0.04.

She reports p = 0.04 with n = 50 and declares it significant.

2 Ani did 50 trials and computed p = 0.06.

Since this was not significant, she then did 50 more trials and

computed p = 0.04 based on all 100 trials.

She reports p = 0.04 with n = 100 and declares it significant.

3 Efrat did 50 trials and computed p = 0.06.

Since this was not significant, she started over and computed

p = 0.04 based on the next 50 trials.

She reports p = 0.04 with n = 50 and declares it statistically

significant.

MIT18_05S14_class20_slides 485
January 1, 2017 7 /10
Solution

1. (a) This is a reasonable NHST experiment.


(b) The probability of a type I error is 0.05.
2. (a) The actual experiment run:
(i) Do 50 trials.
(ii) If p < 0.05 then stop.
(iii) If not run another 50 trials.
(iv) Compute p again, pretending that all 100 trials were run without any
possibility of stopping.
This is not a reasonable NHST experimental setup because the second
p-values are computed using the wrong null distribution.
(b) If H0 is true then the probability of rejecting is already 0.05 by step
(ii). It can only increase by allowing steps (iii) and (iv). So the probability
of rejecting given H0 is more than 0.05. We can’t say how much more
without more details.
MIT18_05S14_class20_slides 486
January 1, 2017 8 /10
Solution continued

3. (a) See answer to (2a).


(b) The total probability of a type I error is more than 0.05. We can
compute it using a probability tree. Since we are looking at type I errors
all probabilities are computed assume H0 is true.

First 50 trials .05 .95

Reject Continue

Second 50 trials 0.05

Reject Don’t reject

The total probability of falsely rejecting H0 is 0.05 + 0.05 × 0.95 = 0.0975

MIT18_05S14_class20_slides 487
January 1, 2017 9 /10
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .

MIT18_05S14_class20_slides 488
Confidence Intervals for Normal Data
18.05 Spring 2014

Class 22 Slides with Solutions: Confidence Intervals for Normal 489


Agenda

Today
Review of critical values and quantiles.
Computing z, t, χ2 confidence intervals for normal data.
Conceptual view of confidence intervals.
Confidence intervals for polling (Bernoulli distributions).

MIT18_05S14_class22-slde-a 490
January 1, 2017 2 / 19
Review of critical values and quantiles

Quantile: left tail P(X < qα ) = α


Critical value: right tail P(X > cα ) = α
Letters for critical values:
zα for N(0, 1)
tα for t(n)
cα , xα all purpose

P (Z ≤ qα ) P (Z > zα )
α α
z
qα zα

qα and zα for the standard normal distribution.


MIT18_05S14_class22-slde-a 491
January 1, 2017 3 / 19
Concept question

P (Z ≤ qα ) P (Z > zα )
α α
z
qα zα

1. z.025 =
(a) -1.96 (b) -0.95 (c) 0.95 (d) 1.96 (e) 2.87

2. −z.16 =
(a) -1.33 (b) -0.99 (c) 0.99 (d) 1.33 (e) 3.52

Solution on next slide.

MIT18_05S14_class22-slde-a 492
January 1, 2017 4 / 19
Solution

1. answer: z.025 = 1.96. By definition P(Z > z.025 ) = 0.025. This is the
same as P(Z ≤ z.025 ) = 0.975. Either from memory, a table or using the
R function qnorm(.975) we get the result.

2.answer: −z.16 = −0.99. We recall that P(|Z | < 1) ≈ .68. Since half the
leftover probability is in the right tail we have P(Z > 1) ≈ 0.16. Thus
z.16 ≈ 1, so −z.16 ≈ −1.

MIT18_05S14_class22-slde-a 493
January 1, 2017 5 / 19
Computing confidence intervals from normal data
Suppose the data x1 , . . . , xn is drawn from N(µ, σ 2 )
Confidence level = 1 − α
z confidence interval for the mean (σ known)
zα/2 · σ zα/2 · σ
 
x − √ , x + √
n n
t confidence interval for the mean (σ unknown)
tα/2 · s tα/2 · s
 
x − √ , x + √
n n
χ2 confidence interval for σ 2
n−1 2 n−1 2
 
s , s
cα/2 c1−α/2
t and χ2 have n − 1 degrees of freedom.
MIT18_05S14_class22-slde-a 494
January 1, 2017 6 / 19
z rule of thumb

Suppose x1 , . . . , xn ∼ N(µ, σ 2 ) with σ known.

The rule-of-thumb 95% confidence interval for µ is:

 
σ σ
x̄ − 2 √ , x̄ + 2 √
n n

A more precise 95% confidence interval for µ is:


 
σ σ
x̄ − 1.96 √ , x̄ + 1.96 √
n n

MIT18_05S14_class22-slde-a 495
January 1, 2017 7 / 19
Board question: computing confidence intervals

The data 1, 2, 3, 4 is drawn from N(µ, σ 2 ) with µ unknown.

1 Find a 90% z confidence interval for µ, given that σ = 2.

For the remaining parts, suppose σ is unknown.


2 Find a 90% t confidence interval for µ.
3 Find a 90% χ2 confidence interval for σ 2 .
4 Find a 90% χ2 confidence interval for σ.
5 Given a normal sample with n = 100, x = 12, and s = 5,
find the rule-of-thumb 95% confidence interval for µ.

MIT18_05S14_class22-slde-a 496
January 1, 2017 8 / 19
Solution
x = 2.5, s 2 = 1.667, s = 1.29
√ √
σ/ n = 1, s/ n = 0.645.
1. z.05 = 1.644: z confidence interval is

2.5 ± 1.644 · 1 = [0.856, 4.144]

2. t.05 = 2.353 (3 degrees of freedom): t confidence interval is

2.5 ± 2.353 · 0.645 = [0.982, 4.018]

3. c0.05 = 7.1814, c0.95 = 0.352 (3 degrees of freedom): χ2 confidence


interval is
3 · 1.667 3 · 1.667
 
, = [0.696, 14.207].
7.1814 0.352
4. Take the square root of the interval in 3. [0.834, 3.769].
5. The rule of thumb is written for z, but with n = 100 the t(99) and
standard normal distributions are very close, so we can assume that
t.025MIT18_05S14_class22-slde-a
≈ 2. Thus the 95% confidence interval is 12 ± 2 · 5/10 = [11,497
13].
January 1, 2017 9 / 19
Conceptual view of confidence intervals
Computed from data ⇒ interval statistic
‘Estimates’ a parameter of interest ⇒ interval estimate
Width = measure of precision
Confidence level = measure of performance
Confidence intervals are a frequentist method.
I No need for a prior, only uses likelihood.

I Frequentists never assign probabilities to unknown

parameters:
I A 95% confidence interval of [1.2, 3.4] for µ does not

mean that P(1.2 ≤ µ ≤ 3.4) = 0.95.


We will compare with Bayesian probability intervals later.

Applet:
MIT18_05S14_class22-slde-a
http://mathlets.org/mathlets/confidence-intervals/498
January 1, 2017 10 / 19
Table discussion

The quantities n, c, µ, σ all play a roll in the confidence interval for


the mean.

How does the width of a confidence interval for the mean change if:

1. we increase n and leave the others unchanged?


2. we increase c and leave the others unchanged?
3. we increase µ and leave the others unchanged?
4. we increase σ and leave the others unchanged?

(A) it gets wider (B) it gets narrower (C) it stays the same.
MIT18_05S14_class22-slde-a 499
January 1, 2017 11 / 19
Answers

1. Narrower. More data decreases the variance of x̄


2. Wider. Greater confidence requires a bigger interval.
3. No change. Changing µ will tend to shift the location of the intervals.
4. Wider. Increasing σ will increase the uncertainty about µ.

MIT18_05S14_class22-slde-a 500
January 1, 2017 12 / 19
Intervals and pivoting
x: sample mean (statistic)
µ0 : hypothesized mean (not known)
Pivoting: x is in the interval µ0 ± 2.3 ⇔ µ0 is in the interval x ± 2.3.

−2 −1 0 1 2 3 4
µ0 x
µ0 ± 1 this interval does not contain x
x±1 this interval does not contain µ0
µ0 ± 2.3 this interval contains x
x ± 2.3 this interval contains µ0

Algebra of pivoting:

µ0 − 2.3 < x < µ0 + 2.3 ⇔ x + 2.3 > µ0 > x − 2.3.


MIT18_05S14_class22-slde-a 501
January 1, 2017 13 / 19
Board question: confidence intervals, non-rejection regions

Suppose x1 , . . . , xn ∼ N(µ, σ 2 ) with σ known.

Consider two intervals:


1. The z confidence interval around x at confidence level 1 − α.
2. The z non-rejection region for H0 : µ = µ0 at significance level α.

Compute and sketch these intervals to show that:

µ0 is in the first interval ⇔ x is in the second interval.

MIT18_05S14_class22-slde-a 502
January 1, 2017 14 / 19
Solution
σ
Confidence interval: x ± zα/2 · √
n
σ
Non-rejection region: µ0 ± zα/2 · √
n

Since the intervals are the same width they either both contain the
other’s center or neither one does.

N (µ0 , σ 2 /n)

x
x2 µ0 − zα/2 · √σ µ0 x1 µ0 + zα/2 · √σ
n n

MIT18_05S14_class22-slde-a 503
January 1, 2017 15 / 19
Polling: a binomial proportion confidence interval
Data x1 , . . . , xn from a Bernoulli(θ) distribution with θ unknown.
A conservative normal† (1 − α) confidence interval for θ is given by
 
zα/2 zα/2
x̄ − √ , x̄ + √ .
2 n 2 n
p
Proof uses the CLT and the observation σ = θ(1 − θ) ≤ 1/2.

Political polls often give a margin-of-error of ±1/ n. This
rule-of-thumb corresponds to a 95% confidence interval:
 
1 1
x̄ − √ , x̄ + √ .
n n
(The proof is in the class 22 notes.)
Conversely, a margin of error of ±0.05 means 400 people were polled.

There are many types of binomial proportion confidence intervals.
MIT18_05S14_class22-slde-a 504
http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
January 1, 2017 16 / 19
Board question

For a poll to find the proportion θ of people supporting X we know


that a (1 − α) confidence interval for θ is given by
 
zα/2 zα/2
x̄ − √ , x̄ + √ .
2 n 2 n

1. How many people would you have to poll to have a margin of error
of .01 with 95% confidence? (You can do this in your head.)

2. How many people would you have to poll to have a margin of error
of .01 with 80% confidence. (You’ll want R or other calculator here.)

3. If n = 900, compute the 95% and 80% confidence intervals for θ.


answer: See next slide.
MIT18_05S14_class22-slde-a 505
January 1, 2017 17 / 19

answer: 1. Need 1/ n = .01 So n = 10000.
zα/2
2. α = .2, so zα/2 = qnorm(.9) = 1.2816. So we need √ = .01.
2 n
This gives n = 4106.
1 1
3. 95% interval: x ± √ = x ± = x ± .0333
n 30
1 1
80% interval: x ± z.1 √ = x ± 1.2816 · = x ± .021.
2 n 60

MIT18_05S14_class22-slde-a 506
January 1, 2017 18 / 19
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class22-slde-a 507
Confidence Intervals II
18.05 Spring 2014

Class 23 Slides with Solutions: Confidence Intervals II 508


Agenda

Polling: estimating θ in Bernoulli(θ).


CLT ⇒ large sample confidence intervals for the mean.
Three views of confidence intervals.
Constructing a confidence interval without normality:
the exact binomial confidence interval for θ

MIT18_05S14_class23-slde-a 509
January 1, 2017 2 / 18
Polling confidence interval
Also called a binomial proportion confidence interval
Polling means sampling from a Bernoulli(θ) distribution,
i.e. data x1 , . . . , xn Bernoulli(θ).

Consevative normal confidence interval for θ:


1
x ± zα/2 · √
2 n
p
Proof uses the CLT and the observation σ = θ(1 − θ) ≤ 1/2.

Rule-of-thumb 95% confidence interval for θ:


1
x±√
n
(Reason: z0.025 ≈ 2.)
MIT18_05S14_class23-slde-a 510
January 1, 2017 3 / 18
Binomial proportion confidence intervals


Political polls often give a margin-of-error of ±1/ n, i.e. they use
the rule-of-thumb 95% confidence interval.

There are many types of binomial proportion confidence intervals:


http://en.wikipedia.org/wiki/Binomial_proportion_
confidence_interval

MIT18_05S14_class23-slde-a 511
January 1, 2017 4 / 18
Board question

For a poll to find the proportion θ of people supporting X we know


that a (1 − α) confidence interval for θ is given by
 
zα/2 zα/2
x̄ − √ , x̄ + √ .
2 n 2 n

1. How many people would you have to poll to have a margin of error
of 0.01 with 95% confidence? (You can do this in your head.)

2. How many people would you have to poll to have a margin of error
of 0.01 with 80% confidence. (You’ll want R or other calculator here.)

3. If n = 900, compute the 95% and 80% confidence intervals for θ.


answer: See next slide.
MIT18_05S14_class23-slde-a 512
January 1, 2017 5 / 18

answer: 1. Need 1/ n = 0.01 So n = 10000.
zα/2
2. α = 0.2, so zα/2 = qnorm(0.9) = 1.2816. So we need √ = .01.
2 n
This gives n = 4106.
1 1
3. 95% interval: x ± √ = x ± = x ± 0.0333
n 30
1 1
80% interval: x ± z0.1 √ = x ± 1.2816 · = x ± 0.021.
2 n 60

MIT18_05S14_class23-slde-a 513
January 1, 2017 6 / 18
Concept question: overnight polling

During the presidential election season, pollsters often do ‘overnight


polls’ and report a ‘margin of error’ of about ±5%.

The number of people polled is in which of the following ranges?


(a) 0 – 50
(b) 50 – 100
(c) 100 – 300
(d) 300 – 600
(e) 600 – 1000

Answer: 5% = 1/20. So 20 = n ⇒ n = 400.

MIT18_05S14_class23-slde-a 514
January 1, 2017 7 / 18
National Council on Public Polls: Press Release, Sept 1992
“The National Council on Public Polls expressed concern today about the
current spate of overnight Presidential polls. [...] Overnight polls do a
disservice to both the media and the research industry because of the
considerable potential for the results to be misleading. The overnight
interviewing period may well mean some methodological compromises, the
most serious of which is..”

...what?

“...the inability to make callbacks, resulting in samples that do not


adequately represent such groups as single member households, younger
people, and others who are apt to be out on any given night. As overnight
polls often result in findings that are less reliable than those from more
carefully conducted polls, if the media reports them, it should be with
great caution.”
http://www.ncpp.org/?q=node/42
MIT18_05S14_class23-slde-a 515
January 1, 2017 8 / 18
Large sample confidence interval
Data x1 , . . . , xn independently drawn from a distribution that may not
be normal but has finite mean and variance.

A version of the central limit theorem says that large n,


x̄ − µ
√ ≈ N(0, 1)
s/ n
i.e. the sampling distribution of the studentized mean is
approximately standard normal:
So for large n the (1 − α) confidence interval for µ is approximately
 
s s
x̄ − √ · zα/2 , x̄ + √ · zα/2
n n

ThisMIT18_05S14_class23-slde-a
is called the large sample confidence interval. 516
January 1, 2017 9 / 18
Review: confidence intervals for normal data
Suppose the data x1 , . . . , xn is drawn from N(µ, σ 2 )
Confidence level = 1 − α
z confidence interval for the mean (σ known)
zα/2 · σ zα/2 · σ zα/2 · σ
 
x − √ , x + √ or x± √
n n n
t confidence interval for the mean (σ unknown)
tα/2 · s tα/2 · s tα/2 · s
 
x − √ , x + √ or x± √
n n n
χ2 confidence interval for σ 2
 
n−1 2 n−1 2
s , s
cα/2 c1−α/2
t and χ2 have n − 1 degrees of freedom.
MIT18_05S14_class23-slde-a 517
January 1, 2017 10 / 18
Three views of confidence intervals

View 1: Define/construct CI using a standardized point statistic.

View 2: Define/construct CI based on hypothesis tests.

View 3: Define CI as any interval statistic satisfying a formal


mathematical property.

MIT18_05S14_class23-slde-a 518
January 1, 2017 11 / 18
View 1: Using a standardized point statistic
Example. x1 . . . , xn ∼ N(µ, σ 2 ), where σ is known.
The standardized sample mean follows a standard normal distribution.
x −µ
z = √ ∼ N(0, 1)
σ/ n
Therefore:
x −µ
P(−zα/2 < √ < zα/2 | µ) = 1 − α
σ/ n
Pivot to:
σ σ
P(x − zα/2 · √ < µ < x + zα/2 · √ | µ) = 1 − α
n n
This is the (1 − α) confidence interval:
σ
x ± zα/2 · √
n
MIT18_05S14_class23-slde-a 519
Think of it as x ± error
January 1, 2017 12 / 18
View 1: Other standardized statistics

The t and χ2 statistics fit this paradigm as well:


x −µ
t= √ ∼ t (n − 1)
s/ n

(n − 1)s 2
X2 = ∼ χ2 (n − 1)
σ2

MIT18_05S14_class23-slde-a 520
January 1, 2017 13 / 18
View 2: Using hypothesis tests

Set up: Unknown parameter θ. Test statistic x.

For any value θ0 , we can run an NSHT with null hypothesis

H0 : θ = θ 0

at significance level α.
Definition. Given x, the (1 − α) confidence interval contains all θ0
which are not rejected when they are the null hypothesis.

Definition. A type 1 CI error occurs when the confidence interval


does not contain the true value of θ.
For a 1 − α confidence interval, the type 1 CI error rate is α.

MIT18_05S14_class23-slde-a 521
January 1, 2017 14 / 18
Board question: exact binomial confidence interval

Use this table of binomial(8,θ) probabilities to:


1 find the (two-sided) rejection region with significance level 0.10
for each value of θ.
2 Given x = 7, find the 90% confidence interval for θ.
3 Repeat for x = 4.
θ/x 0 1 2 3 4 5 6 7 8
.1 0.430 0.383 0.149 0.033 0.005 0.000 0.000 0.000 0.000
.3 0.058 0.198 0.296 0.254 0.136 0.047 0.010 0.001 0.000
.5 0.004 0.031 0.109 0.219 0.273 0.219 0.109 0.031 0.004
.7 0.000 0.001 0.010 0.047 0.136 0.254 0.296 0.198 0.058
.9 0.000 0.000 0.000 0.000 0.005 0.033 0.149 0.383 0.430

MIT18_05S14_class23-slde-a 522
January 1, 2017 15 / 18
Solution
For each θ, the non-rejection region is blue, the rejection region is red.
In each row, the rejection region has probability at most α = 0.10.

θ/x 0 1 2 3 4 5 6 7 8
.1 0.430 0.383 0.149 0.033 0.005 0.000 0.000 0.000 0.000
.3 0.058 0.198 0.296 0.254 0.136 0.047 0.010 0.001 0.000
.5 0.004 0.031 0.109 0.219 0.273 0.219 0.109 0.031 0.004
.7 0.000 0.001 0.010 0.047 0.136 0.254 0.296 0.198 0.058
.9 0.000 0.000 0.000 0.000 0.005 0.033 0.149 0.383 0.430

For x = 7 the 90% confidence interval for p is [0.7, 0.9].


These are the values of θ we wouldn’t reject as null hypotheses. They
are the blue entries in the x = 7 column.

For x = 4 the 90% confidence interval for p is [0.3, 0.7].


MIT18_05S14_class23-slde-a 523
January 1, 2017 16 / 18
View 3: Formal
Recall: An interval statistic is an interval Ix computed from data x.
This is a random interval because x is random.
Suppose x is drawn from f (x|θ) with unknown parameter θ.

Definition:
A (1 − α) confidence interval for θ is an interval statistic Ix such that

P(Ix contains θ | θ) = 1 − α

for all possible values of θ (and hence for the true value of θ).

Note: equality in this equation is often relaxed to ≥ or ≈.


= : z, t, χ2
≥ : rule-of-thumb and exact binomial (polling)
≈ : MIT18_05S14_class23-slde-a
large sample confidence interval 524
January 1, 2017 17 / 18
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class23-slde-a 525
Bootstrapping
18.05 Spring 2014

Class 24 Slides with Solutions: Bootstrap Confidence Intervals 526


Agenda

Bootstrap terminology

Bootstrap principle

Empirical bootstrap

Parametric bootstrap

MIT18_05S14_class24-slde-a 527
January 1, 2017 2 / 16
Empirical distribution of data
Data: x1 , x2 , . . . , xn (independent)
Example 1. Data: 1, 2, 2, 3, 8, 8, 8.
x∗ 1 2 3 8
p ∗ (x ∗ ) 1/7 2/7 1/7 3/7
Example 2. 0.20
0.10
0.00

0 5 10 15

The true and empirical distribution are approximately equal.


MIT18_05S14_class24-slde-a 528
January 1, 2017 3 / 16
Resampling

Sample (size 6): 1 2 1 5 1 12

Resample (size m): Randomly choose m samples with


replacement from the original sample.

Resample probabilities = empirical distribution:


P(1) = 1/2, P(2) = 1/6 etc.

E.g. resample (size 10): 5 1 1 1 12 1 2 1 1 5

A bootstrap (re)sample is always the same size as the original


sample:

Bootstrap sample (size 6): 5 1 1 1 12 1


MIT18_05S14_class24-slde-a 529
January 1, 2017 4 / 16
Bootstrap principle for the mean
• Data x1 , x2 , . . . , xn ∼ F with true mean µ.
• F ∗ = empirical distribution (resampling distribution).
•x1∗ , x2∗ , . . . , xn∗ resample same size data

Bootstrap Principle: (really holds for any statistic)


1 F ∗ ≈ F computed from the resample.
2 δ ∗ = x ∗ − x ≈ x − µ = variation of x
Critical values: δ1∗−α/2 ≤ x ∗ − x ≤ δα/

2

then δ1∗−α/2 ≤ x − µ ≤ δα/



2 so

∗ ∗
x − δα/ 2 ≤ µ ≤ x − δ1−α/2

MIT18_05S14_class24-slde-a 530
January 1, 2017 5 / 16
Empirical bootstrap confidence intervals
Use the data to estimate the variation of estimates based on the data!

Data: x1 , . . . , xn drawn from a distribution F .


Estimate a feature θ of F by a statistic θ̂.
Generate many bootstrap samples x1∗ , . . . , xn∗ .
Compute the statistic θ∗ for each bootstrap sample.
Compute the bootstrap difference

δ ∗ = θ∗ − θ̂.

Use the quantiles of δ ∗ to approximate quantiles of

δ = θ̂ − θ

Set a confidence interval [θ̂ − δ1∗−α/2 , θ̂ − δα/



2]
(By δα/2 we mean the α/2 quantile.)
MIT18_05S14_class24-slde-a 531
January 1, 2017 6 / 16
Concept question

Consider finding bootstrap confidence intervals for

I. the mean II. the median III. 47th percentile.

Which is easiest to find?


A. I B. II C. III D. I and II

E. II and III F. I and III G. I and II and III

answer: G. The program is essentially the same for all three statistics. All
that needs to change is the code for computing the specific statistic.

MIT18_05S14_class24-slde-a 532
January 1, 2017 7 / 16
Board question

Data: 3 8 1 8 3 3

Bootstrap samples (each column is one bootstrap trial):


8 8 1 8 3 8 3 1
1 3 3 1 3 8 3 3
3 1 1 8 1 3 3 8
8 1 3 1 3 3 8 8
3 3 1 8 8 3 8 3
3 8 8 3 8 3 1 1

Compute a bootstrap 80% confidence interval for the mean.

Compute a bootstrap 80% confidence interval for the median.

MIT18_05S14_class24-slde-a 533
January 1, 2017 8 / 16
Solution: mean

x̄ = 4.33

x̄ ∗ : 4.33, 4.00, 2.83, 4.83, 4.33, 4.67, 4.33, 4.00

δ∗: 0.00, -0.33, -1.50, 0.50, 0.00, 0.33, 0.00, -0.33

Sorted
δ ∗ : -1.50, -0.33, -0.33, 0.00, 0.00, 0.00, 0.33, 0.50

So, δ0∗.9 = −1.50, δ0∗.1 = 0.37.


(For δ0∗.1 we interpolated between the top two values –there are other
reasonable choices. In R see the quantile() function.)

80% bootstrap CI for mean: [x̄ − 0.37, x̄ + 1.50] = [3.97, 5.83]


MIT18_05S14_class24-slde-a 534
January 1, 2017 9 / 16
Solution: median

x0.5 = median(x) = 3

x0∗.5 : 3.0, 3.0, 2.0, 5.5, 3.0, 3.0, 3.0, 3.0

δ∗: 0.0, 0.0, -1.0, 2.5, 0.0, 0.0, 0.0, 0.0

Sorted
δ ∗ : -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.5

So, δ0∗.9 = −1.0, δ0∗.1 = 0.5.


(For δ0∗.1 we interpolated between the top two values –there are other
reasonable choices. In R see the quantile() function.)

80% bootstrap CI for median: [x̄ − 0.5, x̄ + 1.0] = [2.5, 4.0]


MIT18_05S14_class24-slde-a 535
January 1, 2017 10 / 16
Empirical bootstrapping in R
x = c(30,37,36,43,42,43,43,46,41,42) # original sample
n = length(x) # sample size
xbar = mean(x) # sample mean
nboot = 5000 # number of bootstrap samples to use

# Generate nboot empirical samples of size n


# and organize in a matrix
tmpdata = sample(x,n*nboot, replace=TRUE)
bootstrapsample = matrix(tmpdata, nrow=n, ncol=nboot)

# Compute bootstrap means xbar* and differences delta*


xbarstar = colMeans(bootstrapsample)
deltastar = xbarstar - xbar

# Find the .1 and .9 quantiles and make


# the bootstrap 80% confidence interval
d = quantile(deltastar, c(.1,.9))
MIT18_05S14_class24-slde-a 536
ci = xbar - c(d[2], d[1])
January 1, 2017 11 / 16
Parametric bootstrapping
Use the estimated parameter to estimate the variation of estimates of
the parameter!

Data: x1 , . . . , xn drawn from a parametric distribution F (θ).


Estimate θ by a statistic θ̂.
Generate many bootstrap samples from F (θ̂).
Compute the statistic θ∗ for each bootstrap sample.
Compute the bootstrap difference

δ ∗ = θ∗ − θ̂.

Use the quantiles of δ ∗ to approximate quantiles of

δ = θ̂ − θ

Set a confidence interval [θ̂ − δ1∗−α/2 , θ̂ − δα/


MIT18_05S14_class24-slde-a

2] 537
January 1, 2017 12 / 16
Parametric sampling in R
# Data from binomial(15, θ) for an unknown θ
x = c(3, 5, 7, 9, 11, 13)
binomSize = 15 # known size of binomial
n = length(x) # sample size
thetahat = mean(x)/binomSize # MLE for θ
nboot = 5000 # number of bootstrap samples to use

# nboot parametric samples of size n; organize in a matrix


tmpdata = rbinom(n*nboot, binomSize, thetahat)
bootstrapsample = matrix(tmpdata, nrow=n, ncol=nboot)

# Compute bootstrap means thetahat* and differences delta*


thetahatstar = colMeans(bootstrapsample)/binomSize
deltastar = thetahatstar - thetahat
# Find quantiles and make the bootstrap confidence interval
d = quantile(deltastar, c(.1,.9))
ci =MIT18_05S14_class24-slde-a
thetahat - c(d[2], d[1]) 538
January 1, 2017 13 / 16
Board question

Data: 6 5 5 5 7 4 ∼ binomial(8,θ)

1. Estimate θ.

2. Write out the R code to generate data of 100 parametric


bootstrap samples and compute an 80% confidence interval for θ.

(Try this without looking at your notes. We’ll show the previous slide
at the end)

MIT18_05S14_class24-slde-a 539
January 1, 2017 14 / 16
Preview of linear regression

Fit lines or polynomials to bivariate data


Model: y = f (x) + E
f (x) function, E random error.
Example: y = ax + b + E
Example: y = ax 2 + bx + c + E
Example: y = eax+b+E (Compute with ln(y ) = ax + b + E .)

MIT18_05S14_class24-slde-a 540
January 1, 2017 15 / 16
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class24-slde-a 541
Linear Regression
18.05 Spring 2014

Class 25 Slides with Solutions: Linear Regression 542


Agenda

Fitting curves to bivariate data

Measuring the goodness of fit

The fit vs. complexity tradeoff

Regression to the mean

Multiple linear regression

MIT18_05S14_class25-slds-a 543
January 1, 2017 2 / 25
Modeling bivariate data as a function + noise

Ingredients
Bivariate data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Model: yi = f (xi ) + Ei
where f (x) is some function, Ei random error.
n
X n
X
Total squared error: Ei2 = (yi − f (xi ))2
i=1 i=1

Model allows us to predict the value of y for any given value of x.


• x is called the independent or predictor variable.
• y is the dependent or response variable.

MIT18_05S14_class25-slds-a 544
January 1, 2017 3 / 25
Examples of f (x)

lines: y = ax + b + E

polynomials: y = ax 2 + bx + c + E

other: y = a/x + b + E

other: y = a sin(x) + b + E

MIT18_05S14_class25-slds-a 545
January 1, 2017 4 / 25
Simple linear regression: finding the best fitting line

Bivariate data (x1 , y1 ), . . . , (xn , yn ).


Simple linear regression: fit a line to the data

yi = axi + b + Ei , where Ei ∼ N(0, σ 2 )

and where σ is a fixed value, the same for all data points.
n
X n
X
Total squared error: Ei2 = (yi − axi − b)2
i=1 i=1

Goal: Find the values of a and b that give the ‘best fitting line’.
Best fit: (least squares)
The values of a and b that minimize the total squared error.

MIT18_05S14_class25-slds-a 546
January 1, 2017 5 / 25
Linear Regression: finding the best fitting polynomial
Bivariate data: (x1 , y1 ), . . . , (xn , yn ).

Linear regression: fit a parabola to the data


yi = axi2 + bxi + c + Ei , where Ei ∼ N(0, σ 2 )
and where σ is a fixed value, the same for all data points.
n
X n
X
Total squared error: Ei2 = (yi − axi2 − bxi − c)2 .
i=1 i=1

Goal:
Find the values of a, b, c that give the ‘best fitting parabola’.
Best fit: (least squares)
The values of a, b, c that minimize the total squared error.

Can also fit higher order polynomials.


MIT18_05S14_class25-slds-a 547
January 1, 2017 6 / 25
Stamps

50
● ●



40




30


y


20


10


0 10 20 30 40 50 60
x

Stamp cost (cents) vs. time (years since 1960)


(Red dot = 49 cents is predicted cost in 2016.)
(Actual cost of a stamp dropped from 49 to 47 cents on 4/8/16.)
MIT18_05S14_class25-slds-a 548
January 1, 2017 7 / 25
Parabolic fit

15
10
y

5
0

−1 0 1 2 3 4 5 6

MIT18_05S14_class25-slds-a x 549
January 1, 2017 8 / 25
Board question: make it fit
Bivariate data:
(1, 3), (2, 1), (4, 4)

1. Do (simple) linear regression to find the best fitting line.


Hint: minimize the total squared error by taking partial derivatives
with respect to a and b.

2. Do linear regression to find the best fitting parabola.

3. Set up the linear regression to find the best fitting cubic. but
don’t take derivatives.

4. Find the best fitting exponential y = eax+b .


Hint: take ln(y ) and do simple linear regression.
MIT18_05S14_class25-slds-a 550
January 1, 2017 9 / 25
Solutions
1. Model ŷi = axi + b.
X
total squared error = T = (yi − ŷi )2
X
= (yi − axi − b)2
= (3 − a − b)2 + (1 − 2a − b)2 + (4 − 4a − b)2

Take the partial derivatives and set to 0:


∂T
∂a = −2(3 − a − b) − 4(1 − 2a − b) − 8(4 − 4a − b) = 0
∂T
∂b = −2(3 − a − b) − 2(1 − 2a − b) − 2(4 − 4a − b) = 0
A little arithmetic gives the system of simultaneous linear equations and
solution:
42a +14b = 42
⇒ a = 1/2, b = 3/2.
14a +6b = 16
1 3
least squares best fitting line is y = x + .
The MIT18_05S14_class25-slds-a 551
2 2
January 1, 2017 10 / 25
Solutions continued
2. Model ŷi = axi2 + bxi + c.
Total squared error:
X
T = (yi − ŷi )2
X
= (yi − axi2 − bxi − c)2
= (3 − a − b − c)2 + (1 − 4a − 2b − c)2 + (4 − 16a − 4b − c)2

We didn’t really expect people to carry this all the way out by hand. If you
did you would have found that taking the partial derivatives and setting to
0 gives the following system of simultaneous linear equations.

273a +73b +21c = 71


73a +21b +7c = 21 ⇒ a = 1.1667, b = −5.5, c = 7.3333.
21a +7b +3c = 8

least squares best fitting parabola is y = 1.1667x 2 + −5.5x + 552


The MIT18_05S14_class25-slds-a 7.3333.
January 1, 2017 11 / 25
Solutions continued
3. Model ŷi = axi3 + bxi2 + cxi + d.
Total squared error:
X
T = (yi − ŷi )2
X
= (yi − axi3 − bxi2 − cxi − d)2
= (3 − a − b − c − d)2 + (1 − 8a − 4b − 2c − d)2 + (4 − 64a − 16b − 4c
In this case with only 3 points, there are actually many cubics that go
through all the points exactly. We are probably overfitting our data.
4. Model ŷi = eaxi +b ⇔ ln(yi ) = axi + b.
Total squared error:
X
T = (ln(yi ) − ln(ŷi ))2
X
= (ln(yi ) − axi − b)2
= (ln(3) − a − b)2 + (ln(1) − 2a − b)2 + (ln(4) − 4a − b)2
NowMIT18_05S14_class25-slds-a
we can find a and b as before. (Using R: a = 0.18, b = 0.41) 553
January 1, 2017 12 / 25
What is linear about linear regression?

Linear in the parameters a, b, . . ..

y = ax + b.
y = ax 2 + bx + c.

It is not because the curve being fit has to be a straight line


–although this is the simplest and most common case.

Notice: in the board question you had to solve a system of


simultaneous linear equations.

Fitting a line is called simple linear regression.

MIT18_05S14_class25-slds-a 554
January 1, 2017 13 / 25
Homoscedastic

BIG ASSUMPTIONS: the Ei are independent with the same


variance σ 2 .
20


4
● ●

● ● ● ●

3
●● ● ● ●
● ●
● ●
15

● ● ●
●●
●● ● ● ● ●
● ●
● ●
●● ●
● ● ●

2
● ● ● ●
●● ● ●
● ●
● ●
● ● ●


●●● ●● ● ● ●
● ● ● ● ●●
● ● ●

1
● ● ● ●
10

● ● ●● ● ●
● ●

e
y

● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●● ●
● ● ● ● ●
● ● ●

0
● ●
● ●
● ●
●● ●
● ●
● ●● ●● ● ●
● ●● ● ● ● ● ●
●● ● ●
● ●● ● ● ●
● ● ●● ●
●●

−1

5

● ● ●
● ●● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●●
● ●
● ●
● ●

−2

● ●




0

−3
0 2 4 6 8 10 0 2 4 6 8 10
x x

Regression line (left) and residuals (right).


Homoscedasticity = uniform spread of errors around regression line.
MIT18_05S14_class25-slds-a 555
January 1, 2017 14 / 25
Heteroscedastic

20


● ●




● ●
● ●
15

●● ● ●
● ●
● ●


● ● ● ● ● ●●●
● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●● ● ●
● ●● ●●
● ●
● ●
● ● ●
10

●● ● ●
●● ● ● ●
y

● ●
● ●● ●
● ● ● ●●
● ● ●

● ●● ●
● ● ●
● ● ● ●
● ●● ●●
● ●
● ● ●
●● ●● ●●
● ● ●
5


● ●● ● ●● ●
● ● ● ●
●● ● ● ●

● ● ●●
●● ●
●● ●
0

0 2 4 6 8 10
x
Heteroscedastic Data
MIT18_05S14_class25-slds-a 556
January 1, 2017 15 / 25
Formulas for simple linear regression

Model:
yi = axi + b + Ei where Ei ∼ N(0, σ 2 ).
Using calculus or algebra:
sxy
â = and b̂ = ȳ − â x̄,
sxx
where
1X 1 X
x̄ = xi sxx = (xi − x̄)2
n n−1
1X 1 X
ȳ = yi sxy = (xi − x̄ )(yi − ȳ ).
n n−1
WARNING: This is just for simple linear regression. For polynomials
and other functions you need other formulas.
MIT18_05S14_class25-slds-a 557
January 1, 2017 16 / 25
Board Question: using the formulas plus some theory
Bivariate data: (1, 3), (2, 1), (4, 4)

1.(a) Calculate the sample means for x and y .


1.(b) Use the formulas to find a best-fit line in the xy -plane.
sxy
â = b̂ = y − âx
sxx
1 X 1 X
sxy = (xi − x)(yi − y ) sxx = (xi − x)2 .
n−1 n−1
2. Show the point (x, y ) is always on the fitted line.

3. Under the assumption Ei ∼ N(0, σ 2 ) show that the least squares


method is equivalent to finding the MLE for the parameters (a, b).
Hint: f (yi | xi , a, b) ∼ N(axi + b, σ 2 ).
MIT18_05S14_class25-slds-a 558
January 1, 2017 17 / 25
Solution

answer: 1. (a) x̄ = 7/3, ȳ = 8/3.


(b)

sxx = (1 + 4 + 16)/3 − 49/9 = 14/9, sxy = (3 + 2 + 16)/3 − 56/9 = 7/9.

So
sxy
â = = 7/14 = 1/2, b̂ = ȳ − âx̄ = 9/6 = 3/2.
sxx
(The same answer as the previous board question.)
2. The formula b̂ = ȳ − âx̄ is exactly the same as ȳ = âx̄ + b̂. That is,
the point (x̄, ȳ ) is on the line y = âx + b̂
Solution to 3 is on the next slide.

MIT18_05S14_class25-slds-a 559
January 1, 2017 18 / 25
3. Our model is yi = axi + b + Ei , where the Ei are independent. Since
Ei ∼ N(0, σ 2 ) this becomes
yi ∼ N(axi + b, σ 2 )
Therefore the likelihood of yi given xi , a and b is
1 (y −axi −b)2
− i
f (yi | xi , a, b) = √ e 2σ 2
2πσ
Since the data yi are independent the likelihood function is just the
product of the expression above, i.e. we have to sum exponents
Pn 2
i=1 (yi −axi −b)
likelihood = f (y1 , . . . , yn |x1 , . . . , xn , a, b) = e− 2σ 2

Since the exponent is negative, the maximum likelihood will happen when
the exponent is as close to 0 as possible. That is, when the sum
n
X
(yi − axi − b)2
i=1

is as small as possible. This is exactly what we were asked to show.560


MIT18_05S14_class25-slds-a
January 1, 2017 19 / 25
Measuring the fit
y = (y1 , · · · , yn ) = data values of the response variable.
ŷ = (ŷ1 , · · · , ŷn ) = ‘fitted values’ of the response variable.
X
TSS = (yi − y )2 = total sum of squares = total variation.
X
RSS = (yi − ŷi )2 = residual sum of squares.
RSS = unexplained by model squared error (due to random
fluctuation)

RSS/TSS = unexplained fraction of the total error.

R 2 = 1 − RSS/TSS is measure of goodness-of-fit

R 2 is the fraction of the variance of y explained by the model.


MIT18_05S14_class25-slds-a 561
January 1, 2017 20 / 25
Overfitting a polynomial

Increasing the degree of the polynomial increases R 2

Increasing the degree of the polynomial increases the complexity


of the model.

The optimal degree is a tradeoff between goodness of fit and


complexity.
If all data points lie on the fitted curve, then y = ŷ and R 2 = 1.

R demonstration!

MIT18_05S14_class25-slds-a 562
January 1, 2017 21 / 25
Outliers and other troubles

Question: Can one point change the regression line significantly?

Use mathlet
http://mathlets.org/mathlets/linear-regression/

MIT18_05S14_class25-slds-a 563
January 1, 2017 22 / 25
Regression to the mean
Suppose a group of children is given an IQ test at age 4.
One year later the same children are given another IQ test.
Children’s IQ scores at age 4 and age 5 should be positively
correlated.
Those who did poorly on the first test (e.g., bottom 10%) will
tend to show improvement (i.e. regress to the mean) on the
second test.
A completely useless intervention with the poor-performing
children might be misinterpreted as causing an increase in their
scores.
Conversely, a reward for the top-performing children might be
misinterpreted as causing a decrease in their scores.

ThisMIT18_05S14_class25-slds-a 564
example is from Rice Mathematical Statistics and Data Analysis
January 1, 2017 23 / 25
A brief discussion of multiple linear regression

Multivariate data: (xi,1 , xi,2 , . . . , xi,m , yi ) (n data points:


i = 1, . . . , n)

Model ŷi = a1 xi,1 + a2 xi,2 + . . . + am xi,m

xi,j are the explanatory (or predictor) variables.

yi is the response variable.

The total squared error is


n
X n
X
2
(yi − ŷi ) = (yi − a1 xi,1 − a2 xi,2 − . . . − am xi,m )2
i=1 i=1

MIT18_05S14_class25-slds-a 565
January 1, 2017 24 / 25
MIT OpenCourseWare
https://ocw.mit.edu

18.05 Introduction to Probability and Statistics


Spring 2014

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

MIT18_05S14_class25-slds-a 566

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy