Statistics M It
Statistics M It
frequency density
4 0.4
3 0.3
2 0.2
1 0.1
x x
.5 1.5 2.5 3.5 4.5 .5 1.5 2.5 3.5 4.5
MIT18_05S14_cl5cont_slides 3
July 13, 2014 2 / 11
Concept question
Add the numbers of the valid cdf’s and click that number.
answer: Test 2 and Test 3.
MIT18_05S14_cl5cont_slides 4
July 13, 2014 3 / 11
Solution
Test 1 is not a cdf: it takes negative values, but probabilities are positive.
MIT18_05S14_cl5cont_slides 5
July 13, 2014 4 / 11
Exponential Random Variables
x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
MIT18_05S14_cl5cont_slides 6
July 13, 2014 5 / 11
http://mathlets.org/mathlets/probability-distributions/
MIT18_05S14_cl5cont_slides 7
July 13, 2014 6 / 11
Table questions
MIT18_05S14_cl5cont_slides 8
July 13, 2014 7 / 11
Normal probabilities
within 1 · σ ≈ 68%
within 3 · σ ≈ 99%
68%
95%
99%
z
−3σ −2σ −σ σ 2σ 3σ
Rules of thumb:
P(−1 ≤ Z ≤ 1) ≈ .68,
P(−2 ≤ Z ≤ 2) ≈ .95,
P(−3 ≤ Z ≤ 3) ≈ .997
MIT18_05S14_cl5cont_slides 9
July 13, 2014 8 / 11
Download R script
MIT18_05S14_cl5cont_slides 10
July 13, 2014 9 / 11
Histograms
frequency density
4 0.4
3 0.3
2 0.2
1 0.1
x x
.5 1.5 2.5 3.5 4.5 .5 1.5 2.5 3.5 4.5
MIT18_05S14_cl5cont_slides 11
July 13, 2014 10 / 11
Histograms of averages of exp(1)
random variable.
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
MIT18_05S14_cl5cont_slides 13
Welcome to 18.05
Spring 2014
January 1, 2017 1 / 23
R
MIT18_05S14_class1_slides 15
January 1, 2017 2 / 23
Platonic Dice
MIT18_05S14_class1_slides 16
January 1, 2017 3 / 23
Probability vs. Statistics
Different subjects: both about random processes
Probability
Logically self-contained
A few rules for computing probabilities
One correct answer
Statistics
Messier and more of an art
Get experimental data and try to draw probabilistic
conclusions
No single correct answer
MIT18_05S14_class1_slides 17
January 1, 2017 4 / 23
Counting: Motivating Examples
MIT18_05S14_class1_slides 18
January 1, 2017 5 / 23
Poker Hands
Deck of 52 cards
13 ranks: 2, 3, . . . , 9, 10, J, Q, K, A
4 suits: ♥, ♠, ♦, ♣,
Poker hands
Consists of 5 cards
A one-pair hand consists of two cards having one rank and the
remaining three cards having three other rank
Example: {2♥, 2♠, 5♥, 8♣, K♦}
The probability of a one-pair hand is:
(1) less than 5%
(2) between 5% and 10%
(3) between 10% and 20%
(4) between 20% and 40%
(5) greater than 40%
MIT18_05S14_class1_slides 19
January 1, 2017 6 / 23
Sets in Words
Old New England rule: don’t eat clams (or any shellfish) in months
without an ’r’ in their name.
S = all months
S = {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}
MIT18_05S14_class1_slides 20
January 1, 2017 7 / 23
Visualize Set Operations with Venn Diagrams
S L R
MIT18_05S14_class1_slides 21
January 1, 2017 8 / 23
Product of Sets
S × T = {(s, t)}
MIT18_05S14_class1_slides 22
January 1, 2017 9 / 23
Inclusion-Exclusion Principle
B A∩B A
MIT18_05S14_class1_slides 23
January 1, 2017 10 / 23
Board Question
MIT18_05S14_class1_slides 24
January 1, 2017 11 / 23
Rule of Product
MIT18_05S14_class1_slides 25
January 1, 2017 12 / 23
Concept Question: DNA
DNA is made of sequences of nucleotides: A, C, G, T.
answer: (iii) 4 × 4 × 4 = 64
answer: (ii) 4 × 3 × 2 = 24
MIT18_05S14_class1_slides 26
January 1, 2017 13 / 23
Board Question 1
answer: 5 × 4 × 3.
There are 5 ways to pick the winner. Once the winner is chosen there are
4 ways to pick second place and then 3 ways to pick third place.
MIT18_05S14_class1_slides 27
January 1, 2017 14 / 23
Board Question 2
Shirts: 3B, 3R, 2G; sweaters 1B, 2R, 1G; pants 2D,2B.
MIT18_05S14_class1_slides 28
January 1, 2017 15 / 23
Solution
answer: Suppose we choose shirts first. Depending on whether we choose
red compatible or green compatible shirts there are different numbers of
sweaters we can choose next. So we split the problem up before using the
rule of product. A multiplication tree is an easy way to present the answer.
3 3 2
Shirts R B G
3 4 2
Sweaters R,B R,B,G B,G
4 4 4
Pants
B, D B, D B, D
MIT18_05S14_class1_slides 29
January 1, 2017 16 / 23
Permutations
MIT18_05S14_class1_slides 30
January 1, 2017 17 / 23
Permutations of k from a set of n
MIT18_05S14_class1_slides 31
January 1, 2017 18 / 23
Combinations
MIT18_05S14_class1_slides 32
January 1, 2017 19 / 23
Combinations of k from a set of n
MIT18_05S14_class1_slides 33
January 1, 2017 20 / 23
Permutations and Combinations
� �
4 4 P3
= 4 C3 =
3 3!
MIT18_05S14_class1_slides 34
January 1, 2017 21 / 23
Board Question
(b) There are 210 possible outcomes from 10 flips (this is the rule of
product). For a fair coin each outcome is equally probable so the
probability of exactly 3 heads is
10
3 120
= = 0.117
210 1024
MIT18_05S14_class1_slides 35
January 1, 2017 22 / 23
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class1_slides 36
Probability: Terminology and Examples
18.05 Spring 2014
Question
(a) How many different 5 card hands have exactly one pair?
Hint: practice with how many 2 card hands have exactly one pair.
Hint for hint: use the rule of product.
(b) What is the probability of getting a one pair poker hand?
MIT18_05S14_class2_slides 38
January 1, 2017 2 / 29
Answer to board question
We can do this two ways as combinations or permutations. The keys are:
1. be consistent
2. break the problem into a sequence of actions and use the rule of
product.
Note, there are many ways to organize this. We will break it into very
Combinations approach
a) Count the number of one-pair hands, where the order they are dealt
doesn’t matter.
Action 1. Choose the rank of the pair: 13 different ranks, choosing 1, so
13
o
1 ways to do this.
Action 2. Choose 2 cards from this rank: 4 cards in a rank, choosing 2, so
4
o
2 ways to do this.
Action 3. Choose the 3 cards of different ranks: 12 remaining ranks, so
12
o
3 ways to do this.
MIT18_05S14_class2_slides 39
(Continued on next slide.)
January 1, 2017 3 / 29
Combination solution continued
Action 4. Choose 1 card from each of these ranks: 4 cards in each rank so
4 3
o
1 ways to do this.
answer: (Using the rule of product.)
13 4 12
· · · 43 = 1098240
1 2 3
January 1, 2017 4 / 29
Permutation approach
This approach is a little trickier. We include it to show that there is
a) Count the number of one-pair hands, where we keep track of the order
Action 1. (This one is tricky.) Choose the positionso in the hand that will
b) There are
52 P5 = 52 · 51 · 50 · 49 · 48 = 311875200
131788800/311875200 = 0.42257.
MIT18_05S14_class2_slides 42
January 1, 2017 6 / 29
Clicker Test
No = 0
Yes = 1
MIT18_05S14_class2_slides 43
January 1, 2017 7 / 29
Probability Cast
Introduced so far
Experiment: a repeatable procedure
Sample space: set of all possible outcomes S (or Ω).
Event: a subset of the sample space.
Probability function, P(ω): gives the probability for
each outcome ω ∈ S
1. Probability is between 0 and 1
2. Total probability of all possible outcomes is 1.
MIT18_05S14_class2_slides 44
January 1, 2017 8 / 29
Example (from the reading)
Use tables:
Outcomes H T
Probability
1/2 1/2
MIT18_05S14_class2_slides 45
January 1, 2017 9 / 29
Discrete sample space
Discrete = listable
Examples:
{a, b, c, d} (finite)
{0, 1, 2, . . . } (infinite)
MIT18_05S14_class2_slides 46
January 1, 2017 10 / 29
Events
Event:
MIT18_05S14_class2_slides 47
January 1, 2017 11 / 29
CQ: Events, sets and words
C = {HTH, THH}
The event “exactly two heads” determines a unique subset, containing all
MIT18_05S14_class2_slides 48
January 1, 2017 12 / 29
CQ: Events, sets and words
MIT18_05S14_class2_slides 49
January 1, 2017 13 / 29
CQ: Events, sets and words
MIT18_05S14_class2_slides 50
January 1, 2017 14 / 29
CQ: Events, sets and words
MIT18_05S14_class2_slides 51
January 1, 2017 15 / 29
Probability rules in mathematical notation
MIT18_05S14_class2_slides 52
January 1, 2017 16 / 29
Probability and set operations on events
Events A, L, R
Rule 1. Complements: P(Ac ) = 1 − P(A).
Rule 2. Disjoint events: If L and R are disjoint then
P(L ∪ R) = P(L) + P(R).
Rule 3. Inclusion-exclusion principle: For any L and R:
P(L ∪ R) = P(L) + P(R) − P(L ∩ R).
Ac
L R L R
A
MIT18_05S14_class2_slides 55
January 1, 2017 19 / 29
Table Question
Experiment:
1. Your table should make 9 rolls of a 20-sided die (one
each if the table is full).
2. Check if all rolls at your table are distinct.
4. Pair up with another group. Have one group compare red vs.
green and the other compare green vs. red. Based on the three
MIT18_05S14_class2_slides
comparisons rank the dice from best to worst. 59
January 1, 2017 23 / 29
Computations for solution
White Green
2 5 1 4
Red 3 15/36 15/36 5/36 25/36
6 3/36 3/36 1/36 5/36
Green 1 3/36 3/36
4 15/36 15/36
The three comparisons are:
P(red beats white) = 21/36 = 7/12
P(white beats green) = 21/36 = 7/12
P(green beats red) = 25/36
Thus: red is better than white is better than green is better than red.
Lucky Larry has a coin that you’re quite sure is not fair.
MIT18_05S14_class2_slides 62
January 1, 2017 26 / 29
Board Question
Lucky Larry has a coin that you’re quite sure is not fair.
(If you don’t see the symbolic algebra try p = .2, p=.5)
MIT18_05S14_class2_slides 63
January 1, 2017 27 / 29
Solution
answer: 1. Same (same is more likely than different)
The key bit of arithmetic is that if a = b then
January 1, 2017 28 / 29
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class2_slides 65
Conditional Probability, Independence, Bayes’ Theorem
18.05 Spring 2014
6, e.g. (1, 2, 1, 3, 1, 5) ∈ Ω.
B
A
A∩B
MIT18_05S14_class3_slides 68
January 1, 2017 3 / 28
Table/Concept Question
(Work with your tablemates, then everyone click in the answer.)
Toss a coin 4 times. Let
A = ‘at least three heads’
B = ‘first toss is tails’.
1. What is P(A|B)?
(a) 1/16 (b) 1/8 (c) 1/4 (d) 1/5
2. What is P(B|A)?
(a) 1/16 (b) 1/8 (c) 1/4 (d) 1/5
answer: 1. (b) 1/8. 2. (d) 1/5.
Counting we find |A| = 5, |B| = 8 and |A ∩ B| = 1. Since all sequences
are equally likely
P(A ∩ B) |A ∩ B| |B ∩ A|
P(A|B) = = = 1/8. P(B|A) = = 1/5.
P(B) |B|
MIT18_05S14_class3_slides |A| 69
January 1, 2017 4 / 28
Table Question
Discussion: Most people say that it is more likely that Steve is a librarian
than a farmer. Almost all people fail to consider that for every male
librarian in the United States, there are more than fifty male farmers.
When this is explained, most people who chose librarian switch their
solution to farmer.
This illustrates how people often substitute representativeness for
likelihood. The fact that a librarian may be likely to have the above
personality traits does not mean that someone with these traits is likely to
be a librarian.
MIT18_05S14_class3_slides 71
January 1, 2017 6 / 28
Multiplication Rule, Law of Total Probability
Multiplication rule: P(A ∩ B) = P(A|B) · P(B).
Ω
B1
A ∩ B1
A ∩ B2 A ∩ B3
B2
MIT18_05S14_class3_slides B3 72
January 1, 2017 7 / 28
Trees
Organize computations
Compute total probability
Compute Bayes’ formula
Example. : Game: 5 red and 2 green balls in an urn. A random ball
is selected and replaced by a ball of the other color; then a second
ball is drawn.
1. What is the probability the second ball is red?
2. What is the probability the first ball was red given the second ball
was red?
5/7 2/7
First draw R1 G1
4/7 3/7 6/7 1/7
Second draw
R2 G2 R2 G2
MIT18_05S14_class3_slides 73
January 1, 2017 8 / 28
Solution
5 4 2 6 32
P(R1 ∩ R2 ) 20/49 20
MIT18_05S14_class3_slides 74
January 1, 2017 9 / 28
Concept Question: Trees 1
x
A1 y A2
B1 z B2 B1 B2
C1 C2 C1 C2 C1 C2 C1 C2
(a) P(A1 )
(b) P(A1 |B2 )
(c) P(B2 |A1 )
(d) P(C1 |B2 ∩ A1 ).
answer: (a) P(A1 ).
MIT18_05S14_class3_slides 75
January 1, 2017 10 / 28
Concept Question: Trees 2
x
A1 y A2
B1 z B2 B1 B2
C1 C2 C1 C2 C1 C2 C1 C2
(a) P(B2 )
(b) P(A1 |B2 )
(c) P(B2 |A1 )
(d) P(C1 |B2 ∩ A1 ).
answer: (c) P(B2 |A1 ).
MIT18_05S14_class3_slides 76
January 1, 2017 11 / 28
Concept Question: Trees 3
x
A1 y A2
B1 z B2 B1 B2
C1 C2 C1 C2 C1 C2 C1 C2
(a) P(C1 )
(b) P(B2 |C1 )
(c) P(C1 |B2 )
(d) P(C1 |B2 ∩ A1 ).
answer: (d) P(C1 |B2 ∩ A1 ).
MIT18_05S14_class3_slides 77
January 1, 2017 12 / 28
Concept Question: Trees 4
x
A1 y A2
B1 z B2 B1 B2
C1 C2 C1 C2 C1 C2 C1 C2
(a) C1
(b) B2 ∩ C1
(c) A1 ∩ B2 ∩ C1
(d) C1 |B2 ∩ A1 .
answer: (c) A1 ∩ B2 ∩ C1 .
MIT18_05S14_class3_slides 78
January 1, 2017 13 / 28
Let’s Make a Deal with Monty Hall
One door hides a car, two hide goats.
The contestant chooses any door.
Monty always opens a different door with a goat. (He
can do this because he knows where the car is.)
The contestant is then allowed to switch doors if she
wants.
What is the best strategy for winning a car?
(a) Switch (b) Don’t switch (c) It doesn’t matter
MIT18_05S14_class3_slides 79
January 1, 2017 14 / 28
Board question: Monty Hall
Organize the Monty Hall problem into a tree and compute
the probability of winning if you always switch.
Hint first break the game into a sequence of actions.
answer: Switch. P(C |switch) = 2/3
It’s easiest to show this with a tree representing the switching strategy:
First the contestant chooses a door, (then Monty shows a goat), then the
Chooses
C G
0 1 1 0
Switches
C G C G
The MIT18_05S14_class3_slides
(total) probability of C is P(C |switch) = 1
3 ·0+ 2
3 · 1 = 23 . 80
January 1, 2017 15 / 28
Independence
Events A and B are independent if the probability that
one occurred is not affected by knowledge that the other
occurred.
⇔ P(A ∩ B) = P(A)P(B)
MIT18_05S14_class3_slides 81
January 1, 2017 16 / 28
Table/Concept Question: Independence
(Work with your tablemates, then everyone click in the answer.)
Notice that knowing B, removes 6 as a possibility for the first die and
makes A more probable. So, knowing B occurred changes the probability
of A.
But, knowing C does not change the probabilities for the possible values of
the first roll; they are still 1/6 for each value. In particular, knowing C
occured does not change the probability of A.
P(B|A) · P(A)
P(A|B) =
P(B)
Often compute the denominator P(B) using the law of
total probability.
MIT18_05S14_class3_slides 84
January 1, 2017 19 / 28
Board Question: Evil Squirrels
MIT18_05S14_class3_slides 85
January 1, 2017 20 / 28
Evil Squirrels Continued
P(alarm | evil)P(evil)
P(evil | alarm) =
P(alarm)
P(alarm | evil)P(evil)
=
P(alarm | evil)P(evil) + P(alarm | nice)P(nice)
(0.99)(0.0001)
=
(0.99)(0.0001) + (0.01)(0.9999)
≈ 0.01
MIT18_05S14_class3_slides 87
January 1, 2017 22 / 28
Squirrels continued
Summary:
Probability a random test is correct = 0.99
answer: (a) This is the same solution as in the slides above, but in a more
compact notation. Let E be the event that a squirrel is evil. Let A be the
event that the alarm goes off. By Bayes’ Theorem, we have:
P(A | E )P(E )
P(E | A) =
P(A | E )P(E ) + P(A | E c )P(E c )
100
.99
1000000
= 100 999900
.99 1000000 + .01 1000000
≈ .01.
(b) No. The alarm would be more trouble than its worth, since for every
true positive there are about 99 false positives.
MIT18_05S14_class3_slides 89
January 1, 2017 24 / 28
Washington Post, hot off the press
http://www.washingtonpost.com/national/health-science/
annual-physical-exam-is-probably-unnecessary-if-youre-general
2013/02/08/2c1e326a-5f2b-11e2-a389-ee565c81c565_story.html
MIT18_05S14_class3_slides 90
January 1, 2017 25 / 28
Table Question: Dice Game
1 The Randomizer holds the 6-sided die in one fist and
the 8-sided die in the other.
3 The Roller rolls the die in secret and reports the result
to the table.
P(roll 4|6-sided)P(6-sided)
P(6-sided|roll 4) =
P(4)
(1/6)(1/2)
= = 4/7.
(1/6)(1/2) + (1/8)(1/2)
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class3_slides 93
Discrete Random Variables; Expectation
18.05 Spring 2014
https://en.wikipedia.org/wiki/Bean_machine#/media/File:
Quincunx_(Galton_Box)_-_Galton_1889_diagram.png
Class 4 Slides with Solutions: Discrete Random Variables, Expect94
http://www.youtube.com/watch?v=9xUBhhM4vbM January 1, 2017 1 / 26
Reading Review
Random variable X assigns a number to each outcome:
X :Ω→R
Properties of F (a):
1. Nondecreasing
2. Way to the left, i.e. as a → −∞), F is 0
3. Way to the right, i.e. as a → ∞, F is 1.
MIT18_05S14_class4_slides 96
January 1, 2017 3 / 26
CDF and PMF
1 F (a)
.9
.75
.5
a
1 3 5 7
p(a)
.5
.25
.15
a
1 3 5 7
MIT18_05S14_class4_slides 97
January 1, 2017 4 / 26
Concept Question: cdf and pmf
X a random variable.
values of X : 1 3 5 7
cdf F (a): 0.5 0.75 0.9 1
2. What is P(X = 3)
(a) 0.15 (b) 0.25 (c) 0.5 (d) 0.75
1. answer: (d) 0.75. P(X ≤ 3) = F (3) = 0.75.
2. answer: (b) P(X = 3) = F (3) − F (1) = 0.75 − 0.5 = 0.25.
MIT18_05S14_class4_slides 98
January 1, 2017 5 / 26
Deluge of discrete distributions
Bernoulli(p) = 1 (success) with probability p,
MIT18_05S14_class4_slides 101
January 1, 2017 8 / 26
Solution
The MIT18_05S14_class4_slides
same reasoning works for general n.
102
January 1, 2017 9 / 26
Dice simulation: geometric(1/4)
MIT18_05S14_class4_slides 103
January 1, 2017 10 / 26
Fiction
MIT18_05S14_class4_slides 104
January 1, 2017 11 / 26
Fact
The data show that player who has made 5 shots in a row
is no more likely than usual to make the next shot.
(Currently, there seems to be some disagreement about
this.)
MIT18_05S14_class4_slides 105
January 1, 2017 12 / 26
Gambler’s fallacy
“On August 18, 1913, at the casino in Monte Carlo, black came up a
record twenty-six times in succession [in roulette]. [There] was a
near-panicky rush to bet on red, beginning about the time black had
come up a phenomenal fifteen times. In application of the maturity
[of the chances] doctrine, players doubled and tripled their stakes, this
doctrine leading them to believe after black came up the twentieth
time that there was not a chance in a million of another repeat. In the
end the unusual run enriched the Casino by some millions of francs.”
MIT18_05S14_class4_slides 106
January 1, 2017 13 / 26
Hot hand fallacy
An NBA player who made his last few shots is more likely
than his usual shooting percentage to make the next one?
See The Hot Hand in Basketball: On the Misperception of Random
Sequences by Gilovish, Vallone and Tversky. (A link that worked when
these slides were written is
http://www.cs.colorado.edu/~mozer/Teaching/syllabi/7782/
readings/gilovich%20vallone%20tversky.pdf)
(There seems to be some controversy about this. Some statisticians feel
that the authors of the above paper erred in their analysis of the data and
the data do support the notion of a hot hand in basketball.)
MIT18_05S14_class4_slides 107
January 1, 2017 14 / 26
Amnesia
P(X = n + k | X ≥ n) = P(X = k)
MIT18_05S14_class4_slides 108
January 1, 2017 15 / 26
Proof that geometric(p) is memoryless
One method is to look at the tree for this distribution. Here we’ll just use
P(A ∩ B) p n+k (1 − p)
P(A|B) = = = p k (1 − p) = P(X = k).
P(B) pn
MIT18_05S14_class4_slides 109
January 1, 2017 16 / 26
Expected Value
X is a random variable takes values x1 , x2 , . . . , xn :
The expected value of X is defined by
= n
E (X ) = p(x1 )x1 + p(x2 )x2 + . . . + p(xn )xn = p(xi ) xi
i=1
It is a weighted average.
It is a measure of central tendency.
Properties of E (X )
E (X + Y ) = E (X ) + E (Y ) (linearity I)
E (aX + b) ==aE (X ) + b (linearity II)
E (h(X )) = h(xi ) p(xi )
MIT18_05S14_class4_slides
i
110
January 1, 2017 17 / 26
Meaning of expected value
What is the expected average of one roll of a die?
answer: Suppose we roll it 5 times and get (3, 1, 6, 1, 2). To find the
average we add up these numbers and divide by 5: ave = 2.6. With so few
rolls we don’t expect this to be representative of what would usually
happen. So let’s think about what we’d expect from a large number of
rolls. To be specific, let’s (pretend to) roll the die 600 times.
We expect that each number will come up roughly 1/6 of the time. Let’s
suppose this is exactly what happens and compute the average.
value: 1 2 3 4 5 6
expected counts: 100 100 100 100 100 100
The average of these 600 values (100 ones, 100 twos, etc.) is then
100 · 1 + 100 · 2 + 100 · 3 + 100 · 4 + 100 · 5 + 100 · 6
average =
600
1 1 1 1 1 1
= ·1+ ·2+ ·3+ ·4+ ·5+ ·6= 3.5.
6 6 6 6 6 6
ThisMIT18_05S14_class4_slides
is the ‘expected average’. We will call it the expected value
111
January 1, 2017 18 / 26
Examples
Example 1. Find E (X )
1. X: 3 4 5 6
2. pmf: 1/4 1/2 1/8 1/8
3. E (X ) = 3/4 + 4/2 + 5/8 + 6/8 = 33/8
MIT18_05S14_class4_slides 112
January 1, 2017 19 / 26
Class example
We looked at the random variable X with the following table top 2 lines.
1. X : -2 -1 0 1 2
2. pmf: 1/5 1/5 1/5 1/5 1/5
3. E (X ) = -2/5 - 1/5 + 0/5 + 1/5 + 2/5 = 0
4. X 2: 4 1 0 1 4
5. E (X 2 ) = 4/5 + 1/5 + 0/5 + 1/5 + 4/5 = 2
January 1, 2017 20 / 26
Class example continued
Notice that in the table on the previous slide, some values for X 2 are
repeated. For example the value 4 appears twice. Summing all the
probabilities where X 2 = 4 gives P(X 2 = 4) = 2/5. Here’s the full table
for X 2
1. X 2: 4 1 0
2. pmf: 2/5 2/5 1/5
3. E (X 2 ) = 8/5 + 2/5 + 0/5 = 2
MIT18_05S14_class4_slides 114
January 1, 2017 21 / 26
Board Question: Interpreting Expectation
MIT18_05S14_class4_slides 115
January 1, 2017 22 / 26
Discussion
Framing bias / cost versus loss. The two situations are identical, with an
expected value of gaining $5. In a study, 132 undergrads were given these
questions (in different orders) separated by a short filler problem. 55 gave
different preferences to the two events. Of these, 42 rejected (a) but
accepted (b). One interpretation is that we are far more willing to pay a
cost up front than risk a loss. (See Judgment under uncertainty: heuristics
and biases by Tversky and Kahneman.)
Loss aversion and cost versus loss sustain the insurance industry: people
pay more in premiums than they get back in claims on average (otherwise
the industry wouldn’t be sustainable), but they buy insurance anyway to
protect themselves against substantial losses. Think of it as paying $1
each year to protect yourself against a 1 in 1000 chance of losing $100
that year. By buying insurance, the expected value of the change in your
assets in one year (ignoring other income and spending) goes from
negative 10 cents to negative 1 dollar. But whereas without insurance you
might lose $100, with insurance you always lose exactly $1.
MIT18_05S14_class4_slides 116
January 1, 2017 23 / 26
Board Question
Suppose (hypothetically!) that everyone at your table got up, ran
around the room, and sat back down randomly (i.e., all seating
arrangements are equally likely).
What is the expected value of the number of people sitting in their
original seat?
(We will explore this with simulations in Friday Studio.)
Neat fact: A permutation in which nobody returns to their original seat is
called a derangement. The number of derangements turns out to be the
nearest integer to n!/e. Since there are n! total permutations, we have:
n!/e
P(everyone in a different seat) ≈ = 1/e ≈ 0.3679.
n!
It’s surprising that the probability is about 37% regardless of n, and that it
converges to 1/e as n goes to infinity.
MIT18_05S14_class4_slides
http://en.wikipedia.org/wiki/Derangement 117
January 1, 2017 24 / 26
Solution
Number the people from 1 to n. Let Xi be the Bernoulli random variable
with value 1 if person i returns to their original seat and value 0 otherwise.
Since person i is equally likely to sit back down in any of the n seats, the
probability that person i returns to their original seat is 1/n. Therefore
Xi ∼ Bernoulli(1/n) and E (Xi ) = 1/n. Let X be the number of people
sitting in their original seat following the rearrangement. Then
X = X1 + X2 + · · · + Xn .
By linearity of expected values, we have
n
= n
=
E (X ) = E (Xi ) = 1/n = 1.
i=1 i=1
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class4_slides 119
Variance; Continuous Random Variables
18.05 Spring 2014
Computation as sum:
n
n
Var(X ) = p(xi )(xi − µ)2 .
i=1
x x
1 2 3 4 5 1 2 3 4 5
(C)
x
1 2 3 4 5
answer: 5. CAB
All 3 variables have the same range from 1-5 and all of them are
symmetric so their mean is right in the middle at 3. (C) has most of
its weight at the extremes, so it has the biggest spread. (B) has the
most weight in the middle so it has the smallest spread.
From biggest to smallest standard deviation we have (C), (A), (B).
MIT18_05S14_class5_slides 123
January 1, 2017 4 / 26
Computation from tables
of X .
values x
1 2 3 4 5
MIT18_05S14_class5_slides 124
January 1, 2017 5 / 26
Computation from tables
1 2 4 2 1
·4+ ·1+ ·0+ ·1+ · 4 = 1.2
10 10 10 10 10
√
The standard deviation is then σ = 1.2.
MIT18_05S14_class5_slides 125
January 1, 2017 6 / 26
Concept question
Which pmf has the bigger standard deviation? (Assume w
and y have the same units.)
1. Y 2. W
pmf for Y p(y) p(W ) pmf for W
1/2
.4
.2
.1
y w
-3 0 3 10 20 30 40 50
January 1, 2017 8 / 26
Concept question
1. True 2. False
answer: True. If X can take more than one value with positive probability,
than Var(X ) will be a sum of positive terms. So X is constant if and only
if Var(X ) = 0.
MIT18_05S14_class5_slides 128
January 1, 2017 9 / 26
Algebra with variances
MIT18_05S14_class5_slides 129
January 1, 2017 10 / 26
Board questions
MIT18_05S14_class5_slides 130
January 1, 2017 11 / 26
Solution
X 0 1
p(x) 1 − p p
(X − µ)2 p2 (1 − p)2
Var(X ) = E ((X − µ)2 ) = (1 − p)p 2 + p(1 − p)2 = p(1 − p)
Var(X1 + . . . + Xn ) = 4n.
X1 + · · · + Xn 1 4
Var(X ) = Var( ) = 2 Var(X1 + . . . + Xn ) = .
n n n
2
This implies σX = √ .
n
Note: this says that the average of n independent measurements varies
less than the individual measurements.
MIT18_05S14_class5_slides 132
January 1, 2017 13 / 26
Continuous random variables
prob.
Units for the pdf are
unit of x
Cumulative distribution function (cdf)
x
F (x) = P(X ≤ x) = f (t) dt.
MIT18_05S14_class5_slides −∞ 133
January 1, 2017 14 / 26
Visualization
f (x)
P (c ≤ X ≤ d)
x
c d
pdf and probability
f (x)
F (x) = P (X ≤ x)
x
x
pdf and cdf
MIT18_05S14_class5_slides 134
January 1, 2017 15 / 26
Properties of the cdf
MIT18_05S14_class5_slides 136
January 1, 2017 17 / 26
Solution
Z2 Z 2
8 3
f (x) dx = cx 2 dx = c = 1 ⇒ c = .
0 0 3 8
1 7
P(1 ≤ X ≤ 2) = F (2) − F (1) = 1 − = .
8 8
MIT18_05S14_class5_slides 137
Continued on next slide
January 1, 2017 18 / 26
Solution continued
b2
F (b) = 1 ⇒ = 1 ⇒ b = 3 .
9
2y
2b. f (y ) = F ' (y ) = .
9
MIT18_05S14_class5_slides 138
January 1, 2017 19 / 26
Concept questions
Add the numbers of the valid cdf’s and click that number.
answer: Test 2 and Test 3.
MIT18_05S14_class5_slides 140
January 1, 2017 21 / 26
Solution
Test 1 is not a cdf: it takes negative values, but probabilities are positive.
MIT18_05S14_class5_slides 141
January 1, 2017 22 / 26
Exponential Random Variables
x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
MIT18_05S14_class5_slides 142
January 1, 2017 23 / 26
Board question
I’ve noticed that taxis drive past 77 Mass. Ave. on the average of
once every 10 minutes.
Suppose time spent waiting for a taxi is modeled by an exponential
random variable
1 −x/10
X ∼ Exponential(1/10); f (x) = e
10
(a) Sketch the pdf of this distribution
(b) Shade the region which represents the probability of waiting
between 3 and 7 minutes
(c) Compute the probability of waiting between between 3 and 7
minutes for a taxi
(d) Compute and sketch the cdf.
MIT18_05S14_class5_slides 143
January 1, 2017 24 / 26
Solution
x x
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
(c)
Z 7
1 −x/10
7
MIT18_05S14_class5_slides 144
January 1, 2017 25 / 26
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class5_slides 145
Continuous Expectation and Variance,
0.5
0.4
0.3
0.2
0.1
MIT18_05S14_class6_slides 148
January 1, 2017 3 / 31
Properties
1. E (X + Y ) = E (X ) + E (Y ).
2. E (aX + b) = aE (X ) + b.
3. If X and Y are independent then
Var(X + Y ) = Var(X ) + Var(Y ).
4. Var(aX + b) = a2 Var(X ).
5. Var(X ) = E (X 2 ) − E (X )2 .
MIT18_05S14_class6_slides 149
January 1, 2017 4 / 31
Board question
1
3
(b) µ = 0 3x dx = 3/4.
1 3 9 9 3
σ2 = ( − 3/4)2 3x 2 dx) =
0 (x 5 − 8 + 16 = 80 .
σ = 3/80 = 14 3/5 ≈ .194
p p
�
Z x
(c) Set F (q0.5 ) = 0.5, solve for q0.5 : F (x) = 3u 2 du = x 3 . Therefore,
0
3 1/3
F (q0.5 ) = q0.5 = .5. We get, q0.5 = (0.5) .
dy 3y 2/4 dy 3
This implies fX (x) dx = fX (y 1/4 ) 3/4
= 3/4
= 1/4 dy
4y 4y 4y
3
Therefore fY (y ) = 1/4
4y
MIT18_05S14_class6_slides 152
January 1, 2017 7 / 31
Quantiles
Quantiles give a measure of location.
φ(z)
z
q0.6 = 0.253
Φ(z)
1
F (q0.6 ) = 0.6
z
q0.6 = 0.253
MIT18_05S14_class6_slides
q0.6 : left tail area = 0.6 ⇔ F (q0.6 ) = 0.6 153
January 1, 2017 8 / 31
Concept question
Each of the curves is the density for a given random variable. The
median of the black plot is always at q. Which density has the
greatest median?
q q
answer: See next frame.
MIT18_05S14_class6_slides 154
January 1, 2017 9 / 31
Solution
(A)
Curves coincide to here.
Area to
the left of
the me-
dian = 0.5
Plot A: 4. All three medians are the same. Remember that probability is
computed as the area under the curve. By definition the median q is the
point where the shaded area in Plot A .5. Since all three curves coincide
up to q. That is, the shaded area in the figure is represents a probability of
.5 for all three densities.
(B)
Plot B: 2. The red density has the greatest median. Since q is the
median for the black density, the shaded area in Plot B is .5. Therefore
the area under the blue curve (up to q) is greater than .5 and that under
the red curve is less than .5. This means the median of the blue density is
to the left of q (you need less area) and the median of the red density is to
the right of q (you need more area).
MIT18_05S14_class6_slides 156
January 1, 2017 11 / 31
Law of Large Numbers (LoLN)
Informally: An average of many measurements is more accurate
than a single measurement.
Formally: Let X1 , X2 , . . . be i.i.d. random variables all with mean
µ and standard deviation σ.
Let n
X1 + X2 + . . . + Xn 1 n
Xn = = Xi .
n n i=1
Then for any (small number) a, we have
n→∞
each time.
Density: area of bar is the fraction of all data points that lie in the
frequency density
4 0.8
3 0.6
2 0.4
1 0.2
x x
0.25 0.75 1.25 1.75 2.25 0.25 0.75 1.25 1.75 2.25
1. Make both a frequency and density histogram from the data below.
Use bins of width 0.5 starting at 0. The bins should be right closed.
MIT18_05S14_class6_slides 161
January 1, 2017 16 / 31
Solution
3.0
0.4
Frequency
Density
2.0
0.2
1.0
0.0
0.0
0 1 2 3 4 0 1 2 3 4
0.4
8
Frequency
Density
6
0.2
4
2
0.0
0
0 1 2 3 4 0 1 2 3 4
MIT18_05S14_class6_slides
Histograms with unequal width bins
162
January 1, 2017 17 / 31
LoLN and histograms
LoLN implies density histogram converges to pdf:
0.5
0.4
0.3
0.2
0.1
0
-4 -3 -2 -1 0 1 2 3 4
within 3 · σ ≈ 99%
68%
95%
99%
z
−3σ −2σ −σ σ 2σ 3σ
2. P(Z > 2)
(a) 0.025 (b) 0.16 (c) 0.68 (d) 0.84 (e) 0.95
answer: 1c, 2a
MIT18_05S14_class6_slides 165
January 1, 2017 20 / 31
Central Limit Theorem
Setting: X1 , X2 , . . . i.i.d. with mean µ and standard dev. σ.
For each n:
1
Xn = (X1 + X2 + . . . + Xn ) average
n
Sn = X1 + X2 + . . . + Xn sum.
Conclusion: For large n:
σ2
� �
X n ≈ N µ,
n
Sn ≈ N nµ, nσ 2
Standardized Sn or X n ≈ N(0, 1)
Sn − nµ X n − µ
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 167
January 1, 2017 22 / 31
CLT: pictures 2
The standardized average of n i.i.d. exponential random
0.4 0.3
0.2
0.2
0.1
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 168
January 1, 2017 23 / 31
CLT: pictures 3
The standardized average of n i.i.d. Bernoulli(0.5)
0.4 0.4
0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
MIT18_05S14_class6_slides
-3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3 4 169
January 1, 2017 24 / 31
CLT: pictures 4
The (non-standardized) average of n Bernoulli(0.5)
random variables, with n = 4, 12, 64. (Spikier.)
1.4 3
1.2 2.5
1
2
0.8
1.5
0.6
1
0.4
0.2 0.5
0 0
-1 -0.5 0 0.5 1 1.5 2 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
7
6
5
4
3
2
1
0
MIT18_05S14_class6_slides
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 170
January 1, 2017 25 / 31
Table Question: Sampling from the standard normal
distribution
As a table, produce a single random sample from (an approximate)
3. What is the probability that less than 20% of those polled prefer
Ruthi?
answer: On next slide.
MIT18_05S14_class6_slides 172
January 1, 2017 27 / 31
Solution
answer: 2. Let A be the fraction polled who support Ani. So A is the
average of 400 Bernoulli(0.5) random variables. That is, let Xi = 1 if the
ith person polled prefers Ani and 0 if not, so A = average of the Xi .
The question asks for the probability A > 0.55.
Each Xi has µ = 0.5 and σ 2 = 0.25. So, E (A) = 0.5 and
2 = 0.25/400 or σ = 1/40 = 0.025.
σA A
variables. So
2 = (0.25)(0.75)/400 = ⇒ σ =
√
E (R) = 0.25 and σR R 3/80.
R − 0.25
So √ ≈ Z . So,
3/80
√
P(R < 0.2) ≈ P(Z < −4/ 3) ≈ 0.0105
MIT18_05S14_class6_slides 174
January 1, 2017 29 / 31
Bonus problem
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class6_slides 176
Joint Distributions, Independence
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
p(xi , yj )
f (x, y )
F (x, y ) = P(X ≤ x, Y ≤ y )
MIT18_05S14_class7_slides 178
January 1, 2017 2 / 36
Discrete joint pmf: example 1
MIT18_05S14_class7_slides 179
January 1, 2017 3 / 36
Discrete joint pmf: example 2
X\T 2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36
MIT18_05S14_class7_slides 180
January 1, 2017 4 / 36
Continuous joint distributions
X takes values in [a, b], Y takes values in [c, d]
(X , Y ) takes values in [a, b] × [c, d].
Joint probability density function (pdf) f (x, y )
f (x, y ) dx dy is the probability of being in the small square.
y
d
Prob. = f (x, y) dx dy
dy
dx
x
a
MIT18_05S14_class7_slides b 181
January 1, 2017 5 / 36
Properties of the joint pmf and pdf
Discrete case: probability mass function (pmf)
1. 0 ≤ p(xi , yj ) ≤ 1
2. Total probability is 1.
n m
m m
p(xi , yj ) = 1
i=1 j=1
MIT18_05S14_class7_slides
Note: 182
f (x, y ) can be greater than 1: it is a density not a probability.
January 1, 2017 6 / 36
Example: discrete events
Roll two dice: X = # on first die, Y = # on second die.
A = {(1, 3), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 5), (3, 6), (4, 6)}.
MIT18_05S14_class7_slides
P(A) = sum of probabilities in shaded cells = 10/36.
183
January 1, 2017 7 / 36
Example: continuous events
Suppose (X , Y ) takes values in [0, 1] × [0, 1].
answer:
y
1
‘X > Y ’
x
1
The event takes up half the square. Since the density is uniform this
is half the probability. That is, P(X > Y ) = 0.5
MIT18_05S14_class7_slides 184
January 1, 2017 8 / 36
Cumulative distribution function
�
Z y �Z x
F (x, y ) = P(X ≤ x, Y ≤ y ) = f (u, v ) du dv .
c a
∂2F
f (x, y ) = (x, y ).
∂x∂y
Properties
1. F (x, y ) is non-decreasing. That is, as x or y increases F (x, y )
increases or remains constant.
2. F (x, y ) = 0 at the lower left of its range.
If the lower left is (−∞, −∞) then this means
lim F (x, y ) = 0.
(x,y )→(−∞,−∞)
MIT18_05S14_class7_slides 185
January 1, 2017 9 / 36
Marginal pmf and pdf
Roll two dice: X = # on first die, T = total on both dice.
The marginal pmf of X is found by summing the rows. The marginal
pmf of T is found by summing the columns
X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(tj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
MIT18_05S14_class7_slides 188
January 1, 2017 12 / 36
Solution
1 3 3 2 1
�Z 1 �Z 1 � 1
Z � 1
Z
3 2 2 1 3 2
(x + y ) dx dy = x + xy dy = + y dy = 1.
0 0 2 0 2 2 0 0 2 2
.5
x
.3 1
�Z 1 �Z 1 � 1
Z 1
3 2 2 3 2 1
P(A) = (x + y ) dy dx = x y + y3 dx
.3 .5 2 .3 2 2 .5
MIT18_05S14_class7_slides
(continued)
189
January 1, 2017 13 / 36
Solutions 2, 3, 4, 5
1
3x 2
�
Z
7
2. (continued) = + dx = 0.5495
.3 4 16
y x
x 3y xy 3
�
Z �
Z
3 2
3. F (x, y ) = (u + v 2 ) du dv = + .
0 0 2 2 2
4.
1 1
y3
�Z
3 2 3 2 3 1
fX (x) = (x + y 2 ) dy = x y+ = x2 +
0 2 2 2 0 2 2
.5 .5
1 3 1 .5
�
Z �Z
3 2 1 5
P(X < .5) = fX (x) dx = x + dx = x + x = .
0 0 2 2 2 2 0 16
5. To find the marginal cdf FX (x) we simply take y to be the top of the
1
y -range and evalute F : FX (x) = F (x, 1) = (x 3 + x).
2
1 1 1 5
Therefore P(X < .5) = F (.5) = ( + ) = .
MIT18_05S14_class7_slides
2 8 2 16 190
6. On next slide
January 1, 2017 14 / 36
Solution 6
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
MIT18_05S14_class7_slides 191
January 1, 2017 15 / 36
Independence
P(A ∩ B) = P(A)P(B).
F (x, y ) = FX (x)FY (y ).
f (x, y ) = fX (x)fY (y ).
MIT18_05S14_class7_slides 192
January 1, 2017 16 / 36
Concept question: independence I
Roll two dice: X = value on first, Y = value on second
X\Y 1 2 3 4 5 6 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
3 1/36 1/36 1/36 1/36 1/36 1/36 1/6
4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
5 1/36 1/36 1/36 1/36 1/36 1/36 1/6
6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/6 1/6 1/6 1/6 1/6 1/6 1
X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1
(i) f (x, y ) = 4x 2 y 3 .
(ii) f (x, y ) = 12 (x 3 y + xy 3 ).
(iii) f (x, y ) = 6e
−3x−2y
MIT18_05S14_class7_slides 196
January 1, 2017 20 / 36
Covariance
MIT18_05S14_class7_slides 197
January 1, 2017 21 / 36
Properties of covariance
Properties
1. Cov(aX + b, cY + d) = acCov(X , Y ) for constants a, b, c, d.
2. Cov(X1 + X2 , Y ) = Cov(X1 , Y ) + Cov(X2 , Y ).
3. Cov(X , X ) = Var(X )
4. Cov(X , Y ) = E (XY ) − µX µY .
5. If X and Y are independent then Cov(X , Y ) = 0.
6. Warning: The converse is not true, if covariance is 0 the variables
might not be independent.
MIT18_05S14_class7_slides 198
January 1, 2017 22 / 36
Concept question
Y \X -1 0 1 p(yj )
0 0 1/2 0 1/2
1 1/4 0 1/4 1/2
p(xi ) 1/4 1/2 1/4 1
1. True 2. False
Compute Cov(X , Y ),
MIT18_05S14_class7_slides 200
January 1, 2017 24 / 36
Solution
X = X1 + X2 + . . . + X7 and Y = X6 + X7 + . . . + X12 .
We know Var(Xi ) = 1/4. Therefore using Property 2 (linearity) of
covariance
Cov(X , Y ) = Cov(X1 + X2 + . . . + X7 , X6 + X7 + . . . + X12 )
= Cov(X1 , X6 ) + Cov(X1 , X7 ) + Cov(X1 , X8 ) + . . . + Cov(X7 , X12 )
Since the different tosses are independent we know
Cov(X1 , X6 ) = 0, Cov(X1 , X7 ) = 0, Cov(X1 , X8 ) = 0, etc.
Looking at the expression for Cov(X , Y ) there are only two non-zero terms
1
Cov(X , Y ) = Cov(X6 , X6 ) + Cov(X7 , X7 ) = Var(X6 ) + Var(X7 ) = .
MIT18_05S14_class7_slides 2
201
January 1, 2017 25 / 36
Correlation
Cov(X , Y )
Cor(X , Y ) = ρ = .
σ X σY
Properties:
1. ρ is the covariance of the standardized versions of X
and Y .
2. ρ is dimensionless (it’s a ratio).
3. −1 ≤ ρ ≤ 1. ρ = 1 if and only if Y = aX + b with
a > 0 and ρ = −1 if and only if Y = aX + b with a < 0.
MIT18_05S14_class7_slides 202
January 1, 2017 26 / 36
Real-life correlations
MIT18_05S14_class7_slides 203
January 1, 2017 27 / 36
Real-life correlations discussion
Ice cream does not cause drownings. Both are correlated with
summer weather.
MIT18_05S14_class7_slides 205
January 1, 2017 29 / 36
Correlation is not causation
MIT18_05S14_class7_slides 206
January 1, 2017 30 / 36
Overlapping sums of uniform random variables
For example:
X = X1 + X2 + X3 + X4 + X5
Y = X3 + X4 + X5 + X6 + X7
MIT18_05S14_class7_slides 207
January 1, 2017 31 / 36
Scatter plots
(1, 0) cor=0.00, sample_cor=−0.07 (2, 1) cor=0.50, sample_cor=0.48
2.0
● ● ● ●● ● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●●
● ● ● ●
● ●●● ●●●● ● ● ● ● ● ● ●● ● ●
● ●●
●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ●● ●
●● ●● ● ● ● ● ●
●●● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● ● ● ●●●● ● ●
● ●● ●● ● ●
● ● ● ●● ● ●● ● ● ● ● ●●
●●
●●
● ● ●● ●● ● ● ●● ● ● ● ● ●● ●● ●● ●● ● ● ● ●● ●
0.8
● ●
●
● ● ● ● ● ● ●● ●
●
● ● ● ●● ● ● ●
●
●● ● ● ●● ● ● ● ●● ●
● ● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●
1.5
●
● ● ● ● ● ●● ● ● ●
●● ● ● ● ●●●● ● ● ● ●● ● ●●
● ●
●●● ● ● ● ● ●●● ● ●●● ● ● ● ●● ●● ● ● ● ● ●
●● ● ●● ● ●● ● ●
● ● ● ●
●
●● ● ● ● ●●● ● ● ●
● ●●● ●● ●●● ●● ● ●●● ●
●● ●● ●
●●● ● ● ● ●● ●●● ● ●● ●●● ●
● ●● ●
● ● ●●● ●● ●
●
● ●● ●● ●
● ● ●
●
● ● ● ●●● ● ● ● ●●● ● ●● ●● ●● ● ● ● ●
●
● ● ● ● ● ●
● ● ●● ●● ● ●
● ●● ●
●
●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ●
● ●
● ●●● ● ●●
●
●
● ●
● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●●● ● ●
●● ● ● ● ●● ● ● ●●●●
● ● ●● ● ● ●●●● ● ●● ● ●●
●●
● ●● ●●● ●
● ●● ● ●● ● ●● ● ● ● ● ● ● ●
● ● ● ● ●●●●●● ●● ● ●
●● ●●● ● ●
●● ●●●
●
●●● ● ●●●●
● ● ● ● ● ● ● ● ● ●
● ●
● ●●● ●● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ●● ●● ● ● ●
● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ●
●● ●●● ●● ●●
● ● ●
●● ●
●●● ●●● ● ●● ●●● ●● ● ●
●
●●● ● ●● ●●
● ● ● ●● ●
● ● ● ● ●●●● ● ● ●●
● ●●● ● ●● ●●● ●●● ● ●●
● ●● ●● ●● ●
● ● ●● ●● ● ● ●● ●●● ● ● ● ● ● ● ●● ●●●● ●● ●●●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ●● ●
1.0
●● ● ●● ● ●● ●● ●
● ● ● ● ●● ● ● ● ●●●●● ●
●●● ●●●● ● ● ● ●
● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ●●
● ● ●● ● ● ●●● ● ● ●●
● ●●● ●● ●● ● ●● ● ●●
● ●● ● ●●●● ●● ● ● ●●
y
y
●● ● ● ● ●● ● ●● ● ●● ● ●●●● ●●●● ● ● ●●
● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ●●●●● ● ● ●
● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●
●
● ● ● ●●● ●●● ● ●● ●●● ● ●●● ● ● ● ●
● ● ●
● ●● ● ●● ●● ● ● ● ● ● ● ●● ●
●
●●● ●● ● ● ●●●●●● ●●
●
●●● ●● ●● ●●● ●●● ●● ● ● ●● ●● ●● ●●
●● ● ● ● ● ●● ● ●● ●●●
0.4
● ●● ● ●● ● ●●
● ● ●● ●
● ● ● ● ●● ● ● ●● ●● ● ● ●●● ● ● ●●●●
●● ●●
● ●● ● ●
● ●● ● ●● ●
● ●
● ●●
● ● ●● ● ●
● ● ● ●●●
●
● ●●
●
●
●● ●●●● ●
●● ●●●
●● ● ●
● ● ● ● ● ● ● ●● ●●
● ● ●● ● ●●●● ●● ● ● ● ● ● ● ● ●
●●●● ●● ●
●●●● ● ● ●●● ● ●● ●
● ● ●● ● ● ●●
● ●
● ●● ●●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●●●●●● ● ● ●● ●●● ● ●●●
● ● ●● ●●
● ● ● ● ●● ● ●● ● ●
● ●●●●● ●●
● ●● ● ●
● ● ● ●
● ● ● ● ● ●●
● ●
● ●●● ● ●● ●●● ● ●● ● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●●
● ●
● ● ● ● ●● ● ● ● ● ●
●● ●● ●
● ● ●● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●
● ● ●● ●● ●● ●
●● ● ●●● ● ●● ●
● ●● ●●
● ●● ● ●●● ● ●● ●●●● ● ●● ●●
● ● ● ● ●
● ●
0.5
● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ● ●
● ●●
● ● ● ● ● ● ● ●● ●● ●
● ●
● ●
● ●● ● ● ● ●●
●● ● ● ● ● ●● ● ● ●● ● ● ●● ●
●
●●●●●● ●
●
●
●●●
●● ●●●
●● ●●●●● ●
● ●●
● ●●● ● ●● ● ●● ●● ● ●● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●● ● ●
●● ●
● ● ●● ● ●● ● ●
● ●● ● ● ●
● ●● ● ● ● ● ●● ● ● ●● ● ●●● ●●●● ●●● ●●●●● ● ●●
● ●
● ●● ● ●● ● ● ● ● ● ● ●● ●
●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●●
● ●●●
● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ●
●●●● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ●
● ● ● ● ● ●● ● ●● ●● ●
● ●● ● ●● ●● ● ●●●
● ●●● ●● ● ● ● ● ●
●● ●● ● ●●● ●● ● ●
●
● ● ● ●
● ● ● ● ● ●●● ● ●● ● ● ●
● ●●●
● ●●● ●
●
● ● ● ● ● ●● ● ● ●● ● ● ● ●
● ●●
●● ● ● ● ● ●●
● ●●● ●● ● ● ● ● ● ●
● ●● ●● ● ●
0.0
● ●●● ●
●● ●
0.0
● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ● ●
● ● ● ● ● ●● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
x x
●
●
8 ●
● ●
●
●● ● ● ● ● ●
4
● ● ● ● ● ●
●● ● ● ●●
● ● ●
7
● ● ● ●● ● ● ●● ●● ●
● ●
● ● ● ● ●● ●● ●● ● ●● ● ●
● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ●● ● ● ● ● ● ●
●●● ● ● ●● ●●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●●●●● ● ● ●●
●● ●● ●● ●
●● ●● ●●●● ●●●● ● ●●
● ●●
●
● ●
●
●● ●●
●●●
●
●
● ● ● ●
●● ● ● ● ● ● ● ● ● ●● ●● ● ●
● ●● ●● ●●● ● ● ●
● ●
●●● ●● ●●●●●
●●
● ●● ● ● ● ● ● ● ●●● ●●
●●●●● ●● ● ●●
●
● ●● ● ● ●● ●●● ● ● ●● ● ●● ●● ● ●●● ●●
●●●● ●● ●●●
● ●●
● ●●● ●● ● ● ●● ● ●●●●
6
● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ●●
●
● ●● ● ● ● ● ● ● ●● ●●●●● ●●
●
●
●● ●●●●●● ● ●●● ● ●●● ● ●● ●●●●
●
●
●
●● ●
●●
●● ● ● ● ●●
●●●●●● ● ● ●● ●● ● ●
3
● ● ● ● ●● ● ●● ● ●●●● ●● ●● ●●●● ● ●● ● ●●
●●●
●● ●●●
● ●●● ●● ● ● ●●
●
● ●●●● ● ●●●● ●
●●●●● ● ●●●
●
● ●●
●●
● ●● ●●●● ●
● ●●● ● ●●● ●●
● ●● ●●
●●●● ●●●
●●● ●●●
● ●● ● ●
●●●●
●●
● ● ●● ●● ●●●●●●● ● ●● ●● ● ●● ● ● ● ● ●● ● ● ●
●
●● ● ● ● ● ●
● ●
●● ●● ● ● ●● ●● ● ●● ●●● ●●● ●●●
● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●● ●●●
●●● ●●
● ● ● ● ●●
● ●●● ●● ● ●●●● ●●● ● ●●● ● ●●● ●
●
● ● ●● ● ●
●●●●● ●●●
●
●● ● ●● ●●
● ●● ● ● ●●●● ●●●●●●●●
●
●
● ● ● ● ● ● ●●
● ●
●●●
●
●
●●●
●
●● ●
●●●●
● ●● ●
● ●●●
●
●
●
●●●● ●●
●
● ●
● ● ● ● ●● ●● ●●● ●●● ● ●●● ● ●●●●●●●● ●● ●●
● ●●●● ●● ●● ● ●●●●● ● ●●
●
●●● ●●● ●●
●● ●●●
● ●● ● ●
●● ●● ●●● ● ●●●● ●● ● ●● ● ●●
●●●
●● ● ● ●●
● ● ●● ● ●●● ● ●●●●
●●
●● ●●●●●●● ●●●
●●●●●●●●
●●●●●
● ● ● ●● ●●
●●
●● ● ● ● ● ● ●
●●● ●● ● ●
● ● ● ●● ●●●
● ● ●●●●
●●●
● ●●●
5
● ● ●● ● ● ●
●● ●● ● ● ●● ●●●● ●
● ●●● ●●● ● ●
● ●● ● ●● ● ● ●● ●● ●● ● ● ●●
y
●
y
●● ● ● ●●● ●●
●● ●● ● ● ●●●
● ●
●
● ●●
● ● ● ●●● ●
● ●● ●● ● ● ● ●
● ●● ● ● ● ●
● ●●● ●● ●
●●● ●
● ●●
● ● ● ● ● ● ● ●●
●●●●● ●●●
●●● ●● ●
● ●
● ●● ●●
●● ●●●●
● ●● ●●
●● ●● ●
● ● ●●
● ●● ● ●● ● ● ● ●● ●●● ●
● ● ● ●●● ●● ● ● ●
●●
● ●
●● ●
●
●
●
●
●●●
●●
●
● ●● ●● ● ●●● ●
● ● ●
● ●● ● ●● ●●● ●● ●
●
●●● ●●●●●●
●
●● ●● ●● ●●● ● ● ●
● ● ●●
● ● ● ●● ●●● ●
●●
●●●●
●● ●●●
●● ●
●●● ●
●●●
●●●●●●●
●●● ●● ●
● ● ●● ●●
●
● ●● ●
●●● ●● ●●●
●●●● ● ● ● ● ●● ●●●●● ●● ●● ● ●●● ● ●●● ●●
●●●
●
● ● ● ● ● ● ●●● ●●
●● ●●●
●
●●
● ●●
●●● ●●● ● ● ● ● ● ● ● ●
● ●●
● ● ●●●
●
●●
●● ●● ●●
●●
●
●●●● ●●●
●
●●●
●●●●● ●
● ●● ● ●
●● ● ●
●● ● ● ●●● ● ● ●●●
●●●●● ● ●●● ●●●●● ● ●● ● ● ● ● ●●
● ● ● ●● ●●●● ● ●●
●
2
●● ●●● ● ●● ● ●
●
● ● ●
● ● ● ●●●● ●●●● ●● ●● ●●● ●●
● ●
●●● ●● ● ● ●●● ●● ●●● ● ● ●● ● ●●
●●●●
●● ●
●●●●● ● ●● ● ●●●●
● ●
●●● ● ● ● ● ● ● ●● ●●● ● ●● ● ●●●● ●
●● ●●● ● ●
●●●● ● ●
● ●● ●●● ● ● ● ●●● ●●
4
● ● ●● ●
● ●● ● ●● ●●●●● ●
●● ● ●
●● ●● ● ● ●● ● ●
● ● ●●
● ● ●● ● ● ● ● ●●●● ● ● ●●● ●
● ●● ●● ●●
●● ●● ●
●● ●●●
● ●●● ●●● ●●●
●● ● ●●
●
●● ●
●●●
●●● ●●● ● ●● ● ● ● ●●
● ● ● ●● ● ●●
●
●
● ●
● ● ● ● ● ●●● ●● ● ● ● ● ●●● ●
● ● ●● ● ●●●● ●●●
● ● ●● ●●
●● ●● ● ●● ●●●● ● ●●● ●
●●● ● ● ●
● ● ●● ● ● ●●
● ●● ● ●● ● ● ● ●●● ● ● ●● ●● ●●
● ●●● ● ●●
●
● ● ● ● ●● ●● ●
●
● ● ●● ● ●
● ● ●● ● ● ●
●
● ● ● ●● ●● ● ●
●
● ●●● ●●
● ● ● ● ●● ● ●● ● ●● ●
● ● ●●● ●
● ●
●
3
● ● ●
● ● ● ●● ●● ●
● ● ● ●
●●
● ●● ● ● ●● ●●
1
● ● ● ●
● ● ●
●
● ● ● ● ● ●
●
●
2
● ●
●
MIT18_05S14_class7_slides
1 2 3 4 3 4 5 6 7 8 208
x x
January 1, 2017 32 / 36
Concept question
MIT18_05S14_class7_slides 209
January 1, 2017 33 / 36
Board question
Toss a fair coin 2n + 1 times. Let X be the number of
heads on the first n + 1 tosses and Y the number on the
last n + 1 tosses.
Compute Cov(X , Y ) and Cor(X , Y ).
As usual let Xi = the number of heads on the i th flip, i.e. 0 or 1. Then
n+1
m 2n+1
m
X = Xi , Y = Xi
1 n+1
January 1, 2017 34 / 36
Solution continued
Now,
�n+1 2n+1
� n+1 2n+1
m m m m
Cov(X , Y ) = Cov Xi Xj = Cov(Xi Xj ).
1 n+1 i=1 j=n+1
Because the Xi are independent the only non-zero term in the above sum
1
is Cov(Xn+1 Xn+1 ) = Var(Xn+1 ) = Therefore,
4
1
Cov(X , Y ) = .
4
We get the correlation by dividing by the standard deviations.
Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σ X σY (n + 1)/4 n+1
This makes sense: as n increases the correlation should decrease since the
MIT18_05S14_class7_slides
contribution 211
of the one flip they have in common becomes less important.
January 1, 2017 35 / 36
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class7_slides 212
Review for Exam 1
18.05 Spring 2014
1. Sets.
2. Counting.
3. Sample space, outcome, event, probability function.
4. Probability: conditional probability, independence, Bayes’ theorem.
5. Discrete random variables: events, pmf, cdf.
6. Bernoulli(p), binomial(n, p), geometric(p), uniform(n)
7. E (X ), Var(X ), σ
8. Continuous random variables: pdf, cdf.
9. uniform(a,b), exponential(λ), normal(µ,σ 2 )
10. Transforming random variables.
11. Quantiles.
12. Central limit theorem, law of large numbers, histograms.
13. Joint distributions: pmf, pdf, cdf, covariance and correlation.
MIT18_05S14_class8_slides 215
January 1, 2017 3 / 18
Sets and counting
Sets:
products
Counting:
n
permutations n Pk , combinations n Ck = k
MIT18_05S14_class8_slides 216
January 1, 2017 4 / 18
Probability
MIT18_05S14_class8_slides 217
January 1, 2017 5 / 18
Random variables, expectation and variance
MIT18_05S14_class8_slides 219
January 1, 2017 7 / 18
Joint distributions
MIT18_05S14_class8_slides 220
January 1, 2017 8 / 18
Hospitals (binomial, CLT, etc)
A certain town is served by two hospitals.
Larger hospital: about 45 babies born each day.
Smaller hospital about 15 babies born each day.
For a period of 1 year, each hospital recorded the days on which
more than 60% of the babies born were boys.
(a) Which hospital do you think recorded more such days?
(i) The larger hospital. (ii) The smaller hospital.
(iii) About the same (that is, within 5% of each other).
(b) Assume exactly 45 and 15 babies are born at the hospitals each
day. Let Li (resp., Si ) be the Bernoulli random variable which takes
the value 1 if more than 60% of the babies born in the larger (resp.,
smaller) hospital on the i th day were boys. Determine the distribution
of Li and of Si .
MIT18_05S14_class8_slides 221
Continued on next slide
January 1, 2017 9 / 18
Hospital continued
(c) Let L (resp., S) be the number of days on which more than 60%
of the babies born in the larger (resp., smaller) hospital were boys.
What type of distribution do L and S have? Compute the expected
value and variance in each case.
(d) Via the CLT, approximate the 0.84 quantile of L (resp., S).
Would you like to revise your answer to part (a)?
MIT18_05S14_class8_slides 222
January 1, 2017 10 / 18
Solution
answer: (a) When this question was asked in a study, the number of
undergraduates who chose each option was 21, 21, and 55, respectively.
This shows a lack of intuition for the relevance of sample size on deviation
from the true mean (i.e., variance).
(b) The random variable XL , giving the number of boys born in the larger
hospital on day i, is governed by a Bin(45, .5) distribution. So Li has a
Ber(pL ) distribution with
45
4 45
pL = P(X: > 27) = .545 ≈ 0.068.
k
k=28
Similarly, the random variable XS , giving the number of boys born in the
smaller hospital on day i, is governed by a Bin(15, .5) distribution. So Si
has a Ber(pS ) distribution with
15
4 15
pS = P(XS > 9) = .515 ≈ 0.151.
k
k=10
MIT18_05S14_class8_slides 223
We see that pS is indeed greater than pL , consistent with (ii).
January 1, 2017 11 / 18
Solution continued
E (L) = 365pL ≈ 25
E (S) = 365pS ≈ 55
Var(L) = 365pL (1 − pL ) ≈ 23
Var(S) = 365pS (1 − pS ) ≈ 47
(d) By the CLT, the 0.84 quantile is approximately the mean + one sd in
each case: √
For L, q0.84 ≈ 25 + √23.
For S, q0.84 ≈ 55 + 47.
Thus
364 4
4 365
P(L > S) = p(i, j) ≈ .0000916
i=0 j=i+1
MIT18_05S14_class8_slides 225
January 1, 2017 13 / 18
R code
pL = 1 - pbinom(.6*45,45,.5)
pS = 1 - pbinom(.6*15,15,.5)
print(pL)
print(pS)
pLGreaterS = 0
for(i in 0:365)
{
for(j in 0:(i-1))
{
= pLGreaterS + dbinom(i,365,pL)*dbinom(j,365,pS)
}
}
print(pLGreaterS)
MIT18_05S14_class8_slides 226
January 1, 2017 14 / 18
Problem correlation
5 1
Cov(X , Y ) = E (XY ) − E (X )E (Y ) = −1= .
4 4
Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σX σY (2)/4 2
Cov(X , Y ) = Cov(X1 + X2 + X3 , X3 + X4 + X5 )
= Cov(X1 , X3 ) + Cov(X1 , X4 ) + Cov(X1 , X5 )
+ Cov(X2 , X3 ) + Cov(X2 , X4 ) + Cov(X2 , X5 )
+ Cov(X3 , X3 ) + Cov(X3 , X4 ) + Cov(X3 , X5 )
Because the Xi are independent the only non-zero term in the above sum
Cov(X , Y ) 1/4 1
Cor(X , Y ) = = = .
σX σY (3)/4 3
MIT18_05S14_class8_slides 229
January 1, 2017 17 / 18
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class8_slides 230
Introduction to Statistics
18.05 Spring 2014
T T T H H T H H H T
H T H T H T H T H T
H T T T H T T T T H
H T T H H T H H T H
T T H H H H T H T H
T T T H T H H H H T
T T T H H H T T T H
H H H H H H H T T T
H T H H T T T H H T
H T H H H T T T H H
Data Collection:
Informal Investigation / Observational Study / Formal
Experiment
Descriptive statistics
Inferential statistics (the focus in 18.05)
MIT18_05S14_class10_slides 232
January 1, 2017 2 / 23
Is it fair?
T T T H H T H H H T
H T H T H T H T H T
H T T T H T T T T H
H T T H H T H H T H
T T H H H H T H T H
T T T H T H H H H T
T T T H H H T T T H
H H H H H H H T T T
H T H H T T T H H T
H T H H H T T T H H
MIT18_05S14_class10_slides 233
January 1, 2017 3 / 23
Is it normal?
0.20
Density
0.10
0.00
−4 −2 0 2 4
x
MIT18_05S14_class10_slides 234
January 1, 2017 4 / 23
What is a statistic?
MIT18_05S14_class10_slides 235
January 1, 2017 5 / 23
Concept question
variables.
e.g. x1 , . . . , x10 = 1, 1, 1, 0, 0, 0, 0, 0, 1, 0.
MIT18_05S14_class10_slides 237
January 1, 2017 7 / 23
Reminder of Bayes’ theorem
P(D|H)P(H)
P(H|D) = .
P(D)
P(data|hypothesis)P(hypothesis)
P(hypothesis|data) =
P(data)
MIT18_05S14_class10_slides 238
January 1, 2017 8 / 23
Estimating a parameter
Model:
MIT18_05S14_class10_slides 239
January 1, 2017 9 / 23
Parameters of interest
Example. You ask 100 people to taste cilantro and 55 say it tastes
like soap. Use this data to estimate p the fraction of all people for
whom it tastes like soap.
MIT18_05S14_class10_slides 240
January 1, 2017 10 / 23
Likelihood
100 55
P(55 soap|p) = p (1 − p)45 .
55
Definition:
100 55
The likelihood P(data|p) = p (1 − p)45 .
55
NOTICE: The likelihood takes the data as fixed and computes the
probability of the data for a given p.
MIT18_05S14_class10_slides 241
January 1, 2017 11 / 23
Maximum likelihood estimate (MLE)
MIT18_05S14_class10_slides 243
January 1, 2017 13 / 23
Log likelihood
Example.
100 55
Likelihood P(data|p) = p (1 − p)45
55
100
Log likelihood = ln + 55 ln(p) + 45 ln(1 − p).
55
(Note first term is just a constant.)
MIT18_05S14_class10_slides 244
January 1, 2017 14 / 23
Board Question: Coins
A coin is taken from a box containing three coins, which give heads
with probability p = 1/3, 1/2, and 2/3. The mystery coin is tossed
80 times, resulting in 49 heads and 31 tails.
(a) What is the likelihood of this data for each type on coin? Which
coin gives the maximum likelihood?
(b) Now suppose that we have a single coin with unknown probability
p of landing heads. Find the likelihood and log likelihood functions
given the same data. What is the maximum likelihood estimate for p?
MIT18_05S14_class10_slides 245
January 1, 2017 15 / 23
Solution
values:
49 31
80 1 2
P(D|p = 1/3) = = 6.24 · 10−7
49 3 3
49 31
80 1 1
P(D|p = 1/2) = = 0.024
49 2 2
49 31
80 2 1
P(D|p = 2/3) = = 0.082
49 3 3
49 31
⇒ − =0
p 1−p
49
⇒ p=
80
So our MLE is p̂ = 49/80.
MIT18_05S14_class10_slides 247
January
January 1, 2017
1, 2017 17 /
17 / 23
22
Continuous likelihood
MIT18_05S14_class10_slides 248
January 1, 2017 18 / 23
Solution
Log likelihood
MIT18_05S14_class10_slides 249
January 1, 2017 19 / 23
Board Question
Work from scratch. Do not simply use the formula just given.
MIT18_05S14_class10_slides 250
January 1, 2017 20 / 23
Solution
x1 = 2, x2 = 3, x3 = 1, x4 = 3, x5 = 4.
So our likelihood and log likelihood functions with this data are
Using calculus to find the MLE we take the derivative of the log likelihood
5 5
− 13 = 0 ⇒ λ̂ = .
λ 13
MIT18_05S14_class10_slides 252
January 1, 2017 22 / 23
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class10_slides 253
Bayesian Updating: Discrete Priors: 18.05 Spring 2014
January 1, 2017 1 / 22
Learning from experience
Which treatment would you choose?
• Suppose you find out that the bag contained one 4-sided die and
one 10-sided die. Does this change your guess?
• Suppose you find out that the bag contained one 4-sided die and
100 10-sided dice. Does this change your guess?
MIT18_05S14_class11_slides 256
January 1, 2017 3 / 22
Board Question: learning from data
• A certain disease has a prevalence of 0.005.
• A screening test has 2% false positives an 1% false negatives.
T+ T− T+ T−
P(T+ | H+ )P(H+ )
Bayes’ theorem says P(H+ | T+ ) = .
P(T+ )
Using the tree, the total probability
P(T+ ) = P(T+ | H+ )P(H+ ) + P(T+ | H− )P(H− )
= 0.99 · 0.005 + 0.02 · 0.995 = 0.02485
MIT18_05S14_class11_slides 258
Solution continued on next slide.
January 1, 2017 5 / 22
Solution continued
So,
P(T+ | H+ )P(H+ ) 0.99 · 0.005
P(H+ | T+ ) = = = 0.199
P(T+ ) 0.02485
P(T+ | H− )P(H− ) 0.02 · 0.995
P(H− | T+ ) = = = 0.801
P(T+ ) 0.02485
MIT18_05S14_class11_slides 259
January 1, 2017 6 / 22
Solution continued
2. Terminology
Data: The data are the results of the experiment. In this case, the
positive test.
Hypotheses: The hypotheses are the possible answers to the question
being asked. In this case they are H+ the patient has the disease; H−
they don’t.
Likelihoods: The likelihood given a hypothesis is the probability of the
data given that hypothesis. In this case there are two likelihoods, one for
each hypothesis
MIT18_05S14_class11_slides 260
Continued on next slide.
January 1, 2017 7 / 22
Solution continued
Prior probabilities of the hypotheses: The priors are the probabilities of the
hypotheses prior to collecting data. In this case,
P (T+ | H+ ) · P (H+ )
P (H+ | T+ ) =
P (T+ )
The table holds likelihoods P(D|H) for every possible hypothesis and data
combination.
Notice in the next slide that the P(T+ | H) column is exactly the likelihood
column in the Bayesian update table.
MIT18_05S14_class11_slides 262
January 1, 2017 9 / 22
Solution continued
4. Calculation using a Bayesian update table
H = hypothesis: H+ (patient has disease); H− (they don’t).
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(T+ |H) P(T+ |H)P(H) P(H|T+ )
H+ 0.005 0.99 0.00495 0.199
H− 0.995 0.02 0.0199 0.801
total 1 NO SUM P(T+ ) = 0.02485 1
Data D = T+
1. Make the full likelihood table (be smart about identical columns).
2. Make a Bayesian update table and compute the posterior
probabilities that the chosen die is each of the five dice.
3. Same question if I rolled a 5.
4. Same question if I rolled a 9.
MIT18_05S14_class11_slides 264
January 1, 2017 11 / 22
Tabular solution
D = ‘rolled a 13’
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 0 0 0
H20 1/5 1/20 1/100 1
total 1 1/100 1
MIT18_05S14_class11_slides 265
January 1, 2017 12 / 22
Tabular solution
D = ‘rolled a 5’
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 1/6 1/30 0.392
H8 1/5 1/8 1/40 0.294
H12 1/5 1/12 1/60 0.196
H20 1/5 1/20 1/100 0.118
total 1 0.085 1
MIT18_05S14_class11_slides 266
January 1, 2017 13 / 22
Tabular solution
D = ‘rolled a 9’
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 1/12 1/60 0.625
H20 1/5 1/20 1/100 0.375
total 1 .0267 1
MIT18_05S14_class11_slides 267
January 1, 2017 14 / 22
Iterated Updates
MIT18_05S14_class11_slides 268
January 1, 2017 15 / 22
Tabular solution
D1 = ‘rolled a 5’
D2 = ‘rolled a 9’
Bayes Bayes
hyp. prior likel. 1 num. 1 likel. 2 num. 2 posterior
H P(H) P(D1 |H) ∗∗∗ P(D2 |H) ∗∗∗ P(H|D1 , D2 )
H4 1/5 0 0 0 0 0
H6 1/5 1/6 1/30 0 0 0
H8 1/5 1/8 1/40 0 0 0
H12 1/5 1/12 1/60 1/12 1/720 0.735
H20 1/5 1/20 1/100 1/20 1/2000 0.265
total 1 0.0019 1
MIT18_05S14_class11_slides 269
January 1, 2017 16 / 22
Board Question
MIT18_05S14_class11_slides 270
January 1, 2017 17 / 22
Tabular solution: two steps
D1 = ‘rolled a 9’
D2 = ‘rolled a 5’
Bayes Bayes
hyp. prior likel. 1 num. 1 likel. 2 num. 2 posterior
H P(H) P(D1 |H) ∗∗∗ P(D2 |H) ∗∗∗ P(H|D1 , D2 )
H4 1/5 0 0 0 0 0
H6 1/5 0 0 1/6 0 0
H8 1/5 0 0 1/8 0 0
H12 1/5 1/12 1/60 1/12 1/720 0.735
H20 1/5 1/20 1/100 1/20 1/2000 0.265
total 1 0.0019 1
MIT18_05S14_class11_slides 271
January 1, 2017 18 / 22
Tabular solution: one step
D = ‘rolled a 9 then a 5’
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D)
H4 1/5 0 0 0
H6 1/5 0 0 0
H8 1/5 0 0 0
H12 1/5 1/144 1/720 0.735
H20 1/5 1/400 1/2000 0.265
total 1 0.0019 1
MIT18_05S14_class11_slides 272
January 1, 2017 19 / 22
Board Question: probabilistic prediction
MIT18_05S14_class11_slides 273
January 1, 2017 20 / 22
Solution
D1 = ‘rolled a 5’
D2 = ‘rolled a 4’
Bayes
hyp. prior likel. 1 num. 1 post. 1 likel. 2 post. 1 × likel. 2
H P(H) P(D1 |H) ∗ ∗ ∗ P(H|D1 ) P(D2 |H, D1 ) P(D2 |H, D1 )P(H|D1 )
H4 1/5 0 0 0 ∗ 0
H6 1/5 1/6 1/30 0.392 1/6 0.392 · 1/6
H8 1/5 1/8 1/40 0.294 1/8 0.294 · 1/40
H12 1/5 1/12 1/60 0.196 1/12 0.196 · 1/12
H20 1/5 1/20 1/100 0.118 1/20 0.118 · 1/20
total 1 0.085 1 0.124
The law of total probability tells us P(D1 ) is the sum of the Bayes
numerator 1 column in the table: P(D1 ) = 0.085 .
The law of total probability tells us P(D2 |D1 ) is the sum of the last
column in the table: P(D2 |D1 ) = 0.124
MIT18_05S14_class11_slides 274
January 1, 2017 21 / 22
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class11_slides 275
Prediction and Odds
18.05 Spring 2014
MIT18_05S14_class12_slides 277
January 1, 2017 2 / 26
Words of estimative probability (WEP)
WEP Prediction: “It is likely to rain tomorrow.”
Word Probability
Likely Will happen to more than 50% of patients
Frequent Will happen to 10-50% of patients
Occasional Will happen to 1-10% of patients
Rare Will happen to less than 1% of patients
MIT18_05S14_class12_slides 279
January 1, 2017 4 / 26
Example: Three types of coins
A drawer contains one coin of each type. You pick one at random.
Prior predictive probability: Before taking data, what is the
probability a toss will land heads? Tails?
MIT18_05S14_class12_slides 280
January 1, 2017 5 / 26
Solution 1
MIT18_05S14_class12_slides 281
January 1, 2017 6 / 26
Solution 2
2. We are given the data D1,H . First update the probabilities for the type
of coin.
Let D2,H = ‘toss 2 is heads’, D2,T = ‘toss 2 is tails’.
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D1,H |H) P(D1,H |H)P(H) P(H|D1,H )
A 1/3 0.5 0.1667 0.25
B 1/3 0.6 0.2 0.3
C 1/3 0.9 0.3 0.45
total 1 0.6667 1
Next use the law of total probability:
P(D2,H |D1,H ) = P(D2,H |A)P(A|D1,H ) + P(D2,H |B)P(B|D1,H )
+P(D2,H |C )P(C |D1,H )
= 0.71
P(DMIT18_05S14_class12_slides
2,T |D1,H ) = 0.29.
282
January 1, 2017 7 / 26
Three coins, continued.
MIT18_05S14_class12_slides 283
January 1, 2017 8 / 26
Board question: three coins
Same setup:
3 coins with probabilities 0.5, 0.6, and 0.9 of heads.
Pick one; toss 5 times.
Suppose you get 1 head out of 5 tosses.
Compute the posterior probabilities for the type of coin and the
posterior predictive probabilities for the results of the next toss.
Bayes
hypothesis prior likelihood numerator posterior
H P(H) P(D|H) P(D|H)P(H) P(H|D )
P5P 5
A 1/3 P5P1 0.5 4 0.0521 0.669
B 1/3 P15P0.6 · .44 0.0256 0.329
C 1/3 1 0.9 · .1 0.00015 0.002
total 1 0.0778 1
So,
MIT18_05S14_class12_slides 285
January 1, 2017 10 / 26
Concept Question
Does the order of the 1 head and 4 tails affect the posterior
distribution of the coin type?
1. Yes 2. No
Does the order of the 1 head and 4 tails affect the posterior predictive
distribution of the next flip?
1. Yes 2. No
MIT18_05S14_class12_slides 286
January 1, 2017 11 / 26
Odds
MIT18_05S14_class12_slides 287
January 1, 2017 12 / 26
Examples
0.5
A fair coin has O(heads) = = 1.
0.5
We say ‘1 to 1’ or ‘fifty-fifty’.
1/6 1
The odds of rolling a 4 with a six-sided die are = .
5/6 5
We say ‘1 to 5 for’ or ‘5 to 1 against’
p
For event E , if P(E ) = p then O(E ) = .
1−p
MIT18_05S14_class12_slides 288
January 1, 2017 13 / 26
Bayesian framework: Marfan’s Syndrome
P(F |M)
= · O(M)
P(F |M c )
A screening test has a 0.05 false positive rate and a 0.02 false
negative rate.
4. Based on your answers to (1) and (2) would you say a positive test
(the data) provides strong or weak evidence for the presence of the
disease.
MIT18_05S14_class12_slides 292
answer: See next slide
January 1, 2017 17 / 26
Solution
Let H+ = ‘has disease’ and H− = ‘doesn’t’
Let T+ = positive test
P(H+ ) 0.005
1. O(H+ ) = = = 0.00503
P(H− ) 0.995
Likelihood table:
Possible data
T+ T−
Hypotheses H+ 0.98 0.02
H− 0.05 0.95
P(T+ |H+ ) 0.98
2. Bayes factor = ratio of likelihoods = = = 19.6
P(T+ |H− ) 0.05
3. Posterior odds = Bayes factor × prior odds = 19.6 × 0.00504 = 0.0985
4. Yes, a Bayes factor of 19.6 indicates a positive test is strong evidence
the patient has the disease. The posterior odds are still small because the
prior odds are extremely small.
MIT18_05S14_class12_slides 293
More on next slide.
January 1, 2017 18 / 26
Solution continued
MIT18_05S14_class12_slides 294
January 1, 2017 19 / 26
Board Question: CSI Blood Types*
Crime scene: the two perpetrators left blood: one of type O and
one of type AB
In population 60% are type O and 1% are type AB
1 Suspect Oliver is tested and has type O blood.
Compute the Bayes factor and posterior odds that Oliver was one
of the perpetrators.
Is the data evidence for or against the hypothesis that Oliver is
guilty?
2 Same question for suspect Alberto who has type AB blood.
Hypotheses:
S = ‘Oliver and another unknown person were at the scene’
S c = ‘two unknown people were at the scene’
Data:
D = ‘type ‘O’ and ‘AB’ blood were found; Oliver is type O’
MIT18_05S14_class12_slides 296
January 1, 2017 21 / 26
Solution to CSI Blood Types
For Oliver:
P(D|S) 0.01
Bayes factor = c
= = 0.83.
P(D|S ) 2 · 0.6 · 0.01
Therefore the posterior odds = 0.83 × prior odds (O(S|D) = 0.83 · O(S))
Since the odds of his presence decreased this is (weak) evidence of his
innocence.
For Alberto:
P(D|S) 0.6
Bayes factor = c
= = 50.
P(D|S ) 2 · 0.6 · 0.01
David Mackay:
MIT18_05S14_class12_slides 298
January 1, 2017 23 / 26
Updating again and again
Collect data: D1 , D2 , . . .
Posterior odds to D1 become prior odds to D2 . So,
P(D1 |H) P(D2 |H)
O(H|D1 , D2 ) = O(H) · ·
P(D1 |H ) P(D2 |H c )
c
Independence assumption:
D1 and D2 are conditionally independent.
MIT18_05S14_class12_slides 299
January 1, 2017 24 / 26
Marfan’s Symptoms
The Bayes factor for ocular features (F) is
P(F |M) 0.7
BFF = c
= = 10
P(F |M ) 0.07
The wrist sign (W) is the ability to wrap one hand around your other
wrist to cover your pinky nail with your thumb. Assume 10% of the
population have the wrist sign, while 90% of people with Marfan’s
have it. So,
P(W |M) 0.9
BFW = c
= = 9.
P(W |M ) 0.1
1 6
O(M|F , W ) = O(M) · BFF · BFW = · 10 · 9 ≈ .
14999 1000
We can convert posterior odds back to probability, but since the odds
are so small the result is nearly the same:
6
P(M|F , W ) ≈ ≈ 0.596%
MIT18_05S14_class12_slides 1000 + 6 300
January 1, 2017 25 / 26
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class12_slides 301
Bayesian Updating: Continuous Priors
18.05 Spring 2014
MIT18_05S14_class13_slides 303
January 1, 2017 2 /24
Example of Bayesian updating so far
MIT18_05S14_class13_slides 304
January 1, 2017 3 /24
Solution (2 times)
Let C0.25 stand for the hypothesis (event) that the chosen coin has
probability 0.25 of heads. We want to compute P(C0.25 |data).
Method 1: Using Bayes’ formula and the law of total probability:
P(data|C.25 )P(C.25 )
P(C.25 |data) =
P(data)
P(data|C.25 )P(C.25 )
=
P(data|C.25 )P(C.25 ) + P(data|C.5 )P(C.5 ) + P(data|C.75 )P(C.75 )
(0.75)2 (1/4)
=
(0.75) (1/4) + (0.5)2 (1/2) + (0.25)2 (1/4)
2
= 0.5
Method 2: Using a Bayesian update table:
hypotheses prior likelihood Bayes numerator posterior
H P(H) P(data|H) P(data|H)P(H) P(H|data)
C0.25 1/4 (0.75)2 0.141 0.500
C0.5 1/2 (0.5)2 0.125 0.444
C0.75 1/4 (0.25)2 0.016 0.056
MIT18_05S14_class13_slides
Total 1 P(data) = 0.281 1 305
January 1, 2017 4 /24
Solution continued
Note. The total probability P(data) is also called the prior predictive
probability of the data.
MIT18_05S14_class13_slides 306
January 1, 2017 5 /24
Notation with lots of hypotheses I.
Now there are 5 types of coins with probabilities 0.1, 0.3, 0.5,
0.7, 0.9 of heads.
Assume the numbers of each type are in the ratio 1:2:3:2:1 (so
fairer coins are more common).
Again we pick a coin at random, toss it twice and get TT .
Construct the Bayesian update table for the posterior probabilities of
each type of coin.
hypotheses prior likelihood Bayes numerator posterior
H P(H) P(data|H) P(data|H)P(H) P(H|data)
C0.1 1/9 (0.9)2 0.090 0.297
C0.3 2/9 (0.7)2 0.109 0.359
C0.5 3/9 (0.5)2 0.083 0.275
C0.7 2/9 (0.3)2 0.020 0.066
C0.9 1/9 (0.1)2 0.001 0.004
MIT18_05S14_class13_slides
Total 1 P(data) = 0.303 1 307
January 1, 2017 6 /24
Notation with lots of hypotheses II.
Assume fairer coins are more common with the number of coins of
We can do this!
MIT18_05S14_class13_slides 308
January 1, 2017 7 /24
Table with 9 hypotheses
MIT18_05S14_class13_slides 309
January 1, 2017 8 /24
Notation with lots of hypotheses III.
Assume fairer coins are more common with the number of coins of
We could do this . . .
MIT18_05S14_class13_slides 310
January 1, 2017 9 /24
Table with 99 coins
MIT18_05S14_class13_slides
C0.62
C0.63
C0.64
k(0.62)(1-0.62)
k(0.63)(1-0.63)
k(0.64)(1-0.64)
(1 − 0.62)2
(1 − 0.63)2
(1 − 0.64)2
k(0.62)(1 − 0.62)2
k(0.63)(1 − 0.63)2
k(0.64)(1 − 0.64)2
0.006805262
0.006383342
0.005972963
311
C0.65 k(0.65)(1-0.65) (1 − 0.65)2 k(0.65)(1 − 0.65)2 0.005574679
C0.66 k(0.66)(1-0.66) (1 − 0.66)2 k(0.66)(1 − 0.66)2 0.005188993
C0.67 k(0.67)(1-0.67) (1 − 0.67)2 k(0.67)(1 − 0.67)2 0.004816361
C0.68 k(0.68)(1-0.68) (1 − 0.68)2
2
k(0.68)(1 − 0.68)2
2
0.004457191 January 1, 2017 10 / 23
Bayesian updating
3. (Big letters) For hypotheses H and data D:
probability f (x)dx
P (c ≤ X ≤ d)
x dx
c x
d x
l d
P(c ≤ X ≤ d) = f (x) dx.
c
f (x) dx
MIT18_05S14_class13_slides 314
January 1, 2017 13 /24
Example of continuous hypotheses
MIT18_05S14_class13_slides 315
January 1, 2017 14 /24
Law of total probability for continuous distributions
Discrete set of hypotheses H1 , H2 , . . . Hn ; data D:
n
n
P(D) = P(D|Hi )P(Hi ).
i=1
Bayes’ Theorem.
p(x|θ)f (θ) dθ p(x|θ)f (θ) dθ
f (θ|x) dθ = = Jb .
p(x) p(x|θ)f (θ) dθ
a
Data x.
Likelihood p(x|θ).
2. Suppose you toss again and get tails. Update your posterior from
problem 1 using this data.
3. On one set of axes graph the prior and the posteriors from
problems 1 and 2.
See MIT18_05S14_class13_slides
next slide for solution. 321
January 1, 2017 20 /24
Solution
Problem 1
Bayes
hypoth. prior likelihood numerator posterior
θ 2θ dθ θ 2θ2 dθ 3θ2 dθ
J1 2
Total 1 T = 0 2θ dθ = 2/3 1
2
Posterior pdf: f (θ|x) = 3θ . (Should graph this.)
Note: We don’t really need to compute T . Once we know the posterior
density is of the form cθ2 we only have to find the value of c makes it
have total probability 1.
Problem 2
Bayes
hypoth. prior likelihood numerator posterior
θ 3θ2 dθ 1−θ 3θ2 (1 − θ), dθ 12θ2 (1 − θ) dθ
J1 2
Total 1 0 3θ (1 − θ) dθ = 1/4 1
Posterior 2
pdf: f (θ|x) = 12θ (1 − θ).
MIT18_05S14_class13_slides 322
January 1, 2017 21 /24
Board Question
Give the integral for the normalizing factor, but do not compute it
out. Call its value T and give the posterior pdf in terms of T .
1 15
answer: f (θ|x) = T θ (1 − θ)12 . (Called a Beta distribution.)
MIT18_05S14_class13_slides 323
January 1, 2017 22 /24
Beta distribution
Beta(a, b) has density
(a + b − 1)! a−1
f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!
http://mathlets.org/mathlets/beta-distribution/
is a pdf, then
(a + b − 1)!
c=
(a − 1)!(b − 1)!
and f (θ) is the pdf of a Beta(a, b) distribution.
MIT18_05S14_class13_slides 324
January 1, 2017 23 /24
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class13_slides 325
Bayesian Updating: Continuous Priors
18.05 Spring 2014
Z Compute
b
f (x|θ)f (θ) dθ
a
(a + b − 1)! a−1
f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!
http://mathlets.org/mathlets/beta-distribution/
Observation:
The coefficient is a normalizing factor, so if we have a pdf
f (θ) = cθa−1 (1 − θ)b−1
then
θ ∼ beta(a, b)
and
(a + b − 1)!
c=
(a − 1)!(b − 1)!
MIT18_05S14_class14_slides 327
January 1, 2017 2 /26
Board question preamble: beta priors
Suppose you are testing a new medical treatment with unknown
probability of success θ. You don’t know that θ, but your prior belief
is that it’s probably not too far from 0.5. You capture this intuition
with a beta(5,5) prior on θ.
2.0
1.0
0.0 Beta(5,5) for θ
To sharpen this distribution you take data and update the prior.
MIT18_05S14_class14_slides 328
Question on next slide.
January 1, 2017 3 /26
Board question: beta priors
(a + b − 1)! a−1
Beta(a, b): f (θ) = θ (1 − θ)b−1
(a − 1)!(b − 1)!
Treatment has prior f (θ) ∼ beta(5, 5)
4. Use what you know about pdf’s to evaluate the integral without
computing it directly
MIT18_05S14_class14_slides 329
January 1, 2017 4 /26
Solution
9! 4
1. Prior pdf is f (θ) = 4! 4! θ (1 − θ)4 = c1 θ4 (1 − θ)4 .
hypoth. prior C10likelihood
) 6 Bayes numer. posterior
θ c1 θ4 (1 − θ)4 dθ 6 θ (1 − θ)
4 c3 θ10 (1 − θ)8 dθ beta(11, 9)
We know the normalized posterior is a beta distribution because it has the
form of a beta distribution (cθa− (1 − θ)b−1 on [0,1]) so by our earlier
observation it must be a beta distribution.
2. The answer is the same. The only change is that the likelihood has a
coefficient of 1 instead of a binomial coefficent.
3. The posterior on θ is beta(11, 9) which has density
19! 10
f (θ |, data) = θ (1 − θ)8 .
10! 8!
MIT18_05S14_class14_slides 330
January 1, 2017 5 /26
Solution continued
The law of total probability says that the posterior predictive probability of
success is
1 1
P(success | data) = f (success | θ) · f (θ | data) dθ
0
1 1 1 1
19! 10 8 19! 11
= θ· θ (1 − θ) dθ = θ (1 − θ)8 dθ
0 10! 8! 0 10! 8!
Thus 1 1
19! 11 19! 11! 8! 11
θ (1 − θ)8 dθ = · .= .
0 10! 8!
MIT18_05S14_class14_slides 10! 8! 20! 20 331
January 1, 2017 6 /26
Conjugate priors
We had
Prior f (θ) dθ: beta distribution
That is, the beta prior becomes a beta posterior and repeated
updating is easy!
MIT18_05S14_class14_slides 332
January 1, 2017 7 /26
Concept Question
Suppose your prior f (θ) in the bent coin example is Beta(6, 8). You
flip the coin 7 times, getting 2 heads and 5 tails. What is the
posterior pdf f (θ|x)?
1. Beta(2,5)
2. Beta(3,6)
3. Beta(6,8)
4. Beta(8,13)
We saw in the previous board question that 2 heads and 5 tails will update
a beta(a, b) prior to a beta(a + 2, b + 5) posterior.
Bayes posterior
hypoth. prior likelihood numerator f (θ|x) dθ
f (x | θ)f (θ) dθ
θ f (θ) dθ f (x | θ) f (x | θ)f (θ) dθ
f (x)
total 1 f (x) 1
1
f (x) = f (x | θ)f (θ) dθ
MIT18_05S14_class14_slides 335
January 1, 2017 10 /26
Normal prior, normal data
N(µ, σ 2 ) has density
1 2 2
f (y ) = √ e−(y −µ) /2σ .
σ 2π
Observation:
The coefficient is a normalizing factor, so if we have a pdf
2 /2σ 2
f (y ) = ce−(y −µ)
then
y ∼ N(µ, σ 2 )
and
1
c= √
MIT18_05S14_class14_slides σ 2π 336
January 1, 2017 11 /26
Board question: normal prior, normal data
1 2 2
N(µ, σ 2 ) has pdf: √ e−(y −µ) /2σ .
f (y ) =
σ 2π
Suppose our data follows a N(θ, 4) distribution with unknown
mean θ and variance 4. That is
f (x | θ) = pdf of N(θ, 4)
We have:
2 /2
Prior: θ ∼ N(3, 1): f (θ) = c1 e−(θ−3)
2 /8
Likelihood x ∼ N(θ, 4): f (x | θ) = c2 e−(x−θ)
2 /8
For x = 5 the likelihood is c2 e−(5−θ)
C 17 )
The MIT18_05S14_class14_slides
last expression shows the posterior is N 5 , 45 . 338
January 1, 2017 13 /26
Solution graphs
Data: x1 = 5
Prior is normal: µprior = 3; σprior = 1
Likelihood is normal: µ = θ; σ=2
Posterior is normal µposterior = 3.4; σposterior = 0.894
• Will see simple formulas for doing this update next time.
MIT18_05S14_class14_slides 339
January 1, 2017 14 /26
Board question: Romeo and Juliet
Juliet knows that θ ≤ 1 hour and she assumes a flat prior for θ on
[0, 1].
On their first date Romeo is 15 minutes late. Use this data to update
the prior distribution for θ.
(a) Find and graph the prior and posterior pdfs for θ.
(b) Find the prior predictive pdf for how late Romeo will be on the
first date and the posterior predictive pdf of how late he’ll be on the
second date (if he gets one!). Graph these pdfs.
See next slides for solution
MIT18_05S14_class14_slides 340
January 1, 2017 15 /26
Solution
Data: x1 = 0.25.
MIT18_05S14_class14_slides 341
January 1, 2017 16 /26
Solution graphs
MIT18_05S14_class14_slides
Prior and posterior pdf’s for θ. 342
January 1, 2017 17 /26
Solution graphs continued
MIT18_05S14_class14_slides 343
January 1, 2017 18 /26
Solution continued
Posterior prediction:
�
1
if θ ≥ x2
f (x2 |θ) = θ
0 if θ < x2 .
1
The posterior predictive pdf f (x2 |x1 ) = f (x2 |θ)f (θ|x1 ) dθ. The
integrand is 0 unless θ > x2 and θ > 0.25. There are two cases:
1 1
c
If x2 < 0.25 : f (x2 |x1 ) = 2
dθ = 3c = 3/ ln(4).
0.25 θ
1 1
c 1
If x2 ≥ 0.25 : f (x2 |x1 ) = 2
dθ = ( − 1)/ ln(4)
x2 θ x 2
MIT18_05S14_class14_slides 344
January 1, 2017 19 /26
Solution continued
MIT18_05S14_class14_slides
Prior (red) and posterior (blue) predictive pdf’s for x2 345
January 1, 2017 20 /26
From discrete to continuous Bayesian updating
Bayes
hyp. prior likelihood numerator numerator
θ dθ θ θ dθ 2θ dθ
J1
Total 1 0
θ dθ = 1/2 1
MIT18_05S14_class14_slides 346
January 1, 2017 21 /26
Approximate continuous by discrete
MIT18_05S14_class14_slides 347
January 1, 2017 22 /26
Chop [0, 1] into 4 intervals
MIT18_05S14_class14_slides 348
January 1, 2017 23 /26
Chop [0, 1] into 12 intervals
MIT18_05S14_class14_slides 349
January 1, 2017 24 /26
Density historgram
2 density
2 density
1.5
1.5
1
1
.5
.5
x
1/8 3/8 5/8 7/8 x
MIT18_05S14_class14_slides 350
January 1, 2017 25 /26
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class14_slides 351
Conjugate Priors: Beta and Normal
18.05 Spring 2014
Bayes
hypoth. prior likelihood numerator posterior
θ 2θ dθ θ 2θ2 dθ 3θ2 dθ
f1 2
Total 1 T = 0 2θ dθ = 2/3 1
MIT18_05S14_class15_slides 353
January 1, 2017 2 /20
Review: Continuous priors, continuous data
Bayes
hypoth. prior likeli. numerator posterior
f (x | θ)f (θ) dθ
θ f (θ) dθ f (x | θ) f (x | θ)f (θ) dθ f (θ | x) dθ =
f (x)
total 1 f (x) 1
J
f (x) = f (x | θ)f (θ) dθ
MIT18_05S14_class15_slides 354
January 1, 2017 3 /20
Romeo and Juliet
MIT18_05S14_class15_slides 355
January 1, 2017 4 /20
Updating with normal prior and normal likelihood
A normal prior is conjugate to a normal likelihood with known σ.
Data: x1 , x2 , . . . , xn
Normal likelihood. x1 , x2 , . . . , xn ∼ N(θ, σ 2 )
2
Normal prior. θ ∼ N(µprior , σprior ).
2
Normal Posterior. θ ∼ N(µpost , σpost ).
1 n aµprior + bx̄ 2 1
a= 2
b= , µpost = , σpost = .
σprior σ2 a+b a+b
0. µprior = 4, σprior = 2, σ = 3, n = 1, x̄ = 2.
1.
hypoth. prior likelihood posterior
θ f (θ) ∼ N(4, 22 ) f (x|θ) ∼ N(θ, 32 ) 2
f (θ|x) ∼ N(µpost , σpost )
( 2
) ( 2
) ( 2
) ( 2
)
θ c1 exp −(θ−4)
8 c2 exp −(2−θ)
18 c3 exp −(θ−4)
8 exp −(2−θ)
18
Plot 3 Plot 5
0.8
Plot 2
0.6
Prior
0.4
Plot 1 Plot 4
0.2
0.0
0 2 4 6 8 10 12 14
(a) Which plot is the posterior to just the first data value?
(Click on the plot number.) (Solution in 2 slides)
MIT18_05S14_class15_slides 359
January 1, 2017 8 /20
Concept question: normal priors, normal likelihood
Plot 3 Plot 5
0.8
Plot 2
0.6
Prior
0.4
Plot 1 Plot 4
0.2
0.0
0 2 4 6 8 10 12 14
(a) Plot 2: The first data value is 3. Therefore the posterior must have its
mean between 3 and the mean of the blue prior. The only possibilites for
this are plots 1 and 2. We also know that the variance of the posterior is
less than that of the posterior. Between the plots 1 and 2 graphs only plot
2 has smaller variance than the prior.
(b) Plot 3: The average of the 3 data values is 8. Therefore the posterior
must have mean between the mean of the blue prior and 8. Therefore the
only possibilities are the plots 3 and 4. Because the posterior is posterior
to the magenta graph (plot 2) it must have smaller variance. This leaves
only the Plot 3.
MIT18_05S14_class15_slides 361
January 1, 2017 10 /20
Board question: normal/normal
x1 +...+xn
For data x1 , . . . , xn with data mean x̄ = n
1 n aµprior + bx̄ 2 1
a= 2
b= , µpost = , σpost = .
σprior σ2 a+b a+b
This season Sophie Lie made 85 percent of her free throws. What is
the posterior expected value of her career percentage θ?
answer: Solution on next frame
MIT18_05S14_class15_slides 362
January 1, 2017 11 /20
Solution
formulas.
2
Likelihood x ∼ N(θ, 16). So f (x|θ) = c1 e−(x−θ) /2·16 .
The updating weights are
Therefore
2
µpost = (75/36 + 85/16)/(52/576) = 81.9, σpost = 36/13 = 11.1.
MIT18_05S14_class15_slides 364
January 1, 2017 13 /20
Concept question: conjugate priors
Which are conjugate priors?
1. none 2. a 3. b 4. c
5.MIT18_05S14_class15_slides
a,b 6. a,c 7. b,c 8. a,b,c 365
January 1, 2017 14 /20
Answer: 3. b
Exponential/Normal posterior:
(θ−µprior )2
− −θ x
2σ 2
f (θ|x) = c1 θe prior
The factor of θ before the exponential means this is not the pdf of a
normal distribution. Therefore it is not a conjugate prior.
Binomial/Normal: It is clear that the posterior does not have the form of a
MIT18_05S14_class15_slides 366
normal distribution.
January 1, 2017 15 /20
Variance can increase
Normal-normal: variance always decreases with data.
Beta-binomial: variance usually decreases with data.
6
beta(2,12) beta(21,19)
5
beta(21,12)
beta(12,12)
4
3
2
1
0
Suppose the prior has been set. Let x1 and x2 be two sets of data.
Which of the following are true.
(a) If the likelihoods f (x1 |θ) and f (x2 |θ) are the same then they
result in the same posterior.
(b) If x1 and x2 result in the same posterior then their likelihood
functions are the same.
(c) If the likelihoods f (x1 |θ) and f (x2 |θ) are proportional then they
result in the same posterior.
(d) If two likelihood functions are proportional then they are equal.
answer: (4): a: true; b: false, the likelihoods are proportional.
c: true, scale factors don’t matter d: false
MIT18_05S14_class15_slides 368
January 1, 2017 17 /20
Concept question: strong priors
Say we have a bent coin with unknown probability of heads θ.
80
A B C D E F
60
40
20
0
MIT18_05S14_class15_slides
0.0 0.2 0.4 0.6 0.8 1.0369
January 1, 2017 18 /20
Solution to concept question
MIT18_05S14_class15_slides 370
January 1, 2017 19 /20
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class15_slides 371
Choosing Priors
Probability Intervals
18.05 Spring 2014
1. none 2. a 3. b 4. c
5.MIT18_05S14_class16_slides
a,b 6. a,c 7. b,c 8. a,b,c 374
April 18, 2017 3 / 33
Answer: 3. b
We have a conjugate prior if the posterior as a function of θ has the same
form as the prior.
Exponential/Normal posterior:
(θ−µprior )2
− −θ x
2σ 2
f (θ|x) = c1 θe prior
The factor of θ before the exponential means this is not the pdf of a
normal distribution. Therefore it is not a conjugate prior.
Binomial/Normal: It is clear that the posterior does not have the form of a
MIT18_05S14_class16_slides 375
normal distribution.
April 18, 2017 4 / 33
Concept question: strong priors
Say we have a bent coin with unknown probability of heads θ.
We are convinced that θ ≤ 0.7.
Our prior is uniform on [0, 0.7] and 0 from 0.7 to 1.
We flip the coin 65 times and get 60 heads.
Which of the graphs below is the posterior pdf for θ?
80
A B C D E F
60
40
20
0
MIT18_05S14_class16_slides
0.0 0.2 0.4 0.6 0.8 1.0376
April 18, 2017 5 / 33
Solution to concept question
MIT18_05S14_class16_slides 377
April 18, 2017 6 / 33
Two parameter tables: Malaria
D+ D−
S 2 13 15
N 14 1 15
16 14 30
MIT18_05S14_class16_slides 378
April 18, 2017 7 / 33
Model
MIT18_05S14_class16_slides 379
April 18, 2017 8 / 33
Color-coded two-dimensional tables
Hypotheses
θN \θS 0 0.2 0.4 0.6 0.8 1
1 (0,1) (.2,1) (.4,1) (.6,1) (.8,1) (1,1)
0.8 (0,.8) (.2,.8) (.4,.8) (.6,.8) (.8,.8) (1,.8)
0.6 (0,.6) (.2,.6) (.4,.6) (.6,.6) (.8,.6) (1,.6)
0.4 (0,.4) (.2,.4) (.4,.4) (.6,.4) (.8,.4) (1,.4)
0.2 (0,.2) (.2,.2) (.4,.2) (.6,.2) (.8,.2) (1,.2)
0 (0,0) (.2,0) (.4,0) (.6,0) (.8,0) (1,0)
MIT18_05S14_class16_slides 381
April 18, 2017 10 / 33
Color-coded two-dimensional tables
Flat prior
θN \θS 0 0.2 0.4 0.6 0.8 1 p(θN )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.8 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0.2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(θS ) 1/6 1/6 1/6 1/6 1/6 1/6 1
MIT18_05S14_class16_slides 382
April 18, 2017 11 / 33
Color-coded two-dimensional tables
Posterior to the flat prior
θN \θS 0 0.2 0.4 0.6 0.8 1 p(θN |data)
1 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0.8 0.00000 0.88075 0.08370 0.00097 0.00000 0.00000 0.96542
0.6 0.00000 0.03139 0.00298 0.00003 0.00000 0.00000 0.03440
0.4 0.00000 0.00016 0.00002 0.00000 0.00000 0.00000 0.00018
0.2 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
0 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
p(θS |data) 0.00000 0.91230 0.08670 0.00100 0.00000 0.00000 1.00000
In 1983 in Michigan:
19/19 ECMO babies survived and 0/3 CVT babies survived.
Later Harvard ran a randomized study:
28/29 ECMO babies survived and 6/10 CVT babies survived.
MIT18_05S14_class16_slides 385
April 18, 2017 14 / 33
Board question: updating two parameter priors
Michigan: 19/19 ECMO babies and 0/3 CVT babies survived.
Harvard: 28/29 ECMO babies and 6/10 CVT babies survived.
θE = probability that an ECMO baby survives
θC = probability that a CVT baby survives
Consider the values 0.125, 0.375, 0.625, 0.875 for θE and θS
1. Make the 4 × 4 prior table for a flat prior.
2. Based on the Michigan results, create a reasonable informed prior
table for analyzing the Harvard results (unnormalized is fine).
3. Make the likelihood table for the Harvard results.
4. Find the posterior table for the informed prior.
5. Using the informed posterior, compute the probability that ECMO
is better than CVT.
6. Also compute the posterior probability that θE − θC ≥ 0.6.
(TheMIT18_05S14_class16_slides
posted solutions will also show 4-6 for the flat prior.) 386
April 18, 2017 15 / 33
Solution
Flat prior
θE
0.125 0.375 0.625 0.875
0.125 0.0625 0.0625 0.0625 0.0625
θC 0.375 0.0625 0.0625 0.0625 0.0625
0.625 0.0625 0.0625 0.0625 0.0625
0.875 0.0625 0.0625 0.0625 0.0625
Likelihood
Entries in the likelihood table are θE28 (1 − θE )θC6 (1 − θC )4 . We don’t bother
including the binomial coefficients since they are the same for every entry.
θE
0.125 0.375 0.625 0.875
0.125 1.012e-31 1.653e-18 1.615e-12 6.647-09
θC 0.375 1.920e-29 3.137e-16 3.065e-10 1.261-06
0.625 5.332e-29 8.713e-16 8.513e-10 3.504e-06
0.875 4.95e-30 8.099e-17 7.913e-11 3.257e-07
MIT18_05S14_class16_slides
(Posteriors are on the next slides). 388
April 18, 2017 17 / 33
Solution continued
Flat posterior
The posterior table is found by multiplying the prior and likelihood tables
and normalizing so that the sum of the entries is 1. We call the posterior
derived from the flat prior the flat posterior. (Of course the flat posterior
is not itself flat.)
θE
0.125 0.375 0.625 0.875
0.125 .984e-26 3.242e-13 3.167e-07 0.001
θc 0.375 .765e-24 6.152e-11 6.011e-05 0.247
0.625 1.046e-23 1.709e-10 1.670e-04 0.687
0.875 9.721e-25 1.588e-11 1.552e-05 0.0639
The boxed entries represent most of the probability where θE > θC .
Informed posterior
θE
0.125 0.375 0.625 0.875
0.125 1.116e-26 1.823e-13 3.167e-07 0.001
θC 0.375 2.117e-24 3.460e-11 6.010e-05 0.2473
0.625 5.882e-24 9.612e-11 1.669e-04 0.6871
0.875 5.468e-25 8.935e-12 1.552e-05 0.0638
MIT18_05S14_class16_slides 392
April 18, 2017 21 / 33
Probability intervals for normal distributions
1. Shrink 2. Stretch.
MIT18_05S14_class16_slides 395
April 18, 2017 24 / 33
Reading questions
The following slides contain bar graphs of last year’s responses to the
reading questions. Each bar represents one student’s estimate of their own
50% probability interval (from the 0.25 quantile to the 0.75 quantile).
Here is what we found for answers to the questions:
2. Number of girls born in the world each year: I had trouble finding a
reliable source. Wiki.answers.com gave the number of 130 million births in
2005. If we take what seems to be the accepted ratio of 1.07 boys born for
everyMIT18_05S14_class16_slides
girl then 130/2.07 = 62.8 million baby girls. 396
April 18, 2017 25 / 33
Reading questions continued
MIT18_05S14_class16_slides 397
April 18, 2017 26 / 33
Subjective probability 1 (50% probability interval)
10 50000
MIT18_05S14_class16_slides 398
April 18, 2017 27 / 33
Subjective probability 2 (50% probability interval)
100 500000000
MIT18_05S14_class16_slides 399
April 18, 2017 28 / 33
Subjective probability 3 (50% probability interval)
Percentage of African-Americans in US
13
0 100
MIT18_05S14_class16_slides 400
April 18, 2017 29 / 33
Subjective probability 3 censored (50% probability interval)
5 100
MIT18_05S14_class16_slides 401
April 18, 2017 30 / 33
Subjective probability 4 (50% probability interval)
MIT18_05S14_class16_slides 402
April 18, 2017 31 / 33
Subjective probability 5 (50% probability interval)
100 1500000
MIT18_05S14_class16_slides 403
April 18, 2017 32 / 33
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class16_slides 404
Frequentist Statistics and Hypothesis Testing
18.05 Spring 2014
http://xkcd.com/539/
MIT18_05S14_class17_slides 406
January 2, 2017 2 /25
Frequentist school of statistics
random experiment.
Statistics
(art)
P (D|H)Pprior (H)
PPosterior (H|D) = Likelihood L(H; D) = P (D|H)
P (D)
Bayesians require a prior, so Without a known prior frequen-
they develop one from the best tists draw inferences from just
information they have. the likelihood function.
MIT18_05S14_class17_slides 408
January 2, 2017 4 /25
Disease screening redux: probability
H = sick H = healthy
P (D | H) 0.99 0.01
D = pos. test neg. test D = pos. test neg. test
MIT18_05S14_class17_slides 409
January 2, 2017 5 /25
Disease screening redux: statistics
P (H) ? ?
H = sick H = healthy
P (D | H) 0.99 0.01
D = pos. test neg. test D = pos. test neg. test
MIT18_05S14_class17_slides 410
January 2, 2017 6 /25
Concept question
Each day Jane arrives X hours late to class, with X ∼ uniform(0, θ),
where θ is unknown. Jon models his initial belief about θ by a prior
pdf f (θ). After Jane arrives x hours late to the next class, Jon
computes the likelihood function f (x|θ) and the posterior pdf f (θ|x).
MIT18_05S14_class17_slides 411
January 2, 2017 7 /25
Concept answer
answer: 3. likelihood
Both the prior and posterior are probability distributions on the possible
values of the unknown parameter θ, i.e. a distribution on hypothetical
values. The frequentist does not consider them valid.
MIT18_05S14_class17_slides 412
January 2, 2017 8 /25
Statistics are computed from data
MIT18_05S14_class17_slides 413
January 2, 2017 9 /25
Concept questions
1. Yes 2. No
1. The median of x1 , . . . , xn .
4. Yes. x̄ depends only on the data, so the set of values within 1 of x̄ can
all be found by working with the data.
MIT18_05S14_class17_slides 415
January 2, 2017 11 /25
Cards and NHST
MIT18_05S14_class17_slides 416
January 2, 2017 12 /25
NHST ingredients
Null hypothesis: H0
Alternative hypothesis: HA
Test statistic: x
Rejection region: reject H0 in favor of HA if x is in this region
f (x|H0 )
x
x2 -3 0 x1 3
reject H0 don’t reject H0 reject H0
Two strategies:
1. Choose rejection region then compute significance level.
.15
.05
x
0 1 2 3 4 5 6 7 8 9 10
x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
MIT18_05S14_class17_slides
2. Given 419
significance level α = .05 find a two-sided rejection region.
1. α = 0.11
x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
2. α = 0.05
x 0 1 2 3 4 5 6 7 8 9 10
p(x|H0 ) .001 .010 .044 .117 .205 .246 .205 .117 .044 .010 .001
MIT18_05S14_class17_slides 420
January 2, 2017 16 /25
Concept question
The null and alternate pdfs are shown on the following plot
f (x|HA ) f (x|H0 )
R3
R2
R1 R4
x
reject H0 region . non-reject H0 region
The significance level of the test is given by the area of which region?
1. R1 2. R2 3. R3 4. R4
5. R1 + R2 6. R2 + R3 7. R2 + R3 + R4 .
answer: 6. R2 + R3 . This is the area under the pdf for H0 above the
rejection region.
MIT18_05S14_class17_slides 421
January 2, 2017 17 /25
z-tests, p-values
Hypotheses: H0 : xi ∼ N(µ0 , σ 2 )
HA : Two-sided: µ = µ0 , or one-sided: µ > µ0
x − µ0
z-value: standardized x: z = √
σ/ n
Test statistic: z
Null distribution: Assuming H0 : z ∼ N(0, 1).
p-values: Right-sided p-value: p = P(Z > z | H0 )
(Two-sided p-value: p = P(|Z | > z | H0 ))
Significance level: For p ≤ α we reject H0 in favor of HA .
Note: Could have used x as test statistic and N(µ0 , σ 2 ) as the null
MIT18_05S14_class17_slides
distribution. 422
January 2, 2017 18 /25
Visualization
H0 : µ = 100
112 − 100
Collect 9 data points: x̄ = 112. So, z = = 2.4.
15/3
Can we reject H0 at significance level 0.05?
f (z|H0 ) ∼ N(0, 1)
z0.05 = 1.64
α = pink + red = 0.05
p = red = 0.008
z
z0.05 2.4
non-reject H0 reject H0
MIT18_05S14_class17_slides 423
January 2, 2017 19 /25
Board question
f (z|H0 ) ∼ N(0, 1)
z0.025 = 1.96
z0.975 = −1.96
α = red = 0.05
z
−1.96 0 z=1 1.96
reject H0 non-reject H0 reject H0
MIT18_05S14_class17_slides 425
January 2, 2017 21 /25
Solution continued
(v) The z-value is not in the rejection region tells us exactly the same
thing as the p-value being greater than the significance, i.e. don’t
reject the null hypothesis H0 .
MIT18_05S14_class17_slides 426
January 2, 2017 22 /25
Board question
Two coins: probability of heads is 0.5 for C1 ; and 0.6 for C2 .
We pick one at random, flip it 8 times and get 6 heads.
Under H0 the probability of heads is 0.5. Using the table we find a one
sided rejection region {7, 8}. That is we will reject H0 in favor of HA only
Since the value of our data x = 6 is not in our rejection region we do not
reject H0 .
Now under H0 the probability of heads is 0.6. Using the table we find a
one sided rejection region {0, 1, 2}. That is we will reject H0 in favor of
Since the value of our data x = 6 is not in our rejection region we do not
reject H0 .
choice. That is, we only reject H0 if the data is extremely unlikely when
MIT18_05S14_class17_slides
we assume H0 . This is not the case for either C1 or C2 .
428
January 2, 2017 24 /25
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class17_slides 429
Null Hypothesis Significance Testing
p-values, significance level, power, t-tests
18.05 Spring 2014
f (x|H0 )
x
reject H0 don’t reject H0 reject H0
x = test statistic
MIT18_05S14_class18_slides 432
January 1, 2017 3 /28
Extreme data and p-values
Hypotheses: H0 , HA .
x
cα 4.2
don’t reject H0 reject H0
MIT18_05S14_class18_slides 434
January 1, 2017 5 /28
Extreme data and p-values
x
2.1 cα
don’t reject H0 reject H0
MIT18_05S14_class18_slides 435
January 1, 2017 6 /28
Critical values
Critical values:
MIT18_05S14_class18_slides 436
January 1, 2017 7 /28
Two-sided p-values
These are trickier: what does ‘at least as extreme’ mean in this case?
Remember the p-value is a trick for deciding if the test statistic is in
the region.
If the significance (rejection) probability is split evenly between the
left and right tails then
f (x|H0 )
x
c1−α/2 x cα/2
reject H0 don’t reject H0 reject H0
MIT18_05S14_class18_slides
x is outside the rejection region, so p > α: do not reject H437
0
January 1, 2017 8 /28
Concept question
1. You collect data from an experiment and do a left-sided z-test
with significance 0.1. You find the z-value is 1.8
(i) Which of the following computes the critical value for the
rejection region.
(a) pnorm(0.1, 0, 1) (b) pnorm(0.9, 0, 1)
(c) pnorm(0.95, 0, 1) (d) pnorm(1.8, 0, 1)
(e) 1 - pnorm(1.8, 0, 1) (f) qnorm(0.05, 0, 1)
(g) qnorm(0.1, 0, 1) (h) qnorm(0.9, 0, 1)
(i) qnorm(0.95, 0, 1)
(ii) Which of the above computes the p-value for this experiment.
= P(false positive)
= 1 − P(type II error)
= P(true positive)
answer:
1. Significance level = P(x in rejection region | H0 ) = 0.11
2. θ = 0.6: power = P(x in rejection region | HA ) = 0.18
0.7: power = P(x in rejection region | HA ) = 0.383
θ = MIT18_05S14_class18_slides 440
January 1, 2017 11 /28
Concept question
f (x|HA ) f (x|H0 )
R3
R2
R1 R4
x
reject H0 region . non-reject H0 region
MIT18_05S14_class18_slides 441
January 1, 2017 12 /28
Concept question
f (x|HA ) f (x|H0 )
x
reject H0 region . do not reject H0 region
f (x|HA ) f (x|H0 )
x
reject H0 region . do not reject H0 region
MIT18_05S14_class18_slides
(a) Top graph (b) Bottom graph
442
January 1, 2017 13 /28
Solution
Power = P(x in rejection region | HA ). In the top graph almost all the
probability of HA is in the rejection region, so the power is close to 1.
MIT18_05S14_class18_slides 443
January 1, 2017 14 /28
Discussion question
MIT18_05S14_class18_slides 444
January 1, 2017 15 /28
One-sample t-test
MIT18_05S14_class18_slides
http://mathlets.org/mathlets/t-distribution/ 445
January 1, 2017 16 /28
Board question: z and one-sample t-test
For both problems use significance level α = 0.05.
Suppose H0 : µ = 0; HA : µ = 0.
favor of HA .
x̄−µ
3. We’ll use the Studentized t = √
s/ n
for the test statistic. The null
√
distribution for t is t3 . For the data we have t = 5/ 3. This is a
two-sided test so the p-value is
√
MIT18_05S14_class18_slides 448
January 1, 2017 19 /28
Two-sample t-test: equal variances
Data: we assume normal data with µx , µy and (same) σ unknown:
x1 , . . . , xn ∼ N(µx , σ 2 ), y1 , . . . , ym ∼ N(µy , σ 2 )
Null hypothesis H0 : µx = µy .
(n − 1)sx2 + (m − 1)sy2 1 1
Pooled variance: sp2 = + .
n+m−2 n m
x̄ − ȳ
Test statistic: t=
sp
Null distribution: f (t | H0 ) is the pdf of T ∼ t(n + m − 2)
Real data from 1408 women admitted to a maternity hospital for (i)
medical reasons or through (ii) unbooked emergency admission. The
duration of pregnancy is measured in complete weeks from the
beginning of the last menstrual period.
Medical: 775 obs. with x̄ = 39.08 and s 2 = 7.77.
Emergency: 633 obs. with x̄ = 39.60 and s 2 = 4.95
MIT18_05S14_class18_slides 450
January 1, 2017 21 /28
Solution
774(7.77) + 632(4.95) 1 1
sp2 = + = .0187
1406 775 633
The t statistic for the null distribution is
x̄ − ȳ
= −3.8064
sp
Rather than compute the two-sided p-value using 2*tcdf(-3.8064,1406)
we simply note that with 1406 degrees of freedom the t distribution is
essentially standard normal and 3.8064 is almost 4 standard deviations. So
2. We assumed the data was normal and that the two groups had equal
variances. Given the big difference in the sample variances this assumption
might not be warranted.
Note: there are significance tests to see if the data is normal and to see if
the two groups have the same variance.
MIT18_05S14_class18_slides 452
January 1, 2017 23 /28
Table discussion: Type I errors Q1
MIT18_05S14_class18_slides 453
January 1, 2017 25 /28
Table discussion: Type I errors Q2
2. Jerry desperately wants to cure diseases but he is terrible at
designing effective treatments. He is however a careful scientist and
statistician, so he randomly divides his patients into control and
treatment groups. The control group gets a placebo and the
treatment group gets the experimental treatment. His null hypothesis
H0 is that the treatment is no better than the placebo. He uses a
significance level of α = 0.05. If his p-value is less than α he publishes
a paper claiming the treatment is significantly better than a placebo.
(a) Since his treatments are never, in fact, effective what percentage
of his experiments result in published papers?
(b) What percentage of his published papers contain type I errors,
i.e. describe treatments that are no better than placebo?
answer: (a) Since in all of his experiments H0 is true, roughly 5%, i.e. the
significance level, of his experiments will have p < 0.05 and be published.
(b) Since he’s always wrong, all of his published papers contain type I
MIT18_05S14_class18_slides 454
errors.
January 1, 2017 26 /28
Table discussions: Type I errors: Q3
3. Efrat is a genius at designing treatments, so all of her proposed
treatments are effective. She’s also a careful scientist and statistician
so she too runs double-blind, placebo controlled, randomized studies.
Her null hypothesis is always that the new treatment is no better than
the placebo. She also uses a significance level of α = 0.05 and
publishes a paper if p < α.
(a) How could you determine what percentage of her experiments
result in publications?
(b) What percentage of her published papers contain type I errors,
i.e. describe treatments that are no better than placebo?
answer: 3. (a)The percentage that get published depends on the power
of her treatments. If they are only a tiny bit more effective than placebo
then roughly 5% of her experiments will yield a publication. If they are a
lot more effective than placebo then as many as 100% could be published.
(b) MIT18_05S14_class18_slides
None of her published papers contain type I errors. 455
January 1, 2017 27 /28
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class18_slides 456
Null Hypothesis Significance Testing
Gallery of Tests
What is a simulation?
– Run an experiment with pseudo-random data instead of
– By doing this many times we can estimate the statistics for the
experiment.
Why do a simulation?
– In the real world we are not omniscient.
– In the real world we don’t have infinite resources.
What was the point of Studio 8?
– To simulate some simple significance tests and compare various
frequences.
– Simulated P(reject|H0 ) ≈ α
– Simulated P(reject|HA ) ≈ power
– P(H0 |reject can be anything depending on the (usually)
MIT18_05S14_class19_slides
unknown prior
458
January 1, 2017 2 /22
Concept question
(a) 19/1 (b) 1/19 (c) 1/20 (d) 1/24 (e) unknown
answer: (e) unknown. Frequentist methods only give probabilities for data
under an assumed hypothesis. They do not give probabilities or odds for
hypotheses. So we don’t know the odds for distribution means
MIT18_05S14_class19_slides 459
January 1, 2017 3 /22
General pattern of NHST
You are interested in whether to reject H0 in favor of HA .
Design:
Design experiment to collect data relevant to hypotheses.
Alternatively, you can choose both the significance level and the
Implementation:
Run the experiment to collect data.
Compute the statistic x and the corresponding p-value.
If p < α, reject H0 .
MIT18_05S14_class19_slides 460
January 1, 2017 4 /22
Chi-square test for homogeneity
In this setting homogeneity means that the data sets are all drawn
from the same distribution.
Use a chi-square test to compare the cure rates for the three
treatments, i.e. to test if all three cure rates are the same.
MIT18_05S14_class19_slides 461
January 1, 2017 5 /22
Solution
We include the marginal values (in red). These are all needed to
The data does not support rejecting H0 . We do not conclude that the
treatments have differing efficacy.
MIT18_05S14_class19_slides 463
January 1, 2017 7 /22
Board question: Khan’s restaurant
MIT18_05S14_class19_slides 464
January 1, 2017 8 /22
Solution
X
G =2 Oi log(Oi /Ei ) = 11.39
X (Oi − Ei )2 |
X2 = = 11.44
Ei
df = 6 − 1 = 5 (6 cells, compute 1 value –the total count– from the data)
p = 1-pchisq(11.39,5) = 0.044.
So, at a significance level of 0.05 we reject the null hypothesis in favor of
the alternative the the owner’s distribution is wrong.
MIT18_05S14_class19_slides 465
January 1, 2017 9 /22
Board question: genetic linkage
In 1905, William Bateson, Edith Saunders, and Reginald Punnett
were examining flower color and pollen shape in sweet pea plants by
MIT18_05S14_class19_slides 468
January 1, 2017 12 /22
F -distribution
Plot of F distributions
1
F 3 4
0.8 F 10 15
F 30 15
0.6
0.4
0.2
0
0 2 4 6 8 10
MIT18_05S14_class19_slides x 469
January 1, 2017 13 /22
F -test = one-way ANOVA
Like t-test but for n groups of data with m data points each.
yi,j ∼ N(µi , σ 2 ), yi,j = j th point in ith group
The table shows recovery time in days for three medical treatments.
1. Set up and run an F-test testing if the average recovery time is the
same for all three treatments.
2. Based on the test, what might you conclude about the treatments?
T1 T2 T3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12
H0 is that the means of the 3 treatments are the same. HA is that they
are not.
Our test statistic w is computed following the procedure from a previous
slide. We get that the test statistic w is approximately 9.25. The p-value
is approximately 0.0024. We reject H0 in favor of the hypothesis that the
means of three treatments are not the same.
MIT18_05S14_class19_slides 472
January 1, 2017 16 /22
Concept question: multiple-testing
2. Suppose we use the significance level 0.05 for each of the 15 tests.
Assuming the null hypothesis, what is the probability that we reject at
least one of the 15 null hypotheses?
(a) Less than 0.05 (b) 0.05 (c) 0.10 (d) Greater than 0.25
Discussion: Recall that there is an F -test that tests if all the means
are the same. What are the trade-offs of using the F -test rather than
many two-sample t-tests?
answer: Solution on next slide.
MIT18_05S14_class19_slides 473
January 1, 2017 17 /22
Solution
(From Rice, Mathematical Statistics and Data Analysis, 2nd ed. p.489)
Use a chi-square test with significance level 0.01 to test the hypothesis
that the number of marriages and education level are independent.
MIT18_05S14_class19_slides 475
January 1, 2017 19 /22
Solution
The null hypothesis is that the cell probabilities are the product of the
marginal probabilities. Assuming the null hypothesis we estimate the
marginal probabilities in red and multiply them to get the cell probabilities
in blue.
Education Married once Married multiple times Total
College 0.365 0.061 611/1436
No college 0.492 0.082 825/1436
Total 1231/1436 205/1436 1
We then get expected counts by multiplying the cell probabilities by the
total number of women surveyed (1436). The table shows the observed,
expected counts:
Education Married once Married multiple times
College 550, 523.8 61, 87.2
No college 681, 707.2 144, 117.8
MIT18_05S14_class19_slides 476
January 1, 2017 20 /22
Solution continued
We then have
p = 1-pchisq(16.55,1) = 0.000047
MIT18_05S14_class19_slides 477
January 1, 2017 21 /22
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class19_slides 478
Comparison of Bayesian and Frequentist Inference
18.05 Spring 2014
1. Give the test statistic, null distribution and rejection region for
each experiment. List all sequences of tosses that produce a test
statistic in the rejection region for each experiment.
2b. Let θ be the probability of heads, Four heads and a tail updates the
prior on θ, Beta(1,1) to the posterior Beta(5,2). Using R we can compute
If the prior is good then the probability the coin is biased towards heads is
0.89.
MIT18_05S14_class20_slides 484
January 1, 2017 6 /10
Board question: Stop II
Since this was not significant, she then did 50 more trials and
Since this was not significant, she started over and computed
significant.
MIT18_05S14_class20_slides 485
January 1, 2017 7 /10
Solution
Reject Continue
MIT18_05S14_class20_slides 487
January 1, 2017 9 /10
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms .
MIT18_05S14_class20_slides 488
Confidence Intervals for Normal Data
18.05 Spring 2014
Today
Review of critical values and quantiles.
Computing z, t, χ2 confidence intervals for normal data.
Conceptual view of confidence intervals.
Confidence intervals for polling (Bernoulli distributions).
MIT18_05S14_class22-slde-a 490
January 1, 2017 2 / 19
Review of critical values and quantiles
P (Z ≤ qα ) P (Z > zα )
α α
z
qα zα
P (Z ≤ qα ) P (Z > zα )
α α
z
qα zα
1. z.025 =
(a) -1.96 (b) -0.95 (c) 0.95 (d) 1.96 (e) 2.87
2. −z.16 =
(a) -1.33 (b) -0.99 (c) 0.99 (d) 1.33 (e) 3.52
MIT18_05S14_class22-slde-a 492
January 1, 2017 4 / 19
Solution
1. answer: z.025 = 1.96. By definition P(Z > z.025 ) = 0.025. This is the
same as P(Z ≤ z.025 ) = 0.975. Either from memory, a table or using the
R function qnorm(.975) we get the result.
2.answer: −z.16 = −0.99. We recall that P(|Z | < 1) ≈ .68. Since half the
leftover probability is in the right tail we have P(Z > 1) ≈ 0.16. Thus
z.16 ≈ 1, so −z.16 ≈ −1.
MIT18_05S14_class22-slde-a 493
January 1, 2017 5 / 19
Computing confidence intervals from normal data
Suppose the data x1 , . . . , xn is drawn from N(µ, σ 2 )
Confidence level = 1 − α
z confidence interval for the mean (σ known)
zα/2 · σ zα/2 · σ
x − √ , x + √
n n
t confidence interval for the mean (σ unknown)
tα/2 · s tα/2 · s
x − √ , x + √
n n
χ2 confidence interval for σ 2
n−1 2 n−1 2
s , s
cα/2 c1−α/2
t and χ2 have n − 1 degrees of freedom.
MIT18_05S14_class22-slde-a 494
January 1, 2017 6 / 19
z rule of thumb
σ σ
x̄ − 2 √ , x̄ + 2 √
n n
MIT18_05S14_class22-slde-a 495
January 1, 2017 7 / 19
Board question: computing confidence intervals
MIT18_05S14_class22-slde-a 496
January 1, 2017 8 / 19
Solution
x = 2.5, s 2 = 1.667, s = 1.29
√ √
σ/ n = 1, s/ n = 0.645.
1. z.05 = 1.644: z confidence interval is
parameters:
I A 95% confidence interval of [1.2, 3.4] for µ does not
Applet:
MIT18_05S14_class22-slde-a
http://mathlets.org/mathlets/confidence-intervals/498
January 1, 2017 10 / 19
Table discussion
How does the width of a confidence interval for the mean change if:
(A) it gets wider (B) it gets narrower (C) it stays the same.
MIT18_05S14_class22-slde-a 499
January 1, 2017 11 / 19
Answers
MIT18_05S14_class22-slde-a 500
January 1, 2017 12 / 19
Intervals and pivoting
x: sample mean (statistic)
µ0 : hypothesized mean (not known)
Pivoting: x is in the interval µ0 ± 2.3 ⇔ µ0 is in the interval x ± 2.3.
−2 −1 0 1 2 3 4
µ0 x
µ0 ± 1 this interval does not contain x
x±1 this interval does not contain µ0
µ0 ± 2.3 this interval contains x
x ± 2.3 this interval contains µ0
Algebra of pivoting:
MIT18_05S14_class22-slde-a 502
January 1, 2017 14 / 19
Solution
σ
Confidence interval: x ± zα/2 · √
n
σ
Non-rejection region: µ0 ± zα/2 · √
n
Since the intervals are the same width they either both contain the
other’s center or neither one does.
N (µ0 , σ 2 /n)
x
x2 µ0 − zα/2 · √σ µ0 x1 µ0 + zα/2 · √σ
n n
MIT18_05S14_class22-slde-a 503
January 1, 2017 15 / 19
Polling: a binomial proportion confidence interval
Data x1 , . . . , xn from a Bernoulli(θ) distribution with θ unknown.
A conservative normal† (1 − α) confidence interval for θ is given by
zα/2 zα/2
x̄ − √ , x̄ + √ .
2 n 2 n
p
Proof uses the CLT and the observation σ = θ(1 − θ) ≤ 1/2.
√
Political polls often give a margin-of-error of ±1/ n. This
rule-of-thumb corresponds to a 95% confidence interval:
1 1
x̄ − √ , x̄ + √ .
n n
(The proof is in the class 22 notes.)
Conversely, a margin of error of ±0.05 means 400 people were polled.
†
There are many types of binomial proportion confidence intervals.
MIT18_05S14_class22-slde-a 504
http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
January 1, 2017 16 / 19
Board question
1. How many people would you have to poll to have a margin of error
of .01 with 95% confidence? (You can do this in your head.)
2. How many people would you have to poll to have a margin of error
of .01 with 80% confidence. (You’ll want R or other calculator here.)
MIT18_05S14_class22-slde-a 506
January 1, 2017 18 / 19
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class22-slde-a 507
Confidence Intervals II
18.05 Spring 2014
MIT18_05S14_class23-slde-a 509
January 1, 2017 2 / 18
Polling confidence interval
Also called a binomial proportion confidence interval
Polling means sampling from a Bernoulli(θ) distribution,
i.e. data x1 , . . . , xn Bernoulli(θ).
√
Political polls often give a margin-of-error of ±1/ n, i.e. they use
the rule-of-thumb 95% confidence interval.
MIT18_05S14_class23-slde-a 511
January 1, 2017 4 / 18
Board question
1. How many people would you have to poll to have a margin of error
of 0.01 with 95% confidence? (You can do this in your head.)
2. How many people would you have to poll to have a margin of error
of 0.01 with 80% confidence. (You’ll want R or other calculator here.)
MIT18_05S14_class23-slde-a 513
January 1, 2017 6 / 18
Concept question: overnight polling
MIT18_05S14_class23-slde-a 514
January 1, 2017 7 / 18
National Council on Public Polls: Press Release, Sept 1992
“The National Council on Public Polls expressed concern today about the
current spate of overnight Presidential polls. [...] Overnight polls do a
disservice to both the media and the research industry because of the
considerable potential for the results to be misleading. The overnight
interviewing period may well mean some methodological compromises, the
most serious of which is..”
...what?
ThisMIT18_05S14_class23-slde-a
is called the large sample confidence interval. 516
January 1, 2017 9 / 18
Review: confidence intervals for normal data
Suppose the data x1 , . . . , xn is drawn from N(µ, σ 2 )
Confidence level = 1 − α
z confidence interval for the mean (σ known)
zα/2 · σ zα/2 · σ zα/2 · σ
x − √ , x + √ or x± √
n n n
t confidence interval for the mean (σ unknown)
tα/2 · s tα/2 · s tα/2 · s
x − √ , x + √ or x± √
n n n
χ2 confidence interval for σ 2
n−1 2 n−1 2
s , s
cα/2 c1−α/2
t and χ2 have n − 1 degrees of freedom.
MIT18_05S14_class23-slde-a 517
January 1, 2017 10 / 18
Three views of confidence intervals
MIT18_05S14_class23-slde-a 518
January 1, 2017 11 / 18
View 1: Using a standardized point statistic
Example. x1 . . . , xn ∼ N(µ, σ 2 ), where σ is known.
The standardized sample mean follows a standard normal distribution.
x −µ
z = √ ∼ N(0, 1)
σ/ n
Therefore:
x −µ
P(−zα/2 < √ < zα/2 | µ) = 1 − α
σ/ n
Pivot to:
σ σ
P(x − zα/2 · √ < µ < x + zα/2 · √ | µ) = 1 − α
n n
This is the (1 − α) confidence interval:
σ
x ± zα/2 · √
n
MIT18_05S14_class23-slde-a 519
Think of it as x ± error
January 1, 2017 12 / 18
View 1: Other standardized statistics
(n − 1)s 2
X2 = ∼ χ2 (n − 1)
σ2
MIT18_05S14_class23-slde-a 520
January 1, 2017 13 / 18
View 2: Using hypothesis tests
H0 : θ = θ 0
at significance level α.
Definition. Given x, the (1 − α) confidence interval contains all θ0
which are not rejected when they are the null hypothesis.
MIT18_05S14_class23-slde-a 521
January 1, 2017 14 / 18
Board question: exact binomial confidence interval
MIT18_05S14_class23-slde-a 522
January 1, 2017 15 / 18
Solution
For each θ, the non-rejection region is blue, the rejection region is red.
In each row, the rejection region has probability at most α = 0.10.
θ/x 0 1 2 3 4 5 6 7 8
.1 0.430 0.383 0.149 0.033 0.005 0.000 0.000 0.000 0.000
.3 0.058 0.198 0.296 0.254 0.136 0.047 0.010 0.001 0.000
.5 0.004 0.031 0.109 0.219 0.273 0.219 0.109 0.031 0.004
.7 0.000 0.001 0.010 0.047 0.136 0.254 0.296 0.198 0.058
.9 0.000 0.000 0.000 0.000 0.005 0.033 0.149 0.383 0.430
Definition:
A (1 − α) confidence interval for θ is an interval statistic Ix such that
P(Ix contains θ | θ) = 1 − α
for all possible values of θ (and hence for the true value of θ).
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class23-slde-a 525
Bootstrapping
18.05 Spring 2014
Bootstrap terminology
Bootstrap principle
Empirical bootstrap
Parametric bootstrap
MIT18_05S14_class24-slde-a 527
January 1, 2017 2 / 16
Empirical distribution of data
Data: x1 , x2 , . . . , xn (independent)
Example 1. Data: 1, 2, 2, 3, 8, 8, 8.
x∗ 1 2 3 8
p ∗ (x ∗ ) 1/7 2/7 1/7 3/7
Example 2. 0.20
0.10
0.00
0 5 10 15
∗ ∗
x − δα/ 2 ≤ µ ≤ x − δ1−α/2
MIT18_05S14_class24-slde-a 530
January 1, 2017 5 / 16
Empirical bootstrap confidence intervals
Use the data to estimate the variation of estimates based on the data!
δ ∗ = θ∗ − θ̂.
δ = θ̂ − θ
answer: G. The program is essentially the same for all three statistics. All
that needs to change is the code for computing the specific statistic.
MIT18_05S14_class24-slde-a 532
January 1, 2017 7 / 16
Board question
Data: 3 8 1 8 3 3
MIT18_05S14_class24-slde-a 533
January 1, 2017 8 / 16
Solution: mean
x̄ = 4.33
Sorted
δ ∗ : -1.50, -0.33, -0.33, 0.00, 0.00, 0.00, 0.33, 0.50
x0.5 = median(x) = 3
Sorted
δ ∗ : -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.5
δ ∗ = θ∗ − θ̂.
δ = θ̂ − θ
Data: 6 5 5 5 7 4 ∼ binomial(8,θ)
1. Estimate θ.
(Try this without looking at your notes. We’ll show the previous slide
at the end)
MIT18_05S14_class24-slde-a 539
January 1, 2017 14 / 16
Preview of linear regression
MIT18_05S14_class24-slde-a 540
January 1, 2017 15 / 16
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class24-slde-a 541
Linear Regression
18.05 Spring 2014
MIT18_05S14_class25-slds-a 543
January 1, 2017 2 / 25
Modeling bivariate data as a function + noise
Ingredients
Bivariate data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Model: yi = f (xi ) + Ei
where f (x) is some function, Ei random error.
n
X n
X
Total squared error: Ei2 = (yi − f (xi ))2
i=1 i=1
MIT18_05S14_class25-slds-a 544
January 1, 2017 3 / 25
Examples of f (x)
lines: y = ax + b + E
polynomials: y = ax 2 + bx + c + E
other: y = a/x + b + E
other: y = a sin(x) + b + E
MIT18_05S14_class25-slds-a 545
January 1, 2017 4 / 25
Simple linear regression: finding the best fitting line
and where σ is a fixed value, the same for all data points.
n
X n
X
Total squared error: Ei2 = (yi − axi − b)2
i=1 i=1
Goal: Find the values of a and b that give the ‘best fitting line’.
Best fit: (least squares)
The values of a and b that minimize the total squared error.
MIT18_05S14_class25-slds-a 546
January 1, 2017 5 / 25
Linear Regression: finding the best fitting polynomial
Bivariate data: (x1 , y1 ), . . . , (xn , yn ).
Goal:
Find the values of a, b, c that give the ‘best fitting parabola’.
Best fit: (least squares)
The values of a, b, c that minimize the total squared error.
50
● ●
●
●
●
40
●
●
●
●
30
●
y
●
20
●
10
●
●
0 10 20 30 40 50 60
x
15
10
y
5
0
−1 0 1 2 3 4 5 6
MIT18_05S14_class25-slds-a x 549
January 1, 2017 8 / 25
Board question: make it fit
Bivariate data:
(1, 3), (2, 1), (4, 4)
3. Set up the linear regression to find the best fitting cubic. but
don’t take derivatives.
We didn’t really expect people to carry this all the way out by hand. If you
did you would have found that taking the partial derivatives and setting to
0 gives the following system of simultaneous linear equations.
y = ax + b.
y = ax 2 + bx + c.
MIT18_05S14_class25-slds-a 554
January 1, 2017 13 / 25
Homoscedastic
●
●
4
● ●
●
● ● ● ●
3
●● ● ● ●
● ●
● ●
15
● ● ●
●●
●● ● ● ● ●
● ●
● ●
●● ●
● ● ●
2
● ● ● ●
●● ● ●
● ●
● ●
● ● ●
●
●
●●● ●● ● ● ●
● ● ● ● ●●
● ● ●
1
● ● ● ●
10
● ● ●● ● ●
● ●
e
y
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ● ●● ●
● ● ● ● ●
● ● ●
0
● ●
● ●
● ●
●● ●
● ●
● ●● ●● ● ●
● ●● ● ● ● ● ●
●● ● ●
● ●● ● ● ●
● ● ●● ●
●●
−1
●
5
● ● ●
● ●● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●●
● ●
● ●
● ●
−2
●
● ●
●
●
●
●
0
−3
0 2 4 6 8 10 0 2 4 6 8 10
x x
20
●
●
● ●
●
●
●
●
● ●
● ●
15
●● ● ●
● ●
● ●
●
●
● ● ● ● ● ●●●
● ● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●● ● ●
● ●● ●●
● ●
● ●
● ● ●
10
●● ● ●
●● ● ● ●
y
● ●
● ●● ●
● ● ● ●●
● ● ●
● ●● ●
● ● ●
● ● ● ●
● ●● ●●
● ●
● ● ●
●● ●● ●●
● ● ●
5
●
● ●● ● ●● ●
● ● ● ●
●● ● ● ●
●
● ● ●●
●● ●
●● ●
0
0 2 4 6 8 10
x
Heteroscedastic Data
MIT18_05S14_class25-slds-a 556
January 1, 2017 15 / 25
Formulas for simple linear regression
Model:
yi = axi + b + Ei where Ei ∼ N(0, σ 2 ).
Using calculus or algebra:
sxy
â = and b̂ = ȳ − â x̄,
sxx
where
1X 1 X
x̄ = xi sxx = (xi − x̄)2
n n−1
1X 1 X
ȳ = yi sxy = (xi − x̄ )(yi − ȳ ).
n n−1
WARNING: This is just for simple linear regression. For polynomials
and other functions you need other formulas.
MIT18_05S14_class25-slds-a 557
January 1, 2017 16 / 25
Board Question: using the formulas plus some theory
Bivariate data: (1, 3), (2, 1), (4, 4)
So
sxy
â = = 7/14 = 1/2, b̂ = ȳ − âx̄ = 9/6 = 3/2.
sxx
(The same answer as the previous board question.)
2. The formula b̂ = ȳ − âx̄ is exactly the same as ȳ = âx̄ + b̂. That is,
the point (x̄, ȳ ) is on the line y = âx + b̂
Solution to 3 is on the next slide.
MIT18_05S14_class25-slds-a 559
January 1, 2017 18 / 25
3. Our model is yi = axi + b + Ei , where the Ei are independent. Since
Ei ∼ N(0, σ 2 ) this becomes
yi ∼ N(axi + b, σ 2 )
Therefore the likelihood of yi given xi , a and b is
1 (y −axi −b)2
− i
f (yi | xi , a, b) = √ e 2σ 2
2πσ
Since the data yi are independent the likelihood function is just the
product of the expression above, i.e. we have to sum exponents
Pn 2
i=1 (yi −axi −b)
likelihood = f (y1 , . . . , yn |x1 , . . . , xn , a, b) = e− 2σ 2
Since the exponent is negative, the maximum likelihood will happen when
the exponent is as close to 0 as possible. That is, when the sum
n
X
(yi − axi − b)2
i=1
R demonstration!
MIT18_05S14_class25-slds-a 562
January 1, 2017 21 / 25
Outliers and other troubles
Use mathlet
http://mathlets.org/mathlets/linear-regression/
MIT18_05S14_class25-slds-a 563
January 1, 2017 22 / 25
Regression to the mean
Suppose a group of children is given an IQ test at age 4.
One year later the same children are given another IQ test.
Children’s IQ scores at age 4 and age 5 should be positively
correlated.
Those who did poorly on the first test (e.g., bottom 10%) will
tend to show improvement (i.e. regress to the mean) on the
second test.
A completely useless intervention with the poor-performing
children might be misinterpreted as causing an increase in their
scores.
Conversely, a reward for the top-performing children might be
misinterpreted as causing a decrease in their scores.
ThisMIT18_05S14_class25-slds-a 564
example is from Rice Mathematical Statistics and Data Analysis
January 1, 2017 23 / 25
A brief discussion of multiple linear regression
MIT18_05S14_class25-slds-a 565
January 1, 2017 24 / 25
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
MIT18_05S14_class25-slds-a 566