CS246 Proof Probability
CS246 Proof Probability
Note: This document has been adapted from a similar review session for CS224W (Autumn
2018). It was originally compiled by Jessica Su, with minor edits by Jayadev Bhaskaran and
Albert Zheng.
1 Proof techniques
Here we will learn to prove universal mathematical statements, like “the square of any odd
number is odd”. It’s easy enough to show that this is true in specific cases – for example,
32 = 9, which is an odd number, and 52 = 25, which is another odd number. However, to
prove the statement, we must show that it works for all odd numbers, which is hard because
you can’t try every single one of them.
Note that if we want to disprove a universal statement, we only need to find one counterex-
ample. For instance, if we want to disprove the statement “the square of any odd number is
even”, it suffices to provide a specific example of an odd number whose square is not even.
(For instance, 32 = 9, which is not an even number.)
Rule of thumb:
• To prove a universal statement, you must show it works in all cases.
• To disprove a universal statement, it suffices to find one counterexample.
(For “existence” statements, this is reversed. For example, if your statement is “there exists
at least one odd number whose square is odd, then proving the statement just requires saying
32 = 9, while disproving the statement would require showing that none of the odd numbers
have squares that are odd.)
1
CS 246 – Review of Proof Techniques and Probability 01/17/20
2
CS 246 – Review of Proof Techniques and Probability 01/17/20
3
CS 246 – Review of Proof Techniques and Probability 01/17/20
4
CS 246 – Review of Proof Techniques and Probability 01/17/20
3 Probability
3.1 Fundamentals
The sample space Ω represents the set of all possible things that can happen. For example,
if you are rolling a die, your sample space is {1, 2, 3, 4, 5, 6}.
An event is a subset of the sample space. For example, the event “I roll a number less than
4” can be represented by the subset {1, 2, 3}. The event “I roll a 6” can be represented by
the subset {6}.
A probability function is a mapping from events to real numbers between 0 and 1. It must
have the following properties:
• P (Ω) = 1
• P (A ∪ B) = P (A) + P (B) for disjoint events A and B (i.e. when A ∩ B = ∅)
Example: For the dice example, we can define the probability function by saying P ({i}) =
1/6 for i = 1, . . . , 6. (That is, we say that each number has an equal probability of being
rolled.) All events in the probability space can be represented as unions of these six disjoint
events.
Using this definition, we can compute the probability of more complicated events, like
P (we roll an odd number) = 1/6 + 1/6 + 1/6 = 1/2.
(Note that we can add probabilities here because the events {1}, {3}, and {5} are disjoint.)
Proof: You can derive this theorem from the probability axioms. A ∪ B can be split into
three disjoint events: A \ B, A ∩ B, and B \ A. Furthermore, A can be split into A \ B and
A ∩ B, and B can be split into B \ A and A ∩ B. So
P (A ∪ B) = P (A \ B) + P (A ∩ B) + P (B \ A)
= P (A \ B) + P (A ∩ B) + P (B \ A) + P (A ∩ B) − P (A ∩ B)
= P (A) + P (B) − P (A ∩ B)
5
CS 246 – Review of Proof Techniques and Probability 01/17/20
Example: Suppose k is chosen uniformly at random from the integers 1, 2, . . . , 100. (This
means the probability of getting each integer is 1/100.) Find the probability that k is
divisible by 2 or 5.
By the Principle of Inclusion-Exclusion, P (k is divisible by 2 or 5) = P (k is divisible by 2)+
P (k is divisible by 5) − P (k is divisible by both 2 and 5).
There are 50 numbers divisible by 2, 20 numbers divisible by 5, and 10 numbers divisible by
10 (i.e., divisible by both 2 and 5). Therefore, the probability is 50/100 + 20/100 − 10/100 =
60/100 = 0.6.
Pk
By the induction hypothesis, the first term is less than or equal to i=1 P (Ai ). So
k+1
! k+1
[ X
P Ai ≤ P (Ai )
i=1 i=1
6
CS 246 – Review of Proof Techniques and Probability 01/17/20
If we want to actually compute this probability, we would take the number of engineering
majors that receive a perfect score, and divide it by the total number of engineering majors.
This is equivalent to computing the formula
In general, we can replace “perfect score” and “engineering major” with any two events, and
we get the formal definition of conditional probability:
P (A ∩ B)
P (A|B) =
P (B)
Example: Suppose you toss a fair coin three times. What is the probability that all three
tosses come up heads, given that the first toss came up heads?
Answer: This probability is
P (all three tosses come up heads and the first toss came up heads) 1/8 1
= =
P (the first toss came up heads) 1/2 4
3.4.1 Independence
Two events are independent if the fact that one event happened does not affect the probability
that the other event happens. In other words
P (A|B) = P (A)
7
CS 246 – Review of Proof Techniques and Probability 01/17/20
Example: Suppose 1% of women who enter your clinic have breast cancer, and a woman
with breast cancer has a 90% chance of getting a positive result, while a woman without
breast cancer has a 10% chance of getting a false positive result. What is the probability of
a woman having breast cancer, given that she just had a positive test?
Answer: By Bayes’ Rule,
8
CS 246 – Review of Proof Techniques and Probability 01/17/20
Example: Let X be the number shown on a fair six-sided die. Then the probability mass
function for X is P (X = i) = 1/6.
Formally, we define p(x) to be the probability mass function (PMF) of a discrete variable X
such that p(x) = P (X = x).
The PMF must have the following properties:
• p(x) ≥ 0 ∀x
P
• x p(x)dx = 1
If the random variable takes a continuous range of values, the equivalent of the probability
mass function is called the probability density function. The tricky thing about probability
density functions is that the probability of getting a specific number (say X = 3.258) is zero.
So we can only talk about the probability of getting a number that lies within a certain
range.
We define f (x) to be the probability density function of a continuous random variable X if
Rb
P (a ≤ X ≤ b) = a f (x)dx. Here the probability is just the area under the curve of the
PDF.
The PDF must have the following properties:
• f (x) ≥ 0 ∀x
R∞
• −∞ f (x)dx = 1
R
• x∈A f (x)dx = P (X ∈ A)
The cumulative distribution function (or CDF) of a random variable X expresses the prob-
ability that the random variable is less than or equal to the argument. It is given by
F (x) = P (X ≤ x).
For discrete random variables, the CDF can be expressed as the sum of the PMF
X
F (x) = p(y)
y≤x
For continuous random variables, the CDF can be expressed as the integral of the PDF
Z x
F (x) = f (t)dt
−∞
9
CS 246 – Review of Proof Techniques and Probability 01/17/20
and
E[aX] = aE[X]
This is true even if X and Y are not independent.
3.6.3 Variance
The variance of a random variable is a measure of how far away the values are, on average,
from the mean. It is defined as
For a random variable X and a constant a, we have V ar(X + a) = V ar(X) and V ar(aX) =
a2 V ar(X).
We do not have V ar(X + Y ) = V ar(X) + V ar(Y ) unless X and Y are uncorrelated (which
means they have covariance 0). In particular, independent random variables are always
uncorrelated, although the reverse doesn’t hold.
10
CS 246 – Review of Proof Techniques and Probability 01/17/20
P (X = k) = p(1 − p)k−1
11
CS 246 – Review of Proof Techniques and Probability 01/17/20
The expectation of an indicator random variable is just the probability of the event occurring:
Indicator random variables are very useful for computing expectations of complicated random
variables, especially when combined with the property that the expectation of a sum of
random variables is the sum of the expectations.
Example: Suppose we are flipping n coins, and each comes up heads with probability p.
What is the expected number of coins that come up heads?
Answer: Let Xi be the indicator random
P variable that
P is 1 if the ith coin comes up heads,
and 0 otherwise. Then E[ ni=1 Xi ] = ni=1 E[Xi ] = ni=1 p = np.
P
3.9 Inequalities
3.9.1 Markov’s inequality
For any random variable X that takes only non-negative values, we have
E[X]
P (X ≥ a) ≤
a
for a > 0.
You can derive this as follows. Let IX≥a be the indicator random variable that is 1 if X ≥ a,
and 0 otherwise. Then aIX≥a ≤ X (convince yourself of this!). Taking expectations on both
sides, we get aE[IX≥a ] ≤ E[X], so P (X ≥ a) ≤ E[X]/a.
2 2 E[(X − E[X])2 ]
P ((X − E[X]) ) ≥ a ) ≤
a2
or
V ar(X)
P (|X − E[X]|) ≥ a) ≤
a2
This gives a bound on how far a random variable can be from its mean.
12
CS 246 – Review of Proof Techniques and Probability 01/17/20
n
! µ
X eδ
P Xi ≥ (1 + δ)µ ≤
i=1
(1 + δ)1+δ
for any δ.
13