Lec 6
Lec 6
Lecture 6: September 9
Lecturer: Siva Balakrishnan
Today we will start off by deriving some of the implications between the different modes of
convergence. Then we will “prove” the CLT.
2 E(Xn − X)2
2
P(|Xn − X| ≥ ) = P(|Xn − X| ≥ ) ≤ → 0,
2
showing convergence in probability.
At a high-level the convergence in qm requirement penalizes Xn for having large deviations
from X by both how frequent the deviation is but also by the magnitude of the deviation. On
the other hand convergence in probability only penalizes you for how frequent the deviation
is and hence is a weaker notion of convergence.
√
Counterexample to reverse: Suppose we take U ∼ U [0, 1] and define Xn = nI[0,1/n] (U ),
then Xn converges in probability to 0 but does not converge in quadratic mean to 0.
To see this:
√ 1
P(|Xn | ≥ ) = P( nI[0,1/n] (U ) ≥ ) = P(U ∈ [0, 1/n]) = → 0.
n
On the other hand,
Observe that most of the time the RV Xn takes the value 0, but when it does not it takes a
huge value.
This one is a little bit involved but perhaps also useful to know. The idea roughly is to trap
the CDF of Xn by the CDF of X with an interval whose length converges to 0.
6-1
6-2 Lecture 6: September 9
We fix a point x where the CDF FX (x) is continuous. Choose an arbitrary > 0. We have
that,
FXn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X ≥ x + )
≤ P(X ≤ x + ) + P(|Xn − X| ≥ )
= FX (x + ) + P(|Xn − X| ≥ ).
Now,
FX (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn ≥ x)
≤ FXn (x) + P(|Xn − X| ≥ ).
Now since > 0 was arbitrary, we can take the limit as → 0 and use continuity to conclude
the desired convergence in distribution.
Counterexample to reverse: This of course is almost trivial since two random variables
having the same distribution does not in any sense mean that they are close (see Lecture 4
notes for an example).
An important caveat: An important exception is that when X is deterministic then
convergence in distribution implies convergence in probability. Concretely, fix > 0, consider
the case when X = c, then
P(|Xn − c| > ) = P(Xn > + c) + P(Xn < c − )
= FXn (c − ) + 1 − FXn (c + )
→ FX (c − ) + 1 − FX (c + ) = 0.
Lecture 6: September 9 6-3
using convergence in distribution and the fact that at both c + , and c − , the distribution
function FX is continuous.
There are some important consequences of the fact that convergence in distribution is weaker
than convergence in probability.
Concretely, for convergence in probability (and stronger forms of convergence) it is the case
that, if Xn converges in probability to X and Yn converges in probability to Y then Xn + Yn
converges in probability to X + Y , and the same is true of products, i.e. Xn Yn converges in
probability to XY .
These statements are not true for convergence in distribution, i.e. if Xn converges in dis-
tribution to X and Yn converges in distribution to Y then Xn + Yn does not necessarily
converge in distribution to X + Y .
The one exception to this is known as Slutsky’s theorem. It says that if Yn converges in
distribution to a constant c, and X converges in distribution to X: then Xn + Yn converges
in distribution to X + c and Xn Yn converges in distribution to cX.
If we have Xn converges in probability to some constant c, then it is not the case that E[Xn ]
converges to c.
Here is an example of this non-convergence. Let Xn be 0 with probability 1 − 1/n and n2
with probability 1/n. Then Xn converges to 0 in probability, but E[Xn ] = n → ∞.
This is a manifestation of the same phenomena as we saw in the counterexample to qm
convergence. On the events when |Xn | ≥ it has a huge value and this affects the moments
but does not affect the convergence in probability (which only cares about how frequent this
violation is).
1. The central limit theorem is incredibly general. It does not matter what the distribution
of Xi is, the average Sn converges in distribution to a Gaussian (under fairly mild
assumptions).
2. The most general version of the CLT does not require any assumption about the mgf.
It just requires that the mean and variance are finite. We will prove this weaker version
in lecture.
We should try to understand why the CLT might be useful. Roughly, the CLT allows to
make approximate probability statements about averages using corresponding statements
about standard normals. At a high-level instead of using a different tail bound for different
types of averages (sub-Gaussian, sub-exponential, bounded etc.) we can now just use the
Gaussian CDF although our results will only be approximate.
I will introduce a simple use case: we will discuss this idea again later on in more detail
when we discuss confidence intervals.
Lecture 6: September 9 6-5
Suppose for now that we are averaging i.i.d. RVs with known variance (and unknown mean
µ). Typically one would also estimate the variance but this will not change much. We would
like to construct a confidence interval for the unknown mean. For some parameter α this is
an interval Cα such that,
P(µ ∈ C) ≥ 1 − α.
One might guess that we would center such an interval around the sample average µ
b but the
main difficulty is that we do not know the distribution of µ
b. We can see that,
P(µ ∈ [b
µ − t, µ µ − µ| ≤ t).
b + t]) = P(|b
So we would like to choose t to make this probability at least 1 − α. One can construct such
intervals using tail bounds (see HW3) but we will instead construct an approximate interval
using the CLT. Using the CLT we know that distribution of µ b − µ converges to a normal
2
with mean 0, and variance σ /n, i.e.
√
nt
µ − µ| ≤ t) ≈ P |Z| ≤
P(|b .
σ
P(µ ∈ C) ≈ 1 − α,
6.3.2 Preliminaries
Sanity Checks: Before we prove the theorem there are two very simple sanity checks that
one might consider. The random variable Sn has mean 0, and variance:
n
E[Sn2 ] = µ − µ)2 = 1.
E(b
σ2
√
So in some sense the normalizations (of subtracting µ and √ dividing by σ/ n) make sense.
You should convince yourself that if you did not multiply by n this would have a degenerate
6-6 Lecture 6: September 9
√
limit (i.e. would converge in distribution to a point mass at 0). Multiplying by n is
enlarging the fluctuations of the average around the expectations at just the right rate.
The other sanity check is to just notice that if X1 , . . . , Xn ∼ N (µ, σ 2 ), then Sn would have
distribution exactly equal to that of Z. Roughly, if there was going to be a “universal
limit” i.e. if the average was going to converge to a single distribution (irrespective of the
distribution of X) then it has a to be a Gaussian distribution (just because we know that
the average of Gaussians is Gaussian).
Calculus with mgfs: We need a few simple facts about mgfs that we will quickly prove.
Fact 1: If X and Y are independent with mgfs MX and MY then Z = X + Y has mgf
MZ (t) = MX (t)MY (t).
Proof: We note that,
using independence.
Fact 2: If X has mgf MX then Y = a + bX has mgf, MY (t) = exp(at)MX (bt).
Proof: We just use the definition,
Fact 3: We will not prove this one (strictly speaking one needs to invoke the dominated
convergence theorem) but it should be familiar to you. The derivative of the mgf at 0 gives
us moments, i.e.
(r)
MX (0) = E[X r ].
Fact 4: The most important result that we also will not prove is that we can show
convergence in distribution by showing convergence of the mgfs.
Formally, let X1 , . . . , Xn be a sequence of RVs with mgfs MX1 , . . . , MXn . If for all t in an
open interval around 0 we have that, MXn (t) → MX (t), then Xn converges in distribution
to X.
6.3.3 Proof
We will follow the proof from John Rice’s (Math Stat and Data Analysis) textbook. Larry’s
notes have a nearly identical proof. First we recall that the mgf of a standard normal is
simply MZ (t) = exp(t2 /2).
Lecture 6: September 9 6-7
Note that,
n
t
MSn (t) = M(X−µ) √ ,
σ n
√
using Facts 1 and 2. Now, one should imagine t as small and fixed so t/(σ n) is quite close
to 0. Taylor expanding the mgf around 0, and using Fact 3 we obtain
n
t2 t3
t 2 3
MSn (t) = 1 + √ E(X − µ) + E(X − µ) + 3/2 3 E(X − µ) + . . .
σ n 2nσ 2 6n σ
2 n
t
≈ 1+ → exp(t2 /2),
2n