0% found this document useful (0 votes)
14 views7 pages

Lec 6

Central Limit Theorem

Uploaded by

nnguyen22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Lec 6

Central Limit Theorem

Uploaded by

nnguyen22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

36-705: Intermediate Statistics Fall 2019

Lecture 6: September 9
Lecturer: Siva Balakrishnan

Today we will start off by deriving some of the implications between the different modes of
convergence. Then we will “prove” the CLT.

6.1 Quadratic mean =⇒ convergence in probability


Suppose that X1 , . . . , Xn converges in quadratic mean to X, then fix an  > 0,

2 E(Xn − X)2
2
P(|Xn − X| ≥ ) = P(|Xn − X| ≥  ) ≤ → 0,
2
showing convergence in probability.
At a high-level the convergence in qm requirement penalizes Xn for having large deviations
from X by both how frequent the deviation is but also by the magnitude of the deviation. On
the other hand convergence in probability only penalizes you for how frequent the deviation
is and hence is a weaker notion of convergence.

Counterexample to reverse: Suppose we take U ∼ U [0, 1] and define Xn = nI[0,1/n] (U ),
then Xn converges in probability to 0 but does not converge in quadratic mean to 0.
To see this:
√ 1
P(|Xn | ≥ ) = P( nI[0,1/n] (U ) ≥ ) = P(U ∈ [0, 1/n]) = → 0.
n
On the other hand,

E(Xn − X)2 = EXn2 = nP(U ∈ [0, 1/n]) = 1.

Observe that most of the time the RV Xn takes the value 0, but when it does not it takes a
huge value.

6.1.1 Convergence in probability =⇒ convergence in distribution

This one is a little bit involved but perhaps also useful to know. The idea roughly is to trap
the CDF of Xn by the CDF of X with an interval whose length converges to 0.

6-1
6-2 Lecture 6: September 9

We fix a point x where the CDF FX (x) is continuous. Choose an arbitrary  > 0. We have
that,
FXn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X ≥ x + )
≤ P(X ≤ x + ) + P(|Xn − X| ≥ )
= FX (x + ) + P(|Xn − X| ≥ ).

Now,
FX (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn ≥ x)
≤ FXn (x) + P(|Xn − X| ≥ ).

Putting these two together we have,


FX (x − ) − P(|Xn − X| ≥ ) ≤ FXn (x) ≤ FX (x + ) + P(|Xn − X| ≥ ).
Intuitively, now as n gets large the two probabilities converge to 0, and since  was chosen
arbitrarily we can let  → 0 and use the continuity of FX (x) at x to conclude that FXn (x) →
FX (x).
Slightly more rigorously, we cannot assume that the limit of FXn (x) exists so we instead
need to use lim infs and lim sups (do not worry about this if you have not seen it before).
Formally, we would take the lim sup of the first half to obtain that,
lim sup FXn (x) ≤ FX (x + ),
n→∞

and similarly that,


lim inf FXn (x) ≥ FX (x − ),
n→∞

and conclude that,


FX (x − ) ≤ lim inf FXn (x) ≤ lim sup FXn (x) ≤ FX (x + ).
n→∞ n→∞

Now since  > 0 was arbitrary, we can take the limit as  → 0 and use continuity to conclude
the desired convergence in distribution.
Counterexample to reverse: This of course is almost trivial since two random variables
having the same distribution does not in any sense mean that they are close (see Lecture 4
notes for an example).
An important caveat: An important exception is that when X is deterministic then
convergence in distribution implies convergence in probability. Concretely, fix  > 0, consider
the case when X = c, then
P(|Xn − c| > ) = P(Xn >  + c) + P(Xn < c − )
= FXn (c − ) + 1 − FXn (c + )
→ FX (c − ) + 1 − FX (c + ) = 0.
Lecture 6: September 9 6-3

using convergence in distribution and the fact that at both c + , and c − , the distribution
function FX is continuous.

6.2 Other things that are very useful to know

6.2.1 Continuous mapping theorem

If a sequence X1 , . . . , Xn converges in probability to X then for any continuous function


h, h(X1 ), . . . , h(Xn ) converges in probability to h(X). The same is true for convergence in
distribution.
This is useful because often we will have a consistent estimator for some parameter, and this
theorem allows to construct estimators for some function of the parameter in a straightfor-
ward way.

6.2.2 Slutsky’s theorem

There are some important consequences of the fact that convergence in distribution is weaker
than convergence in probability.
Concretely, for convergence in probability (and stronger forms of convergence) it is the case
that, if Xn converges in probability to X and Yn converges in probability to Y then Xn + Yn
converges in probability to X + Y , and the same is true of products, i.e. Xn Yn converges in
probability to XY .
These statements are not true for convergence in distribution, i.e. if Xn converges in dis-
tribution to X and Yn converges in distribution to Y then Xn + Yn does not necessarily
converge in distribution to X + Y .
The one exception to this is known as Slutsky’s theorem. It says that if Yn converges in
distribution to a constant c, and X converges in distribution to X: then Xn + Yn converges
in distribution to X + c and Xn Yn converges in distribution to cX.

6.2.3 Convergence of moments is not implied by convergence in


probability

Convergence in probability is actually quite weak as a form of convergence. We have seen


previously that it does not imply quadratic mean convergence. Now we will see that it does
not even imply something much simpler.
6-4 Lecture 6: September 9

If we have Xn converges in probability to some constant c, then it is not the case that E[Xn ]
converges to c.
Here is an example of this non-convergence. Let Xn be 0 with probability 1 − 1/n and n2
with probability 1/n. Then Xn converges to 0 in probability, but E[Xn ] = n → ∞.
This is a manifestation of the same phenomena as we saw in the counterexample to qm
convergence. On the events when |Xn | ≥  it has a huge value and this affects the moments
but does not affect the convergence in probability (which only cares about how frequent this
violation is).

6.3 The central limit theorem


We will now state and prove a form of the central limit theorem, which is one of the most
famous examples of convergence in distribution.
Let X1 , X2 , . . . , Xn be a sequence of independent random variables with mean µ and variance
σ 2 . Assume that the mgf E[exp(tXi )] is finite for t in a neighborhood around zero. Let

µ − µ)
n(b
Sn = ,
σ
then Sn converges in distribution to Z ∼ N (0, 1).
Comments:

1. The central limit theorem is incredibly general. It does not matter what the distribution
of Xi is, the average Sn converges in distribution to a Gaussian (under fairly mild
assumptions).
2. The most general version of the CLT does not require any assumption about the mgf.
It just requires that the mean and variance are finite. We will prove this weaker version
in lecture.

6.3.1 Use Case

We should try to understand why the CLT might be useful. Roughly, the CLT allows to
make approximate probability statements about averages using corresponding statements
about standard normals. At a high-level instead of using a different tail bound for different
types of averages (sub-Gaussian, sub-exponential, bounded etc.) we can now just use the
Gaussian CDF although our results will only be approximate.
I will introduce a simple use case: we will discuss this idea again later on in more detail
when we discuss confidence intervals.
Lecture 6: September 9 6-5

Suppose for now that we are averaging i.i.d. RVs with known variance (and unknown mean
µ). Typically one would also estimate the variance but this will not change much. We would
like to construct a confidence interval for the unknown mean. For some parameter α this is
an interval Cα such that,

P(µ ∈ C) ≥ 1 − α.

One might guess that we would center such an interval around the sample average µ
b but the
main difficulty is that we do not know the distribution of µ
b. We can see that,

P(µ ∈ [b
µ − t, µ µ − µ| ≤ t).
b + t]) = P(|b

So we would like to choose t to make this probability at least 1 − α. One can construct such
intervals using tail bounds (see HW3) but we will instead construct an approximate interval
using the CLT. Using the CLT we know that distribution of µ b − µ converges to a normal
2
with mean 0, and variance σ /n, i.e.
 √ 
nt
µ − µ| ≤ t) ≈ P |Z| ≤
P(|b .
σ

Now, if we let Φ(x) = P(Z ≤ x) denote


√ the standard
√ normal CDF, then we can see that we
−1
need to choose t = σΦ (1 − α/2)/ n := σzα/2 / n.
To summarize, the interval:
 
σzα/2 σzα/2
Cα = µb− √ ,µ b+ √ ,
n n

has the property that,

P(µ ∈ C) ≈ 1 − α,

where we appealed to the CLT to justify this construction.

6.3.2 Preliminaries

Sanity Checks: Before we prove the theorem there are two very simple sanity checks that
one might consider. The random variable Sn has mean 0, and variance:
n
E[Sn2 ] = µ − µ)2 = 1.
E(b
σ2

So in some sense the normalizations (of subtracting µ and √ dividing by σ/ n) make sense.
You should convince yourself that if you did not multiply by n this would have a degenerate
6-6 Lecture 6: September 9


limit (i.e. would converge in distribution to a point mass at 0). Multiplying by n is
enlarging the fluctuations of the average around the expectations at just the right rate.
The other sanity check is to just notice that if X1 , . . . , Xn ∼ N (µ, σ 2 ), then Sn would have
distribution exactly equal to that of Z. Roughly, if there was going to be a “universal
limit” i.e. if the average was going to converge to a single distribution (irrespective of the
distribution of X) then it has a to be a Gaussian distribution (just because we know that
the average of Gaussians is Gaussian).
Calculus with mgfs: We need a few simple facts about mgfs that we will quickly prove.
Fact 1: If X and Y are independent with mgfs MX and MY then Z = X + Y has mgf
MZ (t) = MX (t)MY (t).
Proof: We note that,

MZ (t) = E[exp(t(X + Y )] = E[exp(tX)]E[exp(tY )],

using independence.
Fact 2: If X has mgf MX then Y = a + bX has mgf, MY (t) = exp(at)MX (bt).
Proof: We just use the definition,

MY (t) = E[exp(at + btX)] = exp(at)E[exp(btX)].

Fact 3: We will not prove this one (strictly speaking one needs to invoke the dominated
convergence theorem) but it should be familiar to you. The derivative of the mgf at 0 gives
us moments, i.e.
(r)
MX (0) = E[X r ].

Fact 4: The most important result that we also will not prove is that we can show
convergence in distribution by showing convergence of the mgfs.
Formally, let X1 , . . . , Xn be a sequence of RVs with mgfs MX1 , . . . , MXn . If for all t in an
open interval around 0 we have that, MXn (t) → MX (t), then Xn converges in distribution
to X.

6.3.3 Proof

We will follow the proof from John Rice’s (Math Stat and Data Analysis) textbook. Larry’s
notes have a nearly identical proof. First we recall that the mgf of a standard normal is
simply MZ (t) = exp(t2 /2).
Lecture 6: September 9 6-7

Note that,
  n
t
MSn (t) = M(X−µ) √ ,
σ n


using Facts 1 and 2. Now, one should imagine t as small and fixed so t/(σ n) is quite close
to 0. Taylor expanding the mgf around 0, and using Fact 3 we obtain
n
t2 t3

t 2 3
MSn (t) = 1 + √ E(X − µ) + E(X − µ) + 3/2 3 E(X − µ) + . . .
σ n 2nσ 2 6n σ
2 n
 
t
≈ 1+ → exp(t2 /2),
2n

using the fact that,

lim (1 + x/n)n → exp(x).


n→∞

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy