Maximum Likelihood Estimation by K.Kashin
Maximum Likelihood Estimation by K.Kashin
me/machine_learning_and_DL
Spring 2014
Contents
6 References 28
∗ This note was initially prepared to supplement course material for Gov 2001 for Spring 2013. It is in part based upon my notes from Stat 111 at Harvard
1
1 Overview: Maximum Likelihood vs. Bayesian Estimation
This note is about the mechanics of maximum likelihood estimation (MLE). However, before delving into the mechanics of
finding the MLE, let’s step back and lay out maximum likelihood as a theory of inference. Specifically, it will prove useful to
compare maximum likelihood to Bayesian theory of inference.
In general, what is statistical inference? It’s the process of making a statement about how data is generated in the world. We
can think of the data that we observe in the world as a product of some data generation process (DGP) that is fundamentally
unknown and likely highly convoluted. However, it is our goal as social scientists or applied statisticians to use observed data to
learn something about the DGP. In parametric inference, we are going to assume that we can represent the DGP by a statistical
model. Remember that whatever model we choose, it will essentially never be “right”. Thus, the question is whether or not the
model is useful. In the context of this note, we will limit ourselves to very common probability distributions as our models.
Once we have selected a probability distribution to represent the data generation process, we aren’t done, for in fact we have
not specified a unique distribution, but just a family of distributions. This is because we leave one (or more) parameters as
unknown. The goal of statistical inference is then to use observed data to make a statement about parameters that govern our
model.
To introduce some notation, we are going to call the model, or probability distribution, that we choose f (⋅). This probability
distribution is going to depend on a parameter θ (or vector of parameters, θ = θ 1 , θ 2 , ..., θ k ) that characterize the distribution.
The set Ω of all possible values of a parameter or the vector of parameters is called the parameter space. We then observe some
data, drawn from this distribution:
X ∼ f (x∣θ)
The random variables X 1 , ... ,X n are independent and identically distributed (iid) because they are drawn independently from
the same DGP. The goal is to use the observed data x to learn about θ. Knowing this parameter specifies a particular distribu-
tion from the family of distributions we have selected to represent the data generation process. In the end, we hope that θ is a
substantively meaningful quantity that teaches us something about the world (or at least can be used to derive a substantively
interesting quantity of interest).
So far, we’ve just set up the general inferential goal. Now, we can introduce different theories of actually achieving said goal.
Specifically, we focus on two general approaches to inference and estimation: frequentist / maximum likelihood and Bayesian.
The two are distinguished by their sources of variability, the mathematical objects involved, and estimation and inference. It is
important to keep track of the sources of randomness in each of these paradigms since different estimators are used for random
variables as opposed to constants.
Let’s formalize the notion of inference using Bayes’ Rule. First, let’s restate the goal of inference: it’s to estimate the proba-
bility that the parameter governing our assumed distribution is θ conditional on the sample we observe, denoted as x. We
denote this probability as ξ(θ∣x). Using Bayes’ Rule, we can equate this probability to:
f (x∣θ)ξ(θ) f n (x∣θ)ξ(θ)
ξ(θ∣x) = = , for θ ∈ Ω
g n (x) ∫ f n (x∣θ)ξ(θ)
Ω
Now, since the denominator in the expression above is a constant in the pdf of θ (g n (x) is simply a function of the observed
2
data), the expression can be rewritten as:
The two theories of inference – frequentist and Bayesian – diverge at this step. Under the frequentist paradigm, the parameter
θ is a constant, albeit an unknown constant. Thus, the prior is meaningless and we can absorb ξ(θ) into the proportionality
sign (along with the normalization constant / denominator from Bayes’ Rule above). The result is what R.A. Fisher termed
likelihood:
L(θ∣x) ∝ f n (x∣θ)
Since k(x) is never known, likelihood is not a probability density. Instead, likelihood is some positive multiple of f n (x∣θ).
To summarize, the parameters in the frequentist setting (likelihood theory of inference) are unknown constants. Therefore,
we can ignore ξ(θ) and just focus on the likelihood since everything we know about the parameter based on the data is sum-
marized in the likelihood function. The likelihood function is a function of θ: it conveys the relative likelihood of drawing the
sample observations you observe given some value of θ.
In contrast to frequentist inference, in the Bayesian setting, the parameters are latent random variables, which means that
there is some variability attached to the parameters. This variability is captured through one’s prior beliefs about the value of
θ and is incorporated through the prior, ξ(θ). The focus of Bayesian inference is estimating the posterior distribution of the
parameter, ξ(θ∣x).
The posterior distribution of θ, ξ(θ∣x), is the distribution of the parameter conditional upon the observed data and provides
some sense of (relative) uncertainty regarding our estimate for θ. Note that we cannot obtain an absolute measure of uncer-
tainty since we do not truly know ξ(θ). However, even before the data is observed, the researcher may know where θ may
lie in the parameter space Ω. This information can thus be incorporated through the prior, ξ(θ) in Bayesian inference. Fi-
nally, the data is conceptualized as a joint density function conditional on the parameters of the hypothesized model. That
is, f n (x 1 , x 2 , . . . , x n ∣θ) = f n (x∣θ). For an iid sample, we get f (X 1 ∣θ) ⋅ f (x 2 ∣θ)⋯ f (x n ∣θ). The term f n (x∣θ) is known as the
likelihood.
To recap, where does variability in the data we observe come from? In both frameworks, the sample is a source of variabil-
ity. That is, X 1 , ..., X n form a random sample drawn from some distribution. In the Bayesian mindset, however, there is some
additional variability ascribed to the prior distribution on the parameter θ (or vector of parameters). This variability from the
prior may or may not overlap with the variability from the sample. Frequentists, by contrast, treat the parameters as unknown
constants.
As a result of the differences in philosophies, the estimation procedure and the approach to inference differ between frequentists
and Bayesians. Specifically, under the frequentist framework, we use the likelihood theory of inference where the maximum
likelihood estimator (MLE) is the single point summary of the likelihood curve. It is the point which maximizes the likelihood
function. In contrast, the Bayesian approach tends to focus on the posterior distribution of θ and various estimators, such as
3
the posterior mean (PM) or maximum a posteriori estimator (MAP), which summarize the posterior distribution.
To summarize the distinction between the two approaches to inference, it helps to examine a typology of mathematical ob-
jects. It classifies objects based on whether they are random or not, and whether they are observed or not. When confronted
with inference, one must always ask if is there a density on any given object. A presence of a density implies variability. Fur-
thermore, one must ask if the quantity is observed or not observed.
4
2 Introduction to Maximum Likelihood Estimation
The likelihood, L(θ∣x), is a function that assigns a value to each point in parameter space Ω which indicates how likely each
value of the parameter is to have generated the data. This is proportional to the joint probability distribution of the data as a
function of the unknown coefficients. According to the likelihood theory of inference, the likelihood function summarizes all
the information we have about the parameters given the data we observe. The method of maximum likelihood obtains values
of model parameters that define a distribution that is most likely to have resulted in the observed data. For many statistical
models, the MLE estimator is just a function of the observed data. Furthermore, we often work with the log of the likelihood
function, denoted as log L(θ∣x) = ℓ(θ∣x). Note that since the log is a monotonic function, this does not change any information
we have about the parameter.
As has been alluded to, it is important to distinguish the likelihood of the parameter θ from the probability distribution of
θ conditional upon the data, which is obtained via Bayes’ Theorem. The likelihood is not a probability. Instead, the likelihood
is a measure of relative uncertainty about the plausible values of θ, given by Ω. This relativity is exactly what allows us to work
with the log of the likelihood and to scale the likelihood using monotonic transformations. As a result, we can only compare
likelihoods within, not across, data sets.
Let us formally define the likelihood as proportional to the joint probability of the data conditional on the parameter:
n
L(θ∣x) ∝ f (x∣θ) = ∏ f (x i ∣θ)
i=1
The maximum likelihood estimate of θ, which we denote as θ̂ M LE , is the value of θ in parameter space Ω that maximizes the
likelihood (or log-likelihood) function. It is the value of θ that is most likely to have generated the data.
Alternatively, we could work with the log-likelihood function because maximizing the logarithm of the likelihood is the same
as maximizing the likelihood (due to monotonicity):
n
θ̂ M LE = max log L(θ∣x) = max ℓ(θ∣x) = max ∑ log( f (x i ∣θ))
θ∈Ω θ∈Ω θ∈Ω i=1
How do we actually find the MLE? There are two alternatives: analytic and numeric. We won’t focus on numeric optimization
5
methods in this note, but analytically, finding the MLE involves taking the first derivative of the log-likelihood (or likelihood
function), setting it to 0, and solving for the parameter θ. We then need to check that we have indeed obtained a maximum by
calculating the second derivative at the critical value and checking that it is negative.
To further introduce some terminology, let us define the score as the first derivative of the log-likelihood function with re-
spect to each of the parameters (gradient). For a single parameter:
∂ℓ(θ)
S(θ) =
∂θ
The first order condition thus involves setting the score to zero and solving for θ.
In the case of multiple parameters (a vector θ of length k), the score is defined as:
∂ℓ(θ)
⎛ ∂θ 1 ⎞
⎜ ∂ℓ(θ) ⎟
⎜ ⎟
S(θ) = ∇ℓ(θ) = ⎜ ∂θ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝ ∂ℓ(θ) ⎠
∂θ k
We can visualize the log-likelihood curve quite easily using R (at least for the most common distributions). Let’s visual-
ize a log-likelihood curve for µ in a normal distribution with unknown µ and a known σ = 1. The observed data is x =
{7, 6, 5, 5, 7, 5, 6, 3, 4, 6}.
my.data <- c(7,6,5,5,7,5,6,3,4,6)
norm.ll<- function(x) return(sum(dnorm(my.data,mean=x,sd=1,log=TRUE)))
norm.ll <- Vectorize(norm.ll)
curve(norm.ll, from=0,to=10, lwd=2, xlab=expression(mu),ylab="Log-Likelihood")
Similarly, we can visualize a log-likelihood curve for the parameter λ in a Poisson distribution governed by that parameter. The
observed data is x = {2, 1, 1, 4, 4, 2, 1, 2, 1, 2}.
my.data <- c(2,1,1,4,4,2,1,2,1,2)
pois.ll<- function(x) return(sum(dpois(my.data,lambda=x,log=TRUE)))
pois.ll <- Vectorize(pois.ll)
curve(pois.ll, from=0,to=10, lwd=2, xlab=expression(lambda),ylab="Log-Likelihood")
The results of these two plots, along with the MLE estimates of the respective parameters, are presented in Figure 1. The log-
likelihood surface for a 2-parameter example (a normal distribution with unknown mean and variance) is presented in Figure
2.
6
Figure 1: Examples of Log-Likelihood Functions
-20
-30
-50
-40
Log-Likelihood
Log-Likelihood
-100
-50
-60
-150
-70
0 2 4 6 8 10 0 2 4 6 8 10
µ λ
(a) Log-Likelihood curve for µ in a normal distribution based (b) Log-Likelihood curve for λ in a Poisson distribution based
on the following data: {7, 6, 5, 5, 7, 5, 6, 3, 4, 6}. Note σ = 1. on the following data: {2, 1, 1, 4, 4, 2, 1, 2, 1, 2}.
Figure 2: Example of Log-Likelihood Function for Two Parameters: µ and σ in a normal distribution
-26
-27
-100
-28
Log-Likelihood
Log-Likelihood
-29
-30
-150
-31
-32
-200
0 2 4 6 8 10 2 4 6 8 10
µ σ
(a) Log-Likelihood surface for both µ and
σ in a normal distribution. (b) Marginal log-likelihood curve for µ. (c) Marginal log-likelihood curve for σ.
n
L(θ ∣ x 1 , . . . , x n ) = f n (x∣θ) = ∏ θ x i (1 − θ)1−x i
i=1
7
Optimizing ℓ(θ∣x) by taking its derivative and finding its roots yields the estimator θ̂ = X̄.
2.2.2 MLE Estimation of Mean and Variance for Sampling from Normal Distribution
X 1 ...X n form a random sample from a Normal distribution with unknown parameters θ = (µ, σ 2 ). We need to find the MLE estimator
for θ.
The likelihood function needs to be optimized with respect to parameters µ and σ 2 , where −∞ < µ < ∞ and σ 2 > 0.
First, treat σ 2 as known and find µ̂(σ 2 ). Now, we can take the partial derivative of the log likelihood with respect to the mean
parameter and set it equal to zero:
∂ℓ(θ) 1 n
= 2 ∑ (x i − µ) = 0
∂µ σ i=1
n
∑ xi
µ̂ = = X̄ n
i=1
n
It is good practice to check that the obtained estimator is indeed the maximum using the second order condition:
∂ 2 ℓ(θ) −n
= 2 < 0 ∀ n, σ 2 > 0
∂µ 2 σ
Now for the variance. Plugging µ̂ = x̄ n in for µ in the log-likelihood function and taking the derivative with respect to σ 2 :
∂ℓ(θ) n 1 1 n
2
=− + ∑ (x i − x̄ n ) = 0
∂σ 2 2σ 2 2(σ ) i=1
2 2
1 n
The MLEs for µ and σ 2 are thus: µ̂ = X̄ and σ̂ 2 = ∑ (X i − X̄ n )
2
n i=1
8
Taking the derivative of the log likelihood with respect to θ and setting it equal to 0:
∂ℓ(α, θ) nα 1 n
=− + 2 ∑ xi = 0
∂θ θ θ i=1
Plugging this back into the log likelihood function, taking its derivative with respect to α, and setting the result equal to 0:
ˆ
dL(α, θ∣α) n ⋅ Γ′ (α) 1 n
=− − n ⋅ log( x̄ n ) + ∑ log(x i ) = 0
dα Γ(α) α i=1
ˆ
dL(α, θ∣α) n ⋅ Γ′ (α) n
=− + n ⋅ log(α) − n ⋅ log(x̄ n ) + ∑ log(x i ) = 0
dα Γ(α) i=1
Solving for α as far as we can (the answer remains in terms of the digamma function), we obtain the following condition for
the MLE of α:
n
∑ log(x i )
Γ′ (α)
log(α) − = log(x̄ n ) −
i=1
Γ(α) n
Therefore, the MLE values of α and θ must satisfy the following 2 equations (there is no unique solution) are:
n
∑ log(x i )
Γ′ (α̂)
log(α̂) − = log(x̄ n ) −
i=1
Γ(α̂) n
1
θ̂ = x̄ n
α̂
n! k
L(θ∣n) = f n (n∣θ 1 , θ 2 , ..., θ k ) = θ 1n 1 θ 2n 2 ⋯θ k k for ∑ n i = n
n
n 1 !n 2 !⋯n k ! i=1
Before we can maximize the log of the likelihood function, we have to remember that we are maximizing it subject to the con-
k
straint that ∑ θ i = 1 (a property of the multinomial distribution). Therefore, we will proceed by maximizing the following equation
i=1
using Lagrange multipliers:
k
Λ(θ 1 , ..., θ k , λ) = L(θ 1 , ..., θ k ) + λ ⋅ ( ∑ θ i − 1)
i=1
k k k
Now, we solve ∇θ 1 ,θ 2 ,...,θ k ,λ Λ(θ 1 , ..., θ k , λ) = ∇θ 1 ,θ 2 ,...,θ k , λ (ln(n!) − ∑ ln(n i !) + ∑ n i ⋅ ln(θ i ) + λ ⋅ ( ∑ θ i − 1)) = 0 for θ i and
i=1 i=1 i=1
get n+1 first-order conditions:
ni k
θi = for all i = 1, 2, ..., k and ∑ θ i = 1
λ i=1
k k ni n
Since ∑ θ i = ∑ = = 1, λ = n.
i=1 i=1 λ λ
ni
∴θ̂ i ,MLE = for all i = 1, 2, ..., k
n
9
Note: Alternatively, the same solution is obtained for θ̂ i ,MLE if we model being the ith type of individual as a success in a
binomial distribution of n draws (where the failures are belonging to all the k − 1 remaining types).
⎧
⎪ 1
⎪
⎪ for 0 ≤ x i ≤ θ (i = 1, ..., n)
f (x∣θ) = ⎨ θ n
⎪
⎪
⎪
⎩0 otherwise
From the equation for L(θ) above, one can see that the MLE of θ must be a value of θ for which 0 ≤ x i ≤ θ for all i = 1, ..., n. Since L(θ)
is a monotonically decreasing function of θ, we need the smallest value of θ such that θ ≥ x i for all i = 1, ..., n in order to maximize the
log likelihood. This value is θ = max(x 1 , ..., x n ).
10
3 Properties of MLE: The Basics
Why do we use maximum likelihood estimation? It turns out that subject to regularity conditions, the following properties
hold for the MLE (see proofs in Section 5):
1. Consistency: As sample size (n) increases, the MLE (θ̂ M LE ) converges to the true parameter, θ 0 :
p
θ̂ M LE Ð→ θ 0
2. Normality: As sample size (n) increases, the MLE is normally distributed with a mean equal to the true parameter (θ 0 )
and the variance equal to the inverse of the expected sample Fisher information at the true parameter (denoted as In (θ 0 )):
∂ 2 ℓ(θ∣x) −1
θ̂ M LE ∼ N (θ 0 , ( −E[ 2
∣ ]) )
∂θ θ=θ 0
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
In (θ 0 )
However, using the consistency property of the MLE and observed sample Fisher information, we can use the inverse
of the observed sample Fisher information evaluated at the MLE, denoted as Jn (θ̂ M LE ) to approximate the variance.
Note that the observed sample Fisher information, which will be defined in detail below, is the negation of the second
derivative of the log-likelihood curve.
∂ 2 ℓ(θ∣x) −1
θ̂ M LE ∼ N (θ 0 , ( −[ ∣ ] ) )
∂θ 2 θ=θ̂ M LE
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
Jn ( θ̂ M LE )
3. Efficiency: As sample size (n) increases, MLE is the estimation procedure that generally provides the lowest variance.
4. In a finite sample, MLE gives the minimum variance unbiased estimator, or MVUE, if it exists.
Proof:
By definition of the MLE: θ̂ M LE ∈ Ω and ℓ(θ̂ M LE ∣x) ≥ ℓ(θ∣x)∀θ ∈ Ω. Thus, setting λ̂ = g(θ̂):
11
4 Fisher Information and Uncertainty of MLE
In addition to providing the MLE as a single point summary of the likelihood curve, we would like to quantify how certain we
are in our estimate. We saw in the previous section that the variance of the MLE is asymptotically given by the inverse of the
expected Fisher information. However, we usually approximate expected Fisher information with observed Fisher information.
What are these two quantities? In this section, we’ll define the observed and expected Fisher information, as well as state the
intuition behind why these are useful quantities. The more theoretical derivations and proofs follow in subsequent sections.
First, let’s develop some basic intuition regarding uncertainty of the MLE. Note that the curvature of the likelihood curve
around the MLE contains information about how certain we are as to our estimate. Before we wade into the weeds of specific
calculations, let’s understand the intuition behind this. Looking at Figure 3, we can see that intuitively, we are more certain in
our MLE if the curve has a steeper slope around the MLE than if it has a slope closer to 0. That is, we have more certainty in
an MLE if the second derivative at the MLE is a more negative number. Recall that at the MLE, the slope of the log-likelihood
function is 0, so the second derivative of the log-likelihood curve evaluated at the MLE captures how quickly the slope changes
from 0 as you move away from the MLE in either direction. This comes from the interpretation of the second derivative as the
rate of change of the slope.
2nd der: -1
-5
Log-Likelihood
-10
2nd der: -2
-15
-20
-4 -2 0 2 4
How do we formalize this intuition? Let’s define the observed Fisher information, termed J , to be the negation of the second
derivative of the log-likelihood function. Specifically, we will evaluate it at the MLE1 :
∂2 ∂2
J (θ̂ M LE ) = − ℓ(θ∣x)∣ = − log f (x∣θ)∣θ̂
∂θ 2 θ̂ M LE ∂θ 2 M LE
1 Technically, we can calculate the observed Fisher information at any value of θ, but we will always talk about it as evaluated at the MLE.
12
Note that we negate the second derivative (which is always negative at the MLE) so that we always have a positive observed
Fisher information. Now, the intuition we developed about the steepness of the curve follows through: the steeper the curve
around the MLE, the larger the observed Fisher information.
Note that in the case of multiple parameters (we have a vector of k parameters θ), the observed Fisher information is the
negation of the hessian, the matrix of second derivatives. Again, here we evaluate it at the MLE.
2
∂2 ∂2
⎛ ∂θ∂
2 ∂θ 1 ∂θ 2
⋯ ∂θ 1 ∂θ k ⎞
1
⎜ ∂2 ∂2 ∂2 ⎟
⎜ ∂θ 2 ∂θ 1 ⋯ ⎟
J (θ̂ M LE ) = −∇∇T ℓ(θ∣x)∣θ̂ = −⎜
⎜
∂θ 22 ∂θ 2 ∂θ k ⎟
⎟ ℓ(θ∣x)∣θ̂ M LE
M LE ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ 2
⎟
∂2 ∂2
⎝ ∂θ∂k ∂θ 1 ∂θ k ∂θ 2
⋯ ∂θ 2k ⎠
What is the theoretical rationale for how the Fisher information is linked to the variance of the maximum likelihood estimate?
It turns out that we can prove, using the Central Limit Theorem, that the MLE is asymptotically normal with a mean equal to
the true parameter value and variance equal to the inverse of the expected Fisher information evaluated at the true parameter
value (see proof). To understand this, we need to define the expected Fisher information. Having defined the observed Fisher
information, we can define the expected Fisher information as the expectation of the observed Fisher information:
∂ 2 ℓ(θ∣x)
I(θ) = −E
∂θ 2
The expected Fisher information is thus a function of θ – the parameter we are trying to estimate – that gives us the expected
information across the samples we could draw from our distribution of interest. That is, imagine drawing 1,000,000 different
samples (or even better, infinite samples!) from the distribution of interest (even though in reality we observe only one). Each
sample will have a slightly different MLE and also a slightly different observed Fisher information (since observed Fisher infor-
mation is a sample-specific quantity). The observed Fisher information we expect, on average, across all possible samples is the
expected Fisher information. Moreover, note that since the MLEs are different from sample to sample, we also have a variance
across MLEs – this is in fact the variance we are after (recall that in the frequentist inferential framework, all randomness comes
from sampling and the parameters are fixed)! The inverse of the expected Fisher information captures this variance. For an
example that will elucidate these concepts, see Figure 4.
We can also be more specific as to what kind of Fisher information we are dealing with. Specifically, we can define the sample
expected Fisher information across all the random variables in our sample and the unit expected Fisher information for just
one random variable.
The unit expected Fisher information is defined for one random variable from our distribution:
∂2
I(θ) = −E ℓ(θ∣x)
∂θ 2
13
∂2
In (θ) = −E ℓ(θ∣x)
∂θ 2
Note that for distinction, we have added an n subscript to explicitly differentiate the sample Fisher information from the unit
Fisher information.
Since our sample is a set of n iid random variables, we can relate the sample Fisher information to the unit Fisher informa-
tion using the linearity of expectation and the fact that we can bring the derivative operator within the summation sign :
∂2 ∂2 n n
∂2 ∂2
In (θ) = −E 2
ℓ(θ∣x) = E 2 [ ∑ f (x i ∣θ)] = ∑ E[ 2 f (x∣θ)] = n ⋅ E[ 2 f (x∣θ)] = nI(θ)
∂θ ∂θ i=1 i=1 ∂θ ∂θ
∴ In (θ) = nI(θ)
This tells us that for a sample of iid random variables, the expected sample information is just a sum of the individual expected
informations across the n observations. This relationship becomes important in the proofs of the asymptotic distribution of the
MLE.
Finally, you may ask how we can get away with using the observed Fisher information even though we have just stated that
the MLE is asymptotically distributed normally with a variance equal to the inverse of the expected Fisher information? It in
fact turns out that the observed Fisher information is consistent for the expected Fisher information. That is, we can prove
using the law of large numbers that the observed information converges to the expected Fisher information as the sample size
increases. Moreover, we can evaluate the observed Fisher information at the MLE instead of at the true (unknown) value of
the parameter because of the consistency of the MLE (proven below). In most applications, we thus use the observed Fisher
information.
Let’s write down the log-likelihood and solve for the MLE:
n
ℓ(p) = log ∏ p x i (1 − p)1−x i
i=1
∂ℓ(p) ∑ x i n − ∑ x i
= − =0
∂p p 1− p
p̂ MLE = X̄
14
What is the second derivative of the log-likelihood?
∂ 2 ℓ(p) ∑ xi n − ∑ xi
=− 2 −
∂p2 p (1 − p)2
We now define expected Fisher information as the expected value of the negation of the second derivative
∑ xi n − ∑ xi
In (p) = E [ + ]
p2 (1 − p)2
Since E[X] = p:
np n − np
In (p) = +
p2 (1 − p)2
Simplifying:
n n n
In (p) = + =
p 1 − p p(1 − p)
We have found the expected sample fisher information for X 1 , ..., X n for Bern(p).
The reason that this quantity is not particularly tractable for calculation of uncertainty is that it depends on p, the unknown
parameter! Instead, we can use the observed Fisher information, evaluated at the MLE:
n n
J ( p̂ MLE ) = =
p̂ MLE (1 − p̂ MLE ) X̄(1 − X̄)
Asymptotically, we thus know that p̂ MLE has the following approximate distribution:
We can easily use this to calculate confidence intervals and test statistics.
15
Figure 4: Illustration of consistency properties of MLE and observed Fisher information. For this illustration, the sample size
(n) was varied and at each n, 10,000 datasets were drawn from a Bern(0.3) distribution. For each dataset, the MLE and the
observed Fisher information were calculated.
(a) Simulation of MLEs from Bern(0.3) model by sample size (b) Simulation of observed Fisher information from Bern(0.3)
(n). For each n, 10,000 MLEs are plotted with black dots. The model by sample size (n). For each n, 10,000 observed Fisher
dotted red line denotes the true parameter (p = 0.3), while the informations are plotted with black dots. The dotted red line de-
dotted blue line represents the mean MLE across the 10,000 notes the expected Fisher information, derived in the example
samples at each value of n. above. The gold line represents the simulated variance across
the 10,000 MLEs at each sample size. Finally, the solid blue line
represents the mean of the observed Fisher informations at each
value of n.
16
4.1 The Theory of Fisher Information
4.1.1 Derivation of Unit Fisher Information
Let’s begin by deriving the expected Fisher information for one random variable, X. To do this derivation, we need to impose
some regularity conditions on the distribution of the random variable X:
• Assume that f (x∣θ) > 0 for each value x ∈ S and each value θ ∈ Ω
Recall that by definition of a pdf, the integral of a continuous density across the sample space S is 1:
∫ f (x∣θ)∂x = 1
S
Let’s assume that we are able to distribute a derivative operator within the integration operator, such that:
∂ ∂
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′ (x∣θ)∂x
∂x S S ∂x S
and
∂2 ∂2
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′′ (x∣θ)∂x
∂x 2 S S ∂x 2 S
Recall that the score of the log-likelihood is defined to be the first derivative of the log-likelihood:
∂ ∂
S= log L(θ∣x) = log f (x∣θ)
∂θ ∂θ
We also know the following about the score from the properties of derivatives, specifically the chain rule:
∂ 1
ℓ′ (θ∣x) = log f (x∣θ) = ⋅ f ′ (x∣θ)
∂θ f (x∣θ)
f ′ (x∣θ)
ℓ′ (θ∣x) =
f (x∣θ)
We can find the expected value of the score using the definition of expectation:
∂ℓ(θ∣x) f ′ (x∣θ)
Eθ [ ] = ∫ ℓ′ (θ∣x) f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′ (x∣θ)∂x
∂θ S S f (x∣θ) S
Using our ability to exchange the order of integration and taking a derivative:
∂ ∂
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = 1=0
′
S ∂θ S ∂θ
17
∴ Eθ [ℓ′ (θ∣x)] = 0
We have just shown that in expectation (across samples), the score will be 0.
Suppose that we define the expected unit Fisher information for random variable X as the expectation of the squared score
(you’ll see shortly how it relates to our previous expression for the expected information).
⎡ 2⎤
⎢⎛ ∂ℓ(θ∣x) ⎞ ⎥
I(θ) = Eθ ⎢⎢ ⎥ = Eθ [(ℓ′ (θ∣x))2 ]
⎥
⎢⎝ ∂θ ⎠ ⎥
⎣ ⎦
However, using the fact that Eθ [ℓ′ (θ∣x)] = 0 and the definition of variance (Var(Y) = E[Y 2 ]−(E[Y])2 for any random variable
Y), the Fisher information can also be written as:
I(θ) = Eθ [(ℓ′ (θ∣x))2 ] = Eθ [(ℓ′ (θ∣x))2 ] − (Eθ [ℓ′ (θ∣x)])2 = Var[ℓ′ (θ∣x)]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
E θ [ℓ ′ (θ∣x)]=0
We have just shown that the expected unit Fisher information is equivalent to the variance of the score:
Since the first moment of the score (the expected value) is zero, the Fisher information is also the second moment of the score.
We can also derive the (more familiar) expression for the expected information in terms of the second derivative of the log-
likelihood. First, let us define the second derivative of the log-likelihood (using the quotient rule) as:
f ′ (x∣θ)
The final step above follows from the fact that ℓ′ (θ∣x) = f (x∣θ)
.
f ′′ (x∣θ)
E[ℓ′′ (θ∣x)] = E[ ] − E[[ℓ′ (θ∣x)]2 ]
f (x∣θ) ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ I(θ)
∫S f ′′ (x∣θ)d x=0
The result is a more familiar version of the expected unit Fisher information:
18
I(θ) = −E[ℓ′′ (θ∣x)]
Similarly, the score and second derivative of the log-likelihood for the sample can be expressed in terms of sums of unit scores
and second-derivatives:
n
∂
S= log L(θ∣x) = ℓ′ (θ∣x) = ∑ ℓ′ (θ∣x i )
∂θ i=1
∂2 n
log L(θ∣x) = ℓ ′′
(θ∣x) = ∑ ℓ′′ (θ∣x i )
∂θ 2 i=1
Using these expressions and the linearity of expectation, the expected value of the negative of the second derivative of the log-
likelihood function is:
n
E[−ℓ(θ∣x)] = ∑ E[−ℓ′′ (θ∣x i )]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
In (θ) =I(θ)
But recall that E[−ℓ(θ∣x)] is just the definition of the expected sample Fisher information, denoted as In (θ). We just proved
that the following relation holds true for an iid sample of n random variables:
In (θ) = n ⋅ I(θ)
∂2 ∂2
Jn (θ) = − ℓ(θ∣x) = − log f (x∣θ)
∂θ 2 ∂θ 2
We can rewrite the observed Fisher information as the sum of second derivatives:
n
∂2
Jn (θ) = − ∑ log f (x i ∣θ)
i=1 ∂θ 2
X 1 , X 2 , ..., X n are iid hence the second derivatives on the right hand side of the expression above are also iid Thus, by the law
19
of large numbers, their average converges to the expectation of a single term:
1 p ∂2
Jn (θ) Ð→ E[ 2 log f (x i ∣θ)] = I(θ)
n ∂θ
So by the consistency we have just shown, we can use Jn (θ) instead of In (θ). However, we still do not know θ 0 , the true
value of the parameter (remember that the variance of the MLE asymptotically is In (θ 0 )−1 . But it turns out θ̂ M LE is a consistent
estimator for θ 0 , and as a result we use Jn (θ 0 )−1 as an estimator for the variance of the MLE.
∂ℓ(θ∣x)
⎛ θ1 ⎞
⎜ ∂ℓ(θ∣x) ⎟
⎜ ⎟
S(θ) = ∇ℓ(θ∣x) = ⎜ θ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝ ∂ℓ(θ∣x) ⎠
θk
⎡ ∂2 ∂2
⋯ ∂2 ⎤
⎢⎛ ∂θ 2 ∂θ 1 ∂θ k ⎞ ⎥
⎢ 1 ∂θ 1 ∂θ 2 ⎥
⎢⎜ ∂ 2 ∂2 ∂2 ⎟ ⎥
⎢⎜ ∂θ 2 ∂θ 1 ⋯ ⎟ ⎥
I(θ) = −E[∇ ℓ(θ∣x)] = −E ⎢⎢⎜
2
⎜
∂θ 22 ∂θ 2 ∂θ k ⎟
⎟ ℓ(θ∣x) ⎥
⎥
⎢⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎥
⎢⎜ 2 ⎟ ⎥
⎢⎝ ∂ ∂2
⋯ ∂2
⎠ ⎥
⎢ ∂θ k ∂θ 1 ⎥
⎣ ∂θ k ∂θ 2 ∂θ 2k ⎦
We can show, analagously to the case with a single parameter, that the expected sample Fisher information is equal to the
variance of the score:
I(θ) = −E[∇2 ℓ(θ∣x)] = Var[∇ℓ(θ∣x)] = Var[S(θ)]
20
5 Proofs of Asymptotic Properties of MLE
In this section, we want to prove the asymptotic properties of the MLE: consistency, normality, and efficiency.
We first introduce some notation. Let X n = X 1 , ..., X n be our random sample, where the X i ’s are iid. Let θ̂ M LE be the maximum
likelihood estimator for the parameter θ. θ 0 is the true underlying value of the parameter. Note that θ̂ M LE , θ 0 , θ ∈ Ω (the
parameter space).
Furthermore, to prove the asymptotic properties of the maximum likelihood estimator, we will introduce several regularity
conditions:
• The parameter space must be a bounded and closed set. That is, Ω must be a compact subset.
• The true value of the parameter, θ 0 , must be an interior point of the parameter set: θ 0 ∈ int(Ω). Phrased differently, θ 0
cannot be on the boundary of the set.
• Integration and differentiation is interchangeable (as we defined above for the derivation of expected Fisher information).
p
θ̂ M LE → θ 0
Since the observations in our sample are iid, we can write the log-likelihood as the sum of log-likelihoods for each observation x i :
n
ℓ(θ∣x) = ∑ ℓ(θ∣x i )
i=1
Let’s divide by n, which we can do since it doesn’t affect the maximization of the log-likelihood. Now, we have an expression
that looks like the average of log-likelihoods across all the Xs. We can then show by the strong law of large numbers that that
converges to the expected value of a log-likelihood of a single X:
1 n a.s.
∑ ℓ(θ∣x i ) → Eθ 0 ℓ(θ∣x) = Eθ 0 log f (x∣θ)
n i=1
In this expression, Eθ 0 represents the expectation of the density with respect to the true unknown parameter and thus we define
a new function L(θ), which is the expected log-likelihood function:
∞
L(θ) = Eθ 0 log f (x∣θ) = ∫ log f (x∣θ) ⋅ f (x∣θ 0 )∂x
−∞
As a result, the normalized log-likelihood converges to the expected log-likelihood function L(θ) for any value of θ. This ex-
pression depends solely on θ and not on x since we integrate it out.
21
Now, let’s look at the divergence between L(θ) and L(θ 0 ) (the expected log-likelihood function evaluated at an arbitrary
parameter θ and the true parameter θ 0 ):
f (x∣θ)
L(θ) − L(θ 0 ) = Eθ 0 [log f (x∣θ) − log f (x∣θ 0 )] = Eθ 0 [ log ]
f (x∣θ 0 )
By Jensen’s inequality:
∞ ∞
f (x∣θ) f (x∣θ) f (x∣θ)
Eθ 0 [ log ] ≤ log Eθ 0 [ ] = log ∫ ⋅ f (x∣θ 0 )dx = log ∫ f (x∣θ)dx = 0
f (x∣θ 0 ) f (x∣θ 0 ) f (x∣θ 0 )
−∞ −∞
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=1 by def. of pdf
Thus:
L(θ) − L(θ 0 ) ≤ 0
L(θ) ≤ L(θ 0 )
This inequality suggests that the expected log-likelihood when assuming that an arbitrary parameter θ is governing the data
generation process is no greater than the expected log-likelihood when you correctly identify the true parameter θ 0 governing
the data generation process. Note that this is closely related to the concept of the Kullback-Leibler divergence. In fact, we know
from Gibbs’ inequality that the Kullback-Leibler divergence between f (x∣θ 0 ) and f (x∣θ) must be non-negative:
f (x∣θ 0 )
D K L ( f (x∣θ 0 )∣∣ f (x∣θ)) = Eθ 0 [ log ]≥0
f (x∣θ)
In more practical terms, this inequality suggests that no distribution describes the data as well as the true distribution that gen-
erated it. Therefore, on average, the greatest log-likelihood will be the one that is a function of the true parameter θ 0 . Phrased
differently, θ 0 is the maximizer of the expected log-likelihood, L(θ).
Now, let’s put the different pieces together. Recall that by the strong law of large numbers:
1 n a.s.
∑ ℓ(θ∣x i ) → Eθ 0 log f (x∣θ)
n i=1
For a finite parameter space Ω, the following holds for the MLE (convergence is uniform from the uniform strong law of large
numbers):
1 n a.s.
θ̂ M LE = sup ∑ ℓ(θ∣x i ) → sup Eθ 0 log f (x∣θ) = θ 0
θ∈Ω n i=1 θ∈Ω
a.s.
∴θ̂ M LE → θ 0
In sum, by the strong law of large numbers, the MLE is actually maximizing the expected log-likelihood as n increases asymp-
totically. But we showed that the expected log-likelihood is maximized at the true value of the parameter. Therefore, as n → ∞,
22
the normalized log-likelihood of the data should approach the expected value of the log-likelihood of the random variable X.
Another way of stating the consistency of the MLE is that it minimizes the Kullback-Leibler divergence between an arbitrary
log-likelihood function and the true log-likelihood. If θ̂ M LE = θ 0 , the Kullback-Leibler divergence goes to 0 (by LLN).
Note also that the consistency of the MLE can also be proven for an infinite parameter space Ω as well as for non-compact
parameter spaces.
√ d
n(θ̂ M LE − θ 0 ) → N (0, (In )−1 )
Ultimately, we want to rely on the fact that the score (first derivative of the log-likelihood) is a sum of iid terms, and as a result
we know that its asymptotic distribution can be approximated using the law of large numbers and the central limit theorem.
We can start the proof by noting that since the MLE maximizes the log-likelihood, the score is equal to zero at the MLE:
When the log-likelihood is twice differentiable, we can expand the score around the true parameter value θ 0 using a Taylor
series approximation of the 1st order:
Now we can plug the MLE in for θ and since we know that the score at evaluated at the MLE is 0, the expression simplifies to:
Note that in this case, θ̃ is an arbitrary point located between the MLE (θ̂ M LE ) and the true parameter (θ 0 ). One could also
obtain this equation by utilizing the mean-value theorem which suggests that for a continuous function f (x) there is a point c
on the interval [a, b] such that:
f (b) − f (a)
f ′ (c) =
b−a
Applying the mean value theorem to the function ℓ′ (θ) and letting a = θ̂ M LE and b = θ 0 :
23
ℓ′ (θ̂ M LE ) − ℓ′ (θ 0 )
ℓ′′ (θ̃) = ,where θ̃ ∈ [θ̂ M LE , θ 0 ]
θ̂ M LE − θ 0
ℓ′ (θ 0 )
θ̂ M LE − θ 0 = −
ℓ′′ (θ̃)
√
Multiplying both sides by n:
√
√ nℓ′ (θ 0 )
n(θ̂ M LE − θ 0 ) = −
ℓ′′ (θ̃)
Let’s consider the asymptotic distribution of the numerator and denominator in turn, starting with the asymptotic distribution
of the numerator. We can first express the numerator as a sum of iid scores (first derivatives of the log-likelihood of iid random
variables):
n
ℓ′ (θ 0 ) = ∑ ℓ′ (θ 0 ∣x i )
i=1
Using the Central Limit Theorem, we will be able to make a statement regarding the asymptotic distribution of the average of
the score, n1 ℓ′ (θ 0 ). First, though, note that in the section on Fisher information above, we proved that the first moment of the
score is zero and the second moment of the score is the expected sample Fisher information:
Eθ [ℓ′ (θ∣x)] = 0
and
For a single observation, the first moment is 0 and the second moment is I1 (θ). Invoking the Central Limit Theorem, we know
that the mean of the score is distributed as a normal with a mean of 0 and a variance of I1 (θ)/n.
1 ′ d I1 (θ)
ℓ (θ) Ð→ N (0, )
n n
This implies that ℓ′ (θ) is distributed as a normal with a mean of 0 and a variance of In (θ):
d
ℓ′ (θ) Ð→ N (0, In (θ))
√
Multiplying this by n to get the expression in the numerator:
24
√ d
nℓ′ (θ) Ð→ N (0, n ⋅ In (θ))
Now, we can find the asymptotic behavior of the denominator – ℓ′′ (θ̃∣x) – in the Taylor series expansion of the score. We
can use the law of large numbers to derive the following convergence in probability for any θ:
1 ′′ 1 n p
ℓ (θ̃∣x) = ∑ ℓ′′ (θ̃∣x i ) Ð→ −I1 (θ̃)
n n i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¶
E[ℓ ′′ ( θ̃∣x i )]=−I1 ( θ̃)
a.s a.s
Moreover, since we know from the consistency of the MLE that θ̂ M LE → θ 0 and θ̃ ∈ [θ̂ M LE , θ 0 ], it follows that θ̃ → θ 0 . There-
fore:
a.s
ℓ′′n (θ̃) Ð→ −In (θ 0 )
1
Since Var(θ̂ MLE ) = I(θ)−1 = − , the efficiency of the MLE is:
∂ 2 ℓ(θ∣x)
E[ ]
∂θ 2
I(θ)−1
e(θ̂ M LE ) = =1
Var(θ̂ M LE )
25
Figure 5: Illustration of Central Limit Theorem for the MLE and the score evaluated at the true parameter. For this illustration,
we simulated 1000 datasets of sample size n ∈ {10, 25, 100} from the Bern(0.3) distribution. For each dataset, we plotted the
score function (as a function of θ) and the MLE. We also evaluated the score function at the true value of the parameter, θ 0 = 0.3.
200
0.10
0
Score
0.08
-200
-400
0.06
-600
Density
0.04
0.0 0.2 0.4 0.6 0.8 1.0
0.02
0.0 3.0
Density
0.00
0.0 0.2 0.4 0.6 0.8 1.0 -10 0 10 20
(a) The top plot portrays the 1000 score functions for simulated (b) A density plot of the score function evaluated at the true pa-
data with sample size n = 10. The MLEs are indicated with or- rameter, θ 0 , across 1000 simulated datasets of sample size n = 10
ange dots. The lower plot depicts the density function of the (in black). For comparison in red, we have plotted the theoret-
MLEs across the 1000 datasets. Note that the density of the ical asymptotic density of the score, N (0, In (θ 0 )), derived us-
MLEs does not yet look particularly normal. ing the CLT. Clearly the simulated density has not converged to
the normal density.
500
0.04
0
-500
Score
0.03
-1500
Density
0.02
θ
0.01
Density
0.00
0
(c) The top plot portrays the 1000 score functions for simulated (d) A density plot of the score function evaluated at the true pa-
data with sample size n = 25. The MLEs are indicated with or- rameter, θ 0 , across 1000 simulated datasets of sample size n = 25
ange dots. The lower plot depicts the density function of the (in black). For comparison in red, we have plotted the theoret-
MLEs across the 1000 datasets. ical asymptotic density of the score, N (0, In (θ 0 )), derived us-
ing the CLT. The simulated density is starting to converge to the
expected normal density.
26
2000
0
0.015
-2000
Score
0.010
-6000
Density
θ
Density
0.000
6
0
(a) The top plot portrays the 1000 score functions for simulated (b) A density plot of the score function evaluated at the true
data with sample size n = 100. The MLEs are indicated with parameter, θ 0 , across 1000 simulated datasets of sample size
orange dots. The lower plot depicts the density function of the n = 100 (in black). For comparison in red, we have plotted the
MLEs across the 1000 datasets. The MLE has essentially con- theoretical asymptotic density of the score, N (0, In (θ 0 )). As
verged to the expected distribution. predicted by the CLT, the distribution of the scores has essen-
tially converged to the expected density.
27
6 References
Casella, George. and Roger L. Berger. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury Press, 2002.
DeGroot, Morris H., and Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-Wesley, 2002.
Newey, Whitney K. and Daniel McFadden. Large sample estimation and hypothesis testing. In Engle, Robert F. and Daniel
L. McFadden, editors, Handbook of Econometrics, vol. 4, 1994.
28
Convergence of Random Variables
What does it mean to say that a sequence converges? There are several notions of convergence for random variables. The two
main ones are convergence in probability and convergence in distribution.
Convergence in Probability
Suppose we have a sequence of random variables denoted by {X n } = X 1 , X 2 , ..., X n . The sequence converges in probability to
X if the probability distribution of the sequence {X n } is increasingly concentrated around X:
p
Convergence in probability is denoted as X n → X.
Note that convergence in probability implies convergence in distribution, but convergence in distribution implies convergence
in probability only when the limiting variable X is a constant.
p p p
We can extend convergence in probability to the multivariate case. If X n → X and Yn → X, then (X n , Yn ) → (X, Y).
Note that almost sure convergence implies convergence in probability, but not vice versa.
Convergence in Distribution
Suppose we have a sequence of random variables denoted by {X n } = X 1 , X 2 , ..., X n . Let Fn denote the CDF of random variable
X n and F ∗ denote the CDF of random variable X ∗ .
Intuitively, convergence in distribution means that if n is sufficiently large, the probability for X n to be in a given range is
approximately equal to the probability that X ∗ is in the same range.
29
d
Convergence in probability is denoted as X n → X ∗ . X ∗ is the asymptotic distribution of X n .
d d
We can extend convergence in distribution to the multivariate case. If X n → X and Yn → c (where c is a constant), then
p
(X n , Yn ) → (X, c).
Let X 1 , X 2 , ..., X n be a sequence of n random varaibles. Let Fn be the CDF of X n and let ξ n be the characteristic function
of X n , given by ξ n (t) = E(e i t X n ) ∀t ∈ R . Let F ∗ denote the CDF and ξ∗ denote the characteristic function of random variable
X∗.
Let X 1 , X 2 , ..., X n be a sequence of n random varaibles. Let Fn be the CDF of X n and let ψ n be the moment generating function
(MGF) of X n , given by ψ n (t) = E(e t X n ) ∀t ∈ R . As before, let F ∗ denote the CDF and ψ ∗ denote the MGF of random variable
X ∗ . Assume that both MGFs exist.
p
X̄ n → µ
30
Recall that Chebyshev’s inequality indicates that for any random variable X:
σ2
P(∣X − µ∣ > є) ≤ 2
є
First, find the expected value and the variance of the mean of the sequence:
E[ X̄] = µ
1 2
Var[ X̄] = σ
n
σ2
P(∣ X̄ n − µ∣ > є) ≤
nє 2
Note that as n → ∞:
P(∣ X̄ n − µ∣ < є) = 1 or P(∣ X̄ n − µ∣ > є) = 0
p
∴ X̄ n → µ
a.s.
X̄ n → µ
Moreover, we can generalize the strong law of large numbers to any function of x. Let X 1 , X 2 , ..., X n be iid random variables.
Then assume that f (x, θ) is a continuous function of x defined for all θ ∈ Ω. The strong law of large numbers states:
1 n a.s.
∑ f (X i , θ) Ð→ E[ f (X, θ)]
n i=1
The conditions that need to hold for the uniform strong law of large numbers to apply to a random variable X and a func-
tion f (x, θ) are:
31
• f (x, θ) should be semi-continuous in θ ∈ Ω for all x
• There must be a function K(x) where E[K(x)] ≤ ∞ and ∣ f (x, θ)∣ ≤ K(x) ∀x, θ
More rigorously:
Let X 1 , ..., X n form a random sample of size n from a distribution with mean µ and variance σ 2 . Then for each fixed number x:
√
n( X̄ n − µ)
lim P[ ≤ x] = Φ(x)
n→∞ σ
In a sense, the CLT governs the shape of convergence of X̄ n to µ, which we proved using the law of large numbers. The CLT
1
indicates that the limit as n goes to infinity of X̄ n − µ is non-degenerate when the exponent on the n is 2
and that the limiting
distribution is normal.
√
We begin with the random variable Z n = n( X̄ n − µ)/σ, which converges in distribution to the standard normal.
Let ψ(t) denote the MGF of the random variable Yi . Then, since the sum of independent random variables is the product
n
of their MGFs, the MGF of ∑ Yi is (ψ(t))n . When we multiply the MGF by √1 , we get that the MGF of the standardized sum
n
i=1
of random variables, Z n , is:
t n
ψ n (t) = (ψ( √ ))
n
We can express the MGF of the standardized sum of random variables using a Taylor series expansion around the point t = 0.
This enables us to incorporate the information that the first moment of the standardized RV is 0 and the second moment is 1:
32
First moment: E(Yi ) = ψ ′ (0) = 0
The Taylor series expansion of ψ(t) around t = 0 (hence Maclaurin series) is:
t 2 ′′ t3 t2 t2
ψ(t) ≈ ψ(0) + tψ ′ (0) + ψ (0) + ψ ′′′ (0) + ... = ψ(0) + tψ ′ (0) + ψ ′′ (t ∗ ) = 1 + ψ ′′ (t ∗ ) for 0 < t∗ < t
2! 3! 2! 2
Note instead of writing an infinite series, I wrote the the first 2 terms of the series and then used a Lagrange remainder.
t 2 ′′ ∗ n
ψ n (t) ≈ [1 + ψ (t )] for 0 < t∗ < √t
2n n
This now looks like a case of a notable limit from calculus, given in its general form by:
k x
lim [1 + ] = e k
x→∞ x
t 2 /2 n 2
∴ lim ψ n (t) = lim [1 + ] = e t /2 = ψ ∗ (t)
n→∞ n→∞ n
We have just proven that the MGF of the standardized sequence of any random variables converges to the MGF of the standard
normal distribution. Therefore, the CDF of the standardized sequence of the random variables converges in distribution to the
standard normal distribution.
d
∴Z n → Z ∼ N(0, 1)
More technically:
Define X to be a random variable on metric space S. X n represents a sequence of n random variables X. The continuous func-
tion g(⋅) then maps S → S ′ .
The continuous mapping theorem indicates that the following hold (assuming g(⋅) is continuous at X):
d d
• X n → X ⇒ g(X n ) → g(X)
p p
• X n → X ⇒ g(X n ) → g(X)
33
a.s. a.s.
• X n → X ⇒ g(X n ) → g(X)
p p p
Furthermore, one can show that if X n → a and Yn → b, then g(X n , Yn ) → g(a, b) (assuming that g(⋅) is continuous at a, b).
Slutsky’s Theorem
d p
Suppose that we have two sets of sequences: {X n } and {Yn }. Furthermore, suppose that X n → X and Yn → c.
Slutsky’s theorem indicates that the following relationships hold between convergent sequences:
d
• X n + Yn → X + c
d
• Yn ⋅ X n → cX
Xn d X
• → (c ≠ 0)
Yn c
Proof: The proof of Slutsky’s theorem is quite simple - in effect, the theorem is just a particular application of the continuous
d p d
mapping theorem. We know that since X n → X and Yn → c, the joint vector (X n , Yn ) → (X, c). Now, using the multivariate
version of the continuous mapping theorem, we respectively let g(x, y) = x + y, g(x, y) = x y, and g(x, y) = xy .
34