0% found this document useful (0 votes)

58 views34 pages

Maximum Likelihood Estimation by K.Kashin

ML detection

Uploaded by

subhadeeproy04101999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views34 pages

Maximum Likelihood Estimation by K.Kashin

ML detection

Uploaded by

subhadeeproy04101999

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

https://t.

me/machine_learning_and_DL

Statistical Inference: Maximum Likelihood Estimation ∗

Konstantin Kashin

Spring 2014

Contents

1 Overview: Maximum Likelihood vs. Bayesian Estimation 2

2 Introduction to Maximum Likelihood Estimation 5

2.1 What is Likelihood and the MLE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Examples of Analytical MLE Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 MLE Estimation for Sampling from Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 MLE Estimation of Mean and Variance for Sampling from Normal Distribution . . . . . . . . . . . . . . 8
2.2.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.5 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Properties of MLE: The Basics 11

3.1 Functional Invariance of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Fisher Information and Uncertainty of MLE 12

4.0.1 Example of Calculating Fisher Information: Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . 14
4.1 The Theory of Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.1 Derivation of Unit Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Derivation of Sample Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Relationship between Expected and Observed Fisher Information . . . . . . . . . . . . . . . . . . . . . . 19
4.1.4 Extension to Multiple Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Proofs of Asymptotic Properties of MLE 21

5.1 Consistency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2 Asymptotic Normality of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Efficiency of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 References 28

∗ This note was initially prepared to supplement course material for Gov 2001 for Spring 2013. It is in part based upon my notes from Stat 111 at Harvard

University and also the resources cited at the end.

1
1 Overview: Maximum Likelihood vs. Bayesian Estimation
This note is about the mechanics of maximum likelihood estimation (MLE). However, before delving into the mechanics of
finding the MLE, let’s step back and lay out maximum likelihood as a theory of inference. Specifically, it will prove useful to
compare maximum likelihood to Bayesian theory of inference.

In general, what is statistical inference? It’s the process of making a statement about how data is generated in the world. We
can think of the data that we observe in the world as a product of some data generation process (DGP) that is fundamentally
unknown and likely highly convoluted. However, it is our goal as social scientists or applied statisticians to use observed data to
learn something about the DGP. In parametric inference, we are going to assume that we can represent the DGP by a statistical
model. Remember that whatever model we choose, it will essentially never be “right”. Thus, the question is whether or not the
model is useful. In the context of this note, we will limit ourselves to very common probability distributions as our models.

Once we have selected a probability distribution to represent the data generation process, we aren’t done, for in fact we have
not specified a unique distribution, but just a family of distributions. This is because we leave one (or more) parameters as
unknown. The goal of statistical inference is then to use observed data to make a statement about parameters that govern our
model.

To introduce some notation, we are going to call the model, or probability distribution, that we choose f (⋅). This probability
distribution is going to depend on a parameter θ (or vector of parameters, θ = θ 1 , θ 2 , ..., θ k ) that characterize the distribution.
The set Ω of all possible values of a parameter or the vector of parameters is called the parameter space. We then observe some
data, drawn from this distribution:

X ∼ f (x∣θ)

The random variables X 1 , ... ,X n are independent and identically distributed (iid) because they are drawn independently from
the same DGP. The goal is to use the observed data x to learn about θ. Knowing this parameter specifies a particular distribu-
tion from the family of distributions we have selected to represent the data generation process. In the end, we hope that θ is a
substantively meaningful quantity that teaches us something about the world (or at least can be used to derive a substantively
interesting quantity of interest).

So far, we’ve just set up the general inferential goal. Now, we can introduce different theories of actually achieving said goal.
Specifically, we focus on two general approaches to inference and estimation: frequentist / maximum likelihood and Bayesian.
The two are distinguished by their sources of variability, the mathematical objects involved, and estimation and inference. It is
important to keep track of the sources of randomness in each of these paradigms since different estimators are used for random
variables as opposed to constants.

Let’s formalize the notion of inference using Bayes’ Rule. First, let’s restate the goal of inference: it’s to estimate the proba-
bility that the parameter governing our assumed distribution is θ conditional on the sample we observe, denoted as x. We
denote this probability as ξ(θ∣x). Using Bayes’ Rule, we can equate this probability to:

f (x∣θ)ξ(θ) f n (x∣θ)ξ(θ)
ξ(θ∣x) = = , for θ ∈ Ω
g n (x) ∫ f n (x∣θ)ξ(θ)
Ω

Now, since the denominator in the expression above is a constant in the pdf of θ (g n (x) is simply a function of the observed

2
data), the expression can be rewritten as:

ξ(θ∣x) ∝ f n (x∣θ) ξ(θ)

´¹¹ ¹ ¸¹¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹¶ ±
posterior likelihood prior

The two theories of inference – frequentist and Bayesian – diverge at this step. Under the frequentist paradigm, the parameter
θ is a constant, albeit an unknown constant. Thus, the prior is meaningless and we can absorb ξ(θ) into the proportionality
sign (along with the normalization constant / denominator from Bayes’ Rule above). The result is what R.A. Fisher termed
likelihood:
L(θ∣x) ∝ f n (x∣θ)

Sometimes, the likelihood is also written in terms of an unknown constant k(x):

L(θ∣x) = k(x) f n (x∣θ)

Since k(x) is never known, likelihood is not a probability density. Instead, likelihood is some positive multiple of f n (x∣θ).

To summarize, the parameters in the frequentist setting (likelihood theory of inference) are unknown constants. Therefore,
we can ignore ξ(θ) and just focus on the likelihood since everything we know about the parameter based on the data is sum-
marized in the likelihood function. The likelihood function is a function of θ: it conveys the relative likelihood of drawing the
sample observations you observe given some value of θ.

In contrast to frequentist inference, in the Bayesian setting, the parameters are latent random variables, which means that
there is some variability attached to the parameters. This variability is captured through one’s prior beliefs about the value of
θ and is incorporated through the prior, ξ(θ). The focus of Bayesian inference is estimating the posterior distribution of the
parameter, ξ(θ∣x).

The posterior distribution of θ, ξ(θ∣x), is the distribution of the parameter conditional upon the observed data and provides
some sense of (relative) uncertainty regarding our estimate for θ. Note that we cannot obtain an absolute measure of uncer-
tainty since we do not truly know ξ(θ). However, even before the data is observed, the researcher may know where θ may
lie in the parameter space Ω. This information can thus be incorporated through the prior, ξ(θ) in Bayesian inference. Fi-
nally, the data is conceptualized as a joint density function conditional on the parameters of the hypothesized model. That
is, f n (x 1 , x 2 , . . . , x n ∣θ) = f n (x∣θ). For an iid sample, we get f (X 1 ∣θ) ⋅ f (x 2 ∣θ)⋯ f (x n ∣θ). The term f n (x∣θ) is known as the
likelihood.

To recap, where does variability in the data we observe come from? In both frameworks, the sample is a source of variabil-
ity. That is, X 1 , ..., X n form a random sample drawn from some distribution. In the Bayesian mindset, however, there is some
additional variability ascribed to the prior distribution on the parameter θ (or vector of parameters). This variability from the
prior may or may not overlap with the variability from the sample. Frequentists, by contrast, treat the parameters as unknown
constants.

As a result of the differences in philosophies, the estimation procedure and the approach to inference differ between frequentists
and Bayesians. Specifically, under the frequentist framework, we use the likelihood theory of inference where the maximum
likelihood estimator (MLE) is the single point summary of the likelihood curve. It is the point which maximizes the likelihood
function. In contrast, the Bayesian approach tends to focus on the posterior distribution of θ and various estimators, such as

3
the posterior mean (PM) or maximum a posteriori estimator (MAP), which summarize the posterior distribution.

To summarize the distinction between the two approaches to inference, it helps to examine a typology of mathematical ob-
jects. It classifies objects based on whether they are random or not, and whether they are observed or not. When confronted
with inference, one must always ask if is there a density on any given object. A presence of a density implies variability. Fur-
thermore, one must ask if the quantity is observed or not observed.

Observed Not Observed

Random Variable Latent Random Variable
Variable (Var > 0)
X (the data) θ in Bayesian inference
Known Constant Unknown Constant
Not Variable (Var = 0)
α, β, which govern ξ(θ) in Bayesian inference θ in frequentist inference

4
2 Introduction to Maximum Likelihood Estimation

2.1 What is Likelihood and the MLE?

Say that one has a sample of n iid observations X 1 , X 2 , ..., X n that come from some probability density function characterized by
an unknown parameter θ 0 : f 0 = f (⋅∣θ 0 ), where θ 0 belongs to a parameter space Ω. We want to find θ̂ that is the best estimator
of θ 0 . Specifically, the main approach of maximum likelihood estimation (MLE) is to determine the value of θ that is most
likely to have generated the vector of observed data, x.

The likelihood, L(θ∣x), is a function that assigns a value to each point in parameter space Ω which indicates how likely each
value of the parameter is to have generated the data. This is proportional to the joint probability distribution of the data as a
function of the unknown coefficients. According to the likelihood theory of inference, the likelihood function summarizes all
the information we have about the parameters given the data we observe. The method of maximum likelihood obtains values
of model parameters that define a distribution that is most likely to have resulted in the observed data. For many statistical
models, the MLE estimator is just a function of the observed data. Furthermore, we often work with the log of the likelihood
function, denoted as log L(θ∣x) = ℓ(θ∣x). Note that since the log is a monotonic function, this does not change any information
we have about the parameter.

As has been alluded to, it is important to distinguish the likelihood of the parameter θ from the probability distribution of
θ conditional upon the data, which is obtained via Bayes’ Theorem. The likelihood is not a probability. Instead, the likelihood
is a measure of relative uncertainty about the plausible values of θ, given by Ω. This relativity is exactly what allows us to work
with the log of the likelihood and to scale the likelihood using monotonic transformations. As a result, we can only compare
likelihoods within, not across, data sets.

Let us formally define the likelihood as proportional to the joint probability of the data conditional on the parameter:
n
L(θ∣x) ∝ f (x∣θ) = ∏ f (x i ∣θ)
i=1

The maximum likelihood estimate of θ, which we denote as θ̂ M LE , is the value of θ in parameter space Ω that maximizes the
likelihood (or log-likelihood) function. It is the value of θ that is most likely to have generated the data.

Mathematically, we write the MLE as:

n
θ̂ M LE = max L(θ∣x) = max ∏ f (x i ∣θ)
θ∈Ω θ∈Ω i=1

Alternatively, we could work with the log-likelihood function because maximizing the logarithm of the likelihood is the same
as maximizing the likelihood (due to monotonicity):

n
θ̂ M LE = max log L(θ∣x) = max ℓ(θ∣x) = max ∑ log( f (x i ∣θ))
θ∈Ω θ∈Ω θ∈Ω i=1

How do we actually find the MLE? There are two alternatives: analytic and numeric. We won’t focus on numeric optimization

5
methods in this note, but analytically, finding the MLE involves taking the first derivative of the log-likelihood (or likelihood
function), setting it to 0, and solving for the parameter θ. We then need to check that we have indeed obtained a maximum by
calculating the second derivative at the critical value and checking that it is negative.

To further introduce some terminology, let us define the score as the first derivative of the log-likelihood function with re-
spect to each of the parameters (gradient). For a single parameter:

∂ℓ(θ)
S(θ) =
∂θ

The first order condition thus involves setting the score to zero and solving for θ.

In the case of multiple parameters (a vector θ of length k), the score is defined as:

∂ℓ(θ)
⎛ ∂θ 1 ⎞
⎜ ∂ℓ(θ) ⎟
⎜ ⎟
S(θ) = ∇ℓ(θ) = ⎜ ∂θ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝ ∂ℓ(θ) ⎠
∂θ k

We can visualize the log-likelihood curve quite easily using R (at least for the most common distributions). Let’s visual-
ize a log-likelihood curve for µ in a normal distribution with unknown µ and a known σ = 1. The observed data is x =
{7, 6, 5, 5, 7, 5, 6, 3, 4, 6}.
my.data <- c(7,6,5,5,7,5,6,3,4,6)
norm.ll<- function(x) return(sum(dnorm(my.data,mean=x,sd=1,log=TRUE)))
norm.ll <- Vectorize(norm.ll)
curve(norm.ll, from=0,to=10, lwd=2, xlab=expression(mu),ylab="Log-Likelihood")

Similarly, we can visualize a log-likelihood curve for the parameter λ in a Poisson distribution governed by that parameter. The
observed data is x = {2, 1, 1, 4, 4, 2, 1, 2, 1, 2}.
my.data <- c(2,1,1,4,4,2,1,2,1,2)
pois.ll<- function(x) return(sum(dpois(my.data,lambda=x,log=TRUE)))
pois.ll <- Vectorize(pois.ll)
curve(pois.ll, from=0,to=10, lwd=2, xlab=expression(lambda),ylab="Log-Likelihood")

The results of these two plots, along with the MLE estimates of the respective parameters, are presented in Figure 1. The log-
likelihood surface for a 2-parameter example (a normal distribution with unknown mean and variance) is presented in Figure
2.

6
Figure 1: Examples of Log-Likelihood Functions

-20
-30
-50

-40
Log-Likelihood

Log-Likelihood
-100

-50
-60
-150

-70
0 2 4 6 8 10 0 2 4 6 8 10

µ λ

(a) Log-Likelihood curve for µ in a normal distribution based (b) Log-Likelihood curve for λ in a Poisson distribution based
on the following data: {7, 6, 5, 5, 7, 5, 6, 3, 4, 6}. Note σ = 1. on the following data: {2, 1, 1, 4, 4, 2, 1, 2, 1, 2}.

Figure 2: Example of Log-Likelihood Function for Two Parameters: µ and σ in a normal distribution

-26
-27
-100

-28
Log-Likelihood

Log-Likelihood

-29
-30
-150

-31
-32
-200

0 2 4 6 8 10 2 4 6 8 10

µ σ
(a) Log-Likelihood surface for both µ and
σ in a normal distribution. (b) Marginal log-likelihood curve for µ. (c) Marginal log-likelihood curve for σ.

2.2 Examples of Analytical MLE Derivations

2.2.1 MLE Estimation for Sampling from Bernoulli Distribution

X 1 ...X n form a random sample from a Bernoulli distribution with unknown parameter 0 ≤ θ ≤ 1. We need to find the MLE estimator
for θ.

n
L(θ ∣ x 1 , . . . , x n ) = f n (x∣θ) = ∏ θ x i (1 − θ)1−x i
i=1

Taking the log of the function, we get:

n n
ℓ(θ∣x) = ( ∑ x i ) log θ + (n = ∑ x i ) log(1 − θ)
i=1 i=1

7
Optimizing ℓ(θ∣x) by taking its derivative and finding its roots yields the estimator θ̂ = X̄.

2.2.2 MLE Estimation of Mean and Variance for Sampling from Normal Distribution
X 1 ...X n form a random sample from a Normal distribution with unknown parameters θ = (µ, σ 2 ). We need to find the MLE estimator
for θ.

Starting with the likelihood:

1 1 n
L(θ ∣ x 1 , . . . , x n ) = f n (x∣µ, σ 2 ) = exp[− 2 ∑ (x i − µ)2 ]
(2πσ 2 )n/2 2σ i=1

Taking the log of the function, we get:

n n 1 n
ℓ(θ ∣ x 1 , . . . , x n ) = − log(2π) − log(σ 2 ) − 2 ∑ (x i − µ)2
2 2 2σ i=1

The likelihood function needs to be optimized with respect to parameters µ and σ 2 , where −∞ < µ < ∞ and σ 2 > 0.

First, treat σ 2 as known and find µ̂(σ 2 ). Now, we can take the partial derivative of the log likelihood with respect to the mean
parameter and set it equal to zero:
∂ℓ(θ) 1 n
= 2 ∑ (x i − µ) = 0
∂µ σ i=1
n
∑ xi
µ̂ = = X̄ n
i=1
n

It is good practice to check that the obtained estimator is indeed the maximum using the second order condition:
∂ 2 ℓ(θ) −n
= 2 < 0 ∀ n, σ 2 > 0
∂µ 2 σ

Now for the variance. Plugging µ̂ = x̄ n in for µ in the log-likelihood function and taking the derivative with respect to σ 2 :
∂ℓ(θ) n 1 1 n
2
=− + ∑ (x i − x̄ n ) = 0
∂σ 2 2σ 2 2(σ ) i=1
2 2

Solving for σ 2 , we get:

1 n
σ̂ 2 = ∑ (x i − x̄ n )2
n i=1

1 n
The MLEs for µ and σ 2 are thus: µ̂ = X̄ and σ̂ 2 = ∑ (X i − X̄ n )
2
n i=1

2.2.3 Gamma Distribution

For the gamma distribution, θ is the scale parameter and α is the shape parameter. We seek the conditions for the maximum likelihood
estimates of (θ, α).

The likelihood function for a gamma distribution is the following:

1 n α−1 n x
L(α, θ∣x) = f n (x∣α, θ) = n ( ∏ x i ) exp(− ∑ )
i
Γ (α) ⋅ θ nα i=1 i=1 α

Taking the log, we obtain:

n 1 n
ℓ(α, θ) = −n ⋅ log(Γ(α)) − nα ⋅ log(θ) + (α − 1) ∑ log(x i ) − ∑ xi
i=1 θ i=1

8
Taking the derivative of the log likelihood with respect to θ and setting it equal to 0:
∂ℓ(α, θ) nα 1 n
=− + 2 ∑ xi = 0
∂θ θ θ i=1

Solving for θ, we obtain the following MLE for θ:

n
∑ xi
1
θ̂∣α = =
i=1
x̄ n
α⋅n α

Plugging this back into the log likelihood function, taking its derivative with respect to α, and setting the result equal to 0:
ˆ
dL(α, θ∣α) n ⋅ Γ′ (α) 1 n
=− − n ⋅ log( x̄ n ) + ∑ log(x i ) = 0
dα Γ(α) α i=1
ˆ
dL(α, θ∣α) n ⋅ Γ′ (α) n
=− + n ⋅ log(α) − n ⋅ log(x̄ n ) + ∑ log(x i ) = 0
dα Γ(α) i=1

Solving for α as far as we can (the answer remains in terms of the digamma function), we obtain the following condition for
the MLE of α:
n
∑ log(x i )
Γ′ (α)
log(α) − = log(x̄ n ) −
i=1
Γ(α) n

Therefore, the MLE values of α and θ must satisfy the following 2 equations (there is no unique solution) are:
n
∑ log(x i )
Γ′ (α̂)
log(α̂) − = log(x̄ n ) −
i=1
Γ(α̂) n
1
θ̂ = x̄ n
α̂

2.2.4 Multinomial Distribution

The data follows a multinomial distribution. We begin by writing down the likelihood function for all the data:

n! k
L(θ∣n) = f n (n∣θ 1 , θ 2 , ..., θ k ) = θ 1n 1 θ 2n 2 ⋯θ k k for ∑ n i = n
n
n 1 !n 2 !⋯n k ! i=1

Taking the log of the likelihood function, we get:

k k
ℓ(θ 1 , ..., θ k ) = log(n!) − ∑ log(n i !) + ∑ n i ⋅ log(θ i )
i=1 i=1

Before we can maximize the log of the likelihood function, we have to remember that we are maximizing it subject to the con-
k
straint that ∑ θ i = 1 (a property of the multinomial distribution). Therefore, we will proceed by maximizing the following equation
i=1
using Lagrange multipliers:
k
Λ(θ 1 , ..., θ k , λ) = L(θ 1 , ..., θ k ) + λ ⋅ ( ∑ θ i − 1)
i=1

k k k
Now, we solve ∇θ 1 ,θ 2 ,...,θ k ,λ Λ(θ 1 , ..., θ k , λ) = ∇θ 1 ,θ 2 ,...,θ k , λ (ln(n!) − ∑ ln(n i !) + ∑ n i ⋅ ln(θ i ) + λ ⋅ ( ∑ θ i − 1)) = 0 for θ i and
i=1 i=1 i=1
get n+1 first-order conditions:
ni k
θi = for all i = 1, 2, ..., k and ∑ θ i = 1
λ i=1

k k ni n
Since ∑ θ i = ∑ = = 1, λ = n.
i=1 i=1 λ λ
ni
∴θ̂ i ,MLE = for all i = 1, 2, ..., k
n

9
Note: Alternatively, the same solution is obtained for θ̂ i ,MLE if we model being the ith type of individual as a success in a
binomial distribution of n draws (where the failures are belonging to all the k − 1 remaining types).

2.2.5 Uniform Distribution

We want to solve for the MLE of θ for a uniform distribution on the interval [0, θ]. We begin by writing the likelihood function for θ,
which is the joint density of the observed data conditional on θ:

⎧
⎪ 1
⎪
⎪ for 0 ≤ x i ≤ θ (i = 1, ..., n)
f (x∣θ) = ⎨ θ n
⎪
⎪
⎪
⎩0 otherwise

Taking the log of the likelihood function:

⎧
⎪
⎪
⎪−n ⋅ log(θ) for 0 ≤ x i ≤ θ (i = 1, ..., n)
L(θ) = ⎨
⎪
⎪
⎩0
⎪ otherwise

From the equation for L(θ) above, one can see that the MLE of θ must be a value of θ for which 0 ≤ x i ≤ θ for all i = 1, ..., n. Since L(θ)
is a monotonically decreasing function of θ, we need the smallest value of θ such that θ ≥ x i for all i = 1, ..., n in order to maximize the
log likelihood. This value is θ = max(x 1 , ..., x n ).

Therefore, the MLE is:

θ̂ MLE = max(X 1 , ..., X n ).

10
3 Properties of MLE: The Basics
Why do we use maximum likelihood estimation? It turns out that subject to regularity conditions, the following properties
hold for the MLE (see proofs in Section 5):

1. Consistency: As sample size (n) increases, the MLE (θ̂ M LE ) converges to the true parameter, θ 0 :

p
θ̂ M LE Ð→ θ 0

2. Normality: As sample size (n) increases, the MLE is normally distributed with a mean equal to the true parameter (θ 0 )
and the variance equal to the inverse of the expected sample Fisher information at the true parameter (denoted as In (θ 0 )):

∂ 2 ℓ(θ∣x) −1
θ̂ M LE ∼ N (θ 0 , ( −E[ 2
∣ ]) )
∂θ θ=θ 0
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
In (θ 0 )

However, using the consistency property of the MLE and observed sample Fisher information, we can use the inverse
of the observed sample Fisher information evaluated at the MLE, denoted as Jn (θ̂ M LE ) to approximate the variance.
Note that the observed sample Fisher information, which will be defined in detail below, is the negation of the second
derivative of the log-likelihood curve.

∂ 2 ℓ(θ∣x) −1
θ̂ M LE ∼ N (θ 0 , ( −[ ∣ ] ) )
∂θ 2 θ=θ̂ M LE
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
Jn ( θ̂ M LE )

3. Efficiency: As sample size (n) increases, MLE is the estimation procedure that generally provides the lowest variance.

4. In a finite sample, MLE gives the minimum variance unbiased estimator, or MVUE, if it exists.

3.1 Functional Invariance of MLE

A very useful property of the MLE is its functional invariance. The principle of invariance states that if we know that θ̂ is the
MLE of θ, then the MLE of g(θ) is g(θ̂), where g(θ) is a function of θ.

Proof:
By definition of the MLE: θ̂ M LE ∈ Ω and ℓ(θ̂ M LE ∣x) ≥ ℓ(θ∣x)∀θ ∈ Ω. Thus, setting λ̂ = g(θ̂):

ℓ(g −1 ( λ̂)∣x) ≥ ℓ(g −1 (λ)∣x)∀λ ∈ Λ

∴ λ̂ = g(θ̂) is MLE of λ for a model with density ℓ(g −1 (λ)∣x).

11
4 Fisher Information and Uncertainty of MLE
In addition to providing the MLE as a single point summary of the likelihood curve, we would like to quantify how certain we
are in our estimate. We saw in the previous section that the variance of the MLE is asymptotically given by the inverse of the
expected Fisher information. However, we usually approximate expected Fisher information with observed Fisher information.
What are these two quantities? In this section, we’ll define the observed and expected Fisher information, as well as state the
intuition behind why these are useful quantities. The more theoretical derivations and proofs follow in subsequent sections.

First, let’s develop some basic intuition regarding uncertainty of the MLE. Note that the curvature of the likelihood curve
around the MLE contains information about how certain we are as to our estimate. Before we wade into the weeds of specific
calculations, let’s understand the intuition behind this. Looking at Figure 3, we can see that intuitively, we are more certain in
our MLE if the curve has a steeper slope around the MLE than if it has a slope closer to 0. That is, we have more certainty in
an MLE if the second derivative at the MLE is a more negative number. Recall that at the MLE, the slope of the log-likelihood
function is 0, so the second derivative of the log-likelihood curve evaluated at the MLE captures how quickly the slope changes
from 0 as you move away from the MLE in either direction. This comes from the interpretation of the second derivative as the
rate of change of the slope.

Figure 3: Example of second derivatives.

2nd der: -0.2

2nd der: -1
-5
Log-Likelihood

-10

2nd der: -2
-15
-20

-4 -2 0 2 4

How do we formalize this intuition? Let’s define the observed Fisher information, termed J , to be the negation of the second
derivative of the log-likelihood function. Specifically, we will evaluate it at the MLE1 :

∂2 ∂2
J (θ̂ M LE ) = − ℓ(θ∣x)∣ = − log f (x∣θ)∣θ̂
∂θ 2 θ̂ M LE ∂θ 2 M LE

1 Technically, we can calculate the observed Fisher information at any value of θ, but we will always talk about it as evaluated at the MLE.

12
Note that we negate the second derivative (which is always negative at the MLE) so that we always have a positive observed
Fisher information. Now, the intuition we developed about the steepness of the curve follows through: the steeper the curve
around the MLE, the larger the observed Fisher information.

Note that in the case of multiple parameters (we have a vector of k parameters θ), the observed Fisher information is the
negation of the hessian, the matrix of second derivatives. Again, here we evaluate it at the MLE.

2
∂2 ∂2
⎛ ∂θ∂
2 ∂θ 1 ∂θ 2
⋯ ∂θ 1 ∂θ k ⎞
1
⎜ ∂2 ∂2 ∂2 ⎟
⎜ ∂θ 2 ∂θ 1 ⋯ ⎟
J (θ̂ M LE ) = −∇∇T ℓ(θ∣x)∣θ̂ = −⎜
⎜
∂θ 22 ∂θ 2 ∂θ k ⎟
⎟ ℓ(θ∣x)∣θ̂ M LE
M LE ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎜ 2
⎟
∂2 ∂2
⎝ ∂θ∂k ∂θ 1 ∂θ k ∂θ 2
⋯ ∂θ 2k ⎠

What is the theoretical rationale for how the Fisher information is linked to the variance of the maximum likelihood estimate?
It turns out that we can prove, using the Central Limit Theorem, that the MLE is asymptotically normal with a mean equal to
the true parameter value and variance equal to the inverse of the expected Fisher information evaluated at the true parameter
value (see proof). To understand this, we need to define the expected Fisher information. Having defined the observed Fisher
information, we can define the expected Fisher information as the expectation of the observed Fisher information:

∂ 2 ℓ(θ∣x)
I(θ) = −E
∂θ 2

The expected Fisher information is thus a function of θ – the parameter we are trying to estimate – that gives us the expected
information across the samples we could draw from our distribution of interest. That is, imagine drawing 1,000,000 different
samples (or even better, infinite samples!) from the distribution of interest (even though in reality we observe only one). Each
sample will have a slightly different MLE and also a slightly different observed Fisher information (since observed Fisher infor-
mation is a sample-specific quantity). The observed Fisher information we expect, on average, across all possible samples is the
expected Fisher information. Moreover, note that since the MLEs are different from sample to sample, we also have a variance
across MLEs – this is in fact the variance we are after (recall that in the frequentist inferential framework, all randomness comes
from sampling and the parameters are fixed)! The inverse of the expected Fisher information captures this variance. For an
example that will elucidate these concepts, see Figure 4.

We can also be more specific as to what kind of Fisher information we are dealing with. Specifically, we can define the sample
expected Fisher information across all the random variables in our sample and the unit expected Fisher information for just
one random variable.

The unit expected Fisher information is defined for one random variable from our distribution:

∂2
I(θ) = −E ℓ(θ∣x)
∂θ 2

The sample expected Fisher information is defined as:

13
∂2
In (θ) = −E ℓ(θ∣x)
∂θ 2

Note that for distinction, we have added an n subscript to explicitly differentiate the sample Fisher information from the unit
Fisher information.

Since our sample is a set of n iid random variables, we can relate the sample Fisher information to the unit Fisher informa-
tion using the linearity of expectation and the fact that we can bring the derivative operator within the summation sign :

∂2 ∂2 n n
∂2 ∂2
In (θ) = −E 2
ℓ(θ∣x) = E 2 [ ∑ f (x i ∣θ)] = ∑ E[ 2 f (x∣θ)] = n ⋅ E[ 2 f (x∣θ)] = nI(θ)
∂θ ∂θ i=1 i=1 ∂θ ∂θ

∴ In (θ) = nI(θ)

This tells us that for a sample of iid random variables, the expected sample information is just a sum of the individual expected
informations across the n observations. This relationship becomes important in the proofs of the asymptotic distribution of the
MLE.

Finally, you may ask how we can get away with using the observed Fisher information even though we have just stated that
the MLE is asymptotically distributed normally with a variance equal to the inverse of the expected Fisher information? It in
fact turns out that the observed Fisher information is consistent for the expected Fisher information. That is, we can prove
using the law of large numbers that the observed information converges to the expected Fisher information as the sample size
increases. Moreover, we can evaluate the observed Fisher information at the MLE instead of at the true (unknown) value of
the parameter because of the consistency of the MLE (proven below). In most applications, we thus use the observed Fisher
information.

4.0.1 Example of Calculating Fisher Information: Bernoulli Distribution

Suppose that we have a sample X = {X 1 , ..., X n } such that X ∼ Bern(p) (iid).

Let’s write down the log-likelihood and solve for the MLE:
n
ℓ(p) = log ∏ p x i (1 − p)1−x i
i=1

ℓ(p) = log p∑ x i (1 − p)n−∑ x i

ℓ(p) = ∑ x i log p + (n − ∑ x i ) log(1 − p)

Using the first order condition to solve for the MLE:

∂ℓ(p) ∑ x i n − ∑ x i
= − =0
∂p p 1− p

p̂ MLE = X̄

14
What is the second derivative of the log-likelihood?

∂ 2 ℓ(p) ∑ xi n − ∑ xi
=− 2 −
∂p2 p (1 − p)2

We now define expected Fisher information as the expected value of the negation of the second derivative

∑ xi n − ∑ xi
In (p) = E [ + ]
p2 (1 − p)2

Since E[X] = p:

np n − np
In (p) = +
p2 (1 − p)2

Simplifying:

n n n
In (p) = + =
p 1 − p p(1 − p)

We have found the expected sample fisher information for X 1 , ..., X n for Bern(p).

The reason that this quantity is not particularly tractable for calculation of uncertainty is that it depends on p, the unknown
parameter! Instead, we can use the observed Fisher information, evaluated at the MLE:

n n
J ( p̂ MLE ) = =
p̂ MLE (1 − p̂ MLE ) X̄(1 − X̄)

Asymptotically, we thus know that p̂ MLE has the following approximate distribution:

p̂ MLE ∼ N (p, J ( p̂ MLE )−1 )

We can easily use this to calculate confidence intervals and test statistics.

15
Figure 4: Illustration of consistency properties of MLE and observed Fisher information. For this illustration, the sample size
(n) was varied and at each n, 10,000 datasets were drawn from a Bern(0.3) distribution. For each dataset, the MLE and the
observed Fisher information were calculated.

(a) Simulation of MLEs from Bern(0.3) model by sample size (b) Simulation of observed Fisher information from Bern(0.3)
(n). For each n, 10,000 MLEs are plotted with black dots. The model by sample size (n). For each n, 10,000 observed Fisher
dotted red line denotes the true parameter (p = 0.3), while the informations are plotted with black dots. The dotted red line de-
dotted blue line represents the mean MLE across the 10,000 notes the expected Fisher information, derived in the example
samples at each value of n. above. The gold line represents the simulated variance across
the 10,000 MLEs at each sample size. Finally, the solid blue line
represents the mean of the observed Fisher informations at each
value of n.

16
4.1 The Theory of Fisher Information
4.1.1 Derivation of Unit Fisher Information
Let’s begin by deriving the expected Fisher information for one random variable, X. To do this derivation, we need to impose
some regularity conditions on the distribution of the random variable X:

• f (x∣θ) is the pdf of X, where X ∈ S (sample space) and θ ∈ Ω (parameter space)

• Assume that f (x∣θ) > 0 for each value x ∈ S and each value θ ∈ Ω

• Assume that the pdf of X is a twice differentiable function of θ

Recall that by definition of a pdf, the integral of a continuous density across the sample space S is 1:

∫ f (x∣θ)∂x = 1
S

Let’s assume that we are able to distribute a derivative operator within the integration operator, such that:

∂ ∂
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′ (x∣θ)∂x
∂x S S ∂x S

and

∂2 ∂2
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′′ (x∣θ)∂x
∂x 2 S S ∂x 2 S

Recall that the score of the log-likelihood is defined to be the first derivative of the log-likelihood:

∂ ∂
S= log L(θ∣x) = log f (x∣θ)
∂θ ∂θ

We also know the following about the score from the properties of derivatives, specifically the chain rule:

∂ 1
ℓ′ (θ∣x) = log f (x∣θ) = ⋅ f ′ (x∣θ)
∂θ f (x∣θ)
f ′ (x∣θ)
ℓ′ (θ∣x) =
f (x∣θ)

We can find the expected value of the score using the definition of expectation:

∂ℓ(θ∣x) f ′ (x∣θ)
Eθ [ ] = ∫ ℓ′ (θ∣x) f (x∣θ)∂x = ∫ f (x∣θ)∂x = ∫ f ′ (x∣θ)∂x
∂θ S S f (x∣θ) S

Using our ability to exchange the order of integration and taking a derivative:

∂ ∂
∫ f (x∣θ)∂x = ∫ f (x∣θ)∂x = 1=0
′
S ∂θ S ∂θ

17
∴ Eθ [ℓ′ (θ∣x)] = 0

We have just shown that in expectation (across samples), the score will be 0.

Suppose that we define the expected unit Fisher information for random variable X as the expectation of the squared score
(you’ll see shortly how it relates to our previous expression for the expected information).

⎡ 2⎤
⎢⎛ ∂ℓ(θ∣x) ⎞ ⎥
I(θ) = Eθ ⎢⎢ ⎥ = Eθ [(ℓ′ (θ∣x))2 ]
⎥
⎢⎝ ∂θ ⎠ ⎥
⎣ ⎦

However, using the fact that Eθ [ℓ′ (θ∣x)] = 0 and the definition of variance (Var(Y) = E[Y 2 ]−(E[Y])2 for any random variable
Y), the Fisher information can also be written as:

I(θ) = Eθ [(ℓ′ (θ∣x))2 ] = Eθ [(ℓ′ (θ∣x))2 ] − (Eθ [ℓ′ (θ∣x)])2 = Var[ℓ′ (θ∣x)]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
E θ [ℓ ′ (θ∣x)]=0

We have just shown that the expected unit Fisher information is equivalent to the variance of the score:

∴ I(θ) = Var[ℓ′ (θ∣x)] = Var(S(θ))

Since the first moment of the score (the expected value) is zero, the Fisher information is also the second moment of the score.

We can also derive the (more familiar) expression for the expected information in terms of the second derivative of the log-
likelihood. First, let us define the second derivative of the log-likelihood (using the quotient rule) as:

f (x∣θ) f ′′ (x∣θ) − [ f ′ (x∣θ)]2

ℓ′′ (θ∣x) =
[ f (x∣θ)]2

Separating the result into two fractions and simplifying:

f (x∣θ) f ′′ (x∣θ) [ f ′ (x∣θ)]2 f ′′ (x∣θ)

ℓ′′ (θ∣x) = − = − [ℓ′ (θ∣x)]2
[ f (x∣θ)]2 [ f (x∣θ)]2 f (x∣θ)

f ′ (x∣θ)
The final step above follows from the fact that ℓ′ (θ∣x) = f (x∣θ)
.

Taking the expected value of both sides:

f ′′ (x∣θ)
E[ℓ′′ (θ∣x)] = E[ ] − E[[ℓ′ (θ∣x)]2 ]
f (x∣θ) ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ I(θ)
∫S f ′′ (x∣θ)d x=0

The result is a more familiar version of the expected unit Fisher information:

18
I(θ) = −E[ℓ′′ (θ∣x)]

4.1.2 Derivation of Sample Fisher Information

Let X 1 , ..., X n be an iid sample from the distribution f (x∣θ). We can then expand the expression for the expected unit Fisher
information to the case of a sample of random variables.

In an iid sample, we define the log-likelihood as:

n
log L(θ∣x) = ℓ(θ∣x) = ∑ ℓ(θ∣x i )
i=1

Similarly, the score and second derivative of the log-likelihood for the sample can be expressed in terms of sums of unit scores
and second-derivatives:
n
∂
S= log L(θ∣x) = ℓ′ (θ∣x) = ∑ ℓ′ (θ∣x i )
∂θ i=1

∂2 n
log L(θ∣x) = ℓ ′′
(θ∣x) = ∑ ℓ′′ (θ∣x i )
∂θ 2 i=1

Using these expressions and the linearity of expectation, the expected value of the negative of the second derivative of the log-
likelihood function is:
n
E[−ℓ(θ∣x)] = ∑ E[−ℓ′′ (θ∣x i )]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
In (θ) =I(θ)

But recall that E[−ℓ(θ∣x)] is just the definition of the expected sample Fisher information, denoted as In (θ). We just proved
that the following relation holds true for an iid sample of n random variables:

In (θ) = n ⋅ I(θ)

4.1.3 Relationship between Expected and Observed Fisher Information

The observed Fisher information is defined as the negation of the second derivative of the log-likelihood:

∂2 ∂2
Jn (θ) = − ℓ(θ∣x) = − log f (x∣θ)
∂θ 2 ∂θ 2

We can rewrite the observed Fisher information as the sum of second derivatives:
n
∂2
Jn (θ) = − ∑ log f (x i ∣θ)
i=1 ∂θ 2

X 1 , X 2 , ..., X n are iid hence the second derivatives on the right hand side of the expression above are also iid Thus, by the law

19
of large numbers, their average converges to the expectation of a single term:

1 p ∂2
Jn (θ) Ð→ E[ 2 log f (x i ∣θ)] = I(θ)
n ∂θ

Therefore, when n is large:

Jn (θ) ≈ n ⋅ I(θ) = In (θ)

So by the consistency we have just shown, we can use Jn (θ) instead of In (θ). However, we still do not know θ 0 , the true
value of the parameter (remember that the variance of the MLE asymptotically is In (θ 0 )−1 . But it turns out θ̂ M LE is a consistent
estimator for θ 0 , and as a result we use Jn (θ 0 )−1 as an estimator for the variance of the MLE.

4.1.4 Extension to Multiple Parameters

All the proofs above are generalizable to the case of a vector of k parameters, θ. The multivariate extension of the score (first
derivative of log-likelihood curve) is:

∂ℓ(θ∣x)
⎛ θ1 ⎞
⎜ ∂ℓ(θ∣x) ⎟
⎜ ⎟
S(θ) = ∇ℓ(θ∣x) = ⎜ θ 2 ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎝ ∂ℓ(θ∣x) ⎠
θk

In the multivariate case, we define expected sample Fisher information as:

⎡ ∂2 ∂2
⋯ ∂2 ⎤
⎢⎛ ∂θ 2 ∂θ 1 ∂θ k ⎞ ⎥
⎢ 1 ∂θ 1 ∂θ 2 ⎥
⎢⎜ ∂ 2 ∂2 ∂2 ⎟ ⎥
⎢⎜ ∂θ 2 ∂θ 1 ⋯ ⎟ ⎥
I(θ) = −E[∇ ℓ(θ∣x)] = −E ⎢⎢⎜
2
⎜
∂θ 22 ∂θ 2 ∂θ k ⎟
⎟ ℓ(θ∣x) ⎥
⎥
⎢⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎥
⎢⎜ 2 ⎟ ⎥
⎢⎝ ∂ ∂2
⋯ ∂2
⎠ ⎥
⎢ ∂θ k ∂θ 1 ⎥
⎣ ∂θ k ∂θ 2 ∂θ 2k ⎦

We can show, analagously to the case with a single parameter, that the expected sample Fisher information is equal to the
variance of the score:
I(θ) = −E[∇2 ℓ(θ∣x)] = Var[∇ℓ(θ∣x)] = Var[S(θ)]

20
5 Proofs of Asymptotic Properties of MLE
In this section, we want to prove the asymptotic properties of the MLE: consistency, normality, and efficiency.

We first introduce some notation. Let X n = X 1 , ..., X n be our random sample, where the X i ’s are iid. Let θ̂ M LE be the maximum
likelihood estimator for the parameter θ. θ 0 is the true underlying value of the parameter. Note that θ̂ M LE , θ 0 , θ ∈ Ω (the
parameter space).

Furthermore, to prove the asymptotic properties of the maximum likelihood estimator, we will introduce several regularity
conditions:

• The parameter space must be a bounded and closed set. That is, Ω must be a compact subset.

• The true value of the parameter, θ 0 , must be an interior point of the parameter set: θ 0 ∈ int(Ω). Phrased differently, θ 0
cannot be on the boundary of the set.

• The likelihood function is continuous in θ.

• The likelihood function is twice-differentiable in the neighborhood of θ 0 .

• Integration and differentiation is interchangeable (as we defined above for the derivation of expected Fisher information).

5.1 Consistency of MLE

We want to show that the MLE (θ̂ M LE ) converges in probability to the true value of the parameter (θ 0 ):

p
θ̂ M LE → θ 0

Since the observations in our sample are iid, we can write the log-likelihood as the sum of log-likelihoods for each observation x i :
n
ℓ(θ∣x) = ∑ ℓ(θ∣x i )
i=1

Let’s divide by n, which we can do since it doesn’t affect the maximization of the log-likelihood. Now, we have an expression
that looks like the average of log-likelihoods across all the Xs. We can then show by the strong law of large numbers that that
converges to the expected value of a log-likelihood of a single X:

1 n a.s.
∑ ℓ(θ∣x i ) → Eθ 0 ℓ(θ∣x) = Eθ 0 log f (x∣θ)
n i=1

In this expression, Eθ 0 represents the expectation of the density with respect to the true unknown parameter and thus we define
a new function L(θ), which is the expected log-likelihood function:

∞
L(θ) = Eθ 0 log f (x∣θ) = ∫ log f (x∣θ) ⋅ f (x∣θ 0 )∂x
−∞

As a result, the normalized log-likelihood converges to the expected log-likelihood function L(θ) for any value of θ. This ex-
pression depends solely on θ and not on x since we integrate it out.

21
Now, let’s look at the divergence between L(θ) and L(θ 0 ) (the expected log-likelihood function evaluated at an arbitrary
parameter θ and the true parameter θ 0 ):

f (x∣θ)
L(θ) − L(θ 0 ) = Eθ 0 [log f (x∣θ) − log f (x∣θ 0 )] = Eθ 0 [ log ]
f (x∣θ 0 )

By Jensen’s inequality:

∞ ∞
f (x∣θ) f (x∣θ) f (x∣θ)
Eθ 0 [ log ] ≤ log Eθ 0 [ ] = log ∫ ⋅ f (x∣θ 0 )dx = log ∫ f (x∣θ)dx = 0
f (x∣θ 0 ) f (x∣θ 0 ) f (x∣θ 0 )
−∞ −∞
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=1 by def. of pdf

Thus:

L(θ) − L(θ 0 ) ≤ 0

L(θ) ≤ L(θ 0 )

This inequality suggests that the expected log-likelihood when assuming that an arbitrary parameter θ is governing the data
generation process is no greater than the expected log-likelihood when you correctly identify the true parameter θ 0 governing
the data generation process. Note that this is closely related to the concept of the Kullback-Leibler divergence. In fact, we know
from Gibbs’ inequality that the Kullback-Leibler divergence between f (x∣θ 0 ) and f (x∣θ) must be non-negative:

f (x∣θ 0 )
D K L ( f (x∣θ 0 )∣∣ f (x∣θ)) = Eθ 0 [ log ]≥0
f (x∣θ)

In more practical terms, this inequality suggests that no distribution describes the data as well as the true distribution that gen-
erated it. Therefore, on average, the greatest log-likelihood will be the one that is a function of the true parameter θ 0 . Phrased
differently, θ 0 is the maximizer of the expected log-likelihood, L(θ).

Now, let’s put the different pieces together. Recall that by the strong law of large numbers:

1 n a.s.
∑ ℓ(θ∣x i ) → Eθ 0 log f (x∣θ)
n i=1

For a finite parameter space Ω, the following holds for the MLE (convergence is uniform from the uniform strong law of large
numbers):

1 n a.s.
θ̂ M LE = sup ∑ ℓ(θ∣x i ) → sup Eθ 0 log f (x∣θ) = θ 0
θ∈Ω n i=1 θ∈Ω

a.s.
∴θ̂ M LE → θ 0

In sum, by the strong law of large numbers, the MLE is actually maximizing the expected log-likelihood as n increases asymp-
totically. But we showed that the expected log-likelihood is maximized at the true value of the parameter. Therefore, as n → ∞,

22
the normalized log-likelihood of the data should approach the expected value of the log-likelihood of the random variable X.
Another way of stating the consistency of the MLE is that it minimizes the Kullback-Leibler divergence between an arbitrary
log-likelihood function and the true log-likelihood. If θ̂ M LE = θ 0 , the Kullback-Leibler divergence goes to 0 (by LLN).

Note also that the consistency of the MLE can also be proven for an infinite parameter space Ω as well as for non-compact
parameter spaces.

5.2 Asymptotic Normality of MLE

We want to prove that the MLE, θ̂ M LE , is asymptotically normal:

√ d
n(θ̂ M LE − θ 0 ) → N (0, (In )−1 )

Ultimately, we want to rely on the fact that the score (first derivative of the log-likelihood) is a sum of iid terms, and as a result
we know that its asymptotic distribution can be approximated using the law of large numbers and the central limit theorem.
We can start the proof by noting that since the MLE maximizes the log-likelihood, the score is equal to zero at the MLE:

S(θ̂ M LE ) = ℓ′ (θ̂ M LE ∣x) = 0

When the log-likelihood is twice differentiable, we can expand the score around the true parameter value θ 0 using a Taylor
series approximation of the 1st order:

S(θ) = ℓ′ (θ) = ℓ′ (θ 0 ) + ℓ′′ (θ̃)(θ − θ 0 ),

where θ̃ is some point between θ and θ 0 . More precisely:

θ̃ = αθ + (1 − α)θ 0 , where α ∈ [0, 1]

Now we can plug the MLE in for θ and since we know that the score at evaluated at the MLE is 0, the expression simplifies to:

ℓ′ (θ̂ M LE ) = 0 = ℓ′ (θ 0 ) + ℓ′′ (θ̃)(θ̂ M LE − θ 0 )

Note that in this case, θ̃ is an arbitrary point located between the MLE (θ̂ M LE ) and the true parameter (θ 0 ). One could also
obtain this equation by utilizing the mean-value theorem which suggests that for a continuous function f (x) there is a point c
on the interval [a, b] such that:

f (b) − f (a)
f ′ (c) =
b−a

Applying the mean value theorem to the function ℓ′ (θ) and letting a = θ̂ M LE and b = θ 0 :

23
ℓ′ (θ̂ M LE ) − ℓ′ (θ 0 )
ℓ′′ (θ̃) = ,where θ̃ ∈ [θ̂ M LE , θ 0 ]
θ̂ M LE − θ 0

ℓ′′ (θ̃)(θ̂ M LE − θ 0 ) + ℓ′ (θ 0 ) = ℓ′ (θ̂ M LE ) = 0, θ̃ ∈ [θ̂ M LE , θ 0 ]

In either case, we obtain the following relationship:

ℓ′ (θ 0 )
θ̂ M LE − θ 0 = −
ℓ′′ (θ̃)

√
Multiplying both sides by n:
√
√ nℓ′ (θ 0 )
n(θ̂ M LE − θ 0 ) = −
ℓ′′ (θ̃)

Let’s consider the asymptotic distribution of the numerator and denominator in turn, starting with the asymptotic distribution
of the numerator. We can first express the numerator as a sum of iid scores (first derivatives of the log-likelihood of iid random
variables):
n
ℓ′ (θ 0 ) = ∑ ℓ′ (θ 0 ∣x i )
i=1

Using the Central Limit Theorem, we will be able to make a statement regarding the asymptotic distribution of the average of
the score, n1 ℓ′ (θ 0 ). First, though, note that in the section on Fisher information above, we proved that the first moment of the
score is zero and the second moment of the score is the expected sample Fisher information:

Eθ [ℓ′ (θ∣x)] = 0

and

Varθ [ℓ′ (θ∣x)] = −Eθ [ℓ′′ (θ∣x)] = In (θ)

For a single observation, the first moment is 0 and the second moment is I1 (θ). Invoking the Central Limit Theorem, we know
that the mean of the score is distributed as a normal with a mean of 0 and a variance of I1 (θ)/n.

1 ′ d I1 (θ)
ℓ (θ) Ð→ N (0, )
n n

This implies that ℓ′ (θ) is distributed as a normal with a mean of 0 and a variance of In (θ):

d
ℓ′ (θ) Ð→ N (0, In (θ))

√
Multiplying this by n to get the expression in the numerator:

24
√ d
nℓ′ (θ) Ð→ N (0, n ⋅ In (θ))

Now, we can find the asymptotic behavior of the denominator – ℓ′′ (θ̃∣x) – in the Taylor series expansion of the score. We
can use the law of large numbers to derive the following convergence in probability for any θ:

1 ′′ 1 n p
ℓ (θ̃∣x) = ∑ ℓ′′ (θ̃∣x i ) Ð→ −I1 (θ̃)
n n i=1 ´¹¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¶
E[ℓ ′′ ( θ̃∣x i )]=−I1 ( θ̃)

a.s a.s
Moreover, since we know from the consistency of the MLE that θ̂ M LE → θ 0 and θ̃ ∈ [θ̂ M LE , θ 0 ], it follows that θ̃ → θ 0 . There-
fore:

a.s
ℓ′′n (θ̃) Ð→ −In (θ 0 )

Combining the asymptotic behavior of the numerator and denominator:

√
√ nℓ′ (θ 0 ) d n ⋅ In (θ 0 ) n
n(θ̂ M LE − θ 0 ) = − Ð→ N (0, ) = N (0, )
ℓ′′ (θ̃) (−In (θ 0 ))2 In (θ 0 )

In sum, we have shown:

1
θ̂ M LE ∼ N (θ 0 , )
In (θ 0 )

5.2.1 Efficiency of MLE

To show the efficiency of the MLE we need to show that the variance of the MLE is equal to the Cramer-Rao lower bound.
However, we know that the Cramer-Rao lower bound is just the inverse of the Fisher information, I(θ).

1
Since Var(θ̂ MLE ) = I(θ)−1 = − , the efficiency of the MLE is:
∂ 2 ℓ(θ∣x)
E[ ]
∂θ 2

I(θ)−1
e(θ̂ M LE ) = =1
Var(θ̂ M LE )

∴ θ̂ MLE is an efficient estimator.

25
Figure 5: Illustration of Central Limit Theorem for the MLE and the score evaluated at the true parameter. For this illustration,
we simulated 1000 datasets of sample size n ∈ {10, 25, 100} from the Bern(0.3) distribution. For each dataset, we plotted the
score function (as a function of θ) and the MLE. We also evaluated the score function at the true value of the parameter, θ 0 = 0.3.
200

0.10
0
Score

0.08
-200
-400

0.06
-600

Density

0.04
0.0 0.2 0.4 0.6 0.8 1.0

0.02
0.0 3.0
Density

0.00
0.0 0.2 0.4 0.6 0.8 1.0 -10 0 10 20

MLE Score at the True Parameter

(a) The top plot portrays the 1000 score functions for simulated (b) A density plot of the score function evaluated at the true pa-
data with sample size n = 10. The MLEs are indicated with or- rameter, θ 0 , across 1000 simulated datasets of sample size n = 10
ange dots. The lower plot depicts the density function of the (in black). For comparison in red, we have plotted the theoret-
MLEs across the 1000 datasets. Note that the density of the ical asymptotic density of the score, N (0, In (θ 0 )), derived us-
MLEs does not yet look particularly normal. ing the CLT. Clearly the simulated density has not converged to
the normal density.
500

0.04
0
-500
Score

0.03
-1500

Density

0.02

0.0 0.2 0.4 0.6 0.8 1.0

θ
0.01
Density

0.00
0

0.0 0.2 0.4 0.6 0.8 1.0 -20 0 20 40

MLE Score at the True Parameter

(c) The top plot portrays the 1000 score functions for simulated (d) A density plot of the score function evaluated at the true pa-
data with sample size n = 25. The MLEs are indicated with or- rameter, θ 0 , across 1000 simulated datasets of sample size n = 25
ange dots. The lower plot depicts the density function of the (in black). For comparison in red, we have plotted the theoret-
MLEs across the 1000 datasets. ical asymptotic density of the score, N (0, In (θ 0 )), derived us-
ing the CLT. The simulated density is starting to converge to the
expected normal density.

26
2000
0

0.015
-2000
Score

0.010
-6000

Density

0.0 0.2 0.4 0.6 0.8 1.0

0.005

θ
Density

0.000
6
0

0.0 0.2 0.4 0.6 0.8 1.0 -50 0 50 100

MLE Score at the True Parameter

(a) The top plot portrays the 1000 score functions for simulated (b) A density plot of the score function evaluated at the true
data with sample size n = 100. The MLEs are indicated with parameter, θ 0 , across 1000 simulated datasets of sample size
orange dots. The lower plot depicts the density function of the n = 100 (in black). For comparison in red, we have plotted the
MLEs across the 1000 datasets. The MLE has essentially con- theoretical asymptotic density of the score, N (0, In (θ 0 )). As
verged to the expected distribution. predicted by the CLT, the distribution of the scores has essen-
tially converged to the expected density.

27
6 References
Casella, George. and Roger L. Berger. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury Press, 2002.

DeGroot, Morris H., and Mark J. Schervish. Probability and Statistics. 3rd ed. Boston, MA: Addison-Wesley, 2002.

Newey, Whitney K. and Daniel McFadden. Large sample estimation and hypothesis testing. In Engle, Robert F. and Daniel
L. McFadden, editors, Handbook of Econometrics, vol. 4, 1994.

28
Convergence of Random Variables
What does it mean to say that a sequence converges? There are several notions of convergence for random variables. The two
main ones are convergence in probability and convergence in distribution.

Convergence in Probability
Suppose we have a sequence of random variables denoted by {X n } = X 1 , X 2 , ..., X n . The sequence converges in probability to
X if the probability distribution of the sequence {X n } is increasingly concentrated around X:

lim P(∣X n − X∣ < є) = 1

n→∞
or
lim P(∣X n − X∣ ≥ є) = 0
n→∞

for every є > 0.

p
Convergence in probability is denoted as X n → X.

Note that convergence in probability implies convergence in distribution, but convergence in distribution implies convergence
in probability only when the limiting variable X is a constant.

p p p
We can extend convergence in probability to the multivariate case. If X n → X and Yn → X, then (X n , Yn ) → (X, Y).

Almost Sure Convergence

Suppose we have a sequence of random variables denoted by {X n } = X 1 , X 2 , ..., X n . The sequence converges almost surely to
X if:

lim P (sup ∣X m − X∣ ≥ є) = 0 for every є > 0.

n→∞ m≥n
a.s.
Almost sure convergence is denoted as X n → X.

Note that almost sure convergence implies convergence in probability, but not vice versa.

Convergence in Distribution
Suppose we have a sequence of random variables denoted by {X n } = X 1 , X 2 , ..., X n . Let Fn denote the CDF of random variable
X n and F ∗ denote the CDF of random variable X ∗ .

The sequence converges in distribution to a random variable X if:

lim Fn (x) = F ∗ (x) for every number x ∈ R where F ∗ is continuous

n→∞

Intuitively, convergence in distribution means that if n is sufficiently large, the probability for X n to be in a given range is
approximately equal to the probability that X ∗ is in the same range.

29
d
Convergence in probability is denoted as X n → X ∗ . X ∗ is the asymptotic distribution of X n .

d d
We can extend convergence in distribution to the multivariate case. If X n → X and Yn → c (where c is a constant), then
p
(X n , Yn ) → (X, c).

Levy’s Continuity Theorem

Levy’s continuity theorem equates converges in distribution of a sequence of random variables to the pointwise converges of a
sequence of corresponding characteristic functions.

Let X 1 , X 2 , ..., X n be a sequence of n random varaibles. Let Fn be the CDF of X n and let ξ n be the characteristic function
of X n , given by ξ n (t) = E(e i t X n ) ∀t ∈ R . Let F ∗ denote the CDF and ξ∗ denote the characteristic function of random variable
X∗.

The sequence X 1 , X 2 , ..., X n converges in distribution to X ∗ if:

lim ξ n (t) = ξ∗ (t) ∀t ∈ R
n→∞

Convergence of Moment Generating Functions:

Levy’s continuity theorem is a more general version of the following statement: a convergence of moment generating functions
(MGFs) to one MGF is equivalent to the convergence in distribution of a sequence of random variables (and their CDFs) to
one random variable (and its CDF).

Let X 1 , X 2 , ..., X n be a sequence of n random varaibles. Let Fn be the CDF of X n and let ψ n be the moment generating function
(MGF) of X n , given by ψ n (t) = E(e t X n ) ∀t ∈ R . As before, let F ∗ denote the CDF and ψ ∗ denote the MGF of random variable
X ∗ . Assume that both MGFs exist.

Then, the sequence X 1 , X 2 , ..., X n converges in distribution to X ∗ if:

lim ψ n (t) = ψ ∗ (t) for all values of t in the neighborhood around t = 0
n→∞

Law of Large Numbers

This theorem describes the behavior of the sample mean of a large number of random variables. The Law of Large Numbers
states that the mean of a sequence of iid random variables converges to the expected value of the random variables. Another
way of stating this is that the sample mean converges to the population mean. This means that if the sample size is large, the
probability that the sample mean is near the true population mean is large. Mathematically:

p
X̄ n → µ

Proof of Weak Law of Large Numbers

Let X 1 , X 2 , ..., X n be a sequence of identically and independently distributed random variables where E[X i ] = µ and Var[X i ] =
σ 2.

30
Recall that Chebyshev’s inequality indicates that for any random variable X:
σ2
P(∣X − µ∣ > є) ≤ 2
є

However, what if we want to know the behavior of n1 X̄?

First, find the expected value and the variance of the mean of the sequence:
E[ X̄] = µ

1 2
Var[ X̄] = σ
n

Then, due to Chebyshev’s inequality, for every number є > 0:

σ2
P(∣ X̄ n − µ∣ > є) ≤
nє 2

This is identical to:

σ2
P(∣ X̄ n − µ∣ < є) ≥ 1 −
nє 2

Note that as n → ∞:
P(∣ X̄ n − µ∣ < є) = 1 or P(∣ X̄ n − µ∣ > є) = 0

p
∴ X̄ n → µ

Strong Law of Large Numbers

The strong law of large numbers states that the sample mean almost sure converges to the true population mean:

a.s.
X̄ n → µ

Moreover, we can generalize the strong law of large numbers to any function of x. Let X 1 , X 2 , ..., X n be iid random variables.
Then assume that f (x, θ) is a continuous function of x defined for all θ ∈ Ω. The strong law of large numbers states:

1 n a.s.
∑ f (X i , θ) Ð→ E[ f (X, θ)]
n i=1

Uniform Strong Law of Large Numbers

The uniform strong law of large numbers states necessary conditions for almost sure convergence to be uniform across the
parameter space Ω.

The uniform strong law of large numbers states:

1 n a.s.
supθ∈Ω ∣ ∑ f (X i , θ) − E[ f (X, θ)]∣ Ð→ 0
n i=1

The conditions that need to hold for the uniform strong law of large numbers to apply to a random variable X and a func-
tion f (x, θ) are:

• Ω must be a compact parameter space

31
• f (x, θ) should be semi-continuous in θ ∈ Ω for all x

• There must be a function K(x) where E[K(x)] ≤ ∞ and ∣ f (x, θ)∣ ≤ K(x) ∀x, θ

Central Limit Theorem

The central limit theorem (CLT) states gives the conditions under which a mean of independently and identically distributed
random variables will converge to a normal distribution. Specifically, the theorem states that when a random sample of size n is
taken from any distribution characterized by a mean of µ and a variance of σ 2 < ∞, then the sample mean X̄ n has a distribution
that is approximately normal with mean µ and variance σ 2 /n. An equivalent statement is that for a large random sample, the
√
distribution of n( X̄ n − µ)/σ is approximated by the standard normal distribution. The approximation improves as n increases
(this is a convergence in distribution). This approximation holds whether the original distribution is continuous or discrete.

More rigorously:
Let X 1 , ..., X n form a random sample of size n from a distribution with mean µ and variance σ 2 . Then for each fixed number x:

√
n( X̄ n − µ)
lim P[ ≤ x] = Φ(x)
n→∞ σ

In a sense, the CLT governs the shape of convergence of X̄ n to µ, which we proved using the law of large numbers. The CLT
1
indicates that the limit as n goes to infinity of X̄ n − µ is non-degenerate when the exponent on the n is 2
and that the limiting
distribution is normal.

Proof of Central Limit Theorem

Suppose X 1 , X 2 , ..., X n are iid random variables with mean µ and variance σ 2 . We want to prove that as n increases, X̄ n ∼
σ2
N (µ, ). We will assume that X has a a moment generating function (MGF). Thus, we will use MGFs to prove that the cen-
n
tral limit theorem holds. Even if the MGF does not exist, characteristic functions (which do always exist) may be used to prove
1 2
that the CLT still holds. Note that the MGF of the standard normal distribution is given by ψ ∗ (t) = e 2 t .

√
We begin with the random variable Z n = n( X̄ n − µ)/σ, which converges in distribution to the standard normal.

Using some simple algebra:

√ n √ 1 n n
√ n( ∑ (X i /n) − µ) n ⋅ n ( ∑ (X i − µ)) ∑ (X i − µ)
n( X̄ n − µ) 1 n (X i − µ)
Zn = = = = √ =√ ∑
i=1 i=1 i=1
σ σ σ nσ n i=1 σ
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
Yi

Let ψ(t) denote the MGF of the random variable Yi . Then, since the sum of independent random variables is the product
n
of their MGFs, the MGF of ∑ Yi is (ψ(t))n . When we multiply the MGF by √1 , we get that the MGF of the standardized sum
n
i=1
of random variables, Z n , is:

t n
ψ n (t) = (ψ( √ ))
n

We can express the MGF of the standardized sum of random variables using a Taylor series expansion around the point t = 0.
This enables us to incorporate the information that the first moment of the standardized RV is 0 and the second moment is 1:

32
First moment: E(Yi ) = ψ ′ (0) = 0

Second moment: E(Yi2 ) = ψ ′′ (0) = 1 (since Var(Yi ) = 1)

The Taylor series expansion of ψ(t) around t = 0 (hence Maclaurin series) is:

t 2 ′′ t3 t2 t2
ψ(t) ≈ ψ(0) + tψ ′ (0) + ψ (0) + ψ ′′′ (0) + ... = ψ(0) + tψ ′ (0) + ψ ′′ (t ∗ ) = 1 + ψ ′′ (t ∗ ) for 0 < t∗ < t
2! 3! 2! 2

Note instead of writing an infinite series, I wrote the the first 2 terms of the series and then used a Lagrange remainder.

This means that the Taylor series expansion of ψ n (t) is simply:

t 2 ′′ ∗ n
ψ n (t) ≈ [1 + ψ (t )] for 0 < t∗ < √t
2n n

Now we can find the limit of ψ n (t) as n → ∞. Note that as n → ∞, t ∗ → 0 because √t

n
→ 0.

Replacing t ∗ with 0 yields ψ"(0) = 1. Taking the limit:

t2 n
lim [1 + ]
n→∞ 2n

This now looks like a case of a notable limit from calculus, given in its general form by:

k x
lim [1 + ] = e k
x→∞ x
t 2 /2 n 2
∴ lim ψ n (t) = lim [1 + ] = e t /2 = ψ ∗ (t)
n→∞ n→∞ n

We have just proven that the MGF of the standardized sequence of any random variables converges to the MGF of the standard
normal distribution. Therefore, the CDF of the standardized sequence of the random variables converges in distribution to the
standard normal distribution.

d
∴Z n → Z ∼ N(0, 1)

Continuous Mapping Theorem

The continuous mapping theorem states that if one has a sequence of random variables that converges, a continuous function
that maps the convergent sequence to another sequence will also yield a convergent sequence.

More technically:
Define X to be a random variable on metric space S. X n represents a sequence of n random variables X. The continuous func-
tion g(⋅) then maps S → S ′ .

The continuous mapping theorem indicates that the following hold (assuming g(⋅) is continuous at X):

d d
• X n → X ⇒ g(X n ) → g(X)
p p
• X n → X ⇒ g(X n ) → g(X)

33
a.s. a.s.
• X n → X ⇒ g(X n ) → g(X)
p p p
Furthermore, one can show that if X n → a and Yn → b, then g(X n , Yn ) → g(a, b) (assuming that g(⋅) is continuous at a, b).

Slutsky’s Theorem
d p
Suppose that we have two sets of sequences: {X n } and {Yn }. Furthermore, suppose that X n → X and Yn → c.

Slutsky’s theorem indicates that the following relationships hold between convergent sequences:
d
• X n + Yn → X + c
d
• Yn ⋅ X n → cX
Xn d X
• → (c ≠ 0)
Yn c
Proof: The proof of Slutsky’s theorem is quite simple - in effect, the theorem is just a particular application of the continuous
d p d
mapping theorem. We know that since X n → X and Yn → c, the joint vector (X n , Yn ) → (X, c). Now, using the multivariate
version of the continuous mapping theorem, we respectively let g(x, y) = x + y, g(x, y) = x y, and g(x, y) = xy .

Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Lecture1 ML MLE
No ratings yet
Lecture1 ML MLE
103 pages
LectureNotes22 WI4455
No ratings yet
LectureNotes22 WI4455
154 pages
Exploring Probability and Random Processes Using MATLAB®
From Everand
Exploring Probability and Random Processes Using MATLAB®
Roshan Trivedi
No ratings yet
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
2 Probability
No ratings yet
2 Probability
30 pages
Adv Statistics I
No ratings yet
Adv Statistics I
95 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
Var PPTS
No ratings yet
Var PPTS
249 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Mstat Note12 Parametric Inference FSP
No ratings yet
Mstat Note12 Parametric Inference FSP
45 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
SMDM Project
87% (15)
SMDM Project
23 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
DS 630 - Lec 02 - ST
No ratings yet
DS 630 - Lec 02 - ST
34 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
Linear Regression
100% (2)
Linear Regression
35 pages
확통1 LectureNote09 on Bayesian Statistical Inference
No ratings yet
확통1 LectureNote09 on Bayesian Statistical Inference
78 pages
11 Mle
No ratings yet
11 Mle
26 pages
ML Map and Bayseian
No ratings yet
ML Map and Bayseian
35 pages
Bayesian Inference Slides 2021
No ratings yet
Bayesian Inference Slides 2021
37 pages
Project Report
No ratings yet
Project Report
56 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
Statistics
No ratings yet
Statistics
53 pages
7 - Sampling Distributions & Point Estimation of Parameters
No ratings yet
7 - Sampling Distributions & Point Estimation of Parameters
45 pages
Fisher Information
No ratings yet
Fisher Information
59 pages
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
No ratings yet
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
68 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
2 Mle
No ratings yet
2 Mle
28 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
NOTES
No ratings yet
NOTES
14 pages
Excel 2007 - 10 Forecasting and Data Analysis Course Manual1
No ratings yet
Excel 2007 - 10 Forecasting and Data Analysis Course Manual1
160 pages
Aadt1.Csv and Aadt2.Csv From Ublearns - Fit A LR Model Fit1 From Aadt1.Csv
No ratings yet
Aadt1.Csv and Aadt2.Csv From Ublearns - Fit A LR Model Fit1 From Aadt1.Csv
4 pages
Vol4 No1
No ratings yet
Vol4 No1
374 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Bayes Lecture Notes
No ratings yet
Bayes Lecture Notes
172 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Statistical Inference in Science
No ratings yet
Statistical Inference in Science
262 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
Questions For Unit 4
No ratings yet
Questions For Unit 4
6 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Stat - Prob-Q4-Module-7
No ratings yet
Stat - Prob-Q4-Module-7
15 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Statistics and Probability Least Learned Competencies
No ratings yet
Statistics and Probability Least Learned Competencies
1 page
Cost Estimation and Behaviour
No ratings yet
Cost Estimation and Behaviour
5 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
No ratings yet
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
7 pages
ML Notes
No ratings yet
ML Notes
4 pages
Formula Sheet - Study Version. - Portfolio Management PDF
No ratings yet
Formula Sheet - Study Version. - Portfolio Management PDF
2 pages
Stock Returns Seasonality in Emerging Asian Markets: Khushboo Aggarwal Mithilesh Kumar Jha
No ratings yet
Stock Returns Seasonality in Emerging Asian Markets: Khushboo Aggarwal Mithilesh Kumar Jha
22 pages
Stats, Mle, and Other Stuff: 1 Sevssd
No ratings yet
Stats, Mle, and Other Stuff: 1 Sevssd
10 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
Efron-Why Isn't Everyone A Bayesian PDF
No ratings yet
Efron-Why Isn't Everyone A Bayesian PDF
11 pages
Statistics - Docx Unit 1
No ratings yet
Statistics - Docx Unit 1
9 pages
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
No ratings yet
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
11 pages
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
No ratings yet
Chapter 9 - Statistical Estimat - 2016 - Introduction To Statistical Machine Lea
8 pages
The Role of Teacher's Authority in Students' Learning
No ratings yet
The Role of Teacher's Authority in Students' Learning
16 pages
BBG-Earth in Space
No ratings yet
BBG-Earth in Space
10 pages
Forecasting Interrupted Time Series
No ratings yet
Forecasting Interrupted Time Series
23 pages
Advanced Statistical Inference
No ratings yet
Advanced Statistical Inference
7 pages
Flashcards - Analysis and Interpretation of Data - CIE Biology A-Level
No ratings yet
Flashcards - Analysis and Interpretation of Data - CIE Biology A-Level
45 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
3 Choosing Semivariogram
No ratings yet
3 Choosing Semivariogram
24 pages
Agricultural Land Use in Kerala
No ratings yet
Agricultural Land Use in Kerala
5 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Hypothesis
No ratings yet
Hypothesis
15 pages
Real Databricks Certified Professional Data Scientist Dumps With Actual Questions - Valid IT Exam Dumps Questions
No ratings yet
Real Databricks Certified Professional Data Scientist Dumps With Actual Questions - Valid IT Exam Dumps Questions
44 pages
DC Notes Post MT2
No ratings yet
DC Notes Post MT2
22 pages
Activity No. 1: Age (Years) Weight (KG)
No ratings yet
Activity No. 1: Age (Years) Weight (KG)
12 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Statistical Science: Volume 33, Number 2 May 2018
No ratings yet
Statistical Science: Volume 33, Number 2 May 2018
35 pages
Reflective Index Modulation For IRS Assisted Integrated Data and Energy Transfer
No ratings yet
Reflective Index Modulation For IRS Assisted Integrated Data and Energy Transfer
13 pages
Frequentist Estimation: 4.1 Likelihood Function
No ratings yet
Frequentist Estimation: 4.1 Likelihood Function
6 pages
Comparative Analysis of Wind Speed Prediction: Enhancing Accuracy Using PCA and Linear Regression vs. GPR, SVR, and RNN
No ratings yet
Comparative Analysis of Wind Speed Prediction: Enhancing Accuracy Using PCA and Linear Regression vs. GPR, SVR, and RNN
8 pages
BBG Atmosphere
No ratings yet
BBG Atmosphere
10 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Output SPSS (1) 1
No ratings yet
Output SPSS (1) 1
47 pages
Expert Systems With Applications: Xiangyong Kong, Liqun Gao, Haibin Ouyang, Steven Li
No ratings yet
Expert Systems With Applications: Xiangyong Kong, Liqun Gao, Haibin Ouyang, Steven Li
19 pages
Reflective Group Number Based Index Modulation For Intelligent Reflecting Surface Assisted Wireless Communications
No ratings yet
Reflective Group Number Based Index Modulation For Intelligent Reflecting Surface Assisted Wireless Communications
6 pages
CORE Stat and Prob Q4 Mod15 W4 Solving Problems in Hypothesis Testing Population Mean
No ratings yet
CORE Stat and Prob Q4 Mod15 W4 Solving Problems in Hypothesis Testing Population Mean
20 pages
Skewness and Kurtosis
No ratings yet
Skewness and Kurtosis
6 pages
Stats Poster Project 1
No ratings yet
Stats Poster Project 1
3 pages
Pengaruh Stres Dan Kelelahan Kerja Terhadap Kinerja Guru SMPN 2 Sukodono Di Kabupaten Lumajang
No ratings yet
Pengaruh Stres Dan Kelelahan Kerja Terhadap Kinerja Guru SMPN 2 Sukodono Di Kabupaten Lumajang
9 pages
IEEE Rutgers AI Fact Sheet
No ratings yet
IEEE Rutgers AI Fact Sheet
2 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
MMW Reviewer
No ratings yet
MMW Reviewer
5 pages
Annotated 3
No ratings yet
Annotated 3
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.