Section 5
Section 5
1. Definition
The idea of Maximum Likelihood Estimation (MLE) is given by Sir Ronald Fisher
case of normal distribution 𝜽 = {𝜇, 𝜎}. When we say random variable 𝑋 has a pdf
𝑓(𝑥|𝜽), this is a function of 𝑥 and the values of 𝜽 are fixed or given. Now if we look
observed value 𝑥 given; then, in this perspective, we have a likelihood function and
we shall use the notation 𝑙(𝜽|𝑥) to emphasize this perspective (function of 𝜽 and
0.80 and 𝜎 2 = 0.0016; then we can simulate many values {𝑥1 , 𝑥2 … , 𝑥𝑛 } of 𝑋 from
this distribution. Below are three samples with 2500 observations; the sample means
1
and sample standard deviations are different, but they are from the same distribution:
250
200
150
100
50
0
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
200
150
100
50
0
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
200
150
100
50
0
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
this example, the parameters are given by 𝜇 = 0.80 and 𝜎 2 = 0.0016 . The pdf
𝑓(𝑥|𝜽) is the red curve, and it is a function of 𝑥, which is what we have learned in
Section 1 and Section 21. Recall that on p.9 of Section 2, we have 𝑓(𝑥1 ) = 𝑥1 + 1/2
and 𝑓(𝑥2 ) = 1/2 + 𝑥2 , and on p.13 we have 𝑓(𝑥1 ) = 2(1 − 𝑥1 ) and 𝑓(𝑥2 ) = 2𝑥2
with the observed value 𝑥 fixed. We can understand this concept as follows: for the
data we have at hand, like TSMC returns over past 10 years, we can make an assumption
about the distribution of data; for example, we may assume TSMC returns are normally
Then, for the given data, we can find the set of parameters that lead to the highest
likelihood for seeing the data at hand. This approach is sensible since the data
{𝑥1 , 𝑥2 … , 𝑥𝑛 }, of course, is already realized and therefore the parameter values which
maximize likelihood function are good estimates of true parameters. Such estimators
̂ 𝑀𝐿𝐸 .
are called Maximum Likelihood Estimators, denoted by 𝜽
population using a sample, because we will need to assume a distribution. Other ways
1
Note that in the figure, the y-axis is the number of observations for the histogram.
3
of making inference without making assumption on the distribution of data are thus
non-parametric methods.
Bernoulli Distribution
sample means each 𝑋1, 𝑋2, …, 𝑋𝑛 are independent. Thus, the probability that we see
Again, like the small 𝑥 in Section 1 and Section 2, here {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is a set of
𝑝∑ 𝑥𝑖 (1 − 𝑝)𝑛−∑ 𝑥𝑖 (3)
Now when we view (3) as a function of parameter 𝑝 , given the set of data
4
𝑙(𝑝) = 𝑝∑ 𝑥𝑖 (1 − 𝑝)𝑛−∑ 𝑥𝑖 , for 0 ≤ 𝑝 ≤ 1 (4)
The maximum likelihood estimate of 𝑝 is the value that maximizes (4). However, this
maximization is usually done with the log of likelihood function, because after taking
𝑛 𝑛
𝐿(𝑝) = (∑ 𝑥𝑖 ) ln 𝑝 + (𝑛 − ∑ 𝑥𝑖 ) ln(1 − 𝑝) (5)
𝑖=1 𝑖=1
In addition, taking log does not affect the optimization task2. Thus, to find the maximum
Thus,
∑𝑛𝑖=1 𝑥𝑖 𝑛 − ∑𝑛𝑖=1 𝑥𝑖
=
𝑝 1−𝑝
𝑛 𝑛
⟹ (1 − 𝑝) ∑ 𝑥𝑖 = 𝑝 (𝑛 − ∑ 𝑥𝑖 )
𝑖=1 𝑖=1
𝑛
⟹∑ 𝑥𝑖 = 𝑝𝑛
𝑖=1
∑𝑛𝑖=1 𝑥𝑖
𝑝= (7)
𝑛
2
This is because log is a monotone function.
5
As a result, the maximum likelihood estimator 𝑝̂𝑀𝐿𝐸 for a random sample which can
1 𝑛
𝑝̂ 𝑀𝐿𝐸 = ∑ 𝑋𝑖 = 𝑋 (8)
𝑛 𝑖=1
Poisson Distribution
𝜆𝑥 𝑒 −𝜆
𝑝(𝑥) = (9)
𝑥!
𝑥𝑛 is:
1
= 𝜆∑ 𝑥 𝑒 −𝑛𝜆 (10)
𝑥1 ! 𝑥2 ! … 𝑥𝑛 !
𝑛
𝐿(𝜆) = − ln(𝑥1 ! 𝑥2 ! … 𝑥𝑛 !) + (∑ 𝑥𝑖 ) ∙ ln 𝜆 − 𝑛𝜆 (11)
𝑖=1
𝜕𝐿(𝜆) ∑𝑛𝑖=1 𝑥𝑖
= −𝑛 =0
𝜕𝜆 𝜆
6
∑𝑛𝑖=1 𝑥𝑖
⟹𝜆= (12)
𝑛
Again, as a result, the maximum likelihood estimator 𝜆̂𝑀𝐿𝐸 for a random sample which
1 𝑛
𝜆̂𝑀𝐿𝐸 = ∑ 𝑋𝑖 = 𝑋 (13)
𝑛 𝑖=1
Exponential Distribution
Let 𝑋1, 𝑋2, …, 𝑋𝑛 be a random sample which can be assumed to have Exponential
= 𝜆𝑛 𝑒 −𝜆 ∑ 𝑥
𝑛
𝐿(𝜆) = 𝑛 ln 𝜆 − 𝜆 ∑ 𝑥𝑖 (15)
𝑖=1
7
𝜕𝐿(𝜆) 𝑛 𝑛
= − ∑ 𝑥𝑖 = 0
𝜕𝜆 𝜆 𝑖=1
1 ∑𝑛𝑖=1 𝑥𝑖
⟹ =
𝜆 𝑛
Normal Distribution
1 1 𝑥−𝜇 2
𝑓(𝑥) = exp (− ( ) ) (16)
√2𝜋𝜎 2 2 𝜎
𝑛 1 𝑛
= (2𝜋𝜎 2 )− 2 exp (− ∑ (𝑥𝑖 − 𝜇)2 ) (17)
2𝜎 2 𝑖=1
𝑛 1 𝑛
𝐿(𝜇, 𝜎) = − ln 2𝜋 − 𝑛 ln 𝜎 − 2 ∑ (𝑥𝑖 − 𝜇)2 (18)
2 2𝜎 𝑖=1
𝜕𝐿(𝜇, 𝜎) 1 𝑛
= 2 ∑ (𝑥𝑖 − 𝜇) = 0 (19)
𝜕𝜇 𝜎 𝑖=1
𝜕𝐿(𝜇, 𝜎) 𝑛 1 𝑛
= − + 3 ∑ (𝑥𝑖 − 𝜇)2 = 0 (20)
𝜕𝜎 𝜎 𝜎 𝑖=1
𝑛
2 (𝑥𝑖 − 𝜇)2
𝑛𝜎 = ∑
𝑖=1
1 𝑛 2
⟹ 𝜎̂𝑀𝐿𝐸 = √ ∑ (𝑋𝑖 − 𝑋)
𝑛 𝑖=1
Or, equivalently3,
𝑛
̂2 𝑀𝐿𝐸 = 1 ∑ (𝑋𝑖 − 𝑋)2
𝜎 (22)
𝑛 𝑖=1
3
Let 𝜃̂𝑀𝐿𝐸 be the maximum likelihood estimator of 𝜃 and let 𝑔 be a nice function; then the
maximum likelihood estimator for 𝑔(𝜃) is given by 𝑔(𝜃̂𝑀𝐿𝐸 ).
9
3. The Properties of Maximum Likelihood Estimator
From the above examples, what can we say about the properties of maximum likelihood
̂ 𝑀𝐿𝐸 ? Well, from the above results on Bernoulli, Poisson, Exponential and
estimators 𝜽
Normal distributions, we see that the MLE estimators are all given by the sample mean
Limit Theorem. Recall that in the previous subsection, we obtain these solutions to
̂ 𝑀𝐿𝐸 by solving the FOC, but there are many situations where we cannot directly solve
𝜽
regularity conditions hold. These conditions are, however, too technical (and very much
beyond the scope of this course), so we do not have to state them here.
̂ 𝑀𝐿𝐸 is a random variable. Therefore, we can talk about its expectation as well as its
𝜽
idea that 𝑋 is a random variable which may have pdf 𝑓(𝑥), with 𝜽 the parameter(s).
In Section 4, when we solve the exercises we usually know the value of 𝜽. In real
situations, we are given a sample or data and do not know the true value of 𝜽, and we
{𝑋1, 𝑋2, …, 𝑋𝑛 }
We often assume that this is a random sample which means that the random variables
𝑋1, 𝑋2, …, 𝑋𝑛 are independent and have the same distribution as 𝑋. In other words,
we say 𝑋1, 𝑋2, …, 𝑋𝑛 are i.i.d. (independent and identically distributed). A statistic
suppose 𝑋1, 𝑋2, …, 𝑋𝑛 is a random sample from a distribution (need not to be normal)
with mean 𝜇 and variance 𝜎 2 ; then the sample mean 𝑋 is used to estimate 𝜇, and
In real applications, we are usually okay with the “identical distribution” in the
i.i.d. assumption. However, the independent part may not hold for {𝑋1, 𝑋2 , …, 𝑋𝑛 },
4
歡迎下學期大家選修 FM323 財務風險管理哦!
11
Now we can briefly talk about the three properties of an estimator. An estimator is
unbiased if its expectation equals the parameter it aims to estimate. For example, the
𝜇 because:
1 𝑛 1 𝑛 1
𝐸[𝑋 ] = 𝐸 [ ∑ 𝑋𝑖 ] = ∑ 𝐸[𝑋𝑖 ] = (𝑛𝜇) = 𝜇 (23)
𝑛 𝑖=1 𝑛 𝑖 𝑛
1 𝑛 2
𝑆2 = ∑ (𝑋𝑖 − 𝑋) (24)
𝒏−𝟏 𝑖=1
1 𝑛 2
𝑆2 = ∑ (𝑋𝑖2 − 2𝑋𝑖 𝑋 + 𝑋 )
𝑛 − 1 𝑖=1
1 𝑛 𝑛 𝑛 2
= (∑ 𝑋𝑖2 − 2 ∑ 𝑋𝑖 𝑋 + ∑ 𝑋 )
𝑛−1 𝑖=1 𝑖=1 𝑖=1
1 𝑛 2 2
= (∑ 𝑋𝑖2 − 2𝑛𝑋 + 𝑛𝑋 )
𝑛−1 𝑖=1
1 𝑛 2
= (∑ 𝑋𝑖2 − 𝑛𝑋 ) (25)
𝑛−1 𝑖=1
1 𝑛 2
𝐸[𝑆 2 ] = (∑ 𝐸[𝑋𝑖2 ] − 𝑛𝐸 [𝑋 ])
𝑛−1 𝑖=1
1 𝜎2
= (𝑛𝜎 + 𝑛𝜇 − 𝑛 ( + 𝜇 2 )) = 𝜎 2
2 2 (26)
𝑛−1 𝑛
12
Thus, we see the sample variance 𝑆 2 in (24) is indeed an unbiased estimator of 𝜎 2 .
However, if we define:
1 𝑛 2
𝑉= ∑ (𝑋𝑖 − 𝑋) (27)
𝑛 𝑖=1
𝑛−1 2
𝐸[𝑉] = ( ) 𝜎 ≠ 𝜎2
𝑛
probability that the distance between the estimator and the target parameter becomes
Both the sample mean 𝑋 and the sample variance 𝑆 2 are consistent estimators of
and
13
The first result about 𝜇 can be proved by using Chebyshev’s Inequality and the Weak
smallest, in the sense that it reaches a lower bound called Rao-Cramér lower bound.
The idea is that an estimator is more efficient if it has a smaller variance. We can
14
̂ 𝑀𝐿𝐸 (1) may or may not be
In conclusion, maximum likelihood estimators 𝜽
unbiased, (2) are consistent and (3) are efficient estimators asymptotically, under the
regularity conditions and if the assumption about the distribution is correct. The word
“asymptotically” stands for “as the sample size 𝑛 → ∞.” Another nice and important
15
4. Maximum Likelihood Estimation of Linear Regression
In this sub-section, we will solve the linear regression problem using maximum
likelihood method. Previously in Section 3, we have shown that linear regression can
be estimated using the method of least square; here we will show that for simple linear
̂ 𝑀𝐿𝐸 are the same, that is, the two methods give us the same solutions to the
𝜽
parameters (𝑎, 𝑏) in the model. In addition, with the maximum likelihood method we
can obtain the expected values, as well as the standard errors of (𝑎, 𝑏).
Given this scatter plot, we can agree that 𝑋 and 𝑌 display a linear relationship, but
the relation is not perfect. That is, there will be some residuals when we fit the straight
16
In the method of least squares, to determine the straight line we minimize the sum
and consider the properties of these residuals. First, since they are residuals, by
definition their expected value should be zero. This is like what we have mentioned
before – some of the residuals are positive and some are negative, and so on average,
the expected value = 0. Second, there will be large residuals and small residuals; it is
then sensible to assume the residuals follow a normal distribution 𝑁(0, 𝜎 2 ) with 𝜇 =
0 and variance 𝜎 2 .
𝑌𝑖 = 𝑎 + 𝑏𝑥𝑖 + 𝑒𝑖
, where the residuals {𝑒1 , 𝑒2 , …, 𝑒𝑛 } are random variables assumed to be i.i.d. with
normal distribution 𝑁(0, 𝜎 2 ). Note that here 𝑌𝑖 is a random variable constructed from
17
References
18