0% found this document useful (0 votes)
13 views72 pages

Sspi Lecture 5 Blue Mle 2025

The document discusses the concepts of Best Linear Unbiased Estimator (BLUE) and Maximum Likelihood Estimation (MLE) in statistical signal processing, emphasizing their importance in estimating parameters from data with heavy-tailed distributions. It explains the role of sufficient statistics and the Rao–Blackwell–Lehmann–Scheffe theorem in finding unbiased estimators, and highlights the challenges faced when the probability density function (pdf) is unknown. The document also outlines the conditions under which BLUE is optimal and provides insights into the factorization of pdfs for sufficient statistics.

Uploaded by

zhouwenwei0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views72 pages

Sspi Lecture 5 Blue Mle 2025

The document discusses the concepts of Best Linear Unbiased Estimator (BLUE) and Maximum Likelihood Estimation (MLE) in statistical signal processing, emphasizing their importance in estimating parameters from data with heavy-tailed distributions. It explains the role of sufficient statistics and the Rao–Blackwell–Lehmann–Scheffe theorem in finding unbiased estimators, and highlights the challenges faced when the probability density function (pdf) is unknown. The document also outlines the conditions under which BLUE is optimal and provides insights into the factorization of pdfs for sufficient statistics.

Uploaded by

zhouwenwei0526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Statistical Signal Processing & Inference

BLUE and Maximum Likelihood Est.


Danilo Mandic
room 813, ext: 46271

Department of Electrical and Electronic Engineering


Imperial College London, UK
d.mandic@imperial.ac.uk, URL: www.commsp.ee.ic.ac.uk/∼mandic

c D. P. Mandic Statistical Signal Processing & Inference 1


Motivation for Best Linear Unbiased Estimator (BLUE)
and Maximum Likelihood Estimation (MLE)
◦ In many applications, signals exhibit sharp spikes
◦ This results in heavy-tailed distributions (e.g. α-stable distributions)
◦ There may not be a general form of pdf for such distributions

◦ If an efficient estimator does not exist, it is still of interest to be able


to find an MVU estimator (assuming, of course, that it exists), as in BLUE

R
◦ To achieve this, we need the concept of sufficient statistics and the
Rao–Blackwell–Lehmann–Scheffe theorem
The BLUE assumptions are also called the Gauss–Markov assumptions
◦ It is possible in many cases to determine an approximate MVU estimator
(MVUE) by inspection of the PDF, using Maximum Likelihood Estimation
c D. P. Mandic Statistical Signal Processing & Inference 2
Motivation for Maximum Likelihood Estimation (MLE)
What do you think these applications have in common?
HDD microcontroller Gesture (Swype) keyboard
P (S) = p(I|begin) p(would|I) p(like|would)×
× p(to|like) p(f ly|to) p(London|f ly) · · ·

Generative AI (density estimation) Global Positioning System


c D. P. Mandic Statistical Signal Processing & Inference 3
Overview
◦ It frequently occurs that the MVU estimator, even if it exists, cannot be
found (mathematical tractability, violation of regularity conditions, ...)
◦ For instance, one typical case is that we may not know the pdf of the
data, but we do know the 1st and 2nd moment (mean, variance,
power). In such cases pdf based methods cannot be applied
◦ We therefore have to resort to suboptimal solutions # by imposing
some constraints (domain knowledge) on the estimator and data model
◦ If the variance of a suboptimal estimator meets our system
specifications, the use of such estimators is fully justified
◦ The best linear unbiased estimator (BLUE) # restricts the estimator to
be linear in the data # finds a linear estimator that is unbiased and
has the minimum variance among such unbiased estimators
◦ Alternatively, if the MVU estimator does not exist or BLUE is not
applicable, we may resort to Maximum Likelihood Estimation (MLE)
◦ We first need to look at which data samples are pertinent to the
estimation problem in hand the so called sufficient statistics
c D. P. Mandic Statistical Signal Processing & Inference 4
The notion of a statistic

Def: Any real valued function, T (x) = f x[0], x[1], . . . , x[N − 1] , of the
observations in the sample space, {x}, is called a statistic. Importantly,

R there should not be any unknown parameter, θ, in a statistic.


1
PN −1
The mean x̄ = N n=0 x[n], median, and max{x[0], x[1], . . . , x[N − 1]}
are all statistics. However, x[0] + θ is not a statistic if θ is unknown.
Let us now reflect on the estimators we have considered so far:

R
PN −1 PN −1
x̄ = N1 n=0 x[n] σ̂ 2 = N1 n=0 (x[n] − µ)2 max{x[0], . . . , x[N − 1]}
These estimators are a function of random observations, x, and not of the
unknown paramater, θ # each of these estimators is a valid statistic.
◦ Observe that the above estimators “compress” the available information,
e.g. the sample mean takes N datapoints in x and produces one sample, x̄.
◦ In the best case, such compression is “loss-less”, as it contains the same

R amount of information as that contained in the N original observations, x.


We call such statistic a sufficient statistic, as it summarises (absorbs) all
information about an unknown parameter, θ, and reduces “data footprint”.

c D. P. Mandic Statistical Signal Processing & Inference 5


An insight into the ‘sufficiency’ of the data statistics
Which data samples are pertinent to the est. problem? Q: ∃ a sufficient dataset?
Consider two unbiased estimators of DC level in WGN:
N −1
1 X σ2
var(Ã) = σ 2

R
 = x[n], var(Â) = & à = x[0],
N n=0 N
Although à is unbiased, its variance is much larger than that of Â. This is
is due to discarding samples x[1],...,x[N-1] that carry information about A.
Consider now the following datasets:
N
X −1
S1 = {x[0], x[1], . . . , x[N − 1]} S2 = {x[0] + x[1], x[2], . . . , x[N − 1]} S3 = { x[n]}
n=0
The original dataset, S1, is always sufficient for finding Â, while S2 and
S3 are also sufficient. In addition, S3 is the minimal sufficient statistic!
In a nutshell, sufficient statistics answer the questions:
Q1: Can we find a transformation T (x) of lower dimension that contains
all information about θ? (the data, x ∈ RN ×1, can be very long)
Q2: What is the lowest possible dimension of T (x) which still contains all
information about θ? # (minimal sufficient statistic)
P
For example, for DC level in WGN, T (x) = x[n] (one-dimensional)
c D. P. Mandic Statistical Signal Processing & Inference 6
Sufficient statistics: Putting it all together
In layman’s terms, sufficiency is saying that T (x) is informative enough
Aim: A statistic T (x) is sufficient if it allows us to estimate the unknown par. θ
as well as when based on the entire data set x. So, we no longer need to
consider data, x, after using it to calculate T (x), it becomes redundant.

Def: A statistic T (x) = f x[0], x[1], . . . , x[N − 1] is a sufficient statistic, if for
any value of T0 the conditional distribution of x[0], x[1], . . . , x[N − 1]
given T = T0, that is, p(x = x0|T (x) = T0; θ), does not depend on the
unknown parameter θ. In other words, after observing the statistic T0, the
data x will not give us any new information about θ. (see Appendix 1)
1 PN −1 2 1 PN −1 2
p x = x0 | n=0 x[n] = T0 ; A p x = x0 | n=0 x[n] = T0 ; A
0.6 2
p(x|T0 ; A) depends on A p(x|T0 ; A) does not depend on A ◦ Knowledge of T0 changes
0.5
thus T0 is not sufficient 1.5 thus T0 is sufficient the PDF to the conditional one
0.4 PN −1
p(x| n=0 x[n] = T0; A).
0.3 1
0.2 ◦ If a statistic is sufficient for
0.5
0.1 estimating A, this conditional
0 0 PDF should not depend on A
A0 A A
Information present in observations No information in observations (as in the right hand panel).
after T(x) observed after T(x) observed

c D. P. Mandic Statistical Signal Processing & Inference 7


Sufficient statistics: Neyman-Fisher factorisation
PN −1
1 1
(x[n] − A)2

Recall: The Gaussian p(x; A) = exp − 2σ 2 n=0
(2πσ )N/2
2
PN −1
Finding the conditional distribution p(x| n=0 x[n] = T0; A) can be
extremely difficult. An intuitive way to deal with this is through the
factorisation of p(x|T (x); A).
Th: Neyman-Fisher factorisation. Consider a set of random samples, x, with
a PDF p(x; θ) which depends on the unknown parameter θ. Then, the
statistic T (x) is sufficient for θ iff the PDF can be factored as

p(x; θ) = g T (x), θ h(x) = g (parameters & data) × h (data only)

R where g(·) depends on x only through T (x), and h is a function of only x.


For a DC level in WGN
1  1 X 2 N −1 n 1 
2
z }| {
NX−1
 o
T ( x)

p(x; A) = exp − x [n] exp − N A − 2A x[n]


(2πσ 2)N/2 2σ 2 n=0 2σ 2 n=0
| {z } | {z }
h(x) g(T (x),A)

R
PN −1
Therefore, the sufficient statistic is T (x) = n=0 x[n] (minimal & linear)
1
PN −1
Any 1-2-1 mapping of T (x) is also a sufficient st. e.g. T1(x)=N p n=0 x[n]
and T2(x) = (x̄)3 are sufficient, but T3(x) = (x̄)2 is not as x̄ = ± T3(x).

c D. P. Mandic Statistical Signal Processing & Inference 8


More examples of sufficient statistics # Estimating the
power of white Gaussian noise
Consider the parametrised PDF for DC level estimation in WGN, given by
N −1
1  1 X 2
p(x; A) = exp − (x[n] − A)
(2πσ 2)N/2 2σ 2 n=0

R where A=0 and the noise power, σ 2, is the unknown parameter.


To find the sufficient statistic for the estimation of σ 2, we factorise
T (x )
z }| {
N −1
1 1 X
p(x; σ 2) = 2

2 N/2
exp − 2
x [n] × |{z}
1
(2πσ ) 2σ n=0
| {z } h(x)
g(T (x),σ 2 )
2
This gives the
PN −1 2sufficient statistic for the estimation of the unknown σ as
T (x) = n=0 x [n] which, of course, makes perfect physical sense.
PN −1
Homework: Prove that n=0 x2[n] is a sufficient statistic for σ 2 by using
PN −1 2
the definition that p(x | n=0 x [n] = T0; σ 2) does not depend on σ 2 (no
information left in observation after T0 is observed). (see Slide 8)

c D. P. Mandic Statistical Signal Processing & Inference 9


How to find the MVU from a sufficient statistic?
Raw data x = [x[0], . . . , x[N − 1]]T ∈ RN ×1 # an N -dim. sufficient statistic
Neyman-Fisher Th.: For T (x) to be a sufficient  statistic, we need to be
able to factor p(x; θ) as p(x; θ) = g T (x), θ h(x)
Finding the MVU from sufficiency Rao-Blackwell-Lehmann-Scheffe Th:
We can do this in two ways: Assume that θ̃ is an unbiased
1. Find any unbiased estimator, estimator of θ and T (x) is a
say θ̃, of θ and determine sufficient statistic for θ. Then,
θ̂ = E[θ̃|T (x)] the estimator θ̂ = E[θ̃|T (x)] is:
◦ valid (not dependent on θ)
(often mathematically intractable)
◦ unbiased
2. Find a g(·), s.t. θ̂ = g(T (x)) ◦ of ≤ variance than that of θ̃
is an unbiased estimator of θ
In addition, if the sufficient
(preferable in practice)
statistic is complete, then θ̂ is
Then: If g(·) is unique, we have a the MVU estimator.
complete statistic and MVU est. Def: Complete stat. There is only one
If g(·) is not unique → no MVUE function of the statistic that is unbiased.
c D. P. Mandic Statistical Signal Processing & Inference 10
Best Linear Unbiased Estimator: BLUE
Motivation: When the PDF of the data is unknown, or it cannot be
assessed, the MVU estimator, even if it exists, cannot be found!
◦ In this case methods which rely on the PDF cannot be applied
Remedy: Resort to a sub-optimal estimator # check its variance and
ascertain whether it meets the required specifications (and/or CRLB)
Common sense approach: Assume an estimator to be:
PN −1
◦ Linear in the data, that is, θ̂BLU E = n=0 anx[n], with an as
parameters,
◦ Among all such linear estimators, seek for an unbiased one,

R
◦ Then, minimise the variance of this unbiased estimator.
Such an estimator is termed the Best Linear Unbiased Estimator (BLUE)
which requires only knowledge of the first two moments of the PDF.

We will see that if the data are Gaussian, the BLUE and MVUE are equivalent

c D. P. Mandic Statistical Signal Processing & Inference 11


The form and optimality of BLUE
 T
Consider the data x = x[0], x[1], . . . , x[N − 1] , for which the PDF
p(x; θ) depends on the unknown parameter θ.

The form of BLUE Optimality of BLUE

The BLUE is restricted to have the Note, the BLUE will be optimal
form (a = {an}) only when the actual MVU
N
X −1 estimator is linear!
θ̂ = anx[n] = aT x
This is the case, for instance, when
n=0 ↑
Constants to be determined estimating the DC level in WGN
◦ We choose a = {an} which yield 1
N
X −1 
1

an unbiased estimate E{θ̂} = θ θ̂ = x̄ = x[n] an =
N n=0 N
◦ Then, we perform min(var)
which is clearly linear in the data.
# the BLUE estimator is that
which is unbiased and has the Then, BLUE is an optimal MVU
minimum variance. estimator giving an = 1/N .

c D. P. Mandic Statistical Signal Processing & Inference 12


The place of BLUE amongst other estimators
We illustrate this on the estimation of a DC level in noise of different distributions
Consider the space of all unbiased ◦ The MVU estimator for the
estimators of DC level in noise: mean θ = β2 of uniform noise,
Gaussian
x[n] ∼ U(0, β), is nonlinear
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

Linear Unbiased Uniform


Estimators
in the data, and is given by
𝜃! = 𝑥̅ = 𝑀𝑉𝑈 = 𝐵𝐿𝑈𝐸 N +1
𝜃! = 𝑥̅ = 𝐵𝐿𝑈𝐸 mean: θ̂ = max{x[n]}
2N
𝑁+1 β2
𝜃! = max 𝑥[𝑛]
2𝑁 variance: var(θ̂) =
= 𝑀𝑉𝑈 4N (N + 2)
Nonlinear Unbiased
Estimators The sample mean estimator of
𝜃 β2
uniform noise gives var(θ̂) = 12N
◦ For white Gaussian noise, the MVU So, sample mean is not an MVU
is linear in the data and is given by estimator for uniform noise!

R the sample mean x̄. (see Problem 4.1 in your P&A sets)
The difference in performance between the BLUE and MVU estimators
can, in general, be substantial, and can only be rigorously quantified
through the underlying data generating pdf.
c D. P. Mandic Statistical Signal Processing & Inference 13
Example 1: How useful is an estimator of DC level in
noise?
In fact, very useful. It is up to us to provide a correct data representation.
Sinusoidal frequency estimation Practical example: Real-world speech
x(t) X(f)
time-frequency representation

time

frequency
frequency

time−frequency spectrogram

(T−F representation of a sinewave)

m —- aaaaa — ttt– lll – aaaaa—bb


horizontal axis: time vertical: frequency
time

◦ Sinusoid in time # DC level in This is a speech waveform of the


time–frequency utterance of word “matlab”
◦ Chirp in time # ramp in T-F Observe DC-like harmonics for “a”

c D. P. Mandic Statistical Signal Processing & Inference 14


Example 2: Problems with BLUE (are often surmountable!)
Its direct form is inappropriate for nonlinear prob. # population dynamics example

Owing to the linearity assumptions, the BLUE estimator can be totally


inappropriate for some estimation problems.

 
Power of WGN estimation
 Example: Rabbit population
 
N −1
1 X 120
Parent rabbits

The MVU estimator σ̂ 2 = x2[n] Baby bunnies

N n=0 100

is nonlinear in the data. Forcing the 80

Rabbit population
estimator to be linear, e.g. by 60

N −1
1 X 40

σ̂ 2 = anx[n]
N n=0 20

yields E{σ̂ 2} = 0, which is


0
2 3 4 5 6 7 8 9 10
Rabbit pairs

guaranteed to be biased! The time evolution of the rabbit


population is nonlinear (exponential)
A non-linear transformation of the
data, i.e. y[n] = x2[n], could However, the number of parent pairs
overcome this problem. (next Slide) is linear in time!

c D. P. Mandic Statistical Signal Processing & Inference 15


Example 3: Nonlinear transformation of data often helps
Left: Original data (nonlin. separab.) Right: Linear separab. after nonlin. tran.

2 5
Class 1
Class 2 4
1
3
0
2

-1
1

-2 0
-2 -1 0 1 2 0 1 2 3 4 5

-1

-2
-2 -1 0 1 2

c D. P. Mandic Statistical Signal Processing & Inference 16


How to find BLUE?
PN−1
Recall: BLUE is linear in data θ̂ = n=0 anx[n] = aT x
Consider a scalar linear observation x[n] = θs[n] + w[n] ⇒ E{x[n]} = θs[n]
PN −1
and notice that E{θ̂} = θ n=0 ans[n] s[n] scaled mean
1. Unbiased constraint 2. Variance minimisation
N
X −1 θ̂ = aT x
E{θ̂} = an E{x[n]} = aT s θ = θ
n=0
| {z } var(θ̂) = E{θ̂2} = E{aT xxT a}
θs[n]

aT s θ = θ ⇒ aT s = 1
BLUE optimisation task
unbiased constraint %
where the scaled data vector Minimise:

R
T
s = [s[0], s[1], . . . , s[N − 1]] var(θ̂) = aT E{xxT }a = aT Ca
In other words, to satisfy the subject to the unbiased constraint
unbiased constraint for the estimate
θ̂, E{x[n]} must be linear in θ, or aT s = 1

E{x[n]} = s[n] θ This is a constrained minimisation problem

c D. P. Mandic Statistical Signal Processing & Inference 17


Some remarks on variance calculation
A closer look at the variance yields
−1
n NX −1
 NX 2 o n
T T
2 o
var(θ̂) = E anx[n] − E anx[n] =E a x − a E{x}
n=0 n=0
T
With a ≡ [a0, a1, . . . , aN −1] , y = y × y T , and (aT x)T = xT a, we have
2

n o
T
E aT x − E{x} x − E{x} a = aT Ca like var(aX) = a2var(X)
 
| {z }| {z }
y yT

Also assume
 
E{x[n]} = s[n]θ, easy to show from x[n] = E{x[n]} + x[n] − E{x[n]}

R
by viewing w[n] = x[n] − E{x[n]}, we have x[n] = θs[n] + w[n]

BLUE is linear in the unknown parameter θ, which corresponds to the


amplitude estimation of known signals in noise (to generalise this, a
nonlinear transformation of the data is required). (see Slide 16)

c D. P. Mandic Statistical Signal Processing & Inference 18


BLUE as a constrained optimisation paradigm
For Lagrange optimisation see Lecture 1 and Appendix 11

Task: minimize the variance subject to the unbiased constraint


 T T
min a Ca subject to a
| s{z= 1}
| {z } equality constraint
optimisation task

 

Method of Lagrange multipliers Solve for the Lagrange multiplier λ


 

4. From the constraint equation


1. J = aT Ca − λ(aT s − 1) T λ T −1
a s = s C s=1
2
2. Calculate λ 1
⇒ = T −1
∂J 2 s C s
= 2Ca − λs 5. Replace into Step 3, with the
∂a
constraint satisfied for
3. Equate to zero and solve for a C−1s
aopt = T −1
λ −1 s C s
a= C s
2 These are the coefficients of BLUE

c D. P. Mandic Statistical Signal Processing & Inference 19


Summary: BLUE
Recall that θ = aT x var(θ̂) = aT Ca
BLUE of an unknown parameter:

sT C−1 C−1s
θ̂ = aToptx = T −1 x where aopt = T −1
s C s s C s
BLUE variance:
  1
var θ̂ = aTopt C aopt =
sT C−1 s
To determine the BLUE we only require knowledge of
s # the scaled mean
C # the covariance matrix (C−1 is called the “precision
matrix”, see also Slide 42 in Lecture 4)
That is, for BLUE we only need to know the first two moments of the PDF
Notice that we do not need to know the functional relation of PDF

c D. P. Mandic Statistical Signal Processing & Inference 20


Example 4: Estimation of a DC level in unknown noise
Notice that the PDF is unspecified and does not need to be known

Consider the estimation of a DC level in white noise, which is of an


unspecified PDF and with variance σ 2.
We know that
x[n] = A + w[n], n = 0, 1, . . . , N − 1

where {w[n]} is any white noise with known variance σ 2 (power).


In other words, {w[n]} is not necessarily Gaussian # there may be some
statistical dependence between samples (although they are uncorrelated)

Task: Estimate the DC level, A.

Solution: From the assumptions of BLUE, we have



E x[n] = s[n]A = A and therefore s[n] = 1

R
so that . . , 1} ]T = 1N ×1
s = 1 = [ |1, .{z
N elements
Follows from E{x[n]} being linear in θ ⇒ E{x[n]} = s[n]θ .

c D. P. Mandic Statistical Signal Processing & Inference 21


Example 4: DC level in white noise of unknown PDF,
−1
contd. Recall that aopt = TC −1s , var(θ̂) = T 1−1 , and θ̂ = aToptx
s C s s C s
For any uncorrelated white noise {w} with power σ 2
 2   1 
σ ··· 0 σ2
··· 0
C =  .. . . . ..  = σ 2I ⇒ C−1 =  .. . . . ..  = 1
σ2
I
1
0 · · · σ2 0 ··· σ2

The BLUE for the estimation of DC level in noise then becomes (see Slide 20)
1 N −1
1T σ2
I 1 X
 = x = x[n] = x̄
1T σ12 I 1 N n=0
and has minimum variance (CRLB for a linear estimator) of
1 σ2
var(Â) = T 1 =
1 σ2 I 1 N
◦ The sample mean is the BLUE, independent of the PDF of the data

R ◦ BLUE is also the MVU estimator if the noise {w} is Gaussian


If the noise is not Gaussian (e.g. uniform) the CRLB and MVU
estimator may not exist, but BLUE still exists! (P&A sets and Slide 13)
c D. P. Mandic Statistical Signal Processing & Inference 22
Some help with the quadratic forms of the type aT Aa
We shall analyse the expressions 1T I 1 and 1T I x
1 1 1 ... 1 1 1 1 1 1 ... 1 1
1 1 1
1xN
1 0 1xN
. . = . = N
. . .
0 . . .
1 1 1
Nx1 Nx1
NxN

1 1 1 ... 1 1 x[0] 1 1 1 ... 1 x[0]


1 x[1] x[1]
1xN
1 0 .
1xN
.
.
.
.
.
= .
.
= Σx[n]
0 .
1 x[N−1] Nx1 x[N−1] Nx1

NxN

R It is useful to visualise any type of vector–matrix expression.


It is now obvious that e.g. the scalar aT Aa is ’quadratic’ in a.
This is easily proven by considering xT I x in the diagrams above.

c D. P. Mandic Statistical Signal Processing & Inference 23


Example 5: DC Level in non-iid but uncorrelated zero
mean noise with var(w[n]) = σn2 (de–emphasising bad samples)
Notice that now the noise
variance depends on the The BLUE solution:
sample number! N −1
X x[n]
As before, s = 1.
1T C−1 n=0
σn2
The covariance matrix of the noise  = T −1 x =
1 C 1 N −1
 2  X 1
σ0 0 · · · 0
 0 σ12 · · · σn2
0  n=0
C= . . . . . 
. . . . 
◦ The term
PN −1 1
ensures
0 0 · · · σN 2
−1
n=0
σn2
that the estimator is unbiased
and thus ◦ BLUE assigns greater

σ0−2 0 ··· 0
 weights to samples with
0 σ1−2 ··· 0 smaller variances
−1
 
C =
 .. .. ... .. 
 ◦ Notice that
−2 1
0 0 ··· σN −1 var(Â) = PN −1
2
n=0 1/σn

c D. P. Mandic Statistical Signal Processing & Inference 24


BLUE: Extension to the vector parameter (see Appendix 6)
PN −1
System model: θ̂i = n=0 ainx[n], i = 1, . . . , p ⇒ θ̂ = Ax
For every θi ∈ θ = [θ1, . . . , θp]T we have (ain # weighting coefficients)
N −1
vector BLUE
X
Scalar BLUE: θ̂i = ainx[n], i = 1, 2, . . . , p −→ θ̂ = Ax
n=0
Now: Unbiased constraint
PN −1 vector BLUE
Scalar BLUE: E{θ̂i} = n=0 ainE{x[n]} = θi −→ E{θ̂} = AE{x} = θ
vector BLUE
Scalar BLUE: E{x[n]} = s[n]θ −→ E{x} = H θ → E{θ} = AHθ

R
where the coefficients A = [ain]p×N and H is a vector/matrix form of {s[n]} terms

Unbiased constraint: AH = I and we wish to minimise: var(θ̂i) = aTi Cai


The vector BLUE becomes:
T −1
−1 T −1 T −1
−1
θ̂ = H C H H C x with the covariance Cθ̂ = H C H
If the data are truly Gaussian, as in
x = H θ + w with w ∼ N (0, C)
then the vector BLUE is also the Minimum Variance Unbiased estimator.
c D. P. Mandic Statistical Signal Processing & Inference 25
The Gauss – Markov Theorem
Consider the observed data in the form of a general linear model

x = Hθ + w

with w having zero mean and covariance C, otherwise an arbitrary PDF.

Then, the vector BLUE of θ can be found as


T −1
−1 T −1
θ̂ = H C H H C x

and for every θ̂i ∈ θ̂, the minimum variance of θ̂i is


h i
−1
var(θ̂i) = HT C−1H

ii

with the covariance matrix of θ̂ given by

T −1
−1
Cθ̂ = H C H

c D. P. Mandic Statistical Signal Processing & Inference 26


Example 6: Sinusoidal phase estim. (DSB, PSK, QAM)
Motivation for Maximum Likelihood Estimation (MLE)
Signal model: x[n] = A cos(2πf0n + Φ) + w[n] w ∼ N (0, σ 2)
Psignal A2
Signal to noise ratio (SNR): SN R = Pnoise = 2σ 2
PN −1
2
x[n]−A cos(2πf0 n+Φ)
1 − n=0
Parametrised pdf: p(x; Φ) = (2πσ 2 )N/2
e 2σ 2

∂ ln p(x;θ)  
Regularity condition within CRLB: ∂θ = I(θ) g(x) − θ
In our case: (see Example 8, slide 36)
N −1
∂ ln p(x; Φ) A X A 2

R
=− 2 x[n] sin(2πf0n + Φ) − sin(4πf0n + 2Φ)
∂Φ σ n=0 2
We cannot arrive at the above regularity condition, and an efficient
estimator for sinusoidal phase estimation does not exist
Remedy: Using MLE, we can still obtain an ≈ CRLB for freq. far from 0 and 1/2
1
Approximate CRLB: var(Φ) ≥ (see Example 8)
N × SN R

c D. P. Mandic Statistical Signal Processing & Inference 27


Maximum Likelihood Estimation (MLE): A familiar
example from HDDs in our computers

◦ The spindle spins the HDD platter


◦ The actuator arm moves across the magnetic medium
◦ R/W head changes the polarity to either the North or South pole
◦ This is very much like our usual binary coding of 0’s and 1’s
One method for recovering digital data from a weak analog magnetic

R signal is called the partial-response maximum-likelihood (PRML)


A more advanced, and currently used method, is the correlation-sensitive
maximum likelihood sequence detector (CS-MLSD) (Kavcic and Moura)
c D. P. Mandic Statistical Signal Processing & Inference 28
Towards Maximum Likelihood Estimation (MLE)
Effects of parametrisation on the shape of a PDF: An example from finance

(x−µ)2

Recall that pX (x; µ, σ 2) = √1 e 2σ 2 (parametrised by µ and σ 2)
σ 2π

Simple Returns: Unequal Return Levels Effects of parametrisation:


0.4 (1, 1)
Frequency Density

1 (x−1)2
(4, 9) pblue(x; 1, 1) = √ e − 2×1
0.3
1 × 2π
0.2
0.1 1 (x−4)2
− 2×9
pred(x; 4, 9) = √ e
0.0 3× 2π

R
2.5 0.0 2.5 5.0 7.5 10.0 12.5
Simple Returns, rt (%)

Virtue of imposing a distribution: return(t) = price(t)/price(t − 1)


BA Prices: First Order Moments BA Simple Returns: First Order Moments
0.020
pdf pdf
Frequency Density
Frequency Density

0.015 mean 0.2 mean


median median
0.010 mode mode
0.1
0.005
0.0000 50 100 150 200 250 300 350 0.0 15 10 5 0 5 10 15
Asset Prices, pt ($) Asset Simple Returns, rt (%)

c D. P. Mandic Statistical Signal Processing & Inference 29


MLE: Which distribution to assume for a given data?
Let us consider an example from financial modelling
Consider the S&P 500 stock market index, which tracks performance of
500 largest companies listed on stock exchanges in the US. It serves as a
benchmark in quantitative finance, as it indicates “market movement”.
 
Observe log-returns of S&P 500: log return = log price(t)/price(t − 1)

Normal Exponential Gamma

S&P 500
Log-Return

R Q: Which of the three distribution is most natural to impose on the data?


So, it is all about Maximising the Likelihood of obtaining the observed data!

c D. P. Mandic Statistical Signal Processing & Inference 30


Intuition: Finding the mean and variance of the assumed
distribution # parametrising the MLE
Finding the mean (location) of distribution Finding the variance (spread)
Likelihood Location which maximizes Likelihood Spread which maximizes
the likelihood the likelihood
of observing of observing
the data the data

Location Spread

S&P 500 S&P 500


Log-Return Log-Return

Q: Which of these functions best Q: Which of these functions


approximates the mean? best approximates the variance?
A: “Read-out” the value of A: Employ a similar procedure as

R likelihood function for data mean. for fitting the mean.


In other words, we desire to find θ̂ mle so that p(x; θ̂ mle) is largest!
c D. P. Mandic Statistical Signal Processing & Inference 31
Putting it all together: MLE as an alternative to MVU
Effectively, we treat θ as a variable, not as a parameter, θ̂ = arg maxθ p(x; θ)
Rationale for Maximum Likelihood Estimation (MLE):
◦ The MVU estimator often does not exist or it cannot be found

R ◦ BLUE may not be applicable, that is, x 6= Hθ + w


However, if we assume a likely pdf of the data, MLE can always be used!
◦ This yields an estimator which is generally a function of x, while
maximisation is performed over the allowable range of θ.
Def: The probability of observing the data, x, given the model parameters
θ = [θ1, θ2, . . . , θp]T , is called the likelihood function, L, and is given by
L(θ1, θ2, . . . , θp; x = observed) = L(θ; x = fixed)
Unknown parameters, θ % - known, observed data x
While pdf p(x; θ) gives the probability of occurrence of different possible
values of x, the likelihood function, L, is a function of the parameters θ

R only, with the observed (known) data x held as a fixed constant!


Mathematically, L(θ; x) = p(x; θ), so a more intuitive form of MLE is
θ mle = arg max L(θ; x) = arg max pmodel(x; θ) = arg max p(x; θ)
θ θ θ

c D. P. Mandic Statistical Signal Processing & Inference 32


Principle of Maximum Likelihood Estimation (MLE)
The unknown parameters, θ , may be deterministic or random variables.
Principle of ML estimation: We aim to determine a set of parameters,
θ, from a set of data, x, such that their values would yield the highest

R probability of obtaining the observed data, x.


NB: Data are “probable” and parameters are “likely” # two equivalent
statements: “likelihood of the parameters” and “probability of the data”.
No a priori distribution assumed # MLE A priori distribution assumed # Bayesian

MLE assumptions (the i.i.d. assumption): With L(θ; x) = p(x; θ), it is


often more convenient to consider the log–likelihood, l(θ; x), given by
N −1
i.i.d.
Y

R
l(θ; x) = log L(θ; x) = log L(θ1, . . . , θp; x[0], . . . , x[N − 1]) = pdata(x[n]; θ)
n=0
Conditional MLE: Supervised Machine Learning employs conditioning
between data labels y and input data x, with pmodel dictated by a chosen
architecture. Such conditional max. likelihood function is L(θ; y|x), and
PN −1
θ̂ mle = arg maxθ n=0 log pmodel(yn|xn; θ) (Lecture 7, Appendix 9, P&A sets)

c D. P. Mandic Statistical Signal Processing & Inference 33


Example 7: MLE of a DC level in noise
Consider a DC level in WGN, where w[n] ∼ N (0, σ 2)
x [n] = A + w [n] n=0,1,...,N -1

A is to be estimated
Step 1: Start from the PDF
N −1
1 1
X
2
 
p(x; A) = exp − 2 (x[n] − A)
2 N/2 2σ

2πσ n=0

Step 2: Take the derivative of the log-likelihood function


N −1
∂ ln p(x; A) 1 X
= 2 (x[n] − A)
∂A σ n=0
Step 3: Set the result to zero to yield the MLE (in general, no optimality)
N −1
1 X

R
 = x[n]
N n=0
Clearly this is an MVU estimator which yields the CRLB (efficient)
c D. P. Mandic Statistical Signal Processing & Inference 34
Maximum Likelihood Estimation: Observations so far
MVU vs MLE: MLE is a particular estimator which comes with a “recipe”
for its calculation. MVU property relates to the properties of any estimator
(unbiased, minimum variance). So, MLE could be an MVU estimator,
depending on the chosen model and problem in hand.
◦ If an efficient estimator does exist (which satisfies the CRLB), the
maximum likelihood procedure will produce it (see Example 7)
◦ When an efficient estimator does not exist, the MLE has the desirable
property that it yields “an asymptotically efficient” estimator (Example 8)
If θ is the parameter to be estimated from a random observation x, then
the MLE θ̂mle = arg maxθ p(x; θ) is the value of θ that maximises p(x; θ)
◦ The function L(θ; data) = p(data; θ) does integrate to 1 when
integrated over data (property of PDF), but does not integrate to 1
when integrated over the parameters, θ (property of likelihood fn.)

R
◦ So, p(x; θ) is a probability over the data, x, and a likelihood function
(not probability) over the parameters, θ.
MLE is a “turn-the-crank” method which is optimal for large enough data.
It can be computationally complex and may require numerical methods.
c D. P. Mandic Statistical Signal Processing & Inference 35
Example 8: MLE sinusoidal phase estimator (cf. Ex. 6)

Recall the Neyman-Fisher factorisation: p(x; θ) = g T (x), θ h(x)

MLE of sinusoidal phase. No single sufficient statistic exists for this


case. The sufficient statistics are: (see Slides 6, 7 and 8, and Appendix 1)
PN −1 PN −1
T1(x) = n=0 x[n] cos(2πf0n) T2(x) = n=0 x[n] sin(2πf0n)
The observed data:
x[n] = A cos(2πf0n + Φ) + w[n] n = 0, 1, ..., N − 1 w[n] ∼ N (0, σ 2)

Task: Find the MLE estimator of Φ by maximising


N −1
1  1 X 2 
p(x; Φ) = exp − 2 x[n] − A cos(2πf0n + Φ)
(2πσ 2)N/2 2σ n=0
or, equivalently, minimise

N −1
X 2
J(Φ) = x[n] − A cos(2πf0n + Φ)
n=0

c D. P. Mandic Statistical Signal Processing & Inference 36


Example 8: MLE sinusoidal phase estimator (cf. Ex. 6)

Recall the Neyman-Fisher factorisation: p(x; θ) = g T (x), θ h(x)

For the minimum, differentiate w.r.t. the unknown parameter Φ to yield


N −1
∂J(Φ) X 
= −2 x[n] − A cos(2πf0n + Φ) A sin(2πf0n + Φ)
∂Φ n=0

and set the result to zero, to give


N
X −1 N
X −1
(SP 1) x[n] sin(2πf0n+Φ̂) = A sin(2πf0n + Φ̂) cos(2πf0n + Φ̂)
| {z }
n=0 n=0 inner product of sine and cosine

Recall that (use sin(2a) = 2sin(a)cos(a), see also Example 9 in Lecture 4)


N −1 N −1
1 X 1 X
(SP 2) sin(2πf0n+Φ̂) cos(2πf0n+Φ̂) = sin(4πf0n+2Φ̂) ≈ 0
N n=0 2N n=0

that is, it vanishes provided f0 is not near 0 or 21 , and for a large enough N .

c D. P. Mandic Statistical Signal Processing & Inference 37


Example 8: MLE sinusoidal phase estimator (cf. Ex. 6)

Recall the Neyman-Fisher factorisation: p(x; θ) = g T (x), θ h(x)

Thus, the LHS of (SP1) when divided by N and set equal to zero will yield
an approximation (see Appendix 2)
N −1
∂J(Φ) X
=0 ≈ x[n] sin(2πf0n + Φ̂) ≈ 0
∂Φ n=0
Upon expanding sin(2πf0n + Φ̂), we have (sin(a+b) = sin a cos b + cos a sin b)
N
X −1 N
X −1
x[n] sin(2πf0n) cos Φ̂ = − x[n] cos(2πf0n) sin Φ̂
n=0 n=0
N
X −1
x[n] sin(2πf0n)
so that the ML Estimator Φ̂ = − arctan Nn=0
−1
X

R
x[n] cos(2πf0n)
n=0
The MLE Φ̂ is clearly a function of the sufficient statistics, which are
PN −1 PN −1
T1(x) = n=0 x[n] cos(2πf0n) T2(x) = n=0 x[n] sin(2πf0n)
c D. P. Mandic Statistical Signal Processing & Inference 38
Example 8: Sinusoidal phase # numerical results
The expected asymptotic PDF of the phase estimator: Φ̂asy ∼ N (Φ, I −1(Φ))
# so that the asymptotic variance 1 1
var(Φ̂) = 2 = ηN
NA
2σ 2
Psignal A2 /2
where η= Pnoise = σ2
(SN R) is the “signal-to-noise-ratio”

Below: Simulation results with A=1, f0 = 0.08, Φ = π/4 and σ 2 = 0.05

Data record length Mean, E(Φ̂) Nx× variance, N var(Φ̂)


10 0.732 0.0978
40 0.746 0.108
60 0.774 0.110
80 0.789 0.0990

R
Theoretical asymptotic values Φ=0.785 1 = 0.1
η
For shorter data records the ML estimator is considerably biased. Part of
this bias is due to the assumption (SP2) on Slide 37.
c D. P. Mandic Statistical Signal Processing & Inference 39
Example 8: MLE of sinusoidal phase # asymptotic mean
and variance (performance vs SNR for a fixed N )
For a fixed data length of N = 80, SNR was varied from -15 dB to +10 dB
◦ The asymptotic variance (or CRLB) then becomes
1
10 log10 var(Φ̂) = 10 log10 = −10 log10 N − 10 log10 η

◦ Mean and variance are also functions of SNR
◦ Asymptotic mean is attained for SNRs > -10dB
Actual vs. asymptotic mean for phase estimator Actual vs. asymptotic variance for phase estimator
0.9 0
Actual mean Actual variance
0.85 Asymptotic mean Asymptotic variance
−5
0.8
−10

10log variance
0.75

0.7 −15
Mean

0.65 −20

0.6
−25
0.55
−30
0.5

0.45 −35
−15 −10 −5 0 5 10 −15 −10 −5 0 5 10
SNR (dB) SNR (dB)
Observe that the minimum data length to attain CRLB also depends on SNR
c D. P. Mandic Statistical Signal Processing & Inference 40
Asymptotic properties of MLE
asy
We can now formalise the asymptotic properties of θ̂M L (see the previous slide).

Theorem (asymptotic properties of MLE): If p(x; θ) satisfies some “regularity”


conditions, then the MLE is asymptotically distributed as
asy −1

θ̂ ∼ N θ, I (θ)

where “regularity” refers to the existence of the derivative of the


log–likelihood function (as well as Fisher information being non–zero), and
I is the Fisher Information evaluated at the true value of the unknown

R parameter θ.
The Maximum Likelihood Estimator is therefore asymptotically:
◦ unbiased

R
◦ efficient (that is, it achieves the CRLB)

For a small N , there is no guarantee how the MLE behaves


We can use Monte Carlo simulations to answer “how large an N do we
need for an appropriate estimate ?’’ (see Appendix 9 for more detail)
c D. P. Mandic Statistical Signal Processing & Inference 41
R
MLE: Extension to vector parameter
A distinct advantage of the MLE is that we can always find it for a
given dataset numerically, as the MLE is a maximum of a known function.
◦ For instance, a grid search of p(x; θ) can be performed over a finite
interval [a, b].
◦ If the grid search cannot be performed (e.g. infinite range of θ) then we
may resort to iterative maximisation, such as the Newton-Raphson
method, the scoring approach, and the expectation-maximisation (EM)
approach. MLE depends on a good initial guess of the underlying PDF.
◦ Since the likelihood function to be maximised is not known a priori and

R it changes for each dataset, we effectively maximise a random function.


Extension to the vector parameter is straightforward: The MLE for a
vector parameter θ is the value that maximises the likelihood function
L(θ; x) = p(x; θ) over the allowable domain of θ.
∂ ln p(x; θ) asy −1

Asymptotic properties: If = 0 then θ̂ ∼ N θ, I (θ)
∂θ

c D. P. Mandic Statistical Signal Processing & Inference 42


Example 9: MLE of a DC level in WGN. Both the DC
level A and the noise variance (power) σ 2 are unknown
Consider the data x[n] = A + w[n], n = 0, 1, . . . , N − 1, w[n] is zero–mean
The vector parameter θ = [A, σ 2]T is to be estimated (var(w) is unknown too).
2 PN −1
1 1
(x[n] − A)2

Solution: Assume p(x; θ) = p(x; A, σ )=
(2πσ )N/2
2
exp − 2σ 2 n=0
N −1
∂ ln p(x; θ) 1 X
Now: = 2
(x[n] − A)
∂A σ n=0
N −1
∂ ln p(x; θ) N 1 X 2
and: = − 2+ 4 (x[n] − A)
∂σ 2 2σ 2σ n=0
From first equation solve for A, from second equation solve for σ 2 to obtain
   
x̄ N →∞ A
θ̂ = 1
P N −1 2 −→ 2 asymptotic CRLB
N n=0 (x[n] − x̄) σ

R
1
PN −1
where x̄ = N n=0 x[n].
Amazing, we only assumed a type the PDF, but not the mean or variance!
c D. P. Mandic Statistical Signal Processing & Inference 43
Example 10: Sinusoidal parameter estimation with three
unknown parameters # A, f0, and Φ (Example 7 in Lecture 4)
Now, θ = [A, f0, Φ]T , and
N −1
1  1 X 2 
p(x; θ) = 2 N/2
exp − 2 x[n] − A cos(2πf0n + Φ)
(2πσ ) 2σ n=0 | {z }
we need this as (x−Hθ)T (x−Hθ)

For A > 0, 0 < f0 < 12 , the MLE of θ = [A, f0, Φ]T is found by minimising
N −1
X 2
J(A, f0, Φ) = x[n] − A cos(2πf0n + Φ)
n=0
N
X −1
2
= (x[n] − A
| cos
{z Φ} cos 2πf0n + A
| sin
{z Φ} sin 2πf0n)

R n=0 α1 −α2

The function J(A, f0, Φ) is “coupled” in A and Φ, and thus hard to


minimise. To this end, we may transform the multiplicative terms involving
A and Φ to new “linear terms” α1 = A cos Φ, α2 = A sin Φ

Φ = tan−1( −α
q
with the inverse mapping A= α21 + α22 & 2
α1 )

c D. P. Mandic Statistical Signal Processing & Inference 44


Example 10: Sinusoidal parameter estimation of three
unknown parameters, cont. (see Linear Models in Lecture 4)

For convenience of notation, we shall now introduce the vectors of sampled


cos and sin terms (containing the unknown frequency f0) in the form
 T  T
c = 1, cos 2πf0, . . . , cos 2πf0(N − 1) s = 0, sin 2πf0, . . . , sin 2πf0(N − 1)

to yield the function J 0(α1, α2, f0) which is quadratic in α = [α1, α2]T
T T

R
0
 
J (α1, α2, f0) = x − α1c − α2s x − α1c − α2s = x − Hα x − Hα (∗)

We arrive at a linear estimator of the vector parameter α = [α1, α2]T ,


where H = [c | s] (see Example 9 in Lecture 4)

This function can be minimised over α, exactly as in the linear model


(with C = I), to give (Slide 33, Lecture 4)
T
−1 T
θ̂ = α̂ = H H H x → insert into (∗)

0 T T T −1 T

to yield J (α1, α2, f0) = (x − Hα̂) (x − Hα̂) = x I − H(H H) H x
| {z }
max this f or min J 0

c D. P. Mandic Statistical Signal Processing & Inference 45


Example 10: Sinusoidal parameter estimation of three
unknown parameters, cont. cont.
Hence, to find fˆ0 we need to minimise J 0 over fˆ0 or, equivalently

R
maximise xT H(HT H)−1HT x
Using the definition of H, the MLE for frequency fˆ0 is the value that
maximises the power spectrum estimate (see your P&A sets)
 T T  T −1  T  N −1 2
T 1
c x c c c s c x X
−2πf0 n
= x[n]e ← periodogram
sT x sTc sTs sTx N n=0
- xT H - (HT H)−1 - HT x
Use this expression to find fˆ0, and proceed to find α̂ (Example 9, Lect. 4)
N
X −1
2 x[n] sin(2π fˆ0n)
 P 
x[n] cos 2π ˆ0n
f
N
   T 
α̂1 2 c x n=0
α̂ = ≈ = Φ̂=− arctan

α̂2 N sT x N −1
 
2 P x[n] sin 2π fˆ n X
N 0 x[n] cos(2π fˆ0n)
n=0
PN −1
x[n] exp(−j2π fˆ0n)
p 2
and  = α̂12 + α̂22 = N n=0

c D. P. Mandic Statistical Signal Processing & Inference 46


MLE for transformed parameters (invariance property)
This invariance property of MLE is another big advantage of MLE

Following the above example, we can now state the invariance property
of MLE (also valid for the scalar case).
Theorem (invariance property of MLE): The MLE of a vector parameter
α = f (θ), where the pdf p(x; θ) is parametrised by θ, is given by
α̂ = f (θ̂)

R where θ̂ is the MLE of θ.


Since MLE of θ̂ is obtained by maximising p(x; θ), if f is a one-to-one
function this is obvious, and the MLE of the transformed parameter is
found by substituting the MLE of the original parameter into the
transformation.
For example, if x[n] = A + w[n], w ∈ N (0, σ 2), but we wish to find the
MLE of α = exp(A).
◦ The resulting log–likelihood is still parametrised by A, and by using
ln α = A as a transform, the resulting MLE is obtained as
α̂ = exp(Â) (see also your P & A sets)
c D. P. Mandic Statistical Signal Processing & Inference 47
Optimality of MLE for a linear model
We can now summarise the observations so far in the form of the
optimality theorem for MLE.
Theorem: Assume that the observed data can be described by the general
linear model
x = Hθ + w

where H is a known N × p matrix with N > p and of rank p (tall matrix),


θ is a p × 1 parameter vector to be estimated, and w is a noise vector with
PDF N (0, C). Then, the MLE of θ takes the form

−1
−1
θ̂ = H C T
H HT C−1x

In addition, θ̂ is also an efficient estimator in that it attains the CRLB. It


is hence the MVU estimator, and the PDF of θ̂ is Gaussian and is given by
T −1 −1

θ̂ ∼ N θ, (H C H)
c D. P. Mandic Statistical Signal Processing & Inference 48
Example 12: MLE in Generative Artificial Intelligence
We often have a limited amount of samples of the dataset of interest, e.g.
we do not know the true distribution of all male and female face images.
◦ Generative models aim to
generate “new” data based
on the available samples of a
dataset of interest.
◦ Generated data should
approximate the “true
distribution” of unseen data,
pdata, as best as possible in
some statistical sense, e.g.
min distance(pdata, pmodel ).

R
R We examine the likelihood of the model, given the dataset (≡ MLE).
This boils down to maximising the likelihood that the generated data will
have a similar distribution as the true data of interest.
c D. P. Mandic Statistical Signal Processing & Inference 49
Example 12b: Goal of generative models
We desire to learn a probab. distribution, pdata(x), over data, x, such that:
Generation: If pdata(x) is a distrib. of handwritten digit images, and we
sample xnew ∼ pmodel, then xnew should look like a digit (aka sampling)
Density estimation: The probability pdata(x) should be high if a training
sample, x, looks like a digit, and low otherwise (maximising likelihood)

Sufficient sample space for pdata. A full ground


truth space of all 28 × 28–pixel BW images has
282 = 784 binary variables (BW pixels). This
gives a total of 2784 = 10236 possible BW images.
Some 28 × 28-pixel images ◦ Even a sample space of 109 training data would
give an extremely scarce coverage of ground truth.
◦ For a more realistic 1000 × 1000 = 106–pixel
frames, we would have 21,000,000 possible images.

R L: Original, R: Generated

So we wish to learn pmodel = pθ ≈ pdata, to construct the “best” fit to the


available distribution pdata from incomplete ground truth.
c D. P. Mandic Statistical Signal Processing & Inference 50
Example 12c: Density estimation as MLE
The aim of learning is for pmodel (x; θ) to become as close to pdata(x) as possible

Our context is density estimation # we desire to capture the data


distribution pdata(x), so as to enable either unconditional or conditional
generation of new data from ≈ same distribution pmodel(x; θ).
◦ MLE aims to pick a “good” model which incorporates domain knowledge
(structure of the data), that is, a model with a good inductive bias.
◦ To measure “closeness” between the training data distribution pdata(x)
and model distribution pmodel(x; θ) we use the KL divergence (Appendix 11)
h pdata(x) i h i
DKL(pdata||pmodel) = Ex∼pdata log = Ex∼pdata log pdata(x)−log pmodel(x; θ)
pmodel(x; θ)
Here, Ex∼pdata is the expectation over all possible training data, which is a
weighted average of all possible outcomes, with pdata(·) as “weights”, i.e.
X h i
DKL(pdata||pmodel) = pdata(x) log pdata(x) − log pmodel(x; θ)

R
| {z }
max of this = min of DKL

arg min DKL(pdata||pmodel) ≡ Max. Likelihood Est. arg max log pmodel (x; θ)
pθ pθ

c D. P. Mandic Statistical Signal Processing & Inference 51


Example 12d: Big picture of learning data distributions
Most important general cases
◦ p(y|x; θ) # classification
Machine Learning Model
Continuous (discriminative model), e.g.
𝑝𝑝(𝑦𝑦 ∣ 𝑥𝑥) logistic regression
𝑝𝑝 𝑥𝑥; 𝜃𝜃
Discriminative Model Discrete Generative Model ◦ p(y|x; θ) # regression,
𝑦𝑦 = 𝑓𝑓 𝑥𝑥 + 𝜀𝜀 prediction
Regression Model 𝑝𝑝(𝑦𝑦 ∣ 𝑥𝑥) 𝑝𝑝 𝑥𝑥 𝑦𝑦; 𝜃𝜃
Conditional Generative Model
◦ p(x; θ) # generative
Classification Model
model (e.g. VAE, GANN)
◦ p(x|y; θ) # conditional

R
𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)
generative model
The differenceSampling

between classification and prediction is that in classification y takes discrete

R space (typically 0 or 1) while in prediction y is continuous.


latentvalues
𝒙𝒙𝒏𝒏𝒏𝒏𝒏𝒏 ~𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎
Generative models learn (𝒙𝒙;a𝜽𝜽)joint distribution over the entire dataset. They are typically used

for sampling applications or density estimation.


𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)
𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)
Density Sampling


Estimation
latent space
𝒙𝒙𝒏𝒏𝒏𝒏𝒏𝒏 ~𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)

c D. P. Mandic Statistical Signal Processing & Inference 52

𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)


Summary: Maximum Likelihood Estimation (MLE)
MLE: 1) Assume a model, also known as a data generating process, for your
observed data
2) For the assumed model, produce the likelihood funct. L(θ; x) = p(x; θ)

R 3) Now, MLE becomes an optimisation problem


Differences between p(x|θ) and p(x; θ). The “conditional” PDF p(x|θ)
is typically denoted with the semicolon ’;’ to indicate that θ are not

R random variables but unknown parameters which “parametrise” the pdf.


Differences between the likelihood function, L(θ; x), and the
probability density function, p(x; θ) are nuanced but important:
◦ A PDF gives the probability of observing your data, given the underlying
parameters of the distribution, i.e. it maps samples to their probabilities.
◦ The likelihood function expresses the likelihood of parameter values,
given your observed data. It assumes that the parameters are unknown.
MLE is grounded in probability theory; it provides a rigorous theoretical
framework and underpins many probabilistic models in machine learning,
such as generative AI, logistic regression, and Gaussian mixture models.

c D. P. Mandic Statistical Signal Processing & Inference 53


Summary: BLUE vs MLE
NB: The optimal MVU est. and CRLB may not exist or are impossible to find

Best Linear Unbiased Estimator Maximum Likelihood Estimator


◦ It operates even when the pdf of ◦ Basic idea: In the likelihood
data is unknown function, θ is regarded as a
◦ Restricts the estimates to be variable and not as a parameter!
linear in the data (e.g. DC level ◦ Can always be applied once the
in noise) PDF is assumed; no restriction
◦ Produces unbiased etimates on the data model (cf. BLUE)
◦ Minimises the variance of such ◦ Can be computationally complex
unbiased estimates Properties of MLE, as N → ∞:
◦ Requires knowledge of only the ◦ Efficient # attains the CRLB
mean and variance of the data, ◦ Consistent:Unbiased & var → 0
and not of the full pdf i.e. converges to θ in probability.
◦ BLUE may be used more ◦ Optimal for the General
generally if the data model is Linear Model, invariant to
linearised in an adequate way, for any transformation of θ,
example, through T-F represent. asymptotic normality of θ̂M LE
c D. P. Mandic Statistical Signal Processing & Inference 54
Appendix 1: Intuition about a sufficient statistic
Denote by x the video recording of your SSPI Lecture 4 (a dataset, x),
and by y the notes you have taken about Lecture 4 (a statistic, T (x)).
The information needed to answer Assignment #3 in your Coursework is
the unknown parameter θ.
◦ Now, y depends entirely on x # video contains all info in your notes.
◦ If you took sufficiently good notes, T (x) will give you same information
about θ as x does # conditional distrib. of Lecture 4, given your notes,
p(x|y; θ), is independent of θ (the information related to Assignment #3).
◦ Here, conditional distribution simply means the probability distribution of

R
the information in your notes, given the lecture # if the information is in
the lecture, it is also in your notes.
Once you have checked your notes, going back and listening to the lecture

R will not help you solve Assignment #3 (no additional information).


A consideration of both the data set x and the statistic T (x) does not give
any more information about the distribution of θ, than what is available
based only on the statistic T (x) # so, we can keep T (x) and “throw
away” x without losing any information. (see Slide 7)
c D. P. Mandic Statistical Signal Processing & Inference 55
Appendix 2: Sufficient statistic for the uniform
distribution
 T
Consider the random data x = x0, x1, . . . , xN −1 , which are uniformly
distributed on the interval [0, θ], with the parameter θ unknown.
To find a sufficient statistic T (x), we employ the i.i.d. assumption to yield
−N
1 xn ≤ θ, n = 0, 1, . . . , N − 1 = θ−N 1(E)
 
p x0, x1, . . . , xN −1; θ = p(x; θ) = θ
where 
1, if the event E holds
the indicator function: 1(E) =
0, if the event E does not hold
The data xn ≤ θ, n = 0, . . . , N − 1 iff max{x0, . . . , xN −1} ≤ θ so that
−N

p(x; θ) = θ 1 max{x0, x1, . . . , xN −1} ≤ θ × |{z}
1

R
| {z } h(x)
g(T (x),θ)

By the Neyman-Fisher factorisation theorem, the sufficient statistic is

R

T (x) = max x[0], x[1], . . . , x[N − 1] = max{x}
The sample mean, x̄, is not a sufficient statistic for a uniform random var.,
as 1(max{x} ≤ θ) cannot be expressed as a function of just x̄ and θ.
c D. P. Mandic Statistical Signal Processing & Inference 56
Appendix 3: Sufficient statistics for the estimation of the
phase of a sinusoid

Problem: Estimate the phase of a sinusoid


x[n] = A cos(2πf0n + Φ) + w[n] w ∼ N (0, σ 2)
PN −1
2
x[n]−A cos(2πf0 n+Φ)
1 − n=0
Parametrised pdf: p(x; Φ) = (2πσ 2 )N/2
e 2σ 2

The exponent may be expanded as


NX−1 N
X −1 N
X −1
x2[n] − 2A x[n] cos(2πf0n + Φ) + A2 cos2(2πf0n + Φ)
n=0 n=0 n=0
N
X −1 N
X −1  N
X −1  N
X −1
2 2 2
= x [n] − 2A cos 2πf0n cos Φ + 2A sin 2πf0n sin Φ + A cos (2πf0n + Φ)
n=0 n=0 n=0 n=0

This pdf is not factorable as required by the Neyman-Fisher theorem.


Hence, no single sufficient statistic exists. However, it can still be
factorised as

c D. P. Mandic Statistical Signal Processing & Inference 57


Appendix 3: Sufficient statistics for the estimation of the
phase of a sinusoid
N −1
1 n 1 X 2 2 o
p(x; Φ) = 2 N/2
exp − 2 A cos (2πf0n + Φ) − 2AT1(x) cos Φ + 2AT2(x) sin Φ
(2πσ ) σ n=0
| {z }
g(T1 (x),T2 (x),Φ)

N−1
n 1 X 2 o
× exp − 2
x [n]
2σ n=0
| {z }
h(x)

where

N
X −1 N
X −1
T1(x) = x[n] cos 2πf0n T2(x) = x[n] sin 2πf0n

R
n−0 n−0

T1(x) and T2(x) are jointly sufficient statistics for the estimation of Φ.
However, no single sufficient statistic exists (we really desire a single
sufficient statistic).
c D. P. Mandic Statistical Signal Processing & Inference 58
Appendix 4: Motivation and Pro’s and Con’s of BLUE
Motivation for BLUE: Except for the Linear Model (Lecture 4), the
optimal MVU estimator might:
◦ Not even exist,

R
◦ Be difficult or even impossible to find.
BLUE is one such sub–optimal estimator.
Idea behind BLUE:
◦ Restrict the estimator to be linear in data x,
◦ Restrict the estimate to be unbiased,
◦ Find the best among such unbiased estimates, that is, the one with
the minimum variance.
Advantages of BLUE: It needs only the 1st and 2nd statistical moments
(mean and variance).
Disadvantages of BLUE: 1) In general it is sub–optimal, and 2) It may
be totally inappropriate for some problems (see the next slide).

c D. P. Mandic Statistical Signal Processing & Inference 59


Appendix 5: More on heavy tailed distributions
◦ The α-stable distribution generalises the normal distribution.
◦ It was proposed as a distribution for asset returns and commodity prices
by Mandelbrot in the early 1960s.

Illustration of α-stable distributions based on synthetic data


c D. P. Mandic Statistical Signal Processing & Inference 60
Appendix 6: Some observations about BLUE
◦ BLUE is applicable to amplitude estimation of known signals in noise,
where to satisfy the unbiased constraint, E{x[n]} must be linear in the
unknown parameter θ, or in other words, E{x[n]} = s[n] θ.
◦ Counter-example: If E{x[n]} = cos θ, which is not linear in θ, then
PN −1
from the unbiased assumption we have n=0 an cos θ = θ. Clearly,
there are no {an} that satisfy this condition.
◦ For the vector parameter BLUE, the unbiased constraint generalises
from the scalar case as
E{x[n]} = s[n]θ → aT s = 1 ⇒ E{x} = Hθ → AH = I

Since the unbiased constraint yields:


N
X −1
E{θ̂i} = ainE{x[n]} = θi ⇒ E{θ̂} = AE{x} = θ
n=0

this is equivalent to aTi hj = δij (=0 for i6=j, = 1 for i=j)

c D. P. Mandic Statistical Signal Processing & Inference 61


Appendix 7: Some BLUE–like “estimates”
Composite faces # people face averages
Can we estimate a “typical looking” person from a certain region, by
taking a statistical average of a large ensemble of random faces
photographed on the street?
Does the so-generated estimated average face exist in real life?

Participans in Sydney, Australia, ranging from 0.83–93 years

c D. P. Mandic Statistical Signal Processing & Inference 62


Appendix 7: Some BLUE–like “estimates”, contd.

Composite faces of Sydney Composite faces of Hong Kong

Composite faces of London Composite faces of Argentina


c D. P. Mandic Statistical Signal Processing & Inference 63
Appendix 8: Some notions from probabilistic modelling
The data, x, are connected to all possible models, θ, by a probability
P (x; θ) or a probability density function p(x; θ). In other words, a pdf
gives the probabilities of occurrence of different possible values.
Sample space: The sample space of a random variable represents all
values that the random variable can take. For example, for the coin tossing
experiment the sample space is: Heads and Tails.
Parametric modelling: Parametric models represent a set of density
functions with one or more parameters. For different values of the
parameters there will be different density functions. All of these density
functions are referred to as parametric models.
Probability density function: Given a sample space, the PDF maps the
random samples to their probabilities.
Likelihood function: It represents the probability of observing the values
in the sample space, if the true generating distribution was the model
which uses the particular density function parameterised by θ.
Maximum Likelihood: MLE aims to find the parameter values of a model
which makes the observed data most probable
c D. P. Mandic Statistical Signal Processing & Inference 64
Appendix 9: Monte Carlo (MC) simulations
Use computer simulations to evaluate performance of any estimation method
The MC simulations are illustrated here for a determin. sig. s[n, θ] in AWGN
1. Data collection
◦ Select a true parameter value, θtrue (usually performed over a range of
values of θ
◦ Generate a signal having θtrue as a parameter
◦ Generate WGN with unit variance and form the measurement x = s + w
◦ Choose σ to obtain the desired SNR value and perform one MC
simulation for one SNR value (usually you run many simulations over a
range of SNR values)
2. Statistical evaluation
◦ Compute bias, B = M1 M
P
q− θP
m=1 (θ̂m true )

◦ Compute RMS error, RM S = M1 M m=1 (θ̂m − θtrue )


2
PM PM 2
◦ Compute error variance, var = M m=1(θ̂m − M m=1 θ̂M )
1 1

◦ Plot histogram or scatter plot (if needed)


3. Explore (via plots)
How bias, RMS, variance vary with the value of θ, SNR, number of data
points, N , etc. Q: Is bias =0, is RMS = CRLB1/2, etc.
c D. P. Mandic Statistical Signal Processing & Inference 65
Appendix 10: Generative AI for speech # a diffusion
model
Consider an 80-band mel-spectrogram of a sample of speech of a female
speaker saying: “in being comparatively modern”.

Original speech Generated speech


The speech samples are generated with a score-based model, of which the
training stage is maximising the data likelihood pmodel(x)
Observe that the generated sample is very close to the original one
◦ Diffusion models are built upon Stochastic Differential Equation (SDE)
functions
◦ They are solving an SDE process, which is usually either a variance
preserving or exploding SDE function
◦ The performance of each SDE is different for a different task
c D. P. Mandic Statistical Signal Processing & Inference 66
Appendix 11: Minimising KL–divergence and
cross–entropy is equivalent to maximising the likelihood
The KL-divergence is a common loss function for training neural network
(NN) based generative models, such as VAE, GAN, and diffusion models.
pθ (x)
DKL(pθ ||qθ̂ ) = Epθ log
qθ̂ (x)
where pθ is the probability distribution of the true labels and qθ̂ is the
probability distribution of e.g. deep neural network (DNN) predictors. In
other words, p = plabels and q = qmodel.
◦ KL divergence measures the dissimilarity between pθ and qθ̂ .
◦ To bring qθ̂ closer to the true pθ , we minimize KL-divergence w.r.t. θ̂.
!
X pθ (xi)
Goal : Minimize DKL(pθ ||qθ̂ ) = arg min pθ (xi)log
θ̂ x
qθ̂(x )
i

X  X 
= arg min pθ (xi)log pθ (xi) − qθ̂ (xi) = arg min − pθ (xi)log qθ̂ (xi)
θ̂ x θ̂ x

c D. P. Mandic Statistical Signal Processing & Inference 67


Appendix 11, contd.: Minimising KL–divergence and
cross–entropy is equivalent to maximising the likelihood
Cross-entropy, H(pθ , qθ̂ ), is another important objective function for
training NNs, especially for classification purposes, and is given by
H(pθ , qθ̂ ) = Epθ log qθ̂ (x) = H(pθ ) + DKL(pθ ||qθ̂ )
X 
Goal: minimize H(pθ , qθ̂ ) = arg min − pθ (xi)log qθ̂ (xi)
θ̂ x
◦ In classification problems, the true distribution, pθ (y|xi), is one-hot
encoded as (
1, if y = yi

R
pθ (y|xi) =
0, otherwise
P 
The goal becomes arg minθ̂ − x log qθ̂ (yi |xi ) , which is precisely the
objective of MLE (min of the negative log-likelihood = max likelihood).

1.0 p (x) (1, 1)


0.5 q (x) (3, 1)
DKL(p ||q )
0.0
2 0 2 4 6
c D. P. Mandic Statistical Signal Processing & Inference 68
Appendix 12: Constrained optimisation using Lagrange
multipliers
Consider a two-dimensional problem:
maximize f (x, y)
| {z }
f unction to max/min

subject to g(x, y) = c

R
| {z }
constraint
We look for point(s) where curves f & g touch (but do not cross).
In those points, the tangent lines for f and g are parallel ⇒ so too are the
gradients ∇x,y f k λ∇x,y g, where λ is a scaling constant.
Although the two gradient vectors are parallel they can have different magnitudes!
Therefore, we are looking for max or min points (x, y) of f (x, y) for which
∂f ∂f  ∂g ∂g 
∇x,y f (x, y) = −λ∇x,y g(x, y) where ∇x,y f = , and ∇x,y g = ,
∂x ∂y ∂x ∂y
We can now combine these conditions into one equation as:

F (x, y, λ) = f (x, y) − λ g(x, y) − c and solve ∇x,y,λF (x, y, λ) = 0
Obviously, ∇λF (x, y, λ) = 0 ⇔ g(x, y) = c

c D. P. Mandic Statistical Signal Processing & Inference 69


App. 12: Method of Lagrange multipliers in a nutshell
max/min of a function f (x, y, z) where x, y, z are coupled
Since x, y, z are not independent there exists a constraint g(x, y, z) = c
Solution: Form a new function
F (x, y, z, λ) = f (x, y, z) − λ g(x, y, z) − c and calculate Fx0 , Fy0 , Fz0 , Fλ0


Set Fx0 , Fy0 , Fz0 , Fλ0 = 0 and solve for the unknown x, y, z, λ.
Example 13: Economics
Two factories, A and B make TVs, at a cost
f (x, y) = 6x2 + 12y 2 where x = #T V in A & y = #T V in B
Task: Minimise the cost of producing 90 TVs, by finding optimal numbers
of TVs, x and y, produced respectively at factories A and B.
Solution: The constraint g(x, y) is given by (x+y=90), so that
F (x, y, λ) = 6x2 +12y 2 −λ(x + y − 90)
Then: Fx0 = 12x − λ, Fy0 = 24y − λ, Fλ0 = −x − y + 90, and we need

R
to set ∇F = 0 in order to find min / max.
Upon setting [Fx0 , Fy0 , Fλ0 ] = 0 we find x = 60, y = 30, λ = 720

c D. P. Mandic Statistical Signal Processing & Inference 70


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 71


Notes:

c D. P. Mandic Statistical Signal Processing & Inference 72

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy