Sspi Lecture 5 Blue Mle 2025
Sspi Lecture 5 Blue Mle 2025
R
◦ To achieve this, we need the concept of sufficient statistics and the
Rao–Blackwell–Lehmann–Scheffe theorem
The BLUE assumptions are also called the Gauss–Markov assumptions
◦ It is possible in many cases to determine an approximate MVU estimator
(MVUE) by inspection of the PDF, using Maximum Likelihood Estimation
c D. P. Mandic Statistical Signal Processing & Inference 2
Motivation for Maximum Likelihood Estimation (MLE)
What do you think these applications have in common?
HDD microcontroller Gesture (Swype) keyboard
P (S) = p(I|begin) p(would|I) p(like|would)×
× p(to|like) p(f ly|to) p(London|f ly) · · ·
R
PN −1 PN −1
x̄ = N1 n=0 x[n] σ̂ 2 = N1 n=0 (x[n] − µ)2 max{x[0], . . . , x[N − 1]}
These estimators are a function of random observations, x, and not of the
unknown paramater, θ # each of these estimators is a valid statistic.
◦ Observe that the above estimators “compress” the available information,
e.g. the sample mean takes N datapoints in x and produces one sample, x̄.
◦ In the best case, such compression is “loss-less”, as it contains the same
R
 = x[n], var(Â) = & à = x[0],
N n=0 N
Although à is unbiased, its variance is much larger than that of Â. This is
is due to discarding samples x[1],...,x[N-1] that carry information about A.
Consider now the following datasets:
N
X −1
S1 = {x[0], x[1], . . . , x[N − 1]} S2 = {x[0] + x[1], x[2], . . . , x[N − 1]} S3 = { x[n]}
n=0
The original dataset, S1, is always sufficient for finding Â, while S2 and
S3 are also sufficient. In addition, S3 is the minimal sufficient statistic!
In a nutshell, sufficient statistics answer the questions:
Q1: Can we find a transformation T (x) of lower dimension that contains
all information about θ? (the data, x ∈ RN ×1, can be very long)
Q2: What is the lowest possible dimension of T (x) which still contains all
information about θ? # (minimal sufficient statistic)
P
For example, for DC level in WGN, T (x) = x[n] (one-dimensional)
c D. P. Mandic Statistical Signal Processing & Inference 6
Sufficient statistics: Putting it all together
In layman’s terms, sufficiency is saying that T (x) is informative enough
Aim: A statistic T (x) is sufficient if it allows us to estimate the unknown par. θ
as well as when based on the entire data set x. So, we no longer need to
consider data, x, after using it to calculate T (x), it becomes redundant.
Def: A statistic T (x) = f x[0], x[1], . . . , x[N − 1] is a sufficient statistic, if for
any value of T0 the conditional distribution of x[0], x[1], . . . , x[N − 1]
given T = T0, that is, p(x = x0|T (x) = T0; θ), does not depend on the
unknown parameter θ. In other words, after observing the statistic T0, the
data x will not give us any new information about θ. (see Appendix 1)
1 PN −1 2 1 PN −1 2
p x = x0 | n=0 x[n] = T0 ; A p x = x0 | n=0 x[n] = T0 ; A
0.6 2
p(x|T0 ; A) depends on A p(x|T0 ; A) does not depend on A ◦ Knowledge of T0 changes
0.5
thus T0 is not sufficient 1.5 thus T0 is sufficient the PDF to the conditional one
0.4 PN −1
p(x| n=0 x[n] = T0; A).
0.3 1
0.2 ◦ If a statistic is sufficient for
0.5
0.1 estimating A, this conditional
0 0 PDF should not depend on A
A0 A A
Information present in observations No information in observations (as in the right hand panel).
after T(x) observed after T(x) observed
R
PN −1
Therefore, the sufficient statistic is T (x) = n=0 x[n] (minimal & linear)
1
PN −1
Any 1-2-1 mapping of T (x) is also a sufficient st. e.g. T1(x)=N p n=0 x[n]
and T2(x) = (x̄)3 are sufficient, but T3(x) = (x̄)2 is not as x̄ = ± T3(x).
R
◦ Then, minimise the variance of this unbiased estimator.
Such an estimator is termed the Best Linear Unbiased Estimator (BLUE)
which requires only knowledge of the first two moments of the PDF.
We will see that if the data are Gaussian, the BLUE and MVUE are equivalent
The BLUE is restricted to have the Note, the BLUE will be optimal
form (a = {an}) only when the actual MVU
N
X −1 estimator is linear!
θ̂ = anx[n] = aT x
This is the case, for instance, when
n=0 ↑
Constants to be determined estimating the DC level in WGN
◦ We choose a = {an} which yield 1
N
X −1
1
an unbiased estimate E{θ̂} = θ θ̂ = x̄ = x[n] an =
N n=0 N
◦ Then, we perform min(var)
which is clearly linear in the data.
# the BLUE estimator is that
which is unbiased and has the Then, BLUE is an optimal MVU
minimum variance. estimator giving an = 1/N .
R the sample mean x̄. (see Problem 4.1 in your P&A sets)
The difference in performance between the BLUE and MVU estimators
can, in general, be substantial, and can only be rigorously quantified
through the underlying data generating pdf.
c D. P. Mandic Statistical Signal Processing & Inference 13
Example 1: How useful is an estimator of DC level in
noise?
In fact, very useful. It is up to us to provide a correct data representation.
Sinusoidal frequency estimation Practical example: Real-world speech
x(t) X(f)
time-frequency representation
time
frequency
frequency
time−frequency spectrogram
N n=0 100
Rabbit population
estimator to be linear, e.g. by 60
N −1
1 X 40
σ̂ 2 = anx[n]
N n=0 20
2 5
Class 1
Class 2 4
1
3
0
2
-1
1
-2 0
-2 -1 0 1 2 0 1 2 3 4 5
-1
-2
-2 -1 0 1 2
aT s θ = θ ⇒ aT s = 1
BLUE optimisation task
unbiased constraint %
where the scaled data vector Minimise:
R
T
s = [s[0], s[1], . . . , s[N − 1]] var(θ̂) = aT E{xxT }a = aT Ca
In other words, to satisfy the subject to the unbiased constraint
unbiased constraint for the estimate
θ̂, E{x[n]} must be linear in θ, or aT s = 1
n o
T
E aT x − E{x} x − E{x} a = aT Ca like var(aX) = a2var(X)
| {z }| {z }
y yT
Also assume
E{x[n]} = s[n]θ, easy to show from x[n] = E{x[n]} + x[n] − E{x[n]}
R
by viewing w[n] = x[n] − E{x[n]}, we have x[n] = θs[n] + w[n]
sT C−1 C−1s
θ̂ = aToptx = T −1 x where aopt = T −1
s C s s C s
BLUE variance:
1
var θ̂ = aTopt C aopt =
sT C−1 s
To determine the BLUE we only require knowledge of
s # the scaled mean
C # the covariance matrix (C−1 is called the “precision
matrix”, see also Slide 42 in Lecture 4)
That is, for BLUE we only need to know the first two moments of the PDF
Notice that we do not need to know the functional relation of PDF
R
so that . . , 1} ]T = 1N ×1
s = 1 = [ |1, .{z
N elements
Follows from E{x[n]} being linear in θ ⇒ E{x[n]} = s[n]θ .
The BLUE for the estimation of DC level in noise then becomes (see Slide 20)
1 N −1
1T σ2
I 1 X
 = x = x[n] = x̄
1T σ12 I 1 N n=0
and has minimum variance (CRLB for a linear estimator) of
1 σ2
var(Â) = T 1 =
1 σ2 I 1 N
◦ The sample mean is the BLUE, independent of the PDF of the data
NxN
R
where the coefficients A = [ain]p×N and H is a vector/matrix form of {s[n]} terms
x = Hθ + w
T −1
−1
Cθ̂ = H C H
∂ ln p(x;θ)
Regularity condition within CRLB: ∂θ = I(θ) g(x) − θ
In our case: (see Example 8, slide 36)
N −1
∂ ln p(x; Φ) A X A 2
R
=− 2 x[n] sin(2πf0n + Φ) − sin(4πf0n + 2Φ)
∂Φ σ n=0 2
We cannot arrive at the above regularity condition, and an efficient
estimator for sinusoidal phase estimation does not exist
Remedy: Using MLE, we can still obtain an ≈ CRLB for freq. far from 0 and 1/2
1
Approximate CRLB: var(Φ) ≥ (see Example 8)
N × SN R
(x−µ)2
−
Recall that pX (x; µ, σ 2) = √1 e 2σ 2 (parametrised by µ and σ 2)
σ 2π
1 (x−1)2
(4, 9) pblue(x; 1, 1) = √ e − 2×1
0.3
1 × 2π
0.2
0.1 1 (x−4)2
− 2×9
pred(x; 4, 9) = √ e
0.0 3× 2π
R
2.5 0.0 2.5 5.0 7.5 10.0 12.5
Simple Returns, rt (%)
S&P 500
Log-Return
Location Spread
R
l(θ; x) = log L(θ; x) = log L(θ1, . . . , θp; x[0], . . . , x[N − 1]) = pdata(x[n]; θ)
n=0
Conditional MLE: Supervised Machine Learning employs conditioning
between data labels y and input data x, with pmodel dictated by a chosen
architecture. Such conditional max. likelihood function is L(θ; y|x), and
PN −1
θ̂ mle = arg maxθ n=0 log pmodel(yn|xn; θ) (Lecture 7, Appendix 9, P&A sets)
R
 = x[n]
N n=0
Clearly this is an MVU estimator which yields the CRLB (efficient)
c D. P. Mandic Statistical Signal Processing & Inference 34
Maximum Likelihood Estimation: Observations so far
MVU vs MLE: MLE is a particular estimator which comes with a “recipe”
for its calculation. MVU property relates to the properties of any estimator
(unbiased, minimum variance). So, MLE could be an MVU estimator,
depending on the chosen model and problem in hand.
◦ If an efficient estimator does exist (which satisfies the CRLB), the
maximum likelihood procedure will produce it (see Example 7)
◦ When an efficient estimator does not exist, the MLE has the desirable
property that it yields “an asymptotically efficient” estimator (Example 8)
If θ is the parameter to be estimated from a random observation x, then
the MLE θ̂mle = arg maxθ p(x; θ) is the value of θ that maximises p(x; θ)
◦ The function L(θ; data) = p(data; θ) does integrate to 1 when
integrated over data (property of PDF), but does not integrate to 1
when integrated over the parameters, θ (property of likelihood fn.)
R
◦ So, p(x; θ) is a probability over the data, x, and a likelihood function
(not probability) over the parameters, θ.
MLE is a “turn-the-crank” method which is optimal for large enough data.
It can be computationally complex and may require numerical methods.
c D. P. Mandic Statistical Signal Processing & Inference 35
Example 8: MLE sinusoidal phase estimator (cf. Ex. 6)
Recall the Neyman-Fisher factorisation: p(x; θ) = g T (x), θ h(x)
N −1
X 2
J(Φ) = x[n] − A cos(2πf0n + Φ)
n=0
that is, it vanishes provided f0 is not near 0 or 21 , and for a large enough N .
Thus, the LHS of (SP1) when divided by N and set equal to zero will yield
an approximation (see Appendix 2)
N −1
∂J(Φ) X
=0 ≈ x[n] sin(2πf0n + Φ̂) ≈ 0
∂Φ n=0
Upon expanding sin(2πf0n + Φ̂), we have (sin(a+b) = sin a cos b + cos a sin b)
N
X −1 N
X −1
x[n] sin(2πf0n) cos Φ̂ = − x[n] cos(2πf0n) sin Φ̂
n=0 n=0
N
X −1
x[n] sin(2πf0n)
so that the ML Estimator Φ̂ = − arctan Nn=0
−1
X
R
x[n] cos(2πf0n)
n=0
The MLE Φ̂ is clearly a function of the sufficient statistics, which are
PN −1 PN −1
T1(x) = n=0 x[n] cos(2πf0n) T2(x) = n=0 x[n] sin(2πf0n)
c D. P. Mandic Statistical Signal Processing & Inference 38
Example 8: Sinusoidal phase # numerical results
The expected asymptotic PDF of the phase estimator: Φ̂asy ∼ N (Φ, I −1(Φ))
# so that the asymptotic variance 1 1
var(Φ̂) = 2 = ηN
NA
2σ 2
Psignal A2 /2
where η= Pnoise = σ2
(SN R) is the “signal-to-noise-ratio”
R
Theoretical asymptotic values Φ=0.785 1 = 0.1
η
For shorter data records the ML estimator is considerably biased. Part of
this bias is due to the assumption (SP2) on Slide 37.
c D. P. Mandic Statistical Signal Processing & Inference 39
Example 8: MLE of sinusoidal phase # asymptotic mean
and variance (performance vs SNR for a fixed N )
For a fixed data length of N = 80, SNR was varied from -15 dB to +10 dB
◦ The asymptotic variance (or CRLB) then becomes
1
10 log10 var(Φ̂) = 10 log10 = −10 log10 N − 10 log10 η
Nη
◦ Mean and variance are also functions of SNR
◦ Asymptotic mean is attained for SNRs > -10dB
Actual vs. asymptotic mean for phase estimator Actual vs. asymptotic variance for phase estimator
0.9 0
Actual mean Actual variance
0.85 Asymptotic mean Asymptotic variance
−5
0.8
−10
10log variance
0.75
0.7 −15
Mean
0.65 −20
0.6
−25
0.55
−30
0.5
0.45 −35
−15 −10 −5 0 5 10 −15 −10 −5 0 5 10
SNR (dB) SNR (dB)
Observe that the minimum data length to attain CRLB also depends on SNR
c D. P. Mandic Statistical Signal Processing & Inference 40
Asymptotic properties of MLE
asy
We can now formalise the asymptotic properties of θ̂M L (see the previous slide).
R parameter θ.
The Maximum Likelihood Estimator is therefore asymptotically:
◦ unbiased
R
◦ efficient (that is, it achieves the CRLB)
R
1
PN −1
where x̄ = N n=0 x[n].
Amazing, we only assumed a type the PDF, but not the mean or variance!
c D. P. Mandic Statistical Signal Processing & Inference 43
Example 10: Sinusoidal parameter estimation with three
unknown parameters # A, f0, and Φ (Example 7 in Lecture 4)
Now, θ = [A, f0, Φ]T , and
N −1
1 1 X 2
p(x; θ) = 2 N/2
exp − 2 x[n] − A cos(2πf0n + Φ)
(2πσ ) 2σ n=0 | {z }
we need this as (x−Hθ)T (x−Hθ)
For A > 0, 0 < f0 < 12 , the MLE of θ = [A, f0, Φ]T is found by minimising
N −1
X 2
J(A, f0, Φ) = x[n] − A cos(2πf0n + Φ)
n=0
N
X −1
2
= (x[n] − A
| cos
{z Φ} cos 2πf0n + A
| sin
{z Φ} sin 2πf0n)
R n=0 α1 −α2
Φ = tan−1( −α
q
with the inverse mapping A= α21 + α22 & 2
α1 )
to yield the function J 0(α1, α2, f0) which is quadratic in α = [α1, α2]T
T T
R
0
J (α1, α2, f0) = x − α1c − α2s x − α1c − α2s = x − Hα x − Hα (∗)
0 T T T −1 T
to yield J (α1, α2, f0) = (x − Hα̂) (x − Hα̂) = x I − H(H H) H x
| {z }
max this f or min J 0
R
maximise xT H(HT H)−1HT x
Using the definition of H, the MLE for frequency fˆ0 is the value that
maximises the power spectrum estimate (see your P&A sets)
T T T −1 T N −1 2
T 1
c x c c c s c x X
−2πf0 n
= x[n]e ← periodogram
sT x sTc sTs sTx N n=0
- xT H - (HT H)−1 - HT x
Use this expression to find fˆ0, and proceed to find α̂ (Example 9, Lect. 4)
N
X −1
2 x[n] sin(2π fˆ0n)
P
x[n] cos 2π ˆ0n
f
N
T
α̂1 2 c x n=0
α̂ = ≈ = Φ̂=− arctan
α̂2 N sT x N −1
2 P x[n] sin 2π fˆ n X
N 0 x[n] cos(2π fˆ0n)
n=0
PN −1
x[n] exp(−j2π fˆ0n)
p 2
and  = α̂12 + α̂22 = N n=0
Following the above example, we can now state the invariance property
of MLE (also valid for the scalar case).
Theorem (invariance property of MLE): The MLE of a vector parameter
α = f (θ), where the pdf p(x; θ) is parametrised by θ, is given by
α̂ = f (θ̂)
−1
−1
θ̂ = H C T
H HT C−1x
R
R We examine the likelihood of the model, given the dataset (≡ MLE).
This boils down to maximising the likelihood that the generated data will
have a similar distribution as the true data of interest.
c D. P. Mandic Statistical Signal Processing & Inference 49
Example 12b: Goal of generative models
We desire to learn a probab. distribution, pdata(x), over data, x, such that:
Generation: If pdata(x) is a distrib. of handwritten digit images, and we
sample xnew ∼ pmodel, then xnew should look like a digit (aka sampling)
Density estimation: The probability pdata(x) should be high if a training
sample, x, looks like a digit, and low otherwise (maximising likelihood)
R L: Original, R: Generated
R
| {z }
max of this = min of DKL
arg min DKL(pdata||pmodel) ≡ Max. Likelihood Est. arg max log pmodel (x; θ)
pθ pθ
R
𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)
generative model
The differenceSampling
…
…
…
Estimation
latent space
𝒙𝒙𝒏𝒏𝒏𝒏𝒏𝒏 ~𝒑𝒑𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎𝒎 (𝒙𝒙; 𝜽𝜽)
R
the information in your notes, given the lecture # if the information is in
the lecture, it is also in your notes.
Once you have checked your notes, going back and listening to the lecture
R
| {z } h(x)
g(T (x),θ)
R
T (x) = max x[0], x[1], . . . , x[N − 1] = max{x}
The sample mean, x̄, is not a sufficient statistic for a uniform random var.,
as 1(max{x} ≤ θ) cannot be expressed as a function of just x̄ and θ.
c D. P. Mandic Statistical Signal Processing & Inference 56
Appendix 3: Sufficient statistics for the estimation of the
phase of a sinusoid
N−1
n 1 X 2 o
× exp − 2
x [n]
2σ n=0
| {z }
h(x)
where
N
X −1 N
X −1
T1(x) = x[n] cos 2πf0n T2(x) = x[n] sin 2πf0n
R
n−0 n−0
T1(x) and T2(x) are jointly sufficient statistics for the estimation of Φ.
However, no single sufficient statistic exists (we really desire a single
sufficient statistic).
c D. P. Mandic Statistical Signal Processing & Inference 58
Appendix 4: Motivation and Pro’s and Con’s of BLUE
Motivation for BLUE: Except for the Linear Model (Lecture 4), the
optimal MVU estimator might:
◦ Not even exist,
R
◦ Be difficult or even impossible to find.
BLUE is one such sub–optimal estimator.
Idea behind BLUE:
◦ Restrict the estimator to be linear in data x,
◦ Restrict the estimate to be unbiased,
◦ Find the best among such unbiased estimates, that is, the one with
the minimum variance.
Advantages of BLUE: It needs only the 1st and 2nd statistical moments
(mean and variance).
Disadvantages of BLUE: 1) In general it is sub–optimal, and 2) It may
be totally inappropriate for some problems (see the next slide).
X X
= arg min pθ (xi)log pθ (xi) − qθ̂ (xi) = arg min − pθ (xi)log qθ̂ (xi)
θ̂ x θ̂ x
R
pθ (y|xi) =
0, otherwise
P
The goal becomes arg minθ̂ − x log qθ̂ (yi |xi ) , which is precisely the
objective of MLE (min of the negative log-likelihood = max likelihood).
subject to g(x, y) = c
R
| {z }
constraint
We look for point(s) where curves f & g touch (but do not cross).
In those points, the tangent lines for f and g are parallel ⇒ so too are the
gradients ∇x,y f k λ∇x,y g, where λ is a scaling constant.
Although the two gradient vectors are parallel they can have different magnitudes!
Therefore, we are looking for max or min points (x, y) of f (x, y) for which
∂f ∂f ∂g ∂g
∇x,y f (x, y) = −λ∇x,y g(x, y) where ∇x,y f = , and ∇x,y g = ,
∂x ∂y ∂x ∂y
We can now combine these conditions into one equation as:
F (x, y, λ) = f (x, y) − λ g(x, y) − c and solve ∇x,y,λF (x, y, λ) = 0
Obviously, ∇λF (x, y, λ) = 0 ⇔ g(x, y) = c
Set Fx0 , Fy0 , Fz0 , Fλ0 = 0 and solve for the unknown x, y, z, λ.
Example 13: Economics
Two factories, A and B make TVs, at a cost
f (x, y) = 6x2 + 12y 2 where x = #T V in A & y = #T V in B
Task: Minimise the cost of producing 90 TVs, by finding optimal numbers
of TVs, x and y, produced respectively at factories A and B.
Solution: The constraint g(x, y) is given by (x+y=90), so that
F (x, y, λ) = 6x2 +12y 2 −λ(x + y − 90)
Then: Fx0 = 12x − λ, Fy0 = 24y − λ, Fλ0 = −x − y + 90, and we need
R
to set ∇F = 0 in order to find min / max.
Upon setting [Fx0 , Fy0 , Fλ0 ] = 0 we find x = 60, y = 30, λ = 720