0% found this document useful (0 votes)
21 views

Basic Stats Estimation

The document discusses classical statistics and maximum likelihood estimation (MLE), detailing terminologies related to estimators, their properties, and examples of unbiased and consistent estimators. It explains the bias-variance trade-off in estimators and introduces concepts such as confidence intervals and minimum variance unbiased estimation (MVUE). Additionally, it provides examples of estimating parameters from various distributions and the statistical properties of these estimators.

Uploaded by

nenutrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Basic Stats Estimation

The document discusses classical statistics and maximum likelihood estimation (MLE), detailing terminologies related to estimators, their properties, and examples of unbiased and consistent estimators. It explains the bias-variance trade-off in estimators and introduces concepts such as confidence intervals and minimum variance unbiased estimation (MVUE). Additionally, it provides examples of estimating parameters from various distributions and the statistical properties of these estimators.

Uploaded by

nenutrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Classical statistics

We list some terminology regarding estimators as follows.

1. We denote an estimator of an unknown parameter θ by Θ̂n (a random variable) which is a function of


n observations X1 , . . . , Xn whose distribution depends on θ.

2. The estimation error Θ̃n := Θ̂n −θ. The bias of an estimator b(Θ̂n ) is the expectation of the estimation
error
b(Θ̂n ) := E[Θ̂n ] − θ. (1)

3. The mean value, the variance and the bias of Θ̂n depend on θ, while the estimation error depends in
addition on X1 , . . . , Xn .

4. Θ̂n is unbiased if E[Θ̂n ] = θ for all values of θ.

5. Θ̂n is asymptotically unbiased if limn→∞ E[Θ̂n ] = θ, for all values of θ.

6. Θ̂n is consistent if the sequence Θ̂n converges to the true value of the parameter θ in probability for
all values of θ.

7. The mean squared error (MSE) E[Θ̃2n ] = E[(Θ̂ − θ)2 ] is related to the bias and the variance of Θ̂n as
follows

E[Θ̃2n ] = E[(Θ̂ − E[Θ̂] + E[Θ̂] − θ)2 ]


= E[(Θ̂ − E[Θ̂])2 + E[2(Θ̂ − E[Θ̂])(E[Θ̂] − θ)] + E[(E[Θ̂] − θ)2 ]
= var(Θ̂n ) + 2(E[Θ̂] − E[Θ̂])(E[Θ̂] − θ) + E[b2 (Θ̂n )]
= b2 (Θ̂n ) + var(Θ̂n ). (2)

This represents the trade-off between the bias and the variance is minimizing the mean squared error.

In general, θ can be a scalar or a vector of unknowns. The unknowns are treated as constants and not random
variables unlike in Bayesian statistics.

Maximum likelihood estimation


The objective of MLE is to estimate the unknown parameter θ from observations X1 , . . . , Xn which are
assumed to be iid random variables with pdf pXi (xi ; θ).
Consider the vector of observations X = (X1 , . . . , Xn ) described by a joint PMF pX (x; θ). Suppose
we observe x = (x1 , . . . , xn ). Then, a maximum likelihood estimate (MLE) is a value of the parameter that
maximizes pX (x1 , . . . , xn ; θ) over all θ:

θ̂n = argmaxθ pX (x1 , . . . , xn ; θ). (3)

If X is continuous, then
θ̂n = argmaxθ fX (x1 , . . . , xn ; θ). (4)
We refer to pX (x; θ) or fX (x; θ) as the likelihood function. If Xi s are independent, then

pX (x1 , . . . , xn ; θ) = Πni=1 pXi (xi ; θ).

The log-likelihood function for discrete Xi is


n
X
log pX (x1 , . . . , xn ; θ) = log pXi (xi ; θ)
i=1

1
and for continuous Xi it is
n
X
log fX (x1 , . . . , xn ; θ) = log fXi (xi ; θ).
i=1

θ represents either a scalar or a vector of unknowns.

Example 0.1 (Example 9.2, [1]). Consider the problem of estimating the probability of heads θ of a biased
coin, based on n independent tosses X1 , . . . , Xn where X
 i = 1 for heads and Xid =n0 kfor tails.n−k
Let k
bethe number of heads. Then the PMF pX (x, θ) = nk θk (1 − θ)n−k . Setting dθ k θ (1 − θ) =
n k−1 (1 − θ)n−k − (n − k)θ k (1 − θ)n−k−1 ) = 0,
k (kθ

k
k(1 − θ) − (n − k)θ = 0 ⇒ θ̂ = .
n
Therefore, the MLE is
X1 + . . . + Xn
Θ̂n = .
n
This estimator is unbiased and consistent as Θ̂n → θ in probability by weak law of large numbers.

Example 0.2. Let X1 , . . . , Xn be iid random variables with exponential(θ) distribution. Therefore, pX (x1 , . . . , xn ; θ) =
Πni=1 θe−θxi = θn Πni=1 e−θxi .
n
X
argmaxθ log pX (x1 , . . . , xn ; θ) = argmaxθ (n log θ − θ xi ).
i=1

Taking derivative w.r.t. θ and setting it to zero,


n
n X n
= xi ⇒ θ̂M L = Pn .
θ i=1 xi
i=1

n
Therefore, Θ̂n = X1 +···+X n
. We want to study the statistical properties of Θ̂n and want Θ̂n to be close to
the true value of θ with high probability.

Example 0.3 (Example 7.4, [2]). Let

x(n) = A + w(n), n = 0, 1, . . . , N − 1

where A is the unkown DC level of white Gaussian noise w(n) with known variance σ 2 . Then the joint PDF
1 − 12
PN −1 2
n=0 (x(n)−A) .
p(x; A) = 2 N/2
e 2σ
(2πσ )

Taking the derivative of the log likelihood function and setting it to zero,
N −1 N −1
∂ log p(x; A) 1 X 1 X
= 2 (x(n) − A) = 0 ⇒ Â = x(n).
∂A σ N
n=0 n=0

Properties of MLE:

• The MLE of a one-to-one function h(θ) of θ is h(θ̂), where θ̂ is the MLE of θ. This is called the
invariance principle.

• When Xi s are iid, under some mild extra assumptions, each component of MLE is consistent and
asymptotically normal.

2
Estimation of the mean and the variance of a random variable: We consider the sample mean and the
sample variance where no knowledge of the distributions pX (x; θ) or fX (x; θ) is required. The sample
mean of observations X1 , . . . , Xn (iid, with unknown mean θ) is given as
X1 + · · · + Xn
Mn = .
n
This is unbiased since E[Mn ] = E[X] = θ. The sample mean is consistent since it converges to θ in
probability by weak law of large numbers.
Its mean squared error is

nvar(X) var(X)
E[(Mn − θ)2 ] = var(Mn ) = 2
=
n n
where var(X) is the common variance of Xi . The mean squared error does not depend on θ. The sample
mean is not necessarily the estimator with the smallest variance as the zero estimator θ̂n = 0 has zero
variance. However, the bias of the zero estimator is bθ (θ̂n ) = −θ which implies that the MSE is θ2 .

Example 0.4 (Example 9.5, [1]). Suppose X1 , . . . , Xn are normal iid with unknown mean θ and unknown
variance σ 2 . Consider the estimator θ̂ = X1 +···+X
n+1
n nθ
. This estimator is biased because E[θ̂n ] = n+1
θ
and b(θ̂n ) = − n+1 . However, limn→∞ b(θ̂n ) = 0, so θ̂n is asymptotically unbiased. The variance is
2
σ n σ2
var(θ̂n ) = (n+1) 2 which is smaller than n of the sample mean. Notice that var(θ̂n ) is independent of θ.
The mean squared error is

θ2 nσ 2
E[θ̃2 ] = b(θ̂n ) + var(θ̂n ) = + .
(n + 1)2 (n + 1)2

The sample variance estimator is defined as


n
1X
S̄n2 = (Xi − Mn )2 . (5)
n
i=1

σ2
Note that from definitions, E[Mn ] = θ, E[Xi2 ] = θ2 + σ 2 and E[Mn2 ] = θ2 + n . (Note that
n n
1 X 1 X 2 X 1 σ2
E[Mn2 ] = E[ 2
( Xi )2 ] = 2 E[ Xi + Xi Xj ] = (n(θ 2
+ σ 2
) + n(n − 1)θ 2
) = θ 2
+ .)
n n n2 n
i=1 i=1 1≤i<j≤n

Furthermore,
n n
1 X 2 X
E[S̄n2 ] = E[ Xi − 2Mn Xi + nMn2 ]
n
i=1 i=1
n
1 X
= E[ Xi2 ] − 2Mn2 + Mn2
n
i=1
n
1X 2
= E[ Xi ] − Mn2
n
i=1
σ
= θ + σ − (θ2 + )
2
n
n−1
= σ. (6)
n
2
Therefore, S¯n is not an unbiased estimator of σ but it is asymptotically unbiased. It coincides with the ML
estimator if Xi are normal.

3
Another variance estimator is
n
1 X n
Ŝn2 = (Xi − Mn )2 = S̄ 2 . (7)
n−1 n−1 n
i=1

Therefore,
n−1 n
E[Ŝn2 ] = σ=σ (8)
n n−1
i.e., Ŝn2 is an unbiased estimator of σ. For large n, the two estimators coincide.
Confidence intervals: The confidence intervals is an interval which contains θ with high probability. Let
α be a small number. Then, the confidence level is denoted by 1 − α. The point estimator Θ̂n by a lower
estimator Θ̂− + − +
n and an upper estimator Θ̂n such that Θ̂n ≤ Θ̂n and

Pθ (Θ̂− +
n ≤ θ ≤ Θ̂n ) ≥ 1 − α, (9)

for every possible value of θ. The interval [Θ̂− + −


n , Θ̂n ] is called a 1−α confidence interval. The estimators Θ̂n
+
and Θ̂n are random variables that depend on observations X1 , . . . , Xn . Typical values of α are 0.05, 0.25
or 0.01.
Example 0.5. [3] Consider a resisor estimation (with true value R0 ) problem using (noisy) measurements
v(k) and i(k) of the true voltage v0 and the true current i0 respectively. Let v(k) = v0 + nv (k), i(k) =
i0 + ni (k) where nv and ni denote voltage and current noise during measurements. Let nv , ni be iid, zero
mean with finite variances σv2 , σi2 . Furthermore, cov(nv (k)ni (j)) = 0 for all k, j. Notice that from the
noise models and the properties of sample mean and sample variance estimators,
N N
1 X 1 X
lim ni (k) = 0, lim nv (k) = 0,
N →∞ N N →∞ N
k=1 k=1

N N
1 X 1 X 2
lim ni (k)nv (k) = cov(ni , nv ) = 0, lim ni (k) = σi2 .
N →∞ N N →∞ N
k=1 k=1
1 Pn
v(k)
A possible estimator is R̂EV := N
1 Pk=1
n . Clearly, this is asymptotically unbiased. Furthermore,
N k=1 i(k)
since v(k), i(k) are independent,
Pn
1
E[ N1 nk=1 v(k)]
P
k=1 v(k) v0
E[ N1 Pn ]= 1 Pn = = R0 .
N k=1 i(k) E[ N k=1 i(k)] i0

Thus, R̂EV is an unbiased estimator.


Next, we propose a least squares estimator. The LS problem is
N
1 X
minR (v(k) − Ri(k))2 .
N
k=1

Let vN , iN be the vectors of voltages and currents (ΦN = iN in this case). Notice that the regressor iN is
noisy in this case unlike the previous cases where the regressor matrix Φ was assumed to be noise free.
The LS solution is
1 T 1 PN
N iN vN N k=1 (v0 + nv (k))(i0 + ni (k))
R̂(N ) = 1 T = 1 PN
.
2
N iN iN N k=1 (i0 + n i (k))
As N → ∞, the numerator evaluates to
 XN 
1
limN →∞ v0 i0 + nv (k)ni (k) = v0 i0
N
k=1

4
and the denominator evaluates to
 N 
1 X 2
limN →∞ i0 + ni (k)ni (k) = i20 + σi2
N
k=1

where we have used that the sample mean and the sample variance converges to the true mean and the true
variance asymptotically. Thus, limN →∞ R̂ = Rσ0 2 .
1+ i
i2
PN 0 v(k)
Another possible estimator is R̂SA = N1 k=1 i(k) . Note that this estimator may not converge as i(k)
may take 0 values for some k due to noise.

Minimum variance unbiased estimation


Recall the bias-variance trade-off. It turns out that there is a minimum variance lower bound satisfied
by unbiased estimators. These are minimum variance unbiased estimators (MVUE) which minimize the
variance in the MSE.
Let p(y; θ) be the PDF of obtaining measurements y for a given θ. Suppose p(y; θ0 ) for true θ0 is
known. Then, any unbiased estimator θ̂(y) is a random variable with covariance matrix Σθ̂ :=cov(θ̂yn )
satisfies
Σθ̂ ≥ I(θ0 )−1 (10)
where I(θ0 ) is the Fisher information matrix and this lower bound is called the Cramer-Rao lower bound.
Note that
∂ 2 log p(y; θ)
I(θ0 ) = −E[ |θ=θ0 ]. (11)
∂θ2
For linear models y = ΦN θ + w with additive Gaussian noise w ∼ N (0, Σ), I(θ) = (ΦT −1
N Σ ΦN ). This
can be derived as follows. Clearly, y ∼ N (ΦN θ, Σ). Therefore,
1
log p(y; θ) = C − (y − ΦN θ)T Σ−1 (y − ΦN θ)
2
∂ 2 log p(y; θ) −1
⇒ = −ΦTN Σ ΦN
∂θ2
∂ 2 log p(y; θ) −1
⇒ −E[ ] = ΦT
N Σ ΦN .
∂θ2

Another way to obtain fisher information matrix is as follows. Note that one can replace ∂θ with ∇θ to
∂2 2
denote the gradient when θ is vector valued and ∂θ2 with ∇θ to denote the Hessian. Notice that

1
−∇θ log p(y; θ) = − ∇θ p(y; θ). (12)
p(y; θ)

Therefore,
1 1
−∇2θ log p(y; θ) = − ∇2 p(y; θ) + ( )2 ∇T
θ p(y; θ)∇θ p(y; θ)
p(y; θ) θ p(y; θ)
1
= − ∇2 p(y; θ) + (∇T θ log p(y; θ))(∇θ log p(y; θ))
p(y; θ) θ
X X
⇒ I(θ) = E[−∇2θ log p(y; θ)] = − ∇2θ p(y; θ) + ((∇T θ log p(y; θ))(∇θ log p(y; θ))p(y; θ)
X
= −∇2θ p(y; θ) + E[(∇T θ log p(y; θ))(∇θ log p(y; θ))]

= E[(∇T
θ log p(y; θ))(∇θ log p(y; θ))]. (13)

5
Example 0.6. Consider X ∼Bin(n, θ) = nx θx (1 − θ)n−x = pX (x; θ) with θ being an unknown param-


 Fisher information matrix for the random variable X is given as follows. Note that
eter. Then, the scalar
n
log p(x; θ) = log x + x log θ + (n − x) log(1 − θ). Therefore,

∂ log p(x; θ) x n−x ∂ 2 log p(x; θ) x n−x


= − ⇒− 2
=− 2 − .
∂θ θ 1−θ ∂θ θ (1 − θ)2

Therefore

∂ 2 log p(x; θ) E[x] n − E[x] nθ n − nθ n n n


−E[ 2
]= 2 + 2
= 2 + 2
= + = .
∂θ θ (1 − θ) θ (1 − θ) θ 1−θ θ(1 − θ)
θ(1−θ)
Therefore, if θ̂ is an estimator of θ, then var(θ̂) ≥ n .
1 2
−1 √ 1 − 2 (x(n)−A)
Example 0.7. [2] Let x(n) = A+w(n) where w(n) ∼ N (0, σ 2 ). Then, p(x; A) = ΠN
n=0 2πσ 2 e

(x being the vector of N observations). Therefore,


N −1
1 X 2 ∂ 2 log p(x; A) N ∂ 2 log p(x; A) N
log p(x; A) = C − 2
(x(n) − A) ⇒ 2
= − 2
⇒ −E[ 2
] = 2.
2σ ∂A σ ∂A σ
n=1

σ2
Therefore, if  is an estimator of A, then var(Â) ≥ N.

Unbiased estimators satisfying the Cramer-Rao lower bound are called minimum variance unbiased
estimators (MVUE). For ease of computation, we look for the best linear unbiased estimators (BLUE) as
MVUE cannot always be found in practice. Linear estimators are linear in data.

Linear estimators and BLUE


Consider an observation vector y and we want to estimate an unknown θ from y. For linear models (i.e.,
linear relation between θ and prediction ŷ) with Gaussian noise, the MVUE turns out to be linear using
least squares theory. For nonlinear models, this is not true. One can look for linear estimators for nonlinear
models i.e., estimators of the form θ̂ = Ky where K is a matrix of appropriate size. Best linear unbiased
estimators (BLUE) are linear estimators of the form θ̂ = Ky which have minimum covariance among all
linear unbiased estimators.

Bayesian estimation
• Bayesian statistics treats unknown parameters as random variables with known prior distributions.
These prior beliefs are then updated to posterior beliefs (using Bayes’ rule) after observations/measured
data.

• Principal Bayesian inference methods are: Maximum a posteriori probability (MAP) and Minimum
mean squares error (MMSE).

Bayesian inference and posterior distribution


The unknown quantity of interest is Θ which is modeled as a random variable or a finite collection of
random variables. We observe a collection of random variables X = (X1 , . . . , Xn ) called observation
vector/measurements and aim to extract information about Θ. We assume the following.

1. A prior distribution pΘ or fΘ , depending on whether Θ is discrete or continuous.

2. A conditional distribution pX|Θ or fX|Θ depending on whether X is discrete or continuous.

6
Using the prior and the conditional distribution, we calculate pΘ|X or fΘ|X called the posterior using Bayes’
rule. The four versions of Bayes’ rule are as follows.

1. Θ discrete, X discrete
pΘ (θ)pX|Θ (x|θ)
pΘ|X (θ|x) = P 0 0
. (14)
θ0 pΘ (θ )pX|Θ (x|θ )
This is useful in hypothesis testing problems with discrete data. This is useful in classification prob-
lems of discrete data points.

2. Θ discrete, X continuous
pΘ (θ)fX|Θ (x|θ)
pΘ|X (θ|x) = P 0 0
. (15)
θ0 pΘ (θ )fX|Θ (x|θ )
This is useful in hypothesis testing problems with continuous data. For example, binary signal detec-
tion in the presence of Gaussian noise.

3. Θ continuous, X discrete
fΘ (θ)pX|Θ (x|θ)
fΘ|X (θ|x) = ´ . (16)
fΘ (θ0 )pX|Θ (x|θ0 )dθ0
This is useful in estimation problems with discrete data. For example, a coin with an unknown param-
eter θ, the observation is the number of heads in n tosses or estimating parameters of a model from
the discrete measurements. A real world example would be estimating parameters of laws of motions
from discrete measurements of time and position of an object (for example, curve fitting).

4. Θ continuous, X continuous
fΘ (θ)fX|Θ (x|θ)
fΘ|X (θ|x) = ´ . (17)
fΘ (θ0 )fX|Θ (x|θ0 )dθ0

This is useful in estimation problems with continuous data. For example, estimating parameters of a
model from observed continuous signals (for example, system identification).

MAP estimation
The posterior distribution is either a PMF pΘ|X (.|X) or a PDF fΘ|X (.|X). To find the estimate of Θ given
X, we use the MAP rule:

pΘ|X (θ∗ |x) : = maxθ pΘ|X (θ|x) (18)



⇒θ = argmaxθ pΘ|X (θ|x) (19)

for discrete Θ and

fΘ|X (θ∗ |x) : = maxθ fΘ|X (θ|x) (20)



⇒θ = argmaxθ fΘ|X (θ|x) (21)

for continuous Θ. For continuous random variable Θ, the conditional expectation can be a better estimate
ˆ
E[Θ|X = x] = θfΘ|X (θ|x)dθ. (22)

than the MAP estimator. In general, apriori, there is no reason for chosing one estimator over the other
unless objectives are precisely stated.

7
MMSE estimator
Bayesian MSE is defined as
Bmse (Θ̂) := E[(Θ − Θ̂)2 ]. (23)
Using the joint PDF p(x, Θ),
ˆ ˆ
Bmse (Θ̂) = ((Θ − Θ̂)2 )p(x, Θ)dxdΘ. (24)

Using Bayes’ theorem, p(x, Θ) = p(Θ|x)p(x),


ˆ ˆ 
2
Bmse (Θ̂) = ((Θ − Θ̂) )p(Θ|x)dΘ p(x)dx.

Note that
p(x|Θ) p(x|Θ)
p(Θ|x) = =´
p(x) p(x|Θ)p(Θ)dΘ
where p(Θ) is the prior PDF on Θ. It turns out that MMSE estimator is given by

Θ̂ = E[Θ|x]. (25)

For linear models of the observation x as a function of Θ and Gaussian noise assumptions, MMSE estima-
tors can be computed analytically. However, in general cases, computation of MMSE is computationally
intensive. Therefore, one looks for linear estimators which minimize MSE rather than computing E[Θ|x].
LMMSE estimators (for scalar estimation) are of the form
N
X −1
Θ̂linear = an x(n) + aN . (26)
n=0

This can be generalized to vector valued estimators.


An overview of classical and Bayesian approach:

• One can use classical stats approach for modeling or Bayesian approach for modeling. For parametrized
models, in the classical approach, one treats these parameters as deterministic unknowns. Whereas,
in Bayesian modeling, some prior knowledge on the parameters is assumed and this is updated after
measurements.

• We will use the classical approach to modeling and system identification. But Bayesian modeling
approach can also be used in appropriate cases.

• In estimation theory, both approaches are used. For example, least squares estimation is used in
classical estimation theory which is commonly used in multiple applications e.g., estimating true
signal from its noisy measurements. Kalman filter uses Bayesian approach for estimating the state of
a dynamical system.

References
[1] D. Bertsekas, J. Tsitsiklis, Introduction to Probability, 2nd edition, 2008.

[2] S. Kay, Fundamentals of Statistical Signal Processing: Estimation theory, 1993.

[3] M. Diehl, Lecture notes on Modeling and System Identification, Lecture notes and video lectures, 2020.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy