Basic Stats Estimation
Basic Stats Estimation
2. The estimation error Θ̃n := Θ̂n −θ. The bias of an estimator b(Θ̂n ) is the expectation of the estimation
error
b(Θ̂n ) := E[Θ̂n ] − θ. (1)
3. The mean value, the variance and the bias of Θ̂n depend on θ, while the estimation error depends in
addition on X1 , . . . , Xn .
6. Θ̂n is consistent if the sequence Θ̂n converges to the true value of the parameter θ in probability for
all values of θ.
7. The mean squared error (MSE) E[Θ̃2n ] = E[(Θ̂ − θ)2 ] is related to the bias and the variance of Θ̂n as
follows
This represents the trade-off between the bias and the variance is minimizing the mean squared error.
In general, θ can be a scalar or a vector of unknowns. The unknowns are treated as constants and not random
variables unlike in Bayesian statistics.
If X is continuous, then
θ̂n = argmaxθ fX (x1 , . . . , xn ; θ). (4)
We refer to pX (x; θ) or fX (x; θ) as the likelihood function. If Xi s are independent, then
1
and for continuous Xi it is
n
X
log fX (x1 , . . . , xn ; θ) = log fXi (xi ; θ).
i=1
Example 0.1 (Example 9.2, [1]). Consider the problem of estimating the probability of heads θ of a biased
coin, based on n independent tosses X1 , . . . , Xn where X
i = 1 for heads and Xid =n0 kfor tails.n−k
Let k
bethe number of heads. Then the PMF pX (x, θ) = nk θk (1 − θ)n−k . Setting dθ k θ (1 − θ) =
n k−1 (1 − θ)n−k − (n − k)θ k (1 − θ)n−k−1 ) = 0,
k (kθ
k
k(1 − θ) − (n − k)θ = 0 ⇒ θ̂ = .
n
Therefore, the MLE is
X1 + . . . + Xn
Θ̂n = .
n
This estimator is unbiased and consistent as Θ̂n → θ in probability by weak law of large numbers.
Example 0.2. Let X1 , . . . , Xn be iid random variables with exponential(θ) distribution. Therefore, pX (x1 , . . . , xn ; θ) =
Πni=1 θe−θxi = θn Πni=1 e−θxi .
n
X
argmaxθ log pX (x1 , . . . , xn ; θ) = argmaxθ (n log θ − θ xi ).
i=1
n
Therefore, Θ̂n = X1 +···+X n
. We want to study the statistical properties of Θ̂n and want Θ̂n to be close to
the true value of θ with high probability.
x(n) = A + w(n), n = 0, 1, . . . , N − 1
where A is the unkown DC level of white Gaussian noise w(n) with known variance σ 2 . Then the joint PDF
1 − 12
PN −1 2
n=0 (x(n)−A) .
p(x; A) = 2 N/2
e 2σ
(2πσ )
Taking the derivative of the log likelihood function and setting it to zero,
N −1 N −1
∂ log p(x; A) 1 X 1 X
= 2 (x(n) − A) = 0 ⇒ Â = x(n).
∂A σ N
n=0 n=0
Properties of MLE:
• The MLE of a one-to-one function h(θ) of θ is h(θ̂), where θ̂ is the MLE of θ. This is called the
invariance principle.
• When Xi s are iid, under some mild extra assumptions, each component of MLE is consistent and
asymptotically normal.
2
Estimation of the mean and the variance of a random variable: We consider the sample mean and the
sample variance where no knowledge of the distributions pX (x; θ) or fX (x; θ) is required. The sample
mean of observations X1 , . . . , Xn (iid, with unknown mean θ) is given as
X1 + · · · + Xn
Mn = .
n
This is unbiased since E[Mn ] = E[X] = θ. The sample mean is consistent since it converges to θ in
probability by weak law of large numbers.
Its mean squared error is
nvar(X) var(X)
E[(Mn − θ)2 ] = var(Mn ) = 2
=
n n
where var(X) is the common variance of Xi . The mean squared error does not depend on θ. The sample
mean is not necessarily the estimator with the smallest variance as the zero estimator θ̂n = 0 has zero
variance. However, the bias of the zero estimator is bθ (θ̂n ) = −θ which implies that the MSE is θ2 .
Example 0.4 (Example 9.5, [1]). Suppose X1 , . . . , Xn are normal iid with unknown mean θ and unknown
variance σ 2 . Consider the estimator θ̂ = X1 +···+X
n+1
n nθ
. This estimator is biased because E[θ̂n ] = n+1
θ
and b(θ̂n ) = − n+1 . However, limn→∞ b(θ̂n ) = 0, so θ̂n is asymptotically unbiased. The variance is
2
σ n σ2
var(θ̂n ) = (n+1) 2 which is smaller than n of the sample mean. Notice that var(θ̂n ) is independent of θ.
The mean squared error is
θ2 nσ 2
E[θ̃2 ] = b(θ̂n ) + var(θ̂n ) = + .
(n + 1)2 (n + 1)2
σ2
Note that from definitions, E[Mn ] = θ, E[Xi2 ] = θ2 + σ 2 and E[Mn2 ] = θ2 + n . (Note that
n n
1 X 1 X 2 X 1 σ2
E[Mn2 ] = E[ 2
( Xi )2 ] = 2 E[ Xi + Xi Xj ] = (n(θ 2
+ σ 2
) + n(n − 1)θ 2
) = θ 2
+ .)
n n n2 n
i=1 i=1 1≤i<j≤n
Furthermore,
n n
1 X 2 X
E[S̄n2 ] = E[ Xi − 2Mn Xi + nMn2 ]
n
i=1 i=1
n
1 X
= E[ Xi2 ] − 2Mn2 + Mn2
n
i=1
n
1X 2
= E[ Xi ] − Mn2
n
i=1
σ
= θ + σ − (θ2 + )
2
n
n−1
= σ. (6)
n
2
Therefore, S¯n is not an unbiased estimator of σ but it is asymptotically unbiased. It coincides with the ML
estimator if Xi are normal.
3
Another variance estimator is
n
1 X n
Ŝn2 = (Xi − Mn )2 = S̄ 2 . (7)
n−1 n−1 n
i=1
Therefore,
n−1 n
E[Ŝn2 ] = σ=σ (8)
n n−1
i.e., Ŝn2 is an unbiased estimator of σ. For large n, the two estimators coincide.
Confidence intervals: The confidence intervals is an interval which contains θ with high probability. Let
α be a small number. Then, the confidence level is denoted by 1 − α. The point estimator Θ̂n by a lower
estimator Θ̂− + − +
n and an upper estimator Θ̂n such that Θ̂n ≤ Θ̂n and
Pθ (Θ̂− +
n ≤ θ ≤ Θ̂n ) ≥ 1 − α, (9)
N N
1 X 1 X 2
lim ni (k)nv (k) = cov(ni , nv ) = 0, lim ni (k) = σi2 .
N →∞ N N →∞ N
k=1 k=1
1 Pn
v(k)
A possible estimator is R̂EV := N
1 Pk=1
n . Clearly, this is asymptotically unbiased. Furthermore,
N k=1 i(k)
since v(k), i(k) are independent,
Pn
1
E[ N1 nk=1 v(k)]
P
k=1 v(k) v0
E[ N1 Pn ]= 1 Pn = = R0 .
N k=1 i(k) E[ N k=1 i(k)] i0
Let vN , iN be the vectors of voltages and currents (ΦN = iN in this case). Notice that the regressor iN is
noisy in this case unlike the previous cases where the regressor matrix Φ was assumed to be noise free.
The LS solution is
1 T 1 PN
N iN vN N k=1 (v0 + nv (k))(i0 + ni (k))
R̂(N ) = 1 T = 1 PN
.
2
N iN iN N k=1 (i0 + n i (k))
As N → ∞, the numerator evaluates to
XN
1
limN →∞ v0 i0 + nv (k)ni (k) = v0 i0
N
k=1
4
and the denominator evaluates to
N
1 X 2
limN →∞ i0 + ni (k)ni (k) = i20 + σi2
N
k=1
where we have used that the sample mean and the sample variance converges to the true mean and the true
variance asymptotically. Thus, limN →∞ R̂ = Rσ0 2 .
1+ i
i2
PN 0 v(k)
Another possible estimator is R̂SA = N1 k=1 i(k) . Note that this estimator may not converge as i(k)
may take 0 values for some k due to noise.
1
−∇θ log p(y; θ) = − ∇θ p(y; θ). (12)
p(y; θ)
Therefore,
1 1
−∇2θ log p(y; θ) = − ∇2 p(y; θ) + ( )2 ∇T
θ p(y; θ)∇θ p(y; θ)
p(y; θ) θ p(y; θ)
1
= − ∇2 p(y; θ) + (∇T θ log p(y; θ))(∇θ log p(y; θ))
p(y; θ) θ
X X
⇒ I(θ) = E[−∇2θ log p(y; θ)] = − ∇2θ p(y; θ) + ((∇T θ log p(y; θ))(∇θ log p(y; θ))p(y; θ)
X
= −∇2θ p(y; θ) + E[(∇T θ log p(y; θ))(∇θ log p(y; θ))]
= E[(∇T
θ log p(y; θ))(∇θ log p(y; θ))]. (13)
5
Example 0.6. Consider X ∼Bin(n, θ) = nx θx (1 − θ)n−x = pX (x; θ) with θ being an unknown param-
Fisher information matrix for the random variable X is given as follows. Note that
eter. Then, the scalar
n
log p(x; θ) = log x + x log θ + (n − x) log(1 − θ). Therefore,
Therefore
σ2
Therefore, if  is an estimator of A, then var(Â) ≥ N.
Unbiased estimators satisfying the Cramer-Rao lower bound are called minimum variance unbiased
estimators (MVUE). For ease of computation, we look for the best linear unbiased estimators (BLUE) as
MVUE cannot always be found in practice. Linear estimators are linear in data.
Bayesian estimation
• Bayesian statistics treats unknown parameters as random variables with known prior distributions.
These prior beliefs are then updated to posterior beliefs (using Bayes’ rule) after observations/measured
data.
• Principal Bayesian inference methods are: Maximum a posteriori probability (MAP) and Minimum
mean squares error (MMSE).
6
Using the prior and the conditional distribution, we calculate pΘ|X or fΘ|X called the posterior using Bayes’
rule. The four versions of Bayes’ rule are as follows.
1. Θ discrete, X discrete
pΘ (θ)pX|Θ (x|θ)
pΘ|X (θ|x) = P 0 0
. (14)
θ0 pΘ (θ )pX|Θ (x|θ )
This is useful in hypothesis testing problems with discrete data. This is useful in classification prob-
lems of discrete data points.
2. Θ discrete, X continuous
pΘ (θ)fX|Θ (x|θ)
pΘ|X (θ|x) = P 0 0
. (15)
θ0 pΘ (θ )fX|Θ (x|θ )
This is useful in hypothesis testing problems with continuous data. For example, binary signal detec-
tion in the presence of Gaussian noise.
3. Θ continuous, X discrete
fΘ (θ)pX|Θ (x|θ)
fΘ|X (θ|x) = ´ . (16)
fΘ (θ0 )pX|Θ (x|θ0 )dθ0
This is useful in estimation problems with discrete data. For example, a coin with an unknown param-
eter θ, the observation is the number of heads in n tosses or estimating parameters of a model from
the discrete measurements. A real world example would be estimating parameters of laws of motions
from discrete measurements of time and position of an object (for example, curve fitting).
4. Θ continuous, X continuous
fΘ (θ)fX|Θ (x|θ)
fΘ|X (θ|x) = ´ . (17)
fΘ (θ0 )fX|Θ (x|θ0 )dθ0
This is useful in estimation problems with continuous data. For example, estimating parameters of a
model from observed continuous signals (for example, system identification).
MAP estimation
The posterior distribution is either a PMF pΘ|X (.|X) or a PDF fΘ|X (.|X). To find the estimate of Θ given
X, we use the MAP rule:
for continuous Θ. For continuous random variable Θ, the conditional expectation can be a better estimate
ˆ
E[Θ|X = x] = θfΘ|X (θ|x)dθ. (22)
than the MAP estimator. In general, apriori, there is no reason for chosing one estimator over the other
unless objectives are precisely stated.
7
MMSE estimator
Bayesian MSE is defined as
Bmse (Θ̂) := E[(Θ − Θ̂)2 ]. (23)
Using the joint PDF p(x, Θ),
ˆ ˆ
Bmse (Θ̂) = ((Θ − Θ̂)2 )p(x, Θ)dxdΘ. (24)
Note that
p(x|Θ) p(x|Θ)
p(Θ|x) = =´
p(x) p(x|Θ)p(Θ)dΘ
where p(Θ) is the prior PDF on Θ. It turns out that MMSE estimator is given by
Θ̂ = E[Θ|x]. (25)
For linear models of the observation x as a function of Θ and Gaussian noise assumptions, MMSE estima-
tors can be computed analytically. However, in general cases, computation of MMSE is computationally
intensive. Therefore, one looks for linear estimators which minimize MSE rather than computing E[Θ|x].
LMMSE estimators (for scalar estimation) are of the form
N
X −1
Θ̂linear = an x(n) + aN . (26)
n=0
• One can use classical stats approach for modeling or Bayesian approach for modeling. For parametrized
models, in the classical approach, one treats these parameters as deterministic unknowns. Whereas,
in Bayesian modeling, some prior knowledge on the parameters is assumed and this is updated after
measurements.
• We will use the classical approach to modeling and system identification. But Bayesian modeling
approach can also be used in appropriate cases.
• In estimation theory, both approaches are used. For example, least squares estimation is used in
classical estimation theory which is commonly used in multiple applications e.g., estimating true
signal from its noisy measurements. Kalman filter uses Bayesian approach for estimating the state of
a dynamical system.
References
[1] D. Bertsekas, J. Tsitsiklis, Introduction to Probability, 2nd edition, 2008.
[3] M. Diehl, Lecture notes on Modeling and System Identification, Lecture notes and video lectures, 2020.