0% found this document useful (0 votes)
19 views

Introduction

This document introduces nonparametric inference, which aims to estimate unknown quantities with minimal assumptions, focusing on problems such as estimating distribution functions, functionals, densities, and regression functions. It also provides notation and background on probability theory, including convergence concepts and statistical principles like maximum likelihood estimation and confidence sets. The chapter emphasizes the importance of constructing finite sample confidence sets and discusses various types of confidence intervals and their properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Introduction

This document introduces nonparametric inference, which aims to estimate unknown quantities with minimal assumptions, focusing on problems such as estimating distribution functions, functionals, densities, and regression functions. It also provides notation and background on probability theory, including convergence concepts and statistical principles like maximum likelihood estimation and confidence sets. The chapter emphasizes the importance of constructing finite sample confidence sets and discusses various types of confidence intervals and their properties.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1

Introduction

In this chapter we briefly describe the types of problems with which we will
be concerned. Then we define some notation and review some basic concepts
from probability theory and statistical inference.

1.1 What Is Nonparametric Inference?


The basic idea of nonparametric inference is to use data to infer an unknown
quantity while making as few assumptions as possible. Usually, this means
using statistical models that are infinite-dimensional. Indeed, a better name
for nonparametric inference might be infinite-dimensional inference. But it is
difficult to give a precise definition of nonparametric inference, and if I did
venture to give one, no doubt I would be barraged with dissenting opinions.
For the purposes of this book, we will use the phrase nonparametric in-
ference to refer to a set of modern statistical methods that aim to keep the
number of underlying assumptions as weak as possible. Specifically, we will
consider the following problems:

1. (Estimating the distribution function). Given an iid sample X1 , . . . , Xn ∼


F , estimate the cdf F (x) = P(X ≤ x). (Chapter 2.)
2 1. Introduction

2. (Estimating functionals). Given an iid sample X1 , . . . , Xn ∼ F , estimate



a functional T (F ) such as the mean T (F ) = x dF (x). (Chapters 2
and 3.)

3. (Density estimation). Given an iid sample X1 , . . . , Xn ∼ F , estimate the


density f (x) = F  (x). (Chapters 4, 6 and 8.)

4. (Nonparametric regression or curve estimation). Given (X1 , Y1 ), . . . , (Xn , Yn )


estimate the regression function r(x) = E(Y |X = x). (Chapters 4, 5, 8
and 9.)

5. (Normal means). Given Yi ∼ N (θi , σ 2 ), i = 1, . . . , n, estimate θ =


(θ1 , . . . , θn ). This apparently simple problem turns out to be very com-
plex and provides a unifying basis for much of nonparametric inference.
(Chapter 7.)

In addition, we will discuss some unifying theoretical principles in Chapter


7. We consider a few miscellaneous problems in Chapter 10, such as measure-
ment error, inverse problems and testing.
Typically, we will assume that distribution F (or density f or regression
function r) lies in some large set F called a statistical model. For example,
when estimating a density f , we might assume that
  
 2 2
f ∈F= g: (g (x)) dx ≤ c

which is the set of densities that are not “too wiggly.”

1.2 Notation and Background


Here is a summary of some useful notation and background. See also
Table 1.1.
Let a(x) be a function of x and let F be a cumulative distribution function.
If F is absolutely continuous, let f denote its density. If F is discrete, let f
denote instead its probability mass function. The mean of a is
  
a(x)f (x)dx continuous case
E(a(X)) = a(x)dF (x) ≡ 
j a(xj )f (xj ) discrete case.

Let V(X) = E(X − E(X))2 denote the variance of a random variable. If


 
X1 , . . . , Xn are n observations, then a(x)dFn (x) = n−1 i a(Xi ) where Fn
is the empirical distribution that puts mass 1/n at each observation Xi .
1.2 Notation and Background 3

Symbol Definition
xn = o(an ) limn→∞ xn /an = 0
xn = O(an ) |xn /an | is bounded for all large n
an ∼ b n an /bn → 1 as n → ∞
an  b n an /bn and bn /an are bounded for all large n
Xn  X convergence in distribution
P
Xn −→ X convergence in probability
a.s.
Xn −→ X almost sure convergence
θn estimator of parameter θ
bias E(θn ) − θ

se V(θn ) (standard error)

se estimated standard error
mse E(θn − θ)2 (mean squared error)
Φ cdf of a standard Normal random variable
zα Φ−1 (1 − α)

TABLE 1.1. Some useful notation.

Brief Review of Probability. The sample space Ω is the set of possible


outcomes of an experiment. Subsets of Ω are called events. A class of events
A is called a σ-field if (i) ∅ ∈ A, (ii) A ∈ A implies that Ac ∈ A and (iii)
A1 , A2 , . . . , ∈ A implies that ∞ i=1 Ai ∈ A. A probability measure is a
function P defined on a σ-field A such that P(A) ≥ 0 for all A ∈ A, P(Ω) = 1
and if A1 , A2 , . . . ∈ A are disjoint then
∞ ∞
P Ai = P(Ai ).
i=1 i=1

The triple (Ω, A, P) is called a probability space. A random variable is a


map X : Ω → R such that, for every real x, {ω ∈ Ω : X(ω) ≤ x} ∈ A.
A sequence of random variables Xn converges in distribution (or con-
verges weakly) to a random variable X, written Xn  X, if
P(Xn ≤ x) → P(X ≤ x) (1.1)
as n → ∞, at all points x at which the cdf
F (x) = P(X ≤ x) (1.2)
is continuous. A sequence of random variables Xn converges in probability
P
to a random variable X, written Xn −→ X, if,
for every  > 0, P(|Xn − X| > ) → 0 as n → ∞. (1.3)
4 1. Introduction

A sequence of random variables Xn converges almost surely to a random


a.s.
variable X, written Xn −→ X, if

P( lim |Xn − X| = 0) = 1. (1.4)


n→∞

The following implications hold:


a.s. P
Xn −→ X implies that Xn −→ X implies that Xn  X. (1.5)

Let g be a continuous function. Then, according to the continuous map-


ping theorem,

Xn  X implies that g(Xn )  g(X)


P P
Xn −→ X implies that g(Xn )−→ g(X)
a.s. a.s.
Xn −→ X implies that g(Xn )−→ g(X)

According to Slutsky’s theorem, if Xn  X and Yn  c for some constant


c, then Xn + Yn  X + c and Xn Yn  cX.
Let X1 , . . ., Xn ∼ F be iid. The weak law of large numbers says that if
n P
E|g(X1 )| < ∞, then n−1 i=1 g(Xi )−→ E(g(X1 )). The strong law of large
n a.s.
numbers says that if E|g(X1 )| < ∞, then n−1 i=1 g(Xi )−→ E(g(X1 )).
The random variable Z has a standard Normal distribution if it has density
2
φ(z) = (2π)−1/2 e−z /2 and we write Z ∼ N (0, 1). The cdf is denoted by
Φ(z). The α upper quantile is denoted by zα . Thus, if Z ∼ N (0, 1), then
P(Z > zα ) = α.
If E(g 2 (X1 )) < ∞, the central limit theorem says that

n(Y n − µ)  N (0, σ 2 ) (1.6)
n
where Yi = g(Xi ), µ = E(Y1 ), Y n = n−1 i=1 Yi and σ 2 = V(Y1 ). In general,
if
(Xn − µ)
 N (0, 1)
n
σ
then we will write
n2 ).
Xn ≈ N (µ, σ (1.7)
According to the delta method, if g is differentiable at µ and g  (µ) = 0
then
√ √
n(Xn − µ)  N (0, σ 2 ) =⇒ n(g(Xn ) − g(µ))  N (0, (g  (µ))2 σ 2 ). (1.8)

A similar result holds in the vector case. Suppose that Xn is a sequence of



random vectors such that n(Xn − µ)  N (0, Σ), a multivariate, mean 0
1.3 Confidence Sets 5

normal with covariance matrix Σ. Let g be differentiable with gradient ∇g


such that ∇µ = 0 where ∇µ is ∇g evaluated at µ. Then

n(g(Xn ) − g(µ))  N 0, ∇Tµ Σ∇µ . (1.9)

Statistical Concepts. Let F = {f (x; θ) : θ ∈ Θ} be a parametric model


satisfying appropriate regularity conditions. The likelihood function based
on iid observations X1 , . . . , Xn is

n
Ln (θ) = f (Xi ; θ)
i=1

and the log-likelihood function is n (θ) = log Ln (θ). The maximum likeli-
hood estimator, or mle θn , is the value of θ that maximizes the likelihood. The
score function is s(X; θ) = ∂ log f (x; θ)/∂θ. Under appropriate regularity

conditions, the score function satisfies Eθ (s(X; θ)) = s(x; θ)f (x; θ)dx = 0.
Also,

n(θn − θ)  N (0, τ 2 (θ))
where τ 2 (θ) = 1/I(θ) and

∂ 2 log f (x; θ)
I(θ) = Vθ (s(x; θ)) = Eθ (s2 (x; θ)) = −Eθ
∂θ2
is the Fisher information. Also,

(θn − θ)
 N (0, 1)

se
 2 = 1/(nI(θn )). The Fisher information In from n observations sat-
where se
isfies In (θ) = nI(θ); hence we may also write se  2 = 1/(In (θn )).
The bias of an estimator θn is E(θ)
 − θ and the the mean squared error mse
is mse = E(θ − θ)2 . The bias–variance decomposition for the mse of an
estimator θn is
mse = bias2 (θn ) + V(θn ). (1.10)

1.3 Confidence Sets


Much of nonparametric inference is devoted to finding an estimator θn of
some quantity of interest θ. Here, for example, θ could be a mean, a density
or a regression function. But we also want to provide confidence sets for these
quantities. There are different types of confidence sets, as we now explain.
6 1. Introduction

Let F be a class of distribution functions F and let θ be some quantity of


interest. Thus, θ might be F itself, or F  or the mean of F , and so on. Let
Cn be a set of possible values of θ which depends on the data X1 , . . . , Xn . To
emphasize that probability statements depend on the underlying F we will
sometimes write PF .

1.11 Definition. Cn is a finite sample 1 − α confidence set if

inf PF (θ ∈ Cn ) ≥ 1 − α for all n. (1.12)


F ∈F

Cn is a uniform asymptotic 1 − α confidence set if

lim inf inf PF (θ ∈ Cn ) ≥ 1 − α. (1.13)


n→∞ F ∈F

Cn is a pointwise asymptotic 1 − α confidence set if,

for every F ∈ F, lim inf PF (θ ∈ Cn ) ≥ 1 − α. (1.14)


n→∞

If || · || denotes some norm and fn is an estimate of f , then a confidence


ball for f is a confidence set of the form
 
Cn = f ∈ F : ||f − fn || ≤ sn (1.15)

where sn may depend on the data. Suppose that f is defined on a set X . A


pair of functions (, u) is a 1 − α confidence band or confidence envelope
if
 
inf P (x) ≤ f (x) ≤ u(x) for all x ∈ X ≥ 1 − α. (1.16)
f ∈F

Confidence balls and bands can be finite sample, pointwise asymptotic and
uniform asymptotic as above. When estimating a real-valued quantity instead
of a function, Cn is just an interval and we call Cn a confidence interval.
Ideally, we would like to find finite sample confidence sets. When this is
not possible, we try to construct uniform asymptotic confidence sets. The
last resort is a pointwise asymptotic confidence interval. If Cn is a uniform
asymptotic confidence set, then the following is true: for any δ > 0 there exists
an n(δ) such that the coverage of Cn is at least 1 − α − δ for all n > n(δ).
With a pointwise asymptotic confidence set, there may not exist a finite n(δ).
In this case, the sample size at which the confidence set has coverage close to
1 − α will depend on f (which we don’t know).
1.3 Confidence Sets 7

1.17 Example. Let X1 , . . . , Xn ∼ Bernoulli(p). A pointwise asymptotic 1 − α


confidence interval for p is

pn (1 − pn )
pn ± zα/2 (1.18)
n
n
where pn = n−1 i=1 Xi . It follows from Hoeffding’s inequality (1.24) that a
finite sample confidence interval is

1 2
pn ± log .  (1.19)
2n α

1.20 Example (Parametric models). Let

F = {f (x; θ) : θ ∈ Θ}

be a parametric model with scalar parameter θ and let θn be the maximum
likelihood estimator, the value of θ that maximizes the likelihood function


n
Ln (θ) = f (Xi ; θ).
i=1

Recall that under suitable regularity assumptions,

θn ≈ N (θ, se
 2)

where
 = (In (θn ))−1/2
se

is the estimated standard error of θn and In (θ) is the Fisher information.
Then
θn ± zα/2 se


is a pointwise asymptotic confidence interval. If τ = g(θ) we can get an


asymptotic confidence interval for τ using the delta method. The mle for
τ is τn = g(θn ). The estimated standard error for τ is se(  θn )|g  (θn )|.
 τn ) = se(
The confidence interval for τ is

 θn )|g  (θn )|.


 τn ) = τn ± zα/2 se(
τn ± zα/2 se(

Again, this is typically a pointwise asymptotic confidence interval. 


8 1. Introduction

1.4 Useful Inequalities


At various times in this book we will need to use certain inequalities. For
reference purposes, a number of these inequalities are recorded here.

Markov’s Inequality. Let X be a non-negative random variable and suppose


that E(X) exists. For any t > 0,

E(X)
P(X > t) ≤ . (1.21)
t

Chebyshev’s Inequality. Let µ = E(X) and σ 2 = V(X). Then,

σ2
P(|X − µ| ≥ t) ≤ . (1.22)
t2

Hoeffding’s Inequality. Let Y1 , . . . , Yn be independent observations such that


E(Yi ) = 0 and ai ≤ Yi ≤ bi . Let  > 0. Then, for any t > 0,
 
n 
n
2
(bi −ai )2 /8
P Yi ≥  ≤ e−t et . (1.23)
i=1 i=1

Hoeffding’s Inequality for Bernoulli Random Variables. Let X1 , . . ., Xn ∼ Bernoulli(p).


Then, for any  > 0,
  2
P |X n − p| >  ≤ 2e−2n (1.24)
n
where X n = n−1 i=1 Xi .

Mill’s Inequality. If Z ∼ N (0, 1) then, for any t > 0,

2φ(t)
P(|Z| > t) ≤ (1.25)
t
where φ is the standard Normal density. In fact, for any t > 0,

1 1 1
− 3 φ(t) < P(Z > t) < φ(t) (1.26)
t t t

and
1 −t2 /2
P (Z > t) < e . (1.27)
2
1.4 Useful Inequalities 9

Berry–Esséen Bound. Let X1 , . . . , Xn be iid with finite mean µ = E(X1 ),



variance σ 2 = V(X1 ) and third moment, E|X1 |3 < ∞. Let Zn = n(X n −
µ)/σ. Then
33 E|X1 − µ|3
sup |P(Zn ≤ z) − Φ(z)| ≤ √ 3 . (1.28)
z 4 nσ

Bernstein’s Inequality. Let X1 , . . . , Xn be independent, zero mean random vari-


ables such that −M ≤ Xi ≤ M . Then
 n    
  t2
  1
P  Xi  > t ≤ 2 exp − (1.29)
  2 v + M t/3
i=1

where v ≥ ni=1 V(Xi ).

Bernstein’s Inequality (Moment version). Let X1 , . . . , Xn be independent, zero


mean random variables such that
m!M m−2 vi
E|Xi |m ≤
2
for all m ≥ 2 and some constants M and vi . Then,
 n    
  t2
  1
P  Xi  > t ≤ 2 exp − (1.30)
  2 v + Mt
i=1
n
where v = i=1 vi .

Cauchy–Schwartz Inequality. If X and Y have finite variances then



E |XY | ≤ E(X 2 )E(Y 2 ). (1.31)

Recall that a function g is convex if for each x, y and each α ∈ [0, 1],

g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).

If g is twice differentiable, then convexity reduces to checking that g  (x) ≥ 0


for all x. It can be shown that if g is convex then it lies above any line that
touches g at some point, called a tangent line. A function g is concave if
−g is convex. Examples of convex functions are g(x) = x2 and g(x) = ex .
Examples of concave functions are g(x) = −x2 and g(x) = log x.

Jensen’s inequality. If g is convex then

Eg(X) ≥ g(EX). (1.32)


10 1. Introduction

If g is concave then
Eg(X) ≤ g(EX). (1.33)

1.5 Bibliographic Remarks


References on probability inequalities and their use in statistics and pattern
recognition include Devroye et al. (1996) and van der Vaart and Wellner
(1996). To review basic probability and mathematical statistics, I recommend
Casella and Berger (2002), van der Vaart (1998) and Wasserman (2004).

1.6 Exercises
1. Consider Example 1.17. Prove that (1.18) is a pointwise asymptotic
confidence interval. Prove that (1.19) is a uniform confidence interval.

2. (Computer experiment). Compare the coverage and length of (1.18) and


(1.19) by simulation. Take p = 0.2 and use α = .05. Try various sample
sizes n. How large must n be before the pointwise interval has accurate
coverage? How do the lengths of the two intervals compare when this
sample size is reached?

3. Let X1 , . . . , Xn ∼ N (µ, 1). Let Cn = X n ± zα/2 / n. Is Cn a finite
sample, pointwise asymptotic, or uniform asymptotic confidence set
for µ?

4. Let X1 , . . . , Xn ∼ N (µ, σ 2 ). Let Cn = X n ± zα/2 Sn / n where Sn2 =
n 2
i=1 (Xi − X n ) /(n − 1). Is Cn a finite sample, pointwise asymptotic,
or uniform asymptotic confidence set for µ?

5. Let X1 , . . . , Xn ∼ F and let µ = x dF (x) be the mean. Let
 
 X n + zα/2 se
Cn = X n − zα/2 se, 

 2 = Sn2 /n and
where se
n
1
Sn2 = (Xi − X n )2 .
n i=1

(a) Assuming that the mean exists, show that Cn is a 1 − α pointwise


asymptotic confidence interval.
1.6 Exercises 11

(b) Show that Cn is not a uniform asymptotic confidence interval. Hint :


Let an → ∞ and n → 0 and let Gn = (1 − n )F + n δn where δn is
a pointmass at an . Argue that, with very high probability, for an large

 is not large.
and n small, x dGn (x) is large but X n + zα/2 se
(c) Suppose that P(|Xi | ≤ B) = 1 where B is a known constant. Use
Bernstein’s inequality (1.29) to construct a finite sample confidence in-
terval for µ.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy