0% found this document useful (0 votes)
47 views9 pages

Adapting To Unknown Smoothness: R. M. Castro May 20, 2011

This document discusses adapting regression models to functions of unknown smoothness. It begins by reviewing bounds for regression of Lipschitz smooth functions, then generalizes the approach to handle functions of unknown smoothness. The key ideas are to consider a countable class of models with increasing complexity, assign penalties based on model complexity using a coding argument, and select the optimal model complexity automatically via penalized maximum likelihood. This approach achieves near-optimal statistical rates without needing to know the true smoothness of the target function in advance.

Uploaded by

Raunak Jain
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

Adapting To Unknown Smoothness: R. M. Castro May 20, 2011

This document discusses adapting regression models to functions of unknown smoothness. It begins by reviewing bounds for regression of Lipschitz smooth functions, then generalizes the approach to handle functions of unknown smoothness. The key ideas are to consider a countable class of models with increasing complexity, assign penalties based on model complexity using a coding argument, and select the optimal model complexity automatically via penalized maximum likelihood. This approach achieves near-optimal statistical rates without needing to know the true smoothness of the target function in advance.

Uploaded by

Raunak Jain
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Adapting to Unknown Smoothness

R. M. Castro
May 20, 2011
1 Introduction
In this set of notes we see how to make the most of the oracle bounds we
proved in class. We begin by re-deriving our bounds for regression of Lipschitz
smooth functions, and see how to generalize the result regression of functions
of unknown smoothness.
Suppose we have a function f

: [0, 1] [R, R] satisfying the Lipschitz


smoothness assumption
s, t [0, 1] [f

(s) f

(t)[ L[s t[ .
We have seen these functions can be well approximated by piecewise constant
functions of the form
g(x) =
m

j=1
c
j
1x I
j
, where I
j
=
_
j 1
m
,
j
m
_
.
Lets use maximum-likelihood to pick the best such model. Suppose we have
following regression model
Y
i
= f

(x
i
) +
i
, i = 1, . . . , n ,
where x
i
= i/n and
i
i.i.d.
A(0,
2
). To be able to use the corollary we derived
we need a countable or nite class of models. The easiest way to do so is to
discretize/quantize the possible values of each constant piece in the candidate
models. Dene
T
m
=
_
_
_
m

j=1
c
j
1x I
j
: c
j
Q
_
_
_
,
where
Q = R, R
n 1
n
, . . . , R =
_
R
k n
n
, k = 0, . . . , 2n
_
.
Therefore T
m
has exactly (2n + 1)
m
elements in total. This means that, by
taking c(f) = log
2
((2n + 1)
m
) = mlog
2
(2n + 1) for all f T
m
we satisfy the
1
Kraft inequality

fFm
2
c(f)
=

fFm
1
[T
m
[
= 1 .
So we are ready to apply our oracle bound. Since c(f) is just a constant (not
really a function of f) the estimator is simply the MLE

f
n
= arg min
fFm
_
1
n
n

i=1
(Y
1
f(x
i
))
2
+
4
2
c(f) log 2
n
_
= arg min
fFm
_
1
n
n

i=1
(Y
1
f(x
i
))
2
_
.
The corollary then says that
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
2 min
fFm
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
+
4
2
c(f) log 2
n
_
.
So far the result is extremely general, as we have not made use of the Lipschitz
assumption. We have seen earlier that there is a piecewise constant function

f
m
(x) =

m
j=1
c
j
1x I
j
such that for all x [0, 1] we have [f

(x)

f
m
(x)[
L/m. The problem is that, generally

f
m
/ T
m
since c
j
/ Q. Take instead the
element of T
m
that is closest to

f
m
, namely

f
m
= arg min
fFm
sup
x[0,1]
[f(x)

f
m
(x)[ .
It is clear that [f(x)

f
m
(x)[ R/n for all x [0, 1] therefore, by the triangle
inequality we have
[f(x)

f
m
(x)[ [f(x)

f
m
(x)[ +[

f
m
(x)

f
m
(x)[
L
m
+
R
n
.
Now, we can just use this in our bound
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
2 min
fFm
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
+
4
2
c(f) log 2
n
_

2
n
n

i=1
_
L
m
+
R
n
_
2
+
8
2
mlog
2
(2n + 1) log 2
n
= 2
_
L
m
+
R
n
_
2
+
8
2
mlog
2
(2n + 1) log 2
n
.
So, to ensure the best bound possible we should choose m minimizing the right-
hand-side. This yields m (n/ log n)
1/3
and
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
= O
_
(n/ log n)
2/3
_
2
which, apart from the logarithmic factor is the best we can ever hope for (this
logarithmic factor is due to the discretization of the model classes, and is an
artifact of this approach). If we want the truly best possible bound we need
to minimize the above expression with respect to m, and for that we need to
know L. Can we do better? Can we automagically choose m using the data?
The answer is yes, and for this we will start taking full advantage of our oracle
bound.
Since we want to choose the best possible m we must consider the following
class of models
T =

_
m=1
T
m
.
This is clearly a countable class of models (but not nite). So we need to be
a bit more careful in constructing the map c(). Lets use a coding argument:
begin by dening
m(f) = min
mN
m : f T
m
.
Encode f T using rst the bits 00 . . . 01 (total m(f) bits) to encode m(f)
and them log
2
[T
m
[ bits to encode which model inside T
m
is f. This is clearly
a prex code and therefore satises the Kraft inequality. More formally
c(f) = m(f)+log
2
[T
m(f)
[ = m(f)+log
2
_
(2n + 1)
m(f)
_
= m(f)(1+log
2
((2n + 1)) .
Although we know, from the coding argument, that the map c() satises
the Kraft inequality for sure, we can do a little sanity check, and ensure this is
indeed true:

fF
2
c(f)

m=1

fFm
2
c(f)
=

m=1

fFm
2
m(f)log
2
|F
m(f)
|

m=1

fFm
2
mlog
2
|Fm|
=

m=1
2
m

fFm
1
[T
m
[
=

m=1
2
m
= 1 .
Now, similarly to what we had before

f
n
= arg min
fF
_
1
n
n

i=1
(Y
1
f(x
i
))
2
+
4
2
c(f) log 2
n
_
= arg min
fF
_
1
n
n

i=1
(Y
1
f(x
i
))
2
+
4
2
m(f)(1 + log
2
(2n + 1)) log 2
n
_
,
3
which is no longer the MLE, but rather a maximum penalized likelihood esti-
mator. Then
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
2 min
fF
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
+
4
2
m(f)(1 + log
2
(2n + 1)) log 2
n
_
2 min
mN
_
min
fFm
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
_
+
4
2
m(1 + log
2
(2n + 1)) log 2
n
_
min
mN
_
2
_
L
m
+
R
n
_
2
+
8
2
m(1 + log
2
(2n + 1)) log 2
n
_
.
Therefore this estimator automatically chooses the best possible number of
parameters m. Note that the price we pay is very very modest - the only
change from what we had before was that the term log
2
(2n + 1) is replaced
by 1 + log
2
(2n + 1), which is a very minute change as log
2
(2n + 1) 1 for
reasonable sample sizes.
Although this is remarkable, we can probably do much better, and adjust to
unknown smoothness. Well see how to do it in the next section.
2 Holder smooth functions
For 0 < 1, dene the space of functions
H

(C) =
_
f : sup
x,h
[f(x + h) f(x)[
[h[

C
_
,
for some constant 0 < C < . This class contains functions that are bounded,
but less smooth than Lipschitz functions. Indeed, the space of Lipschitz func-
tions corresponds to = 1. Functions in H
1
are uniformly continuous, but
functions in H

, < 1, are generally not uniformly continuous (although are


still continuous). Therefore a larger corresponds to smoother functions. Since
we want functions that are smoother than Lipschitz it makes sense to look at
derivatives. If 1 < 2 dene
H

(C) =
_
f :
f
x
H
1
(C)
_
.
In other words, H

, 1 < 2, contains dierentiable functions and their rst


derivative is Holder smooth with smoothness 1.
If f H

(C), 0 < 2, then we say that f is Holder smooth with


Holder constant C. The notion of Holder smoothness can also be extended to
> 2 in a straightforward way. There are other equivalent ways of dening
4
Holder smooth functions, in particular characterizing their local approximation
by polynomials. In particular a function f is Holder- smooth if it has |
derivatives and
[f(x) T

y
(x)[ C[x y[

x, y ,
where | is the largest integer such that | < , and T

y
is the Taylor
polynomial of degree | around the point y. In words, a Holder- smooth
function is locally well approximated by a polynomial of degree |. In this
lecture we will work with the rst denition (this will also give you an indication
why the two denitions are equivalent). Note: If a function is Holder-
2
smooth
and
1
<
2
then the function is also Holder-
1
smooth.
Note that since Holder smoothness essentially measures how dierentiable
functions are, the Taylor polynomial is the natural way to approximate Holder
smooth functions. We will focus on Holder smooth function classes with 0 <
2. Thus, we will work with piecewise linear approximations, the Taylor
polynomial of degree 1. If we were to consider smoother functions, > 2 we
would need consider higher degree Taylor polynomial approximation functions,
i.e. quadratic, cubic, etc...
3 Regression of Holder smooth functions
Consider the usual regression model
Y
i
= f

(x
i
) +
i
, i = 1, . . . , n ,
where x
i
= i/n and
i
i.i.d.
A(0,
2
). Lets assume f

: [0, 1] [R, R] is a
smooth function, in the sense that f

(C) for some unknown 0 < 2.


Lets see how well can we estimate f

. Intuitively, the smoother f

is the better
we should be able to estimate it. The smoother f

is,the more averaging we


can perform to reduce noise. In other words for smoother f

we should average
over larger bins. Also, we will need to exploit the extra smoothness in our
approximation of f

. To that end, we will consider candidate functions that are


piecewise linear, i.e., functions of the form
m

j=1
(a
j
+ b
j
x)1x I
j
, where I
j
=
_
j 1
m
,
j
m
_
.
As before, we want to consider countable/nite classes of models to be able to
apply our corollary, so we will consider a slight modication of the above. Each
linear piece can be described by their beginning and end points respectively. So
we are going to restrict those to lie on a grid. Namely refer to Figure 1
Dene the class
T
m
=
_
_
_
f(x) =
m

j=1

j
(x)1x I
j

_
_
_
,
5
(i1)/k i/k
C
0
C
n
levels
Figure 1: Example on the discretization of f on interval
_
j1
m
,
j
m
_
where

j
(x) =
x (j 1)/m
1/m
b
j
+
j/mx
1/m
a
j
= (mx j + 1)b
j
+ (j mx)a
j
,
and a
j
, b
j

_
k

n
R : k 0, . . . , 2

n
_
. Clearly [T
m
[ = (2

n + 1)
2m
.
Since we dont know the smoothness a priori, we must choose m using the
data. Therefore, in the same fashion as before we take the class
T =

_
m=1
T
m
,
with m(f) = min
mN
m : f T
m
, and
c(f) = m(f) + log
2
[T
m(f)
[ = m(f)(1 + 2 log
2
_
(2

n + 1)
_
.
Exactly as before, dene the estimator

f
n
= arg min
fF
_
1
n
n

i=1
(Y
1
f(x
i
))
2
+
4
2
m(f)(1 + 2 log
2
(2

n + 1)) log 2
n
_
,
6
Then
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
2 min
fF
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
+
4
2
m(f)(1 + 2 log
2
(2

n + 1)) log 2
n
_
2 min
mN
_
min
fFm
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
_
+
4
2
m(1 + 2 log
2
(2

n + 1)) log 2
n
_
.
In the above, the rst term is essentially our familiar approximation error, and
the second term is in a sense bounding the estimation error. Therefore this esti-
mator automatically seeks the best balance between the two. To say something
more concrete about the performance of the estimator we need to bring in the
assumptions we have on f

(so far we havent used any).


First, suppose f

(C) for 1 < 2. We need to nd agood model


in the class T
m
that makes the approximation error small. Take x I
j
where
j is arbitrary (so x can be any number in [0, 1]). From Taylors theorem with
remainder we have
f

(x) = f

_
j 1
m
_
+
f

x
(x

)
_
x
j 1
m
_
for some x


_
j1
m
, x

. This suggests using a non-discretized piecewise linear


approximation of the form

f
m
(x) =
m

j=1
_
f

_
j 1
m
_
+
f

x
_
j 1
m
__
x
j 1
m
__
1x I
j
.
Note that this is not necessarily the best piecewise linear approximation to f

,
but it is good enough for our purposes. What can we say about [

f
m
(x)f

(x)[?
Take again x I
j
. Now


f
m
(x) f

(x)

_
j 1
m
_
+
f

x
_
j 1
m
__
x
j 1
m
_
f

(x)

_
j 1
m
_
+
f

x
_
j 1
m
__
x
j 1
m
_
f

_
j 1
m
_

x
(x

)
_
x
j 1
m
_

x
_
j 1
m
__
x
j 1
m
_

x
(x

)
_
x
j 1
m
_

x
_
j 1
m
_

x
(x

_
x
j 1
m
_
C
_
1
m
_
1
1
m
= Cm

,
7
where the last line follows simply from the use of the smoothness assumption,
together with the fact that x (j 1)/m 1/m and x

(j 1)/m 1/m.
So, we just showed that, for any piecewise linear function of the form con-
sidered
x [0, 1] [

f
m
(x) f

(x)[ Cm

.
Now, clearly

f
m
is not necessarily in T
m
, so we still need a bit of work. Let
f
m
be the closest function to f

, in the sense that


f
m
= arg min
piecewise linear functions
sup
x[0,1]
[f(x) f

(x)[ .
Now take the function in T
m
that is closest to that function

f
m
= arg min
fFm
sup
x[0,1]
[f(x) f
m
(x)[ ,
because of the way we discretized we know that sup
x[0,1]
[f(x) f
m
(x)[
R/

n. Using this, together with the triangle inequality yields


x [0, 1] [

f
m
(x) f

(x)[ Cm

+
R

n
.
If f

(C) for 0 < 1 we can proceed in a similar fashion, but simply


have to note that such functions are well approximated by piecewise constant
functions. Furthermore these are a subset of T
m
. So, the reasoning we used for
Lipschitz functions applies directly here. Let

f
m
(x) =
m

j=1
f

_
j 1
m
_
1x I
j
.
Then, for x I
j


f
m
(x) f

(x)

_
j 1
m
_
f

(x)

j 1
m
x

Cm

,
and similarly, for any

f
m
T
m
x [0, 1] [

f
m
(x) f

(x)[ Cm

+
R

n
.
8
So, we can just plug-in these results into the bound of the corollary.
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
2 min
mN
_
min
fFm
_
1
n
n

i=1
(f

(x
i
) f(x
i
))
2
_
+
4
2
m(1 + 2 log
2
(2

n + 1)) log 2
n
_
2 min
mN
_
_
Cm

+
R

n
_
2
+
4
2
m(1 + 2 log
2
(2

n + 1)) log 2
n
_
2 min
mN
O
_
max
_
m
2
,
m

n
,
1
n
,
mlog n
n
__
.
It is not hard to see that the rst and last terms dominate the bound, and so
we attain the minimum by taking (in the bound)
m
_
n
log n
_
1/(2+1)
,
which yields
E
_
1
n
n

i=1
(

f
n
(x
i
) f

(x
i
))
2
_
= O
_
_
n
log n
_

2
2+1
_
.
Note that the estimator does not know !!! So we are indeed adapting to
unknown smoothness. If the regression function f

is Lipschitz this estimator


has error rate O
_
(n/ log n)
2/3
_
. However, if the function is smoother (say
= 2) the estimator has error rate O
_
(n/ log n)
4/5
_
, which decays much
quicker to zero. More remarkably, apart from the logarithmic factor it can be
shown this is the best one can hope for! So the logarithmic factor is the very
small price we need to pay for adaptivity.
9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy