Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
Adapting To Unknown Smoothness: R. M. Castro May 20, 2011
R. M. Castro
May 20, 2011
1 Introduction
In this set of notes we see how to make the most of the oracle bounds we
proved in class. We begin by re-deriving our bounds for regression of Lipschitz
smooth functions, and see how to generalize the result regression of functions
of unknown smoothness.
Suppose we have a function f
(s) f
(t)[ L[s t[ .
We have seen these functions can be well approximated by piecewise constant
functions of the form
g(x) =
m
j=1
c
j
1x I
j
, where I
j
=
_
j 1
m
,
j
m
_
.
Lets use maximum-likelihood to pick the best such model. Suppose we have
following regression model
Y
i
= f
(x
i
) +
i
, i = 1, . . . , n ,
where x
i
= i/n and
i
i.i.d.
A(0,
2
). To be able to use the corollary we derived
we need a countable or nite class of models. The easiest way to do so is to
discretize/quantize the possible values of each constant piece in the candidate
models. Dene
T
m
=
_
_
_
m
j=1
c
j
1x I
j
: c
j
Q
_
_
_
,
where
Q = R, R
n 1
n
, . . . , R =
_
R
k n
n
, k = 0, . . . , 2n
_
.
Therefore T
m
has exactly (2n + 1)
m
elements in total. This means that, by
taking c(f) = log
2
((2n + 1)
m
) = mlog
2
(2n + 1) for all f T
m
we satisfy the
1
Kraft inequality
fFm
2
c(f)
=
fFm
1
[T
m
[
= 1 .
So we are ready to apply our oracle bound. Since c(f) is just a constant (not
really a function of f) the estimator is simply the MLE
f
n
= arg min
fFm
_
1
n
n
i=1
(Y
1
f(x
i
))
2
+
4
2
c(f) log 2
n
_
= arg min
fFm
_
1
n
n
i=1
(Y
1
f(x
i
))
2
_
.
The corollary then says that
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
2 min
fFm
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
+
4
2
c(f) log 2
n
_
.
So far the result is extremely general, as we have not made use of the Lipschitz
assumption. We have seen earlier that there is a piecewise constant function
f
m
(x) =
m
j=1
c
j
1x I
j
such that for all x [0, 1] we have [f
(x)
f
m
(x)[
L/m. The problem is that, generally
f
m
/ T
m
since c
j
/ Q. Take instead the
element of T
m
that is closest to
f
m
, namely
f
m
= arg min
fFm
sup
x[0,1]
[f(x)
f
m
(x)[ .
It is clear that [f(x)
f
m
(x)[ R/n for all x [0, 1] therefore, by the triangle
inequality we have
[f(x)
f
m
(x)[ [f(x)
f
m
(x)[ +[
f
m
(x)
f
m
(x)[
L
m
+
R
n
.
Now, we can just use this in our bound
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
2 min
fFm
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
+
4
2
c(f) log 2
n
_
2
n
n
i=1
_
L
m
+
R
n
_
2
+
8
2
mlog
2
(2n + 1) log 2
n
= 2
_
L
m
+
R
n
_
2
+
8
2
mlog
2
(2n + 1) log 2
n
.
So, to ensure the best bound possible we should choose m minimizing the right-
hand-side. This yields m (n/ log n)
1/3
and
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
= O
_
(n/ log n)
2/3
_
2
which, apart from the logarithmic factor is the best we can ever hope for (this
logarithmic factor is due to the discretization of the model classes, and is an
artifact of this approach). If we want the truly best possible bound we need
to minimize the above expression with respect to m, and for that we need to
know L. Can we do better? Can we automagically choose m using the data?
The answer is yes, and for this we will start taking full advantage of our oracle
bound.
Since we want to choose the best possible m we must consider the following
class of models
T =
_
m=1
T
m
.
This is clearly a countable class of models (but not nite). So we need to be
a bit more careful in constructing the map c(). Lets use a coding argument:
begin by dening
m(f) = min
mN
m : f T
m
.
Encode f T using rst the bits 00 . . . 01 (total m(f) bits) to encode m(f)
and them log
2
[T
m
[ bits to encode which model inside T
m
is f. This is clearly
a prex code and therefore satises the Kraft inequality. More formally
c(f) = m(f)+log
2
[T
m(f)
[ = m(f)+log
2
_
(2n + 1)
m(f)
_
= m(f)(1+log
2
((2n + 1)) .
Although we know, from the coding argument, that the map c() satises
the Kraft inequality for sure, we can do a little sanity check, and ensure this is
indeed true:
fF
2
c(f)
m=1
fFm
2
c(f)
=
m=1
fFm
2
m(f)log
2
|F
m(f)
|
m=1
fFm
2
mlog
2
|Fm|
=
m=1
2
m
fFm
1
[T
m
[
=
m=1
2
m
= 1 .
Now, similarly to what we had before
f
n
= arg min
fF
_
1
n
n
i=1
(Y
1
f(x
i
))
2
+
4
2
c(f) log 2
n
_
= arg min
fF
_
1
n
n
i=1
(Y
1
f(x
i
))
2
+
4
2
m(f)(1 + log
2
(2n + 1)) log 2
n
_
,
3
which is no longer the MLE, but rather a maximum penalized likelihood esti-
mator. Then
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
2 min
fF
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
+
4
2
m(f)(1 + log
2
(2n + 1)) log 2
n
_
2 min
mN
_
min
fFm
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
_
+
4
2
m(1 + log
2
(2n + 1)) log 2
n
_
min
mN
_
2
_
L
m
+
R
n
_
2
+
8
2
m(1 + log
2
(2n + 1)) log 2
n
_
.
Therefore this estimator automatically chooses the best possible number of
parameters m. Note that the price we pay is very very modest - the only
change from what we had before was that the term log
2
(2n + 1) is replaced
by 1 + log
2
(2n + 1), which is a very minute change as log
2
(2n + 1) 1 for
reasonable sample sizes.
Although this is remarkable, we can probably do much better, and adjust to
unknown smoothness. Well see how to do it in the next section.
2 Holder smooth functions
For 0 < 1, dene the space of functions
H
(C) =
_
f : sup
x,h
[f(x + h) f(x)[
[h[
C
_
,
for some constant 0 < C < . This class contains functions that are bounded,
but less smooth than Lipschitz functions. Indeed, the space of Lipschitz func-
tions corresponds to = 1. Functions in H
1
are uniformly continuous, but
functions in H
(C) =
_
f :
f
x
H
1
(C)
_
.
In other words, H
y
(x)[ C[x y[
x, y ,
where | is the largest integer such that | < , and T
y
is the Taylor
polynomial of degree | around the point y. In words, a Holder- smooth
function is locally well approximated by a polynomial of degree |. In this
lecture we will work with the rst denition (this will also give you an indication
why the two denitions are equivalent). Note: If a function is Holder-
2
smooth
and
1
<
2
then the function is also Holder-
1
smooth.
Note that since Holder smoothness essentially measures how dierentiable
functions are, the Taylor polynomial is the natural way to approximate Holder
smooth functions. We will focus on Holder smooth function classes with 0 <
2. Thus, we will work with piecewise linear approximations, the Taylor
polynomial of degree 1. If we were to consider smoother functions, > 2 we
would need consider higher degree Taylor polynomial approximation functions,
i.e. quadratic, cubic, etc...
3 Regression of Holder smooth functions
Consider the usual regression model
Y
i
= f
(x
i
) +
i
, i = 1, . . . , n ,
where x
i
= i/n and
i
i.i.d.
A(0,
2
). Lets assume f
: [0, 1] [R, R] is a
smooth function, in the sense that f
is the better
we should be able to estimate it. The smoother f
we should average
over larger bins. Also, we will need to exploit the extra smoothness in our
approximation of f
j=1
(a
j
+ b
j
x)1x I
j
, where I
j
=
_
j 1
m
,
j
m
_
.
As before, we want to consider countable/nite classes of models to be able to
apply our corollary, so we will consider a slight modication of the above. Each
linear piece can be described by their beginning and end points respectively. So
we are going to restrict those to lie on a grid. Namely refer to Figure 1
Dene the class
T
m
=
_
_
_
f(x) =
m
j=1
j
(x)1x I
j
_
_
_
,
5
(i1)/k i/k
C
0
C
n
levels
Figure 1: Example on the discretization of f on interval
_
j1
m
,
j
m
_
where
j
(x) =
x (j 1)/m
1/m
b
j
+
j/mx
1/m
a
j
= (mx j + 1)b
j
+ (j mx)a
j
,
and a
j
, b
j
_
k
n
R : k 0, . . . , 2
n
_
. Clearly [T
m
[ = (2
n + 1)
2m
.
Since we dont know the smoothness a priori, we must choose m using the
data. Therefore, in the same fashion as before we take the class
T =
_
m=1
T
m
,
with m(f) = min
mN
m : f T
m
, and
c(f) = m(f) + log
2
[T
m(f)
[ = m(f)(1 + 2 log
2
_
(2
n + 1)
_
.
Exactly as before, dene the estimator
f
n
= arg min
fF
_
1
n
n
i=1
(Y
1
f(x
i
))
2
+
4
2
m(f)(1 + 2 log
2
(2
n + 1)) log 2
n
_
,
6
Then
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
2 min
fF
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
+
4
2
m(f)(1 + 2 log
2
(2
n + 1)) log 2
n
_
2 min
mN
_
min
fFm
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
_
+
4
2
m(1 + 2 log
2
(2
n + 1)) log 2
n
_
.
In the above, the rst term is essentially our familiar approximation error, and
the second term is in a sense bounding the estimation error. Therefore this esti-
mator automatically seeks the best balance between the two. To say something
more concrete about the performance of the estimator we need to bring in the
assumptions we have on f
(x) = f
_
j 1
m
_
+
f
x
(x
)
_
x
j 1
m
_
for some x
_
j1
m
, x
f
m
(x) =
m
j=1
_
f
_
j 1
m
_
+
f
x
_
j 1
m
__
x
j 1
m
__
1x I
j
.
Note that this is not necessarily the best piecewise linear approximation to f
,
but it is good enough for our purposes. What can we say about [
f
m
(x)f
(x)[?
Take again x I
j
. Now
f
m
(x) f
(x)
_
j 1
m
_
+
f
x
_
j 1
m
__
x
j 1
m
_
f
(x)
_
j 1
m
_
+
f
x
_
j 1
m
__
x
j 1
m
_
f
_
j 1
m
_
x
(x
)
_
x
j 1
m
_
x
_
j 1
m
__
x
j 1
m
_
x
(x
)
_
x
j 1
m
_
x
_
j 1
m
_
x
(x
_
x
j 1
m
_
C
_
1
m
_
1
1
m
= Cm
,
7
where the last line follows simply from the use of the smoothness assumption,
together with the fact that x (j 1)/m 1/m and x
(j 1)/m 1/m.
So, we just showed that, for any piecewise linear function of the form con-
sidered
x [0, 1] [
f
m
(x) f
(x)[ Cm
.
Now, clearly
f
m
is not necessarily in T
m
, so we still need a bit of work. Let
f
m
be the closest function to f
(x)[ .
Now take the function in T
m
that is closest to that function
f
m
= arg min
fFm
sup
x[0,1]
[f(x) f
m
(x)[ ,
because of the way we discretized we know that sup
x[0,1]
[f(x) f
m
(x)[
R/
f
m
(x) f
(x)[ Cm
+
R
n
.
If f
f
m
(x) =
m
j=1
f
_
j 1
m
_
1x I
j
.
Then, for x I
j
f
m
(x) f
(x)
_
j 1
m
_
f
(x)
j 1
m
x
Cm
,
and similarly, for any
f
m
T
m
x [0, 1] [
f
m
(x) f
(x)[ Cm
+
R
n
.
8
So, we can just plug-in these results into the bound of the corollary.
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
2 min
mN
_
min
fFm
_
1
n
n
i=1
(f
(x
i
) f(x
i
))
2
_
+
4
2
m(1 + 2 log
2
(2
n + 1)) log 2
n
_
2 min
mN
_
_
Cm
+
R
n
_
2
+
4
2
m(1 + 2 log
2
(2
n + 1)) log 2
n
_
2 min
mN
O
_
max
_
m
2
,
m
n
,
1
n
,
mlog n
n
__
.
It is not hard to see that the rst and last terms dominate the bound, and so
we attain the minimum by taking (in the bound)
m
_
n
log n
_
1/(2+1)
,
which yields
E
_
1
n
n
i=1
(
f
n
(x
i
) f
(x
i
))
2
_
= O
_
_
n
log n
_
2
2+1
_
.
Note that the estimator does not know !!! So we are indeed adapting to
unknown smoothness. If the regression function f