0% found this document useful (0 votes)
93 views8 pages

Lecture 9: Learning Decision Trees and DNFS: 1 Two Important Learning Algorithms

- The document discusses learning decision trees and DNFs (disjunctive normal forms) from examples. - It shows that decision trees of depth d are exactly learnable in polynomial time, and size-s decision trees are approximately learnable. - It also shows that DNFs of width w are approximately concentrated on Fourier degrees up to O(wlog(1/ε)), allowing them to be learned.

Uploaded by

Lokesh Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views8 pages

Lecture 9: Learning Decision Trees and DNFS: 1 Two Important Learning Algorithms

- The document discusses learning decision trees and DNFs (disjunctive normal forms) from examples. - It shows that decision trees of depth d are exactly learnable in polynomial time, and size-s decision trees are approximately learnable. - It also shows that DNFs of width w are approximately concentrated on Fourier degrees up to O(wlog(1/ε)), allowing them to be learned.

Uploaded by

Lokesh Sharma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Analysis of Boolean Functions (CMU 18-859S, Spring 2007)

Lecture 9: Learning Decision Trees and DNFs


Feb. 18, 2007
Lecturer: Ryan ODonnell Scribe: Suresh Purini
1 Two Important Learning Algorithms
We recall the following denition and two important learning algorithms discussed in previous
lecture.
Denition 1.1 Given a collection S of subsets of [n], we say f : {1, 1}
n
Rhas -concentration
on S, if

S/ S

f(S)
2
.
Theorem 1.2 Let C be a class of n-bit functions, such that f C, f is -concentrated on S =
{S [n]| |S| d}, then the function class C is learnable under the uniform distribution to an
accuracy of O(), with a probability of at least 1, in time poly(|S|, 1/)poly(n) log (1/) using
random examples only.
This algorithmis called LowDegree algorithmand was proposed by Linial, Mansour and Nisan
in [3]. Refer theorem 5.4 in lecture notes 8.
Theorem 1.3 Let C be a class of n-bit functions, such that f C, f is -concentrated on some
collection S. Then the function class C is learnable using membership queries (Goldreich-Levin
Algorithm) in poly(|S|, 1/)poly(n) log (1/) time.
This algorithmis called Kushilevitz-Mansour algorithm[2]. Refer corollary 5.5 in lecture notes
8.
2 Learning Decision Trees
A decision tree is a binary tree in which the internal nodes are labeled with variables and the leafs
are labeled with either 1 or +1. And the left and right edges corresponding to any internal node
is labeled 1 and +1 respectively. We can think of the decision tree as dening a boolean function
in the natural obvious way. For example, the decision tree in the gure 1 denes a boolean function
whose DNF formula is x
1
x
2
x
3
+ x
1
x
2
x
4
+ x
1
x
2
.
Note that, given any boolean function we can come up with a corresponding decision tree.
Let P be a path in the decision tree. An example of a path in the gure 1 is P = (x
1
=
1, x
2
= +1, x
4
= 1).
1
Figure 1:
Let 1
P
: {1, 1}
n
{0, 1} be an indicator function for path P. For example,
1
P
=
_
1 if x
1
= 1, x
2
= +1, x
4
= 1
0 else
Observation 2.1 A boolean function f can be expressed in terms of path functions 1
P
s, corre-
sponding to various paths in the decision tree of the function f as follows
f(x) =

Paths P
1
P
(x)f(P)
where f(P) is the label on the leaf when the function f takes the path P in its decision tree.
Observation 2.2 Let V be the set of variables occurring in a path function 1
P
and d be the cardi-
nality of the set V . Then the Fourier expansion of 1
P
looks like

SV
2
d
X
S
.
It is easy to see the proof of the above observation by noting that the Fourier expansion for the
path function 1
P
, when P = (x
1
= 1, x
2
= +1, x
4
= 1), is 1
P
= x
1
x
2
x
4
= (
1
2

1
2
x
1
)(
1
2
+
1
2
x
2
)(
1
2

1
2
x
4
).
Proposition 2.3 If f : {1, 1}
n
{1, 1} is computable by a depth-d decision tree then
1. Fourier expansion of f has degree at most d i.e.,

|S|>d

f(S)
2
= 0.
2. All Fourier coefcients are integer multiples of 2
d
.
3. The number of nonzero Fourier coefcients is at most 4
d
.
2
Proof:(1) follows from observation 2.1. We can observe that all the Fourier coefcients look like
k2
d

for some d

d, which can be written as k2


d+d

2
d
. This proves (2). A depth-d decision tree
has at most 2
d
leaves and hence we have at most 2
d
2
d
= 4
d
Fourier coefcients, which proves
(3). 2
Corollary 2.4 Depth-d decision trees are exactly learnable with random examples in time
poly(4
d
)poly(n) log (1/).
Proof:Use Kushilevitz-Mansour algorithm, with =
2
d
4
and round each Fourier coefcient esti-
mate to the nearest multiple of 2
d
. 2
Remark 2.5 log (n)-depth decision trees are exactly learnable in polynomial time. This algorithm
can be derandomized.
Observation 2.6 Size-s decision trees are -close to a depth log (s/) decision trees.
Proof:Let T be decision tree of size s corresponding to boolean function f. Consider the decision
T

obtained from T by chopping all paths whose depth is greater than log (
s

) to log (
s

). The
decision tree T

gives an incorrect value for f(X) only when X takes a path of length greater than
log (
s

) in T. When we pick X at random, this happens with probability 2


log (
s

)
=

s
. Therefore
by union bound, we get that Pr
x{1,1}
n [T(x) = T

(x)] .
2
Corollary 2.7 Size-s decision trees are O()-concentrated on a collection of size size 4
log (s/)
=
(s/)
2
.
Denition 2.8 Given a function f : {1, 1}
n
R, the spectral norm or L
1
-Fourier norm of f is
||

f||
1
=

S[n]
|

f(S)|
Observation 2.9 If a function f is an AND of literals, then ||

f||
1
= 1. Refer observation 2.2 for
the proof idea.
The following observation follows from the fact a, b R, |a +b| |a| +|b| and |ab| = |a||b|.
Observation 2.10
1. ||

f + g||
1
||

f||
1
+|| g||
1
2. ||

cf||
1
= |c|||

f||
1
3
Proposition 2.11 If f has a decision tree of size s, ||

f||
1
s.
Proof:
||

f||
1

Paths P

1
P
f(P)

Paths P

1
P
s
2
Proposition 2.12 Given any function f with ||f||
2
2
1 and > 0, S = {S [n]||

f(S)|

||

f||
1
}, then f is -concentrated on S. Note that |S|
_
||

f||
1

_
2
.
Proof:

S/ S

f(S)
2
max
S / S
|

f(S)|
_

S/ S
|

f(S)|
_
max
S / S
|

f(S)|
_

S/ S
|

f(S)| +

SS
|

f(S)|
_


||

f||
1
||

f||
1

2
Corollary 2.13 Any class of functions C = {f| ||f||
2
2
1 and ||

f||
1
s} is learnable with
random examples in time poly(s,
1

).
Let us now consider functions which are computable by decision trees where nodes branch on
arbitrary parities of variables. Figure 2 contains an example of a function computable by decision
tree on the parity of the various subsets of variables. Another example is parity function which is
computable by a depth-1 parity decision tree.
Proposition 2.14 If a function f : {1, 1}
n
{1, 1} is expressible as a size-s decision tree on
parities, then ||

f||
1
s.
4
Figure 2:
Proof:Let 1
P
be an {0, 1}-indicator function for a path P in the decision tree. Let the path P =
(X
S
1
= b
1
, , X
S
d
= b
d
), i.e., we get the path P by taking the edges labeled b
1
, , b
d
{1, 1}
starting from the root node. We have
1
P
= (
1
2
+
1
2
b
1
X
S
1
) (
1
2
+
1
2
b
d
X
S
d
)
It can be seen that ||

1
P
||
1
= 1. Since f(x) =

Paths P
1
P
(x)f(P), we have ||

f||
1
s. 2
Denition 2.15 An AND of parities is called a coset.
Remark 2.16 If a function f : {1, 1}
n
{1, 1} is expressible as

s
i=1
1
P
i
, where P
i
s are
cosets then ||

f||
1
s.
Remark 2.17 Proposition 2.14 implies that we can learn all parity functions in poly(
1

) time.
Observe that we cannot see this result straightforward from the usual decision trees on parity
functions.
Theorem 2.18 [1] If a function f : {1, 1}
n
{1, 1} with ||

f||
1
s, then
f =
2
2
O(s
4
)

i=1
1
P
i
where P
i
s are cosets.
3 Learning DNFs
Proposition 3.1 If f has a size-s DNF formula, it is -close to a width-log(
s

) DNF.
5
Proof:Let the function f : {1, 1}
n
{1, 1} has a size-s DNF. Drop all the terms whose width
is larger than log(
s

) from the DNF of f and let the new DNF represents the function f

. If we
look at a particular term in the DNF of f whose width is greater than log(
s

), then the probability


that a randomly chosen x {1, 1} sets it to 1(or 1 if we look at f as boolean function from
{0, 1}
n
to {0, 1}) is at most 2
log(
s

)
=

s
. Since there are at most s terms in the DNF, we have that
Pr
x
[f(x) = f

(x)] by union bound. 2


Proposition 3.2 If a function f : {1, 1}
n
{1, 1} has a width w DNF, then I(f) 2w.
Proof:Left as an exercise. 2
Corollary 3.3 If a function f : {1, 1}
n
{1, 1} has a width w DNF, then f is -concentrated
on a S = {S| |S|
2w

}. Thus the function f can be learnable in n


O(
w

)
.
In the rest of the class, we shall prove the following theorem making use of Hastads switching
lemma.
Theorem 3.4 DNFs of width w are -concentrated on degree up to O(wlog(
1

)).
Remark 3.5 Observe that we are replacing the
1

-factor with log(


1

)-factor on the maximumdegree


of the Fourier coefcients.
Denition 3.6 A random restriction with -probability on [n] is a random pair (I, X) where I is
a random subset of [n] chosen by including each coordinate with probability independently and
X is a random string from {1, 1}
|

I|
.
Given a function f : {1, 1}
n
{1, 1}, we shall write f
X

I
: {1, 1}
|I|
R for a restric-
tion of f. If the function f is computable by a width w DNF, then after a random restriction with
-probability =
1
10w
, with very high probability, f
X

I
: {1, 1}
|I|
R has a O(1)-depth deci-
sion tree. The reason for this is intuitively that in each term of the DNF,
1
10
variables survive the
random restriction on an average. Thus resulting in a a constant depth decision tree. This intuition
is formalized in the following lemma due to Hastad.
Theorem 3.7 (Hastads Switching Lemma) Let f : {1, 1}
n
{1, 1} be a width w computable
DNF. When we apply a random restriction on the function f with -probability , then
Pr
(I,X)
[DT-depth(f
X

I
) > d] (5w)
d
Theorem 3.8 Let f be computable by a width-w DNF. Then d 5,

|U|20dw

f(u)
2
2
d+1
.
6
Proof:Let (I, X) be a random restriction with =
1
10w
. We know from Hastads switching lemma
f
X

I
has a depth greater than d with a probability less than 2
d
. Hence the following sum is
nonzero (and less than 1) with a probability less than 2
d
.

SI,|S|>d

f
X

I
(S)
2
Therefore, we have
2
d
E
(X,I)
_

SI
|S|>d

f
X

I
(S)
2
_

_
= E
I
_

_
E
X{1,1}
|

I|
_

SI
|S|>d

f
X

I
(S)
2
_

_
_

_
= E
I
_

SI
|S|>d
E
X{1,1}
|

I|
_
F
SI
(X)
2

_
(Recall F
SI
(x) =

f
x
(S))
= E
I
_

SI
|S|>d

F
SI
(T)
2
_

_
= E
I
_

SI
|S|>d

f(S T)
2
_

_
=

f(U)
2
Pr
I
[|U I| > d]
Suppose |U| 20dw, then |U I| is binomially distributed with mean 20dw = 2d. Using
Chernoff bound, we get that Pr
I
[|U I| > d]
1
2
, when d 5. Therefore we have the

f(U)
2
Pr
I
[|U I| > d] 2
d

U
|U|20dw

f(U)
2
1
2
2
d

U
|U|20dw

f(U)
2
2
d+1
2
Remark 3.9 By putting dw = wlog (
1

), we get the theorem 3.4


Further References Yishay Mansours survey paper[4] also contains some of the ideas in this
lecture notes.
7
References
[1] B. Green and T. Sanders. A quantitative version of the idempotent theorem in harmonic anal-
ysis. ArXiv Mathematics e-prints, Nov. 2006.
[2] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. In STOC
91: Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages
455464, New York, NY, USA, 1991. ACM Press.
[3] N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnabil-
ity. J. ACM, 40(3):607620, 1993.
[4] Y. Mansour. Learning boolean functions via the fourier transform. In V. Roychowdhury, K.-
Y. Siu, and A. Orlitsky, editors, Theoretical Advances in Neural Computation and Learning.
Kluwer, 1994.
8

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy