Mlfmodule 1
Mlfmodule 1
Introduction
Machine Learning (ML). That is, we wish to program computers so that they can “learn” from input
available to them. Roughly speaking, learning is the process of converting experience into expertise or
knowledge. The input to a learning algorithm is training data, representing experience, and the output
is some expertise, which usually takes the form of another computer program that can perform some
task. Seeking a formal-mathematical understanding of this concept, we’ll have to be more explicit
about what we mean by each of the involved terms: What is the training data our programs will
access? How can the process of learning be automated? How can we evaluate the success of such a
process (namely, the quality of the output of a learning program)?
What is Learning?
Some of the most fundamental issues in ML arise already in that context, which we are all familiar
with.
Bait Shyness – Rats Learning to Avoid Poisonous Baits: When rats encounter food items with novel
look or smell, they will first eat very small amounts, and subsequent feeding will depend on the flavor
of the food and its physiological effect. If the food produces an ill effect, the novel food will often be
associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning
mechanism in play here – the animal used past experience with some food to acquire expertise in
detecting the safety of this food. If past experience with the food was negatively labeled, the animal
predicts that it will also have a negative effect when encountered in the future. Inspired by the
preceding example of successful learning, let us demonstrate a typical machine learning task. Suppose
we would like to program a machine that learns how to filter spam e-mails. A naive solution would be
seemingly similar to the way rats learn how to avoid poisonous baits. The machine will simply
memorize all previous e-mails that had been labeled as spam e-mails by the human user. When a new
e-mail arrives, the machine will search for it in the set of previous spam e-mails. If it matches one of
them, it will be trashed. Otherwise, it will be moved to the user’s inbox folder.
• Tasks beyond Human Capabilities: Another wide family of tasks that benefit from machine learning
techniques are related to the analysis of very large and complex data sets: astronomical data, turning
medical archives into medical knowledge, weather prediction, analysis of genomic data, Web search
engines, and electronic commerce. With more and more available digitally recorded data, it becomes
obvious that there are treasures of meaningful information buried in data archives that are way too
large and too complex for humans to make sense of. Learning to detect meaningful patterns in large
and complex data sets is a promising domain in which the combination of programs that learn with the
almost unlimited memory capacity and ever increasing processing speed of computers opens up new
horizons.
Adaptivity:
One limiting feature of programmed tools is their rigidity – once the program has been written
down and installed, it stays unchanged. However, many tasks change over time or from one user to
another. Machine learning tools – programs whose behavior adapts to their input data – offer a solution
to such issues; they are, by nature, adaptive to changes in the environment they interact with. Typical
successful applications of machine learning to such problems include programs that decode
handwritten text, where a fixed program can adapt to variations between the handwriting of different
users; spam detection programs, adapting automatically to changes in the nature of spam e-mails; and
speech recognition programs.
Types Of Learnings:
Learning is, of course, a very wide domain. Consequently, the field of machine learning has branched
into several subfields dealing with different types of learning tasks. We give a rough taxonomy of
learning paradigms, aiming to provide some perspective of where the content of this book sits within
the wide field of machine learning. We describe four parameters along which learning paradigms can
be classified.
When one thinks about human learning, of a baby at home or a student at school, the process often
involves a helpful teacher, who is trying to feed the learner with the information most use- ful for
achieving the learning goal. In contrast, when a scientist learns about nature, the environment, playing
the role of the teacher, can be best thought of as passive – apples drop, stars shine, and the rain falls
without regard to the needs of the learner. We model such learning scenarios by postulating that the
training data (or the learner’s experience) is generated by some random process. This is the basic
building block in the branch of “statistical learning.” Finally, learning also occurs when the learner’s
input is generated by an adversarial “teacher.” This may be the case in the spam filtering example (if
the spammer makes an effort to mislead the spam filtering designer) or in learning to detect fraud. One
also uses an adversarial teacher model as a worst-case scenario, when no milder setup can be safely
assumed. If you can learn against an adversarial teacher, you are guaranteed to succeed interacting any
odd teacher.
The last parameter we mention is the distinction between situations in which the learner has to respond
online, throughout the learning process, and settings in which the learner has to engage the acquired
expertise only after having a chance to process large amounts of data. For example, a stockbroker has
to make daily decisions, based on the experience collected so far. He may become an expert over time,
but might have made costly mistakes in the process. In contrast, in many data mining settings, the
learner – the data miner – has large amounts of training data to play with before having to output
conclusions.
A Formal Model
The Statistical Learning Framework
• The learner’s input:
In the basic statistical learning setting, the learner has access to the following: –
Domain set: An arbitrary set, X . This is the set of objects that we may wish to label. For example,
in the papaya learning problem mentioned before, the domain set will be the set of all papayas.
Usually, these domain points will be represented by a vector of features (like the papaya’s color and
softness). We also refer to domain points as instances and to X as instance space.
Label set: For our current discussion, we will restrict the label set to be a two-element set, usually
{0, 1} or {−1, +1}. Let Y denote our set of possible labels. For our papayas example, let Y be {0, 1},
where 1 represents being tasty and 0 stands for being not-tasty.
Training data: S = ((x1, y1). . .(xm, ym)) is a finite sequence of pairs in X × Y: that is, a
sequence of labeled domain points. This is the input that the learner has access to (like a set of papayas
that have been tasted and their color, softness, and tastiness). Such labeled examples are often called
training examples. We sometimes also refer to S as a training set. 1
• The learner’s output: The learner is requested to output a prediction rule, h : X → Y. This
function is also called a predictor, a hypothesis, or a classifier. The predictor can be used to predict the
label of new domain points. In our papayas example, it is a rule that our learner will employ to predict
whether future papayas he examines in the farmers’ market are going to be tasty or not. We use the
notation A(S) to denote the hypothesis that a learning algorithm, A, returns upon receiving the training
sequence S.
• A simple data-generation model :We now explain how the training data is generated. First,
we assume that the instances (the papayas we encounter) are generated by some probability
distribution (in this case, representing the environment). Let us denote that probability distribution
over X by D. It is important to note that we do not assume that the learner knows anything about this
distribution. For the type of learning tasks we discuss, this could be any arbitrary probability
distribution. As to the labels, in the current discussion we assume that there is some “correct” labeling
function, f : X → Y, and that yi = f(xi) for all i. This assumption will be relaxed in the next chapter.
The labeling function is unknown to the learner. In fact, this is just what the learner is trying to figure
out. In summary, each pair in the training data S is generated by first sampling a point xi according to
D and then labeling it by f.
• Measures of success:
We define the error of a classifier to be the probability that it does not predict the correct label
on a random data point generated by the aforementioned underlying distribution. That is, the error of h
equal f(x). Formally, given a domain subset,2 A ⊂ X , the probability distribution, D, assigns a
is the probability to draw a random instance x, according to the distribution D, such that h(x) does not
number, D(A), which determines how likely it is to observe a point x ∈ A. In many cases, we refer to
A as an event and express it using a function π : X → {0, 1}, namely, A = {x ∈ X : π(x) = 1}. In that
case, we also use the notation Px∼D[π(x)] to express D(A). We define the error of a prediction rule,
h : X → Y, to be LD,f (h) def = P x∼D [h(x) 6= f(x)] def = D({x : h(x) 6= f(x)}). (2.1) That is, the
error of such h is the probability of randomly choosing an example x for which h(x) 6= f(x). The
subscript (D, f) indicates that the error is measured with respect to the probability distribution D and
the
1 Despite the “set” notation, S is a sequence. In particular, the same example may appear twice
in S and some algorithms can take into account the order of examples in S.
2 Strictly speaking, we should be more careful and require that A is a member of some σ-
algebra of subsets of X, over which D is defined. We will formally define our measurability
assumptions in the next chapter. Correct labeling function f. We omit this subscript when it is clear
from the context. L(D,f)(h) has several synonymous names such as the generalization error, the risk, or
the true error of h, and we will use these names interchangeably throughout the book. We use the letter
L for the error, since we view this error as the loss of the learner. We will later also discuss other
possible formulations of such loss.
• A note about the information available to the learner : The learner is blind to the
underlying distribution D over the world and to the labeling function f. In our papayas example, we
have just arrived in a new island and we have no clue as to how papayas are distributed and how to
predict their tastiness. The only way the learner can interact with the environment is through observing
the training set. In the next section we describe a simple learning paradigm for the preceding setup and
analyze its performance.
As mentioned earlier, a learning algorithm receives as input a training set S, sampled from an
unknown distribution D and labeled by some target function f, and should output a predictor hS : X →
Y (the subscript S emphasizes the fact that the output predictor depends on S). The goal of the
algorithm is to find hS that minimizes the error with respect to the unknown D and f. Since the learner
does not know what D and f are, the true error is not directly available to the learner. A useful notion
of error that can be calculated by the learner is the training error – the error the classifier incurs over
the training sample:
Since the training sample is the snapshot of the world that is available to the learner, it makes sense to
search for a solution that works well on that data. This learning paradigm – coming up with a predictor
h that minimizes LS(h) – is called Empirical Risk Minimization or ERM for short.
Although the ERM rule seems very natural, without being careful, this approach may fail
miserably. To demonstrate such a failure, let us go back to the problem of learning to predict the taste
of a papaya on the basis of its softness and color. Consider a sample as depicted in the following:
Assume that the probability distribution D is such that instances are distributed uniformly within the
gray square and the labeling function, f, determines the label to be 1 if the instance is within the inner
blue square, and 0 otherwise. The area of the gray square in the picture is 2 and the area of the blue
square is 1. Consider the following predictor:
0 otherwise.
While this predictor might seem rather artificial, in Exercise 1 we show a natural
representation of it using polynomials. Clearly, no matter what the sample is, LS(hS) = 0, and
therefore this predictor may be chosen by an ERM algorithm (it is one of the empirical-minimum-cost
hypotheses; no classifier can have smaller error). On the other hand, the true error of any classifier that
predicts the label 1 only on a finite number of instances is, in this case, 1/2. Thus, LD(hS) = 1/2. We
have found a predictor whose performance on the training set is excellent, yet its performance on the
true “world” is very poor. This phenomenon is called overfitting. Intuitively, overfitting occurs when
our hypothesis fits the training data “too well” (perhaps like the everyday experience that a person who
provides a perfect detailed explanation for each of his single actions may raise suspicion).
predictors. This set is called a hypothesis class and is denoted by H. Each h ∈ H is a function mapping
restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of
choose a predictor h ∈ H,
from X to Y. For a given class H, and a training sample, S, the ERMH learner uses the ERM rule to
learning algorithm with the following property: For every , δ ∈ (0, 1), for every distribution D over X ,
A hypothesis class H is PAC learnable if there exist a function mH : (0, 1)2 → N and a
and for every labeling function f : X → {0, 1}, if the realizable assumption holds with respect to H, D,
f, then when running the learning algorithm on m ≥ mH(, δ) i.i.d. examples generated by D and labeled
by f, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ (over the choice
of the examples), L(D,f)(h) ≤ .
Sample Complexity The function mH : (0, 1)2 → N determines the sample complexity of
learning H: that is, how many examples are required to guarantee a probably approximately correct
solution. The sample complexity is a function of the accuracy () and confidence (δ) parameters. It also
depends on properties of the hypothesis class H – for example, for a finite class we showed that the
sample complexity depends on log the size of H. Note that if H is PAC learnable, there are many
functions mH that satisfy the requirements given in the definition of PAC learnability. Therefore, to be
precise, we will define the sample complexity of learning H to be the “minimal function,” in the sense
that for any , δ, mH(, δ) is the minimal integer that satisfies the requirements of PAC learning with
accuracy and confidence δ. Let us now recall the conclusion of the analysis of finite hypothesis
classes from the previous chapter. It can be rephrased as stating: corollary 3.2 Every finite hypothesis
class is PAC learnable with sample complexity
mH(, δ) ≤ log(|H|/δ) .
There are infinite classes that are learnable as well (see, for example, Exercise 3). Later on we
will show that what determines the PAC learnability of a class is not its finiteness but rather a
combinatorial measure called the VC dimension.
assumption requires that there exists h ? ∈ H such that Px∼D[h ? (x) = f(x)] = 1. In many practical
A More Realistic Model for the Data-Generating Distribution Recall that the realizability
problems this assumption does not hold. Furthermore, it is maybe more realistic not to assume that the
labels are fully determined by the features we measure on input elements (in the case of the papayas, it
is plausible that two papayas of the same color and softness will have different taste). In the following,
we relax the realizability assumption by replacing the “target labeling function” with a more flexible
notion, a data-labels generating distribution. Formally, from now on, let D be a probability distribution
over X × Y, where, as before, X is our domain set and Y is a set of labels (usually we will consider Y
= {0, 1}). That is, D is a joint distribution over domain points and labels. One can view such a
distribution as being composed of two parts: a distribution Dx over unlabeled domain points
(sometimes called the marginal distribution) and a conditional probability over labels for each domain
point, D((x, y)|x). In the papaya example, Dx determines the probability of encountering a papaya
whose color and hardness fall in some color-hardness values domain, and the conditional probability is
the probability that a papaya with color and hardness represented by x is tasty. Indeed, such modeling
allows for two papayas that share the same color and hardness to belong to different taste categories.
The empirical and the True Error Revised For a probability distribution, D, over X × Y, one can
measure how likely h is to make an error when labeled points are randomly drawn according to D. We
redefine the true error (or risk) of a prediction rule h to be
The Bayes Optimal Predictor. Given any probability distribution D over X × {0, 1}, the best
label predicting function from X to {0, 1} will be
0 otherwise
• Multiclass Classification Our classification does not have to be binary. Take, for example,
the task of document classification: We wish to design a program that will be able to classify given
documents according to topics (e.g., news, sports, biology, medicine). A learning algorithm for such a
task will have access to examples of correctly classified documents and, on the basis of these
examples, should output a program that can take as input a new document and output a topic
classification for that document. Here, the domain set is the set of all potential documents. Once again,
we would usually represent documents by a set of features that could include counts of different key
words in the document, as well as other possibly relevant features like the size of the document or its
origin. The label set in this task will be the set of possible document topics (so Y will be some large
finite set). Once we determine our domain and label sets, the other components of our framework look
exactly the same as in the papaya tasting example; Our training sample will be a finite sequence of
(feature vector, label) pairs, the learner’s output will be a function from the domain set to the label set,
and, finally, for our measure of success, we can use the probability, over (document, topic) pairs, of
the event that our predictor suggests a wrong label.
• Regression In this task, one wishes to find some simple pattern in the data – a functional
relationship between the X and Y components of the data. For example, one wishes to find a linear
function that best predicts a baby’s birth weight on the basis of ultrasound measures of his head
circumference, abdominal circumference, and femur length. Here, our domain set X is some subset of
R 3 (the three ultrasound measurements), and the set of “labels,” Y, is the the set of real numbers (the
weight in grams). In this context, it is more adequate to call Y the target set. Our training data as well
as the learner’s output are as before (a finite sequence of (x, y) pairs, and a function from X to Y
respectively). However, our measure of success is different. We may evaluate the quality of a
hypothesis function, h : X → Y, by the expected square difference between the true labels and their
predicted values, namely,
A training set S is called -representative (w.r.t. domain Z, hypothesis class H, loss function `, and
distribution D)
and distribution D). Then, any output of ERMH(S), namely, any hS ∈ argminh∈H LS(h), satisfies
Assume that a training set S is 2 -representative (w.r.t. domain Z, hypothesis class H, loss function `,
To show that uniform convergence holds we follow a two step argument, similar to the derivation in
Chapter 2. The first step applies the union bound while the second step employs a measure
concentration inequality. We now explain these two steps in detail.
least 1 − δ of the choice of S = (z1, . . . , zm) sampledi.i.d. from D we have that for all h ∈ H, |LS(h) −
Fix some , δ. We need to find a sample size m that guarantees that for any D, with probability of at
LD(h)| ≤ . That is, D m({S : ∀h ∈ H, |LS(h) − LD(h)| ≤ }) ≥ 1 − δ. Equivalently, we need to show that
D m({S : ∃h ∈ H, |LS(h) − LD(h)| > }) < δ. Writing {S : ∃h ∈ H, |LS(h) − LD(h)| > } = ∪h∈H{S : |
LS(h) − LD(h)| > }, and applying the union bound (Lemma 2.2) we obtain D m({S : ∃h ∈ H, |LS(h) −
LD(h)| > }) ≤ X h∈H D m({S : |LS(h) − LD(h)| > }). (4.1) Our second step will be to argue that each
summand of the right-hand side of this inequality is small enough (for a sufficiently large m). That is,
we will show that for any fixed hypothesis, h, (which is chosen in advance prior to the sampling of the
training set), the gap between the true and empirical risks, |LS(h) − LD(h)|, is likely to be small. Recall
that LD(h) = Ez∼D[`(h, z)] and that LS(h) = 1 m Pm i=1 `(h, zi). Since each zi is sampled i.i.d. from
D, the expected value of the random variable `(h, zi) is LD(h). By the linearity of expectation, it
follows that LD(h) is also the expected value of LS(h). Hence, the quantity |LD(h)−LS(h)| is the
deviation of the random variable LS(h) from its expectation. We therefore need to show that the
measure of LS(h) is concentrated around its expected value. A basic statistical fact, the law of large
numbers, states that when m goes to infinity, empirical averages converge to their true expectation.
This is true for LS(h), since it is the empirical average of m i.i.d random variables. However, since the
law of large numbers is only an asymptotic result, it provides no information about the gap between
the empirically estimated error and its true value for any given, finite, sample size. Instead, we will use
a measure concentration inequality due to Hoeffding, which quantifies the gap between empirical
averages and their expected value. lemma 4.5 (Hoeffding’s Inequality) Let θ1, . . . , θm be a sequence
of i.i.d. random variables and assume that for all i, E[θi ] = µ and P[a ≤ θi ≤ b] = 1.
2 Error Decomposition.
According tothe “No Free Lunch” theorem, all optimization methods perform equally well when
averaged over all optimization tasks without re-sampling. This fundamental theoretical notion has
had the greatest impact on optimization, search, and supervised learning. The first theorem, No Free
Lunch, was rapidly formulated, resulting in a series of research works, which defined a whole field
of study with meaningful outcomes across different disciplines of science where the effective
exploration of a search region is a vital and crucial activity.
In general, its usefulness is as important as the algorithm. An effective solution is created by
matching the utility with the algorithm. If no good conditions for the objective function are known,
and one is just working with a black box, no guarantee can be made that this or that method
outperforms a (pseudo)random search.
A framework is being created to investigate the relationship between successful optimization
algorithms and the issues they solve. A series of “no free lunch” (NFL) theorems are provided,
establishing that any improved performance over one class of tasks is compensated by improved
performance over another. These theorems provide a geometric explanation of what it means for an
algorithm to be well matched to an optimization issue.
The NFL theorems are also applied to information-theoretic elements of optimization and
benchmark measurements of performance.
There is no such thing as a free lunch, since adding alternatives to a project incurs both direct and
opportunity expenses. As a result, incorporating actual alternatives may increase the original
development cost. Direct costs are the expenses of additional development effort required to include
certain flexibilities into the project’s architecture. Opportunity costs are the expenses of not being
able to do anything else (for example, add a feature) as a result of the time and effort spent on
generating that flexibility.
Error Decomposition:
To answer this question we decompose the error of an ERMH predictor into two components
as follows.
Let hS be an ERMH hypothesis.
Then, we can write
LD(hS) = app +est where : app = min h∈H LD(h), est = LD(hS)−app.
• The Approximation Error – the minimum risk achievable by a predictor in the hypothesis class.
This term measures how much risk we have because we restrict ourselves to a specific class, namely,
how much inductive bias we have. The approximation error does not depend on the sample size and is
determined by the hypothesis class chosen. Enlarging the hypothesis class can decrease the
approximation error. Under the realizability assumption, the approximation error is zero. In the
agnostic case, however, the approximation error can be large.1
1 In fact, it always includes the error of the Bayes optimal predictor (see Chapter 3), the minimal yet
inevitable error, because of the possible nondeterminism of the world in this model. Sometimes in the
literature the term approximation error refers not to minh∈H LD(h), but rather to the excess error over
that of the Bayes optimal predictor, namely, minh∈H LD(h) − Bayes.
• The Estimation Error – the difference between the approximation error and the error achieved by
the ERM predictor. The estimation error results because the empirical risk (i.e., training error) is only
an estimate of the true risk, and so the predictor minimizing the empirical risk is only an estimate of
the predictor minimizing the true risk. The quality of this estimation depends on the training set size
and on the size, or complexity, of the hypothesis class. As we have shown, for a finite hypothesis class,
est increases (logarithmically) with |H| and decreases with m. We can think of the size of H as a
measure of its complexity. In future chapters we will define other complexity measures of hypothesis
classes.s
Since our goal is to minimize the total risk, we face a tradeoff, called the biascomplexity
tradeoff. On one hand, choosing H to be a very rich class decreases the approximation error but at the
same time might increase the estimation error, as a rich H might lead to overfitting. On the other hand,
choosing H to be a very small set reduces the estimation error but might increase the approximation
error or, in other words, might lead to underfitting. Of course, a great choice for H is the class that
contains only one classifier – the Bayes optimal classifier. But the Bayes optimal classifier depends on
the underlying distribution D, which we do not know (indeed, learning would have been unnecessary
had we known D). Learning theory studies how rich we can make H while still maintaining reasonable
estimation error. In many cases, empirical research focuses on designing good hypothesis classes for a
certain domain. Here, “good” means classes for which the approximation error would not be
excessively high. The idea is that although we are not experts and do not know how to construct the
optimal classifier, we still have some prior knowledge of the specific problem at hand, which enables
us to design hypothesis classes for which both the approximation error and the estimation error are not
too large. Getting back to our papayas example, we do not know how exactly the color and hardness of
a papaya predict its taste, but we do know that papaya is a fruit and on the basis of previous experience
with other fruit we conjecture that a rectangle in the color-hardness space may be a good predictor.