0% found this document useful (0 votes)
3 views14 pages

Mlfmodule 1

This document introduces the foundational concepts of Machine Learning (ML), explaining how computers can learn from training data to perform tasks. It discusses the importance of generalization, the need for ML in complex tasks, and various learning paradigms such as supervised vs. unsupervised learning. Additionally, it outlines the statistical learning framework, including the learner's input and output, measures of success, and the concept of empirical risk minimization while cautioning against overfitting.

Uploaded by

rekhadivya121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views14 pages

Mlfmodule 1

This document introduces the foundational concepts of Machine Learning (ML), explaining how computers can learn from training data to perform tasks. It discusses the importance of generalization, the need for ML in complex tasks, and various learning paradigms such as supervised vs. unsupervised learning. Additionally, it outlines the statistical learning framework, including the learner's input and output, measures of success, and the concept of empirical risk minimization while cautioning against overfitting.

Uploaded by

rekhadivya121
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

MODULE 1: FOUNDATION

Introduction
Machine Learning (ML). That is, we wish to program computers so that they can “learn” from input
available to them. Roughly speaking, learning is the process of converting experience into expertise or
knowledge. The input to a learning algorithm is training data, representing experience, and the output
is some expertise, which usually takes the form of another computer program that can perform some
task. Seeking a formal-mathematical understanding of this concept, we’ll have to be more explicit
about what we mean by each of the involved terms: What is the training data our programs will
access? How can the process of learning be automated? How can we evaluate the success of such a
process (namely, the quality of the output of a learning program)?

What is Learning?

Some of the most fundamental issues in ML arise already in that context, which we are all familiar
with.

Bait Shyness – Rats Learning to Avoid Poisonous Baits: When rats encounter food items with novel
look or smell, they will first eat very small amounts, and subsequent feeding will depend on the flavor
of the food and its physiological effect. If the food produces an ill effect, the novel food will often be
associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning
mechanism in play here – the animal used past experience with some food to acquire expertise in
detecting the safety of this food. If past experience with the food was negatively labeled, the animal
predicts that it will also have a negative effect when encountered in the future. Inspired by the
preceding example of successful learning, let us demonstrate a typical machine learning task. Suppose
we would like to program a machine that learns how to filter spam e-mails. A naive solution would be
seemingly similar to the way rats learn how to avoid poisonous baits. The machine will simply
memorize all previous e-mails that had been labeled as spam e-mails by the human user. When a new
e-mail arrives, the machine will search for it in the set of previous spam e-mails. If it matches one of
them, it will be trashed. Otherwise, it will be moved to the user’s inbox folder.

While the preceding “learning by memorization” approach is sometimes useful, it lacks an


important aspect of learning systems – the ability to label unseen e-mail messages. A successful
learner should be able to progress from individual examples to broader generalization. This is also
referred to as inductive reasoning or inductive inference. In the bait shyness example presented
previously, after the rats encounter an example of a certain type of food, they apply their attitude
toward it on new, unseen examples of food of similar smell and taste. To achieve generalization in the
spam filtering task, the learner can scan the previously seen e-mails, and extract a set of words whose
appearance in an e-mail message is indicative of spam. Then, when a new e-mail arrives, the machine
can check whether one of the suspicious words appears in it, and predict its label accordingly. Such a
system would potentially be able correctly to predict the label of unseen e-mails.

When Do We Need Machine Learning?


When do we need machine learning rather than directly program our computers to carry out the task at
hand? Two aspects of a given problem may call for the use of programs that learn and improve on the
basis of their “experience”: the problem’s complexity and the need for adaptivity.

Tasks That Are Too Complex to Program:


• Tasks Performed by Animals/Humans: There are numerous tasks that we human beings
perform routinely, yet our introspection concerning how we do them is not sufficiently elaborate to
extract a well defined program. Examples of such tasks include driving, speech recognition, and image
understanding. In all of these tasks, state of the art machine learning programs, programs that “learn
from their experience,” achieve quite satisfactory results, once exposed to sufficiently many training
examples.

• Tasks beyond Human Capabilities: Another wide family of tasks that benefit from machine learning
techniques are related to the analysis of very large and complex data sets: astronomical data, turning
medical archives into medical knowledge, weather prediction, analysis of genomic data, Web search
engines, and electronic commerce. With more and more available digitally recorded data, it becomes
obvious that there are treasures of meaningful information buried in data archives that are way too
large and too complex for humans to make sense of. Learning to detect meaningful patterns in large
and complex data sets is a promising domain in which the combination of programs that learn with the
almost unlimited memory capacity and ever increasing processing speed of computers opens up new
horizons.

Adaptivity:
One limiting feature of programmed tools is their rigidity – once the program has been written
down and installed, it stays unchanged. However, many tasks change over time or from one user to
another. Machine learning tools – programs whose behavior adapts to their input data – offer a solution
to such issues; they are, by nature, adaptive to changes in the environment they interact with. Typical
successful applications of machine learning to such problems include programs that decode
handwritten text, where a fixed program can adapt to variations between the handwriting of different
users; spam detection programs, adapting automatically to changes in the nature of spam e-mails; and
speech recognition programs.

Types Of Learnings:
Learning is, of course, a very wide domain. Consequently, the field of machine learning has branched
into several subfields dealing with different types of learning tasks. We give a rough taxonomy of
learning paradigms, aiming to provide some perspective of where the content of this book sits within
the wide field of machine learning. We describe four parameters along which learning paradigms can
be classified.

Supervised versus Unsupervised :


Since learning involves an interaction between the learner and the environment, one can divide
learning tasks according to the nature of that interaction. The first distinction to note is the difference
between supervised and unsupervised learning. As an illustrative example, consider the task of
learning to detect spam e-mail versus the task of anomaly detection. For the spam detection task, we
consider a setting in which the learner receives training e-mails for which the label spam/not-spam is
provided. On the basis of such training the learner should figure out a rule for labeling a newly arriving
e-mail message. In contrast, for the task of anomaly detection, all the learner gets as training is a large
body of e-mail messages (with no labels) and the learner’s task is to detect “unusual” messages.

More abstractly, viewing learning as a process of “using experience to gain expertise,”


supervised learning describes a scenario in which the “experience,” a training example, contains
significant information (say, the spam/not-spam labels) that is missing in the unseen “test examples” to
which the learned expertise is to be applied. In this setting, the acquired expertise is aimed to predict
that missing information for the test data. In such cases, we can think of the environment as a teacher
that “supervises” the learner by providing the extra information (labels). In unsupervised learning,
however, there is no distinction between training and test data. The learner processes input data with
the goal of coming up with some summary, or compressed version of that data. Clustering a data set
into subsets of similar objets is a typical example of such a task.

Active versus Passive Learner:


Learning paradigms can vary by the role played by the learner. We distinguish between “active” and
“passive” learners. An active learner interacts with the environment at training time, say, by posing
queries or performing experiments, while a passive learner only observes the information provided by
the environment (or the teacher) without influencing or directing it. Note that the learner of a spam
filter is usually passive – waiting for users to mark the e-mails coming to them. In an active setting,
one could imagine asking users to label specific e-mails chosen by the learner, or even composed by
the learner, to enhance its understanding of what spam is.

Helpfulness of the Teacher:

When one thinks about human learning, of a baby at home or a student at school, the process often
involves a helpful teacher, who is trying to feed the learner with the information most use- ful for
achieving the learning goal. In contrast, when a scientist learns about nature, the environment, playing
the role of the teacher, can be best thought of as passive – apples drop, stars shine, and the rain falls
without regard to the needs of the learner. We model such learning scenarios by postulating that the
training data (or the learner’s experience) is generated by some random process. This is the basic
building block in the branch of “statistical learning.” Finally, learning also occurs when the learner’s
input is generated by an adversarial “teacher.” This may be the case in the spam filtering example (if
the spammer makes an effort to mislead the spam filtering designer) or in learning to detect fraud. One
also uses an adversarial teacher model as a worst-case scenario, when no milder setup can be safely
assumed. If you can learn against an adversarial teacher, you are guaranteed to succeed interacting any
odd teacher.

Online versus Batch Learning Protocol :

The last parameter we mention is the distinction between situations in which the learner has to respond
online, throughout the learning process, and settings in which the learner has to engage the acquired
expertise only after having a chance to process large amounts of data. For example, a stockbroker has
to make daily decisions, based on the experience collected so far. He may become an expert over time,
but might have made costly mistakes in the process. In contrast, in many data mining settings, the
learner – the data miner – has large amounts of training data to play with before having to output
conclusions.

A Formal Model
The Statistical Learning Framework
• The learner’s input:

In the basic statistical learning setting, the learner has access to the following: –

Domain set: An arbitrary set, X . This is the set of objects that we may wish to label. For example,
in the papaya learning problem mentioned before, the domain set will be the set of all papayas.
Usually, these domain points will be represented by a vector of features (like the papaya’s color and
softness). We also refer to domain points as instances and to X as instance space.

Label set: For our current discussion, we will restrict the label set to be a two-element set, usually
{0, 1} or {−1, +1}. Let Y denote our set of possible labels. For our papayas example, let Y be {0, 1},
where 1 represents being tasty and 0 stands for being not-tasty.

Training data: S = ((x1, y1). . .(xm, ym)) is a finite sequence of pairs in X × Y: that is, a
sequence of labeled domain points. This is the input that the learner has access to (like a set of papayas
that have been tasted and their color, softness, and tastiness). Such labeled examples are often called
training examples. We sometimes also refer to S as a training set. 1

• The learner’s output: The learner is requested to output a prediction rule, h : X → Y. This
function is also called a predictor, a hypothesis, or a classifier. The predictor can be used to predict the
label of new domain points. In our papayas example, it is a rule that our learner will employ to predict
whether future papayas he examines in the farmers’ market are going to be tasty or not. We use the
notation A(S) to denote the hypothesis that a learning algorithm, A, returns upon receiving the training
sequence S.

• A simple data-generation model :We now explain how the training data is generated. First,
we assume that the instances (the papayas we encounter) are generated by some probability
distribution (in this case, representing the environment). Let us denote that probability distribution
over X by D. It is important to note that we do not assume that the learner knows anything about this
distribution. For the type of learning tasks we discuss, this could be any arbitrary probability
distribution. As to the labels, in the current discussion we assume that there is some “correct” labeling
function, f : X → Y, and that yi = f(xi) for all i. This assumption will be relaxed in the next chapter.
The labeling function is unknown to the learner. In fact, this is just what the learner is trying to figure
out. In summary, each pair in the training data S is generated by first sampling a point xi according to
D and then labeling it by f.
• Measures of success:

We define the error of a classifier to be the probability that it does not predict the correct label
on a random data point generated by the aforementioned underlying distribution. That is, the error of h

equal f(x). Formally, given a domain subset,2 A ⊂ X , the probability distribution, D, assigns a
is the probability to draw a random instance x, according to the distribution D, such that h(x) does not

number, D(A), which determines how likely it is to observe a point x ∈ A. In many cases, we refer to
A as an event and express it using a function π : X → {0, 1}, namely, A = {x ∈ X : π(x) = 1}. In that
case, we also use the notation Px∼D[π(x)] to express D(A). We define the error of a prediction rule,
h : X → Y, to be LD,f (h) def = P x∼D [h(x) 6= f(x)] def = D({x : h(x) 6= f(x)}). (2.1) That is, the
error of such h is the probability of randomly choosing an example x for which h(x) 6= f(x). The
subscript (D, f) indicates that the error is measured with respect to the probability distribution D and
the

1 Despite the “set” notation, S is a sequence. In particular, the same example may appear twice
in S and some algorithms can take into account the order of examples in S.

2 Strictly speaking, we should be more careful and require that A is a member of some σ-
algebra of subsets of X, over which D is defined. We will formally define our measurability
assumptions in the next chapter. Correct labeling function f. We omit this subscript when it is clear
from the context. L(D,f)(h) has several synonymous names such as the generalization error, the risk, or
the true error of h, and we will use these names interchangeably throughout the book. We use the letter
L for the error, since we view this error as the loss of the learner. We will later also discuss other
possible formulations of such loss.

• A note about the information available to the learner : The learner is blind to the
underlying distribution D over the world and to the labeling function f. In our papayas example, we
have just arrived in a new island and we have no clue as to how papayas are distributed and how to
predict their tastiness. The only way the learner can interact with the environment is through observing
the training set. In the next section we describe a simple learning paradigm for the preceding setup and
analyze its performance.

Empirical Risk Minimization:

As mentioned earlier, a learning algorithm receives as input a training set S, sampled from an
unknown distribution D and labeled by some target function f, and should output a predictor hS : X →
Y (the subscript S emphasizes the fact that the output predictor depends on S). The goal of the
algorithm is to find hS that minimizes the error with respect to the unknown D and f. Since the learner
does not know what D and f are, the true error is not directly available to the learner. A useful notion
of error that can be calculated by the learner is the training error – the error the classifier incurs over
the training sample:

LS(h) def = |{i ∈ [m] : h(xi) 6= yi}| m , (2.2)


where [m] = {1, . . . , m}.
The terms empirical error and empirical risk are often used interchangeably for this error.

Since the training sample is the snapshot of the world that is available to the learner, it makes sense to
search for a solution that works well on that data. This learning paradigm – coming up with a predictor
h that minimizes LS(h) – is called Empirical Risk Minimization or ERM for short.

Something May Go Wrong – Overfitting

Although the ERM rule seems very natural, without being careful, this approach may fail
miserably. To demonstrate such a failure, let us go back to the problem of learning to predict the taste
of a papaya on the basis of its softness and color. Consider a sample as depicted in the following:
Assume that the probability distribution D is such that instances are distributed uniformly within the
gray square and the labeling function, f, determines the label to be 1 if the instance is within the inner
blue square, and 0 otherwise. The area of the gray square in the picture is 2 and the area of the blue
square is 1. Consider the following predictor:

hS(x) = ( yi if ∃i ∈ [m] s.t. xi = x

0 otherwise.

While this predictor might seem rather artificial, in Exercise 1 we show a natural
representation of it using polynomials. Clearly, no matter what the sample is, LS(hS) = 0, and
therefore this predictor may be chosen by an ERM algorithm (it is one of the empirical-minimum-cost
hypotheses; no classifier can have smaller error). On the other hand, the true error of any classifier that
predicts the label 1 only on a finite number of instances is, in this case, 1/2. Thus, LD(hS) = 1/2. We
have found a predictor whose performance on the training set is excellent, yet its performance on the
true “world” is very poor. This phenomenon is called overfitting. Intuitively, overfitting occurs when
our hypothesis fits the training data “too well” (perhaps like the everyday experience that a person who
provides a perfect detailed explanation for each of his single actions may raise suspicion).

Empirical Risk Minimization with Inductive Bias


We have just demonstrated that the ERM rule might lead to overfitting. Rather than giving up
on the ERM paradigm, we will look for ways to rectify it. We will search for conditions under which
there is a guarantee that ERM does not overfit, namely, conditions under which when the ERM
predictor has good performance with respect to the training data, it is also highly likely to perform well
over the underlying data distribution. A common solution is to apply the ERM learning rule over a

predictors. This set is called a hypothesis class and is denoted by H. Each h ∈ H is a function mapping
restricted search space. Formally, the learner should choose in advance (before seeing the data) a set of

choose a predictor h ∈ H,
from X to Y. For a given class H, and a training sample, S, the ERMH learner uses the ERM rule to

with the lowest possible error over S. Formally,

ERMH(S) ∈ argmin h∈H LS(h),


where argmin stands for the set of hypotheses in H that achieve the minimum value of LS(h)
over H. By restricting the learner to choosing a predictor from H, we bias it toward a particular set of
predictors. Such restrictions are often called an inductive bias. Since the choice of such a restriction is
determined before the learner sees the training data, it should ideally be based on some prior
knowledge about the problem to be learned. For example, for the papaya taste prediction problem we
may choose the class H to be the set of predictors that are determined by axis aligned rectangles (in the
space determined by the color and softness coordinates). We will later show that ERMH over this class
is guaranteed not to overfit. On the other hand, the example of overfitting that we have seen
previously, demonstrates that choosing H to be a class of predictors that includes all functions that
assign the value 1 to a finite set of domain points does not suffice to guarantee that ERMH will not
overfit. A fundamental question in learning theory is, over which hypothesis classes ERMH learning
will not result in overfitting. We will study this question later in the book. Intuitively, choosing a more
restricted hypothesis class better protects us against overfitting but at the same time might cause us a
stronger inductive bias. We will get back to this fundamental tradeoff later.

A Formal Learning Model


PAC Learning:

learning algorithm with the following property: For every , δ ∈ (0, 1), for every distribution D over X ,
A hypothesis class H is PAC learnable if there exist a function mH : (0, 1)2 → N and a

and for every labeling function f : X → {0, 1}, if the realizable assumption holds with respect to H, D,
f, then when running the learning algorithm on m ≥ mH(, δ) i.i.d. examples generated by D and labeled
by f, the algorithm returns a hypothesis h such that, with probability of at least 1 − δ (over the choice
of the examples), L(D,f)(h) ≤ .

The definition of Probably Approximately Correct learnability contains two approximation


parameters. The accuracy parameter determines how far the output classifier can be from the optimal
one (this corresponds to the “approximately correct”), and a confidence parameter δ indicating how
likely the classifier is to meet that accuracy requirement (corresponds to the “probably” part of
“PAC”). Under the data access model that we are investigating, these approximations are inevitable.
Since the training set is randomly generated, there may always be a small chance that it will happen to
be noninformative (for example, there is always some chance that the training set will contain only one
domain point, sampled over and over again). Furthermore, even when we are lucky enough to get a
training sample that does faithfully represent D, because it is just a finite sample, there may always be
some fine details of D that it fails to reflect. Our accuracy parameter, , allows “forgiving” the learner’s
classifier for making minor errors.

Sample Complexity The function mH : (0, 1)2 → N determines the sample complexity of
learning H: that is, how many examples are required to guarantee a probably approximately correct
solution. The sample complexity is a function of the accuracy () and confidence (δ) parameters. It also
depends on properties of the hypothesis class H – for example, for a finite class we showed that the
sample complexity depends on log the size of H. Note that if H is PAC learnable, there are many
functions mH that satisfy the requirements given in the definition of PAC learnability. Therefore, to be
precise, we will define the sample complexity of learning H to be the “minimal function,” in the sense
that for any , δ, mH(, δ) is the minimal integer that satisfies the requirements of PAC learning with
accuracy and confidence δ. Let us now recall the conclusion of the analysis of finite hypothesis
classes from the previous chapter. It can be rephrased as stating: corollary 3.2 Every finite hypothesis
class is PAC learnable with sample complexity

mH(, δ) ≤ log(|H|/δ) .
There are infinite classes that are learnable as well (see, for example, Exercise 3). Later on we
will show that what determines the PAC learnability of a class is not its finiteness but rather a
combinatorial measure called the VC dimension.

A More General Learning Model:


The model we have just described can be readily generalized, so that it can be made relevant
to a wider scope of learning tasks. We consider generalizations in two aspects:

Releasing the Realizability Assumption – Agnostic PAC Learning :

assumption requires that there exists h ? ∈ H such that Px∼D[h ? (x) = f(x)] = 1. In many practical
A More Realistic Model for the Data-Generating Distribution Recall that the realizability

problems this assumption does not hold. Furthermore, it is maybe more realistic not to assume that the
labels are fully determined by the features we measure on input elements (in the case of the papayas, it
is plausible that two papayas of the same color and softness will have different taste). In the following,
we relax the realizability assumption by replacing the “target labeling function” with a more flexible
notion, a data-labels generating distribution. Formally, from now on, let D be a probability distribution
over X × Y, where, as before, X is our domain set and Y is a set of labels (usually we will consider Y
= {0, 1}). That is, D is a joint distribution over domain points and labels. One can view such a
distribution as being composed of two parts: a distribution Dx over unlabeled domain points
(sometimes called the marginal distribution) and a conditional probability over labels for each domain
point, D((x, y)|x). In the papaya example, Dx determines the probability of encountering a papaya
whose color and hardness fall in some color-hardness values domain, and the conditional probability is
the probability that a papaya with color and hardness represented by x is tasty. Indeed, such modeling
allows for two papayas that share the same color and hardness to belong to different taste categories.
The empirical and the True Error Revised For a probability distribution, D, over X × Y, one can
measure how likely h is to make an error when labeled points are randomly drawn according to D. We
redefine the true error (or risk) of a prediction rule h to be

LD(h) def = P (x,y)∼D [h(x) 6= y] def = D({(x, y) : h(x) 6= y}).


We would like to find a predictor, h, for which that error will be minimized. However, the
learner does not know the data generating D. What the learner does have access to is the training data,
S. The definition of the empirical risk remains the same as before, namely,

LS(h) def = |{i ∈ [m] : h(xi) 6= yi}| /m.


Given S, a learner can compute LS(h) for any function h : X → {0, 1}. Note that LS(h) =
LD(uniform over S)(h).

The Goal We wish to find some hypothesis, h : X → Y, that (probably approximately)


minimizes the true risk, LD(h).

The Bayes Optimal Predictor. Given any probability distribution D over X × {0, 1}, the best
label predicting function from X to {0, 1} will be

fD(x) = 1 if P[y = 1|x] ≥ 1/2

0 otherwise

The Scope of Learning Problems Modeled:


We next extend our model so that it can be applied to a wide variety of learning tasks. Let us consider
some examples of different learning tasks.

• Multiclass Classification Our classification does not have to be binary. Take, for example,
the task of document classification: We wish to design a program that will be able to classify given
documents according to topics (e.g., news, sports, biology, medicine). A learning algorithm for such a
task will have access to examples of correctly classified documents and, on the basis of these
examples, should output a program that can take as input a new document and output a topic
classification for that document. Here, the domain set is the set of all potential documents. Once again,
we would usually represent documents by a set of features that could include counts of different key
words in the document, as well as other possibly relevant features like the size of the document or its
origin. The label set in this task will be the set of possible document topics (so Y will be some large
finite set). Once we determine our domain and label sets, the other components of our framework look
exactly the same as in the papaya tasting example; Our training sample will be a finite sequence of
(feature vector, label) pairs, the learner’s output will be a function from the domain set to the label set,
and, finally, for our measure of success, we can use the probability, over (document, topic) pairs, of
the event that our predictor suggests a wrong label.

• Regression In this task, one wishes to find some simple pattern in the data – a functional
relationship between the X and Y components of the data. For example, one wishes to find a linear
function that best predicts a baby’s birth weight on the basis of ultrasound measures of his head
circumference, abdominal circumference, and femur length. Here, our domain set X is some subset of
R 3 (the three ultrasound measurements), and the set of “labels,” Y, is the the set of real numbers (the
weight in grams). In this context, it is more adequate to call Y the target set. Our training data as well
as the learner’s output are as before (a finite sequence of (x, y) pairs, and a function from X to Y
respectively). However, our measure of success is different. We may evaluate the quality of a
hypothesis function, h : X → Y, by the expected square difference between the true labels and their
predicted values, namely,

LD(h) def = E (x,y)∼D (h(x) − y) 2.


Uniform Convergence Is Sufficient for Learnability:
Upon receiving a training sample, S, the learner evaluates the risk (or error) of each h in H on the
given sample and outputs a member of H that minimizes this empirical risk. The hope is that an h that
minimizes the empirical risk with respect to S is a risk minimizer (or has risk close to the minimum)
with respect to the true data probability distribution as well. For that, it suffices to ensure that the
empirical risks of all members of H are good approximations of their true risk. Put another way, we
need that uniformly over all hypotheses in the hypothesis class, the empirical risk will be close to the
true risk, as formalized in the following.

A training set S is called -representative (w.r.t. domain Z, hypothesis class H, loss function `, and
distribution D)

if ∀h ∈ H, |LS(h) − LD(h)| ≤ .(e)


The next simple lemma states that whenever the sample is (/2)-representative, the ERM learning rule is
guaranteed to return a good hypothesis.

and distribution D). Then, any output of ERMH(S), namely, any hS ∈ argminh∈H LS(h), satisfies
Assume that a training set S is 2 -representative (w.r.t. domain Z, hypothesis class H, loss function `,

LD(hS) ≤ min h∈H LD(h) + .(e)

Finite Classes Are Agnostic PAC Learnable:


In view of Corollary 4.4, the claim that every finite hypothesis class is agnostic PAC learnable will
follow once we establish that uniform convergence holds for a finite hypothesis class.

To show that uniform convergence holds we follow a two step argument, similar to the derivation in
Chapter 2. The first step applies the union bound while the second step employs a measure
concentration inequality. We now explain these two steps in detail.

least 1 − δ of the choice of S = (z1, . . . , zm) sampledi.i.d. from D we have that for all h ∈ H, |LS(h) −
Fix some , δ. We need to find a sample size m that guarantees that for any D, with probability of at

LD(h)| ≤ . That is, D m({S : ∀h ∈ H, |LS(h) − LD(h)| ≤ }) ≥ 1 − δ. Equivalently, we need to show that
D m({S : ∃h ∈ H, |LS(h) − LD(h)| > }) < δ. Writing {S : ∃h ∈ H, |LS(h) − LD(h)| > } = ∪h∈H{S : |
LS(h) − LD(h)| > }, and applying the union bound (Lemma 2.2) we obtain D m({S : ∃h ∈ H, |LS(h) −
LD(h)| > }) ≤ X h∈H D m({S : |LS(h) − LD(h)| > }). (4.1) Our second step will be to argue that each
summand of the right-hand side of this inequality is small enough (for a sufficiently large m). That is,
we will show that for any fixed hypothesis, h, (which is chosen in advance prior to the sampling of the
training set), the gap between the true and empirical risks, |LS(h) − LD(h)|, is likely to be small. Recall
that LD(h) = Ez∼D[`(h, z)] and that LS(h) = 1 m Pm i=1 `(h, zi). Since each zi is sampled i.i.d. from
D, the expected value of the random variable `(h, zi) is LD(h). By the linearity of expectation, it
follows that LD(h) is also the expected value of LS(h). Hence, the quantity |LD(h)−LS(h)| is the
deviation of the random variable LS(h) from its expectation. We therefore need to show that the
measure of LS(h) is concentrated around its expected value. A basic statistical fact, the law of large
numbers, states that when m goes to infinity, empirical averages converge to their true expectation.
This is true for LS(h), since it is the empirical average of m i.i.d random variables. However, since the
law of large numbers is only an asymptotic result, it provides no information about the gap between
the empirically estimated error and its true value for any given, finite, sample size. Instead, we will use
a measure concentration inequality due to Hoeffding, which quantifies the gap between empirical
averages and their expected value. lemma 4.5 (Hoeffding’s Inequality) Let θ1, . . . , θm be a sequence
of i.i.d. random variables and assume that for all i, E[θi ] = µ and P[a ≤ θi ≤ b] = 1.

Then, for any > 0

The Bias-Complexity Tradeof:


The bias is known as the difference between the prediction of the values by the ML model and the
correct value. Being high in biasing gives a large error in training as well as testing data. Its
recommended that an algorithm should always be low biased to avoid the problem of underfitting.
By high bias, the data predicted is in a straight line format, thus not fitting accurately in the data in
the data set. Such fitting is known as Underfitting of Data. This happens when the hypothesis is too
simple or linear in nature. Refer to the graph given below for an example of such a situation.

It is classified in to Methods: 1 The No-Free-Lunch Theorem.

2 Error Decomposition.

The No-Free-Lunch Theorem:


The No Free Lunch Theorem is often used in optimization and machine learning, with little
comprehension of what it means or implies.
The theory asserts that when the performance of all optimization methods is averaged across all
conceivable problems, they all perform equally well. It indicates that no one optimum optimization
algorithm exists. Because of the strong link between optimization, search, and machine learning,
there is no one optimum machine learning method for predictive modelling tasks like classification
and regression.
They all agree on one point: there is no “best” algorithm for specific kinds of algorithms, since they
all perform similarly on average. Mathematically, the computing cost of finding a solution is the
same for any solution technique when averaged across all problems in the class. As a result, no
solution provides a shortcut.
There are two No Free Lunch (NFL) theorems in general: one for machine learning and one for
search and optimization. These two theorems are connected and are frequently combined into a
single general postulate (the folklore theorem).
Although many other scholars have contributed to the collective writings on the No Free Lunch
theorems, David Wolpert is the most well-known name connected with these studies.
Surprisingly, the concept that may have inspired the NFL theorem was first offered by a 1700s
philosopher. Yes, you read that correctly! A philosopher, not a mathematician or a statistician..

According tothe “No Free Lunch” theorem, all optimization methods perform equally well when
averaged over all optimization tasks without re-sampling. This fundamental theoretical notion has
had the greatest impact on optimization, search, and supervised learning. The first theorem, No Free
Lunch, was rapidly formulated, resulting in a series of research works, which defined a whole field
of study with meaningful outcomes across different disciplines of science where the effective
exploration of a search region is a vital and crucial activity.
In general, its usefulness is as important as the algorithm. An effective solution is created by
matching the utility with the algorithm. If no good conditions for the objective function are known,
and one is just working with a black box, no guarantee can be made that this or that method
outperforms a (pseudo)random search.
A framework is being created to investigate the relationship between successful optimization
algorithms and the issues they solve. A series of “no free lunch” (NFL) theorems are provided,
establishing that any improved performance over one class of tasks is compensated by improved
performance over another. These theorems provide a geometric explanation of what it means for an
algorithm to be well matched to an optimization issue.
The NFL theorems are also applied to information-theoretic elements of optimization and
benchmark measurements of performance.
There is no such thing as a free lunch, since adding alternatives to a project incurs both direct and
opportunity expenses. As a result, incorporating actual alternatives may increase the original
development cost. Direct costs are the expenses of additional development effort required to include
certain flexibilities into the project’s architecture. Opportunity costs are the expenses of not being
able to do anything else (for example, add a feature) as a result of the time and effort spent on
generating that flexibility.

Error Decomposition:
To answer this question we decompose the error of an ERMH predictor into two components
as follows.
Let hS be an ERMH hypothesis.
Then, we can write
LD(hS) = app +est where : app = min h∈H LD(h), est = LD(hS)−app.
• The Approximation Error – the minimum risk achievable by a predictor in the hypothesis class.
This term measures how much risk we have because we restrict ourselves to a specific class, namely,
how much inductive bias we have. The approximation error does not depend on the sample size and is
determined by the hypothesis class chosen. Enlarging the hypothesis class can decrease the
approximation error. Under the realizability assumption, the approximation error is zero. In the
agnostic case, however, the approximation error can be large.1

1 In fact, it always includes the error of the Bayes optimal predictor (see Chapter 3), the minimal yet
inevitable error, because of the possible nondeterminism of the world in this model. Sometimes in the
literature the term approximation error refers not to minh∈H LD(h), but rather to the excess error over
that of the Bayes optimal predictor, namely, minh∈H LD(h) − Bayes.

• The Estimation Error – the difference between the approximation error and the error achieved by
the ERM predictor. The estimation error results because the empirical risk (i.e., training error) is only
an estimate of the true risk, and so the predictor minimizing the empirical risk is only an estimate of
the predictor minimizing the true risk. The quality of this estimation depends on the training set size
and on the size, or complexity, of the hypothesis class. As we have shown, for a finite hypothesis class,
est increases (logarithmically) with |H| and decreases with m. We can think of the size of H as a
measure of its complexity. In future chapters we will define other complexity measures of hypothesis
classes.s

Since our goal is to minimize the total risk, we face a tradeoff, called the biascomplexity
tradeoff. On one hand, choosing H to be a very rich class decreases the approximation error but at the
same time might increase the estimation error, as a rich H might lead to overfitting. On the other hand,
choosing H to be a very small set reduces the estimation error but might increase the approximation
error or, in other words, might lead to underfitting. Of course, a great choice for H is the class that
contains only one classifier – the Bayes optimal classifier. But the Bayes optimal classifier depends on
the underlying distribution D, which we do not know (indeed, learning would have been unnecessary
had we known D). Learning theory studies how rich we can make H while still maintaining reasonable
estimation error. In many cases, empirical research focuses on designing good hypothesis classes for a
certain domain. Here, “good” means classes for which the approximation error would not be
excessively high. The idea is that although we are not experts and do not know how to construct the
optimal classifier, we still have some prior knowledge of the specific problem at hand, which enables
us to design hypothesis classes for which both the approximation error and the estimation error are not
too large. Getting back to our papayas example, we do not know how exactly the color and hardness of
a papaya predict its taste, but we do know that papaya is a fruit and on the basis of previous experience
with other fruit we conjecture that a rectangle in the color-hardness space may be a good predictor.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy