ML Unit3
ML Unit3
NOTES MATERIAL
UNIT 3
For
B. TECH (CSE)
3rd YEAR – 2nd SEM (R18)
RAVIKRISHNA B
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
IV B. Tech (CSE) MACHINE LEARNING
Bayesian Learning:
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain
types of learning problems
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal
decision making against which other methods can be measured
Above,
P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable
‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather
condition.
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the
highest posterior probability is the outcome of prediction.
other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation
System that uses machine learning and data mining techniques to filter unseen information and predict whether
a user would like a given resource or not.
Definition: The true error (denoted errorv(h)) of hypothesis h with respect to target concept c and distribution D is the
probability that h will misclassify an instance drawn at random according to D.
errorD (h) Pr[c( x) h( x)]
xD
Here the notation Pr indicates that the probability is taken over the instance distribution D.
xD
Note there are many specific settings in which we could pursue such questions. For example, there are various ways
to specify what it means for the learner to be "successful." We might specify that to succeed, the learner must output
a hypothesis identical to the target concept. Alternatively, we might simply require
that it output a hypothesis that agrees with the target concept most of the time, or that it usually output such a
hypothesis. Similarly, we must specify how training examples are to be obtained by the learner. We might specify that
training examples are presented by a helpful teacher, or obtained by the learner performing experiments, or simply
generated at random according to some process outside the learner's control. As we might expect, the answers to the
above questions depend on the particular setting, or learning model, we have in mind.
We assume instances are generated at random from X according to some probability distribution D. For example, 2)
might be the distribution of instances generated by observing people who walk out of the largest sports store in Switzer-
land.
In general, D may be any distribution, and it will not generally be known to the learner. All that we require of D is that
it be stationary; that is, that the distribution not change over time. Training examples are generated by drawing an
instance x at random according to D, then presenting x along with its target value, c(x), to the learner.
The learner L considers some set H of possible hypotheses when attempting to learn the target concept. For example,
H might be the set of all hypotheses describable by conjunctions of the attributes age and height. After observing a
sequence of training examples of the target concept c, L must output some hypothesis
h from H, which is its estimate of c. To be fair, we evaluate the success of L by the performance of h over new instances
drawn randomly from X according to D, the same probability distribution used to generate the training data.
Within this setting, we are interested in characterizing the performance of various learners L using various hypothesis
spaces H, when learning individual target concepts drawn from various classes C. Because we demand that L be general
enough to learn any target concept from C regardless of the distribution of training examples, we will often be interested
in worst-case analyses over all possible target concepts from C and all possible instance distributions D.
The true error (denoted errorv(h)) of hypothesis h with respect to target concept c and distribution D is the probability
that h will misclassify an instance drawn at random according to D.
Here the notation Pr indicates that the probability is taken over the instance x€D distribution V.
Figure shows this definition of error in graphical form. The concepts c and h are depicted by the sets of instances within
X that they label as positive. The error of h with respect to c is the probability that a randomly drawn instance will fall
into the region where h and c disagree (i.e., their set difference). Note we have chosen to define error over the entire
distribution of instances-not simply over the training examples-because this is the true error we expect to encounter
when actually using the learned hypothesis h on subsequent instances drawn from D. Note that error depends strongly
on the unknown probability distribution that addresses this important special case.
Eager Learning:
The main advantage gained in employing an eager learning method, such as an artificial neural network, is
that the target function will be approximated globally during training, thus requiring much less space than
using a lazy learning system.
Eager learning systems also deal much better with noise in the training data.
Eager learning is an example of offline learning, in which post-training queries to the system have no effect on
the system itself, and thus the same query to the system will always produce the same result.
Instance based learning includes nearest Neighbour and locally weighted regression methods that assume
instances can be represented as points in a Euclidean space. It also includes case-based reasoning methods
that use more complex, symbolic representations for instances.
Instance-based methods which are also referred as "lazy" learning methods because they delay processing
until a new instance must be classified. The lazy learner can create many local approximations.
A key advantage of this kind of delayed, or lazy, learning is that instead of estimating the target function once
for the entire instance space, these methods can estimate it locally and differently for each new instance to be
classified.
KNN algorithm fairs across all parameters of considerations. It is commonly used for its easy of interpretation and low
calculation time.
You intend to find out the class of the blue star (BS). BS can either be RC
or GS and nothing else. The “K” is KNN algorithm is the nearest neighbors
we wish to take vote from. Let’s say K = 3. Hence, we will now make a
circle with BS as center just as big as to enclose only three data points on
the plane. Refer to following diagram for more details:
The most basic instance-based method is the k-NEARESNT NEIGHBORING Algorithm. The main idea behind k-NN
learning is so-called majority voting. This algorithm assumes all instances correspond to points in the n-dimensional
n
space . The nearest neighbors of an instance are defined in terms of the standard Euclidean distance. More precisely,
let an arbitrary instance x be described by the feature vector
a1 ( x), a2 ( x), a3 ( x),....an ( x)
Where ar ( x) denoted the value of the rth attribute of instance x, then the distance between two instances xi and x j
defined to be d( xi , x j ) where
n
d( xi , x j ) r 1
(ar ( xi ) ar ( x j ))2
In nearest-neighbour learning the target function may be either discrete-valued or real-valued.
Let us first consider learning discrete-valued target functions of the form f: n
V , where V is the finite set such
that V {v1 , v2 ...., vm } .
The k-NEAREST NEIGHBOUR algorithm for approximation of a discrete-valued target function is given below.
^
As shown there, the value f ( xq ) returned by this algorithm as its estimate of f ( xq ) is just the most common value
of f among the k training examples nearest to xi . If we choose k = 1, then the 1-NEAREST NEIGHBOUR algorithm
assigns to f ( xi ) . The value f ( xi ) where xi is the training instance nearest to xq . For larger values of k, the
algorithm assigns the most common value among the k nearest training examples.
For each training example x, f ( x) , add the example to the list of training_examples .
Classification algorithm:
Given a query instance xq to be classified,
Let ( x1 , x2 , x3 ,...., xk ) denote the k instances from training_examples that are nearest to xq
^ k
Return f ( xq ) arg max (v, f ( xi ))
vV i 1
The k-NEAREST NEIGHBOR algorithm is easily adapted to approximating continuous-valued target functions. To
accomplish this, we have the algorithm calculate the mean value of the k nearest training examples rather than calculate
their most common value. More precisely, to approximate a real-valued target function f : R n R . We replace the
final line of the above algorithm by the line
k
^ (v, f ( x )) i
f ( xq ) i 1
k
VORONOI DIAGRAM:
The diagram on the right side shows the shape of this decision surface
induced by 1-NEAREST NEIGHBOR over the entire instance space.
The decision surface is a combination of convex polyhedral
surrounding each of the training examples.
For every training example, the polyhedron indicates the set of query
points whose classification will be completely determined by that
training example. Query points outside the polyhedron are closer to
some other training example. This kind of diagram is often called the
Voronoi diagram of the set of training examples.
The basis for classifying new query points is easily understood based on the algorithm specified above.
The inductive bias corresponds to an assumption that the classification of an instance xq will be most similar
to the classification of other instances that are nearby in Euclidean distance.
One practical issue in applying k-NEAREST NEIGHBOR Algorithm is that the distance between instances is
calculated based on all attributes of the instance (i.e., on all axes in the Euclidean space containing the
instances).
This lies in contrast to methods such as rule and decision tree learning systems that select only a subset of the
instance attributes when forming the hypothesis.
Case-Based reasoning (CBR), broadly construed, is the process of solving new problems based on the solutions
of similar past problems.
It is an approach to model the way humans think to build intelligent systems.
Case-based reasoning is a prominent kind of analogy making.
CBR: Uses a database of problem solutions to solve new problems.
Store symbolic description (tuples or cases)—not points in a Euclidean space
Applications: Customer-service (product-related diagnosis), legal ruling.
Case-Based Reasoning is a well-established research field that involves the investigation of theoretical
foundations, system development and practical application building of experience-based problem solving with
base line of by remembering the past experience.
It can be classified as a sub-discipline of Artificial Intelligence
learning process is based on analogy but not on deduction or induction
best classified as supervised learning(recall the distinction between supervised, unsupervised and
reinforcement learning methods typically made in Machine Learning)
Learning happens in a memory-based manner.
Case – previously made and stored experience item
Case-Base – core of every case – based problem solver - collection of cases
Everyday examples of CBR:
An auto mechanic who fixes an engine by recalling another car that exhibited similar symptoms
A lawyer who advocates a particular outcome in a trial based on legal precedents or a judge who creates case
law.
An engineer copying working elements of nature (practicing biomimicry), is treating nature as a database of
solutions to problems.
Few commercially/industrially really successful AI methods:
Customer support, help-desk systems: diagnosis and therapy of customer‘s problems, medical diagnosis
Product recommendation and configuration: e-commerce
Textual CBR: text classification, judicial applications (in particular in the countries where common law (not civil
law) is applied)[like USA, UK, India, Australia, many others]
Applicability also in ill-structured and bad understood application domains.
There are three main types of CBR that difer significantly from one another concerning case representation and
reasoning:
3. Conversational
A case is represented through a list of questions that varies from one case to another ; knowledge is
contained in customer / agent conversations.
CBR Cycle:
Despite the many different appearances of CBR systems, the essentials of CBR are captured in a surprisingly simple
and uniform process model.
• The CBR cycle is proposed by Aamodt and Plaza.
• The CBR cycle consists of 4 sequential steps around the knowledge of the CBR system.
• RETRIEVE
• REUSE
• REVIZE
• RETAIN
RETRIEVE:
• One or several cases from the case base are selected, based on the modeled similarity.
• The retrieval task is defined as finding a small number of cases from the case-base with the highest similarity
to the query.
• This is a k-nearest-neighbor retrieval task considering a specific similarity function.
• When the case base grows, the efficiency of retrieval decreases => methods that improve retrieval
efficiency,
e.g. specific index structures such as kd-trees, case-retrieval nets, or discrimination networks.
REUSE:
• Reusing a retrieved solution can be quite simple if the solution is returned unchanged as the proposed
solution for the new problem.
• Adaptation (if required, e.g. for synthetic tasks).
• Several techniques for adaptation in CBR
- Transformational adaptation
- Generative adaptation
• Most practical CBR applications today try to avoid extensive adaptation for pragmatic reasons.
REVISE:
• In this phase, feedback related to the solution constructed so far is obtained.
• This feedback can be given in the form of a correctness rating of the result or in the form of a manually
corrected revised case.
• The revised case or any other form of feedback enters the CBR system for its use in the subsequent retain
phase.
• Retain
• The retain phase is the learning phase of a CBR system (adding a revised case to the case base).
• Explicit competence models have been developed that enable the selective retention of cases (because of the
continuous increase of the case-base).
• The revised case or any other form of feedback enters the CBR system for its use in the subsequent retain
phase.
RETAIN:
• The retain phase is the learning phase of a CBR system (adding a revised case to the case base).
• Explicit competence models have been developed that enable the selective retention of cases (because of the
continuous increase of the case-base).
• The revised case or any other form of feedback enters the CBR system for its use in the subsequent retain
phase.