Unit 2 ML
Unit 2 ML
Supervised Learning
Syllabus
• Supervised Learning:
• Learning a Class from Examples, Learning Multiple Classes, Regression, Model Selection and Generalization,
Dimensions of Supervised Machine Learning Algorithm,
• Non-Parametric Methods: Histogram Estimator, Kernel Estimator, K-Nearest Neighbor Estimator
Learning a class from examples
• Class learning is finding a description that is shared by all the
positive examples.
• For example, given a car that we have not seen before by
checking with the description learned, we will be able to say
whether it is a family car or not.
Learning a Class from Examples
• Class C of a “family car”
– Prediction: Is car x a family car?
– Knowledge extraction: What do people expect from a family car?
• Output:
Positive (+) and negative (–) examples
• Input representation:
x1: price, x2 : engine power
4
Training set X
X = {x t ,r t }tN=1
1 if x is positive
r =
0 if x is negative
x 1
x=
x 2
5
Class C
6
Hypothesis class H
1 if h classifies x as positive
h (x ) =
0 if h classifies x as negative
Error of h on H
(( ) )
N
E (h | X ) = 1 h x r
t t
t =1
7
Learning Multiple classes
• In the example discussed earlier, positive examples belonging to
the family car and the negative examples belongs to all other
cars.
• This is a two class problem.
• In the general case, we have K classes denoted by Ci, where i=1
…k, and an input instance belongs to one and exactly one of
them.
• An example in next slide with instances from 3 classes: family car,
sports car and luxury sedan car.
Multiple Classes, Ci i=1,...,K
K number of classes and t number of test data
X = {x t ,r t }tN=1
1 if x t
Ci
ri =
t
0 if x t
C j , j i
Train hypotheses
hi(x), i =1,...,K:
1 if x t
Ci
( )
hi x =
t
0 if x C j , j i
t
9
Regression
• Regression models are used to predict a continuous value.
• Model Selection is finding the optimal model which minimizes both bias and
variance.
• Bias is the error during training and Variance is the error during testing
Inductive bias
• The set of assumption we make to have learning possible is called
inductive bias of the learning algorithm.
• Model selection is also about choosing the right inductive bias.
Generalization
• Generalisation is how well a model trained on the training set
predicts the right output for new instances.
• A model should be selected having best generalisation. It should
avoid two problems
– Underfitting
– Overfitting
In underfitting – simple model is selected and hence
bias is very high
Underfitting – overfitting – just right model-
Generalization
• As the model selected the bias (training error) is very high.
• Also variance testing error) is also very high.
• In just right model, it is close all the data/samples, due to this minim
bias and variance is also low : This is called generalization
Dimensions of Supervised Machine Learning
Algorithm
Consider we have N samples and K classes
Non-Parametric Estimation
• Statistical Estimation:
– Population is the entire data and a smaller set of data from the
population is called Sample. (Election result analysis – Exit poll )
– Usually it is very difficult to estimate on the entire data, it will be
performed on a sample data.
There are many ways to measure the distance s between two instances
Distance or similarity measures are essential in solving many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available in the
literature to compare two data distributions.
As the names suggest, a similarity measures how close two distributions are.
For algorithms like the k-nearest neighbor and k-means, it is essential to measure the
distance between the data points.
• In KNN we calculate the distance between points to find the nearest neighbor.
• In K-Means we find the distance between points to group data points into clusters based
on similarity.
• It is vital to choose the right distance measure as it impacts the results of our algorithm.
Euclidean Distance
• We are most likely to use Euclidean distance when calculating the distance between two rows
of data that have numerical values, such a floating point or integer values.
• If columns have values with differing scales, it is common to normalize or standardize the
numerical values across all columns prior to calculating the Euclidean distance. Otherwise,
columns that have large values will dominate the distance measure.
n
dist = ( pk − qk )2
k =1
• Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
• Euclidean distance is also known as the L2 norm of a vector.
Compute the Euclidean Distance between the following data set
• D1= [10, 20, 15, 10, 5]
• D2= [12, 24, 18, 8, 7]
Apply Pythagoras theorem for Euclidean distance
Manhattan distance:
Manhattan distance is a metric in which the distance between two points is the sum
of the absolute differences of their Cartesian coordinates. In a simple way of saying it
is the total sum of the difference between the x-coordinates and y-coordinates.
Formula: In a plane with p1 at (x1, y1) and p2 at (x2, y2)
The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
= (5/(6.481*2.245)) = 0.3150
• Let's say you are in an e-commerce setting and you want to compare
users for product recommendations:
• User 1 bought 1x eggs, 1x flour and 1x sugar.
• User 2 bought 100x eggs, 100x flour and 100x sugar
• User 3 bought 1x eggs, 1x Vodka and 1x Red Bull
A simple example using set notation: How similar are these two sets?
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
J(A,B) = {0,2,5}/{0,1,2,3,4,5,6,7,9} = 3/9 = 0.33
Jaccard Similarity is given by :
Overlapping vs Total items.
• Jaccard Similarity value ranges between 0 to 1
• 1 indicates highest similarity
• 0 indicates no similarity
Application of Jaccard Similarity
• Language processing is one example where jaccard similarity is
used.
X X y
y z
Euclidean Distance
2 p1 point x y
p3 p4 p1 0 2
1 p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
3
L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4
1
p3 4 2 0 2
p2 p4 6 4 2 0
0
0 1 2 3 4 5 6 L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
point x y
L p1 p2 p3 p4
p1 0 2
p1 0 2 3 5
p2 2 0 p2 2 0 1 3
p3 3 1 p3 3 1 0 2
p4 5 1 p4 5 3 2 0
Distance Matrix
Summary of Distance Metrics
Compute Distance
Test Record
• K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be
easily classified into a well suite category by using K- NN algorithm.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.
Illustrative Example for KNN
Collected data over the past few years(training data)
Considering K=1, based on nearest neighbor find the test
data class- It belongs to class of africa
Now we have used K=3, and 2 are showing it is close to
North/South America and hence the new data or data under
testing belongs to that class.
In this case K=3… but still not a correct value to
classify…Hence select a new value of K
Algorithm
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance to all the data points in
training.
• Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
• Step-4: Among these k neighbors, apply voting algorithm
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.
Consider the following data set of a pharmaceutical company with assigned class labels,
using K nearest neighbour method classify a new unknown sample using k =3 and k = 2.
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P1 P2 P3 P4
Euclidean
Distance of
P5(3,7) from
Sqrt((7-3) 2 + (7-7)2 ) = 16 Sqrt((7-3) 2 + (4-7)2 ) = 25 Sqrt((3-3) 2 + (4-7)2 ) Sqrt((1-3) 2 + (4-7)2 )
=4 =5 = 9 = 13
=3 = 3.60
P1 P2 P3 P4