0% found this document useful (0 votes)

9 views25 pages

Unit II

The document provides an overview of supervised learning in machine learning, detailing its definitions, processes, and types of algorithms such as classification and regression. It explains key terminologies, the steps involved in supervised learning, and the advantages and disadvantages of using this approach. Additionally, it covers specific algorithms like Logistic Regression, Support Vector Machines, and Naive Bayes, along with their applications and examples.

Uploaded by

mouneshyatham99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

Unit II

Uploaded by

mouneshyatham99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

UNIT – II

SUPERVISED LEARNING: Introduction, Linear methods for classification, Linear methods for
regression, Support Vector Machine, SVM- the dual formulation, SVM- the maximum margin
with noise, Decision trees, over fitting

1.1 MACHINE LEARNING TERMINOLOGIES

Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used to learn
patterns from data and draw significant information from it. It is the logic behind a Machine Learning
model. An example of a Machine Learning algorithm is the Linear Regression algorithm.
Model: A model is the main component of Machine Learning. A model is trained by using a Machine
Learning Algorithm. An algorithm maps all the decisions that a model is supposed to take based on the
given input, in order to get the correct output.
Predictor Variable: It is a feature(s) of the data that can be used to predict the output.
Response Variable: It is the feature or the output variable that needs to be predicted by using the
predictor variable(s).
Training Data: The Machine Learning model is built using the training data. The training data helps
the model to identify key trends and patterns essential to predict the output.
Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict an
outcome. This is done by the testing data set.

1.2 INTRODUCTION TO SUPERVISED LEARNING

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data means
some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that teaches
the machines to predict the output correctly. It applies the same concept as a student learns in the
supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud
Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about each
type of data. Once the training process is completed, the model is tested on the basis of test data (a
subset of the training set), and then it predicts the output.

Figure 1: Supervised Learning

The working of supervised learning can be easily understood from the above Figure 1.

Suppose we have a dataset of different types of shapes, which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the model for each shape.

If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the
shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the
shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

➢ First Determine the type of training dataset

➢ Collect/Gather the labelled training data.
➢ Split the training dataset into training dataset, test dataset, and validation dataset.
➢ Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
➢ Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.
➢ Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
➢ Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

1.3 Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

Figure 2: Major Types of algorithms in Supervised Learning

1.3.1 Classification

Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc. Below are some popular classification
algorithms which come under supervised learning:

➢ Logistic Regression
➢ Decision Trees
➢ Support vector Machines
➢ Random Forest

1.3.2 Regression

Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:

➢ Linear Regression
➢ Bayesian Linear Regression
➢ Regression Trees
➢ Non-Linear Regression
➢ Polynomial Regression

1.3.3 Advantages of Supervised learning:

➢ With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
➢ In supervised learning, we can have an exact idea about the classes of objects.
➢ Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

1.3.4 Disadvantages of supervised learning:

➢ Supervised learning models are not suitable for handling the complex tasks.
➢ Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
➢ Training required lots of computation times.
➢ In supervised learning, we need enough knowledge about the classes of object.
1.3.5 Classification Terminologies in Machine Learning
Classifier: It is an algorithm that is used to map the input data to a specific category.
Classification Model: The model predicts or draws a conclusion to the input data given for training, it
will predict the class or category for the data.
Feature: A feature is an individual measurable property of the phenomenon being observed.
Binary Classification: It is a type of classification with two outcomes, for eg – either true or false.
Multi-Class Classification: The classification with more than two classes, in multi-class classification
each sample is assigned to one and only one label or target.
Multi-label Classification: This is a type of classification where each sample is assigned to a set of
labels or targets.
Train the Classifier: Each classifier in sci-kit learn uses the fit (X, y) method to fit the model for
training the train X and train label y.
Predict the Target: For an unlabeled observation X, the predict (X) method returns predicted label y.
Evaluate: This basically means the evaluation of the model i.e., classification report, accuracy score,
etc.

1.4 LINEAR METHODS FOR CLASSIFICATION:

Classification Algorithms can be further divided into the mainly two categories:

Linear Models
➢ Logistic Regression
➢ Support Vector Machines
Non-linear Models
➢ K-Nearest Neighbours
➢ Kernel SVM
➢ Naïve Bayes
➢ Decision Tree Classification
➢ Random Forest Classification

Classification is the process of finding or discovering a model or function which helps in separating the
data into multiple categorical classes i.e. discrete values. In classification, data is categorized under
different labels according to some parameters given input and then the labels are predicted for the data.
In machine learning, classification is a supervised learning concept which basically categorizes a set of
data into classes. The most common classification problems are – speech recognition, face detection,
handwriting recognition, document classification, etc. It can be either a binary classification problem
or a multi-class problem too. There are a bunch of machine learning algorithms for classification in
machine learning. Let us take a look at those classification algorithms in machine learning.
Figure 3: Binary Classification and Multiclass Classification
Types of Learners in Classification:
Lazy Learners: Lazy learners simply store the training data and wait until a testing data appears. The
classification is done using the most related data in the stored training data. They have more predicting
time compared to eager learners. Ex – k-nearest neighbor, case-based reasoning.
Eager Learners: Eager learners construct a classification model based on the given training data before
getting data for predictions. It must be able to commit to a single hypothesis that will work for the entire
space. Due to this, they take a lot of time in training and less time for a prediction. Ex – Decision Tree,
Naive Bayes, Artificial Neural Networks.

1.5 LINEAR METHODS FOR REGRESSION

Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression. If we plot the independent variable on the x-axis
and dependent variable (y) on the y-axis, linear regression gives us a straight line that best fits the data
points, as shown in the Figure 4.

Figure 4: Linear regression

The equation of the above line is: Y= mx + b

Where b is the intercept and m is the slope of the line. So basically, the linear regression algorithm
gives us the most optimal value for the intercept and the slope (in two dimensions). The y and x
variables remain the same, since they are the data features and cannot be changed. The values that we
can control are the intercept (‘b’) and slope (‘m’ - coefficient of x). There can be multiple straight lines
depending upon the values of intercept and slope. Basically, what the linear regression algorithm does
is it fits multiple lines on the data points and returns the line that results in the least error.
How to update ‘b’ and ‘m’ values to get the best-fit line?

Cost Function (J):

By achieving the best-fit regression line, the model aims to predict y value such that the error difference
between predicted value and true value is minimum. So, it is very important to update the ‘b’ and ‘m’
values, to reach the best value that minimize the error between predicted y value (pred) and true y value
(y).

Cost function (J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted
y value (pred) and true y value (y).
1.5.1 Multiple Linear Regression:
This same concept can be extended to cases where there are more than two variables. This is called
Multiple Linear Regression. For instance, consider a scenario where you have to predict the price of
the house based upon its area, number of bedrooms, the average income of the people in the area, the
age of the house, and so on. In this case, the dependent variable (target variable) is dependent upon
several independent variables. A regression model involving multiple variables can be represented as:
y = b0 + m1b1 + m2b2 + m3b3 + … … mnbn. This is the equation of a Hyperplane.

1.5.2 Logistic Regression

It is a classification algorithm in machine learning that uses one or more independent variables to
determine an outcome. The outcome is measured with a dichotomous variable meaning it will have
only two possible outcomes. The goal of logistic regression is to find a best-fitting relationship between
the dependent variable and a set of independent variables. Logistic regression models the data using
the sigmoid function.
If ‘Z’ goes to infinity, Y (predicted) will become 1 and if ‘Z’ goes to negative infinity, Y(predicted)
will become 0.
It is used to estimate discrete values (Binary values like 0/1, yes/no, true/false) based on a given set of
independent variables(s). It predicts the probability of occurrence of an event by fitting data to a logit
function. Hence, it is also known as logit regression. Since it predicts the probability, its output values
lie between 0 and 1. The goal of logistic regression is to find a best- fitting relationship between the
dependent variable and a set of independent variables. It is better than other binary classification
algorithms like nearest neighbor since it quantitatively explains the factors leading to classification.

Logistic regression is a linear method, but the predictions are transformed using the logistic function.
The impact of this is that we can no longer understand the predictions as a linear combination of the
inputs as we can with linear regression, for example, continuing on from above, the model can be stated
as:

Logistic regression is an estimation of the logit function and the logit function is simply a log of odds
in favor of the event.
Example for Logistic Regression:
Let’s say we have a model that can predict whether a person is male or female based on their height.
Given a height of 150cm is the person male or female.
If we are modeling people’s sex as male or female from their height, then the first class could be male
and the logistic regression model could be written as the probability of male given a person’s height,
or more formally:
P(sex=male|height)
Written another way, we are modeling the probability that an input (X) belongs to the default class
(Y=1), we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability prediction must be transformed into a binary value (0 or 1) in order to actually
make a probability prediction.
We have learned the coefficients of b0 = -100 and b1 = 0.6. Using the equation above we can calculate
the probability of male given a height of 150cm or more formally P(male|height=150). We will use
exp() for e, because that is what you can use if you type this example into your spreadsheet: y = e^(b0
+ b1*X) / (1 + e^(b0 + b1*X))
y = exp(-100 + 0.6*150) / (1 + EXP(-100 + 0.6*X)) y = 0.0000453978687
Or a probability of near zero that the person is a male.
In practice we can use the probabilities directly. Because this is classification and we want a crisp
answer, we can snap the probabilities to a binary class value, for example:
if p(male) < 0.5
if p(male) >= 0.5

1.5.3 Naive Bayes Classifier

It is a classification algorithm based on Bayes theorem which gives an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might
be related to the event.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)and P(x|c).
Even if the features depend on each other, all of these properties contribute to the probability
independently. Naive Bayes model is easy to make and is particularly useful for comparatively large
data sets. Even with a simplistic approach, Naive Bayes is known to outperform most of the
classification methods in machine learning. Following is the Bayes theorem to implement the Naive
Bayes Theorem.

Example:
Consider a fictional dataset that describes the weather conditions for playing a game of

golf. Given the weather conditions, each tuple classifies the conditions as fit (“Yes”) or unfit(“No”) for
playing golf.
Outlook Temperature Humidity Windy Play Golf
Rainy Hot High False No
Rainy Hot High True No
Overcast Hot High False Yes
Sunny Mild High False Yes
Sunny Cool Normal False Yes
Sunny Cool Normal True No
Overcast Cool Normal True Yes
Rainy Mild High False No
Rainy Cool Normal False Yes

The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors (rows) of dataset in which each vector consists of the value of
dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each row of feature
matrix. In above dataset, the class variable name is ‘Play golf’.
With relation to our dataset, this concept can be understood as:
We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing
to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features
are assumed to be independent.
Secondly, each feature is given the same weight (or importance). For example, knowing only
temperature and humidity alone can’t predict the outcome accuracy. None of the attributes is irrelevant
and assumed to be contributing equally to the outcome.

So, in the figure above, we have calculated P(xi | yj) for each xi in X and yj in y manually in the tables.
For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play
golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been calculated in the table For example,
P(play golf = Yes) = 9/14.
1.5.4 K-Nearest Neighbor
It is a lazy learning algorithm that stores all instances corresponding to training data in n- dimensional
space. It is a lazy learning algorithm as it does not focus on constructing a general internal model,
instead, it works on storing instances of training data.
K nearest neighbors is a simple algorithm used for both classification and regression problems. It
basically stores all available cases to classify the new cases by a majority vote of its k neighbors. The
case assigned to the class is most common amongst its K nearest neighbors measured by a distance
function (Euclidean, Manhattan, Minkowski, and Hamming).

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three
functions are used for continuous function and the fourth one (Hamming) for categorical variables. If
K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns
out to be a challenge while performing k-NN modelling.

As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.
Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is
supervised and takes a bunch of labelled points and uses them to label other points. To label a new
point, it looks at the labelled points closest to that new point also known as its nearest neighbors. It has
those neighbors vote, so whichever label the most of the neighbors have is the label for the new point.
The “k” is the number of neighbors it checks.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So, for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs’ images and based on the most similar features it will put it in either cat or dog category.
Here all ‘cats’ belongs to ‘category A’ and all ‘dogs’ belong to Category B, the ‘input value’ is the
‘New Data Point’
Advantages of KNN Algorithm:
➢ It is simple to implement.
➢ It is robust to the noisy training data
➢ It can be more effective if the training data is large. Disadvantages of KNN Algorithm:
➢ Always needs to determine the value of K which may be complex some time.
➢ The computation cost is high because of calculating the distance between the data points for
all the training samples.

1.6 SUPPORT VECTOR MACHINE

The support vector machine is a linear classifier that represents the training data as points in space
separated into categories by a gap as wide as possible. New points are then added to space by predicting
which category they fall into and which space they will belong to.
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for

both classification and regression challenges. However, it is mostly used in classification problems. In
this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features
you have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well.
Identify the right hyper-plane:
The hyperplane that separates two classes in training data is
wx+b=0
It is also called the decision boundary (surface). So many possible hyperplanes, which one to choose?

This distance is called as Margin. Margin for hyper-plane C is high as compared to both A and B.
Another lightning reason for selecting the hyper-plane with higher margin is robustness. If we select a
hyper-plane having low margin then there is high chance of miss-classification. one star at another end
is like an outlier for star class. SVM has a feature to ignore outliers and find the hyper-plane that has
maximum margin. Hence, we can say, SVM is robust to outliers.

SVM looks for the separating hyperplane with the largest margin. SVM is basically a binary classifier,
although it can be modified for multi-class classification as well as regression. Unlike logistic
regression and other neural network models, SVMs try to maximize the separation between two classes
of points.
A support-vector machine constructs a hyperplane or set of hyperplanes in a high or infinite
dimensional space, this hyperplane bisects the two classes in the best way possible.
Now to construct the OPTIMAL HYPERPLANE it takes the support of two other hyperplanes that are
parallel & equidistant from it on either side!
These two support hyperplanes lie on the most extreme points between the classes and are called
support-vectors.
Therefore, what we just need to do is to find the support hyperplanes (using for simplicity), that have
the maximum distance between them! From this, we can easily get the OPTIMAL HYPERPLANE.
This is simply called the Maximum Margin Hyperplane. The distance between the support hyperplanes
is called the Margin.
Hence, our goal is to simply find the Maximum Margin M. Using vector operations, we can find that
(given the OPTIMAL HYPERPLANE (w.x+b=0)), the Margin is equal to:

Therefore, the Maximum Margin M:

For mathematical convenience, the Objective Function becomes

Observe here that our goal has been deduced to finding the optimal w from the equation w.x+b=0!

However, this also comes with a constraint that the points in a class must not lie within the two support
hyperplanes! That can be mathematically represented as:

So, what do these constraints mean?

It means that there can be no points inside of the hyperplane as shown in the below figure and this is

called Hard Margin SVM (Vanilla SVM).

This is the big drawback of SVMs! The two classes need to be fully separable. This is never the case
in real-world datasets! This is where Soft Margin SVMs come into play
1.6.1 Deriving the Maximum Margin Equation
For deriving the Objective function, we assume that the dataset is linearly separable. We know that the
two support hyperplanes lie on the Support Vectors and there are no points between them. Hence, we
can first consider the mid-hyperplane between these two support hyperplanes to be:

And considering that the distance between this and the support hyperplanes is 1, we get
In the above figure, let vectors x₀ & z₀ be parallel vectors

on the lines w.x+b=- 1 & w.x+b=1 respectively. Then,

where vector k is a line perpendicular to the vector x₀ & z₀. The magnitude of vector k is M.
Therefore,

Now since z₀ & x₀ lie on w.x+b=1 & w.x+b=-1,

From the above eq. substituting for z₀,

Now that we have, we need to maximize it,

Hence, the new goal i.e., the Objective function is:

& the inequality constraint is:

Finding optima of the Objective function using Lagrangian, Dual Formulation

General method to solve for minima
➢ To find the optima for a curve generally, we can just
➢ Take the first-order derivative,
➢ Equate the derivative to 0 (for maxima or minima), to get a differential equation.
Solve the differential equation to find the optimal points.
The 2nd order derivative can provide the direction & hence we can deduce whether the optima is a
minimum or a maximum.
Solving for minima when constraints are present
If there are some constraints in the differential equation, like the one we have in our Objective function,
we’ll first need to apply the Lagrangian multipliers (which is honestly pretty straight forward)!

If you observe the above equation (5), it’s just the Objective function, subtracted by the inequality

constraint! It must only satisfy the criteria αᵢ > 0 when the constraint is an equality constraint.

However, the above equations are inequality constraint equations. So, an additional set of conditions
called the Kuhn — Tucker conditions need to satisfy as well. We will go through the conditions later.
To find the optima, just like before, we take the first-order derivative and equate it to 0:

To find the optimal values, we can simply substitute the values back into (5),

Expanding Eq. (5) we get,

Now substituting the values in (6) & (7) to above,

We can resubstitute the value of w in Eq. (6) to the above equation,

Here, W is called the Objective functional & it is a function of all (αᵢ … to … αₙ) represented as Λ
(Capital Lambda).
This derivation was necessary to account for the inequality constraint [Eq. (4)]. Since it’s now
accounted for, the Objective functional W is the new function that needs to be optimized instead of Eq.
(3). This is called the DUAL FORMULATION because the initial Objective function has been
modified! While Eq. (3) was minimized, W has to be maximized.

1.7 DECISION TREE

The decision tree algorithm builds the classification model in the form of a tree structure. It utilizes the
if-then rules which are equally exhaustive and mutually exclusive in classification. The process goes
on with breaking down the data into smaller structures and eventually associating it with an incremental
decision tree. The final structure looks like a tree with nodes and leaves. The rules are learned
sequentially using the training data one at a time. Each time a rule is learned, the tuples covering the
rules are removed. The process continues on the training set until the termination point is met.

Internal nodes represent the features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
The tree is constructed in a top-down recursive divide and conquer approach. A decision node will have
two or more branches and a leaf represents a classification or decision. The topmost node in the decision
tree that corresponds to the best predictor is called the root node, and the best thing about a decision
tree is that it can handle both categorical and numerical data.
1.7.1 Decision Tree Terminologies:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into 2 or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:

1.7.2 The strengths of Decision Tree methods:

➢ Decision trees are able to generate understandable rules.
➢ Decision trees perform classification without requiring much computation.
➢ Decision trees are able to handle both continuous and categorical variables.
➢ Decision trees provide a clear indication of which fields are most important for prediction or
classification.
1.7.3 The weaknesses of Decision Tree:
➢ Decision trees are less appropriate for estimation tasks where the goal is to predict the value of
a continuous attribute.
➢ Decision trees are prone to errors in classification problems with many class and relatively small
number of training examples.
➢ Decision tree can be computationally expensive to train. The process of growing a decision tree
is computationally expensive. At each node, each candidate splitting field must be sorted before
its best split can be found. In some algorithms, combinations of fields are used and a search
must be made for optimal combining weights. Pruning algorithms can also be expensive since
many candidate sub-trees must be formed and compared.

1.8 UNDER FITTING AND OVERFITTING

Let us consider that we are designing a machine-learning model. A model is said to be a good machine-
learning model, if it generalizes any new input data from the problem domain in a proper way. This
helps us to make predictions in the future data, that data model has never seen.
Now, suppose we want to check how well our machine-learning model learns and generalizes to the
new data. For that we have overfitting and under fitting, which are majorly responsible for the poor
performances of the machine learning algorithms.

1.8.1 Under fitting:

A statistical model or a machine-learning algorithm is said to have under fitting when it cannot capture
the underlying trend of the data. Under fitting destroys the accuracy of our machine-learning
model. Its occurrence simply means that our model or the algorithm does not fit the data well enough.
It usually happens when we have less data to build an accurate model and when we try to build a linear
model with a non-linear data. In such cases, the rules of the machine-learning model are too easy and
flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong
predictions. Under fitting can be avoided by using more data and also reducing the features by feature
selection.

1.8.2 Overfitting:
A statistical model is said to be over fitted, when we train it with a lot of data. When a model is trained
with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then
the model does not
categorize the data correctly, because of too much of details and noise. The causes of overfitting are

the non-parametric and non-linear methods because these types of machine learning algorithms have

more freedom in building the model based on the dataset and therefore theycan really build

unrealistic models. A solution to avoid overfitting is using a linear algorithmif we have linear
data or using the parameters like the maximal depth if we are using decisiontrees.

1.8.3 Avoiding Overfitting:

The commonly used methodologies are:

Cross- Validation: A standard way to find out-of-sample prediction error is to use 5- fold cross
validation.
Early Stopping: Its rules provide us the guidance as to how many iterations can be run before learner
begins to over-fit.
Pruning: Pruning is extensively used while building related models. It simply removes the nodes,
which add little predictive power for the problem in hand.
Regularization: It introduces a cost term for bringing in more features with the objective function.
Hence, it tries to push the coefficients for many variables to zero and hence reduce cost term.
1.8.4 Good Fit in a Statistical Model:
Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the
data. This situation is achievable at a spot between overfitting and under fitting. In order to understand
it we will have to look at the performance of our model with the passage of time, while it is learning
from training dataset.
With the passage of time, our model will keep on learning and thus the error for the model on the
training and testing data will keep on decreasing. If it will learn for too long, the model will become
more prone to overfitting due to presence of noise and less useful details. Hence, the performance of
our model will decrease. In order to get a good fit, we will stop at a point just before where the error
starts increasing. At this point, the model is said to have good skills on training dataset as well our
unseen testing dataset.

MLT Unit 2 Notes
No ratings yet
MLT Unit 2 Notes
58 pages
Unit-5 Machine Learning
No ratings yet
Unit-5 Machine Learning
25 pages
Chapter 2
No ratings yet
Chapter 2
124 pages
UNIT II Deep Learning
No ratings yet
UNIT II Deep Learning
42 pages
Ai Unit 4
No ratings yet
Ai Unit 4
32 pages
Lecture 01 02
No ratings yet
Lecture 01 02
30 pages
Chapter Five
No ratings yet
Chapter Five
178 pages
Machine Learning - UNIT I Notes
No ratings yet
Machine Learning - UNIT I Notes
31 pages
Unit 5
No ratings yet
Unit 5
16 pages
Machine Learning IAI
No ratings yet
Machine Learning IAI
94 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
8 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
4 pages
ML 2
No ratings yet
ML 2
166 pages
Machine Learning Lecture Notes
No ratings yet
Machine Learning Lecture Notes
19 pages
AI
No ratings yet
AI
52 pages
ML Unit - 2
No ratings yet
ML Unit - 2
36 pages
Machine Learning
No ratings yet
Machine Learning
14 pages
Applied ML Notes
No ratings yet
Applied ML Notes
123 pages
Intro To ML
No ratings yet
Intro To ML
34 pages
All Algos - of - ML
No ratings yet
All Algos - of - ML
31 pages
Machine Learning With Boosting
100% (1)
Machine Learning With Boosting
212 pages
E-Notes 34758 Content Document 20250415115803AM
No ratings yet
E-Notes 34758 Content Document 20250415115803AM
23 pages
6CS4 AI Unit-4 @zammers
No ratings yet
6CS4 AI Unit-4 @zammers
129 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
ML 1
No ratings yet
ML 1
35 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
14 pages
Python UNIT-5
100% (1)
Python UNIT-5
67 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
61 pages
BDAunit 5
No ratings yet
BDAunit 5
26 pages
AI17
No ratings yet
AI17
10 pages
Ai Unit-4-1
No ratings yet
Ai Unit-4-1
9 pages
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
No ratings yet
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
27 pages
Unit 2
No ratings yet
Unit 2
63 pages
ML Unit-1
No ratings yet
ML Unit-1
28 pages
Lecture 2 Unit 1
No ratings yet
Lecture 2 Unit 1
60 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
Unit 3ML
No ratings yet
Unit 3ML
23 pages
Supervised Learning
No ratings yet
Supervised Learning
19 pages
2021 01 Slides l4 ML
No ratings yet
2021 01 Slides l4 ML
253 pages
Machine Learning Unit-I
No ratings yet
Machine Learning Unit-I
41 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
6 pages
8-Module 5 Linear and Logical Regression-18-03-2024
No ratings yet
8-Module 5 Linear and Logical Regression-18-03-2024
14 pages
Chap2 SupervisedLearning
No ratings yet
Chap2 SupervisedLearning
24 pages
Ai Unit-4 ML
No ratings yet
Ai Unit-4 ML
4 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
25 pages
Machine Learning-1
No ratings yet
Machine Learning-1
64 pages
Unit 1
No ratings yet
Unit 1
21 pages
ML Unit 1
No ratings yet
ML Unit 1
19 pages
Full Notes
No ratings yet
Full Notes
37 pages
Machine Learning Types
No ratings yet
Machine Learning Types
30 pages
ML Type
No ratings yet
ML Type
13 pages
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
No ratings yet
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
25 pages
7.deep Learning Model To Detect and Classify Bone Fracture in X-Ray Images
No ratings yet
7.deep Learning Model To Detect and Classify Bone Fracture in X-Ray Images
6 pages
MLT Unit 1
No ratings yet
MLT Unit 1
15 pages
Chapter-2-Fundamentals of Machine Learning
No ratings yet
Chapter-2-Fundamentals of Machine Learning
23 pages
2 ML
No ratings yet
2 ML
9 pages
Data Science For Supply Chain Forecasting 2nd Edition-Extract2
No ratings yet
Data Science For Supply Chain Forecasting 2nd Edition-Extract2
84 pages
Supervised Learning (Classification and Regression)
No ratings yet
Supervised Learning (Classification and Regression)
14 pages
ML Doc1
No ratings yet
ML Doc1
14 pages
2023-Hierarchical Attention Network For Multivariate Time Series
No ratings yet
2023-Hierarchical Attention Network For Multivariate Time Series
12 pages
MAchine Learning Notes
No ratings yet
MAchine Learning Notes
6 pages
Ethics in Artificial Intelligence Bias Fairness and Beyond 1st Edition Animesh Mukherjee Download
No ratings yet
Ethics in Artificial Intelligence Bias Fairness and Beyond 1st Edition Animesh Mukherjee Download
42 pages
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
No ratings yet
Bodyslam: A Generalized Monocular Visual Slam Framework For Surgical Applications
16 pages
CSE445 1 Intro To ML
No ratings yet
CSE445 1 Intro To ML
36 pages
Learning Dynamic Dependencies With Graph Evolution Recurrent Unit For Stock Predictions
No ratings yet
Learning Dynamic Dependencies With Graph Evolution Recurrent Unit For Stock Predictions
13 pages
Paper s6ey7T9N
No ratings yet
Paper s6ey7T9N
46 pages
April May 2024
No ratings yet
April May 2024
17 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
6CS4-22 Machine Learning Lab
No ratings yet
6CS4-22 Machine Learning Lab
30 pages
Robust PDF Document Conversion Using Recurrent Neural Networks
No ratings yet
Robust PDF Document Conversion Using Recurrent Neural Networks
9 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
20 pages
(Pec Cs701e)
No ratings yet
(Pec Cs701e)
4 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Technical Report: Supervised Training of Convolutional Spiking Neural Networks With Pytorch
No ratings yet
Technical Report: Supervised Training of Convolutional Spiking Neural Networks With Pytorch
24 pages
Nabati和Qi - 2021 - CenterFusion Center-based Radar and Camera Fusion for 3D Object Detection
No ratings yet
Nabati和Qi - 2021 - CenterFusion Center-based Radar and Camera Fusion for 3D Object Detection
10 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
4 pages
Masters in Data Science Brochure
No ratings yet
Masters in Data Science Brochure
20 pages
Implementing Logistic Regression For Iris Using Sklearn and Checking The Accuracy Using Confusion Matrix
No ratings yet
Implementing Logistic Regression For Iris Using Sklearn and Checking The Accuracy Using Confusion Matrix
7 pages
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
No ratings yet
Tomato Quality Classification Based On Transfer Learning Feature Extraction and Machine Learning Algorithm Classifiers
13 pages
Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
No ratings yet
Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
6 pages
Car Price Prediction
No ratings yet
Car Price Prediction
7 pages
PCOS (With Visible Features)
No ratings yet
PCOS (With Visible Features)
7 pages
Lab 4 Specification
No ratings yet
Lab 4 Specification
3 pages
Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning
No ratings yet
Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning
10 pages
Multi-Object Recognition and Grasping Detection Based On The Anchor-Free Network
No ratings yet
Multi-Object Recognition and Grasping Detection Based On The Anchor-Free Network
6 pages
Edgecomm - 2023 - Lahari (5) - ACM - Workshop
No ratings yet
Edgecomm - 2023 - Lahari (5) - ACM - Workshop
5 pages
Aindumps AI-900 v2021-04-29 by Mohammed 47q
No ratings yet
Aindumps AI-900 v2021-04-29 by Mohammed 47q
30 pages
Dsbda Mini Manav
No ratings yet
Dsbda Mini Manav
17 pages
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit II

Uploaded by

Unit II

Uploaded by

UNIT – II

1.1 MACHINE LEARNING TERMINOLOGIES

1.2 INTRODUCTION TO SUPERVISED LEARNING

Figure 1: Supervised Learning

Steps Involved in Supervised Learning:

➢ First Determine the type of training dataset

1.3 Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

Figure 2: Major Types of algorithms in Supervised Learning

1.3.3 Advantages of Supervised learning:

1.3.4 Disadvantages of supervised learning:

1.4 LINEAR METHODS FOR CLASSIFICATION:

1.5 LINEAR METHODS FOR REGRESSION

Figure 4: Linear regression

The equation of the above line is: Y= mx + b

Cost Function (J):

1.5.2 Logistic Regression

1.5.3 Naive Bayes Classifier

1.6 SUPPORT VECTOR MACHINE

Therefore, the Maximum Margin M:

So, what do these constraints mean?

called Hard Margin SVM (Vanilla SVM).

on the lines w.x+b=- 1 & w.x+b=1 respectively. Then,

Now since z₀ & x₀ lie on w.x+b=1 & w.x+b=-1,

From the above eq. substituting for z₀,

Now that we have, we need to maximize it,

Hence, the new goal i.e., the Objective function is:

& the inequality constraint is:

Finding optima of the Objective function using Lagrangian, Dual Formulation

Expanding Eq. (5) we get,

Now substituting the values in (6) & (7) to above,

1.7 DECISION TREE

1.7.2 The strengths of Decision Tree methods:

1.8 UNDER FITTING AND OVERFITTING

1.8.1 Under fitting:

1.8.3 Avoiding Overfitting:

The commonly used methodologies are:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.