0% found this document useful (0 votes)
33 views124 pages

ML Module2

The document outlines concepts related to machine learning, specifically focusing on regression techniques such as linear regression, multivariate regression, and logistic regression. It explains the relationships between independent and dependent variables, the use of performance metrics, and the application of these techniques in various fields like healthcare and demand forecasting. Additionally, it discusses methods for estimating coefficients, including Ordinary Least Squares and Gradient Descent.

Uploaded by

Dark Vedar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views124 pages

ML Module2

The document outlines concepts related to machine learning, specifically focusing on regression techniques such as linear regression, multivariate regression, and logistic regression. It explains the relationships between independent and dependent variables, the use of performance metrics, and the application of these techniques in various fields like healthcare and demand forecasting. Additionally, it discusses methods for estimating coefficients, including Ordinary Least Squares and Gradient Descent.

Uploaded by

Dark Vedar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Machine Learning (ML)

CSC 701

St. Francis Institute of Technology


Department of Computer Engineering 1
Unit 2 - LEARNING WITH REGRESSION AND TREES
2.1
Learning with Regression: Linear Regression, Multivariate
Linear Regression, Logistic Regression.
2.2
Learning with Trees: Decision Trees, Constructing Decision
Trees using Gini Index (Regression), Classification and
Regression Trees (CART)
2.3
Performance Metrics: Confusion Matrix, [Kappa Statistics],
Sensitivity, Specificity, Precision, Recall, F-measure, ROC
curve

St. Francis Institute of Technology


Department of Computer Engineering 2
Regression
• Is Supervised or Unsupervised?

• What is the basic requirement of Supervised learning?

• What is Regression?

• Regression is a supervised machine learning technique which is


used to predict continuous values.

St. Francis Institute of Technology


Department of Computer Engineering 3
In regression we attempt to establish a correlation between one
variable (independent variable) and another variable (dependent
variable), then use that relationship to predict other values.
For example: Petrol required (X) = 5 litres
Kilometres driven (Y) = 50
Relationship: Y = 10 * X
Prediction: For 10 litres of petrol, you could travel around 10 * 10 =
100 km.
Definition: The variables that influence the outcome are called
input, independent, explanatory, or predictor variables.
In the previous example, the number of litres of petrol (X) is the
variable that influences or controls the number of kilometres you
may drive. Here, the variable X is an independent variable.
St. Francis Institute of Technology
Department of Computer Engineering 4
Definition: The variable whose outcome (or value) depends on
other variables is called response, outcome, or dependent
variable.
In the previous example, the number of kilometres (Y) is a
dependent variable.
It is not always true that you get the exact same mileage all
the time. It also depends on traffic jams, driving speed, petrol
quality, engine condition, time of day, weather conditions,
road conditions, etc.
This is precisely why regression analysis aims to identify the
complex relationship between variables and establish it as
closely as possible.

St. Francis Institute of Technology


Department of Computer Engineering 5
Which of the following is a regression task?
1. Predicting age of a person

2. Predicting nationality of a person

3. Predicting whether stock price of a company will increase


tomorrow

4. Predicting whether a document is related to sighting of


UFOs(unidentified Flying Object)?

St. Francis Institute of Technology


Department of Computer Engineering 6
Regression analysis techniques.
1.Linear Regression
Definition: The process of finding a straight line that best fits a set
of points on a graph is called linear regression.

The term "linear" indicates that the type of relationship you're


trying to establish between the variables tends to be a straight
line. Linear regression is one of the simplest and most popular
techniques of regression analysis.

St. Francis Institute of Technology


Department of Computer Engineering 7
There are two types of linear regression:

Simple Linear Regression (SLR): This involves only one


independent (or input) variable. For example, the number of
litres of petrol and kilometres driven.

Multiple Linear Regression (MLR): This involves more than


one independent (or input) variable. For example, the number
of litres of petrol, age of the vehicle, speed, and kilometres
driven.

St. Francis Institute of Technology


Department of Computer Engineering 8
St. Francis Institute of Technology
Department of Computer Engineering 9
Linear Regression Model Representation

• For example, in a simple regression problem (a single x


and a single y), the form of the model would be:
• Y= B0 + B1*X, where
• B0 - represents the intercept(The value of Y when X is 0)
• B1 - represents the coefficient/slop (how much Y
changes for a unit change in X)
• X - represents the independent variable
• Y - represents the output or the dependent variable

St. Francis Institute of Technology


Department of Computer Engineering 10
• To calculate 𝐵0 and 𝐵1 ,
Y= B0 + B1*X
• you can also use the following formulas:
𝐵1=∑(𝑥𝑖−𝑥¯)(𝑦𝑖−𝑦¯)/∑(𝑥𝑖−𝑥¯)2
• sum of product of Deviation/Sum of Square of deviation
for X

𝐵0=𝑦¯−𝐵1𝑥¯
• Mean of Y –(B1 *Mean of X)

• Where 𝑥¯ is the mean of 𝑥 values and 𝑦¯is the mean of 𝑦


values.
St. Francis Institute of Technology
Department of Computer Engineering 11
Question
Predict the value of y for the given value
of x = 20

Diameter
(X) in Price (Y) in
inch Dollars
1 8 10
2 10 13
3 12 16
4 20 ?

St. Francis Institute of Technology


Department of Computer Engineering 12
Solution

B1=m= 12/8= 1.5


B0=b=13-15=(-2)
Y= 1.5*20-2
=28
St. Francis Institute of Technology
Department of Computer Engineering 13
Linear Regression

The goal of linear regression is to find the regression line that best approximates (or
fits) the relationship between the input variable(s) and the output variable. This
involves calculating the values of b0 (intercept)and b1 (the regression coefficients).
Linear Regression

SUBJECT AGE X GLUCOSE LEVEL Y

1 43 99

2 21 65

3 25 79

4 42 75

5 57 87

6 59 81

7 55 ?
Linear Regression-Step I

GLUCO
SE
SUBJE AGE LEVEL
CT X Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step II
Find b:
GLUCO
SE
SUBJE AGE LEVEL
CT X Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step III
Find b:
GLUCO
SE
SUBJE AGE LEVEL
CT X Y XY X2 Y2

1 43 99 4257 1849 9801

2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step IV
Insert the values into the equation.
GLUCO
y’ =bo +b1 * x SE
SUBJE AGE LEVEL
CT X Y XY X2 Y2
y’ = 65.14 + 1 43 99 4257 1849 9801
(0.385225 * x)
2 21 65 1365 441 4225

3 25 79 1975 625 6241

4 42 75 3150 1764 5625

5 57 87 4959 3249 7569

6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022


Linear Regression-Step V

Prediction – the value of y for the


given value SUBJEC GLUCOSE
T AGE X LEVEL Y
of x = 55 1 43 99
y’ = 65.14 + 2 21 65
(0.385225 * x)
3 25 79
4 42 75
y’ = 65.14
5 57 87
+(.385225 ∗55)
6 59 81
y’ =86.327
7 55 86.327
Linear Regression Model Representation
● The General equation for a Multiple linear regression
with p - independent variables looks like this:

St.12Francis
AugustInstitute
2024 of Technology
Department of Computer Engineering 21
Ordinary Least Squares

• When we have more than one input we can use

Ordinary Least Squares to estimate the values of the

coefficients.

• The Ordinary Least Squares procedure seeks

to minimize the sum of the squared residuals.

St. Francis Institute of Technology


Department of Computer Engineering 22
Ordinary Least Squares

• This means that given a regression line through the data

we calculate the distance from each data point to the

regression line, square it, and sum all of the squared

errors together.

• This is the quantity that ordinary least squares seeks to

minimize.

St.12Francis
AugustInstitute
2024 of Technology
Department of Computer Engineering 23
Gradient Descent

• When there are one or more inputs, you can use a


process of optimizing the values of the coefficients by
iteratively minimizing the error of the model on your
training data.
• This process is called as Gradient Descent.
• It works by starting with random values for each
coefficient.

St. Francis Institute of Technology


Department of Computer Engineering 25
Use Cases (or Applications) of Linear Regression
As you've learned from various examples, linear regression is
highly effective in establishing relationships between
independent and dependent variables to predict outcomes.
Therefore, linear regression is extensively used in various
business, medical, government, and everyday scenarios. Some
common applications of linear regression include:
Healthcare: Linear regression can establish relationships
between treatments and their effects, or uncover complex
operations within the human body to derive meaningful
insights. For instance, studying the impact of a drug or chemical
substance on reducing blood infection levels—where 1 mg of
substance reduces infection by 20%, and 3 mg reduces it by
50%.

St. Francis Institute of Technology


Department of Computer Engineering 26
Demand Forecasting: Businesses constantly aim to maximize
sales and optimize inventory. Sales often depend on multiple
factors, and understanding these relationships through linear
regression helps businesses adjust strategies such as promotional
schemes to forecast sales accurately.

Other Predictions: Linear regression can also be applied to


predict outcomes in various domains such as sports results, crop
yields, machinery performance, fitness levels, and similar areas.

St. Francis Institute of Technology


Department of Computer Engineering 27
Multiple Regression:
Multiple regression refers to a statistical technique where a
single dependent variable (outcome) is predicted based on
two or more independent variables (predictors). In essence, it
extends simple linear regression, which uses only one
independent variable, to multiple predictors.

Example: Predicting house prices based on variables like


square footage, number of bedrooms, and location. Here, the
house price (dependent variable) is predicted using multiple
independent variables (square footage, bedrooms, location).

St. Francis Institute of Technology


Department of Computer Engineering 28
1.2 Multivariate Regression - Model Representation
Multiple regression involves one output variable with multiple
input variables,
whereas multivariate regression can have multiple output
variables along with multiple input variables. It's important to
distinguish between these terms correctly.

Multivariate Regression:
Multivariate regression, on the other hand, deals with
situations where there are multiple dependent variables, each
predicted by the same set of independent variables. In other
words, it simultaneously predicts multiple outcomes based on
the same set of predictors

St. Francis Institute of Technology


Department of Computer Engineering 29
Example: A doctor has collected data on cholesterol, blood
pressure, and weight.
She also collected data on the eating habits of the subjects
(e.g., how many ounces of red meat, fish, dairy products, and
chocolate consumed per week).
She wants to investigate the relationship between the three
measures of health and eating habits.

St. Francis Institute of Technology


Department of Computer Engineering 30
Dependent Variables: Cholesterol levels, blood pressure,
weight (these are measures of health).

Independent Variables: Eating habits (e.g., ounces of red


meat, fish, dairy products, chocolate consumed per week).

Multivariate regression allows you to examine these


relationships simultaneously. It considers the
interdependencies among the dependent variables
(cholesterol, blood pressure, weight) while predicting their
values based on the same set of independent variables (eating
habits).

St. Francis Institute of Technology


Department of Computer Engineering 31
Logistic Regression
Earlier, In linear regression, the independent and dependent
variables are continuous (they can have any value), they exhibit a
linear relationship that can be plotted as a straight line.

However, in some cases, the outcome (dependent variable) is a


binary event, such as yes/no, pass/fail, or approved/rejected.

For example, given the humidity level, it may either rain or not rain.
Similarly, when applying for a loan or college admission, the
application is either approved or rejected.

Definition: Logistic regression is a regression technique used to


model and estimate the probability of an event occurring based on
the values of independent variables.
St. Francis Institute of Technology
Department of Computer Engineering 32
Logistic Regression
Logistic regression is the Supervised Learning technique.

Logistic regression is a classification algorithm.

It is used for predicting the categorical dependent variable using a


given set of independent variables.
It is intended for datasets that have numerical input variables and a
categorical target variable that has two values or classes.
Logistic regression predicts the output of a categorical dependent
variable.
Therefore the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc.
but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
St. Francis Institute of Technology
Department of Computer Engineering 33
Sigmoid Function

where ‘e’ is Euler’s number (approximately 2.71828) and ‘x’ is


the input variable.

St. Francis Institute of Technology


Department of Computer Engineering 34
Logistic Function (Sigmoid Function)

The sigmoid function is a mathematical function used to map the


predicted values to probabilities.

It maps any real value into another value within a range of 0 and 1.
The value of the logistic regression must be between 0 and 1,
which cannot go beyond this limit, so it forms a curve like the "S"
form. The S-form curve is called the Sigmoid function or the logistic
function.

In logistic regression, we use the concept of the threshold value,


which defines the probability of either 0 or 1. Such as values above
the threshold value tends to 1, and a value below the threshold
values tends to 0.

St. Francis Institute of Technology


Department of Computer Engineering 35
Logistic Regression
Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the
classification problems.
In Logistic regression, instead of fitting a regression line, we fit
an "S" shaped logistic function, which predicts two maximum
values (0 or 1).
The curve from the logistic function indicates the likelihood of
something such as whether the cells are cancerous or not.

St. Francis Institute of Technology


Department of Computer Engineering 36
Application of Logistic Regression
1.Finance:
In the finance sector, logistic regression can be used to predict
whether a loan application will be approved or rejected. Based
on input parameters like credit score, potential income, age,
and employment history, the model estimates the probability
of approval, aiding in decision-making processes.
2. Sports:
In sports analytics, logistic regression can help predict the
likelihood of a team winning or losing a match. By analyzing
various factors such as the team’s performance metrics,
weather conditions, pitch conditions, and players' past
records, you can estimate the probability of different
outcomes.

St. Francis Institute of Technology


Department of Computer Engineering 37
3. Maintenance:
Logistic Regression can be applied to build proactive
maintenance schedules. By considering factors such as
machine age, operational hours, and working conditions, you
can predict the likelihood of machinery breakdowns. This helps
in planning maintenance activities to minimize downtimes and
ensure continuous operations.
4.Classification or Categorization:
Logistic Regression can be used for classification tasks to
categorize entities or objects based on input variables. For
instance, you could classify mangoes as "export quality" or
"not export quality" based on their size, color, and weight.

St. Francis Institute of Technology


Department of Computer Engineering 38
What are the different types of logistic regression?

• The three types of logistic regression are:

• Binary logistic regression

• Multinomial logistic regression

• Ordinal logistic regression

St. Francis Institute of Technology


Department of Computer Engineering 39
Binary logistic regression

• Binary logistic regression is the statistical technique


used to predict the relationship between the
dependent variable (Y) and the independent variable
(X), where the dependent variable is binary in nature.
For example, the output can be Success/Failure, 0/1 ,
True/False, or Yes/No.

• Multinomial logistic regression is used when you have


one categorical dependent variable with two or more
unordered levels (i.e two or more discrete outcomes).
It is very similar to logistic regression except that here
you can have more than two possible outcomes.

St. Francis Institute of Technology


Department of Computer Engineering 40
• For example, let’s imagine that you want to predict
what will be the most-used transportation type in the
year 2030.

• The transport type will be the dependent variable,


with possible outputs of train, bus, tram, and bike
(for example).

St. Francis Institute of Technology


Department of Computer Engineering 41
• Ordinal logistic regression Examples of such variables
might be t-shirt size (XS/S/M/L/XL), answers on an
opinion poll (Agree/Disagree/Neutral), or scores on a
test (Poor/Average/Good).

St. Francis Institute of Technology


Department of Computer Engineering 42
Advantages of Logistic Regression

1.Logistic Regression is very easy to understand.


2. It requires less training.
3. It performs well for simple datasets as well as
when the data set is linearly separable.
4. It doesn’t make any assumptions about the
distributions of classes in feature space.
5. They are easier to implement, interpret, and
very efficient to train.

St. Francis Institute of Technology


Department of Computer Engineering 43
Disadvantages of Logistic Regression

1. If the independent features are correlated with each other it


may affect the performance of the classifier.
3. It is quite sensitive to noise and over fitting.
4. Logistic Regression should not be used if the number of
observations is lesser than the number of features, otherwise,
it may lead to overfitting.
5. By using Logistic Regression, non-linear problems can’t be
solved because it has a linear decision surface. But in
real-world scenarios, the linearly separable data is rarely
found.
6. By using Logistic Regression, it is tough to obtain complex
relationships. Some algorithms such as neural networks, which
are more powerful, and compact can easily outperform
Logistic Regression algorithms.

St. Francis Institute of Technology


Department of Computer Engineering 44
Differences between Linear and Logistic Regression
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a categorical dependent variable using a given set
given set of independent variables. of independent variables.
Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.
In Linear regression, we predict the value In logistic Regression, we predict the values of
of continuous variables. categorical variables.
In linear regression, we find the best fit In Logistic Regression, we find the S-curve by
line, by which we can easily predict the which we can classify the samples.
output.
Least square estimation method is used for Maximum likelihood estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No, etc.

In Linear regression, it is required that In Logistic regression, it is not required to have the
relationship between dependent variable linear relationship between the dependent and
and independent variable must be linear. independent variable.
St. Francis Institute of Technology
Department of Computer Engineering 45
What is a Decision Tree?
• A decision tree is a flowchart-like structure used to
make decisions or predictions.
• It consists of nodes representing decisions or tests on
attributes, branches representing the outcome of
these decisions, and leaf nodes representing final
outcomes or predictions.
• Each internal node corresponds to a test on an
attribute, each branch corresponds to the result of the
test, and each leaf node corresponds to a class label or
a continuous value.

St. Francis Institute of Technology


Department of Computer Engineering 46
Structure of a Decision Tree

• Root Node: Represents the entire dataset and the


initial decision to be made.
• Internal Nodes: Represent decisions or tests on
attributes. Each internal node has one or more
branches.
• Branches: Represent the outcome of a decision or test,
leading to another node.
• Leaf Nodes: Represent the final decision or prediction.
No further splits occur at these nodes.

St. Francis Institute of Technology


Department of Computer Engineering 47
How Decision Trees Work?

• The process of creating a decision tree involves:


• Selecting the Best Attribute: Using a metric like Gini
impurity, entropy, or information gain, the best
attribute to split the data is selected.
• Splitting the Dataset: The dataset is split into subsets
based on the selected attribute.
• Repeating the Process: The process is repeated
recursively for each subset, creating a new internal
node or leaf node until a stopping criterion is met (e.g.,
all instances in a node belong to the same class or a
predefined depth is reached).

St. Francis Institute of Technology


Department of Computer Engineering 48
St. Francis Institute of Technology
Department of Computer Engineering 49
St. Francis Institute of Technology
Department of Computer Engineering 50
Metrics for Splitting
• Gini Impurity: What is the chance the that Model
predicting a class incorrectly ; lower Gini means purer
subsets. The root node is the one with the highest Gini
impurity before any split.

• Entropy: Measures the uncertainty in the dataset; lower


entropy means more certainty. The root node is the one
with the highest entropy before any split.

Information Gain: Shows how much uncertainty is reduced


after splitting on an attribute; higher gain means a better
attribute for splitting.
St. Francis Institute of Technology
Department of Computer Engineering 51
Entropy
• Definition: Entropy measures the randomness or
impurity of an attribute (or datapoint) in a decision
tree.(How many classes are there)

• In the context of decision trees, entropy generally lies


between 0 and 1. A lower value of entropy signifies a
more homogeneous dataset with less randomness
(hence better predictions). A high entropy indicates
high disorder. Entropy can also be more than 1,
indicating that the dataset is very random and not
good for creating a prediction or classification model.

St. Francis Institute of Technology


Department of Computer Engineering 52
• A probability of 0.5 makes entropy equal to 1, which
means equally divided samples (not good for
predicting).
• A probability of 1 makes entropy 0, which means
prediction with 100% accuracy.
• The goal is to have low entropy to ensure better
predictions.

St. Francis Institute of Technology


Department of Computer Engineering 53
St. Francis Institute of Technology
Department of Computer Engineering 54
Information Gain
• Definition: Information gain measures the purity
produced by an attribute. It represents how well the
given attribute separates the dataset based on the
desired target classification.

• Information gain is high if the chosen attribute


produces a nearly pure subset of datapoints
(containing a majority of one class) after the split. It is
low if the chosen attribute does not produce a
near-pure subset.

St. Francis Institute of Technology


Department of Computer Engineering 55
Gini Index

• Definition: The Gini index is another purity measurement


method used for splitting the datapoints in a decision tree.
• It measures the likelihood of a randomly chosen element
from the dataset being incorrectly classified if it were
randomly labeled according to the distribution of labels in
the dataset.
and multiplying them by the weighted size of the datapoints
in the subset relative to the size of the datapoints in the
parent node before the split.

St. Francis Institute of Technology


Department of Computer Engineering 56
St. Francis Institute of Technology
Department of Computer Engineering 57
Degree Experience in years Placed/not placed

Masters 1 Not placed

Bachelors 0 Not placed

Masters 3 placed

Bachelors 3 placed

Masters 3 placed

Bachelors 1 placed

Masters 4 placed

Bachelors 1 Not placed

St. Francis Institute of Technology


Department of Computer Engineering 58
Decision Tree

Entropy: High Entropy : Low


Information Gain: Low Information Gain :High
Gini Impurity :High Gini Impurity : Low
St. Francis Institute of Technology
Department of Computer Engineering 59
Example

Here I have given data of sports on which


age and gender are taking a part in a
decision on ‘what kind of person would
play ground-game? ‘.

We will divide data in binary, like F or


M & age =< 25 or age > 25.
Solution
Gender Sportiv Sporti Total
e-yes ve-no
Female 5 2 7
Male 2 4 6

Age Sportiv Sporti Total


e-yes ve-no
< 25 5 1 6
>= 2 5 7
25
Find Gini index of attributes
CART Algorithm
• CART Algorithm is an abbreviation
of Classification And Regression Trees.
• Rather than general trees that could have multiple
branches, CART makes use binary tree, which has only
two branches from each node.
• CART use Gini Impurity as the measure to split node, not
Information Gain.
• Just like the ID3 and C4.5 algorithms that rely on
Information Gain as the criterion to split nodes, the CART
algorithm makes use another criterion called Gini to split
the nodes.
CART Algorithm
• Similar to ID3 and C4.5 using Information Gain to
select the node with more uncertainty, the Gini
coefficient will guide the CART algorithm to find the
node with larger uncertainty (i.e. impurity) and then
split it.
• Gini Index is a metric to measure how often a
randomly chosen element would be incorrectly
identified.
• It means an attribute with lower Gini index should be
preferred.
Question

Hours of Study Participated in Result


Sports
1 No Fail
2 No Fail
2 Yes Pass
3 Yes Pass
3 No Pass
4 Yes Pass
4 No Pass

St. Francis Institute of Technology


Department of Computer Engineering 65
Example 3
(Target
attribute)

66
Decision tree using regression

• A decision tree for regression is used to predict


continuous values (like temperature, price, etc.). Here’s
how you can build one:
• Steps to Construct a Regression Tree
• Choose the Best Split:
• Look at each attribute (like age, salary, etc.) and find the best
way to split the data.
• Evaluate each possible split to see which one best reduces the
variance (spread) in the target variable.
• Calculate the Standard Deviation Reduction (SDR):
• For each potential split, calculate the standard deviation (a
measure of spread) of the target variable before and after the
split.
St. Francis Institute of Technology
Department of Computer Engineering 88
Standard Deviation Reduction

Create Nodes:
Split the data into two subsets based on the best split. Create a decision node that tests the
chosen attribute.
Recursively Split:
Repeat the above steps for each child node until you reach a stopping point (like a maximum
depth of the tree or a minimum number of samples per leaf).
Prediction:
To predict the value for a new instance, follow the path down the tree using the decision
nodes and return the average value of the target variable in the final leaf node.

St. Francis Institute of Technology


Department of Computer Engineering 89
• Standard deviation is same as Entropy value in
classification
• Entropy is minimum when all data points belongs to
same class(positive points).Maximum when equal no
of data of each class.
• Same here if all target value will be same (assume
50) then S.D will be 0.
• That means when W.S.D is minimum that is best
separation.
• (max)SDR=SD(y)-WSD(min)

St. Francis Institute of Technology


Department of Computer Engineering 90
Question

Hours of Study Participated in Sports Result

2 Yes 50

3 No 60

4 Yes 65

5 No 70

6 Yes 80

7 No 85

8 Yes 90

St. Francis Institute of Technology


Department of Computer Engineering 91
Example of Decision tree using regression

St. Francis Institute of Technology


Department of Computer Engineering 92
St. Francis Institute of Technology
Department of Computer Engineering 93
St. Francis Institute of Technology
Department of Computer Engineering 94
St. Francis Institute of Technology
Department of Computer Engineering 95
St. Francis Institute of Technology
Department of Computer Engineering 96
St. Francis Institute of Technology
Department of Computer Engineering 97
Performance Metrics: Confusion Matrix

• Machine learning models are increasingly used in various


applications to classify data into different categories. However,
evaluating the performance of these models is crucial to ensure
their accuracy and reliability. One essential tool in this evaluation
process is the confusion matrix.
• A confusion matrix is a matrix that summarizes the performance of
a machine learning model on a set of test data. It is a means of
displaying the number of accurate and inaccurate instances based
on the model’s predictions. It is often used to measure the
performance of classification models, which aim to predict a
categorical label for each input instance.
• The matrix displays the number of instances produced by the model
on the test data.

St. Francis Institute of Technology


Department of Computer Engineering 98
Confusion Matrix

Predicted
165 NO YES =100/110 =
A 50 10
c NO [TN] [FP] 60
t
u
5 100 = 100/105 = 0.95
a
l YES [FN] [TP] 105
55 110

Sensitivity = TP / (TP + FN)


0.91
Error rate= 1- Accuracy = 0.09
St. Francis Institute of Technology
Department of Computer Engineering 99
True Positive (TP) — model correctly predicts the positive class
(prediction and actual both are positive). In the above example, 10
people who have tumors are predicted positively by the model.
True Negative (TN) — model correctly predicts the negative class
(prediction and actual both are negative). In the above example, 60
people who don’t have tumors are predicted negatively by the
model.
False Positive (FP) — model gives the wrong prediction of the
negative class (predicted-positive, actual-negative). In the above
example, 22 people are predicted as positive of having a tumor,
although they don’t have a tumor. FP is also called a TYPE I error.
False Negative (FN) — model wrongly predicts the positive class
(predicted-negative, actual-positive). In the above example, 8
people who have tumors are predicted as negative. FN is also called
a TYPE II error.
St. Francis Institute of Technology 10
Department of Computer Engineering 0
With the help of these four values, we can calculate True Positive
Rate (TPR), False Negative Rate (FPR), True Negative Rate (TNR),
and False Negative Rate (FNR).

Even if data is imbalanced, we can figure out that our model is working well or not. For
that, the values of TPR and TNR should be high, and FPR and FNR should be as low as
possible.
With the help of TP, TN, FN, and FP, other performance metrics can be calculated.

St. Francis Institute of Technology 10


Department of Computer Engineering 1
• 1. Accuracy
• Accuracy is used to measure the performance of the
model. It is the ratio of Total correct instances to the
total instances.

St. Francis Institute of Technology 10


Department of Computer Engineering 2
• 2. Precision
• Precision is a measure of how accurate a model’s
positive predictions are. It is defined as the ratio of
true positive predictions to the total number of
positive predictions made by the model.

St. Francis Institute of Technology 10


Department of Computer Engineering 3
• 3. Recall
• Recall measures the effectiveness of a classification
model in identifying all relevant instances from a
dataset. It is the ratio of the number of true positive
(TP) instances to the sum of true positive and false
negative (FN) instances.

St. Francis Institute of Technology 10


Department of Computer Engineering 4
• 4. F1-Score
• F1-score is used to evaluate the overall performance of
a classification model. It is the harmonic mean of
precision and recall,

• We balance precision and recall with the F1-score


when a trade-off between minimizing false positives
and false negatives is necessary, such as in information
retrieval systems.

• St. Francis Institute of Technology 10


Department of Computer Engineering 5
• 5. Specificity
• Specificity is another important metric in the
evaluation of classification models, particularly in
binary classification. It measures the ability of a model
to correctly identify negative instances. Specificity is
also known as the True Negative Rate. Formula is given
by:

St. Francis Institute of Technology 10


Department of Computer Engineering 6
Advantages of Decision Trees

• Simplicity and Interpretability: Decision trees are easy


to understand and interpret. The visual representation
closely mirrors human decision-making processes.
• Versatility: Can be used for both classification and
regression tasks.
• No Need for Feature Scaling: Decision trees do not
require normalization or scaling of the data.
• Handles Non-linear Relationships: Capable of capturing
non-linear relationships between features and target
variables.

St. Francis Institute of Technology 10


Department of Computer Engineering 7
Disadvantages of Decision Trees

•Overfitting: Decision trees can easily overfit


the training data, especially if they are deep
with many nodes.
•Instability: Small variations in the data can
result in a completely different tree being
generated.
•Bias towards Features with More Levels:
Features with more levels can dominate the
tree structure.
St. Francis Institute of Technology 10
Department of Computer Engineering 8
ROC curve
•Receiver operating characteristics (ROC)
curves are graphs showing classifiers'
performance by plotting the true positive
rate and false positive rate. The area under
the ROC curve (AUC) measures the
performance of machine learning algorithms.

•Since the 1980s, ROC curves gained


popularity in medical diagnostics testing and,
more recently, for analyzing the performance
of machine learning algorithms.
St. Francis Institute of Technology 10
Department of Computer Engineering 9
How does a ROC curve work?

• An ROC curve works by plotting the true positive rate


(TPR) on the y-axis and the false positive rate (FPR) on
the x-axis of a graph. How does this connect to
classification metrics in machine learning?
• Once a classification model has analyzed training data,
a confusion matrix displays the results of the
predicted data against the labeled data, and you can
use this data—the TPR and the FPR—to produce the
ROC curve, which can help you determine the efficacy
of your machine learning model. Now, let’s examine
what makes up a TPR and FPR in machine learning:

St. Francis Institute of Technology 11


Department of Computer Engineering 0
• True positive rate: A ratio of true positive predictions
divided by the true positives plus false negative
predictions (TPR = TP / TP + FN)
• False positive rate: A ratio of total false positive
predictions divided by the false positive plus true negative
predictions (FPR = FP / FP + TN)
• The true positive and false positive rates at each point on
the curve depict the rate at each decision classification
threshold. To create the ROC curve, the scale goes from
zero to one, with an ideal rate being one for positives and
zero for negatives. The ROC curve has no bias towards
classifiers and remains independent of the conditions it
works under, making it useful for predictions with both
balanced and imbalanced problems. 11
1
St. Francis Institute of Technology
Department of Computer Engineering
Area under the ROC curve
• A score is given to them to compare the ROC curve of
multiple classifiers based on a calculation of the area
under the ROC curve, also known as AUC or ROCAUC.
This score ranges from 0.0 to 1.0, with 1.0 being a perfect
classifier.
• What is ROC curve used for?
• In machine learning, ROC curves measure the
performance of various machine learning algorithm
classifications. A ROC curve focuses on finding the errors
and benefits classifiers use to organize classes, making
ROC graphs a useful analysis when comparing two classes
in something like a diagnostic test that tests whether a
condition is present or not present in an individual class.
St. Francis Institute of Technology 11
Department of Computer Engineering 2
Example

St. Francis Institute of Technology 11


Department of Computer Engineering 3
St. Francis Institute of Technology 11
Department of Computer Engineering 4
Kappa Statistics

• What is Cohen’s Kappa Score or Coefficient?


• Cohen’s Kappa Score, also known as the Kappa
Coefficient, is a statistical measure of inter-rater
agreement for categorical data. Cohen’s Kappa
Coefficient is named after statistician Jacob Cohen, who
developed the metric in 1960.
• Cohen’s Kappa takes into account both the number of
agreements (True positives & true negatives) and the
number of disagreements between the raters (False
positives & false negatives), and it can be used to
calculate both overall agreement and agreement after
chance has been taken into account.
St. Francis Institute of Technology 11
Department of Computer Engineering 5
• Taking that into consideration, Cohen’s Kappa score
can be defined as the metric used to measure the
performance of machine learning classification models
based on assessing the perfect agreement and
agreement by chance between the two raters
(real-world observer and the classification model).
• The main use of Cohen Kappa metric is to evaluate the
consistency of the classifications, rather than their
accuracy. This is particularly useful in scenarios where
accuracy is not the only important factor, such as with
imbalanced classes.

St. Francis Institute of Technology 11


Department of Computer Engineering 6
Cohen’s Kappa Score Value Range (-1 to 1)
• The Cohen Kappa Score is used to compare the
predicted labels from a model with the actual labels in
the data. The score ranges from -1 (worst possible
performance) to 1 (best possible performance).
• A Kappa value of 1 implies perfect agreement between
the raters or predictions.
• A Kappa value of 0 implies no agreement other than
what would be expected by chance. In other words,
the model is no better than the random guessing.
• Negative values indicate agreement less than chance
(which is a rare but concerning scenario)

St. Francis Institute of Technology 11
Department of Computer Engineering 7
How to Calculate Cohen’s Kappa Score?

• Cohen’s Kappa can be calculated using either raw data or


confusion matrix values.
• When Cohen’s Kappa is calculated using raw data, each
row in the data represents a single observation, and each
column represents a rater’s classification of that
observation.
• Cohen’s Kappa can also be calculated using a confusion
matrix, which contains the counts of true positives, false
positives, true negatives, and false negatives for each
class. We will look into the details of how Kappa score
can be calculated using confusion matrix.

St. Francis Institute of Technology 11


Department of Computer Engineering 8
following confusion matrix representing a binary
classification model where there are two classes / labels:
In the below confusion matrix, the actual represents rater 1.
Rater 1 is an observer of real world events and record what
actually happened. The predicted represents rater 2. The rater
2 represents the classification model which makes the
predictions

St. Francis Institute of Technology 11


Department of Computer Engineering 9
• Cohen Kappa score will be used to assess the model performance as
a function of probability that the rater 1 and rater 2 are in perfect
agreement (TP + TN), also denoted as Po (observed probability),
and, the probability (expected) both the raters are in agreement by
chance or randomly, denoted as Pe(probability of random
agreement) in the following formula.

Now that we have defined the terms, let’s calculate the Cohen
Kappa score. Let the total observation (TP + FP + FN + TN) is N.
Or, N = TP + FP + FN + TN
St. Francis Institute of Technology 12
Department of Computer Engineering 0
• The first step is to calculate the probability that both the
raters are in perfect agreement:
• Observed Agreement, Po = (TP + TN) / N
• In our example, this would be:
• Po = (45+15)/100=0.6
• Next, we need to calculate the expected probability that
both the raters are in agreement by chance. This is
calculated by multiplying the expected probability that
both the raters are in agreement that the classes are
positive, and, the classes are negative.

St. Francis Institute of Technology 12


Department of Computer Engineering 1
• Pe = [{Pe(rater 1 says Yes) / N}* {Pe(rater 2 says Yes) /
N} + [{Pe(rater 1 says no) / N} * {Pe(rater 2 says no) /
N}]
• So in our case this would be calculated as:
• Pe = 0.7 x 0.6 + 0.4 x 0.3 = 0.42 + 0.12 = 0.54
• Now that we have both the observed and expected
agreement, we can calculate Cohen’s Kappa:
• Kappa score = (Po – Pe) / (1 – Pe)
• In our example, this would be:
• K = (0.6 – 0.54)/(1 – 0.54)= 0.06 / 0.46 = 0.1304 or a little
over 13%
St. Francis Institute of Technology 12
Department of Computer Engineering 2
• Kappa can range from 0 to 1. A value of 0 means that
there is no agreement between the raters (real-world
observer vs classification model), and a value of 1
means that there is perfect agreement between the
raters. In most cases, anything over 0.7 is considered to
be very good agreement.
• Cohen Kappa score can also be used to assess
the performance of multi-class classification model

St. Francis Institute of Technology 12


Department of Computer Engineering 3
St. Francis Institute of Technology 12
Department of Computer Engineering 4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy