0% found this document useful (0 votes)
138 views7 pages

Evaluation Metrics:: Confusion Matrix

Evaluation metrics are used to measure the quality of statistical and machine learning models. There are many types of metrics that can evaluate models including confusion matrices, accuracy, precision, recall, F1 score, AUC-ROC curves, and mean squared error. Choosing the appropriate metric depends on factors like the problem type (classification or regression) and goals like balancing precision vs recall. Evaluation allows models to be optimized and the best model to be selected.

Uploaded by

Nithya Prasath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
138 views7 pages

Evaluation Metrics:: Confusion Matrix

Evaluation metrics are used to measure the quality of statistical and machine learning models. There are many types of metrics that can evaluate models including confusion matrices, accuracy, precision, recall, F1 score, AUC-ROC curves, and mean squared error. Choosing the appropriate metric depends on factors like the problem type (classification or regression) and goals like balancing precision vs recall. Evaluation allows models to be optimized and the best model to be selected.

Uploaded by

Nithya Prasath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Evaluation Metrics:

Evaluation Metrics are used to measure the quality of the statistical or


machine learning models. Evaluating Machine learning models or algorithms is essential for
any projects. There are many different types of evaluation metrics available to test a model.

Confusion Matrix:

 It is an NxN matrix, where N is the number of classes being Predicted.


 It gives us a matrix as output and describes the complete performance of the
model.

 Accuracy: the proportion of the total number of predictions that were correct
 True Positive: Positive class correctly predicted as positive
 False Negative : Positive class incorrectly predicted as negative.
 False Positive: Negative class incorrectly predicted as positive
 True Negative: Negative class correctly predicted as negative

 Sensitivity: It is also known as True positive rate or Recall. It is measure of Positive


examples labeled as positive by classifier [the proportion of actual positive cases
which are correctly identified].It should be higher.

 Specificity: It is also known as True Negative rate. It is measure of negative examples


labeled as negative by classifier [the proportion of actual negative cases which are
correctly identified].It should be higher.
 Positive Predictive Value or Precision : the proportion of positive cases that were
correctly identified.
 Negative Predictive Value : the proportion of negative cases that were correctly
identified.

Type I and Type II Error:


Type I error occurs when the null hypothesis is true, but is rejected. [type I error, or
false positive, is asserting something as true when it is actually false]

Type II error occurs when the null hypothesis is false, but erroneously fails to be
rejected. [type II error occurs when the null hypothesis is actually false, but was accepted
as true by the testing.]

Scenario/Problem Statement : Medical trials for a drug which is a cure for Cancer

Type I error: Predicting that a cure is found when it is not the case.


Type II error: Predicting that a cure is not found when in fact it is the case.

In this case, Type I error is not an issue. It could be corrected later with more trials. Type II
error is more serious as it could be discarded as no cure and a cure can save millions of lives.
Risk of committing a Type I error is represented by your alpha level(P value below which you
reject the null hypothesis)

To control these type of errors, a variable alpha is used. Increasing the sample size can also
reduce the risk and change the amount of these type of errors.

P- Value:

When you perform a hypothesis test in statistics, a p-value can help you
determine the strength of your results. p-value is a number between 0 and 1. Based on the
value it will denote the strength of the results. The claim which is on trial is called Null
Hypothesis.

*Low p-value (≤ 0.05) indicates strength against the null hypothesis which means
we can reject the null Hypothesis.

*High p-value (≥ 0.05) indicates strength for the null hypothesis which means we
can accept the null Hypothesis

*p-value of 0.05 indicates the Hypothesis could go either way.

To put it in another way,

High P values: your data are likely with a true null.


Low P values: your data are unlikely with a true null.
F1 Score:

It  is a weighted average of the recall (sensitivity) and precision. F1


score might be good choice when you seek to balance between Precision and Recall. F1
score is used to measure a test’s accuracy.

F1 Score is the Harmonic Mean between precision and recall. The range for F1
Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies
correctly), as well as how robust it is (it does not miss a significant number of instances).

High precision but lower recall, gives you an extremely accurate, but it then
misses a large number of instances that are difficult to classify. The greater the F1 Score, the
better is the performance of our model

Classification Accuracy:
It is the ratio of number of correct predictions to the total number of input samples.

It works well only if there are equal number of sample belonging to each class[Balanced
dataset].

Logarithmic Loss:

Logarithmic Loss or Log Loss, works by penalising the false classifications. It

works well for multi-class classification. When working with Log Loss, the classifier must

assign probability to each class for all the samples. Suppose, there are N samples belonging

to M classes, then the Log Loss is calculated as below :

where,

yij indicates whether sample i belongs to class j or not


pij indicates the probability of sample i belonging to class j

Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates

higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.In

general, minimising Log Loss gives greater accuracy for the classifier.

 Gain and Lift Charts:


Gain and Lift charts are used to evaluate performance of classification model.
They measure how much better one can expect to do with the predictive model comparing
without a model. It's a very popular metrics in marketing analytics. It's not just restricted to
marketing analysis. It can be used in other domains as well such as risk modeling, supply
chain analytics etc. It also helps to find the best predictive model among multiple challenger
models.

Steps to bulid a lift/gain chart:

Step 1 : Calculate probability for each observation

Step 2 : Rank these probabilities in decreasing order.

Step 3 : Build deciles with each group having almost 10% of the observations.

Step 4 : Calculate the response rate at each deciles for Good (Responders) ,Bad (Non-
responders) and total.

Gain at a given decile level is the ratio of cumulative number of targets (events) up to that
decile to the total number of targets (events) in the entire data set.

Lift measures how much better one can expect to do with the predictive model comparing
without a model. It is the ratio of gain % to the random expectation % at a given decile level.
The random expectation at the xth decile is x%.

Area Under Curve:[AUC-ROC Curve]

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is
used for binary classification problem. AUC of a classifier is equal to the probability that the
classifier will rank a randomly chosen positive example higher than a randomly chosen
negative example.
AUC has a range of [0, 1].
The greater the value, the better is the performance of our model.
The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is
also known as false positive rate and sensitivity is also known as True Positive rate. This is a
commonly used graph that summarizes the performance of a classifier over all possible
thresholds.
The biggest advantage of using ROC curve is that it is independent of the change in
proportion of responders.

Mean Absolute Error:

Mean Absolute Error is the average of the difference between the Original Values and
the Predicted Values. It gives us the measure of how far the predictions were from the actual
output

Mean Squared Error:

Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only

difference being that MSE takes the average of the square of the difference between the

original values and the predicted values.

The advantage of MSE being that it is easier to compute the gradient,

whereas Mean Absolute Error requires complicated linear programming tools to compute

the gradient.

As, we take square of the error, the effect of larger errors become more

pronounced then smaller error, hence the model can now focus more on the larger errors.

K-S Chart:

Kolmogorov-Smirnov chart measures performance of classification models. More


accurately, K-S is a measure of the degree of separation between the positive and negative
distributions. In most classification models the K-S will fall between 0 and 100, and that the
higher the value the better the model is at separating the positive from negative cases.
Gini Coefficient:

The Gini coefficient also relates to assessing classifier models. It is actually


directly related to the area under the ROC curve mentioned above. The Gini coefficient is
calculated from the area under the curve (AUC) as 2AUC–1. A 74% area under the curve
becomes a Gini coefficient of 48%, which is fair. The Gini coefficient hence effectively ranges
between 0% and 100%. Gini above 60% is a good model.

R-Squared:
R Squared is used to determine the strength of correlation between the
predictors and the target. In simple terms it lets us know how good a regression model is
when compared to the average. R Squared is the ratio between the residual sum of squares
and the total sum of squares.

Where,

 SSR (Sum of Squares of Residuals) is the sum of the squares of the difference
between the actual observed value (y) and the predicted value (ycap).
 SST (Total Sum of Squares) is the sum of the squares of the difference between the
actual observed value (y) and the average of the observed y value (yavg).

SSR is the best fitting criteria for a regression line. That is the regression algorithm chooses
the best regression line for a given set of observations by drawing random lines and
comparing the SSR of each line. The line with the least value of SSR is the best fitting line.

DisadvantageThe value of R Squared never decreases. Adding new independent variables


will result in an increased value of R Squared.

Adjusted R Squared:

Adjusted R Squared has the capability to decrease with the addition of less
significant variables, thus resulting in a more reliable and accurate evaluation.

Degree of Freedom: We can define it as the minimum number of data points or


observations required to generate a valid regression model.
Where,
 k is the number of independent variables
 n is the number of observations

Adjusted R Squared,makes use of the degree of freedom to compensate and penalize for
the inclusion of a bad variable.

The value of Adjusted R Squared decreases as k increases also while considering R Squared
acting a penalization factor for a bad variable and rewarding factor for a good or significant
variable. Adjusted R Squared is thus a better model evaluator and can correlate the
variables more efficiently than R Squared.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy