Diagnosing Bias Vs Variance
Diagnosing Bias Vs Variance
We will continue to look at some methodologies to diagnose our hypothesis function. This time,
it will be about bias and variance.
Errors due to the bias means that the model is too simple to represent a certain situation, it
cannot catch the details. If the bias is high, it will cause underfitting. This means that our model
is missing something important.
Errors due to the variance means that the model is not generalized enough to apply to other
situations or datasets other than our training set. If the model has high variance, this means that
the model is overfitted.
High bias & low variance, Good compromise, Low bias & high variance
If our model gets complicated, the bias would get lower, and the variance would
get higher.
So we must find the appropriate point where the bias and the variance values are
well compromised with each other. We cannot just lower both at once.
In
To understand this clearly, Let’s see what bias and variance means.
What is Bias?
1. Low Bias: Suggests less assumptions about the form of the target function.
2. High-Bias: Suggests more assumptions about the form of the target
function.
Refer this article to understand what over-fitting and under-fitting models mean.
What is Variance?
1. Low Variance: Suggests small changes to the estimate of the target function
with changes to the training dataset.
2. High Variance: Suggests large changes to the estimate of the target
function with changes to the training dataset.
Normally, as you increase the complexity of your model, you will see a reduction in error
due to lower bias in the model. However, this only happens until a particular point. As
you continue to make your model more complex, you end up over-fitting your model and
hence your model will start suffering from high variance.
The goal of any supervised machine learning algorithm is to have low bias and low
variance to achieve good prediction performance.However,you cannot reduce both at
once.
1. The k-nearest neighbor’s algorithm has low bias and high variance, but this
can be changed by increasing the value of k which increases the number of
neighbors that contribute to the prediction and in turn increases the bias of
the model.
2. The support vector machine algorithm has low bias and high variance, but
this can be changed by increasing the C parameter that influences the
number of violations of the margin allowed in the training data which
increases the bias but decreases the variance.
So,What is Over-fitting?
And,What is Under-fitting?
To combat over-fitting:
● Ideally, the case when the model makes the predictions with 0 error, is
said to have a good fit on the data.
● This situation is achievable at a spot between overfitting and
underfitting. In order to understand it we will have to look at the
performance of our model with the passage of time, while it is learning
from training dataset.
With the passage of time, our model will keep on learning and thus the error for the
model on the training and testing data will keep on decreasing. If it will learn for too
long, the model will become more prone to overfitting due to presence of noise and
less useful details. Hence the performance of our model will decrease.
● In order to get a good fit, we will stop at a point just before where the
error starts increasing. At this point the model is said to have good
skills on training dataset as well our unseen testing dataset.As shown
in the picture below.
The entire formula basically calculates the average squared difference between the
estimator's predictions and the true values it's trying to predict. This difference is
squared to emphasize larger errors more than smaller ones.
The goal is to minimize the MSE, which means finding an estimator that balances bias
and variance. A good estimator makes predictions that are close to the true values on
average (low bias), and also makes predictions that are consistent across different
training datasets (low variance).