Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
Cross Validation: Chandan B K Mrs. S Asst Professor, Department of Computer Science Engineering
Validation Chandan B K
By
Mrs. Sridevi S
Asst professor,
Department of computer science engineering
Contents
Cross validation.
Types of cross validation
leave one out cross validation.
Hold out Method.
K-Fold cross validation.
Stratified K-Fold cross validation.
Time series cross validation
Cross Validation
Cross-Validation is a resampling technique that helps to
make our model sure about its efficiency and accuracy
on the unseen data.
It is a method for evaluating Machine Learning models
by training several other Machine learning models on
subsets of the available input data set and evaluating
them on the subset of the data set.
involves reserving a particular sample of a dataset on
which you do not train the model. Later, you test your
model on this sample before finalizing it.
Steps Involved In Cross
Validation
Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the
test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of
model evaluation scores
Advantages and Disadvantages Of
Cross Validation
Pros
Reduces Over fitting.
Hyper parameter Tuning.
Cons
Increases Training Time.
Needs Expensive Computation.
Cross Validation
Techniques
Exhaustive Methods
Leave-One-Out Cross-Validation
Leave-P-Out Cross-Validation
Non-exhaustive Methods
Hold out method
K – Fold cross validation
Stratified K-Fold cross validation
Time Series cross validation
Leave One Out Cross
Validation
Leave-one-out cross-validation is a special case of cross-
validation where the number of folds equals the number
of instances in the data set.
if there are n data points, n – p data points are taken in one iteration
and the remaining p data points are used for validation.
Only a single data point is taken into consideration as the testing data
i.e. p=1.
Leave One Out Cross
Validation
Leave One Out Cross
Validation
Pro’s and Con’s :
More number of iterations in case of large data
set.
Low biased approach.
Requires more computational power.
No randomness in test data set.
Hold Out Method
In this approach we divide our entire dataset into two
parts viz training data and testing data.
The size of training data is set more than twice that of
testing data, so the data is split in the ratio of 70:30 or
80:20.
In this approach, the data is first shuffled randomly
before splitting. As the model is trained on a different
combination of data points.
Hold Out Method
Pro’s and Con’s :
The model can give different results every time we train it,
and this can be a cause of instability.
We can never assure that the train set we picked is
representative of the whole dataset.
When dataset is not too large, there is a high possibility that
the testing data may contain some important information
that we lose as we do not train the model on the testing set.
The hold-out method is good to use when you have a very large
dataset or you are starting to build an initial model in your data
science project.
K -Fold Cross Validation
K-Fold cross-validation, the data is divided into k
subsets.
Each time, one of the k subsets is used as the validation
set and the other k-1 subsets as the training set.
The Parameters is averaged over all k trials to get the
total efficiency of our model.
K -Fold Cross Validation
A large value for ‘k’ indicates less bias, and high
variance. Also, this means more data samples can be
used to give a better, and precise outcomes.
The true error is estimated as the average error rate on
test examples.
K -Fold Cross Validation
K -Fold Cross Validation
Pros and Cons:
Computation time is reduced as we repeated the process
only 10 times when the value of k is 10.
Reduced bias
The variance of the is reduced as k increases
The training algorithm is computationally intensive as
the algorithm has to be rerun from scratch k times.
Stratified K -Fold Cross
Validation
Stratified sampling is a sampling technique where the
samples are selected in the same proportion (as they appear
in the population.
Stratified K Fold used when just random shuffling and
splitting the data is not sufficient, and we want to have
correct distribution of data in each fold.
In case of regression problem folds are selected so that the
mean response value is approximately equal in all the folds.
In case of classification problem folds are selected to have
same proportion of class labels .
Stratified K -Fold Cross
Validation
Stratified K -Fold Cross
Validation
Pros and Cons:
It can improve different models using hyper-parameter
tuning.
Helps us compare models.
It helps in reducing both Bias and Variance.
Time Series Cross
Validation
Cross-validating the time-series model is cross-
validation on a rolling basis.
In this method we Start with a small subset of data for
training purpose, forecast for the later data points and
then checking the accuracy for the forecasted data
points.
The same forecasted data points are then included as
part of the next training dataset and subsequent data
points are forecasted.
Time Series Cross
Validation
Thank
You