Keras1 - 1.3 Improving Your Model Performance
Keras1 - 1.3 Improving Your Model Performance
Loss curve
Loss tends to decrease as epochs go by. This is expected since our model is essentially
learning to minimize the loss function. Epochs are shown on the X axis and loss on the
Y-axis. As epochs go by our loss value decreases. After a certain amount of epochs,
the value converges, meaning it no longer gets much lower than that. We've arrived at a
minimum.
Accuracy curve
Accuracy curves are similar but opposite in tendency if the Y-axis shows accuracy it
now tends to increase as epochs go by. This shows that our model makes fewer
mistakes as it learns.
Overfitting
If we plot training versus validation data we can identify overfitting. We will see the
training and validation curves start to diverge. Overfitting is when our model starts
learning particularities of our training data which don't generalize well on unseen data.
The early stopping callback is useful to stop our model before it starts overfitting.
Unstable curves
But not all curves are smooth and pretty, many times we will find unstable curves. There
are many reasons that can lead to unstable learning curves; the chosen optimizer,
learning rate, batch-size, network architecture, weight initialization, etc. All these
We loop over a predefined list of train sizes and for each training size we get the
corresponding training data fraction. Before any training, we make sure our model starts
with the same set of weights by setting them to the initial_weights using the set_weights
function. After that, we can fit our model on the training fraction. We use an
EarlyStopping callback which monitors loss,but it's important to note that it's not
validation loss since we haven't provided the fit method with validation data. After the
training is done, we can get the accuracy for the training set fraction and the accuracy
from the test set and append it to our lists of accuracies. Observe that the same quantity
of test data observations were used to evaluate each iteration.
You want to distinguish between each of the 10 possible digits given an image, so we
are dealing with multi-class classification.
The dataset has already been partitioned into X_train, y_train, X_test, and y_test, using
30% of the data as testing data. The labels are already one-hot encoded vectors, so
you don't need to use Keras to_categorical() function.
Let's build this new model!
A summation of the inputs reaching the neuron multiplied by the weights of each
connection and the addition of the bias weight. This operation results in a number: a,
which can be anything, it is not bounded.
We pass this number into an activation function that essentially takes it as an input and
decides how the neuron fires and which output it produces. Activation functions impact
learning time, making our model converge faster or slower and achieving lower or
higher accuracy. They also allow us to learn more complex functions.
Activation zoo
Four very well known activation functions are: The sigmoid, which varies between 0 and
1 for all possible X input values. The tanh or Hyperbolic tangent, which is similar to the
sigmoid in shape but varies between -1 and 1.
The ReLU (Rectified linear unit) which varies between 0 and infinity and the leaky
ReLU, which we can look as a smoothed version of ReLU that doesn't sit at 0, allowing
negative values for negative inputs.
We can see that the previous model can not completely separate red crosses from blue
circles if we use a sigmoid activation function in the hidden layer. Some blue circles are
misclassified as red crosses along the diagonal. However, when we use the tanh we
completely separate red crosses from blue circles, the separation region for the blue
and red classification is smooth.
Using a ReLU activation function we obtain sharper boundaries,the leaky ReLU shows
similar behavior for this dataset. It's important to note that these boundaries will be
different for every run of the same model because of the random initialization of weights
and other random variables that aren't fixed.
We can then use this function as we loop over several activation functions, training
different models and saving their history callback. We store all these callbacks in a
dictionary.
With this dictionary of histories, we can extract the metrics we want to plot, build a
pandas dataframe and plot it.
Mini-batch
Remember that during an epoch we feed our network, calculate the errors and update
the network weights. It's not very practical to update our network weights only once per
epoch after looking at the error produced by all training samples. In practice, we take a
mini-batch of training samples. And that way, if our training set has 9 images and we
choose a batch_size of 3, we will perform 3 weight updates per epoch, one per mini-
batch.
Networks tend to train faster with mini-batches since weights are updated often.
Sometimes datasets are so huge that they would struggle to fit in RAM memory if we
didn't use mini-batches. Also, the noise produced by a small batch-size can help escape
local minima. A couple of disadvantages are the need for more iterations and finding a
good batch size.
Great work! You can see that accuracy is lower when using a batch size equal to the
training set size. This is not because the network had more trouble learning the
optimization function: Even though the same number of epochs were used for both
batch sizes the number of resulting weight updates was very different!. With a
batch of size the training set and 5 epochs we only get 5 updates total, each update
computes and averaged gradient descent with all the training set observations. To
obtain similar results with this batch size we should increase the number of epochs so
that more weight updates take place.
A multi-class classification problem that you solved using softmax and 10 neurons in
your output layer.
You will now build a new deeper model consisting of 3 hidden layers of 50 neurons
each, using batch normalization in between layers. The kernel_initializer parameter is
used to initialize weights in a similar way.
Sklearn recap
In sklearn we can perform hyperparameter search by using methods like
RandomizedSearchCV. We import RandomizedSearchCV from sklearn
model_selection. We instantiate a model, define a dictionary with a series of model
parameters to try and finally instantiate a RandomizedSearchCV object passing our
model, the parameters and a number of cross-validation folds. We fit it on our data and
print the best resulting combination of parameters. For this example, a
min_samples_leaf of 1, 3 max_features and a max_depth of 3 gave us the best results.
This is very cool! Our model is now just like any other sklearn estimator, so we can, for
instance, perform cross-validation on it to see the stability of its predictions across folds.
Import cross_val_score, passing in our recently converted Keras model, predictors,
labels, and the number of folds. We can then check the mean accuracy per fold or the
standard deviation. Note that 6 epochs and a batch_size of 16 were used since we
specified it before.
Then we just need to use the exact same names in the parameter dictionary as we have
in our function and repeat the process. The best result is 87% accuracy with 2 hidden
layers of 128 neurons each.