Material For Student CAIEC™ (V062021A) EN
Material For Student CAIEC™ (V062021A) EN
Who is CertiProf®?
CertiProf® is an Examination Institute founded in Unites States in 2015.Located in Sunrise, Florida.
Our philosophy is based on the creation of knowledge in community and for this purpose its
collaborative network is made up of:
• CKA's (CertiProf Knowledge Ambassadors), are influential people in their fields of expertise or
mastery, coaches, trainers, consultants, bloggers, community builders, organizers and evangelists,
who are willing to contribute in the improvement of content
• CLL's (CertiProf Lifelong Learners), Certification candidates are identified as Continuing Learner
proven their unwavering commitment to lifelong learning, which is vitally important in today's ever-
changing and expanding digitalized world. Regardless of whether they win or fail the exam
• ATP's (Accredited Trainer Partners), Universities, training centers and facilitators around the world
that make up the partner network
• Authors (co-creators). Industry experts or practitioners who, with their knowledge, develop content
for the creation of new certifications that respond to the needs of the industry
• Internal Staff, our distributed team with operations in India, Brazil, Colombia and the United States
that support day by day the execution of the purpose of CertiProf®
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
2
Presentation
Welcome!
Report in the following format:
• Name
• Company
• Job title and experience
• Expectations of this course
Badge
https://www.credly.com/org/certiprof/badge/artificial-intelligence-expert-certificate-caiec
3
Lifelong Learning
Earning criteria:
• Be a candidate for CertiProf® certification
• Be a continuous and focused learner
• Identify with the concept of lifelong learning
• Believe and genuinely identify with the
concept that knowledge and education can
and should change the world
• Wanting to enhance your professional growth
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
4
5
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Agenda
I. Deep Learning Fundamentals 7
I.1 Representing Neural Networks 8
I.2 Nonlinear Activation Functions 19
I.3 Hidden Layers 25
I.4 Guided Project: Building A Handwritten Digits Classifier 37
II. Machine Learning Project 41
II.1 Machine Learning Project Walkthrough: Data Cleaning 42
II.2 Machine Learning Project Walkthrough: Preparing the Features 50
II.3 Machine Learning Project Walkthrough: Making Predictions 57
Key Points 70
III. Kaggle Fundamentals 72
Kaggle Fundamentals 73
III.1 Getting Started with Kaggle 73
III.2 Feature Preparation, Selection and Engineering 89
III.3 Model Selection and Tuning 106
III.4 Guided Project: Creating a Kaggle Workflow 119
IV. TensorFlow Concepts 129
IV.1 Presentation of TensorFlow 130
IV.2 TensorFlow Basics 138
IV.3 Classification of Neural Network in TensorFlow 155
IV.4 Linear Regression in TensorFlow 160
V. Keras Basis 165
V. Keras Basis 166
V.1 Kears Layers 166
V.2 Deep Learning with Keras Implementation and Example 178
V.3 Keras Vs Tensorflow – Difference Between Keras and Tensorflow 183
VI. References 186
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
6
I. Deep Learning
Fundamentals
7
I.1 Representing Neural Networks
I.1.1 Nonlinear models
The inspiration for artificial neural networks (or "neural networks" for short) comes partially from biological
neural networks. The cells in most brains (including ours) connect and work together. We call each of these
cells in a neural network a neuron. Neurons in human brains communicate by exchanging electrical signals.
Neural network models draw inspiration from the structure of neurons in our brains — and the way they pass
messages. However, the similarities between biological neural networks and artificial neural networks end
here.
A deep neural network is a specific type of neural network that excels at capturing nonlinear relationships in
data. Deep neural networks have surpassed many benchmarks in audio and image classification. Previously,
linear models were often used with nonlinear transformations discovered meticulously by hand.
In this lesson, we'll explore deep neural networks. Here are a few takeaways you can expect by the end of
this lesson:
• How to represent neural networks visually
• How to implement linear and logistic regression as neural networks
• The differences between the nonlinear activation functions
To get the most out of this lesson, you'll need to know the NumPy, sklearn, and pandas libraries. You'll also
need to be comfortable programming in Python. We'll rely on statistics, calculus, and linear algebra. If you
understand the traditional machine learning workflow, as well as linear and logistic regression models.
I.1.2 Graphs
We usually represent neural networks as graphs. A graph is a data structure that consists
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
8
We commonly use graphs to represent relations or links between components of a system. For example,
the Facebook Social Graph describes the interconnection of all the users on Facebook (and this graph
changes constantly as users add and remove friends). Google Maps uses graphs to represent locations
in the physical world as nodes and roads as edges.
Graphs are a highly flexible data structure; you can even represent a list of values as a graph. We often
categorize graphs by their properties, which act as constraints. You can read about the many different
ways to categorize graphs on Wikipedia.
9
I.1.3 Computational Graphs
Graphs provide a mental model for thinking about a specific class of models — those that consist of
a series of functions executed in a specific order. In the context of neural networks, graphs help us
express the execution of a pipeline of functions in succession.
The second stage can't happen without the first stage because L1 is an input to the second stage. The
successive computation of functions is at the heart of neural network models. This is a computational
graph. A computational graph uses nodes to describe variables and edges to describe the combination
of variables. Here's a simple example:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The computational graph is a powerful representation because it allows us to represent models with
many layers of nesting. In fact, a decision tree is really a specific type of computational graph. There's
no compact way to express a decision tree model using only equations and standard algebraic notation.
10
To better understand this representation, we'll represent a linear regression model using neural network
notation. This will help you learn this unique representation, and it will allow us to explore some of the
neural network terminology.
I.1.4 Neural network vs linear regression
The first step is to rewrite this model using linear algebra notation, as a product of two vectors.
The neurons and arrows represent the weighted sum, which is the combination of the feature columns
and weights.
Inspired by biological neural networks, an activation function determines if the neuron fires or not. In
a neural network model, the activation function transforms the weighted sum of the input values. For
this network, the activation function is the identity function. The identity function returns the same
value that was passed in the following: f(x)=x
11
While the activation function isn't interesting for a network that performs linear regression, it's useful
for logistic regression and more complex networks. Here's a comparison of both representations of
the same linear regression model:
We'll begin working with data that we'll generate ourselves instead of an external dataset. Generating
data ourselves gives us more control over the properties of the dataset (e.g., like the number of features,
observations, and the noise in the features). Datasets we create where neural networks excel contain
the same non-linearity as real-world datasets, so we can apply what we learn here.
• sklearn.datasets.make_classification()
• sklearn.datasets.make_moons()
The following code generates a regression data set with 3 features, 1000 observations, and a random
seed of 1:
12
The function make_regression() returns a tuple of two NumPy objects.
The features are in the first NumPy array, and the labels are in the second NumPy array:
Instructions
Solutions
Because the inputs from one layer of neurons feed into the next layer of the single output neuron,
we call this network a feedforward network. In the language of graphs, a feedforward network is
a directed, acyclic graph.
13
Fitting A Network
Gradient descent is the most common technique for fitting neural network models. We'll rely on the
scikit-learn implementation of gradient descent in this lesson.
This implementation is in the SGDRegressor class. We use it the same way we do with
the LinearRegression class:
We now have everything we need to implement this network. Because we're focusing on building
intuition, we'll be training and testing on the same data set. In real-life scenarios, you always want to
use a cross validation technique of some kind.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Instructions
1. Add a column named bias containing the value 1 for each row to the features DataFrame
2. Import SGDRegressor from sklearn.linear_model
3. Define two functions:
• train(features, labels): takes in the features DataFrame and labels series and performs model
fitting
• Use the SGDRegressor class from scikit-learn to handle model fitting
• This function should return only a NumPy 1D array of weights for the linear regression model
• feedforward(features, weights): takes in the features DataFrame and the weights NumPy array
• Perform matrix multiplication between features (100 rows by 4 columns) and weights (4 rows by
1 column) and assign the result to predictions
• Return predictions. We'll skip implementing the identity function since it simply returns the
same value that was passed in
4. Uncomment the code we have added for you and run the train() and feedforward() functions. The
final predictions will be in linear_predictions
14
Solutions
To generate a dataset friendly for classification, we can use the make_classification() function from
scikit-learn.
The following code generates a classification data set with 4 features, 1000 observations, and a
random seed of 1:
Let's generate some classification data for the network we're building.
15
Instructions
Solutions
On the previous few screens, we replicated linear regression as a feedforward neural network model
and learned about nonlinear activation functions. We now have a better idea of what defines a neural
network. So far, we know that neural networks need the following:
• A network structure (How do the nodes connect? In which direction does the data and computation
flow?)
• A feedforward function (How are the node weights and observation values combined?)
• An activation function (What transformations are performed on the data?)
• A model fitting function (How does the model fit?)
Now, we'll explore how to build a neural network that replicates a logistic regression model. We'll start
with a quick recap.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
In binary classification, we want to find a model that can differentiate between two categorical values
(usually 0 and 1). The values 0 and 1 don't have any numerical weight and instead act as numerical
placeholders for the two categories. We can try to learn the probability that a given observation
belongs in either category.
In the language of conditional probability, we're interested in the probability that a given
observation x belongs to each category:
P(y=0|x)=0.3P(y=1|x)=0.7
16
Because the universe of possibilities only consists of these two categories, the probabilities for both
must add up to 1. This lets us simplify what we want a binary classification model to learn:
P(y=1|x)=?
If P(y=1|x)>0.5, we want the model to assign it to category 1.
If P(y=1|x)<0.5, we want the model to assign it to category 0.
Combining these two steps yields the following definition of a logistic regression model:
Neural networks literature usually refers to this function as the sigmoid function:
17
Instructions
1. Add a column named bias containing the value 1 for each row to the class_features DataFrame.
2. Define three functions:
• log_train(class_features, class_labels): takes in the class_features DataFrame and class_labels series
and performs model fitting
• Use the SGDClassifier class from scikit-learn to handle model fitting
• This function should return a NumPy 2D array of weights for the logistic regression model
• sigmoid(linear_combination): takes in a NumPy 2D array and applies the sigmoid function for every
value: 11+e−x
• log_feedforward(class_features, log_train_weights): takes in the class_features DataFrame and
the log_train_weights NumPy array
• Perform matrix multiplication between class_features (100 rows by 5 columns) and log_train_
weights (1 row by 5 columns) transposed, and assign to linear_combination
• Use the sigmoid() function to transform linear_combinations and assign the result to log_predictions
• Convert each value in log_predictions to a class label:
• If the value is greater than or equal to 0.5, overwrite the value to 1
• If the value is less than 0.5, overwrite the value to 0
• Return log_predictions
3. Uncomment the code we have added for you and run the log_train() and log_feedforward() functions.
The final predictions will be in log_predictions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
18
I.2 Nonlinear Activation Functions
In the last mission, we became familiar with computational graphs and how neural network models are
represented. We also became familiar with neural network terminology like:
• forward pass
• input neurons
• output neurons
In this mission, we'll dive deeper into the role nonlinear activation functions play. To help motivate our
exploration, let's start by reflecting on the purpose of a machine learning model.
The purpose of a machine learning model is to transform training data inputs to the model (which are
features) to approximate the training output values. We accomplish this by:
Linear Regression
We use linear regression when we think that the output values can be best approximated by a linear
combination of the features and the learned weights. This model is a linear system, because any change
in the output value is proportional to the changes in the input values.
When the target values y can be approximated by a linear combination of the features x1 to xn,
linear regression is the ideal choice. Here's a GIF that visualizes the potential expressability of a linear
regression model (by conceptually mimicking what gradient descent does.
Let's now look at a situation where the output values can't be approximated effectively using a linear
combination of the input values.
19
Logistic Regression
In a binary classification problem, the target values are 0 and 1 and the relationship between the
features and the target values is nonlinear. This means we need a function that can perform a nonlinear
transformation of the input features.
The sigmoid function is a good choice since all of its input values are squashed to range between 0 and 1.
Adding the sigmoid transformation helps the model approximate this nonlinear relationship underlying
common binary classification tasks. The following GIF shows how the shape of the logistic regression
model changes as we increase the single weight (by conceptually mimicking what gradient descent
does):
Neural Networks
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Logistic regression models learn a set of weights that impact the linear combination phase and then
are fed through a single nonlinear function (the sigmoid function). In this mission, we'll dive into the
most commonly used activation functions. The three most commonly used activation functions in
neural networks are:
Since we've covered the sigmoid function already, we'll focus on the latter two functions.
20
1.2.1 ReLU Activation Function
We'll start by introducing the ReLU activation function, which is a commonly used activation function
in neural networks for solving regression problems. ReLU stands for rectified linear unit and is defined
as follows:
ReLU(x)=max(0,x)
The max(0,x) function call returns the maximum value between 0 and x. This means that:
• When x is less than 0, the value 0 is returned
• When x is greater than 0, the value x is returned
The ReLU function returns the positive component of the input value. Let's visualize the expressivity
of a model that performs a linear combination of the features and weights followed by the ReLU
transformation:
21
There are a few different ways we can implement the ReLU function in code. We'll leave it as an
exercise for you to implement.
Instructions
Solutions
The last commonly used activation function in neural networks we'll discuss is the tanh function (also
known as the hyperbolic tangent function). We'll start by reviewing some trigonometry by discussing
the tan (short for tangent) function and then work our way up to the tanh function (in the next screen).
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
While we won't provide the depth here needed to learn trigonometry from scratch, we do recommend
the Trigonometry Series on Khan Academy if you're new to trigonometry.
What is trigonometry?
Trigonometry is short for triangle geometry and provides formulas, frameworks, and mental models for
reasoning about triangles. Triangles are used extensively in theoretical and applied mathematics, and
build on mathematical work done over many centuries. Let's start by clearly defining what a triangle is.
22
A triangle is a polygon that has the following properties:
• 3 edges
• 3 vértices
• angles between edges add up to 180 degrees
Two main ways that triangles can be classified is by the internal angles or by the edge lengths. The
following diagram outlines the three different types of triangles by their edge length properties:
An important triangle that's classified by the internal angles is the right angle triangle. In a right angle
triangle, one of the angles is 90 degrees (also known as the right angle). The edge opposite of the right
angle is called the hypotenuse.
23
Here's an example of the tangent function.
Instructions
Solutions
The tangent function from the last screen generated the following plot:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
24
The periodic sharp spikes that you see in the plot are known as vertical asymptotes. At those points,
the value isn't defined but the limit approaches either negative or positive infinity (depending on
which direction you're approaching the x value from).
The key takeaway from the plot is how the tangent function is a repeating, periodic function. A periodic
function is one that returns the same value at regular intervals. Let's look at a table of some values
from the tangent function:
The tangent function repeats itself every π, which is known as the period. The tangent function isn't
known to be used as an activation function in neural networks (or any machine learning model really)
because the periodic nature isn't a pattern that's found in real datasets.
While there have been some experiments with periodic functions as the activation function for neural
networks, the general conclusion has been that period functions like tangent don't offer any unique
benefits for modeling.
Generally speaking, the activation functions that are used in neural networks are increasing functions.
An increasing function f is a function where f(x) always stays the same or increases as x increases.
All of the activation functions we've looked at (and will look at) in this mission meet this criteria.
25
We included both of the functions that are used to
compute each hidden neuron and output neuron
to help clear up any confusion. You'll notice that
the number of neurons in the second layer was
more than those in the input layer. Choosing the
number of neurons in this layer is a bit of an art
form and not quite a science yet in neural network
literature. We can actually add more intermediate
layers, and this often leads to improved model
accuracy (because of an increased capability in
learning nonlinearity).
The intermediate layers are known as hidden layers, because they aren't directly represented in the
input data or the output predictions. Instead, we can think of each hidden layer as intermediate features
that are learned during the training process. Comparison With Decision Tree Models.
This is actually very similar to how decision trees are structured. The branches and splits represent
some intermediate features that are useful for making predictions and are analogous to the hidden
layers in a neural network:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
26
Each of these hidden layers has its own set of weights and biases, which are discovered during the
training process. In decision tree models, the intermediate features in the model represented something
more concrete we can understand (feature ranges).
Decision tree models are referred to as white box models because they can be observed and understood
but not easily altered. After we train a decision tree model, we can visualize the tree, interpret it, and
have new ideas for tweaking the model. Neural networks, on the other hand, are much closer to being
a black box. In a black box model, we can understand the inputs and the outputs but the intermediate
features are actually difficult to interpret and understand. Even harder and perhaps more importantly,
it's difficult to understand how to tweak a neural network based on these intermediate features.
In this mission, we'll learn how adding more layers to a network and adding more neurons in the
hidden layers can improve the model's ability to learn more complex relationships.
To generate data with nonlinearity in the features (both between the features and between the features
and the target column), we can use the make_moons() function from scikit-learn:
By default, make_moons() will generate 100 rows of data with 2 features. Here's a plot that visualizes
one feature against the other:
27
To make things interesting, let's add some Just like in a previous mission, we can separate
Gaussian noise to the data. Gaussian noise is a the resulting NumPy object into 2 pandas
kind of statistical noise that follows the Gaussian dataframes:
distribution, and it's a common way to try to
recreate the noise that's often found in real world
data.
Instructions
• Generate a 3d scatter plot with the first column from features on the x-axis, the second column
from features on the y-axis and labels on the z-axis
• Set the labels 'x1', 'x2' and 'y', respectively
Solutions
28
I.3.2 Hidden Layer with single neuron
In the last mission, we learned how adding a nonlinear activation function expanded the range of
patterns that a model could try to learn. The following GIF demonstrates how adding the sigmoid
function enables a logistic regression model to capture nonlinearity more effectively:
We can think of a logistic regression model as a neural network with an activation function but no
hidden layers. To make predictions, a linear combination of the features and weights is performed
followed by a single sigmoid transformation.
29
To improve the expressive power, we can add a hidden layer of neurons in between the input layer and
the output layer. Here's an example where we've added a single hidden layer with a single neuron in
between the input layer and the output layer:
This network contains two sets of weights that are learned during the training phase:
In the next screen, we'll learn how to train a neural network with a hidden layer using scikit-learn. We'll
compare this model with a logistic regression model.
• MLPClassifier
• MLPRegressor
Let's focus on the MLPClassifier class. As with all of the model classes in scikit-learn, MLPClassifier follows
the standard model.fit() and model.predict() pattern:
30
We can specify the number of hidden neurons we want to use in each layer using the hidden_layer_
sizes parameter. This parameter accepts a tuple where the index value corresponds to the number of
neurons in that hidden layer. The parameter is set to the tuple (100,) by default, which corresponds to
a hundred neurons in a single hidden layer. The following code specifies a hidden layer of six neurons:
We can specify the activation function we want used in all layers using the activation parameter. This
parameter accepts only the following string values:
While scikit-learn is friendly to use when learning new concepts, it has a few limitations when it comes
to working with neural networks in production.
• At the time of writing, scikit-learn only supports using the same activation function for all layers
• Scikit-learn also struggles to scale to larger datasets
• Libraries like Theano and TensorFlow support offloading some computation to the GPU to
overcome bottlenecks
Instructions
31
Solutions
In the last screen, we trained a logistic regression model and a neural network model with a hidden
layer containing a single neuron. While we don't recommend using the accuracy scores to benchmark
classification models in a production setting, they can be helpful when we're learning and experimenting
because they are easy to understand.
The logistic regression model performed much better (accuracy of 88%) compared to the neural
network model with one hidden layer and one neuron (48%). This network architecture doesn't
give the model much ability to capture nonlinearity in the data unfortunately, which is why logistic
regression performed much better.
Let's take a look at a network with a single hidden layer of multiple neurons:
32
This network has 3 input neurons, 6 neurons in the single hidden layer, and 1 output neuron. You'll
notice that there's an arrow between every input neuron and every hidden neuron (3 x 6 = 18
connections), representing a weight that needs to be learned during the training process. You'll notice
that there's also a weight that needs to be learned between every hidden neuron and the final output
neuron (6 x 1 = 6 connections).
Because every neuron has a connection between itself and all of the neurons in the next layer, this
is known as a fully connected network. Lastly, because the computation flows from left (input layer)
to right (hidden layer then to output layer), we can call this network a fully connected, feedforward
network.
There are two weight matrices (a1 and a2) that need to be learned during the training process, one for
each stage of the computation. Let's look at the linear algebra representation of this network.
33
While we've discussed different architectures in this course, a deep neural network boils down to a
series of matrix multiplications paired with nonlinear transformations! These are the key ideas that
underlie all neural network architectures. Take a look at this conceptual diagram from the Asimov
Institute that demonstrates a variety of neural network architectures:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
34
Instructions
• Create the following list of neuron counts and assign to neurons: [1, 5, 10, 15, 20, 25]
• Create an empty list named accuracies
• For each value in neurons:
• Train a neural network:
• With the number of neurons in the hidden layer set to the current value
• Using the sigmoid activation function on the training set
• Make predictions on the test set and compute the accuracy value
• Append the accuracy value to accuracies
• Print accuracies
Solutions
Next, we can observe the effect of increasing the number of hidden layers on the overall accuracy of
the network. Here's a diagram representing a neural network with six neurons in the first hidden layer
and four neurons in the second hidden layer:
35
To determine the number of weights between the layers, multiply the number of neurons between
those two layers. Remember that these weights will be represented as weight matrices.
To specify the number of hidden layers and the number of neurons in each hidden layer, we change
the tuple we pass in to the hidden_layer_sizes parameter:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The number of hidden layers and number of neurons in each hidden layer are hyperparameters that act
as knobs for the model behavior. Hyperparameter optimization for neural networks is unfortunately
outside the scope of this course, as it requires a stronger mathematical foundation which we plan to
provide in future courses.
Let's also switch the activation function used in the hidden layers to the ReLU function.
36
Neural networks often tend to take a long time to converge during the training process and many
libraries have default values for the number of iterations of gradient descent to run. We can increase
the number of iterations of gradient descent that's performed during the training process by modifying
the max_iter parameter, which is set to 200 by default.
Instructions
• Create the following list of neuron counts and assign to neurons: [1, 5, 10, 15, 20, 25]
• Create an empty list named nn_accuracies
• For each value in neurons:
• Train a neural network:
• With two hidden layers, each containing the same number of neurons (the current value
in neurons)
• Using the relu activation function
• Using 1000 iterations of gradient descent
• On the training set
• Make predictions on the test set and compute the accuracy value
• Append the accuracy value to nn_accuracies
• Print nn_accuracies
Solutions
37
As we mentioned in the first mission in this course, deep neural networks have been used to reach state-
of-the-art performance on image classification tasks in the last decade. For some image classification
tasks, deep neural networks actually perform as well as or slightly better than the human benchmark.
You can read about the history of deep neural networks here.
To end this course, we'll build models that can classify handwritten digits. Before the year 2000,
institutions like the United States Post Office used handwriting recognition software to read addresses,
zip codes, and more. One of their approaches, which consists of pre-processing handwritten images
then feeding to a neural network model is detailed in this paper.
Within the field of machine learning and pattern recognition, image classification (especially for
handwritten text) is towards the difficult end of the spectrum. There are a few reasons for this.
First, each image in a training set is high dimensional. Each pixel in an image is a feature and a separate
column. This means that a 128 x 128 image has 16384 features.
Second, images are often downsampled to lower resolutions and transformed to grayscale (no color).
This is a limitation of compute power unfortunately. The resolution of a 8 megapixel photo has 3264 by
2448 pixels, for a total of 7,990,272 features (or about 8 million). Images of this resolution are usually
scaled down to between 128 and 512 pixels in either direction for significantly faster processing. This
often results in a loss of detail that's available for training and pattern matching.
Third, the features in an image don't have an obvious linear or nonlinear relationship that can be
learned with a model like linear or logistic regression. In grayscale, each pixel is just represented as a
brightness value ranging from 0 to 256.
Here's an example of how an image is represented across the different abstractions we care about:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
38
Why is deep learning effective in image
classification?
In this Guided Project, we'll explore the effectiveness of deep, feedforward neural networks at
classifying images.
Scikit-learn contains a number of datasets pre-loaded with the library, within the namespace of sklearn.
datasets. The load_digits() function returns a copy of the hand-written digits dataset from UCI.
Because dataframes are a tabular representation of data, each image is represented as a row of pixel
values. To visualize an image from the dataframe, we need to reshape the image back to its original
dimensions (28 x 28 pixels). To visualize the image, we need to reshape these pixel values back into the
To reshape the image, we need to convert a training example to a numpy array (excluding
the label column) and pass the result into that into the numpy.reshape() function:
Now that the data is in the right shape, we can visualize it using pyplot.imshow() function:
39
To display multiple images in one matplotlib figure, we can use the equivalent axes.imshow() function.
Let's use what we've learned to display images from both classes.
Instructions
40
II.Machine
Learning Project
41
II.1 Machine Learning Project Walkthrough: Data Cleaning
In this course, we will go through the full data science life cycle, from data cleaning and feature
selection to machine learning. We will focus on credit modelling, a well-known data science problem
that focuses on modeling a borrower's credit risk. Credit has played a key role in the economy for
centuries and some form of credit has existed since the beginning of commerce. We'll be working
with financial lending data from Lending Club. Lending Club is a marketplace for personal loans that
matches borrowers who are seeking a loan with investors looking to lend money and make a return.
You can read more about their marketplace here.
Each borrower completes a comprehensive application, providing their past financial history, the
reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical
data and their own data science process to assign an interest rate to the borrower. The interest rate
is the percent in addition to the requested loan amount the borrower has to pay back. You can read
more about the interest rate that Lending Club assigns here. Lending Club also tries to verify all the
information the borrower provides but it can't verify all of the information (usually for regulation
reasons).
A higher interest rate means that the borrower is a risk and more unlikely to pay back the loan. While
a lower interest rate means that the borrower has a good credit history and is more likely to pay
back the loan. The interest rates range from 5.32% all the way to 30.99% and each borrower is given
a grade according to the interest rate they were assigned. If the borrower accepts the interest rate,
then the loan is listed on the Lending Club marketplace.
Investors are primarily interested in receiving a return on their investments. Approved loans are
listed on the Lending Club website, where qualified investors can browse recently approved loans,
the borrower's credit score, the purpose for the loan, and other information from the application.
Once they’re ready to back a loan, they select the amount of money they want to fund. Once a
loan's requested amount is fully funded, the borrower receives the money they requested minus
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
42
While Lending Club has to be extremely savvy and rigorous with their credit modelling, investors
on Lending Club need to be equally as savvy about determining which loans are more likely to be
paid off. At first, you may wonder why investors put money into anything but low interest loans. The
incentive investors have to back higher interest loans is, well, the higher interest! If investors believe
the borrower can pay back the loan, even if he or she has a weak financial history, then investors can
make more money through the larger additional amount the borrower has to pay.
Most investors use a portfolio strategy to invest small amounts in many loans, with healthy mixes of
low, medium, and interest loans. In this course, we'll focus on the mindset of a conservative investor
who only wants to invest in the loans that have a good chance of being paid off on time. To do that,
we'll need to first understand the features in the dataset and then experiment with building machine
learning models that reliably predict if a loan will be paid off or not.
You'll also find a data dictionary (in XLS format) which contains information on the different column
names towards the bottom of the page. We recommend downloading the data dictionary to so you
can refer to it whenever you want to learn more about what a column represents in the datasets.
Here's a link to the data dictionary file hosted on Google Drive.
Before diving into the datasets, let's get familiar with the data dictionary. The LoanStats sheet describes
the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected
applications don't appear on the Lending Club marketplace and aren't available for investment, we'll
be focusing on approved loans.
Before we can start doing machine learning, we need to define what features we want to use and
which column represents the target column we want to predict. Let's start by reading and exploring
the dataset.
In this lesson, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans
have already finished. In the datasets for later years, many of the loans are current and still being paid
off.
43
If we complete the following, we can reduce the size of the dataset for the ease of use:
• Remove the desc column:
• Which contains a long text explanation for each loan
• Remove the url column:
• Which contains a link to each loan on Lending Club which can only be accessed with an investor
account
• Remove all columns containing more than 50% missing values:
• Which allows us to move faster since we can spend less time trying to fill these values
First, let's read the dataset into a Dataframe so we can start to explore the data and remaining features.
Instructions
• Read loans_2007.csv into a DataFrame named loans_2007 and use the print function to display the
first row of the Dataframe.
• Use the print function to:
• Display the first row of loans_2007
• The number of columns in loans_2007
The Dataframe contains many columns and can be cumbersome to try to explore all at once. Let's
separate the columns into 3 groups of 18 columns and use the data dictionary to become familiar with
what each column represents. As you understand each feature, look for any features that:
• Disclose information from the future (after the loan has already been funded)
• Don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending
Club)
• Need to be cleaned up and are formatted poorly
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
We need to especially pay attention to data leakage, since it can cause our model to overfit. This is
because the model uses data about the target column that wouldn't be available when we're using the
model on future loans. We encourage you to take your time to understand each column, because a
poor understanding could cause you to make mistakes in the data analysis and modeling process. As
you go through the dictionary, keep in mind that we need to select one of the columns as the target
column we want to use for the machine learning phase.
In this screen and the next few screens, let's focus on just columns that we need to remove from
consideration. Then, we can circle back and further dissect the columns we decided to keep.
To make this process easier, we created a table that contains the name, data type, first row's value, and
description from the data dictionary for the first 18 rows.
44
Name dtype First Value Description
id Object 1077501 A unique LC assigned ID for the loan listing.
member_id Float64 1.2966e+06 A unique LC assigned Id for the borrower member.
loan_amnt float64 5000 The listed amount of the loan applied for by the borrower
funded_amnt float64 5000 The total amount committed to that loan at that point in time
funded_amnt_inv float64 49750 The total amount committed by investors for that loan at that point in time
term object 36 months The number of payments on the loan. Values are in months and can be either 36 or 60
int_rate object 10.65% Interest Rate on the loan
installment float64 162.87 The monthly payment owed by the borrower if the loan originates.
grade object B LC assigned loan grade
sub_grade object B2 LC assigned loan subgrade
emp_title object NaN The job title supplied by the Borrower when applying for the loan
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10
emp_length object 10+ years
means ten or more years
The home ownership status provided by the borrower during registration. Our values are: RENT, OWN,
home_ownership object RENT
MORTGAGE, OTHER
annual_inc float64 24000 The self-reported annual income provided by the borrower during registration
verification_status object Verified Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d object Dec-2011 The month which the loan was funded
loan_status object Charged Off Current status of the loan
pymnt_plan object n Indicates if a payment plan has been put in place for the loan
purpose object car A category provided by the borrower for the loan request
After analyzing each column, we can conclude that the following features need to be removed:
• id: randomly generated field by Lending Club for unique identification purposes only
• member_id: also a randomly generated field by Lending Club for unique identification purposes only
• funded_amnt: leaks data from the future (after the loan is already started to be funded)
• funded_amnt_inv: also leaks data from the future (after the loan is already started to be funded)
• grade: contains redundant information as the interest rate column (int_rate)
• sub_grade: also contains redundant information as the interest rate column (int_rate)
• emp_title: requires other data and a lot of processing to potentially be useful
• issue_d: leaks data from the future (after the loan is already completely funded)
Let's now drop these columns from the Dataframe before moving onto the next group of columns.
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
• Id
• member_id • sub_grade
• funded_amnt • emp_title
• funded_amnt_inv • issue_d
• grade
45
II.2.4 Second group of columns
Let's now look at the next 18 columns:
Name dtype First Value Description
title object Computer The loan title provided by the borrower
zip_code object 860xx The first 3 numbers of the zip code provided by the borrower in the loan application
addr_state object AZ The state provided by the borrower in the loan application
A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding
dti float64 27.65
mortgage and the requested LC loan, divided by the borrower’s self -reported monthly income
delinq_2yrs float64 0 The number of 30+ days past -due incidences of delinquency in the borrower's credit file for the past 2 years
earliest_cr_line object janv-85 The month the borrower's earliest reported credit line was opened
inq_last_6mths float64 1 The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
open_acc float64 3 The number of open credit lines in the borrower's credit file
pub_rec float64 0 Number of derogatory public records
revol_bal float64 13648 Total credit revolving balance
Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving
revol_util object 83.7%
credit
total_acc float64 9 The total number of credit lines currently in the borrower's credit file
initial_list_status object f The initial listing status of the loan. Possible values are – W, F
out_prncp float64 0 Remaining outstanding principal for total amount funded
out_prncp_inv float64 0 Remaining outstanding principal for portion of total amount funded by investors
total_pymnt float64 5863.16 Payments received to date for total amount funded
total_pymnt_inv float64 5833.84 Payments received to date for portion of total amount funded by investors
total_rec_prncp float64 5000 Principal received to date
• zip_code: redundant with the addr_state column since only the first 3 digits of the 5-digit zip code
are visible (which can only be used to identify the state the borrower lives in)
• out_prncp: leaks data from the future, (after the loan already started to be paid off)
• out_prncp_inv: also leaks data from the future, (after the loan already started to be paid off)
• total_pymnt: also leaks data from the future, (after the loan already started to be paid off)
• total_pymnt_inv: also leaks data from the future, (after the loan already started to be paid off)
• total_rec_prncp: also leaks data from the future, (after the loan already started to be paid off)
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The out_prncp and out_prncp_inv both describe the outstanding principal amount for a loan, which
is the remaining amount the borrower still owes. These 2 columns as well as the total_pymnt column
describe properties of the loan after it's fully funded and started to be paid off. This information isn't
available to an investor before the loan is fully funded and we don't want to include it in our model.
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
• zip_code
• out_prncp
• out_prncp_inv
• total_pymnt
• total_pymnt_inv
• total_rec_prncp
46
II.2.5 Third group of columns
Let's now move on to the last group of features:
Name dtype First Value Description
total_rec_int float64 863.16 Interest received to date
total_rec_late_fee float64 0 Late fees received to date
recoveries float64 0 post charge off gross recovery
collection_recovery_fee float64 0 post charge off collection fee
last_pymnt_d object janv-15 Last month payment was received
last_pymnt_amnt float64 171.62 Last total payment amount received
last_credit_pull_d object juin-16 The most recent month LC pulled credit for this loan
• total_rec_int: leaks data from the future, (after the loan has started to be paid off)
• total_rec_late_fee: leaks data from the future, (after the loan has started to be paid off)
• recoveries: leaks data from the future, (after the loan has started to be paid off)
• collection_recovery_fee: leaks data from the future, (after the loan has started to be paid off)
• last_pymnt_d: leaks data from the future, (after the loan has started to be paid off)
• last_pymnt_amnt: leaks data from the future, (after the loan has started to be paid off)
Instructions
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
• total_rec_int
• total_rec_late_fee
• Recoveries
• collection_recovery_fee
• last_pymnt_d
• last_pymnt_amnt
47
Use the print function to display the first row of loans_2007 and the number of columns in loans_2007.
By becoming familiar with the columns in the dataset, we were able to reduce the number of columns
from 52 to 32 columns. We now need to decide on a target column that we want to use for modeling.
We should use the loan_status column, since it's the only column that directly describes if a loan was
paid off on time, had delayed payments, or was defaulted on the borrower. Currently, this column
contains text values and we need to convert it to a numerical value for training a model. Let's explore
the different values in this column and come up with a strategy for converting the values in this column.
Instructions
• Use the Series method value_counts to return the frequency of the unique values in the loan_
status column
• Display the frequency of each unique value using the print function
There are 8 different possible values for the loan_status column. You can read about most of the
different loan statuses on the Lending Clube website. The two values that start with "Does not meet
the credit policy" aren't explained unfortunately. A quick Google search takes us to explanations from
the lending community here.
We've compiled the explanation for each column as well as the counts in the Dataframe in the following
table:
Charged Off 5634 Loan for which there is no longer a reasonable expectation of further payments
Default 3 Loan is defaulted on and no payment has been made for more than 121 days
48
rom the investor's perspective, we're interested in trying to predict whether loans will be paid off on
time. Only the Fully Paid and Charged Off values describe the final outcome of the loan. The other
values describe loans that are still ongoing and where the jury is still out on if the borrower will pay
back the loan on time or not. While the Default status resembles the Charged Off status, in Lending
Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones
have a small chance.
Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat
the problem as a binary classification one. Let's remove all the loans that don't contain either Fully
Paid or Charged Off as the loan's status. After the removal of the loan statuses, then transform
the Fully Paid values to 1 for the positive case and the Charged Off values to 0 for the negative case.
While there are a few different ways to transform all of the values in a column, we'll use the Dataframe
method replace. According to the documentation, we can pass the replace method a nested mapping
dictionary in the following format:
Lastly, one thing we need to keep in mind is the class imbalance between the positive and negative
cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were
charged off. This class imbalance is a common problem in binary classification and during training,
the model ends up having a strong bias towards predicting the class with more observations in the
training set and will rarely predict the class with less observations. The stronger the imbalance, the
more biased the model becomes. There are a few different ways to tackle this class imbalance, which
we'll explore later.
• Remove all rows from loans_2007 that contain values other than Fully Paid or Charged Off for
the loan_status column
• Use the Dataframe method replace to replace:
• Fully Paid with 1
• Charged Off with 0
To wrap up this lesson, let's look for any columns that contain only one unique value and remove
them. These columns won't be useful for the model since they don't add any information to each loan
application. In addition, removing these columns will reduce the number of columns we'll need to
explore in the future.
49
We'll need to compute the number of unique values in each column and drop the columns that contain
only one unique value. While the Series method unique returns the unique values in a column, it also
counts the Pandas missing value object nan as a value:
Since we're trying to find columns that contain one true unique value, we should first drop the null
values then compute the number of unique values:
Instructions
• Remove any columns from loans_2007 that contain only one unique value:
• Create an empty list, drop_columns to keep track of which columns you want to drop
• For each column:
• Use the Series method dropna to remove any null values and then use the Series
method unique to return the set of non-null unique values
• Use the len() function to return the number of values in that set
• Append the column to drop_columns if it contains only 1 unique value
• Use the Dataframe method drop to remove the columns in drop_columns from loans_2007
• Use the print function to display drop_columns so we know which ones were removed
You may have learned how to remove all of the columns that contained redundant information, weren't
useful for modeling, required too much processing to make useful, or leaked information from the
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
future. After exporting the Dataframe to a CSV file named filtered_loans_2007.csv to differentiate the
file with the loans_2007.csv. In this lesson, we'll prepare the data for machine learning by focusing on
handling missing values, converting categorical columns to numeric columns, and removing any other
extraneous columns we encounter throughout this process.
Mathematics underlying most machine learning models assumes that the data is numerical and
contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to
train a model using data that contain missing values or non-numeric values when working with models
like linear regression and logistic regression.
50
Let's start by computing the number of missing values and come up with a strategy for handling them.
Then, we'll focus on the categorical columns.
We can return the number of missing values across the Dataframe by:
• First using the Pandas Dataframe method isnull to return a Dataframe containing Boolean values:
• True if the original value is null
• False if the original value isn't null
• Then using the Pandas Dataframe method sum to calculate the number of null values in each column
Instructions
Solutions
In the previous screen we got a series displaying how many missing values each column with missing
values has:
Domain knowledge tells us that employment length is frequently used in assessing how risky a potential
borrower is, so we'll keep this column despite its relatively large number of missing values.
51
Let's inspect the values of the column pub_rec_bankruptcies.
We see that this column offers very little variability, nearly 94% of values are in the same category. It
probably won't have much predictive value. Let's drop it. In addition, we'll remove the remaining rows
containing null values.
This means that we'll keep the following columns and just remove rows containing missing values for
them:
• emp_length
• Title
• revol_útil
• last_credit_pull_d
After removing the rows containing missing values, drop the pub_rec_bankruptcies column entirely.
Let's use the strategy of removing the pub_rec_bankruptcies column first, then remove all rows
containing any missing values to cover both of these cases. This way, we only remove the rows
containing missing values for the emp_length, title and revol_util columns, but not the pub_rec_
bankruptcies column.
Instructions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
• Use the drop method to remove the pub_rec_bankruptcies column from loans
• Use the dropna method to remove all rows from loans containing any missing values
• Use the dtypes attribute followed by the value_counts() method to return the counts for each
column data type. Use the print function to display these counts
Solutions
52
II.2.3 Texts columns
While the numerical columns can be used natively with scikit-learn, the object columns that contain
text need to be converted to numerical data types. Let's return a new dataframe containing just the
object columns so we can explore them in more depth. You can use the dataframe method select_
dtypes to select only the columns of a certain data type:
Let's select just the object columns then display a sample row for a better sense of how the values in
each column are formatted.
Instructions
• Use the dataframe method select_dtypes to select only the columns of object type from loans and
assign the resulting Dataframe object_columns_df
• Display the first row in object_columns_df using the print function
Solutions
Some of the columns seem like they represent categorical values, but we should confirm by checking
the number of unique values in those columns:
• home_ownership: home ownership status, can only be 1 of 4 categorical values according to the
data dictionary
• verification_status: indicates if income was verified by Lending Club
There are also two columns that represent numeric values and need to be converted:
• int_rate: interest rate of the loan in %
• revol_util: revolving line utilization rate or the amount of credit the borrower is using relative to all
available credit, read more here
Based on the first row's values for purpose and title, it seems like these columns could reflect the
same information. Let's explore the unique value counts separately to confirm if this is true.
53
Lastly, some of the columns contain date values that would require a good amount of feature
engineering for them to be potentially useful:
• earliest_cr_line: The month the borrower's earliest reported credit line was opened
• last_credit_pull_d: The most recent month Lending Club pulled credit for this loan
Since these date features require some feature engineering for modeling purposes, let's remove these
date columns from the dataframe.
Let's explore the unique value counts of the columns that seem like they contain categorical values.
Instructions
• Display the unique value counts for the following columns: home_ownership, verification_status,
emp_lenght, term, addr state columns:
• Store these column names in a list named cols
• Use a for loop to iterate over cols:
• Use the print function combined with the Series method value_counts to display each column's
unique value counts
Solutions
The home_ownership, verification_status, emp_length, term, and addr_state columns all contain
multiple discrete values. We should clean the emp_length column and treat it as a numerical one since
the values have ordering (2 years of employment is less than 8 years).
First, let's look at the unique value counts for the purpose and title columns to understand which
column we want to keep.
Instructions
Use the value_counts method and the print function to display the unique values in the following
columns:
• Title
• purpose
54
Solutions
The home_ownership, verification_status, emp_length, and term columns each contain a few discrete
categorical values. We should encode these columns as dummy variables and keep them.
It seems like the purpose and title columns do contain overlapping information, but we'll keep
the purpose column since it contains a few discrete values. In addition, the title column has data quality
issues since many of the values are repeated with slight modifications (e.g. Debt Consolidation and Debt
Consolidation Loan and debt consolidation).
• "10+ years": 10
• "9 years": 9
• "8 years": 8
• "7 years": 7
• "6 years": 6
• "5 years": 5
• "4 years": 4
• "3 years": 3
• "2 years": 2
• "1 year": 1
• "< 1 year": 0
• "n/a": 0
Lastly, the addr_state column contains many discrete values, and we'd need to add 49 dummy variable
columns to use it for classification. This would make our dataframe much larger and could slow down
how quickly the code runs. Let's remove this column from consideration.
55
Instructions
• Remove the last_credit_pull_d, addr_state, title, and earliest_cr_line columns from loans
• Convert the int_rate and revol_util columns to float columns by:
• Using the str accessor followed by the rstrip string method to strip the right trailing percent sign
(%):
• loans['int_rate'].str.rstrip('%') returns a new Series with % stripped from the right side of each
value
• On the resulting Series object, use the astype method to convert to the float type
• Assign the new Series of float values back to the respective columns in the Dataframe
• Use the replace method to clean the emp_length column
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
56
We can then use the concat method to add these dummy columns back to the original Dataframe:
And then drop the original columns entirely using the drop method:
Instructions
• Encode the home_ownership, verification_status, purpose, and term columns as integer values:
• Use the get_dummies function to return a Dataframe containing the dummy columns
• Use the concat method to add these dummy columns back to loans
• Remove the original, non-dummy columns (home_ownership, verification_status, purpose,
and term) from loans
Solutions
As we prepared the data, we removed columns that had data leakage issues, contained redundant
information, or required additional processing to turn into useful features. We cleaned features that
had formatting issues and converted categorical columns to dummy variables.
In the last lesson, we noticed that there's a class imbalance in our target column, loan_status. There
are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that
weren't (negative case, label of 0). Imbalances can cause issues with many machine learning algorithms,
where they appear to have high accuracy, but actually aren't learning from the training data. Due to its
potential to cause issues, we need to keep the class imbalance in mind as we build machine learning
models.
After all of our data cleaning, we ended up with the csv file called clean_loans_2007.csv. Let's read this
file into a dataframe and view a summary of the work we did.
57
Instructions
Solutions
Before we dive into predicting loan_status with machine learning, let's go back to our first steps when
we started cleaning the Lending Club dataset. You may recall the original question we wanted to
answer:
• Can we build a machine learning model that can accurately predict if a borrower will pay off their
loan on time or not?
We established that this is a binary classification problem and we converted the loan_status column
to 0s and 1s as a result. Before diving in and selecting an algorithm to apply to the data, we should
select an error metric.
An error metric will help us figure out when our model is performing well, and when it's performing
poorly. To tie error metrics all the way back to the original question we wanted to answer, let's say
we're using a machine learning model to predict whether or not we should fund a loan on the Lending
Club platform. Our objective in this is to make money -- we want to fund enough loans that are paid
off on time to offset our losses from loans that aren't paid off. An error metric will help us determine if
our algorithm will make us money or lose us money.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
In this case, we're primarily concerned with false positives and false negatives. Both of these are
different types of misclassifications. With a false positive, we predict that a loan will be paid off on
time, but it actually isn't. This costs us money, since we fund loans that lose us money. With a false
negative, we predict that a loan won't be paid off on time, but it actually would be paid off on time.
This loses us potential money, since we didn't fund a loan that actually would have been paid off.
58
Here's a diagram to simplify the concepts:
Let's calculate false positives and true positives in Python. We can use multiple conditionals, separated
by a & to select items in a NumPy array that meet certain conditions. For instance, if we had an array
called predictions, we could select items in predictions that equal 1 and where items in loans["loan_
status"] in the same position also equal 1 using this:
The above code will give us all the items in predictions that are true positives -- where we predicted
that the loan would be paid off on time, and it was actually paid off on time. By using the len function
to find the number of items, we can find the number of true positives.
We've generated some predictions automatically and they are stored in a NumPy array called predictions.
59
Instructions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
60
In the above diagram, our predictions are 85.7%
accurate -- we've correctly identified loan_status
in 85.7% of cases. However, we've done this by
predicting 1 for every row. What this means is
that we'll actually lose money. Let's say we loan
out 1000 dollars on average to each borrower.
Each borrower pays us 10% interest back. We
will make a projected profit of 100 dollars on
each loan. In the above diagram, we'd actually
lose money:
As you can see, we made 600 dollars in interest from the borrowers that paid us back, but we
lost 1000 dollars on the one borrower who never paid us back, so we actually ended up losing 400 dollars
overall, even though our model is technically accurate.
This is why it's important to always be aware of imbalanced classes in machine learning models, and
to adjust your error metric accordingly. In this case, we don't want to use accuracy and should instead
use metrics that tell us the number of false positives and false negatives.
We can calculate false positive rate and true positive rate, using the numbers of true positives, true
negatives, false negatives, and false positives.
True positive rate is the number of true positives divided by the number of true positives plus the
number of false negatives. This divides all the cases where we thought a loan would be paid off by all
the loans that were paid off:
61
Simple english ways to think of each term are:
• False Positive Rate: "the percentage of the loans that shouldn't be funded that I would fund".
• True Positive Rate: "the percentage of loans that should be funded that I would fund".
Generally, if we reduce false positive rate, true positive rate will also go down. This is because if we
want to reduce the risk of false positives, we wouldn't think about funding riskier loans in the first
place.
Instructions
62
II.3.4 Logistic Regression
In the last screen, you may have noticed that both fpr and tpr were 1. This is because we predicted 1 for
each row. This means that we correctly identified all of the good loans (true positive rate), but we also
incorrectly identified all of the bad loans (false positive rate). Now that we've setup error metrics, we
can move on to making predictions using a machine learning algorithm.
As we saw in the first screen of the mission, our cleaned dataset contains 41 columns, all of which are
either the int64 or the float64 data type. There aren't any null values in any of the columns. This means
that we can now apply any machine learning algorithm to our dataset. Most algorithms can't deal with
non-numeric or missing values, which is why we had to do so much data cleaning.
In order to fit the machine learning models, we'll use the Scikit-learn library. Although we've built our
own implementations of algorithms in earlier missions, it's easier and faster to use algorithms that
someone else has already written and tuned for high performance.
A good first algorithm to apply to binary classification problems is logistic regression, for the following
reasons:
Instructions
• Create a dataframe named features that contains just the feature columns
• Remove the loan_status column
• Create a Series named target that contains just the target column (loan_status)
• Use the fit method of lr to fit a logistic regression to features and target
Solutions
63
II.3.6 Cross Validation
While we generated predictions in the last screen, those predictions were overfit. They were overfit
because we generated predictions using the same data that we trained our model on. When we use
this to evaluate an error, we get an unrealistically high depiction of how accurate the algorithm is,
because it already "knows" the correct answers. This is like asking someone to memorize a bunch of
physics equations, then asking them to plug numbers into the equations. They can tell you the right
answer, but they can't explain a concept that they haven't already memorized an equation for.
In order to get a realistic depiction of the accuracy of the model, let's perform k-fold cross validation.
We can use the cross_val_predict() function from the sklearn.model_selection package. Here's what
the workflow looks like:
Once we have cross validated predictions, we can compute true positive rate and false positive rate.
Instructions
64
Solutions
As you can see from the last screen, our fpr and tpr are around what we'd expect if the model was
predicting all ones. We can look at the first few rows of predictions to confirm:
Unfortunately, even though we're not using accuracy as an error metric, the classifier is, and it isn't
accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for
imbalanced classes. The two main ways are:
65
• Use oversampling and undersampling to ensure that the classifier gets input that has a balanced
number of each class
• Tell the classifier to penalize misclassifications of the less prevalent class more than the other class
We'll look into oversampling and undersampling first. They involve taking a sample that contains equal
numbers of rows where loan_status is 0, and where loan_status is 1. This way, the classifier is forced
to make actual predictions, since predicting all 1s or all 0s will only result in 50% accuracy at most.
The downside of this technique is that since it has to preserve an equal ratio, you have to either:
• Throw out many rows of data. If we wanted equal numbers of rows where loan_status is 0 and
where loan_status is 1, one way we could do that is to delete rows where loan_status is 1
• Copy rows multiple times. One way to equalize the 0s and 1s is to copy rows where loan_status is 0
• Generate fake data. One way to equalize the 0s and 1s is to generate new rows where loan_status is 0
Unfortunately, none of these techniques are easy. The second method we mentioned earlier, telling
the classifier to penalize certain rows more, is much easier to implement using scikit-learn.
We can do this by setting the class_weight parameter to balanced when creating the LogisticRegression
instance. This tells scikit-learn to penalize the misclassification of the minority class during the training
process. The penalty means that the logistic regression classifier pays more attention to correctly
classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but increases
accuracy when loan_status is 0.
By setting the class_weight parameter to balanced, the penalty is set to be inversely proportional
to the class frequencies. You can read more about the parameter here. This would mean that for the
classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly
classifying a row where loan_status is 1.
We can repeat the cross validation procedure we performed in the last screen, but with the class_
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
66
Instructions
Solutions
67
II.3.8 Manuel Penalties
We significantly improved the false positive rate in the last screen by balancing the classes, which
reduced true positive rate. Our true positive rate is now around 66%, and our false positive rate is
around 39%. From a conservative investor's standpoint, it's reassuring that the false positive rate is
lower, because it means that we'll be able to do a better job at avoiding bad loans than if we funded
everything. However, we'd only decide to fund 66% of the total loans (true positive rate), so we'd
immediately reject a good amount of loans.
We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the
negative class. While setting class_weight to balanced will automatically set a penalty based on the
number of 1s and 0s in the column, we can also set a manual penalty. In the last screen, the penalty
scikit-learn imposed for misclassifying a 0 would have been around 5.89 (since there are 5.89 times as
many 1s as 0s).
We can also specify a penalty manually if we want to adjust the rates more. To do this, we need to pass
in a dictionary of penalty values to the class_weight parameter:
The above dictionary will impose a penalty of 10 for misclassifying a 0 and a penalty of 1 for
misclassifying a 1.
Instructions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Modify the code from the last screen to change the class_weight parameter from the string "balanced" to
the dictionary:
Remember to print out the fpr and tpr values at the end!
68
II.3.9 Random Forets
It looks like assigning manual penalties lowered the false positive rate to 9%, and thus lowered our
risk. Note that this comes at the expense of true positive rate. While we have fewer false positives,
we're also missing opportunities to fund more loans and potentially make more money. Given that
we're approaching this as a conservative investor, this strategy makes sense, but it's worth keeping in
mind the tradeoffs.
While we could tweak the penalties further, it's best to move to try a different model right now, for
larger potential false positive rate gains. We can always loop back and iterate on the penalties more
later.
Let's try a more complex algorithm, random forest. We learned about random forests in a previous
mission and constructed our own model. Random forests are able to work with nonlinear data and learn
complex conditionals. Logistic regressions are only able to work with linear data. Training a random
forest algorithm may enable more accuracy due to columns that correlate nonlinearly with loan_status.
Instructions
• Modify the code from the last screen, and swap out the LogisticRegression for
a RandomForestClassifer model.
• Set the value of the keyword argument random_state to 1, so the predictions don't vary due to
random chance.
• Set the value of the keyword argument class_weight to balanced, so we avoid issues with imbalanced
classes.
• Remember to print out the fpr and tpr values at the end!
69
Solutions
Key Points
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Unfortunately, using a random forest classifier didn't improve our false positive rate. The model is
likely too heavy on the 1 class, and still mostly predicting 1s. We could fix this by applying a harsher
penalty for misclassifications of 0s.
Ultimately, our best model had a false positive rate of nearly 9%, and a true positive rate of nearly 24%.
For a conservative investor, this means that they make money as long as the interest rate is high enough
to offset the losses from 9% of borrowers defaulting. In addition, the pool of 24% of borrowers must
be large enough to make enough interest money to offset the losses.
If we had randomly picked loans to fund, borrowers would have defaulted on 14.5% of them, and our
model is better than that, although we're excluding more loans than a random strategy would. Given
this, there's still quite a bit of room to improve:
70
• We can tweak the penalties further
• We can try models other than a random forest and logistic regression
• We can use some of the columns we discarded to generate better features
• We can ensemble multiple models to get more accurate predictions
• We can tune the parameters of the algorithm to achieve higher performance
71
III. Kaggle
Fundamentals
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
72
Kaggle Fundamentals
Learn how to get started and participate in Kaggle competitions with our Kaggle Fundamentals course.
Kaggle is a data science competition site where you can sign up to compete with other data scientists
and data science teams to produce the most accurate analysis of a particular data set. Competition
in Kaggle is strong, and placing among the top finishers in a competition will give you bragging rights.
In this course, you will compete in Kaggle's 'Titanic' competition to build a simple machine learning
model and make your first Kaggle submission. You will also learn how to select the best algorithm and
tune your model for the best performance. You'll be working with multiple algorithms such as logistic
regression, k-nearest neighbors, and random forests in attempts to find the model that scores the best
and awards you the best rank.
Throughout this course, you'll learn several tips and tricks for competing in Kaggle competitions that
will help you place highly. You’ll also learn more about effective machine learning workflows, and
about how to use a Jupyter Notebook for Kaggle competitions.
At the end of the course, you’ll have a completed machine learning project and the knowledge you
need to dive into other Kaggle competitions and prove your skills to the world.
• Build a simple machine learning model and make your fist Kaggle submission
• Create new features and select the best-performing features to improve your score
• Work with multiple algorithms including logistic regression, k nearest neighbors, and random forest
• How to select the best algorithm and tune your model for the best performance
III.1 Getting Started with Kaggle
III.1.1 Introduction to Kaggle
In this mission and the ones that follow, we're going to learn how to compete in Kaggle competitions.
In this introductory mission we'll learn how to:
73
This course presumes you have an understanding of Python and the pandas library. If you need to
learn about these, we recommend our Python and pandas courses.
Kaggle has created a number of competitions designed for beginners. The most popular of these
competitions, and the one we'll be looking at, is about predicting which passengers survived the sinking
of the Titanic.
In this competition, we have a data set of different information about passengers onboard the Titanic,
and we want to see if we can use that information to predict whether those people survived or not.
Before we start looking at this specific competition, let's take a moment to understand how Kaggle
competitions work.
Each Kaggle competition has two key data
files that you will work with - a training set and
a testing set.
In this competition, the two files are named test.csv and train.csv. We'll start by using the pandas
library to read both files and inspect their size.
Instructions
74
Solutions
The files we read in the previous screen are available on the data page for the Titanic competition on
Kaggle. That page also has a data dictionary, which explains the various columns that make up the data
set. Below are the descriptions contained in that data dictionary:
• PassengerID - A column added by Kaggle to identify each row and make submissions easier
• Survived - Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)
• Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)
• Sex - The passenger's sex
• Age - The passenger's age in years
• SibSp - The number of siblings or spouses the passenger had aboard the Titanic
• Parch - The number of parents or children the passenger had aboard the Titanic
• Ticket - The passenger's ticket number
• Fare - The fare the passenger paid
• Cabin - The passenger's cabin number
• Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown,
S=Southampton)
The data page on Kaggle has some additional notes about some of the columns. It's always worth
75
The type of machine learning we will be doing is called classification, because when we make predictions
we are classifying each passenger as a survivor or not. More specifically, we are performing binary
classification, which means that there are only two different states we are classifying.
In any machine learning exercise, thinking about the topic you are predicting is very important. We call
this step acquiring domain knowledge, and it's one of the most important determinants for success in
machine learning.
In this case, understanding the Titanic disaster and specifically what variables might affect the outcome
of survival is important. Anyone who has watched the movie Titanic would remember that women and
children were given preference to lifeboats (as they were in real life). You would also remember the
vast class disparity of the passengers.
This indicates that Age, Sex, and PClass may The resultant plot will look like this:
be good predictors of survival. We'll start by
exploring Sex and Pclass by visualizing the data.
We can immediately see that females survived in much higher proportions than males did. Let's do the
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Instructions
76
III.1.3 Exploring and converting the age column
The Sex and PClass columns are what we call categorical features. That means that the values
represented a few separate options (for instance, whether the passenger was male or female).
The Age column contains numbers ranging from 0.42 to 80.0 (if you look at Kaggle's data page, it
informs us that Age is fractional if the passenger is less than one). The other thing to note here is that
there are 714 values in this column, fewer than the 891 rows we discovered that the train data set had
earlier in this mission which indicates we have some missing values.
All of this means that the Age column needs to be treated slightly differently, as this is a continuous
numerical column. One way to look at distribution of values in a continuous numerical set is to use
histograms. We can create two histograms to compare visually those that survived vs those who died
across different age ranges:
77
The pandas.cut() function has two required parameters - the column we wish to cut, and a list of numbers
which define the boundaries of our cuts. We are also going to use the optional parameter labels, which
takes a list of labels for the resultant bins. This will make it easier for us to understand our results.
Before we modify this column, we have to be aware of two things. Firstly, any change we make to
the train data, we also need to apply to the test data, otherwise we will be unable to use our model to
make predictions for our submissions. Secondly, we need to remember to handle the missing values
we observed above.
We then use that function on both The diagram below shows how the function
the train and test dataframes. converts the data:
Note that the cut_points list has one more element than the label_names list, since it needs to define
the upper boundary for the last segment.
Instructions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Create the cut_points and label_names lists to split the Age column into six categories:
• Missing, from -1 to 0
• Infant, from 0 to 5
• Child, from 5 to 12
• Teenager, from 12 to 18
• Young Adult, from 18 to 35
• Adult, from 35 to 60
• Senior, from 60 to 100
• Apply the process_age() function on the train dataframe, assigning the result to train
• Apply the process_age() function on the test dataframe, assigning the result to test
• Use DataFrame.pivot_table() to pivot the train dataframe by the Age_categories column
• Use DataFrame.plot.bar() to plot the pivot table
78
Solutions
So far we have identified three columns that may be useful for predicting survival:
• Sex
• Pclass
• Age, or more specifically our newly created Age_categories
Before we build our model, we need to prepare these columns for machine learning. Most machine
learning algorithms can't understand text labels, so we have to convert our values into numbers.
Additionally, we need to be careful that we don't imply any numeric relationship where there isn't one.
If we think of the values in the Pclass column, we know they are 1, 2, and 3. You can confirm this by
running the following code:
79
Rather than doing this manually, we can use the pandas.get_dummies() function, which will generate
columns shown in the diagram above.
The following code creates a function to create the dummy columns for the Pclass column and add it
back to the original dataframe. It then applies that function the train and test dataframes.
Let's use that function to create dummy columns for both the Sex and Age_categories columns.
Instructions
• Use the create_dummies() function to create dummy variables for the Sex column:
• In the train dataframe
• In the test dataframe
• Use the create_dummies() function to create dummy variables for the Age_categories column:
• In the train dataframe
• In the test dataframe
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Now that our data has been prepared, we are ready to train our first model. The first model we will use is
called Logistic Regression, which is often the first model you will train when performing classification.
We will be using the scikit-learn library as it has many tools that make performing machine learning
easier. The scikit-learn workflow consists of four main steps:
• Instantiate (or create) the specific machine learning model you want to use
• Fit the model to the training data
• Use the model to make predictions
• Evaluate the accuracy of the predictions
80
Each model in scikit-learn is implemented as a separate class and the first step is to identify the class
we want to create an instance of. In our case, we want to use the LogisticRegression class.
We'll start by looking at the first two steps. First, we need to import the class:
Lastly, we use the LogisticRegression.fit() method to train our model. The .fit() method accepts two
arguments: X and Y. X must be a two dimensional array (like a dataframe) of the features that we wish
to train our model on, and Y must be a one-dimensional array (like a series) of our target, or the column
we wish to predict.
The code above fits (or trains) our LogisticRegression model using three columns: Pclass_2, Pclass_3,
and Sex_male.
Let's train our model using all of the columns we created in the previous screen.
Instructions
Solutions
81
III.1.6 Split data
Congratulations, you've trained your first machine learning model! Our next step is to find out how
accurate our model is, and to do that, we'll have to make some predictions.
If you recall from earlier, we do have a test dataframe that we could use to make predictions. We could
make predictions on that data set, but because it doesn't have the Survived column we would have to
submit it to Kaggle to find out our accuracy. This would quickly become a pain if we had to submit to
find out the accuracy every time we optimized our model.
We could also fit and predict on our train dataframe, however if we do this there is a high likelihood
that our model will overfit, which means it will perform well because we're testing on the same data
we've trained on, but then perform much worse on new, unseen data.
The convention in machine learning is to call these two parts train and test. This can become confusing,
since we already have our test dataframe that we will eventually use to make predictions to submit to
Kaggle. To avoid confusion, from here on, we're going to call this Kaggle 'test' data holdout data, which
is the technical name given to this type of data used for final predictions.
The scikit-learn library has a handy model_selection.train_test_split() function that we can use to split
our data. train_test_split() accepts two parameters, X and Y, which contain all the data we want to
train and test on, and returns four objects: train_X, train_Y, test_X, test_Y:
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
82
Here's what the syntax for creating these four objects looks like:
You'll notice that there are two other parameters we used: test_size, which lets us control what
proportions our data are split into, and random_state. The train_test_split() function randomizes
observations before dividing them, and setting a random seed means that our results will be
reproducible, which is important if you are collaborating, or need to produce consistent results each
time (which our answer checker requires).
Instructions
• Use the model_selection.train_test_split() function to split the train dataframe using the following
parameters:
• test_size of 0.2
• random_state of 0
• Assign the four returned objects to train_X, test_X, train_y, and test_y
Solutions
Now that we have our data split into train and test sets, we can fit our model again on our training set,
and then use that model to make predictions on our test set
Once we have fit our model, we can use the LogisticRegression.predict() method to make predictions
83
The predict() method takes a single parameter X, a two dimensional array of features for the observations
we wish to predict. X must have the exact same features as the array we used to fit our model. The
method returns single dimensional array of predictions.
Again, scikit-learn has a handy function we can use to calculate accuracy: metrics.accuracy_score(). The
function accepts two parameters, y_true and y_pred, which are the actual values and our predicted
values respectively, and returns our accuracy score.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Instructions
Solutions
84
III.1.8 Using Cross Validation for More
Accurate Error Measurement
• estimator is a scikit-learn estimator object, like the LogisticRegression() objects we have been
creating
• X is all features from our data set
• y is the target variables
• cv specifies the number of folds
It's worth noting, the cross_val_score() function can use a variety of cross validation techniques and
scoring types, but it defaults to k-fold validation and accuracy scores for our input types.
Instructions
85
Solutions
From the results of our k-fold validation, you can see that the accuracy number varies with each fold -
ranging between 76.4% and 87.6%. This demonstrates why cross validation is important.
As it happens, our average accuracy score was 80.2%, which is not far from the 81.0% we got from
our simple train/test split, however this will not always be the case, and you should always use cross-
validation to make sure the error metrics you are getting from your model are accurate.
We are now ready to use the model we have built to train our final model and then make predictions
on our unseen holdout data, or what Kaggle calls the 'test' data set.
Instructions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
86
III.1.10 Creation submission file
The last thing we need to do is create a submission file. Each Kaggle competition can have slightly
different requirements for the submission file. Here's what is specified on the Titanic competition
evaluation page:
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if
you have extra columns (beyond PassengerId and Survived) or rows.
The table below shows this in a slightly easier to We will need to create a new dataframe that
understand format, so we can visualize what we contains the holdout_predictions we created in
are aiming for. the previous screen and the PassengerId column
from the holdout dataframe. We don't need to
worry about matching the data up, as both of
these remain in their original order.
Instructions
Solutions
87
III.1.11 Making Our First Submission to Kaggle
You can download the submission file you just created (when working locally, it will be in the same
directory as your notebook).
Now that we have our submission file, we can start our submission to Kaggle by clicking the blue
'Submit Predictions' button on the competition page.
You will then be prompted to upload your CSV file, and add a brief description of your submission.
When you make your submission, Kaggle will process your predictions and give you your accuracy for
the holdout data and your ranking.
When it is finished processing you will see our first submission gets an accuracy score of 0.75598, or
75.6%.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The fact that our accuracy on the holdout data is 75.6% compared with the 80.2% accuracy we got
with cross-validation indicates that our model is overfitting slightly to our training data.
At the time of writing, accuracy of 75.6% gives a rank of 6,663 out of 7,954. It's easy to look at
Kaggle leaderboards after your first submission and get discouraged, but keep in mind that this is just
a starting point.
It's also very common to see a small number of scores of 100% at the top of the Titanic leaderboard
and think that you have a long way to go. In reality, anyone scoring about 90% on this competition is
likely cheating (it's easy to look up the names of the passengers in the holdout set online and see if
they survived).
88
There is a great analysis on Kaggle, How am I doing with my score, which uses a few different strategies
and suggests a minimum score for this competition is 62.7% (achieved by presuming that every
passenger died) and a maximum of around 82%. We are a little over halfway between the minimum
and maximum, which is a great starting point.
There are many things we can do to improve the accuracy of our model. Here are some that we will
cover in the next two missions of this course:
In this mission, we're going to focus working with the features used in our model.
We'll start by looking at feature selection. Feature selection is important because it helps to exclude
features which are not good predictors, or features that are closely related to each other. Both of these
will cause our model to be less accurate, particularly on previously unseen data.
89
The model on the left is overfitting, which means the model represents the training data too closely,
and is unlikely to predict well on unseen data, like the holdout data for our Kaggle competition.
The model on the right is well-fit. It captures the underlying pattern in the data without the detailed
noise found just in the training set. A well fit model is likely to make accurate predictions on previously
unseen data. The key to creating a well-fit model is to select the right balance of features, and to create
new features to train your model.
In the previous mission, we trained our model using data about the age, sex and class of the passengers
on the Titanic. Let's start by using the functions we created in that mission to add the columns we had
at the end of the first mission.
Remember that any modifications we make to our training data (train.csv) we also have to make to our
holdout data (test.csv).
Instructions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
90
III.2.1 Preparing more Features
Our model in the previous mission was based on three columns from the original data: Age, Sex, and
Pclass. As you saw when you printed the column names in the previous screen, there are a number of
other columns that we haven't yet used. To make it easier to reference, the output from the previous
screen is copied below:
The last nine rows of the output are dummy columns we created, but in the first three rows we can
see there are a number of features we haven't yet utilized. We can ignore PassengerId, since this
is just a column Kaggle have added to identify each passenger and calculate scores. We can also
ignore Survived, as this is what we're predicting, as well as the three columns we've already used.
Here is a list of the remaining columns (with a brief description), followed by 10 randomly selected
passengers from and their data from those columns, so we can refamiliarize ourselves with the data.
• SibSp - The number of siblings or spouses the passenger had aboard the Titanic
• Parch - The number of parents or children the passenger had aboard the Titanic
Ticket - The passenger's ticket number
91
Name SibSp Parch Ticket Fare Cabin Embarked
Peters, Miss.
680 0 0 330935 8.1375 NaN Q
Katie
Vander Planke,
333 Mr. Leo 2 0 345764 18.0000 NaN S
Edmondus
Hays, Miss.
310 Margaret 0 0 11767 83.1583 C54 C
Bechstein
Bishop, Mrs.
291 Dickinson H 1 0 11967 91.0792 B49 C
(Helen Walton)
Wheadon, Mr.
33 0 0 C.A. 24579 10.5000 NaN S
Edward H
Nirva, Mr. Iisakki SOTON/O2
761 0 0 7.1250 NaN S
Antino Aijo 3101272
Allison, Master.
305 1 2 113781 151.5500 C22 C26 S
Hudson Trevor
SOTON/O.Q.
210 Ali, Mr. Ahmed 0 0 7.0500 NaN S
3101311
Mellinger, Mrs.
272 (Elizabeth Anne 0 1 250644 19.5000 NaN S
Maidment)
At first glance, both the Name and Ticket columns look to be unique to each passenger. We will come
back to these columns later, but for now we'll focus on the other columns.
We can use the Dataframe.describe() method to give us some more information on the values within
each remaining column.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
92
Of these, SibSp, Parch and Fare look to be standard numeric columns with no missing values. Cabin has
values for only 204 of the 891 rows, and even then most of the values are unique, so for now we will
leave this column also. Embarked looks to be a standard categorical column with 3 unique values,
much like PClass was, except that there are two missing values. We can easily fill these two missing
values with the most common value, "S" which occurs 644 times.
Looking at our numeric columns, we can see a big difference between the range of each. SibSp has
values between 0-8, Parch between 0-6, and Fare is on a dramatically different scale, with values
ranging from 0-512. In order to make sure these values are equally weighted within our model, we'll
need to rescale the data.
Rescaling simply stretches or shrinks the data as needed to be on the same scale, in our case between
0 and 1.
In the diagram above, the three columns have
different minimum and maximum values before
rescaling.
Instructions
93
Solutions
In order to select the best-performing features, we need a way to measure which of our features are
relevant to our outcome - in this case, the survival of each passenger. One effective way is by training a
logistic regression model using all of our features, and then looking at the coefficients of each feature.
The scikit-learn LogisticRegression class has an attribute in which coefficients are stored after the
model is fit, LogisticRegression.coef_. We first need to train our model, after which we can access this
attribute.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The coef() method returns a NumPy array of coefficients, in the same order as the features that were
used to fit the model. To make these easier to interpret, we can convert the coefficients to a pandas
series, adding the column names as the index:
We'll now fit a model and plot the coefficients for each feature.
94
Instructions
Solutions
To make things easier to interpret, we'll alter the plot to show all positive values, and have sorted the
bars in order of size:
95
We'll train a new model with the top 8 scores and check our accuracy using cross validation.
Instructions
Solutions
96
III.2.4 Submitting our Improved Model to Kaggle
The cross validation score of 81.48% is marginally higher than the cross validation score for the model
we created in the previous mission, which had a score of 80.2%.
Hopefully, this improvement will translate to previously unseen data. Let's train a model using the
columns from the previous step, make some predictions on the holdout data and submit it to Kaggle
for scoring.
Instructions
Solutions
You can download the CSV from the previous step here. When you submit it to Kaggle, you'll see
that the score is 77.0%, which at the time of writing equates to jumping about 1,500 places up the
leaderboard (this will vary as the leaderboard is always changing). It's only a small improvement, but
we're moving in the right direction.
A lot of the gains in accuracy in machine learning come from Feature Engineering. Feature engineering
is the practice of creating new features from your existing data.
97
One common way to engineer a feature is using a technique called binning. Binning is when you take
a continuous feature, like the fare a passenger paid for their ticket, and separate it out into several
ranges (or 'bins'), turning it into a categorical variable.
This can be useful when there are patterns in the
data that are non-linear and you're using a linear
model (like logistic regression). We actually used
binning in the previous mission when we dealt
with the Age column, although we didn't use the
term.
Instructions
• Using the process_age() function as a model, create a function process_fare() that uses the
pandas cut() method to create bins for the Fare column and assign the results to a new column
called Fare_categories
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
• We have already dealt with missing values in the Fare column, so you won't need the line that
uses fillna()
• Use the process_fare() function on both the train and holdout dataframes, creating the four 'bins’:
• 0-12, for values between 0 and 12
• 12-50, for values between 12 and 50
• 50-100, for values between 50 and 100
• 100+, for values between 100 and 1000
• Use the create_dummies() function we created earlier in the mission on both
the train and holdout dataframes to create dummy columns based on our new fare bins
98
Solutions
While in isolation the cabin number of each passenger will be reasonably unique to each, we can see
99
Looking at the Name column, There is a title like We can use the Series.str.extract method and
'Mr' or 'Mrs' within each, as well as some less a regular expression to extract the title from each
common titles, like the 'Countess' from the final name and then use the Series.map() method and
row of our table above. By spending some time a predefined dictionary to simplify the titles.
researching the different titles, we can categorize
these into six types:
• Mr
• Mrs
• Master
• Miss
• Officer
• Royalty
Instructions
• Use extract(), map() and the dictionary titles to categorize the titles for the holdout dataframe and
assign the results to a new column Title
• For both the train and holdout dataframes:
• Use the str() accessor to extract the first letter from the Cabin column and assign the result to a
new column Cabin_type
• Use the fillna() method to fill any missing values in Cabin_type with "Unknown"
• For the newly created columns Title and Cabin_type, use create_dummies() to create dummy
columns for both the train and holdout dataframes
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
100
III.2.7 Finding Correlated Features
We now have 34 possible feature columns we can use to train our model. One thing to be aware of
as you start to add more features is a concept called collinearity. Collinearity occurs where more than
one feature contains data that are similar.
The effect of collinearity is that your model will overfit - you may get great results on your test data set,
but then the model performs worse on unseen data (like the holdout set).
One easy way to understand collinearity is with a simple binary variable like the Sex column in our
dataset. Every passenger in our data is categorized as either male or female, so 'not male' is exactly the
same as 'female'.
The darker squares, whether the darker red or darker blue, indicate pairs of columns that have higher
correlation and may lead to collinearity. The easiest way to produce this plot is using the DataFrame.
corr() method to produce a correlation matrix, and then use the Seaborn library's seaborn.
The example plot above was produced using a code example from seaborn's documentation which
produces an correlation heatmap that is easier to interpret than the default output of heatmap(). We've
created a function containing that code to make it easier for you to plot the correlations between the
features in our data.
101
Instructions
Use the plot_correlation_heatmap() function to produce a heatmap for the train dataframe, using only
the features in the list columns.
Solutions
102
We can see that there is a high correlation between Sex_female/Sex_male and Title_Mr/Title_Mrs.
We will remove the columns Sex_female and Sex_male since the title data may be more nuanced.
Apart from that, we should remove one of each of our dummy variables to reduce the collinearity in
each. We'll remove:
• Pclass_2
• Age_categories_Teenager
• Fare_categories_12-50
• Title_Master
• Cabin_type_A
In an earlier step, we manually used the logit coefficients to select the most relevant features. An
alternate method is to use one of scikit-learn's inbuilt feature selection classes. We will be using
the feature_selection.RFECV class which performs recursive feature elimination with cross-validation.
The RFECV class starts by training a model using all of your features and scores it using cross validation.
It then uses the logit coefficients to eliminate the least important feature, and trains and scores a new
model. At the end, the class looks at all the scores, and selects the set of features which scored highest.
Like the LogisticRegression class, RFECV must first be instantiated and then fit. The first parameter
when creating the RFECV object must be an estimator, and we need to use the cv parameter to
specific the number of folds for cross-validation.
Once the RFECV object has been fit, we can use the RFECV.support attribute to access a boolean
Instructions
Because of the computation involved in this exercise, code running may take longer than other screens.
103
Solutions
Let's train a model using cross validation using these columns and check the score.
Instructions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
104
Instructions
Solutions
You can download the submission file we just created and submit it to Kaggle. The score this submission
gets is 78.0%, which is equivalent to a jump of roughly 1,000 spots (again, this will vary as submissions
are constantly being made to the leaderboard).
By preparing, engineering and selecting features, we have increased our accuracy by 2.4%. When
working in Kaggle competitions, you should spend a lot of time experimenting with features, particularly
feature engineering. Here are some ideas that you can use to work with features for this competition:
In the next mission in this course, we'll look at selecting and optimizing different models to improve
our score.
105
III.3 Model Selection and Tuning
III.3.1 Model selection
In the previous mission, we worked to optimize our predictions by creating and selecting the features
used to train our model. The other half of the optimization puzzle is to optimize the model itself— or
more specifically, the algorithm used to train our model.
So far, we've been using the logistic regression algorithm to train our models, however there are
hundreds of different machine learning algorithms from which we can choose. Each algorithm has
different strengths and weaknesses, and so we need to select the algorithm that works best with our
specific data— in this case our Kaggle competition.
The process of selecting the algorithm which gives the best predictions for your data is called model
selection.
In this mission, we're going to work with two new algorithms: k-nearest neighbors and random forests.
Before we begin, we'll need to import in the data. To save time, we have saved the features we created
in the previous mission as CSV files, train_modified.csv and holdout_modified.csv.
Instructions
• Import train_modified.csv into a pandas dataframe and assign the result to train
• Import holdout_modified.csv into a pandas dataframe and assign the result to holdout
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
We're going to train our models using all the columns in the train dataframe. This will cause a small
amount of overfitting due to collinearity (as we discussed in the previous mission), but having more
features will allow us to more thoroughly compare algorithms.
So we have something to compare to, we're going to train a logistic regression model like in the
previous two missions. We'll use cross validation to get a baseline score.
106
Instructions
Solutions
The logistic regression baseline model from the previous screen scored 82.5%
The k-nearest neighbors algorithm finds the observations in our training set most similar to the
observation in our test set, and uses the average outcome of those 'neighbor' observations to make a
prediction. The 'k' is the number of neighbor observations used to make the prediction.
107
The plots below shows three simple k-nearest neighbors models where there are two features shown
on each axis, and two outcomes, red and green
• In the first plot, the value of k is 1. The green dot is therefore the closet neighbor to the gray dot,
making the prediction green
• In the second plot, the value of k is 3. The closest 3 neighbors to our gray dot are used (2 red vs 1
green), making the prediction red
• In the third plot, the value of k is 5. The closest 5 neighbors to our gray dot are used (3 red vs 2
green), making the prediction red
If you'd like to learn more about the k-nearest neighbors algorithm, you might like to check out our
free Introduction to K-Nearest Neighbors mission.
Just like it does for logistic regression, scikit-learn has a class that makes it easy to use k-nearest
neighbors to make predictions, neighbors.KNeighborsClassifier.
Scikit-learn's use of object-oriented design makes it easy to substitute one model for another. The
syntax to instantiate a KNeighborsClassifier is very similar to the syntax we use for logistic regression.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The optional n_neighbors argument sets the value of k when predictions are made. The default value
of n_neighbors is 5, but we're going to start by building a simple model that uses the closest neighbor
to make our predictions.
Instructions
108
Solutions
The k-nearest neighbors model we trained in the previous screen had an accuracy score of 78.3%,
worse than our baseline score of 82.5%.
Besides pure model selection, we can vary the settings of each model— for instance the value of k in
our k-nearest neighbors model. This is called hyperparameter optimization.
We can use a loop and Python's inbuilt range() class to iterate through different values for k and
calculate the accuracy score for each different value. We will only want to test odd values for k to
avoid ties, where both 'survived' and 'died' outcomes would have the same number of neighbors.
This is the syntax we would use to get odd values between 1-7 from range():
Note that we use the arguments (1,8,2) to get values between 1 and 7, since the created range() object
contains numbers up to but not including the 8.
Let's use this technique to calculate the accuracy of our model for values of k from 1-49, storing the
results in a dictionary.
To make the results easier to understand, we'll finish by plotting the scores. We have provided a helper
function, plot_dict() which you can use to easily plot the dictionary.
109
Instructions
• Use a for loop and the range class to iterate over odd values of k from 1-49, and in each iteration:
• Instantiate a KNeighborsClassifier object with the value of k for the n_neighbors argument
• Use cross_val_score to create a list of scores using the newly created KNeighborsClassifier object,
using all_X, all_y, and cv=10 as the arguments
• Calculate the mean of the list of scores
• Add the mean of the scores to the dictionary knn_scores, using k for the key
• Use the plot_dict() helper function to plot the knn_scores dictionary
Looking at our plot from the previous screen we can see that a k value of 19 gave us our best score,
and checking the knn_scores dictionary we can see that the score was 82.4%, identical to our baseline
(if we didn't round the numbers you would see that it's actually 0.01% less accurate).
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
110
The technique we just used is called grid search - we train a number of models across a 'grid' of values
and then searched for the model that gave us the highest accuracy.
Scikit-learn has a class to perform grid search, model_selection.GridSearchCV(). The 'CV' in the name
indicates that we're performing both grid search and cross validation at the same time.
By creating a dictionary of parameters and possible values and passing it to the GridSearchCV object
you can automate the process. Here's what the code from the previous screen would look like, when
implemented using the GridSearchCV class.
111
Let's use GridSearchCV to turbo-charge our search for the best performing parameters for our model,
by testing 40 combinations of three different hyperparameters.
Instructions
Solutions
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
112
III.3.6 Submitting K-Nearest Neighbors Predictions to Kaggle
The cross-validation score for the best performing model was 82.9%, better than our baseline model.
We can use the GridSearchCV.best_estimator_ attribute to retrieve a trained model with the best-
performing hyperparameters. This code:
Is equivalent to this code where we manually specify the hyperparameters and train the model:
Instructions
• Make predictions on the data from holdout_no_id using the best_knn model, and assign the result
to holdout_predictions
• Create a dataframe submission with two columns:
• PassengerId, with the values from the PassengerId column of the holdout dataframe
• Survived, with the values from holdout_predictions
• Use the DataFrame.to_csv method to save the submission dataframe to the filename submission_1.csv
113
Solutions
You can download the submission file from the previous screen here.
When you submit this to Kaggle, you'll see it scores 75.6%, less than our best submission of 78.0%.
While our model could be overfitting due to including all columns, it also seems like k-nearest neighbors
may not be the best algorithm choice.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
114
Let's try another algorithm called random forests.
Random forests is a specific type of decision
tree algorithm. You have likely seen decision trees
before as part of flow charts or infographics.
Say we wanted to build a decision tree to help
us categorize an object as either being 'hotdog' or
'not hotdog', we could construct a decision tree
like the below:
Decision tree algorithms attempt to build the most efficient decision tree based on the training data,
and then use that tree to make future predictions. If you'd like to learn about decision trees and
random forests in detail, you should check out our decision trees course.
Scikit-learn contains a class for classification using the random forest algorithm, ensemble.
RandomForestClassifier. Here's how to fit a model and make predictions using
the RandomForestClassifier class:
Let's use a RandomForestClassifier object with cross_val_score() as we did earlier to see how the
algorithm performs with the default hyperparameters.
Instructions
115
Solutions
Using the default settings, our random forests model obtained a cross validation score of 82.0%.
Just like we did with the k-nearest neighbors model, we can use GridSearchCV to test a variety of
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The best way to see a list of available hyperparameters is by checking the documentation for the
classifier— in this case, the documentation for RandomForestClassifier. Let's use grid search to test out
combinations of the following hyperparameters:
116
Instructions
Solutions
117
Let's train it on the holdout data and create a submission file to see how it performs on the Kaggle
leaderboard!
Instructions
• Assign the best performing model from the GridSearchCV object grid to best_rf
• Make predictions on the data from holdout_no_id using the best_rf model, and assign the result
to holdout_predictions
• Create a dataframe submission with two columns:
• PassengerId, with the values from the PassengerId column of the holdout dataframe
• Survived, with the values from holdout_predictions
• Use the DataFrame.to_csv method to save the submission dataframe to the filename submission_2.csv
Solutions
If you submit this to Kaggle, it achieves a score of 77.1%, considerably better than our k-nearest
neighbors score of 75.6% and very close (2 incorrect predictions) to our best score from the previous
mission of 78.0%.
118
By combining our strategies for feature selection, feature engineering, model selection and model
tuning, we'll be able to continue to improve our score.
The next and final mission in this course is a guided project, where we'll teach you how to combine
everything you've learned into a real-life Kaggle workflow, and continue to improve your score.
So far in this course, you've been learning about Kaggle competitions using Dataquest missions.
Missions are highly structured and your work is answer checked every step of the way
Guided projects, on the other hand, are less structured and focus more on exploration. Guided projects
help you synthesize concepts learned during missions and practice what you have learned.
Working with Guided projects is a great opportunity to practice some of the extra skills you'll need
to do data science by yourself, including practicing debugging using all the tools at your disposal,
including googling for answers, visiting Stack Overflow and consulting the documentation for the
modules you are using.
This guided project uses Jupyter notebook, a web application which lets you combine text and code
within a single file, and is one of the most popular ways to explore and iterate when working with data.
The Jupyter notebook easily allows you to share your work, and makes exploring data much easier.
If you're not familiar with Jupyter notebook, we recommend completing our guided project on Using
Jupyter notebook to familiarize yourself.
119
In this guided project, we're going to put together all that we've learned in this course and create
a data science workflow.
Data science, and particularly machine learning, contain many dimensions of complexity when
compared with standard software development. In standard software development, code not working
as you expect can be caused by a number of factors along two dimensions:
• Bugs in implementation
• Algorithm design
• Bugs in implementation
• Algorithm design
• Model issues
• Data quality
The result of this is that there are exponentially more places that machine learning can go wrong.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
This concept is shown in the diagram above (taken from the excellent post Why is machine learning
'hard'?). The green dot is a 'correct' solution, where the red dots are incorrect solutions. In this
illustration there are only a small number of incorrect combinations for software engineering, but in
machine learning this becomes exponentially greater!
By defining a workflow for yourself, you can give yourself a framework with which to make iterating
on ideas quicker and easier, allowing yourself to work more efficiently.
In this mission, we're going to explore a workflow to make competing in the Kaggle Titanic competition
easier, using a pipeline of functions to reduce the number of dimensions you need to focus on.
To get started, we'll read in the original train.csv and test.csv files from Kaggle.
120
Instructions
One of the many benefits of using Jupyter is that (by default) it uses the IPython kernel to run code.
This gives you all the benefits of IPython, including code completion and 'magic' commands. (If you'd
like to read more about the internals of Jupyter and how it can help you work more efficiently, you
might like to check out our blog post Jupyter Notebook Tips, Tricks and Shortcuts.)
We can use one of those magic commands, the %load command, to load an external file.
The %load command will copy the contents of the file into the current notebook cell. The syntax is
simple:
To illustrate, say we had a file called test.py with the following line of code:
We have created a file, functions.py which contains versions of the functions we created in the earlier
missions form this course, which will save you building those functions again from scratch.
121
Instructions
• Use the %load magic command to load the contents of functions.py into a notebook cell and read
through the functions you have imported
• Create a new function, which:
• Accepts a dataframe parameter
• Applies the process_missing(), process_age(), process_fare(), process_titles(), and process_
cabin() functions to the dataframe
• Applies the create_dummies() function to the “Age_categories”, “Fare_categories”, “Title”, and
“Sex” columns
• Returns the processed dataframe
• Apply the newly create function on the train and holdout dataframes
• Feature selection, to select the best subset of our current set of features
• Model selection/tuning, training a number of models with different hyperparameters to find the
best performer
We can continue to repeat this cycle as we work to optimize our predictions. At the end of any cycle
we wish, we can also use our model to make predictions on the holdout set and then Submit to
Kaggle to get a leaderboard score.
While the first two steps of our workflow are relatively freeform, later in this project we'll create some
functions that will help automate the complexity of the latter two steps so we can move faster.
For now, let's practice the first stage, exploring the data. We're going to examine the two columns that
contain information about the family members each passenger had onboard: SibSp and Parch.
If you need some help with techniques for exploring and visualizing data, you might like to check out
our Data Analysis with Pandas and Exploratory Data Visualization courses.
122
Instructions
• Review the data dictionary and variable notes for the Titanic competition on Kaggle's website to
familiarize yourself with the SibSp and Parch columns
• Use pandas and matplotlib to explore those two columns. You might like to try:
• Inspecting the type of the columns
• Using histograms to view the distribution of values in the columns
• Use pivot tables to look at the survival rate for different values of the columns
• Find a way to combine the columns and look at the resulting distribution of values and survival
rate
• Write a markdown cell explaining your findings
If you didn't get this conclusion, you can use the code segment below to verify this for yourself:
Based of this, we can come up with an idea for a new feature - was the passenger alone. This will be a
binary column containing the value:
Instructions
123
III.4.5 Selecting the Best-Performing Features
The next step in our workflow is feature selection. In the Feature Preparation, Selection and
Engineering mission, we used scikit-learn's feature_selection.RFECV class to automate selecting the
best-performing features using recursive feature elimination.
To speed up our Kaggle workflow, we can create a function that performs this step for us, which will
mean we can perform feature selection by calling a self-contained function and focus our efforts on
the more creative part - exploring the data and engineering new features.
You may remember that the first parameter when you instantiate a RFECV() object is an estimator. At
the time we used a Logistic Regression estimator, but we've since discovered in the Model Selection
and Tuning mission that Random Forests seems to be a better algorithm for this Kaggle competition.
Instructions
Just like we did with feature selection, we can write a function to do the heavy lifting of model selection
and tuning. The function we'll create will use three different algorithms and use grid search to train
using different combinations of hyperparameters to find the best performing model.
124
We can achieve this by creating a list of dictionaries— that is, a list where each element of the list is a
dictionary. Each dictionary should contain:
We can then use a for loop to iterate over the list of dictionaries, and for each one we can use scikit-
learn's model_selection.GridSearchCV class to find the best set of performing parameters, and add
values for both the parameter set and the score to the dictionary.
Finally, we can return the list of dictionaries, which will have our trained GridSearchCV objects as well
as the results so we can see which was the most accurate.
Instructions
125
• "criterion": ["entropy", "gini"]
• "max_depth": [2, 5, 10]
• "max_features": ["log2", "sqrt"]
• "min_samples_leaf": [1, 5, 8]
• "min_samples_split": [2, 3, 5]
• Iterate over that list of dictionaries, and for each dictionary:
• Print the name of the model.
• Instantiate a GridSearchCV() object using the model, the dictionary of hyperparameters and
specify 10 fold cross validation
• Fit the GridSearchCV() object using all_X and all_y
• Assign the parameters and score for the best model to the dictionary
• Assign the best estimator for the best model to the dictionary
• Print the the parameters and score for the best model
• Return the list of dictionaries
• Run the newly created function using the train dataframe and the output of select_features() as
inputs and assign the result to a variable
After running your function, you will have three scores from three different models. At this point in the
workflow you have a decision to make: Do you want to train your best model on the holdout set and
make a Kaggle submission, or do you want to go back to engineering features.
You may find that adding a feature to your model doesn't improve your accuracy. In that case you
should go back to data exploration and repeat the cycle again.
If you're going to be continually submitting to Kaggle, a function will help make this easier. Let's create
a function to automate this.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Note that in our Jupyter Notebook environment, the DataFrame.to_csv() method will save the CSV in
the same directory as your notebook, just as it would if you are running Jupyter locally. To download
the CSV from our environment, you can either click the 'download' button to download all of your
project files as a tar file, or click the Jupyter logo at the top of the interface, and navigate to the CSV
itself to download just that file.
126
Instructions
• Continue to explore the data and create new features, following the workflow and using the
functions we created
• Read more about the titanic and this Kaggle competition to get ideas for new features
• Use some different algorithms in the select_model() function, like support vector machines, stochastic
gradient descent or perceptron linear models
• Experiment with RandomizedSearchCV instead of GridSearchCV to speed up your select_
features() function
127
You can continue to work on this Kaggle competition within this guided project environment and
save out files for submission if you like, although we would encourage you to set up your own Python
environment so that you can work on your own computer. We have a Python Installation Guide that
walks you through how to do this.
Lastly, while the Titanic competition is great for learning about how to approach your first Kaggle
competition, we recommend against spending many hours focused on trying to get to the top of the
leaderboard. With such a small data set, there is a limit to how good your predictions can be, and your
time would be better spent moving onto more complex competitions.
Once you feel like you have a good understanding of the Kaggle workflow, you should look at some
other competitions - a great next competition is the House Prices Competition. We have a great tutorial
for getting started with this competition on our blog.
Curious to see what other students have done on this project? Head over to our Community to check
them out. While you are there, please remember to show some love and give your own feedback!
And of course, we welcome you to share your own project and show off your hard work. Head over to
our Community to share your finished Guided Project!
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
128
IV. TensorFlow Concepts
Concepts
IV. TensorFlow
129
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
IV.1 Presentation of TensorFlow
Many years ago, deep learning started to exceed all other machine learning algorithms when giving
extensive data. Google has seen it could use these deep neural networks to upgrade its services:
They build a framework called TensorFlow to permit researchers and developers to work together in
an AI model. Once it approved and scaled, it allows lots of people to use it.
It was first released in 2015, while the first stable version was coming in 2017. It is an open- source
platform under Apache Open Source License. We can use it, modify it, and reorganize the revised
version for free without paying anything to Google.
130
IV.1.2 Components of TensorFlow
IV .1.3 Advantages
• It was fixed to run on multiple CPUs or GPUs and mobile operating systems
• The portability of the graph allows to conserve the computations for current or later use. The graph
can be saved because it can be executed in the future
• All the computation in the graph is done by connecting tensors together
Consider the following expression a= (b+c)*(c+2). We can break the functions into components given
d=b+c
e=c+2
a=d*e
131
A session can execute the operation from the
graph. To feed the graph with the value of a tensor,
we need to open a session. Inside a session, we
must run an operator to create an output.
TensorFlow is the better library for all because it is accessible to everyone. TensorFlow library
integrates different API to create a scale deep learning architecture like CNN (Convolutional Neural
Network) or RNN (Recurrent Neural Network).
TensorFlow is based on graph computation; it can allow the developer to create the construction of
the neural network with Tensorboard. This tool helps debug our program. It runs on CPU (Central
Processing Unit) and GPU (Graphical Processing Unit).
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
132
It is mainly used for deep learning or machine learning problems such as Classification, Peception,
Understanding, Discovering, Prediction and Creation.
Voice and sound recognition applications are the most-known use cases of deep-learning. If the neural
networks have proper input data feed, neural networks are capable of understanding audio signals.
For example:
• Voice recognition is used in the Internet of Things, automotive, security, and UX/UI
• Sentiment Analysis is mostly used in customer relationship management (CRM)
• Flaw Detection (engine noise) is mostly used in automotive and Aviation
• Voice search is mostly used in customer relationship management (CRM)
133
Image Recognition
Image recognition is the first application that made deep learning and machine learning popular.
Telecom, Social Media, and handset manufacturers mostly use image recognition. It is also used for
face recognition, image search, motion detection, machine vision, and photo clustering.
For example, image recognition is used to recognize and identify people and objects in from of images.
Image recognition is used to understand the context and content of any image.
For object recognition, TensorFlow helps to classify and identify arbitrary objects within larger images.
This is also used in engineering application to identify shape for modeling purpose (3d reconstruction
from 2d image) and by Facebook for photo tagging.
For example, deep learning uses TensorFlow for analyzing thousands of photos of cats. So a deep
learning algorithm can learn to identify a cat because this algorithm is used to find general features of
objects, animals, or people.
Time Series
Deep learning is using Time Series algorithms for examining the time series data to extract meaningful
statistics. For example, it has used the time series to predict the stock market.
A recommendation is the most common use case for Time Series. Amazon, Google, Facebook,
and Netflix are using deep learning for the suggestion. So, the deep learning algorithm is used to
analyze customer activity and compare it to millions of other users to determine what the customer
may like to purchase or watch.
For example, it can be used to recommend us TV shows or movies that people like based on TV shows
or movies we already watched.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Video Detection
The deep learning algorithm is used for video detection. It is used for motion detection, real-time
threat detection in gaming, security, airports, and UI/UX field.
For example, NASA is developing a deep learning network for object clustering of asteroids and orbit
classification. So, it can classify and predict NEOs (Near Earth Objects).
134
Text-Based Applications
Text-based application is also a popular deep learning algorithm. Sentimental analysis, social media,
threat detection, and fraud detection, are the example of Text-based applications.
Some companies who are currently using TensorFlow are Google, AirBnb, eBay, Intel, DropBox, Deep
Mind, Airbus, CEVA, Snapchat, SAP, Uber, Twitter, Coca-Cola, and IBM.
TensorFlow has an interactive multiplatform programming interface which is scalable and reliable
compared to other deep learning libraries which are available.
135
Responsive Construct
We can visualize each part of the graph, which is not an option while using Numpy or SciKit. To
develop a deep learning application, firstly, there are two or three components that are required to
create a deep learning application and need a programming language.
Flexible
It is one of the essential TensorFlow Features according to its operability. It has modularity and parts
of it which we want to make standalone.
Easily Trainable
Large Community
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Google has developed it, and there already is a large team of software engineers who work on stability
improvements continuously.
Open Source
The best thing about the machine learning library is that it is open source so anyone can use it as much
as they have internet connectivity. So, people can manipulate the library and come up with a fantastic
variety of useful products. And it has become another DIY community which has a massive forum for
people getting started with it and those who find it hard to use it.
Feature Columns
TensorFlow has feature columns which could be thought of as intermediates between raw data and
estimators; accordingly, bridging input data with our model.
136
Availability of Statistical Distributions
This library provides distributions functions including Bernoulli, Beta, Chi2, Uniform, Gamma, which
are essential, especially where considering probabilistic approaches such as Bayesian models.
Layered Components
TensorFlow produces layered operations of weight and biases from the function such as tf.contrib.
layers and also provides batch normalization, convolution layer, and dropout layer. So tf.contrib.
layers.optimizers have optimizers such as Adagrad, SGD, Momentum which are often used to solve
optimization problems for numerical analysis.
We can inspect a different representation of a model and make the changed necessary while debugging
it with the help of TensorBoard.
It is just like UNIX, where we use tail - f to monitor the output of tasks at the cmd. It checks, logging
events and summaries from the graph and production with the TensorBoard.
137
IV.2 TensorFlow Basics
IV.2.1 Single Layer Perceptron in
TensorFlow
Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.
The activation function is used to map the input between the required value like (0, 1) or (-1, 1).
138
The perceptron consists of 4 parts.
1. Input value or One input layer: The input layer of the perceptron is made of artificial input neurons
and takes the initial data into the system for further processing
2. Weights and Bias
• Weight: It represents the dimension or strength of the connection between units. If the weight
to node 1 to node 2 has a higher quantity, then neuron 1 has a more considerable influence on
the neuron
• Bias: It is the same as the intercept added in a linear equation. It is an additional parameter which
task is to modify the output along with the weighted sum of the input to the other neuron
3. Net sum: It calculates the total sum
4. Activation Function: A neuron can be activated or not, is determined by an activation function.
The activation function calculates a weighted sum and further adding bias with it to give the result
139
How does it work?
The perceptron works on these simple steps which are given below:
A. In the first step, all the inputs x are multiplied B. In this step, add all the increased values and
with their weights w call them the Weighted sum
C. In our last step, apply the weighted sum to a correct Activation Function. For Example:
There are two types of architecture. These types focus on the functionality of artificial neural networks
as follows:
The single-layer perceptron was the first neural network model, proposed in 1958 by Frank Rosenbluth.
It is one of the earliest models for learning. Our goal is to find a linear decision function measured by
the weight vector w and the bias parameter b.
To understand the perceptron layer, it is necessary to comprehend artificial neural networks (ANNs).
140
The artificial neural network (ANN) is an information processing system, whose mechanism is inspired
by the functionality of biological neural circuits. An artificial neural network consists of several
processing units that are interconnected.
This is the first proposal when the neural model is built. The content of the neuron's local memory
contains a vector of weight.
The single vector perceptron is calculated by calculating the sum of the input vector multiplied by the
corresponding element of the vector, with each increasing the amount of the corresponding component
of the vector by weight. The value that is displayed in the output is the input of an activation function.
Let us focus on the implementation of a single-layer perceptron for an image classification problem
using TensorFlow. The best example of drawing a single-layer perceptron is through the representation
of "logistic regression."
• The weights are initialized with the random values at the origination of each training
• For each element of the training set, the error is calculated with the difference between the desired
output and the actual output. The calculated error is used to adjust the weight
• The process is repeated until the fault made on the entire training set is less than the specified limit
until the maximum number of iterations has been reached
141
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
142
Complete code of Single layer perceptron
143
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
The output of the Code:
The logistic regression is considered as predictive analysis. Logistic regression is mainly used to
describe data and use to explain the relationship between the dependent binary variable and one or
many nominal or independent variables.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
A hidden layer is an artificial neural network that is a layer in between input layers and output layers.
Where the artificial neurons take in a set of weighted inputs and produce an output through an
activation function. It is a part of nearly and neural in which engineers simulate the types of activity
that go on in the human brain.
The hidden neural network is set up in some techniques. In many cases, weighted inputs are
randomly assigned. On the other hand, they are fine-tuned and calibrated through a process
called backpropagation.
144
The artificial neuron in the hidden layer of perceptron works as a biological neuron in the brain- it takes
in its probabilistic input signals, and works on them. And it converts them into an output corresponding
to the biological neuron's axon.
Layers after the input layer are called hidden because they are directly resolved to the input. The
simplest network structure is to have a single neuron in the hidden layer that directly outputs the
value.
Deep learning can refer to having many hidden layers in our neural network. They are deep because
they will have been unimaginably slow to train historically, but may take seconds or minutes to prepare
using modern techniques and hardware.
The code for the hidden layers of the perceptron is shown below:
145
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
146
147
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
IV.2.3 Artificial Neural Network in TensorFlow
Neural Network or artificial neural network (ANN) are modeled the same as the human brain. The
human brain has a mind to think and analyze any task in a particular situation.
But how can a machine think like that? For the purpose, an artificial brain was designed is known as a
neural network. The neural network is made up many perceptrons.
Perceptron is a single layer neural network. It is a binary classifier and part of supervised learning. A
simple model of the biological neuron in an artificial neural network is known as the perceptron.
A computing system made of several simple, highly interconnected processing elements, which process
information by its dynamic state response to external inputs.
A neural network can be made with multiple perceptrons. Where there are three layers:
• Input layer: Input layers are the real value from the data
• Hidden layer: Hidden layers are between input and output layers where three or more layers are
deep network
• Output layer: It is the final estimate of the output
148
IV.2.4 Types of Artificial Neural Network
149
It has a front propagate wave that is achieved by using a classifying activation function. All other types of
neural network use backpropagation, but FNN can't. In FNN, the sum of the product's input and weight
are calculated, and then it is fed to the output. Technologies such as face recognition and computer
vision are used FNN.
RBFNN find the distance of a point to the centre and considered it to work smoothly. There are two
layers in the RBF Neural Network. In the inner layer, the features are combined with the radial basis
function. Features provide an output that is used in consideration. Other measures can also be used
rather than Euclidean.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
• We define a receptor t
• Confronted maps are drawn around the receptor
• For RBF Gaussian Functions are generally used. So we can define the radial distance r=||X-t||
This Neural Network is used in power restoration system. In the present era power system have
increased in size and complexity. It's both factors increase the risk of major power outages. Power
needs to be restored as quickly and reliably as possible after a blackout.
150
IV.2.7 Multilayer Perceptron
A Multilayer Perceptron has three or more layer. The data that cannot be separated linearly is classified
with the help of this network. This network is a fully connected network that means every single node
is connected with all other nodes that are in the next layer. A Nonlinear Activation Function is used in
Multilayer Perceptron. It's input and output layer nodes are connected as a directed graph. It is a deep
learning method so that for training the network it uses backpropagation. It is extensively applied in
speech recognition and machine translation technologies.
151
IV.2.8 Convolutional Neural Network
In image classification and image recognition, a Convolutional Neural Network plays a vital role, or we
can say it is the main category for those. Face recognition, object detection, etc., are some areas where
CNN are widely used. It is similar to FNN, learn-able weights and biases are available in neurons.
CNN takes an image as input that is classified and process under a certain category such as dog, cat,
lion, tiger, etc. As we know, the computer sees an image as pixels and depends on the resolution of the
picture. Based on image resolution, it will see h * w * d, where h= height w= width and d= dimension.
For example, An RGB image is 6 * 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 3 array
of the pattern.
In CNN, each input image will pass through a sequence of convolution layers along with pooling, fully
connected layers, filters (Also known as kernels). And apply Soft-max function to classify an object
with probabilistic values 0 and 1.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Recurrent Neural Network is based on prediction. In this neural network, the output of a particular
layer is saved and fed back to the input. It will help to predict the outcome of the layer. In Recurrent
Neural Network, the first layer is formed in the same way as FNN's layer, and in the subsequent layer,
the recurrent neural network process begins.
Both inputs and outputs are independent of each other, but in some cases, it required to predict the
next word of the sentence.
Then it will depend on the previous word of the sentence. RNN is famous for its primary and most
important feature, i.e., Hidden State. Hidden State remembers the information about a sequence.
152
RNN has a memory to store the result after
calculation. RNN uses the same parameters on
each input to perform the same task on all the
hidden layers or data to produce the output.
Unlike other neural networks, RNN parameter
complexity is less.
In Modular Neural Network, several different networks are functionally independent. In MNN the
task is divided into sub-task and perform by several systems. During the computational process,
networks don't communicate directly with each other. All the interfaces are work independently
towards achieving the output. Combined networks are more powerful than flat and unrestricted.
Intermediary takes the production of each system, process them to produce the final output.
153
IV.2.11 Sequence to Sequence Network
It is consist of two recurrent neural networks. Here, encoder processes the input and decoder processes
the output. The encoder and decoder can either use for same or different parameter.
Sequence-to-sequence models are applied in chatbots, machine translation, and question answering
systems.
Neurons
Neurons are similar to the biological neurons. Neurons are nothing but the activation function. Artificial
neurons or Activation function has a "switch on" characteristic when it performs the classification task.
We can say when the input is higher than a specific value; the output should change state, i.e., 0 to 1,
-1 to 1, etc. The sigmoid function is commonly used activation function in Artificial Neural Network.
In the neural network, we predict the output (y) based on the given input (x). We create a model, i.e.
(mx + c), which help us to predict the output. When we train the model, it finds the appropriate value
of the constants m and c itself.
The constant c is the bias. Bias helps a model in such a manner that it can fit best for the given data.
We can say bias gives freedom to perform best.
Algorithm
Algorithms are required in the neural network. Biological neurons have self-understanding and
working capability, but how an artificial neuron will work in the same way? For this, it is necessary to
train our artificial neuron network. For this purpose, there are lots of algorithms used. Each algorithm
has a different way of working.
154
IV.3 Classification of Neural Network in TensorFlow
Artificial neural networks are computational models which are inspired by biological neural networks,
and it is composed of a large number of highly interconnected processing elements called neurons.
An ANN (Artificial Neural network) is configured for a specific application, such as pattern recognition
or data classification.
It extracts patterns and detects trends that are too complex to be noticed by either humans or other
computer techniques.
The behavior of ANN (Artificial Neural Network) depends on both the weights and the input-output
function, which is specified for the unit. This function falls into one of these three categories:
Linear units: The output activity is proportional to the total weighted output in linear units.
155
Threshold: The output is set at one of two levels, depending on whether the total input is greater than
or less than some threshold value.
Sigmoid units: The output varies continuously but not linearly as the input changes. Sigmoid units bear
a more considerable resemblance to real neurons than do linear or threshold units, but all three must
be considered rough approximations.
Firstly, we made an activation function so that we have to plot as POPC and to create the sigmoid
function, which is an effortless activation function takes in Z to make the sigmoid.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
156
Then, we make the operation which inherits sigmoid. So let's see a classification example and sikat
learn has a helpful function and capabilities to create data set for us. And then we are going to say my
data is equal to make blobs. It just creates a couple of blobs there that we can classify. So, we have to
create 50 samples and the number of features to a status that's going to make two blobs, so this is just
a binary classification problem.
157
Now, we have to create the scatterplot of features all the rows in column 0 and so if we do scatterplot
of two distinctive blobs and able to classify these two highly separable classes.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
158
159
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
Here, we're going to build a matrix of one that's a matrix of one by two. And then, we pass that into
our sigmoid function say sigmoid Z because that's necessarily going to output is 0 or 1 for us as we're
classifying them based on whether it is positive or negative.
The more positive input, the more sure our model is going to be that it belongs to the one class.
Linear Regression is a machine learning algorithm that is based on supervised learning. It performs
a regression function. The regression models a target predictive value based on the independent
variable. It is mostly used to detect the relation between variables and forecasts.
Linear regression is a linear model; for example, a model that assumes a linear relationship between
an input variable (x) and a single output variable (y). In particular, y can be calculated by a linear
combination of input variables (x).
Linear regression is a prevalent statistical method that allows us to learn a function or relation from a
set of continuous data. For example, we are given some data point of x and the corresponding, and we
need to know the relationship between them, which is called the hypothesis.
In the case of linear regression, the hypothesis is a straight line, that is,
h (x) = wx + b
160
Where w is a vector called weight, and b is a scalar called Bias. Weight and bias are called parameters
of the model.
We need to estimate the value of w and b from the set of data such that the resultant hypothesis
produces at least cost 'j,' which has been defined by the below cost function.
For optimization of parameters for which the value of j is minimal, we will use a commonly used
optimizer algorithm, called gradient descent. The following is pseudocode for gradient descent:
We will start to import the necessary libraries in Tensorflow. We will use Numpy with Tensorflow for
computation and Matplotlib for plotting purposes.
161
Now, we have to generate some random data for training the Linear Regression Model.
Now, we will start building our model by defining placeholders x and y, so that we feed the training
examples x and y into the optimizer while the training process.
Now, we can declare two trainable TensorFlow variables for the bias and Weights initializing them
randomly using the method:
162
Now we define the hyperparameter of the model, the learning rate and the number of Epochs.
Now, we will build Hypothesis, Cost Function and Optimizer. We will not manually implement the
Gradient Decent Optimizer because it is built inside TensorFlow. After that, we will initialize the
variables in the method.
163
Output is given below:
Epoch: 50 cost = 5.8868037 W = 0.9951241 b = 1.2381057
Epoch: 100 cost = 5.7912708 W = 0.9981236 b = 1.0914398
Epoch: 150 cost = 5.7119676 W = 1.0008028 b = 0.96044315
Epoch: 200 cost = 5.6459414 W = 1.0031956 b = 0.8434396
Epoch: 250 cost = 5.590798 W = 1.0053328 b = 0.7389358
Epoch: 300 cost = 5.544609 W = 1.007242 b = 0.6455922
Epoch: 350 cost = 5.5057884 W = 1.008947 b = 0.56223
Epoch: 400 cost = 5.473068 W = 1.01047 b = 0.46775345
Epoch: 450 cost = 5.453845 W = 1.0118302 b = 0.42124168
Epoch: 500 cost = 5.421907 W = 1.0130452 b = 0.36183489
Epoch: 550 cost = 5.4019218 W = 1.0141305 b = 0.30877414
Epoch: 600 cost = 5.3848578 W = 1.0150996 b = 0.26138115
Epoch: 650 cost = 5.370247 W = 1.0159653 b = 0.21905092
Epoch: 700 cost = 5.3576995 W = 1.0167387 b = 0.18124212
Epoch: 750 cost = 5.3468934 W = 1.0174294 b = 0.14747245
Epoch: 800 cost = 5.3375574 W = 1.0180461 b = 0.11730932
Epoch: 850 cost = 5.3294765 W = 1.0185971 b = 0.090368526
Epoch: 900 cost = 5.322459 W = 1.0190894 b = 0.0663058
Epoch: 950 cost = 5.3163588 W = 1.0195289 b = 0.044813324
Epoch: 1000 cost = 5.3110332 W = 1.0199218 b = 0.02561669
Output
Training cost= 5.3110332 Weight= 1.0199214 bias=0.02561663
Note that in this case, both weight and bias are scalars in order. This is because we have examined only
one dependent variable in our training data. If there are m dependent variables in our training dataset,
the weight will be a one-dimensional vector while Bias will be a scalar.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
164
V. Keras Basis
V.Keras Basis
165
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
V. Keras Basis
Keras is an open-source high-level Neural Network library, which is written in Python is capable
enough to run on Theano, TensorFlow, or CNTK. It was developed by one of the Google engineers,
Francois Chollet. It is made user-friendly, extensible, and modular for facilitating faster experimentation
with deep neural networks. It not only supports Convolutional Networks and Recurrent Networks
individually but also their combination.
It cannot handle low-level computations, so it makes use of the Backend library to resolve it. The
backend library act as a high-level API wrapper for the low-level API, which lets it run on TensorFlow,
CNTK, or Theano.
V.1 Kears Layers
Focus on user experience has always been a major part of Keras.
Keras being a model-level library helps in developing deep learning models by offering high-level building
blocks. All the low-level computations such as products of Tensor, convolutions, etc. are not handled
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
by Keras itself, rather they depend on a specialized tensor manipulation library that is well optimized
to serve as a backend engine. Keras has managed it so perfectly that instead of incorporating one
single library of tensor and performing operations related to that particular library, it offers plugging of
different backend engines into Keras.
166
TensorFlow
Theano
CNTK
167
V.1.2 Keras Convolution Neural Network
Layers and Working
CNN has the ability to learn the characteristics and perform classification. An input image has many
spatial and temporal dependencies, CNN captures these characteristics using relevant filters/kernels.A
Kernel or filter is an element in CNN that performs convolution around the image in the first part. The
kernel moves to the right and shifts according to the stride value. Every time during convolution a
matrix multiplication operation is performed.
After convolution, we obtain another image with a different height, width, and depth. We obtain more
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
channels than just RGB but less width and height.We slide each filter though out the image step by
step, this step in the forward pass is called stride.
It is the first layer to extract features from the input image. Here we define the kernel as the layer
parameter. We perform matrix multiplication operations on the input image using the kernel.
Example:
Suppose a 3*3 image pixel and a 2*2 filter as shown:
pixel : [[1,0,1],
[0,1,0],
[1,0,1]]
filter : [[1,0],
[0,1]]
The restaurant matrix a�er convolu�on of filter would be:
[[2,0],
[0,2]]
168
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
V.1.4 Keras Pooling Layer
After convolution, we perform pooling to reduce the number of parameters and computations.There
are different types of pooling operations, the most common ones are max pooling and average pooling.
Example:
Take a sample case of max pooling with 2*2 filter and stride 2.
Image pixels:
[[1,2,3,4],
[5,6,7,8],
[3,4,5,6],
[6,7,8,9]]
The resultant matrix a�er max-pooling would be:
[[6,8],
[7,9]]
169
V.1.5 Keras Dropout Layer
It is used to convert the data into 1D arrays to create a single feature vector. After flattening we
forward the data to a fully connected layer for final classification.
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
170
V.1.7 Keras Dense Layer
It is a fully connected layer. Each node in this layer is connected to the previous layer i.e densely
connected. This layer is used at the final stage of CNN to perform classification.
CIFAR 10 dataset consists of 10 image classes. The available image classes are:
• Car
• Airplane
• Bird
• Cat
• Deer
• Dog
• Frog
This is one of the most popular datasets that allow researchers to practice different algorithms for
object recognition. Convolution Neural Networks have shown the best results in solving the CIFAR-10
problem.
171
Let’s build our Convolution model to recognize CIFAR-10 classes.
172
4. Normalizing inputs
173
8. Analyzing Model Summary
174
10. Evaluate the model
The Fashion MNIST dataset consists of a training set of 60000 images and a testing set of 10000
images. There are 10 image classes in this dataset and each class has a mapping corresponding to the
following labels:
• T-shirt/top
• Trouser
• Pullover
• Dress
• Coat
• Sandals
• Shirt
• Sneaker
• Bag
• Ankle boot
175
3. Reshaping and one hot encoding
5. Normalizing data
176
6. Build the model
177
V.2 Deep Learning with Keras Implementation and Example
In this Keras section, we will walk through deep learning with keras and an important deep learning
algorithm used in keras. We will study the applications of this algorithm and also its implementation
in Keras.
Deep Learning is a subset of machine learning which concerns the algorithms inspired by the
architecture of the brain. In the last decade, there have been many major developments to support
deep learning research. Keras is the result of one of these recent developments which allow us to
define and create neural network models in a few lines of code.
There has been a boom in the research of Deep Learning algorithms. Keras ensures the ease of users
to create these algorithms.
But before we begin with Tensorflow Keras Deep learning article, let us do keras installation.
• Auto-Encoders
• Convolution Neural Nets
• Recurrent Neural Nets
• Long Short Term Memory Nets
• Deep Boltzmann Machine(DBM)
• Deep Belief Nets(DBN)
There are implementations of convolution neural nets, recurrent neural nets, and LSTM
Auto-Encoders
These types of neural networks are able to compress the input data and reconstruct it again. These are
very old deep learning algorithms. It encodes the input up to a bottleneck layer and then decodes it to
get the input back. At the bottleneck layer, we get a compressed form of input.
Anomaly detection and denoising an image are a few of the major applications of Auto-Encoders.
178
Types of Auto-Encoders
There are seven types of deep learning auto encoders as mentioned below:
• Denoising autoencoders
• Deep autoencoders
• Sparse autoencoders
• Contractive autoencoders
• Convolutional autoencoders
• Variational autoencoders
• Undercomplete autoencoders
For the purpose of its implementation in Keras, we will work on MNIST handwritten digit dataset.
Firstly, we will introduce some noise in the MNIST images. Then we will create an Auto – Encoder for
removing noise from the images and reconstruct the original images.
179
4. Introducing noise in MNIST images using Gaussian distribution
180
6. Specify input layer and create model
181
10. Again visualize the reconstructed images
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
You can see our Auto Encoder is able to reconstruct the images and remove its noise. We will get
better quality if we increase the epoch count of training.
To conclude, we have seen Deep learning with Keras implementation and example. This article
concerns the Keras library and its support to deploy major deep learning algorithms. It also introduces
you to Auto-Encoders, its different types, its applications, and its implementation. It explains how to
build a neural network for removing noise from our data.
182
V.3 Keras Vs Tensorflow – Difference Between Keras and Tensorflow
Keras and Tensorflow are two very popular deep learning frameworks. Deep Learning practitioners
most widely use Keras and Tensorflow. Both of these frameworks have large community support.
Both of these frameworks capture a major fraction of deep learning production.
There are some differences between Keras and Tensorflow, which will help you choose between the
two. We will provide you better insights on both these frameworks.
Following points will help you to learn comparison between tensorflow and keras to find which one is
more suitable for you.
Complexity
This feature of Keras provides more comfort and makes it less complex than TensorFlow.
Keras is a high-level API. Keras uses either Tensorflow, Theano, or CNTK as its backend engines.
Tensorflow provides both high and low-level APIs. Tensorflow is a math library that uses data flow
programming for a wide variety of tasks.
If you are looking for a neural network tool that is easy to use and has simple syntax then you will find
Keras more favorable.
183
Fast Development
If you want to quickly deploy and test your deep learning models, choose Keras. Using Keras, you can
create your models with very less lines of code and within a few minutes. Keras provides two APIs to
write your neural network. These are:
• Model(functional API)
• Sequential
With these APIs, you can easily create any complex neural network.
Performance
Since Keras is not directly responsible for the backend computation, Keras is slower. Keras depends
upon its backend engines for computation tasks. It provides an abstraction over its backend. To perform
the underlying computations and training Keras calls its backend.
On the other hand, Tensorflow is a symbolic math library. Its complex architecture focuses on reducing
cognitive load for computation. Hence, Tensorflow is fast and provides high performance.
Tensorflow gives you more flexibility, more control, and advanced features for the creation of complex
topologies. It provides more control over your network. Therefore if you want to define your own cost
function, metric, or layer Or, if you want to perform operations on input weights or gradients, choose
TensorFlow.
Dataset
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
We prefer Keras if the size of the dataset is of relatively small or medium size. While if the dataset
is large, we prefer TensorFlow because of fewer overheads. Also, TensorFlow provides more level of
control, hence we have more options to handle large datasets.
Tensorflow provides more number of inbuilt datasets than Keras. It contains all the datasets that are
available in Keras and tf.datasets module of TensorFlow contains a wide range of dataset and these are
classified under the following headings:
Audio, Image, Image classification, object detection, question answering, structured, summarization,
text, translate, and video.
184
Debug
Debugging the TensorFlow code is very difficult. In general, we perform de-bugging in TensorFlow
debugger and done through the command line. We start by wrapping the TensorFlow session with,tf_
debug.LocalCLIDebugWrapperSession(session), and then we execute the file with different necessary
debug flags.
Keras is high level and does not deal with backend computation, therefore debugging is easy. We can
also check the output from each layer in Keras using keras.backend.function().
Popularity
Keras has 48.7k stars on github and 18.4k fork on github. WhereasTensorflow has 146k stars and
81.7k forks on github.
Since both Keras and TensorFlow were released in 2015, it’s clear that TensorFlow has a larger
developer community.
Other than the above factors, you should be aware that Tensorflow also provides support for Keras.
Tensorflow provides tf.keras sub-module that allows you to drop Tensorflow code directly into Keras
models. You can obtain features of both Keras and Tensorflow using tf.keras, i.e you can get the best
of both worlds.
The below code describes how to use tf.keras to create your models:
185
VI. References
Papers
[1] L. Xu, F. Cai, Y. Hu, Z. Lin, and Q. Liu, “Using deep learning algorithms to perform accurate
spectral classification,” Optik, vol. 231, p. 166423, Apr. 2021, doi: 10.1016/j.ijleo.2021.166423.
[3] L. Zeng et al., “Deep learning trained algorithm maintains the quality of half-dose contrast-
enhanced liver computed tomography images: Comparison with hybrid iterative reconstruction: Study
for the application of deep learning noise reduction technology in low dose,” Eur. J. Radiol., vol. 135, p.
109487, Feb. 2021, doi: 10.1016/j.ejrad.2020.109487.
[4] S. Khan, N. Islam, Z. Jan, I. Ud Din, and J. J. P. C. Rodrigues, “A novel deep learning based
framework for the detection and classification of breast cancer using transfer learning,” Pattern
Recognit. Lett., vol. 125, pp. 1–6, Jul. 2019, doi: 10.1016/j.patrec.2019.03.022.
[5] Y. He, P. Wu, Y. Li, Y. Wang, F. Tao, and Y. Wang, “A generic energy prediction model of machine
tools using deep learning algorithms,” Appl. Energy, vol. 275, p. 115402, Oct. 2020, doi: 10.1016/j.
apenergy.2020.115402.
[6] M. Jiang, J. Liu, L. Zhang, and C. Liu, “An improved Stacking framework for stock index prediction
by leveraging tree-based ensemble models and deep learning algorithms,” Phys. Stat. Mech. Its Appl.,
vol. 541, p. 122272, Mar. 2020, doi: 10.1016/j.physa.2019.122272.
[7] D. Kißkalt, A. Mayr, B. Lutz, A. Rögele, and J. Franke, “Streamlining the development of data-
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
driven industrial applications by automated machine learning,” Procedia CIRP, vol. 93, pp. 401–406,
Jan. 2020, doi: 10.1016/j.procir.2020.04.009.
[8] Y. Chen, X. Zou, K. Li, K. Li, X. Yang, and C. Chen, “Multiple local 3D CNNs for region-based
prediction in smart cities,” Inf. Sci., vol. 542, pp. 476–491, Jan. 2021, doi: 10.1016/j.ins.2020.06.026.
[9] T. D. Akinosho et al., “Deep learning in the construction industry: A review of present status and
future innovations,” J. Build. Eng., vol. 32, p. 101827, Nov. 2020, doi: 10.1016/j.jobe.2020.101827.
186
[11] R. Espinosa, H. Ponce, and S. Gutiérrez, “Click-event sound detection in automotive industry
using machine/deep learning,” Appl. Soft Comput., vol. 108, p. 107465, Sep. 2021, doi: 10.1016/j.
asoc.2021.107465.
[12] J. Leng et al., “A loosely-coupled deep reinforcement learning approach for order acceptance
decision of mass-individualized printed circuit board manufacturing in industry 4.0,” J. Clean. Prod.,
vol. 280, p. 124405, Jan. 2021, doi: 10.1016/j.jclepro.2020.124405.
[13] M. Mishra, J. Nayak, B. Naik, and A. Abraham, “Deep learning in electrical utility industry: A
comprehensive review of a decade of research,” Eng. Appl. Artif. Intell., vol. 96, p. 104000, Nov. 2020,
doi: 10.1016/j.engappai.2020.104000.
[14] P. Tripicchio and S. D’Avella, “Is Deep Learning ready to satisfy Industry needs?,” Procedia
Manuf., vol. 51, pp. 1192–1199, Jan. 2020, doi: 10.1016/j.promfg.2020.10.167.
[15] R. Oberleitner and J. Schwartz, “5.29 Integrating Deep Learning With Behavior Imaging to
Accelerate Industry Learning of Autism Core Deficits,” J. Am. Acad. Child Adolesc. Psychiatry, vol. 56,
no. 10, Supplement, p. S263, Oct. 2017, doi: 10.1016/j.jaac.2017.09.312.
[16] T. Kotsiopoulos, P. Sarigiannidis, D. Ioannidis, and D. Tzovaras, “Machine Learning and Deep
Learning in smart manufacturing: The Smart Grid paradigm,” Comput. Sci. Rev., vol. 40, p. 100341, May
2021, doi: 10.1016/j.cosrev.2020.100341.
[17] C. Yang, H. Lan, F. Gao, and F. Gao, “Review of deep learning for photoacoustic imaging,”
Photoacoustics, vol. 21, p. 100215, Mar. 2021, doi: 10.1016/j.pacs.2020.100215.
[18] L. Zhu, P. Spachos, E. Pensini, and K. N. Plataniotis, “Deep learning and machine vision for
food processing: A survey,” Curr. Res. Food Sci., vol. 4, pp. 233–249, Jan. 2021, doi: 10.1016/j.
crfs.2021.03.009.
[20] S. Shajun Nisha, M. Mohamed Sathik, and M. Nagoor Meeral, “3 - Application, algorithm, tools
directly related to deep learning,” in Handbook of Deep Learning in Biomedical Engineering, V. E. Balas,
B. K. Mishra, and R. Kumar, Eds. Academic Press, 2021, pp. 61–84. doi: 10.1016/B978-0-12-823014-
5.00007-7.
187
Books
• https://www.pdfdrive.com/introduction-to-deep-learning-using-r-a-step-by-step-guide-to-
learning-and-implementing-deep-learning-models-using-r-e158252417.html
• https://www.pdfdrive.com/learn-keras-for-deep-neural-networks-a-fast-track-approach-to-
modern-deep-learning-with-python-e185770502.html
• https://www.pdfdrive.com/applied-deep-learning-a-case-based-approach-to-understanding-
deep-neural-networks-e176380114.html
• https://www.pdfdrive.com/deep-learning-adaptive-computation-and-machine-
learning-e176370174.html
• https://www.pdfdrive.com/deep-learning-in-python-master-data-science-and-machine-learning-
with-modern-neural-networks-written-in-python-theano-and-tensorflow-e196480537.html
• https://www.pdfdrive.com/deep-learning-with-python-e54511249.html
• https://www.pdfdrive.com/learning-tensorflow-a-guide-to-building-deep-learning-
systems-e158557113.html
• https://www.pdfdrive.com/deep-learning-with-applications-using-python-chatbots-and-face-
object-and-speech-recognition-with-tensorflow-and-keras-e184016771.html
• https://www.pdfdrive.com/mastering-machine-learning-with-python-in-six-steps-a-practical-
implementation-guide-to-predictive-data-analytics-using-python-e168776616.html
• https://hackr.io/blog/artificial-intelligence-books
Tutorials
• https://www.fast.ai
• https://www.coursera.org/learn/machine-learning
• https://www.coursera.org/specializations/deep-learning
• https://www.udemy.com/course/machinelearning/
• https://www.edx.org/professional-certificate/harvardx-data-science
• https://www.udacity.com/course/intro-to-machine-learning-nanodegree--nd229
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™
• https://online.stanford.edu/courses/cs229-machine-learning
• https://www.edx.org/learn/machine-learning
• https://learn.datacamp.com/courses/introduction-to-machine-learning-with-r
• https://www.brighttalk.com/topic/deep-learning/
• https://www.dataiku.com/webinars/
188
www.certiprof.com
189
ARTIFICIAL INTELLIGENCE EXPERT CERTIFICATE CAIEC™