Libro Nuevo ML
Libro Nuevo ML
Jeff Heaton
Fall 2022.0
ii
The text and illustrations of Applications of Deep Neural Networks by Jeff Heaton are licensed under
CC BY-NC-SA 4.0. To view a copy of this license, visit CC BY-NC-SA 4.0.
All of the book’s source code is licensed under the GNU Lesser General Public License as published by the
Free Software Foundation; either version 2.1 of the license or (at your option) any later version. LGPL
Heaton Research, Encog, the Encog Logo, and the Heaton Research logo are all trademarks of Jeff
Heaton in the United States and/or other countries.
TRADEMARKS: Heaton Research has attempted throughout this book to distinguish proprietary
trademarks from descriptive terms by following the capitalization style used by the manufacturer.
The author and publisher have done their best to prepare this book, so the content is based upon the
final release of software whenever possible. Portions of the manuscript may be based upon pre-release
versions supplied by software manufacturer(s). The author and the publisher make no representation or
warranties of any kind about the completeness or accuracy of the contents herein and accept no liability
of any kind, including but not limited to performance, merchantability, fitness for any particular purpose,
or any losses or damages of any kind caused or alleged to be caused directly or indirectly from this book.
DISCLAIMER
The author, Jeffrey Heaton, makes no warranty or representation, either expressed or implied, concern-
ing the Software or its contents, quality, performance, merchantability, or fitness for a particular purpose.
In no event will Jeffrey Heaton, his distributors, or dealers be liable to you or any other party for direct,
indirect, special, incidental, consequential, or other damages arising out of the use of or inability to use the
Software or its contents even if advised of the possibility of such damage. In the event that the Software
includes an online update feature, Heaton Research, Inc. further disclaims any obligation to provide this
feature for any specific duration other than the initial posting.
The exclusion of implied warranties is not permitted by some states. Therefore, the above exclusion
may not apply to you. This warranty provides you with specific legal rights; there may be other rights
that you may have that vary from state to state. The pricing of the book with the Software by Heaton
Research, Inc. reflects the allocation of risk and limitations on liability contained in this agreement of
Terms and Conditions.
Contents
Introduction xiii
1 Python Preliminaries 1
1.1 Part 1.1: Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Origins of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 What is Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Regression, Classification and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Why Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Python for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 Check your Python Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.7 Module 1 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Part 1.2: Introduction to Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Part 1.3: Python Lists, Dictionaries, Sets, and JSON . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Lists and Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.3 Maps/Dictionaries/Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.4 More Advanced Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.5 An Introduction to JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Part 1.4: File Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.1 Read a CSV File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.2 Read (stream) a Large CSV File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.3 Read a Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.4 Read an Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5 Part 1.5: Functions, Lambdas, and Map/Reduce . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.2 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.3 Lambda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.4 Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iii
iv CONTENTS
3 Introduction to TensorFlow 73
3.1 Part 3.1: Deep Learning and Neural Network Introduction . . . . . . . . . . . . . . . . . . . 73
3.1.1 Classification or Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.2 Neurons and Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.3 Types of Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.1.4 Input and Output Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.5 Hidden Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.6 Bias Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.1.7 Other Neuron Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.1.8 Why are Bias Neurons Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.1.9 Modern Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.1.10 Linear Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.1.11 Rectified Linear Units (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.1.12 Softmax Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.1.13 Step Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.1.14 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CONTENTS v
8.2 Part 8.2: Building Ensembles with Scikit-Learn and Keras . . . . . . . . . . . . . . . . . . . 285
8.2.1 Evaluating Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.2.2 Classification and Input Perturbation Ranking . . . . . . . . . . . . . . . . . . . . . 287
8.2.3 Regression and Input Perturbation Ranking . . . . . . . . . . . . . . . . . . . . . . . 289
8.2.4 Biological Response with Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 291
8.2.5 What Features/Columns are Important . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.2.6 Neural Network Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
8.3 Part 8.3: Architecting Network: Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 297
8.3.1 Number of Hidden Layers and Neuron Counts . . . . . . . . . . . . . . . . . . . . . . 298
8.3.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.3.3 Advanced Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.3.4 Regularization: L1, L2, Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.3.5 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.3.6 Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.4 Part 8.4: Bayesian Hyperparameter Optimization for Keras . . . . . . . . . . . . . . . . . . 300
8.5 Part 8.5: Current Semester’s Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
8.5.1 Iris as a Kaggle Competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
8.5.2 MPG as a Kaggle Competition (Regression) . . . . . . . . . . . . . . . . . . . . . . . 310
Starting in the spring semester of 2016, I began teaching the T81-558 Applications of Deep Learning course
for Washington University in St. Louis. I never liked Microsoft Powerpoint for technical classes, so I placed
my course material, examples, and assignments on GitHub. This material started with code and grew to
include enough description that this information evolved into the book you see before you.
I license the book’s text under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-
NC-SA 4.0) license. Similarly, I offer the book’s code under the LGPL license. Though I provide this book
both as a relatively inexpensive paperback and Amazon Kindle, you can obtain the book’s PDF here:
• https://arxiv.org/abs/2009.05673
• https://github.com/jeffheaton/t81_558_deep_learning
If you purchased this book from me, you have my sincere thanks for supporting my ongoing projects. I sell
the book as a relatively low-cost paperback and Kindle ebook for those who prefer that format or wish to
support my projects. I suggest that you look at the above GitHub site, as all of the code for this book is
presented there as Jupyter notebooks that are entirely Google CoLab compatible.
This book focuses on the application of deep neural networks. There is some theory; however, I do not
focus on recreating neural network fundamentals that tech companies already provide in popular frame-
works. The book begins with a quick review of the Python fundamentals needed to learn the subsequent
chapters. With Python preliminaries covered, we start with classification and regression neural networks
in Keras.
In my opinion, PyTorch, Jax, and Keras are the top three deep learning frameworks. When I first
created this course, neither PyTorch nor JAX existed. I began the course based on TensorFlow and
migrated to Keras the following semester. I believe TensorFlow remains a good choice for a course focusing
on the application of deep learning. Some of the third-party libraries used for this course use PyTorch; as
a result, you will see a blend of both technologies. StyleGAN and TabGAN both make use of PyTorch.
The technologies that this course is based on change rapidly. I update the Kindle and paperback books
according to this schedule. Formal updates to this book typically occur just before each academic year’s
fall and spring semesters.
The source document for this book is Jupyter notebooks. I wrote a Python utility that transforms my
course Jupyter notebooks into this book. It is entirely custom, and I may release it as a project someday.
However, because this book is based on code and updated twice a year, you may find the occasional typo. I
xiii
xiv INTRODUCTION
try to minimize errors as much as possible, but please let me know if you see something. I use Grammarly
to find textual issues, but due to the frequently updated nature of this book, I do not run it through a
formal editing cycle for each release. I also double-check the code with each release to ensure CoLab, Keras,
or another third-party library did not make a breaking change.
The book and course continue to be a work in progress. Many have contributed code, suggestions,
fixes, and clarifications to the GitHub repository. Please submit a GitHub issue or a push request with a
solution if you find an error.
Chapter 1
Python Preliminaries
• Yann LeCun, Facebook and New York University - Optical character recognition and computer vision
using convolutional neural networks (CNN). The founding father of convolutional nets.
• Geoffrey Hinton, Google and University of Toronto. Extensive work on neural networks. Creator of
deep learning and early adapter/creator of backpropagation for neural networks.
1
2 CHAPTER 1. PYTHON PRELIMINARIES
• Yoshua Bengio, University of Montreal and Botler AI. Extensive research into deep learning, neural
networks, and machine learning.
• Andrew Ng, Badiu and Stanford University. Extensive research into deep learning, neural networks,
and application to robotics.
Geoffrey Hinton, Yann LeCun, and Yoshua Bengio won the Turing Award for their contributions to deep
learning.
• Traditional Software Development - Programmers create programs that specify how to transform
input into the desired output.
• Machine Learning - Programmers create models that can learn to produce the desired output for
given input. This learning fills the traditional role of the computer program.
1.1. PART 1.1: OVERVIEW 3
Researchers have applied machine learning to many different areas. This class explores three specific
domains for the application of deep neural networks, as illustrated in Figure 1.3.
• Computer Vision - The use of machine learning to detect patterns in visual data. For example, is
an image a picture of a cat or a dog.
• Tabular Data - Several named input values allow the neural network to predict another named value
that becomes the output. For example, we are using four measurements of iris flowers to predict the
species. This type of data is often called tabular data.
• Natural Language Processing (NLP) - Deep learning transformers have revolutionized NLP,
allowing text sequences to generate more text, images, or classifications.
• Reinforcement Learning - Reinforcement learning trains a neural network to choose ongoing
actions so that the algorithm rewards the neural network for optimally completing a task.
4 CHAPTER 1. PYTHON PRELIMINARIES
• Time Series - The use of machine learning to detect patterns in time. Typical time series applications
are financial applications, speech recognition, and even natural language processing (NLP).
• Generative Models - Neural networks can learn to produce new original synthetic data from input.
We will examine StyleGAN, which learns to create new images similar to those it saw during training.
• An image
• A series of numbers that could represent text, audio, or another time series
• A regression number
• A classification class
Like these other models, neural networks can perform both classification and regression. When applied
to relatively low-dimensional tabular data tasks, deep neural networks do not necessarily add significant
accuracy over other model types. However, most state-of-the-art solutions depend on deep neural networks
for images, video, text, and audio data.
• TensorFlow/Keras (Google)
• PyTorch (Facebook)
Overall, this book focused on the application of deep neural networks. This book focuses primarily upon
Keras, with some applications in PyTorch. For many tasks, we will utilize Keras directly. We will utilize
third-party libraries for higher-level tasks, such as reinforcement learning, generative adversarial neural
networks, and others. These third-party libraries may internally make use of either PyTorch or Keras. I
chose these libraries based on popularity and application, not whether they used PyTorch or Keras.
To successfully use this book, you must be able to compile and execute Python code that makes use of
TensorFlow for deep learning. There are two options for you to accomplish this:
• Install Python, TensorFlow and some IDE (Jupyter, TensorFlow, and others).
• Use Google CoLab in the cloud, with free GPU access.
If you look at this notebook on Github, near the top of the document, there are links to videos that describe
how to use Google CoLab. There are also videos explaining how to install Python on your local computer.
The following sections take you through the process of installing Python on your local computer. This
process is essentially the same on Windows, Linux, or Mac. For specific OS instructions, refer to one of
the tutorial YouTube videos earlier in this document.
To install Python on your computer, complete the following instructions:
• Installing Python and TensorFlow - Windows/Linux
• Installing Python and TensorFlow - Mac Intel
• Installing Python and TensorFlow - Mac M1
Output
Tensor Flow V e r s i o n : 2 . 8 . 0
Keras V e r s i o n : 2 . 8 . 0
Python 3 . 7 . 1 3 ( d e f a u l t , Mar 16 2 0 2 2 , 1 7 : 3 7 : 1 7 )
[GCC 7 . 5 . 0 ]
Pandas 1 . 3 . 5
S c i k i t −Learn 1 . 0 . 2
GPU i s a v a i l a b l e
Output
H e l l o World
The above code passes a constant string, containing the text "hello world" to a function that is named
print.
You can also leave comments in your code to explain what you are doing. Comments can begin anywhere
in a line.
Code
Output
H e l l o World
Strings are very versatile and allow your program to process textual information. Constant string,
enclosed in quotes, define literal string values inside your program. Sometimes you may wish to define a
larger amount of literal text inside of your program. This text might consist of multiple lines. The triple
quote allows for multiple lines of text.
Code
Output
Print
Multiple
Lines
Like many languages Python uses single (’) and double (") quotes interchangeably to denote literal
string constants. The general convention is that double quotes should enclose actual text, such as words
or sentences. Single quotes should enclose symbolic text, such as error codes. An example of an error code
might be ’HTTP404’.
However, there is no difference between single and double quotes in Python, and you may use whichever
you like. The following code makes use of a single quote.
8 CHAPTER 1. PYTHON PRELIMINARIES
Code
Output
H e l l o World
In addition to strings, Python allows numbers as literal constants in programs. Python includes support
for floating-point, integer, complex, and other types of numbers. This course will not make use of complex
numbers. Unlike strings, quotes do not enclose numbers.
The presence of a decimal point differentiates floating-point and integer numbers. For example, the
value 42 is an integer. Similarly, 42.5 is a floating-point number. If you wish to have a floating-point
number, without a fraction part, you should specify a zero fraction. The value 42.0 is a floating-point
number, although it has no fractional part. As an example, the following code prints two numbers.
Code
print ( 4 2 )
print ( 4 2 . 5 )
Output
42
42.5
So far, we have only seen how to define literal numeric and string values. These literal values are
constant and do not change as your program runs. Variables allow your program to hold values that can
change as the program runs. Variables have names that allow you to reference their values. The following
code assigns an integer value to a variable named "a" and a string value to a variable named "b."
Code
a = 10
b = " ten "
print ( a )
print ( b )
Output
10
ten
1.2. PART 1.2: INTRODUCTION TO PYTHON 9
The key feature of variables is that they can change. The following code demonstrates how to change
the values held by variables.
Code
a = 10
print ( a )
a = a + 1
print ( a )
Output
10
11
You can mix strings and variables for printing. This technique is called a formatted or interpolated
string. The variables must be inside of the curly braces. In Python, this type of string is generally called
an f-string. The f-string is denoted by placing an "f" just in front of the opening single or double quote
that begins the string. The following code demonstrates the use of an f-string to mix several variables with
a literal string.
Code
a = 10
print ( f ' The␣ v a l u e ␣ o f ␣ a ␣ i s ␣ { a } ' )
Output
The v a l u e o f a i s 10
You can also use f-strings with math (called an expression). Curly braces can enclose any valid Python
expression for printing. The following code demonstrates the use of an expression inside of the curly braces
of an f-string.
Code
a = 10
print ( f ' The␣ v a l u e ␣ o f ␣ a ␣ p l u s ␣ 5 ␣ i s ␣ { a+5} ' )
Output
The v a l u e o f a p l u s 5 i s 15
10 CHAPTER 1. PYTHON PRELIMINARIES
Python has many ways to print numbers; these are all correct. However, for this course, we will use
f-strings. The following code demonstrates some of the varied methods of printing numbers in Python.
Code
a = 5
Output
a is 5
a is 5
a is 5
a is 5
You can use if-statements to perform logic. Notice the indents? These if-statements are how Python
defines blocks of code to execute together. A block usually begins after a colon and includes any lines at
the same level of indent. Unlike many other programming languages, Python uses whitespace to define
blocks of code. The fact that whitespace is significant to the meaning of program code is a frequent source
of annoyance for new programmers of Python. Tabs and spaces are both used to define the scope in a
Python program. Mixing both spaces and tabs in the same program is not recommended.
Code
a = 5
i f a >5:
print ( ' The␣ v a r i a b l e ␣ a ␣ i s ␣ g r e a t e r ␣ than ␣ 5 . ' )
else :
print ( ' The␣ v a r i a b l e ␣ a ␣ i s ␣ not ␣ g r e a t e r ␣ than ␣ 5 ' )
Output
The following if-statement has multiple levels. It can be easy to indent these levels improperly, so be
careful. This code contains a nested if-statement under the first "a==5" if-statement. Only if a is equal to
5 will the nested "b==6" if-statement be executed. Also, note that the "elif" command means "else if."
1.2. PART 1.2: INTRODUCTION TO PYTHON 11
Code
a = 5
b = 6
i f a==5:
print ( ' The␣ v a r i a b l e ␣ a ␣ i s ␣ 5 ' )
i f b==6:
print ( ' The␣ v a r i a b l e ␣b␣ i s ␣ a l s o ␣ 6 ' )
e l i f a==6:
print ( ' The␣ v a r i a b l e ␣ a ␣ i s ␣ 6 ' )
Output
The v a r i a b l e a i s 5
The v a r i a b l e b i s a l s o 6
It is also important to note that the double equal ("==") operator is used to test the equality of two
expressions. The single equal ("=") operator is only used to assign values to variables in Python. The
greater than (">"), less than ("<"), greater than or equal (">="), less than or equal ("<=") all perform as
would generally be accepted. Testing for inequality is performed with the not equal ("!=") operator.
It is common in programming languages to loop over a range of numbers. Python accomplishes this
through the use of the range operation. Here you can see a for loop and a range operation that causes
the program to loop between 1 and 3.
Code
Output
1
2
This code illustrates some incompatibilities between Python 2 and Python 3. Before Python 3, it was
acceptable to leave the parentheses off of a print function call. This method of invoking the print command
is no longer allowed in Python 3. Similarly, it used to be a performance improvement to use the xrange
command in place of range command at times. Python 3 incorporated all of the functionality of the xrange
Python 2 command into the normal range command. As a result, the programmer should not use the
xrange command in Python 3. If you see either of these constructs used in example code, then you are
12 CHAPTER 1. PYTHON PRELIMINARIES
acc = 0
for x in range ( 1 , 3 ) :
a c c += x
print ( f " Adding ␣ {x } , ␣sum␣ s o ␣ f a r ␣ i s ␣ { a c c } " )
Output
Adding 1 , sum s o f a r i s 1
Adding 2 , sum s o f a r i s 3
F i n a l sum : 3
• Dictionary - A dictionary is a mutable unordered collection that Python indexes with name and
value pairs.
• List - A list is a mutable ordered collection that allows duplicate elements.
• Set - A set is a mutable unordered collection with no duplicate elements.
• Tuple - A tuple is an immutable ordered collection that allows duplicate elements.
Most Python collections are mutable, meaning the program can add and remove elements after definition.
An immutable collection cannot add or remove items after definition. It is also essential to understand
that an ordered collection means that items maintain their order as the program adds them to a collection.
This order might not be any specific ordering, such as alphabetic or numeric.
Lists and tuples are very similar in Python and are often confused. The significant difference is that a
list is mutable, but a tuple isn’t. So, we include a list when we want to contain similar items and a tuple
when we know what information goes into it ahead of time.
1.3. PART 1.3: PYTHON LISTS, DICTIONARIES, SETS, AND JSON 13
Many programming languages contain a data collection called an array. The array type is noticeably
absent in Python. Generally, the programmer will use a list in place of an array in Python. Arrays in
most programming languages were fixed-length, requiring the program to know the maximum number of
elements needed ahead of time. This restriction leads to the infamous array-overrun bugs and security
issues. The Python list is much more flexible in that the program can dynamically change the size of a list.
The next sections will look at each collection type in more detail.
print ( l )
print ( t )
Output
The primary difference you will see programmatically is that a list is mutable, which means the program
can change it. A tuple is immutable, which means the program cannot change it. The following code
demonstrates that the program can change a list. This code also illustrates that Python indexes lists
starting at element 0. Accessing element one modifies the second element in the collection. One advantage
of tuples over lists is that tuples are generally slightly faster to iterate over than lists.
Code
print ( l )
Output
Like many languages, Python has a for-each statement. This statement allows you to loop over every
element in a collection, such as a list or a tuple.
Code
# I t e r a t e over a c o l l e c t i o n .
for s in l :
print ( s )
Output
a
changed
c
d
The enumerate function is useful for enumerating over a collection and having access to the index of
the element that we are currently on.
Code
Output
0: a
1 : changed
2: c
3:d
A list can have multiple objects added, such as strings. Duplicate values are allowed. Tuples do not
allow the program to add additional objects after definition.
Code
Output
Ordered collections, such as lists and tuples, allow you to access an element by its index number, as
done in the following code. Unordered collections, such as dictionaries and sets, do not allow the program
to access them in this way.
Code
print ( c [ 1 ] )
Output
A list can have multiple objects added, such as strings. Duplicate values are allowed. Tuples do not
allow the program to add additional objects after definition. The programmer must specify an index for
the insert function, an index. These operations are not allowed for tuples because they would result in a
change.
Code
# Insert
c = [ 'a ' , 'b ' , ' c ' ]
c . i n s e r t ( 0 , ' a0 ' )
print ( c )
# Remove
c . remove ( ' b ' )
print ( c )
# Remove a t i n d e x
del c [ 0 ]
print ( c )
Output
1.3.2 Sets
A Python set holds an unordered collection of objects, but sets do not allow duplicates. If a program adds
a duplicate item to a set, only one copy of each item remains in the collection. Adding a duplicate item to
a set does not result in an error. Any of the following techniques will define a set.
Code
s = set ( )
s = { 'a ' , 'b ' , ' c '}
s = set ( [ ' a ' , ' b ' , ' c ' ] )
print ( s )
Output
A list is always enclosed in square braces [], a tuple in parenthesis (), and similarly a set is enclosed
in curly braces. Programs can add items to a set as they run. Programs can dynamically add items to a
set with the add function. It is important to note that the append function adds items to lists, whereas
the add function adds items to a set.
Code
Output
Many programming languages include the concept of a map, dictionary, or hash table. These are all
very related concepts. Python provides a dictionary that is essentially a collection of name-value pairs.
Programs define dictionaries using curly braces, as seen here.
Code
d = { ' name ' : " J e f f " , ' a d d r e s s ' : " 123 ␣Main " }
print ( d )
print ( d [ ' name ' ] )
Output
{ ' name ' : ' J e f f ' , ' a d d r e s s ' : ' 1 2 3 Main ' }
Jeff
Name i s d e f i n e d
age u n d e f i n e d
Be careful that you do not attempt to access an undefined key, as this will result in an error. You can
check to see if a key is defined, as demonstrated above. You can also access the dictionary and provide a
default value, as the following code demonstrates.
Code
Output
You can also access the individual keys and values of a dictionary.
18 CHAPTER 1. PYTHON PRELIMINARIES
Code
d = { ' name ' : " J e f f " , ' a d d r e s s ' : " 123 ␣Main " }
# All of the keys
print ( f " Key : ␣ {d . k e y s ( ) } " )
Output
Dictionaries and lists can be combined. This syntax is closely related to JSON. Dictionaries and lists
together are a good way to build very complex data structures. While Python allows quotes (") and
apostrophe (’) for strings, JSON only allows double-quotes ("). We will cover JSON in much greater detail
later in this module.
The following code shows a hybrid usage of dictionaries and lists.
Code
print ( c u s t o m e r s )
for customer in c u s t o m e r s :
print ( f " { customer [ ' name ' ] } : { customer . g e t ( ' p e t s ' , ␣ ' no␣ p e t s ' ) } " )
Output
[ { ' name ' : ' J e f f & Tracy Heaton ' , ' p e t s ' : [ ' Wynton ' , ' C r i c k e t ' ,
' Hickory ' ] } , { ' name ' : ' John Smith ' , ' p e t s ' : [ ' r o v e r ' ] } , { ' name ' : ' Jane
Doe ' } ]
J e f f & Tracy Heaton : [ ' Wynton ' , ' C r i c k e t ' , ' Hickory ' ]
John Smith : [ ' r o v e r ' ]
1.3. PART 1.3: PYTHON LISTS, DICTIONARIES, SETS, AND JSON 19
Jane Doe : no p e t s
The variable customers is a list that holds three dictionaries that represent customers. You can
think of these dictionaries as records in a table. The fields in these individual records are the keys of the
dictionary. Here the keys name and pets are fields. However, the field pets holds a list of pet names.
There is no limit to how deep you might choose to nest lists and maps. It is also possible to nest a map
inside of a map or a list inside of another list.
a = [1 ,2 ,3 ,4 ,5]
b = [5 ,4 ,3 ,2 ,1]
print ( zip ( a , b ) )
Output
<z i p o b j e c t a t 0 x000001802A7A2E08>
To see the results of the zip function, we convert the returned zip object into a list. As you can see,
the zip function returns a list of tuples. Each tuple represents a pair of items that the function zipped
together. The order in the two lists was maintained.
Code
a = [1 ,2 ,3 ,4 ,5]
b = [5 ,4 ,3 ,2 ,1]
print ( l i s t ( zip ( a , b ) ) )
Output
[(1 , 5) , (2 , 4) , (3 , 3) , (4 , 2) , (5 , 1)]
The usual method for using the zip command is inside of a for-loop. The following code shows how a
for-loop can assign a variable to each collection that the program is iterating.
20 CHAPTER 1. PYTHON PRELIMINARIES
Code
a = [1 ,2 ,3 ,4 ,5]
b = [5 ,4 ,3 ,2 ,1]
for x , y in zip ( a , b ) :
print ( f ' {x} ␣−␣ {y} ' )
Output
1 − 5
2 − 4
3 − 3
4 − 2
5 − 1
Usually, both collections will be of the same length when passed to the zip command. It is not an
error to have collections of different lengths. As the following code illustrates, the zip command will only
process elements up to the length of the smaller collection.
Code
a = [1 ,2 ,3 ,4 ,5]
b = [5 ,4 ,3]
print ( l i s t ( zip ( a , b ) ) )
Output
[(1 , 5) , (2 , 4) , (3 , 3)]
Sometimes you may wish to know the current numeric index when a for-loop is iterating through an
ordered collection. Use the enumerate command to track the index location for a collection element.
Because the enumerate command deals with numeric indexes of the collection, the zip command will
assign arbitrary indexes to elements from unordered collections.
Consider how you might construct a Python program to change every element greater than 5 to the
value of 5. The following program performs this transformation. The enumerate command allows the loop
to know which element index it is currently on, thus allowing the program to be able to change the value
of the current element of the collection.
1.3. PART 1.3: PYTHON LISTS, DICTIONARIES, SETS, AND JSON 21
Code
a = [ 2 , 10 , 3 , 11 , 10 , 3 , 2 , 1 ]
f o r i , x in enumerate ( a ) :
i f x >5:
a[ i ] = 5
print ( a )
Output
[2 , 5 , 3 , 5 , 5 , 3 , 2 , 1]
The comprehension command can dynamically build up a list. The comprehension below counts from
0 to 9 and adds each value (multiplied by 10) to a list.
Code
l s t = [ x ∗10 f o r x in range ( 1 0 ) ]
print ( l s t )
Output
[ 0 , 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90]
A dictionary can also be a comprehension. The general format for this is:
d i c t _ v a r i a b l e = { key : v a l u e f o r ( key , v a l u e ) i n d i c t o n a r y . i t e m s ( ) }
t e x t = [ ' c o l −z e r o ' , ' c o l −one ' , ' c o l −two ' , ' c o l −t h r e e ' ]
lookup = { key : v a l u e f o r ( valu e , key ) in enumerate ( t e x t ) }
print ( lookup )
Output
{ ' c o l −z e r o ' : 0 , ' c o l −one ' : 1 , ' c o l −two ' : 2 , ' c o l −t h r e e ' : 3}
Code
print ( f ' The␣ i n d e x ␣ o f ␣ " c o l −two " ␣ i s ␣ { lookup [ " c o l −two " ] } ' )
Output
{
" f i r s t N a m e " : " John " ,
" lastName " : " Smith " ,
" i s A l i v e " : true ,
" age " : 2 7 ,
" address " : {
" s t r e e t A d d r e s s " : " 2 1 2nd S t r e e t " ,
" c i t y " : "New York " ,
" s t a t e " : "NY" ,
" p o s t a l C o d e " : "10021 −3100"
},
" phoneNumbers " : [
{
" type " : " home " ,
" number " : " 2 1 2 555 −1234"
},
{
" type " : " o f f i c e " ,
1.3. PART 1.3: PYTHON LISTS, DICTIONARIES, SETS, AND JSON 23
The above file may look somewhat like Python code. You can see curly braces that define dictionaries
and square brackets that define lists. JSON does require there to be a single root element. A list or
dictionary can fulfill this role. JSON requires double-quotes to enclose strings and names. Single quotes
are not allowed in JSON.
JSON files are always legal JavaScript syntax. JSON is also generally valid as Python code, as demon-
strated by the following Python program.
Code
jsonHardCoded = {
" f i r s t N a m e " : " John " ,
" lastName " : " Smith " ,
" i s A l i v e " : True ,
" age " : 2 7 ,
" address " : {
" s t r e e t A d d r e s s " : " 21 ␣ 2nd␣ S t r e e t " ,
" c i t y " : "New␣York " ,
" s t a t e " : "NY" ,
" p o s t a l C o d e " : " 10021 −3100 "
},
" phoneNumbers " : [
{
" type " : " home " ,
" number " : " 212 ␣ 555 −1234 "
},
{
" type " : " o f f i c e " ,
" number " : " 646 ␣ 555 −4567 "
},
{
" type " : " m o b i l e " ,
" number " : " 123 ␣ 456 −7890 "
24 CHAPTER 1. PYTHON PRELIMINARIES
}
],
" children " : [ ] ,
" s p o u s e " : None
}
Generally, it is better to read JSON from files, strings, or the Internet than hard coding, as demonstrated
here. However, for internal data structures, sometimes such hard-coding can be useful.
Python contains support for JSON. When a Python program loads a JSON the root list or dictionary
is returned, as demonstrated by the following code.
Code
import j s o n
j s o n _ s t r i n g = ' { " f i r s t " : " J e f f " , " l a s t " : " Heaton " } '
obj = json . loads ( json_string )
print ( f " F i r s t ␣name : ␣ { o b j [ ' f i r s t ' ] } " )
print ( f " Last ␣name : ␣ { o b j [ ' l a s t ' ] } " )
Output
F i r s t name : J e f f
Last name : Heaton
import r e q u e s t s
Output
{ ' f i r s t N a m e ' : ' John ' , ' lastName ' : ' Smith ' , ' i s A l i v e ' : True , ' age ' : 2 7 ,
' a d d r e s s ' : { ' s t r e e t A d d r e s s ' : ' 2 1 2nd S t r e e t ' , ' c i t y ' : 'New York ' ,
' s t a t e ' : 'NY' , ' postalCode ' : '10021 −3100 '} , ' phoneNumbers ' : [ { ' type ' :
' home ' , ' number ' : ' 2 1 2 555 −1234 '} , { ' type ' : ' o f f i c e ' , ' number ' : ' 6 4 6
555 −4567 '} , { ' type ' : ' mobile ' , ' number ' : ' 1 2 3 4 56 −78 90 '}] , ' c h i l d r e n ' :
1.4. PART 1.4: FILE HANDLING 25
Python programs can easily generate JSON strings from Python objects of dictionaries and lists.
Code
python_obj = { " f i r s t " : " J e f f " , " l a s t " : " Heaton " }
print ( j s o n . dumps ( python_obj ) )
Output
A data scientist will generally encounter JSON when they access web services to get their data. A data
scientist might use the techniques presented in this section to convert the semi-structured JSON data into
tabular data for the program to use with a model such as a neural network.
• CSV files (generally have the .csv extension) hold tabular data that resembles spreadsheet data.
• Image files (generally with the .png or .jpg extension) hold images for computer vision.
• Text files (often have the .txt extension) hold unstructured text and are essential for natural language
processing.
• JSON (often have the .json extension) contain semi-structured textual data in a human-readable
text-based format.
• H5 (can have a wide array of extensions) contain semi-structured textual data in a human-readable
text-based format. Keras and TensorFlow store neural networks as H5 files.
• Audio Files (often have an extension such as .au or .wav) contain recorded sound.
Data can come from a variety of sources. In this class, we obtain data from three primary locations:
• Your Hard Drive - This type of data is stored locally, and Python accesses it from a path that
looks something like: c:\data\myfile.csv or /Users/jheaton/data/myfile.csv.
• The Internet - This type of data resides in the cloud, and Python accesses it from a URL that
looks something like:
https://data.heatonresearch.com/data/t81-558/iris.csv.
26 CHAPTER 1. PYTHON PRELIMINARIES
• Google Drive (cloud) - If your code in Google CoLab, you use GoogleDrive to save and load
some data files. CoLab mounts your GoogleDrive into a path similar to the following: /content/-
drive/My Drive/myfile.csv.
import pandas a s pd
The above command loads Fisher’s Iris data set from the Internet. It might take a few seconds to load,
so it is good to keep the loading code in a separate Jupyter notebook cell so that you do not have to reload
it as you test your program. You can load Internet data, local hard drive, and Google Drive data this way.
Now that the data is loaded, you can display the first five rows with this command.
Code
display ( df [ 0 : 5 ] )
Output
import c s v
import u r l l i b . r e q u e s t
1.4. PART 1.4: FILE HANDLING 27
import c o d e c s
import numpy a s np
f o r l i n e in c s v f i l e :
# Convert each row t o Numpy a r r a y
l i n e 2 = np . a r r a y ( l i n e ) [ 0 : 4 ] . a s t y p e ( f l o a t )
# I f t h e l i n e i s o f t h e r i g h t l e n g t h ( s k i p empty l i n e s ) , t h e n add
i f len ( l i n e 2 ) == 4 :
sum += l i n e 2
count += 1
# C a l c u l a t e t h e a v e r a g e , and p r i n t t h e a v e r a g e o f t h e 4 i r i s
# measurements ( f e a t u r e s )
print (sum/ count )
Output
import u r l l i b . r e q u e s t
Output
Sonnet 18 o r i g i n a l t e x t
William S h a k e s p e a r e
S h a l l I compare t h e e t o a summer ' s day ?
Thou a r t more l o v e l y and more t e m p e r a t e :
Rough winds do shake t h e d a r l i n g buds o f May ,
And summer ' s l e a s e hath a l l t o o s h o r t a d a t e :
Sometime t o o hot t h e eye o f heaven s h i n e s ,
And o f t e n i s h i s g o l d complexion dimm ' d ;
And e v e r y f a i r from f a i r sometime d e c l i n e s ,
By chance o r nature ' s c h a n g i n g c o u r s e untrimm ' d ;
But thy e t e r n a l summer s h a l l not f a d e
Nor l o s e p o s s e s s i o n o f t h a t f a i r thou owest ;
Nor s h a l l Death brag thou wander ' s t i n h i s shade ,
When i n e t e r n a l l i n e s t o time thou g r o w e s t :
So l o n g a s men can b r e a t h e o r e y e s can s e e ,
So l o n g l i v e s t h i s and t h i s g i v e s l i f e t o t h e e .
%m a t p l o t l i b i n l i n e
from PIL import Image
import r e q u e s t s
from i o import BytesIO
img
Output
1.5. PART 1.5: FUNCTIONS, LAMBDAS, AND MAP/REDUCE 29
Output
H e l l o John , t h i s i s J e f f .
Goodbye John , t h i s i s J e f f .
Goodbye John , t h i s i s J e f f .
30 CHAPTER 1. PYTHON PRELIMINARIES
A function is a way to capture code that is commonly executed. Consider the following function that
can be used to trim white space from a string capitalize the first letter.
Code
def p r o c e s s _ s t r i n g ( s t r ) :
t = str . s t r i p ( )
return t [ 0 ] . upper ()+ t [ 1 : ]
Output
Python’s map is a very useful function that is provided in many different programming languages. The
map function takes a list and applies a function to each member of the list and returns a second list
that is the same size as the first.
Code
l = [ ' ␣␣␣ apple ␣␣ ' , ' pear ␣ ' , ' orange ' , ' pine ␣ apple ␣␣ ' ]
l i s t (map( p r o c e s s _ s t r i n g , l ) )
Output
[ ' Apple ' , ' Pear ' , ' Orange ' , ' Pine apple ' ]
1.5.1 Map
The map function is very similar to the Python comprehension that we previously explored. The
following comprehension accomplishes the same task as the previous call to map.
Code
l = [ ' ␣␣␣ apple ␣␣ ' , ' pear ␣ ' , ' orange ' , ' pine ␣ apple ␣␣ ' ]
l 2 = [ p r o c e s s _ s t r i n g ( x ) f o r x in l ]
print ( l 2 )
1.5. PART 1.5: FUNCTIONS, LAMBDAS, AND MAP/REDUCE 31
Output
[ ' Apple ' , ' Pear ' , ' Orange ' , ' Pine apple ' ]
The choice of using a map function or comprehension is up to the programmer. I tend to prefer
map since it is so common in other programming languages.
1.5.2 Filter
While a map function always creates a new list of the same size as the original, the filter function
creates a potentially smaller list.
Code
def g r e a t e r _ t h a n _ f i v e ( x ) :
return x>5
l = [ 1 , 1 0 , 2 0 , 3 , −2, 0 ]
l2 = l i s t ( f i l t e r ( greater_than_five , l ))
print ( l 2 )
Output
[10 , 20]
1.5.3 Lambda
It might seem somewhat tedious to have to create an entire function just to check to see if a value is greater
than 5. A lambda saves you this effort. A lambda is essentially an unnamed function.
Code
l = [ 1 , 1 0 , 2 0 , 3 , −2, 0 ]
l 2 = l i s t ( f i l t e r (lambda x : x>5, l ) )
print ( l 2 )
Output
[10 , 20]
32 CHAPTER 1. PYTHON PRELIMINARIES
1.5.4 Reduce
Finally, we will make use of reduce. Like filter and map the reduce function also works on a list.
However, the result of the reduce is a single value. Consider if you wanted to sum the values of a list.
The sum is implemented by a lambda.
Code
l = [ 1 , 1 0 , 2 0 , 3 , −2, 0 ]
r e s u l t = reduce (lambda x , y : x+y , l )
print ( r e s u l t )
Output
32
Chapter 2
# Simple d a t a f r a m e
import o s
import pandas a s pd
Output
33
34 CHAPTER 2. PYTHON FOR MACHINE LEARNING
The display function provides a cleaner display than merely printing the data frame. Specifying the
maximum rows and columns allows you to achieve greater control over the display.
Code
Output
It is possible to generate a second data frame to display statistical information about the first data
frame.
Code
# S t r i p non−numerics
d f = d f . s e l e c t _ d t y p e s ( i n c l u d e =[ ' i n t ' , ' f l o a t ' ] )
h e a d e r s = l i s t ( d f . columns . v a l u e s )
fields = []
for f i e l d in h e a d e r s :
f i e l d s . append ( {
' name ' : f i e l d ,
' mean ' : d f [ f i e l d ] . mean ( ) ,
' var ' : d f [ f i e l d ] . var ( ) ,
' sdev ' : d f [ f i e l d ] . s t d ( )
})
2.1. PART 2.1: INTRODUCTION TO PANDAS 35
f o r f i e l d in f i e l d s :
print ( f i e l d )
Output
{ ' name ' : 'mpg ' , ' mean ' : 2 3 . 5 1 4 5 7 2 8 6 4 3 2 1 6 0 7 , ' var ' : 6 1 . 0 8 9 6 1 0 7 7 4 2 7 4 4 0 5 ,
' sdev ' : 7 . 8 1 5 9 8 4 3 1 2 5 6 5 7 8 2 }
{ ' name ' : ' c y l i n d e r s ' , ' mean ' : 5 . 4 5 4 7 7 3 8 6 9 3 4 6 7 3 4 , ' var ' :
2 . 8 9 3 4 1 5 4 3 9 9 2 0 0 0 3 , ' sdev ' : 1 . 7 0 1 0 0 4 2 4 4 5 3 3 2 1 1 9 }
{ ' name ' : ' d i s p l a c e m e n t ' , ' mean ' : 1 9 3 . 4 2 5 8 7 9 3 9 6 9 8 4 9 3 , ' var ' :
1 0 8 7 2 . 1 9 9 1 5 2 2 4 7 3 8 4 , ' sdev ' : 1 0 4 . 2 6 9 8 3 8 1 7 1 1 9 5 9 1 }
{ ' name ' : ' weight ' , ' mean ' : 2 9 7 0 . 4 2 4 6 2 3 1 1 5 5 7 8 , ' var ' :
7 1 7 1 4 0 . 9 9 0 5 2 5 6 7 6 3 , ' sdev ' : 8 4 6 . 8 4 1 7 7 4 1 9 7 3 2 6 8 }
{ ' name ' : ' a c c e l e r a t i o n ' , ' mean ' : 1 5 . 5 6 8 0 9 0 4 5 2 2 6 1 3 0 7 , ' var ' :
7 . 6 0 4 8 4 8 2 3 3 6 1 1 3 8 3 , ' sdev ' : 2 . 7 5 7 6 8 8 9 2 9 8 1 2 6 7 6 }
{ ' name ' : ' year ' , ' mean ' : 7 6 . 0 1 0 0 5 0 2 5 1 2 5 6 2 9 , ' var ' : 1 3 . 6 7 2 4 4 2 8 1 8 6 2 7 1 4 3 ,
' sdev ' : 3 . 6 9 7 6 2 6 6 4 6 7 3 2 6 2 3 }
{ ' name ' : ' o r i g i n ' , ' mean ' : 1 . 5 7 2 8 6 4 3 2 1 6 0 8 0 4 0 2 , ' var ' :
0 . 6 4 3 2 9 2 0 2 6 8 8 5 0 5 4 9 , ' sdev ' : 0 . 8 0 2 0 5 4 8 7 7 7 2 6 6 1 4 8 }
This code outputs a list of dictionaries that hold this statistical information. This information looks
similar to the JSON code seen in Module 1. If proper JSON is needed, the program should add these
records to a list and call the Python JSON library’s dumps command.
The Python program can convert this JSON-like information to a data frame for better display.
Code
Output
36 CHAPTER 2. PYTHON FOR MACHINE LEARNING
Code
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
print ( f " h o r s e p o w e r ␣ has ␣na ? ␣ {pd . i s n u l l ( d f [ ' h o r s e p o w e r ' ] ) . v a l u e s . any ( ) } " )
Output
The code below will drop every row from the Auto MPG dataset where the horsepower is two standard
deviations or more above or below the mean.
Code
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s c i p y . s t a t s import z s c o r e
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Drop o u t l i e r s i n h o r s e p o w e r
print ( " Length ␣ b e f o r e ␣MPG␣ o u t l i e r s ␣ dropped : ␣ {} " . format ( len ( d f ) ) )
r e m o v e _ o u t l i e r s ( df , 'mpg ' , 2 )
print ( " Length ␣ a f t e r ␣MPG␣ o u t l i e r s ␣ dropped : ␣ {} " . format ( len ( d f ) ) )
Output
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
B e f o r e drop : [ ' mpg ' , ' c y l i n d e r s ' , ' d i s p l a c e m e n t ' , ' horsepower ' ,
' weight ' , ' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' , ' name ' ]
A f t e r drop : [ ' mpg ' , ' c y l i n d e r s ' , ' d i s p l a c e m e n t ' , ' horsepower ' ,
' weight ' , ' a c c e l e r a t i o n ' , ' year ' , ' origin ' ]
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
name horsepower
0 chevrolet chevelle malibu 130.0
1 buick skylark 320 165.0
... ... ...
396 ford ranger 79.0
397 chevy s-10 82.0
The concat function can also concatenate rows together. This code concatenates the first two rows
and the last two rows of the Auto MPG dataset.
Code
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
40 CHAPTER 2. PYTHON FOR MACHINE LEARNING
r e s u l t = pd . c o n c a t ( [ d f [ 0 : 2 ] , d f [ − 2 : ] ] , a x i s =0)
Output
• Training Data - In Sample Data - The data that the neural network used to train.
• Validation Data - Out of Sample Data - The data that the machine learning model is evaluated
upon after it is fit to the training data.
There are two effective means of dealing with training and validation data:
• Training/Validation Split - The program splits the data according to some ratio between a training
and validation (hold-out) set. Typical rates are 80% training and 20% validation.
• K-Fold Cross Validation - The program splits the data into several folds and models. Because
the program creates the same number of models as folds, the program can generate out-of-sample
predictions for the entire dataset.
The code below splits the MPG data into a training and validation set. The training set uses 80% of the
data, and the validation set uses 20%. Figure 2.1 shows how we train a model on 80% of the data and
then validated against the remaining 20%.
Code
import o s
import pandas a s pd
import numpy a s np
d f = pd . read_csv (
2.1. PART 2.1: INTRODUCTION TO PANDAS 41
# U s u a l l y a good i d e a t o s h u f f l e
d f = d f . r e i n d e x ( np . random . p e r m u t a t i o n ( d f . i n d e x ) )
Output
T r a i n i n g DF: 333
V a l i d a t i o n DF: 65
df . values
Output
42 CHAPTER 2. PYTHON FOR MACHINE LEARNING
You might wish only to convert some of the columns, to leave out the name column, use the following
code.
Code
d f [ [ 'mpg ' , ' c y l i n d e r s ' , ' d i s p l a c e m e n t ' , ' h o r s e p o w e r ' , ' w e i g h t ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
Output
import o s
import pandas a s pd
import numpy a s np
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
2.1. PART 2.1: INTRODUCTION TO PANDAS 43
Output
Done
import os
import pandas a s pd
import numpy a s np
import pickle
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Loading the pickle file back into memory is accomplished by the following lines of code. Notice that
the index numbers are still jumbled from the previous shuffle? Loading the CSV rebuilt (in the last step)
did not preserve these values.
44 CHAPTER 2. PYTHON FOR MACHINE LEARNING
Code
import os
import pandas a s pd
import numpy a s np
import pickle
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
x−µ
z=
σ
To calculate the Z-Score, you also need to calculate the mean(µ or x̄) and the standard deviation (σ).
You can calculate the mean with this equation:
x1 + x2 + · · · + xn
µ = x̄ =
n
v
u
u1 X N
σ=t (xi − µ)2
N i=1
The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero,
above zero is above average, and below zero is below average. Z-Scores more that 3 above or below are
very rare; these are outliers.
46 CHAPTER 2. PYTHON FOR MACHINE LEARNING
Code
import o s
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
display ( df )
2.2. PART 2.2: CATEGORICAL AND CONTINUOUS VALUES 47
Output
The area column is not numeric, so you must encode it with one-hot encoding. We display the number
of areas and individual values. There are just four values in the area categorical variable in this case.
Code
Output
Number o f a r e a s : 4
Areas : [ ' c ' , ' d ' , ' a ' , ' b ' ]
There are four unique values in the area column. To encode these dummy variables, we would use four
columns, each representing one of the areas. For each row, one column would have a value of one, the rest
zeros. For this reason, this type of encoding is sometimes called one-hot encoding. The following code
shows how you might encode the values "a" through "d." The value A becomes [1,0,0,0] and the value B
becomes [0,1,0,0].
Code
dummies = pd . get_dummies ( [ ' a ' , ' b ' , ' c ' , ' d ' ] , p r e f i x= ' a r e a ' )
print ( dummies )
Output
Code
Output
For the new dummy/one hot encoded values to be of any use, they must be merged back into the data
set.
Code
d f = pd . c o n c a t ( [ df , dummies ] , a x i s =1)
To encode the area column, we use the following code. Note that it is necessary to merge these dummies
back into the data frame.
Code
d i s p l a y ( d f [ [ ' i d ' , ' j o b ' , ' a r e a ' , ' income ' , ' area_a ' ,
' area_b ' , ' area_c ' , ' area_d ' ] ] )
Output
2.2. PART 2.2: CATEGORICAL AND CONTINUOUS VALUES 49
Usually, you will remove the original column area because the goal is to get the data frame to be entirely
numeric for the neural network.
Code
Output
Code
import pandas a s pd
dummies = pd . get_dummies ( [ ' a ' , ' b ' , ' c ' , ' d ' ] , p r e f i x= ' a r e a ' , d r o p _ f i r s t=True )
print ( dummies )
Output
As you can see from the above data, the area_a column is missing, as it get_dummies replaced it
by the encoding of [0,0,0]. The following code shows how to apply this technique to a dataframe.
Code
import pandas a s pd
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# d i s p l a y t h e encoded d a t a f r a m e
pd . s e t _ o p t i o n ( ' d i s p l a y . max_columns ' , 0 )
pd . s e t _ o p t i o n ( ' d i s p l a y . max_rows ' , 1 0 )
Output
2.2. PART 2.2: CATEGORICAL AND CONTINUOUS VALUES 51
# C r e a t e a s m a l l sample d a t a s e t
import pandas a s pd
import numpy a s np
np . random . s e e d ( 4 3 )
d f = pd . DataFrame ( {
' cont_9 ' : np . random . rand ( 1 0 ) ∗ 1 0 0 ,
' cat_0 ' : [ ' dog ' ] ∗ 5 + [ ' c a t ' ] ∗ 5 ,
' cat_1 ' : [ ' w o l f ' ] ∗ 9 + [ ' t i g e r ' ] ∗ 1 ,
'y ' : [1 , 0 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0]
})
Output
We want to change them to a number rather than creating dummy variables for "dog" and "cat," we
would like to change them to a number. We could use 0 for a cat and 1 for a dog. However, we can encode
more information than just that. The simple 0 or 1 would also only work for one animal. Consider what
the mean target value is for cat and dog.
Code
Output
The danger is that we are now using the target value (y) for training. This technique will potentially
lead to overfitting. The possibility of overfitting is even greater if a small number of a particular category.
To prevent this from happening, we use a weighting factor. The stronger the weight, the more categories
with fewer values will tend towards the overall average of y. You can perform this calculation as follows.
Code
Output
0.5
You can implement target encoding as follows. For more information on Target Encoding, refer to the
2.2. PART 2.2: CATEGORICAL AND CONTINUOUS VALUES 53
article "Target Encoding Done the Right Way", that I based this code upon.
Code
Code
WEIGHT = 5
d f [ ' cat_0_enc ' ] = calc_smooth_mean ( d f 1=df , d f 2=None ,
cat_name= ' cat_0 ' , t a r g e t= ' y ' , w e i g h t=WEIGHT)
d f [ ' cat_1_enc ' ] = calc_smooth_mean ( d f 1=df , d f 2=None ,
cat_name= ' cat_1 ' , t a r g e t= ' y ' , w e i g h t=WEIGHT)
display ( df )
Output
54 CHAPTER 2. PYTHON FOR MACHINE LEARNING
• Kindergarten (0)
• First Grade (1)
• Second Grade (2)
• Third Grade (3)
• Fourth Grade (4)
• Fifth Grade (5)
• Sixth Grade (6)
• Seventh Grade (7)
• Eighth Grade (8)
• High School Freshman (9)
• High School Sophomore (10)
• High School Junior (11)
• High School Senior (12)
• College Freshman (13)
• College Sophomore (14)
• College Junior (15)
• College Senior (16)
• Graduate Student (17)
• PhD Candidate (18)
• Doctorate (19)
• Post Doctorate (20)
The above list has 21 levels and would take 21 dummy variables to encode. However, simply encoding this
to dummies would lose the order information. Perhaps the most straightforward approach would be to
simply number them and assign the category a single number equal to the value in the parenthesis above.
2.3. PART 2.3: GROUPING, SORTING, AND SHUFFLING 55
However, we might be able to do even better. A graduate student is likely more than a year so you might
increase one value.
import o s
import pandas a s pd
import numpy a s np
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
d f = d f . r e i n d e x ( np . random . p e r m u t a t i o n ( d f . i n d e x ) )
Output
The following code demonstrates a reindex. Notice how the reindex orders the row indexes.
Code
d f . r e s e t _ i n d e x ( i n p l a c e=True , drop=True )
display ( df )
Output
Code
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
58 CHAPTER 2. PYTHON FOR MACHINE LEARNING
Output
You can use the above data set with the group to perform summaries. For example, the following code
will group cylinders by the average (mean). This code will provide the grouping. In addition to mean,
you can use other aggregating functions, such as sum or count.
Code
Output
cylinders
3 20.550000
4 29.286765
5 27.366667
6 19.985714
8 14.963107
Name : mpg , dtype : f l o a t 6 4
d = g . to_dict ()
d
Output
2.4. PART 2.4: APPLY AND MAP 59
{3: 20.55 ,
4: 29.28676470588236 ,
5: 27.366666666666664 ,
6: 19.985714285714284 ,
8: 14.963106796116508}
A dictionary allows you to access an individual element quickly. For example, you could quickly look
up the mean for six-cylinder cars. You will see that target encoding, introduced later in this module, uses
this technique.
Code
d[6]
Output
19.985714285714284
The code below shows how to count the number of rows that match each cylinder count.
Code
Output
{ 3 : 4 , 4 : 2 0 4 , 5 : 3 , 6 : 8 4 , 8 : 103}
one and three that indicates the geographic origin of each car. We can see how to use the map function to
transform this numeric origin into the textual name of each origin.
We will begin by loading the Auto MPG data set.
Code
import o s
import pandas a s pd
import numpy a s np
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
display ( df )
Output
The map method in Pandas operates on a single column. You provide map with a dictionary of values
to transform the target column. The map keys specify what values in the target column should be turned
into values specified by those keys. The following code shows how the map function can transform the
numeric values of 1, 2, and 3 into the string values of North America, Europe, and Asia.
Code
# Apply t h e map
d f [ ' origin_name ' ] = d f [ ' o r i g i n ' ] . map(
{ 1 : ' North ␣ America ' , 2 : ' Europe ' , 3 : ' Asia ' } )
# S h u f f l e t h e data , so t h a t we h o p e f u l l y s e e
# more r e g i o n s .
d f = d f . r e i n d e x ( np . random . p e r m u t a t i o n ( d f . i n d e x ) )
# Display
2.4. PART 2.4: APPLY AND MAP 61
Output
Output
45 2.345455
290 2.471831
313 1.677778
82 1.237113
33 2.320000
249 2.363636
27 1.514286
62 CHAPTER 2. PYTHON FOR MACHINE LEARNING
7 2.046512
302 1.500000
179 1.234694
dtype : float64
You can now insert this series into the data frame, either as a new column or to replace an existing
column. The following code inserts this new series into the data frame.
Code
df [ ' e f f i c i e n c y ' ] = e f f i c i e n c y
• https://www.irs.gov/pub/irs-soi/16zpallagi.csv
This URL contains US Government public data for "SOI Tax Stats - Individual Income Tax Statistics."
The entry point to the website is here:
• https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2016-zip-code-data-soi
Note, that the file will have six rows for each zip code for each of the agi_stub brackets. You can skip zip
codes with 0 or 99999.
We will create an output CSV with these columns; however, only one row per zip code. Calculate a
weighted average of the income brackets. For example, the following six rows are present for 63017:
zipcode agi_stub N1
-- -- --
63017 1 4710
63017 2 2780
63017 3 2130
63017 4 2010
63017 5 5240
63017 6 3510
2.4. PART 2.4: APPLY AND MAP 63
We must combine these six rows into one. For privacy reasons, AGI’s are broken out into 6 buckets.
We need to combine the buckets and estimate the actual AGI of a zipcode. To do this, consider the values
for N1:
• 1 = 1 to 25,000
• 2 = 25,000 to 50,000
• 3 = 50,000 to 75,000
• 4 = 75,000 to 100,000
• 5 = 100,000 to 200,000
• 6 = 200,000 or more
• 1 = 12,500
• 2 = 37,500
• 3 = 62,500
• 4 = 87,500
• 5 = 112,500
• 6 = 212,500
88689.89205103042
import pandas a s pd
First, we trim all zip codes that are either 0 or 99999. We also select the three fields that we need.
Code
display ( df )
Output
We replace all of the agi_stub values with the correct median values with the map function.
Code
medians = { 1 : 1 2 5 0 0 , 2 : 3 7 5 0 0 , 3 : 6 2 5 0 0 , 4 : 8 7 5 0 0 , 5 : 1 1 2 5 0 0 , 6 : 2 1 2 5 0 0 }
d f [ ' agi_stub ' ]= d f . agi_stub .map( medians )
Output
2.4. PART 2.4: APPLY AND MAP 65
Code
The program applies a lambda across the groups and calculates the AGI estimate.
Code
d f = pd . DataFrame ( g r o u p s . apply (
lambda x :sum( x [ 'N1 ' ] ∗ x [ ' agi_stub ' ] ) /sum( x [ 'N1 ' ] ) ) ) \
. reset_index ()
display ( df )
Output
66 CHAPTER 2. PYTHON FOR MACHINE LEARNING
zipcode 0
0 1001 52895.322940
1 1002 64528.451001
2 1003 15441.176471
3 1005 54694.092827
4 1007 63654.353562
... ... ...
29867 99921 48042.168675
29868 99922 32954.545455
29869 99925 45639.534884
29870 99926 41136.363636
29871 99929 45911.214953
Code
display ( df )
Output
zipcode agi_estimate
0 1001 52895.322940
1 1002 64528.451001
2 1003 15441.176471
3 1005 54694.092827
4 1007 63654.353562
... ... ...
29867 99921 48042.168675
29868 99922 32954.545455
29869 99925 45639.534884
29870 99926 41136.363636
29871 99929 45911.214953
Finally, we check to see that our zip code of 63017 got the correct value.
2.5. PART 2.5: FEATURE ENGINEERING 67
Code
Output
zipcode agi_estimate
19909 63017 88689.892051
import o s
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
Output
68 CHAPTER 2. PYTHON FOR MACHINE LEARNING
i f 'GOOGLE_API_KEY ' in o s . e n v i r o n :
# I f t h e API key i s d e f i n e d i n an e n v i r o n m e n t a l v a r i a b l e ,
# t h e use t h e env v a r i a b l e .
GOOGLE_KEY = o s . e n v i r o n [ 'GOOGLE_API_KEY ' ]
else :
# I f you have a Google API key o f your own , you can a l s o j u s t
# put i t here :
GOOGLE_KEY = 'REPLACE␣WITH␣YOUR␣GOOGLE␣API␣KEY '
import r e q u e s t s
resp_json_payload = r e s p o n s e . j s o n ( )
Output
They might not be overly helpful if you feed latitude and longitude into the neural network as two
features. These two values would allow your neural network to cluster locations on a map. Sometimes
cluster locations on a map can be useful. Figure 2.2 shows the percentage of the population that smokes
in the USA by state.
The above map shows that certain behaviors, like smoking, can be clustered by the global region.
However, often you will want to transform the coordinates into distances. It is reasonably easy to
estimate the distance between any two points on Earth by using the great circle distance between any two
points on a sphere:
The following code implements this formula:
70 CHAPTER 2. PYTHON FOR MACHINE LEARNING
∆σ = arccos sin φ1 · sin φ2 + cos φ1 · cos φ2 · cos(∆λ)
d = r, ∆σ
Code
# Distance function
def d i s t a n c e _ l a t _ l n g ( l a t 1 , lng1 , l a t 2 , l n g 2 ) :
# a p p r o x i m a t e r a d i u s o f e a r t h i n km
R = 6373.0
return R ∗ c
# Find l a t l o n f o r a d d r e s s
def lookup_lat_lng ( a d d r e s s ) :
response = requests . get ( \
URL. format (GOOGLE_KEY, a d d r e s s ) )
json = response . json ()
i f len ( j s o n [ ' r e s u l t s ' ] ) == 0 :
r a i s e V a l u e E r r o r ( " Google ␣API␣ e r r o r ␣on : ␣ {} " . format ( a d d r e s s ) )
map = j s o n [ ' r e s u l t s ' ] [ 0 ] [ ' geometry ' ] [ ' l o c a t i o n ' ]
return map[ ' l a t ' ] ,map[ ' l n g ' ]
2.5. PART 2.5: FEATURE ENGINEERING 71
# D i s t a n c e b e t w e e n two l o c a t i o n s
import r e q u e s t s
l a t 1 , l n g 1 = lookup_lat_lng ( a d d r e s s 1 )
l a t 2 , l n g 2 = lookup_lat_lng ( a d d r e s s 2 )
Output
D i s t a n c e , St . Louis , MO t o Ft . Lauderdale , FL : 1 6 8 5 . 3 0 1 9 8 0 8 6 0 7 4 2 6 km
Distances can be a useful means to encode addresses. It would help if you considered what distance
might be helpful for your dataset. Consider:
• Distance to a major metropolitan area
• Distance to a competitor
• Distance to a distribution center
• Distance to a retail outlet
The following code calculates the distance between 10 universities and Washington University in St. Louis:
Code
# Encoding o t h e r u n i v e r s i t i e s by t h e i r d i s t a n c e t o Washington U n i v e r s i t y
schools = [
[ " P r i n c e t o n ␣ U n i v e r s i t y , ␣ P r i n c e t o n , ␣NJ␣ 08544 " , ' P r i n c e t o n ' ] ,
[ " M a s s a c h u s e t t s ␣ Hall , ␣ Cambridge , ␣MA␣ 02138 " , ' Harvard ' ] ,
[ " 5801 ␣S␣ E l l i s ␣Ave , ␣ Chicago , ␣ IL ␣ 60637 " , ' U n i v e r s i t y ␣ o f ␣ Chicago ' ] ,
[ " Yale , ␣New␣Haven , ␣CT␣ 06520 " , ' Yale ' ] ,
[ " 116 th ␣ St ␣&␣Broadway , ␣New␣York , ␣NY␣ 10027 " , ' Columbia ␣ U n i v e r s i t y ' ] ,
[ " 450 ␣ S e r r a ␣ Mall , ␣ S t a n f o r d , ␣CA␣ 94305 " , ' S t a n f o r d ' ] ,
[ " 77 ␣ M a s s a c h u s e t t s ␣Ave , ␣ Cambridge , ␣MA␣ 02139 " , 'MIT ' ] ,
[ " Duke␣ U n i v e r s i t y , ␣Durham , ␣NC␣ 27708 " , ' Duke␣ U n i v e r s i t y ' ] ,
[ " U n i v e r s i t y ␣ o f ␣ P e n n s y l v a n i a , ␣ P h i l a d e l p h i a , ␣PA␣ 19104 " ,
' University ␣ of ␣ Pennsylvania ' ] ,
72 CHAPTER 2. PYTHON FOR MACHINE LEARNING
[ " Johns ␣ Hopkins ␣ U n i v e r s i t y , ␣ Baltimore , ␣MD␣ 21218 " , ' Johns ␣ Hopkins ' ]
]
for a d d r e s s , name in s c h o o l s :
l a t 2 , l n g 2 = lookup_lat_lng ( a d d r e s s )
d i s t = d i s t a n c e _ l a t _ l n g ( l a t 1 , lng1 , l a t 2 , l n g 2 )
print ( " S c h o o l ␣ ' { } ' , ␣ d i s t a n c e ␣ t o ␣ w u s t l ␣ i s : ␣ {} " . format ( name , d i s t ) )
Output
Introduction to TensorFlow
Before CNNs, programs either encoded images to an intermediate form or sent the image input to a neural
network by merely squashing the image matrix into a long array by placing the image’s rows side-by-side.
CNNs are different as the matrix passes through the neural network layers.
Initially, this book will focus on 1D input to neural networks. However, later modules will focus more
heavily on higher dimension input.
The term dimension can be confusing in neural networks. In the sense of a 1D input vector, dimension
refers to how many elements are in that 1D array. For example, a neural network with ten input neurons
73
74 CHAPTER 3. INTRODUCTION TO TENSORFLOW
has ten dimensions. However, now that we have CNNs, the input has dimensions. The input to the neural
network will usually have 1, 2, or 3 dimensions. Four or more dimensions are unusual. You might have a
2D input to a neural network with 64x64 pixels. This configuration would result in 4,096 input neurons.
This network is either 2D or 4,096D, depending on which dimensions you reference.
Notice that the output of the regression neural network is numeric, and the classification output is a
class. Regression, or two-class classification, networks always have a single output. Classification neural
networks have an output neuron for each category.
A diagram shows the abstract structure of a single artificial neuron in Figure 3.2.
The artificial neuron receives input from one or more sources that may be other neurons or data fed
into the network from a computer program. This input is usually floating-point or binary. Often binary
input is encoded to floating-point by representing true or false as 1 or 0. Sometimes the program also
depicts the binary information using a bipolar system with true as one and false as -1.
An artificial neuron multiplies each of these inputs by a weight. Then it adds these multiplications and
passes this sum to an activation function. Some neural networks do not use an activation function. The
following equation summarizes the calculated output of a neuron:
X
f (x, w) = φ( (θi · xi ))
i
In the above equation, the variables x and θ represent the input and weights of the neuron. The variable
i corresponds to the number of weights and inputs. You must always have the same number of weights as
inputs. The neural network multiplies each weight by its respective input and feeds the products of these
multiplications into an activation function, denoted by the Greek letter φ (phi). This process results in a
76 CHAPTER 3. INTRODUCTION TO TENSORFLOW
[1, 2]
Because a bias neuron is present, the program should append the value of one as follows:
[1, 2, 1]
The weights for a 3-input layer (2 real inputs + bias) will always have additional weight for the bias.
A weight vector might be:
The neurons that form a layer share several characteristics. First, every neuron in a layer has the same
activation function. However, the activation functions employed by each layer may be different. Each of
the layers fully connects to the next layer. In other words, every neuron in one layer has a connection
to neurons in the previous layer. The former figure is not fully connected. Several layers are missing
connections. For example, I1 and N2 do not connect. The next neural network in Figure 3.5 is fully
connected and has an additional layer.
In this figure, you see a fully connected, multilayered neural network. Networks such as this one will
always have an input and output layer. The hidden layer structure determines the name of the network
architecture. The network in this figure is a two-hidden-layer network. Most networks will have between
zero and two hidden layers. Without implementing deep learning strategies, networks with more than two
hidden layers are rare.
You might also notice that the arrows always point downward or forward from the input to the output.
78 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Later in this course, we will see recurrent neural networks that form inverted loops among the neurons.
This type of neural network is called a feedforward neural network.
Neural networks typically accept floating-point vectors as their input. To be consistent, we will represent
the output of a single output neuron network as a single-element vector. Likewise, neural networks will
output a vector with a length equal to the number of output neurons. The output will often be a single
value from a single output neuron.
form the output. Programmers often group hidden neurons into fully connected hidden layers. However,
these hidden layers do not directly process the incoming data or the eventual output.
A common question for programmers concerns the number of hidden neurons in a network. Since the
answer to this question is complex, more than one section of the course will include a relevant discussion
of the number of hidden neurons. Before deep learning, researchers generally suggested that anything
more than a single hidden layer is excessive.[14]Researchers have proven that a single-hidden-layer neural
network can function as a universal approximator. In other words, this network should be able to learn to
produce (or approximate) any output from any input as long as it has enough hidden neurons in a single
layer.
Training refers to the process that determines good weight values. Before the advent of deep learning,
researchers feared additional layers would lengthen training time or encourage overfitting. Both concerns
are true; however, increased hardware speeds and clever techniques can mitigate these concerns. Before
researchers introduced deep learning techniques, we did not have an efficient way to train a deep network,
which is a neural network with many hidden layers. Although a single-hidden-layer neural network can
theoretically learn anything, deep learning facilitates a more complex representation of patterns in the
data.
1
f (x, w, b) =
1 + e−(wx+b)
3.1. PART 3.1: DEEP LEARNING AND NEURAL NETWORK INTRODUCTION 81
The x variable represents the single input to the neural network. The w and b variables specify the
weight and bias of the neural network. The above equation combines the weighted sum of the inputs
and the sigmoid activation function. For this section, we will consider the sigmoid function because it
demonstrates a bias neuron’s effect.
The weights of the neuron allow you to adjust the slope or shape of the activation function. Figure 3.7
shows the effect on the output of the sigmoid activation function if the weight is varied:
The above diagram shows several sigmoid curves using the following parameters:
We did not use bias to produce the curves, which is evident in the third parameter of 0 in each case.
Using four weight values yields four different sigmoid curves in the above figure. No matter the weight, we
always get the same value of 0.5 when x is 0 because all curves hit the same point when x is 0. We might
need the neural network to produce other values when the input is near 0.5.
82 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Bias does shift the sigmoid curve, which allows values other than 0.5 when x is near 0. Figure 3.8 shows
the effect of using a weight of 1.0 with several different biases:
The above diagram shows several sigmoid curves with the following parameters:
We used a weight of 1.0 for these curves in all cases. When we utilized several different biases, sigmoid
curves shifted to the left or right. Because all the curves merge at the top right or bottom left, it is not a
complete shift.
When we put bias and weights together, they produced a curve that created the necessary output. The
above curves are the output from only one neuron. In a complete network, the output from many different
neurons will combine to produce intricate output patterns.
activation function. However, modern deep neural networks primarily make use of the following activation
functions:
• Rectified Linear Unit (ReLU) - Used for the output of hidden layers.[8]
• Softmax - Used for the output of classification neural networks.
• Linear - Used for the output of regression neural networks (or 2-class classification).
φ(x) = x
As you can observe, this activation function simply returns the value that the neuron inputs passed to
it. Figure 3.9 shows the graph for a linear activation function:
Regression neural networks, which learn to provide numeric values, will usually use a linear activation
function on their output layer. Classification neural networks, which determine an appropriate class for
their input, will often utilize a softmax activation function for their output layer.
84 CHAPTER 3. INTRODUCTION TO TENSORFLOW
φ(x) = max(0, x)
The neuron’s outputs are numeric values without the softmax, with the highest indicating the winning
class.
To see how the program uses the softmax activation function, we will look at a typical neural network
classification problem. The iris data set contains four measurements for 150 different iris flowers. Each of
these flowers belongs to one of three species of iris. When you provide the measurements of a flower, the
softmax function allows the neural network to give you the probability that these measurements belong to
each of the three species. For example, the neural network might tell you that there is an 80% chance that
the iris is setosa, a 15% probability that it is virginica, and only a 5% probability of versicolor. Because
these are probabilities, they must add up to 100%. There could not be an 80% probability of setosa, a 75%
probability of virginica, and a 20% probability of versicolor---this type of result would be nonsensical.
To classify input data into one of three iris species, you will need one output neuron for each species.
The output neurons do not inherently specify the probability of each of the three species. Therefore, it is
desirable to provide probabilities that sum to 100%. The neural network will tell you the likelihood of a
flower being each of the three species. To get the probability, use the softmax function in the following
equation:
exp(xi )
φi (x) = P
j exp(xj )
In the above equation, i represents the index of the output neuron (φ) that the program is calculating,
and j represents the indexes of all neurons in the group/level. The variable x designates the array of output
neurons. It’s important to note that the program calculates the softmax activation differently than the
other activation functions in this module. When softmax is the activation function, the output of a single
neuron is dependent on the other output neurons.
To see the softmax function in operation, refer to this Softmax example website.
86 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Consider a trained neural network that classifies data into three categories: the three iris species. In
this case, you would use one output neuron for each of the target classes. Consider if the neural network
were to output the following:
The above output shows that the neural network considers the data to represent a setosa iris. However, these
numbers are not probabilities. The 0.9 value does not represent a 90% likelihood of the data representing
a setosa. These values sum to 1.5. For the program to treat them as probabilities, they must sum to 1.0.
The output vector for this neural network is the following:
If you provide this vector to the softmax function it will return the following vector:
The above three values do sum to 1.0 and can be treated as probabilities. The likelihood of the data
representing a setosa iris is 48% because the first value in the vector rounds to 0.48 (48%). You can
calculate this value in the following manner:
j0 = exp(0.9)/sum = 0.47548495534876745
j1 = exp(0.2)/sum = 0.2361188410001125
j2 = exp(0.4)/sum = 0.28839620365112
1
φ(x) =
1 + e−x
Use the sigmoid function to ensure that values stay within a relatively small range, as seen in Figure
3.12:
As you can see from the above graph, we can force values to a range. Here, the function compressed
values above or below 0 to the approximate range between 0 and 1.
φ(x) = tanh(x)
88 CHAPTER 3. INTRODUCTION TO TENSORFLOW
The graph of the hyperbolic tangent function has a similar shape to the sigmoid activation function, as
seen in Figure 3.13.
The hyperbolic tangent function has several advantages over the sigmoid activation function.
Textbooks often give this derivative in other forms. We use the above form for computational efficiency.
To see how we determined this derivative, refer to the following article.
We present the graph of the sigmoid derivative in Figure 3.15.
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 89
The derivative quickly saturates to zero as x moves from zero. This is not a problem for the derivative
of the ReLU, which is given here:
(
0 1 x>0
φ (x) =
0 x≤0
• TensorFlow Homepage
90 CHAPTER 3. INTRODUCTION TO TENSORFLOW
• TensorFlow GitHib
• TensorFlow Google Groups Support
• TensorFlow Google Groups Developer Discussion
• TensorFlow FAQ
• Supported by Google
• Works well on Windows, Linux, and Mac
• Excellent GPU support
• Python is an easy to learn programming language
• Python is extremely popular in the data science community
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 91
• TensorFlow - Google’s deep learning API. The focus of this class, along with Keras.
• Keras - Acts as a higher-level to Tensorflow.
• PyTorch - PyTorch is an open-source machine learning library based on the Torch library, used for
computer vision and natural language applications processing. Facebook’s AI Research lab primarily
develops PyTorch.
In my opinion, the two primary Python libraries for deep learning are PyTorch and Keras. Generally,
PyTorch requires more lines of code to perform the deep learning applications presented in this course.
This trait of PyTorch gives Keras an easier learning curve than PyTorch. However, if you are creating
entirely new neural network structures in a research setting, PyTorch can make for easier access to some
of the low-level internals of deep learning.
import t e n s o r f l o w a s t f
# Crea t e a n o t h e r Constant t h a t p r o d u c e s a 2 x1 m a t r i x .
matrix2 = t f . c o n s t a n t ( [ [ 2 . ] , [ 2 . ] ] )
print ( p r o d u c t )
print ( f l o a t ( p r o d u c t ) )
Output
This example multiplied two TensorFlow constant tensors. Next, we will see how to subtract a constant
from a variable.
Code
import t e n s o r f l o w a s t f
x = t f . Variable ( [ 1 . 0 , 2 . 0 ] )
a = t f . constant ( [ 3 . 0 , 3 . 0 ] )
sub = t f . s u b t r a c t ( x , a )
print ( sub )
print ( sub . numpy ( ) )
# ==> [ −2. −1.]
Output
Of course, variables are only useful if their values can be changed. The program can accomplish this
change in value by calling the assign function.
Code
Output
The program can now perform the subtraction with this new value.
Code
sub = t f . s u b t r a c t ( x , a )
print ( sub )
print ( sub . numpy ( ) )
Output
In the next section, we will see a TensorFlow example that has nothing to do with neural networks.
learning rendering task. The code presented here can render a Mandelbrot set. Note, I based this code
on a Mandelbrot example that I originally found with TensorFlow 1.0. I’ve updated the code slightly to
comply with current versions of TensorFlow.
Code
# Import l i b r a r i e s f o r s i m u l a t i o n
import t e n s o r f l o w a s t f
import numpy a s np
# Imports f o r v i s u a l i z a t i o n
import PIL . Image
from i o import BytesIO
from IPython . d i s p l a y import Image , d i s p l a y
Y, X = np . mgrid [ − 1 . 3 : 1 . 3 : 0 . 0 0 5 , −2:1:0.005]
Z = X+1 j ∗Y
xs = t f . c o n s t a n t ( Z . a s t y p e ( np . complex64 ) )
z s = t f . V a r i a b l e ( xs )
ns = t f . V a r i a b l e ( t f . z e r o s _ l i k e ( xs , t f . f l o a t 3 2 ) )
# O p e r a t i o n t o u p d a t e t h e z s and t h e i t e r a t i o n c ou n t .
#
# Note : We k e e p computing z s a f t e r t h e y d i v e r g e ! This
# i s v e r y w a s t e f u l ! There a r e b e t t e r , i f a l i t t l e
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 95
# l e s s s i m p l e , ways t o do t h i s .
#
f o r i in range ( 2 0 0 ) :
# Compute t h e new v a l u e s o f z : z ^2 + x
zs_ = z s ∗ z s + xs
# Have we d i v e r g e d w i t h t h i s new v a l u e ?
n o t _ d i v e r g e d = t f . abs ( zs_ ) < 4
z s . a s s i g n ( zs_ ) ,
ns . assign_add ( t f . c a s t ( not_diverged , t f . f l o a t 3 2 ) )
D i s p l a y F r a c t a l ( ns . numpy ( ) )
Output
Mandlebrot rendering programs are both simple and infinitely complex at the same time. This view
shows the entire Mandlebrot universe simultaneously, as a view completely zoomed out. However, if you
zoom in on any non-black portion of the plot, you will find infinite hidden complexity.
96 CHAPTER 3. INTRODUCTION TO TENSORFLOW
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' ho r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 97
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , y , v e r b o s e =2, e p o c h s =100)
Output
...
13/13 − 0 s − l o s s : 1 3 9 . 3 4 3 5
Epoch 100/100
13/13 − 0 s − l o s s : 1 3 5 . 2 2 1 7
• verbose=0 - No progress output (use with Jupyter if you do not want output).
98 CHAPTER 3. INTRODUCTION TO TENSORFLOW
• verbose=1 - Display progress bar, does not work well with Jupyter.
• verbose=2 - Summary progress output (use with Jupyter if you want to know the loss at each
epoch).
pred = model . p r e d i c t ( x )
print ( f " Shape : ␣ { pred . shape } " )
print ( pred [ 0 : 1 0 ] )
Output
Shape : ( 3 9 8 , 1 )
[[22.539425]
[27.995203]
[25.851433]
[25.711117]
[23.701847]
[31.893755]
[35.556503]
[34.45243 ]
[36.27014 ]
[31.358776]]
We would like to see how good these predictions are. We know the correct MPG for each car so we can
measure how close the neural network was.
Code
Output
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 99
F i n a l s c o r e (RMSE) : 1 1 . 5 5 2 9 0 7 3 6 5 1 9 5 1 3 4
The number printed above is the average number of predictions above or below the expected output.
We can also print out the first ten cars with predictions and actual MPG.
Code
# Sample p r e d i c t i o n s
f o r i in range ( 1 0 ) :
print ( f " { i +1}. ␣Car␣name : ␣ { c a r s [ i ] } , ␣MPG: ␣ {y [ i ] } , ␣ "
+ f " p r e d i c t e d ␣MPG: ␣ { pred [ i ] } " )
Output
We just saw a straightforward example of how to perform the Iris classification using TensorFlow. The
iris.csv file is used rather than using the built-in data that many Google examples require.
Make sure that you always run previous code blocks. If you run the code block below,
without the code block above, you will get errors
Code
import pandas a s pd
import i o
import r e q u e s t s
import numpy a s np
from s k l e a r n import m e t r i c s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ i r i s . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' s p e c i e s ' ] ) # C l a s s i f i c a t i o n
s p e c i e s = dummies . columns
y = dummies . v a l u e s
# B u i l d n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
Output
...
5/5 − 0 s − l o s s : 0 . 0 8 5 1
Epoch 100/100
5/5 − 0 s − l o s s : 0 . 0 8 8 0
Code
# P r i n t o u t number o f s p e c i e s found :
print ( s p e c i e s )
Output
Now that you have a neural network trained, we would like to be able to use it. The following code
makes use of our neural network. Exactly like before, we will generate predictions. Notice that three values
come back for each of the 150 iris flowers. There were three types of iris (Iris-setosa, Iris-versicolor, and
Iris-virginica).
Code
pred = model . p r e d i c t ( x )
print ( f " Shape : ␣ { pred . shape } " )
print ( pred [ 0 : 1 0 ] )
Output
Shape : ( 1 5 0 , 3 )
[ [ 9 . 9 7 6 8 4 1 2 e −01 2 . 3 0 8 7 7 6 6 e −03 7 . 1 4 7 4 5 6 0 e −06]
[ 9 . 9 3 4 9 6 6 6 e −01 6 . 4 7 6 3 0 1 7 e −03 2 . 6 9 9 5 1 0 5 e −05]
[ 9 . 9 6 1 8 2 9 8 e −01 3 . 7 9 9 1 4 5 6 e −03 1 . 7 7 9 0 3 6 6 e −05]
[ 9 . 9 2 0 7 5 3 2 e −01 7 . 8 8 8 2 5 9 4 e −03 3 . 6 4 5 3 8 9 7 e −05]
[ 9 . 9 7 9 1 3 1 8 e −01 2 . 0 8 0 0 2 2 8 e −03 6 . 7 6 0 2 9 4 1 e −06]
[ 9 . 9 6 8 4 9 9 5 e −01 3 . 1 4 4 2 6 1 4 e −03 5 . 8 1 1 2 0 0 0 e −06]
[ 9 . 9 5 4 7 1 3 6 e −01 4 . 5 0 8 6 8 8 1 e −03 1 . 9 9 4 6 1 0 3 e −05]
[ 9 . 9 6 2 5 9 2 1 e −01 3 . 7 2 8 8 4 9 3 e −03 1 . 2 0 4 0 5 0 6 e −05]
[ 9 . 9 0 1 1 1 8 9 e −01 9 . 8 2 9 6 8 5 1 e −03 5 . 8 4 3 4 5 3 6 e −05]
102 CHAPTER 3. INTRODUCTION TO TENSORFLOW
[ 9 . 9 4 4 7 2 0 3 e −01 5 . 5 0 6 7 8 8 4 e −03 2 . 1 2 7 2 4 2 1 e − 0 5 ] ]
If you would like to turn of scientific notation, the following line can be used:
Code
np . s e t _ p r i n t o p t i o n s ( s u p p r e s s=True )
print ( y [ 0 : 1 0 ] )
Output
[[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]
[1 0 0]]
Usually, the program considers the column with the highest prediction to be the prediction of the neural
network. It is easy to convert the predictions to the expected iris species. The argmax function finds the
index of the maximum prediction for each row.
Code
Output
Predictions : [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
3.2. PART 3.2: INTRODUCTION TO TENSORFLOW AND KERAS 103
2 1
1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2
2 2]
Expected : [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2
2 2]
Of course, it is straightforward to turn these indexes back into iris species. We use the species list that
we created earlier.
Code
print ( s p e c i e s [ p r e d i c t _ c l a s s e s [ 1 : 1 0 ] ] )
Output
Accuracy might be a more easily understood error metric. It is essentially a test score. For all of the
iris predictions, what percent were correct? The downside is it does not consider how confident the neural
network was in each prediction.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " Accuracy : ␣ { c o r r e c t } " )
Output
104 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Accuracy : 0 . 9 7 3 3 3 3 3 3 3 3 3 3 3 3 3 4
The code below performs two ad hoc predictions. The first prediction is a single iris flower, and the
second predicts two iris flowers. Notice that the argmax in the second prediction requires axis=1? Since
we have a 2D array now, we must specify which axis to take the argmax over. The value axis=1 specifies
we want the max column index for each row.
Code
Output
Code
sam ple _ f l o w e r = np . a r r a y ( [ [ 5 . 0 , 3 . 0 , 4 . 0 , 2 . 0 ] , [ 5 . 2 , 3 . 5 , 1 . 5 , 0 . 8 ] ] , \
dtype=f l o a t )
pred = model . p r e d i c t ( s a m p l e _ f l o w e r )
print ( pred )
pred = np . argmax ( pred , a x i s =1)
print ( f " P r e d i c t ␣ t h a t ␣ t h e s e ␣two␣ f l o w e r s ␣ { s a m p l e _ f l o w e r } ␣ " )
print ( f " a r e : ␣ { s p e c i e s [ pred ] } " )
Output
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' h o r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
106 CHAPTER 3. INTRODUCTION TO TENSORFLOW
# Predict
pred = model . p r e d i c t ( x )
# s a v e n e u r a l network s t r u c t u r e t o JSON ( no w e i g h t s )
model_json = model . t o _ j s o n ( )
with open ( o s . path . j o i n ( save_path , " network . j s o n " ) , "w" ) a s j s o n _ f i l e :
j s o n _ f i l e . w r i t e ( model_json )
# s a v e e n t i r e network t o HDF5 ( s a v e e v e r y t h i n g , s u g g e s t e d )
model . s a v e ( o s . path . j o i n ( save_path , " network . h5 " ) )
Output
...
13/13 − 0 s − l o s s : 5 0 . 2 1 1 8 − 25ms/ epoch − 2ms/ s t e p
Epoch 100/100
13/13 − 0 s − l o s s : 4 9 . 8 8 2 8 − 25ms/ epoch − 2ms/ s t e p
B e f o r e s a v e s c o r e (RMSE) : 7 . 0 4 4 4 3 1 6 9 0 3 0 0 9 0 3
The code below sets up a neural network and reads the data (for predictions), but it does not clear
the model directory or fit the neural network. The code loads the weights from the previous fit. Now we
reload the network and perform another prediction. The RMSE should match the previous one exactly if
we saved and reloaded the neural network correctly.
Code
Output
A f t e r l o a d s c o r e (RMSE) : 7 . 0 4 4 4 3 1 6 9 0 3 0 0 9 0 3
3.4. PART 3.4: EARLY STOPPING IN KERAS TO PREVENT OVERFITTING 107
• Training Set
• Validation Set
• Holdout Set
You can construct these sets in several different ways. The following programs demonstrate some of these.
The first method is a training and validation set. We use the training data to train the neural network
until the validation set no longer improves. This attempts to stop at a near-optimal training point. This
method will only give accurate "out of sample" predictions for the validation set; this is usually 20% of the
data. The predictions for the training data will be overly optimistic, as these were the data that we used
to train the neural network. Figure 3.18 demonstrates how we divide the dataset.
Code
import pandas a s pd
import i o
import r e q u e s t s
import numpy a s np
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ i r i s . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' s p e c i e s ' ] ) # C l a s s i f i c a t i o n
s p e c i e s = dummies . columns
y = dummies . v a l u e s
# S p l i t i n t o v a l i d a t i o n and t r a i n i n g s e t s
x_train , x_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.25 , random_state =42)
# B u i l d n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
3.4. PART 3.4: EARLY STOPPING IN KERAS TO PREVENT OVERFITTING 109
Output
There are a number of parameters that are specified to the EarlyStopping object.
• min_delta This value should be kept small. It simply means the minimum change in error to be
registered as an improvement. Setting it even smaller will not likely have a great deal of impact.
• patience How long should the training wait for the validation error to improve?
• verbose How much progress information do you want?
• mode In general, always set this to "auto". This allows you to specify if the error should be minimized
or maximized. Consider accuracy, where higher numbers are desired vs log-loss/RMSE where lower
numbers are desired.
• restore_best_weights This should always be set to true. This restores the weights to the values
they were at when the validation set is the highest. Unless you are manually tracking the weights
yourself (we do not use this technique in this course), you should have Keras perform this step for
you.
As you can see from above, the entire number of requested epochs were not used. The neural network
training stopped once the validation set no longer improved.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
pred = model . p r e d i c t ( x _ t e s t )
p r e d i c t _ c l a s s e s = np . argmax ( pred , a x i s =1)
e x p e c t e d _ c l a s s e s = np . argmax ( y_test , a x i s =1)
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " Accuracy : ␣ { c o r r e c t } " )
110 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Output
Accuracy : 1 . 0
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' ho r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# S p l i t i n t o v a l i d a t i o n and t r a i n i n g s e t s
x_train , x_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.25 , random_state =42)
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
3.5. PART 3.5: EXTRACTING WEIGHTS AND MANUAL NETWORK CALCULATION 111
Output
Output
F i n a l s c o r e (RMSE) : 5 . 2 9 1 2 1 9 3 0 0 7 9 9 3 9 8
algorithms begin by initializing the weights to a random state. Training then progresses through iterations
that continuously improve the weights to produce better output.
The random weights of a neural network impact how well that neural network can be trained. If a
neural network fails to train, you can remedy the problem by simply restarting with a new set of random
weights. However, this solution can be frustrating when you are experimenting with the architecture of a
neural network and trying different combinations of hidden layers and neurons. If you add a new layer,
and the network’s performance improves, you must ask yourself if this improvement resulted from the new
layer or from a new set of weights. Because of this uncertainty, we look for two key attributes in a weight
initialization algorithm:
• How consistently does this algorithm provide good weights?
• How much of an advantage do the weights of the algorithm provide?
One of the most common yet least practical approaches to weight initialization is to set the weights to
random values within a specific range. Numbers between -1 and +1 or -5 and +5 are often the choice. If
you want to ensure that you get the same set of random weights each time, you should use a seed. The seed
specifies a set of predefined random weights to use. For example, a seed of 1000 might produce random
weights of 0.5, 0.75, and 0.2. These values are still random; you cannot predict them, yet you will always
get these values when you choose a seed of 1000.
Not all seeds are created equal. One problem with random weight initialization is that the random weights
created by some seeds are much more difficult to train than others. The weights can be so bad that training
is impossible. If you cannot train a neural network with a particular weight set, you should generate a new
set of weights using a different seed.
Because weight initialization is a problem, considerable research has been around it. By default, Keras
uses the Xavier weight initialization algorithm, introduced in 2006 by Glorot Bengio[7], produces good
weights with reasonable consistency. This relatively simple algorithm uses normally distributed random
numbers.
To use the Xavier weight initialization, it is necessary to understand that normally distributed random
numbers are not the typical random numbers between 0 and 1 that most programming languages generate.
Normally distributed random numbers are centered on a mean (µ, mu) that is typically 0. If 0 is the center
(mean), then you will get an equal number of random numbers above and below 0. The next question
is how far these random numbers will venture from 0. In theory, you could end up with both positive
and negative numbers close to the maximum positive and negative ranges supported by your computer.
However, the reality is that you will more likely see random numbers that are between 0 and three standard
deviations from the center.
The standard deviation (σ, sigma) parameter specifies the size of this standard deviation. For example,
if you specified a standard deviation of 10, you would mainly see random numbers between -30 and +30,
and the numbers nearer to 0 have a much higher probability of being selected.
The above figure illustrates that the center, which in this case is 0, will be generated with a 0.4 (40%)
probability. Additionally, the probability decreases very quickly beyond -2 or +2 standard deviations. By
defining the center and how large the standard deviations are, you can control the range of random numbers
that you will receive.
The Xavier weight initialization sets all weights to normally distributed random numbers. These weights
are always centered at 0; however, their standard deviation varies depending on how many connections are
present for the current layer of weights. Specifically, Equation 4.2 can determine the standard deviation:
3.5. PART 3.5: EXTRACTING WEIGHTS AND MANUAL NETWORK CALCULATION 113
2
V ar(W ) =
nin + nout
The above equation shows how to obtain the variance for all weights. The square root of the variance
is the standard deviation. Most random number generators accept a standard deviation rather than a
variance. As a result, you usually need to take the square root of the above equation. Figure 3.19 shows
how this algorithm might initialize one layer.
Code
# Crea t e a d a t a s e t f o r t h e XOR f u n c t i o n
x = np . a r r a y ( [
[0 ,0] ,
[1 ,0] ,
[0 ,1] ,
[1 ,1]
])
y = np . a r r a y ( [
0,
1,
1,
0
])
# B u i l d t h e network
# s g d = o p t i m i z e r s .SGD( l r =0.01 , decay=1e −6, momentum=0.9 , n e s t e r o v=True )
done = F a l s e
cycle = 1
# Predict
pred = model . p r e d i c t ( x )
# Check i f s u c c e s s f u l . I t t a k e s s e v e r a l runs w i t h t h i s
# s m a l l o f a network
done = pred [ 0 ] < 0 . 0 1 and pred [ 3 ] < 0 . 0 1 and pred [ 1 ] > 0 . 9 \
and pred [ 2 ] > 0 . 9
3.5. PART 3.5: EXTRACTING WEIGHTS AND MANUAL NETWORK CALCULATION 115
print ( pred )
Output
Cycle #1
[[0.49999997]
[0.49999997]
[0.49999997]
[0.49999997]]
Cycle #2
[[0.33333334]
[1. ]
[0.33333334]
[0.33333334]]
Cycle #3
[[0.33333334]
[1. ]
[0.33333334]
[0.33333334]]
Cycle #4
[[0.]
[1.]
[1.]
[0.]]
Code
pred [ 3 ]
Output
a r r a y ( [ 0 . ] , dtype=f l o a t 3 2 )
The output above should have two numbers near 0.0 for the first and fourth spots (input [0,0] and
[1,1]). The middle two numbers should be near 1.0 (input [1,0] and [0,1]). These numbers are in scientific
notation. Due to random starting weights, it is sometimes necessary to run the above through several
cycles to get a good result.
Now that we’ve trained the neural network, we can dump the weights.
116 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Code
# Dump w e i g h t s
for layerNum , l a y e r in enumerate ( model . l a y e r s ) :
weights = l a y e r . get_weights ( ) [ 0 ]
b i a s e s = l a y e r . get_weights ( ) [ 1 ]
fo r toNeuronNum , b i a s in enumerate ( b i a s e s ) :
print ( f ' { layerNum }B␣−>␣L{ layerNum+1}N{toNeuronNum } : ␣ { b i a s } ' )
Output
If you rerun this, you probably get different weights. There are many ways to solve the XOR function.
In the next section, we copy/paste the weights from above and recreate the calculations done by the
neural network. Because weights can change with each training, the weights used for the below code came
from this:
Code
input0 = 0
input1 = 1
hidden0Sum = ( i n p u t 0 ∗ 1 . 3 ) + ( i n p u t 1 ∗ 1 . 3 ) + ( − 1 . 3 )
hidden1Sum = ( i n p u t 0 ∗ 1 . 2 ) + ( i n p u t 1 ∗ 1 . 2 ) + ( 0 )
print ( hidden0Sum ) # 0
print ( hidden1Sum ) # 1 . 2
print ( hidden0 ) # 0
print ( hidden1 ) # 1 . 2
print ( output ) # 0 . 9 6
Output
0.0
1.2
0
1.2
0.96
0.96
118 CHAPTER 3. INTRODUCTION TO TENSORFLOW
Chapter 4
4.1 Part 4.1: Encoding a Feature Vector for Keras Deep Learning
Neural networks can accept many types of data. We will begin with tabular data, where there are well-
defined rows and columns. This data is what you would typically see in Microsoft Excel. Neural networks
require numeric input. This numeric form is called a feature vector. Each input neurons receive one feature
(or column) from this vector. Each row of training data typically becomes one vector. This section will
see how to encode the following tabular data into a feature vector. You can see an example of tabular data
below.
Code
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
display ( df )
Output
119
120 CHAPTER 4. TRAINING FOR TABULAR DATA
You can make the following observations from the above data:
• The target column is the column that you seek to predict. There are several candidates here. However,
we will initially use the column "product". This field specifies what product someone bought.
• There is an ID column. You should exclude his column because it contains no information useful for
prediction.
• Many of these fields are numeric and might not require further processing.
• The income column does have some missing values.
• There are categorical values: job, area, and product.
To begin with, we will convert the job code into dummy variables.
Code
d i s p l a y ( dummies )
Output
4.1. PART 4.1: ENCODING A FEATURE VECTOR FOR KERAS DEEP LEARNING 121
(2000 , 33)
Because there are 33 different job codes, there are 33 dummy variables. We also specified a prefix
because the job codes (such as "ax") are not that meaningful by themselves. Something such as "job_ax"
also tells us the origin of this field.
Next, we must merge these dummies back into the main data frame. We also drop the original "job"
field, as the dummies now represent it.
Code
d f = pd . c o n c a t ( [ df , dummies ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
display ( df )
Output
122 CHAPTER 4. TRAINING FOR TABULAR DATA
Output
Code
There are more advanced ways of filling in missing values, but they require more analysis. The idea
would be to see if another field might hint at what the income was. For example, it might be beneficial to
calculate a median income for each area or job category. This technique is something to keep in mind for
the class Kaggle competition.
At this point, the Pandas dataframe is ready to be converted to Numpy for neural network training.
We need to know a list of the columns that will make up x (the predictors or inputs) and y (the target).
The complete list of columns is:
Code
print ( l i s t ( d f . columns ) )
Output
[ ' id ' , ' income ' , ' a s p e c t ' , ' s u b s c r i p t i o n s ' , ' d i s t _ h e a l t h y ' ,
' save_rate ' , ' d i s t _ u n h e a l t h y ' , ' age ' , ' pop_dense ' , ' r e t a i l _ d e n s e ' ,
' crime ' , ' product ' , ' job_11 ' , ' job_al ' , ' job_am ' , ' job_ax ' , ' job_bf ' ,
' job_by ' , ' job_cv ' , ' job_de ' , ' job_dz ' , ' job_e2 ' , ' job_f8 ' , ' job_gj ' ,
' job_gv ' , ' job_kd ' , ' job_ke ' , ' job_kl ' , ' job_kp ' , ' job_ks ' , ' job_kw ' ,
'job_mm ' , ' job_nb ' , ' job_nn ' , ' job_ob ' , ' job_pe ' , ' job_po ' , ' job_pq ' ,
' job_pz ' , ' job_qp ' , ' job_qw ' , ' job_rn ' , ' job_sa ' , ' job_vv ' , ' job_zz ' ,
' area_a ' , ' area_b ' , ' area_c ' , ' area_d ' ]
This data includes both the target and predictors. We need a list with the target removed. We also
remove id because it is not useful for prediction.
Code
Output
[ ' income ' , ' a s p e c t ' , ' subscriptions ' , ' d i s t _ h e a l t h y ' , ' save_rate ' ,
' dist_unhealthy ' , ' age ' , ' pop_dense ' , ' r e t a i l _ d e n s e ' , ' crime ' ,
' job_11 ' , ' job_al ' , ' job_am ' , ' job_ax ' , ' job_bf ' , ' job_by ' , ' job_cv ' ,
' job_de ' , ' job_dz ' , ' job_e2 ' , ' job_f8 ' , ' job_gj ' , ' job_gv ' , ' job_kd ' ,
' job_ke ' , ' job_kl ' , ' job_kp ' , ' job_ks ' , ' job_kw ' , 'job_mm ' , ' job_nb ' ,
124 CHAPTER 4. TRAINING FOR TABULAR DATA
' job_nn ' , ' job_ob ' , ' job_pe ' , ' job_po ' , ' job_pq ' , ' job_pz ' , ' job_qp ' ,
' job_qw ' , ' job_rn ' , ' job_sa ' , ' job_vv ' , ' job_zz ' , ' area_a ' , ' area_b ' ,
' area_c ' , ' area_d ' ]
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
print ( x )
print ( y )
Output
[0 1 0 . . . 0 0 0]
...
[0 0 0 . . . 0 1 0]
[0 0 1 . . . 0 0 0]
[0 0 1 . .. 0 0 0]]
The x and y values are now ready for a neural network. Make sure that you construct the neural
network for a classification problem. Specifically,
• Classification neural networks have an output neuron count equal to the number of classes.
• Classification neural networks should use categorical_crossentropy and a softmax activation
function on the output layer.
• Binary Classification - Classification between two possibilities (positive and negative). Common
in medical testing, does the person has the disease (positive) or not (negative).
• Classification - Classification between more than 2. The iris dataset (3-way classification).
• Regression - Numeric prediction. How many MPG does a car get? (covered in next video)
import pandas a s pd
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ wcbreast_wdbc . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
display ( df )
Output
ROC curves can be a bit confusing. However, they are prevalent in analytics. It is essential to know
how to read them. Even their name is confusing. Do not worry about their name; the receiver operating
characteristic curve (ROC) comes from electrical engineering (EE).
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 127
Binary classification is common in medical testing. Often you want to diagnose if someone has a disease.
This diagnosis can lead to two types of errors, known as false positives and false negatives:
• False Positive - Your test (neural network) indicated that the patient had the disease; however, the
patient did not.
• False Negative - Your test (neural network) indicated that the patient did not have the disease;
however, the patient did have the disease.
• True Positive - Your test (neural network) correctly identified that the patient had the disease.
• True Negative - Your test (neural network) correctly identified that the patient did not have the
disease.
Neural networks classify in terms of the probability of it being positive. However, at what possibility
do you give a positive result? Is the cutoff 50%? 90%? Where you set, this cutoff is called the threshold.
Anything above the cutoff is positive; anything below is negative. Setting this cutoff allows the model to
be more sensitive or specific:
More info on Sensitivity vs. Specificity: Khan Academy
Code
%m a t p l o t l i b i n l i n e
import m a t p l o t l i b . p y p l o t a s p l t
import numpy a s np
import s c i p y . s t a t s a s s t a t s
import math
mu1 = −2
mu2 = 2
variance = 1
sigma = math . s q r t ( v a r i a n c e )
x1 = np . l i n s p a c e (mu1 − 5∗ sigma , mu1 + 4∗ sigma , 1 0 0 )
x2 = np . l i n s p a c e (mu2 − 5∗ sigma , mu2 + 4∗ sigma , 1 0 0 )
p l t . p l o t ( x1 , s t a t s . norm . pdf ( x1 , mu1 , sigma ) / 1 , c o l o r=" g r e e n " ,
l i n e s t y l e= ' dashed ' )
p l t . p l o t ( x2 , s t a t s . norm . pdf ( x2 , mu2 , sigma ) / 1 , c o l o r=" r e d " )
p l t . a x v l i n e ( x=−2, c o l o r=" b l a c k " )
128 CHAPTER 4. TRAINING FOR TABULAR DATA
Output
We will now train a neural network for the Wisconsin breast cancer dataset. We begin by preprocessing
the data. Because we have all numeric data, we compute a z-score for each column.
Code
from s c i p y . s t a t s import z s c o r e
# Convert t o numpy − R e g r e s s i o n
x = d f [ x_columns ] . v a l u e s
y = d f [ ' d i a g n o s i s ' ] . map( { 'M' : 1 , "B" : 0 } ) . v a l u e s # Binary c l a s s i f i c a t i o n ,
# M i s 1 and B i s 0
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 129
We can now define two functions. The first function plots a confusion matrix. The second function
plots a ROC chart.
Code
%m a t p l o t l i b i n l i n e
import m a t p l o t l i b . p y p l o t a s p l t
from s k l e a r n . m e t r i c s import roc_curve , auc
# P l o t an ROC. p red − t h e p r e d i c t i o n s , y − t h e e x p e c t e d o u t p u t .
def p l o t _ r o c ( pred , y ) :
f p r , tpr , _ = roc_curve ( y , pred )
roc_auc = auc ( f p r , t p r )
plt . figure ()
plt . p l o t ( f p r , tpr , l a b e l= 'ROC␣ c u r v e ␣ ( a r e a ␣=␣ %0.2 f ) ' % roc_auc )
plt . p l o t ( [ 0 , 1 ] , [ 0 , 1 ] , ' k−− ' )
plt . xlim ( [ 0 . 0 , 1 . 0 ] )
plt . ylim ( [ 0 . 0 , 1 . 0 5 ] )
plt . x l a b e l ( ' F a l s e ␣ P o s i t i v e ␣ Rate ' )
plt . y l a b e l ( ' True ␣ P o s i t i v e ␣ Rate ' )
plt . t i t l e ( ' R e c e i v e r ␣ O p e r a t i n g ␣ C h a r a c t e r i s t i c ␣ (ROC) ' )
plt . l e g e n d ( l o c=" l o w e r ␣ r i g h t " )
plt . show ( )
Code
# C l a s s i f i c a t i o n n e u r a l network
import numpy a s np
import t e n s o r f l o w . k e r a s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ,
k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( 5 0 , a c t i v a t i o n= ' r e l u ' , k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' , k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( 1 , a c t i v a t i o n= ' s i g m o i d ' , k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . compile ( l o s s= ' b i n a r y _ c r o s s e n t r o p y ' ,
o p t i m i z e r=t e n s o r f l o w . k e r a s . o p t i m i z e r s . Adam ( ) ,
m e t r i c s =[ ' a c c u r a c y ' ] )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =5, v e r b o s e =1, mode= ' auto ' , r e s t o r e _ b e s t _ w e i g h t s=True )
Output
...
14/14 − 0 s − l o s s : 0 . 0 4 5 8 − a c c u r a c y : 0 . 9 8 3 6 − v a l _ l o s s : 0 . 0 4 8 6 −
v a l _ a c c u r a c y : 0 . 9 8 6 0 − 119ms/ epoch − 8ms/ s t e p
Epoch 13/1000
R e s t o r i n g model w e i g h t s from t h e end o f t h e b e s t epoch : 8 .
14/14 − 0 s − l o s s : 0 . 0 4 1 7 − a c c u r a c y : 0 . 9 8 8 3 − v a l _ l o s s : 0 . 0 4 7 7 −
v a l _ a c c u r a c y : 0 . 9 8 6 0 − 124ms/ epoch − 9ms/ s t e p
Epoch 1 3 : e a r l y s t o p p i n g
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 131
Code
pred = model . p r e d i c t ( x _ t e s t )
p l o t _ r o c ( pred , y _ t e s t )
Output
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
132 CHAPTER 4. TRAINING FOR TABULAR DATA
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
Code
# C l a s s i f i c a t i o n n e u r a l network
import numpy a s np
import t e n s o r f l o w . k e r a s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ,
k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( 5 0 , a c t i v a t i o n= ' r e l u ' , k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' , k e r n e l _ i n i t i a l i z e r= ' random_normal ' ) )
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ,
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 133
Output
...
47/47 − 0 s − l o s s : 0 . 6 6 2 4 − a c c u r a c y : 0 . 7 1 4 7 − v a l _ l o s s : 0 . 7 5 2 7 −
v a l _ a c c u r a c y : 0 . 6 8 0 0 − 328ms/ epoch − 7ms/ s t e p
Epoch 21/1000
R e s t o r i n g model w e i g h t s from t h e end o f t h e b e s t epoch : 1 6 .
47/47 − 1 s − l o s s : 0 . 6 5 5 8 − a c c u r a c y : 0 . 7 1 6 0 − v a l _ l o s s : 0 . 7 6 5 3 −
v a l _ a c c u r a c y : 0 . 6 7 2 0 − 527ms/ epoch − 11ms/ s t e p
Epoch 2 1 : e a r l y s t o p p i n g
c
accuracy =
N
Where c is the number correct and N is the size of the evaluated set (training or validation). Higher
accuracy numbers are desired.
As we just saw, by default, Keras will return the percent probability for each class. We can change
these prediction probabilities into the actual iris predicted with argmax.
Code
pred = model . p r e d i c t ( x _ t e s t )
pred = np . argmax ( pred , a x i s =1)
# raw p r o b a b i l i t i e s t o chosen c l a s s ( h i g h e s t p r o b a b i l i t y )
Now that we have the actual iris flower predicted, we can calculate the percent accuracy (how many
were correctly classified).
134 CHAPTER 4. TRAINING FOR TABULAR DATA
Code
from s k l e a r n import m e t r i c s
Output
Accuracy s c o r e : 0 . 7
# Generate p r e d i c t i o n s
pred = model . p r e d i c t ( x _ t e s t )
s c o r e = m e t r i c s . l o g _ l o s s ( y_test , pred )
print ( " Log␣ l o s s ␣ s c o r e : ␣ {} " . format ( s c o r e ) )
# raw p r o b a b i l i t i e s t o chosen c l a s s ( h i g h e s t p r o b a b i l i t y )
pred = np . argmax ( pred , a x i s =1)
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 135
Output
Numpy a r r a y o f p r e d i c t i o n s
array ( [ [ 0 . , 0.1201 , 0.7286 , 0.1494 , 0.0018 , 0. , 0. ],
[0. , 0.6962 , 0.3016 , 0.0001 , 0.0022 , 0. , 0. ],
[0. , 0.7234 , 0.2708 , 0.0003 , 0.0053 , 0.0001 , 0. ],
[0. , 0.3836 , 0.6039 , 0.0086 , 0.0039 , 0. , 0. ],
[0. , 0.0609 , 0.6303 , 0.3079 , 0.001 , 0. , 0. ]] ,
dtype=f l o a t 3 2 ) As p e r c e n t p r o b a b i l i t y
[ 0.0001 12.0143 72.8578 14.9446 0.1823 0.0009 0.0001]
Log l o s s s c o r e : 0 . 7 4 2 3 4 0 1 4 2 9 2 8 0 6 3 8
N
1 X
log loss = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
N i=1
You should use this equation only as an objective function for classifications that have two outcomes.
The variable y-hat is the neural network’s prediction, and the variable y is the known correct answer. In
this case, y will always be 0 or 1. The training data have no probabilities. The neural network classifies it
either into one class (1) or the other (0).
The variable N represents the number of elements in the training set the number of questions in the
test. We divide by N because this process is customary for an average. We also begin the equation with a
negative because the log function is always negative over the domain 0 to 1. This negation allows a positive
score for the training to minimize.
You will notice two terms are separated by the addition (+). Each contains a log function. Because y
will be either 0 or 1, then one of these two terms will cancel out to 0. If y is 0, then the first term will
reduce to 0. If y is 1, then the second term will be 0.
If your prediction for the first class of a two-class prediction is y-hat, then your prediction for the second
class is 1 minus y-hat. Essentially, if your prediction for class A is 70% (0.7), then your prediction for class
B is 30% (0.3). Your score will increase by the log of your prediction for the correct class. If the neural
network had predicted 1.0 for class A, and the correct answer was A, your score would increase by log (1),
which is 0. For log loss, we seek a low score, so a correct answer results in 0. Some of these log values for
a neural network’s probability estimate for the correct class:
• -log(1.0) = 0
• -log(0.95) = 0.02
• -log(0.9) = 0.05
• -log(0.8) = 0.1
• -log(0.5) = 0.3
• -log(0.1) = 1
• -log(0.01) = 2
136 CHAPTER 4. TRAINING FOR TABULAR DATA
• -log(1.0e-12) = 12
• -log(0.0) = negative infinity
As you can see, giving a low confidence to the correct answer affects the score the most. Because log (0)
is negative infinity, we typically impose a minimum value. Of course, the above log values are for a single
training set element. We will average the log values for the entire training set.
The log function is useful to penalizing wrong answers. The following code demonstrates the utility of
the log function:
Code
%m a t p l o t l i b i n l i n e
from m a t p l o t l i b . p y p l o t import f i g u r e , show
from numpy import arange , s i n , p i
#t = arange (1 e −5, 5 . 0 , 0 . 0 0 0 0 1 )
#t = arange ( 1 . 0 , 5 . 0 , 0 . 0 0 0 0 1 ) # computer s c i e n t i s t s
t = arange ( 0 . 0 , 1 . 0 , 0.00001) # data scientists
f i g = f i g u r e ( 1 , f i g s i z e =(12 , 1 0 ) )
ax1 = f i g . add_subplot ( 2 1 1 )
ax1 . p l o t ( t , np . l o g ( t ) )
ax1 . g r i d ( True )
ax1 . s e t _ y l i m (( −8 , 1 . 5 ) )
ax1 . s e t _ x l i m ( ( − 0 . 1 , 2 ) )
ax1 . s e t _ x l a b e l ( ' x ' )
ax1 . s e t _ y l a b e l ( ' y ' )
ax1 . s e t _ t i t l e ( ' l o g ( x ) ' )
show ( )
Output
4.2. PART 4.2: MULTICLASS CLASSIFICATION WITH ROC AND AUC 137
A confusion matrix shows which predicted classes are often confused for the other classes. The vertical
axis (y) represents the true labels and the horizontal axis (x) represents the predicted labels. When the
true label and predicted label are the same, the highest values occur down the diagonal extending from
the upper left to the lower right. The other values, outside the diagonal, represent incorrect predictions.
For example, in the confusion matrix below, the value in row 2, column 1 shows how often the predicted
value A occurred when it should have been B.
Code
import numpy a s np
from s k l e a r n import svm , d a t a s e t s
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s import c o n f u s i o n _ m a t r i x
# Compute c o n f u s i o n m a t r i x
cm = c o n f u s i o n _ m a t r i x ( y_compare , pred )
np . s e t _ p r i n t o p t i o n s ( p r e c i s i o n =2)
p l t . show ( )
Output
138 CHAPTER 4. TRAINING FOR TABULAR DATA
Normalized c o n f u s i o n matrix
[ [ 0 . 9 5 0.05 0. 0. 0. 0. 0. ]
[0.02 0.78 0.2 0. 0. 0. 0. ]
[0. 0.29 0.7 0.01 0. 0. 0. ]
[0. 0. 0.71 0.29 0. 0. 0. ]
[0. 1. 0. 0. 0. 0. 0. ]
[0.59 0.41 0. 0. 0. 0. 0. ]
[1. 0. 0. 0. 0. 0. 0. ]]
4.3 Part 4.3: Keras Regression for Deep Neural Networks with
RMSE
We evaluate regression results differently than classification. Consider the following code that trains a
neural network for regression on the data set jh-simple-dataset.csv. We begin by preparing the data
set.
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
import m a t p l o t l i b . p y p l o t a s p l t
# Read t h e d a t a s e t
4.3. PART 4.3: KERAS REGRESSION FOR DEEP NEURAL NETWORKS WITH RMSE 139
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r p r o d u c t
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' p r o d u c t ' ] , p r e f i x=" p r o d u c t " ) ] , a x i s =1)
d f . drop ( ' p r o d u c t ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' age ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
y = d f [ ' age ' ] . v a l u e s
# Create t r a i n / t e s t
x_train , x_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.25 , random_state =42)
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =5, v e r b o s e =1, mode= ' auto ' ,
r e s t o r e _ b e s t _ w e i g h t s=True )
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =( x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =2, e p o c h s =1000)
Output
n
1X 2
MSE = (ŷi − yi )
n i=1
The following code calculates the MSE on the predictions from the neural network.
Code
from s k l e a r n import m e t r i c s
# Predict
pred = model . p r e d i c t ( x _ t e s t )
4.3. PART 4.3: KERAS REGRESSION FOR DEEP NEURAL NETWORKS WITH RMSE 141
# Measure MSE e r r o r .
s c o r e = m e t r i c s . mean_squared_error ( pred , y _ t e s t )
print ( " F i n a l ␣ s c o r e ␣ (MSE) : ␣ {} " . format ( s c o r e ) )
Output
F i n a l s c o r e (MSE) : 0 . 5 4 6 3 4 4 7 8 2 9 6 7 7 6 0 7
Code
import numpy a s np
Output
F i n a l s c o r e (RMSE) : 0 . 7 3 9 1 5 1 3 9 3 8 0 7 6 2 9 1
Code
# Regression chart .
def c h a r t _ r e g r e s s i o n ( pred , y , s o r t=True ) :
t = pd . DataFrame ( { ' pred ' : pred , ' y ' : y . f l a t t e n ( ) } )
if sort :
t . s o r t _ v a l u e s ( by=[ ' y ' ] , i n p l a c e=True )
p l t . p l o t ( t [ ' y ' ] . t o l i s t ( ) , l a b e l= ' e x p e c t e d ' )
p l t . p l o t ( t [ ' pred ' ] . t o l i s t ( ) , l a b e l= ' p r e d i c t i o n ' )
p l t . y l a b e l ( ' output ' )
plt . legend ()
p l t . show ( )
Output
Researchers have extended classic backpropagation and modified to give rise to many different training
algorithms. This section will discuss the most commonly used training algorithms for neural networks. We
begin with classic backpropagation and end the chapter with stochastic gradient descent (SGD).
Backpropagation is the primary means of determining a neural network’s weights during training.
Backpropagation works by calculating a weight change amount (vt ) for every weight(θ, theta) in the neural
network. This value is subtracted from every weight by the following equation:
θt = θt−1 − vt
We repeat this process for every iteration(t). The training algorithm determines how we calculate the
weight change. Classic backpropagation calculates a gradient (∇, nabla) for every weight in the neural
network for the neural network’s error function (J). We scale the gradient by a learning rate (η, eta).
vt = η∇θt−1 J(θt−1 )
The learning rate is an important concept for backpropagation training. Setting the learning rate can
be complex:
• Too low a learning rate will usually converge to a reasonable solution; however, the process will be
prolonged.
• Too high of a learning rate will either fail outright or converge to a higher error than a better learning
rate.
Common values for learning rate are: 0.1, 0.01, 0.001, etc.
Backpropagation is a gradient descent type, and many texts will use these two terms interchangeably.
Gradient descent refers to calculating a gradient on each weight in the neural network for each training
element. Because the neural network will not output the expected value for a training element, the gradient
of each weight will indicate how to modify each weight to achieve the expected output. If the neural network
did output exactly what was expected, the gradient for each weight would be 0, indicating that no change
to the weight is necessary.
The gradient is the derivative of the error function at the weight’s current value. The error function
measures the distance of the neural network’s output from the expected output. We can use gradient
descent, a process in which each weight’s gradient value can reach even lower values of the error function.
The gradient is the partial derivative of each weight in the neural network concerning the error function.
Each weight has a gradient that is the slope of the error function. Weight is a connection between two
neurons. Calculating the gradient of the error function allows the training method to determine whether
it should increase or decrease the weight. In turn, this determination will decrease the error of the neural
network. The error is the difference between the expected output and actual output of the neural network.
Many different training methods called propagation-training algorithms utilize gradients. In all of them,
the sign of the gradient tells the neural network the following information:
• Zero gradient - The weight does not contribute to the neural network’s error.
144 CHAPTER 4. TRAINING FOR TABULAR DATA
• Negative gradient - The algorithm should increase the weight to lower error.
• Positive gradient - The algorithm should decrease the weight to lower error.
Because many algorithms depend on gradient calculation, we will begin with an analysis of this process.
First of all, let’s examine the gradient. Essentially, training is a search for the set of weights that will cause
the neural network to have the lowest error for a training set. If we had infinite computation resources,
we would try every possible combination of weights to determine the one that provided the lowest error
during the training.
Because we do not have unlimited computing resources, we have to use some shortcuts to prevent
the need to examine every possible weight combination. These training methods utilize clever techniques
to avoid performing a brute-force search of all weight values. This type of exhaustive search would be
impossible because even small networks have an infinite number of weight combinations.
Consider a chart that shows the error of a neural network for each possible weight. Figure 4.2 is a
graph that demonstrates the error for a single weight:
Looking at this chart, you can easily see that the optimal weight is where the line has the lowest y-value.
The problem is that we see only the error for the current value of the weight; we do not see the entire
4.4. PART 4.4: TRAINING NEURAL NETWORKS 145
graph because that process would require an exhaustive search. However, we can determine the slope of
the error curve at a particular weight. In the above chart, we see the slope of the error curve at 1.5. The
straight line barely touches the error curve at 1.5 gives the slope. In this case, the slope, or gradient, is
-0.5622. The negative slope indicates that an increase in the weight will lower the error.
The gradient is the instantaneous slope of the error function at the specified weight. The derivative of the
error curve at that point gives the gradient. This line tells us the steepness of the error function at the given
weight.
Derivatives are one of the most fundamental concepts in calculus. For this book, you need to under-
stand that a derivative provides the slope of a function at a specific point. A training technique and this
slope can give you the information to adjust the weight for a lower error. Using our working definition of
the gradient, we will show how to calculate it.
Like the learning rate, momentum adds another training parameter that scales the effect of momen-
tum. Momentum backpropagation has two training parameters: learning rate (η, eta) and momentum (λ,
lambda). Momentum adds the scaled value of the previous weight change amount (vt−1 ) to the current
weight change amount(vt ).
This technique has the effect of adding additional force behind the direction a weight is moving. Figure
4.3 shows how this might allow the weight to escape local minima.
A typical value for momentum is 0.9.
• Online Training - Update the weights based on gradients calculated from a single training set
element.
• Batch Training - Update the weights based on the sum of the gradients over all training set elements.
• Batch Size - Update the weights based on the sum of some batch size of training set elements.
• Mini-Batch Training - The same as batch size, but with minimal batch size. Mini-batches are
very popular, often in the 32-64 element range.
Because the batch size is smaller than the full training set size, it may take several batches to make it
completely through the training set.
• Computationally efficient. Each training step can be relatively fast, even with a huge training set.
• Decreases overfitting by focusing on only a portion of the training set each step.
• Learning rate must be adjusted to a small enough level to train an accurate neural network.
• Momentum must be large enough to overcome local minima yet small enough not to destabilize the
training.
• A single learning rate/momentum is often not good enough for the entire training process. It is often
helpful to automatically decrease the learning rate as the training progresses.
• All weights share a single learning rate/momentum.
4.4. PART 4.4: TRAINING NEURAL NETWORKS 147
• Resilient Propagation - Use only the magnitude of the gradient and allow each neuron to learn at
its rate. There is no need for learning rate/momentum; however, it only works in full batch mode.
• Nesterov accelerated gradient - Helps mitigate the risk of choosing a bad mini-batch.
• Adagrad - Allows an automatically decaying per-weight learning rate and momentum concept.
• Adadelta - Extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing
learning rate.
• Non-Gradient Methods - Non-gradient methods can sometimes be useful, though rarely outper-
form gradient-based backpropagation methods. These include: simulated annealing, genetic algo-
rithms, particle swarm optimization, Nelder Mead, and many more.
mt = β1 mt−1 + (1 − β1 )gt
148 CHAPTER 4. TRAINING FOR TABULAR DATA
This average accomplishes a similar goal as classic momentum update; however, its value is calculated
automatically based on the current gradient (gt ). The update rule then calculates the second moment (vt ):
vt = β2 vt−1 + (1 − β2 )gt2
The values mt and vt are estimates of the gradients’ first moment (the mean) and the second moment
(the uncentered variance). However, they will be strongly biased towards zero in the initial training cycles.
The first moment’s bias is corrected as follows.
mt
m̂t =
1 − β1t
vt
v̂t =
1 − β2t
These bias-corrected first and second moment estimates are applied to the ultimate Adam update rule,
as follows:
α · m̂t
θt = θt−1 − √ m̂t
v̂t + η
Adam is very tolerant to initial learning rate (\alpha) and other training parameters. Kingma and Ba
(2014) propose default values of 0.9 for β1 , 0.999 for β2 , and 10-8 for η.
Code
%m a t p l o t l i b i n l i n e
# Regression chart .
def c h a r t _ r e g r e s s i o n ( pred , y , s o r t=True ) :
t = pd . DataFrame ( { ' pred ' : pred , ' y ' : y . f l a t t e n ( ) } )
if sort :
t . s o r t _ v a l u e s ( by=[ ' y ' ] , i n p l a c e=True )
p l t . p l o t ( t [ ' y ' ] . t o l i s t ( ) , l a b e l= ' e x p e c t e d ' )
p l t . p l o t ( t [ ' pred ' ] . t o l i s t ( ) , l a b e l= ' p r e d i c t i o n ' )
p l t . y l a b e l ( ' output ' )
plt . legend ()
p l t . show ( )
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r p r o d u c t
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' p r o d u c t ' ] , p r e f i x=" p r o d u c t " ) ] , a x i s =1)
d f . drop ( ' p r o d u c t ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
150 CHAPTER 4. TRAINING FOR TABULAR DATA
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' age ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
y = d f [ ' age ' ] . v a l u e s
# Crea t e t r a i n / t e s t
x_train , x_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.25 , random_state =42)
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' ) # Modify h e r e
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3, p a t i e n c e =5,
v e r b o s e =1, mode= ' auto ' , r e s t o r e _ b e s t _ w e i g h t s=True )
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =( x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =0, e p o c h s =1000)
Output
4.5. PART 4.5: ERROR CALCULATION FROM SCRATCH 151
from s k l e a r n import m e t r i c s
import numpy a s np
predicted = [ 1 . 1 , 1 .9 , 3 .4 , 4 .2 , 4. 3 ]
expected = [ 1 , 2 , 3 , 4 , 5 ]
score_mse = m e t r i c s . mean_squared_error ( p r e d i c t e d , e x p e c t e d )
score_rmse = np . s q r t ( score_mse )
print ( " S c o r e ␣ (MSE) : ␣ {} " . format ( score_mse ) )
print ( " S c o r e ␣ (RMSE) : ␣ {} " . format ( score_rmse ) )
Output
S c o r e (MSE) : 0 . 1 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 7
S c o r e (RMSE) : 0 . 3 7 6 8 2 8 8 7 3 6 2 8 3 3 5 5 6
Code
Output
S c o r e (MSE) : 0 . 1 4 2 0 0 0 0 0 0 0 0 0 0 0 0 0 7
S c o r e (RMSE) : 0 . 3 7 6 8 2 8 8 7 3 6 2 8 3 3 5 5 6
4.5.1 Classification
We will now look at how to calculate a logloss by hand. For this, we look at a binary prediction. The
predicted is some number between 0-1 that indicates the probability true (1). The expected is always 0
or 1. Therefore, a prediction of 1.0 is completely correct if the expected is 1 and completely wrong if the
expected is 0.
Code
from s k l e a r n import m e t r i c s
expected = [ 1 , 1 , 0 , 0 , 0 ]
predicted = [0.9 ,0.99 ,0.1 ,0.05 ,0.06]
print ( m e t r i c s . l o g _ l o s s ( exp ec te d , p r e d i c t e d ) )
Output
0.06678801305495843
Output
155
156 CHAPTER 5. REGULARIZATION AND DROPOUT
We will look at linear regression to see how L1 and L2 regularization work. The following code sets up
the auto-mpg data for this purpose.
Code
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' ho r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
names = [ ' c y l i n d e r s ' , ' d i s p l a c e m e n t ' , ' h o r s e p o w e r ' , ' w e i g h t ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ]
x = d f [ names ] . v a l u e s
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
We will use the data just loaded for several examples. The first examples in this part use several forms
of linear regression. For linear regression, it is helpful to examine the model’s coefficients. The following
function is utilized to display these coefficients.
Code
# Simple f u n c t i o n t o e v a l u a t e t h e c o e f f i c i e n t s o f a r e g r e s s i o n
%m a t p l o t l i b i n l i n e
from IPython . d i s p l a y import d i s p l a y , HTML
def r e p o r t _ c o e f ( names , c o e f , i n t e r c e p t ) :
r = pd . DataFrame ( { ' c o e f ' : c o e f , ' p o s i t i v e ' : c o e f >=0
} , i n d e x = names )
r = r . s o r t _ v a l u e s ( by=[ ' c o e f ' ] )
5.1. PART 5.1: INTRODUCTION TO REGULARIZATION: RIDGE AND LASSO 157
display ( r )
print ( f " I n t e r c e p t : ␣ { i n t e r c e p t } " )
r [ ' c o e f ' ] . p l o t ( kind= ' barh ' , c o l o r=r [ ' p o s i t i v e ' ] . map(
{ True : ' b ' , F a l s e : ' r ' } ) )
Before jumping into L1/L2 regularization, we begin with linear regression. Researchers first introduced
the L1/L2 form of regularization for linear regression. We can also make use of L1/L2 for neural networks.
To fully understand L1/L2 we will begin with how we can use them with linear regression.
The following code uses linear regression to fit the auto-mpg data set. The RMSE reported will not be
as good as a neural network.
Code
import s k l e a r n
# Create l i n e a r r e g r e s s i o n
r e g r e s s o r = sklearn . linear_model . LinearRegression ( )
report_coef (
names ,
r e g r e s s o r . coef_ ,
regressor . intercept_ )
Output
158 CHAPTER 5. REGULARIZATION AND DROPOUT
coef positive
cylinders -0.427721 False
weight -0.007255 False
horsepower -0.005491 False
displacement 0.020166 True
acceleration 0.138575 True
year 0.783047 True
origin 1.003762 True
F i n a l s c o r e (RMSE) : 3 . 0 0 1 9 3 4 5 9 8 5 8 6 0 7 8 4
I n t e r c e p t : −19.101231042200112
X
E1 = α |w|
w
You should use L1 regularization to create sparsity in the neural network. In other words, the L1
algorithm will push many weight connections to near 0. When the weight is near 0, the program drops it
from the network. Dropping weighted connections will create a sparse neural network.
The following code demonstrates lasso regression. Notice the effect of the coefficients compared to the
previous section that used linear regression.
5.1. PART 5.1: INTRODUCTION TO REGULARIZATION: RIDGE AND LASSO 159
Code
import s k l e a r n
from s k l e a r n . l i n e a r _ m o d e l import Lasso
# Create l i n e a r r e g r e s s i o n
r e g r e s s o r = Lasso ( random_state =0, a l p h a =0.1)
# F i t / t r a i n LASSO
r e g r e s s o r . f i t ( x_train , y _ t r a i n )
# Predict
pred = r e g r e s s o r . p r e d i c t ( x _ t e s t )
report_coef (
names ,
r e g r e s s o r . coef_ ,
regressor . intercept_ )
Output
coef positive
cylinders -0.012995 False
weight -0.007328 False
horsepower -0.002715 False
displacement 0.011601 True
acceleration 0.114391 True
origin 0.708222 True
year 0.777480 True
160 CHAPTER 5. REGULARIZATION AND DROPOUT
F i n a l s c o r e (RMSE) : 3 . 0 6 0 4 0 2 1 9 0 4 0 3 3 3 0 3
I n t e r c e p t : −18.506677982383252
X
E2 = α w2
w
Like the L1 algorithm, the α value determines how important the L2 objective is compared to the
neural network’s error. Typical L2 values are below 0.1 (10%). The main calculation performed by L2 is
the summing of the squares of all of the weights. The algorithm will not sum bias values.
You should use L2 regularization when you are less concerned about creating a space network and are
more concerned about low weight values. The lower weight values will typically lead to less overfitting.
Generally, L2 regularization will produce better overall performance than L1. However, L1 might be useful
in situations with many inputs, and you can prune some of the weaker inputs.
The following code uses L2 with linear regression (Ridge regression):
Code
import s k l e a r n
from s k l e a r n . l i n e a r _ m o d e l import Ridge
# Crea t e l i n e a r r e g r e s s i o n
r e g r e s s o r = Ridge ( a l p h a =1)
# F i t / t r a i n Ridge
r e g r e s s o r . f i t ( x_train , y _ t r a i n )
# Predict
pred = r e g r e s s o r . p r e d i c t ( x _ t e s t )
report_coef (
names ,
r e g r e s s o r . coef_ ,
regressor . intercept_ )
5.1. PART 5.1: INTRODUCTION TO REGULARIZATION: RIDGE AND LASSO 161
Output
coef positive
cylinders -0.421393 False
weight -0.007257 False
horsepower -0.005385 False
displacement 0.020006 True
acceleration 0.138470 True
year 0.782889 True
origin 0.994621 True
F i n a l s c o r e (RMSE) : { s c o r e }
I n t e r c e p t : −19.07980074425469
a ∗ L1 + b ∗ L2
Code
import s k l e a r n
from s k l e a r n . l i n e a r _ m o d e l import E l a s t i c N e t
# Create l i n e a r r e g r e s s i o n
r e g r e s s o r = E l a s t i c N e t ( a l p h a =0.1 , l 1 _ r a t i o =0.1)
# F i t / t r a i n LASSO
162 CHAPTER 5. REGULARIZATION AND DROPOUT
r e g r e s s o r . f i t ( x_train , y _ t r a i n )
# Predict
pred = r e g r e s s o r . p r e d i c t ( x _ t e s t )
report_coef (
names ,
r e g r e s s o r . coef_ ,
regressor . intercept_ )
Output
coef positive
cylinders -0.274010 False
weight -0.007303 False
horsepower -0.003231 False
displacement 0.016194 True
acceleration 0.132348 True
year 0.777482 True
origin 0.782781 True
F i n a l s c o r e (RMSE) : 3 . 0 4 5 0 8 9 9 9 6 0 7 7 5 0 1 3
I n t e r c e p t : −18.389355690429767
• Estimate a good number of epochs to train a neural network for (early stopping)
• Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts,
and layer counts
Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as
both the validation and training set. Figure 5.1 shows cross-validation.
It is important to note that each fold will have one model (neural network). To generate predictions
for new data (not present in the training set), predictions from the fold models can be handled in several
ways:
• Choose the model with the highest validation score as the final model.
• Preset new data to the five models (one for each fold) and average the result (this is an ensemble).
• Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train
for as many epochs and with the same hidden layer structure.
Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected
hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not
use in any aspect of the training process.
The following two sections demonstrate cross-validation with classification and regression.
We begin by preparing a feature vector using the jh-simple-dataset to predict age. This model is set up
as a regression problem.
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r p r o d u c t
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' p r o d u c t ' ] , p r e f i x=" p r o d u c t " ) ] , a x i s =1)
d f . drop ( ' p r o d u c t ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' age ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
y = d f [ ' age ' ] . v a l u e s
Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-
sample predictions. We will assume 500 epochs and not use early stopping. Later we will see how we can
estimate a more optimal epoch count.
5.2. PART 5.2: USING K-FOLD CROSS-VALIDATION WITH KERAS 165
Code
EPOCHS=500
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
# Cross−V a l i d a t e
k f = KFold ( 5 , s h u f f l e=True , random_state =42) # Use f o r KFold c l a s s i f i c a t i o n
oos_y = [ ]
oos_pred = [ ]
fold = 0
f o r t r a i n , t e s t in k f . s p l i t ( x ) :
f o l d+=1
print ( f " Fold ␣#{ f o l d } " )
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 ) )
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
pred = model . p r e d i c t ( x _ t e s t )
oos_y . append ( y _ t e s t )
oos_pred . append ( pred )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( f " Fold ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
# B u i l d t h e oos p r e d i c t i o n l i s t and c a l c u l a t e t h e e r r o r .
oos_y = np . c o n c a t e n a t e ( oos_y )
oos_pred = np . c o n c a t e n a t e ( oos_pred )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( oos_pred , oos_y ) )
print ( f " F i n a l , ␣ out ␣ o f ␣ sample ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
# Write t h e c r o s s −v a l i d a t e d p r e d i c t i o n
oos_y = pd . DataFrame ( oos_y )
oos_pred = pd . DataFrame ( oos_pred )
oosDF = pd . c o n c a t ( [ df , oos_y , oos_pred ] , a x i s =1 )
#oosDF . to_csv ( f i l e n a m e _ w r i t e , i n d e x=F a l s e )
Output
Fold #1
Fold s c o r e (RMSE) : 0 . 6 8 1 4 2 9 9 4 2 6 5 1 1 2 0 8
Fold #2
Fold s c o r e (RMSE) : 0 . 4 5 4 8 6 5 1 3 7 1 9 4 8 7 1 6 5
Fold #3
Fold s c o r e (RMSE) : 0 . 5 7 1 6 1 5 0 4 1 8 7 6 3 9 2
Fold #4
Fold s c o r e (RMSE) : 0 . 4 6 4 1 6 3 5 6 0 8 1 1 1 6 9 1 6
Fold #5
Fold s c o r e (RMSE) : 1 . 0 4 2 6 5 1 8 4 9 1 6 8 5 4 7 5
F i n a l , out o f sample s c o r e (RMSE) : 0 . 6 7 8 3 1 6 0 7 7 5 9 7 4 0 8
As you can see, the above code also reports the average number of epochs needed. A common technique
is to then train on the entire dataset for the average number of epochs required.
The following code trains and fits the jh-simple-dataset dataset with cross-validation to generate out-of-
sample. It also writes the out-of-sample (predictions on the test set) results.
It is good to perform stratified k-fold cross-validation with classification data. This technique ensures
that the percentages of each class remain the same across all folds. Use the StratifiedKFold object
instead of the KFold object used in the regression.
5.2. PART 5.2: USING K-FOLD CROSS-VALIDATION WITH KERAS 167
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
We will assume 500 epochs and not use early stopping. Later we will see how we can estimate a more
optimal epoch count.
Code
import pandas a s pd
import o s
import numpy a s np
168 CHAPTER 5. REGULARIZATION AND DROPOUT
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
oos_y = [ ]
oos_pred = [ ]
fold = 0
# Must s p e c i f y y S t r a t i f i e d K F o l d f o r
for t r a i n , t e s t in k f . s p l i t ( x , d f [ ' p r o d u c t ' ] ) :
f o l d+=1
print ( f " Fold ␣#{ f o l d } " )
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
model = S e q u e n t i a l ( )
# Hidden 1
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
pred = model . p r e d i c t ( x _ t e s t )
oos_y . append ( y _ t e s t )
# raw p r o b a b i l i t i e s t o chosen c l a s s ( h i g h e s t p r o b a b i l i t y )
pred = np . argmax ( pred , a x i s =1)
oos_pred . append ( pred )
# Measure t h i s f o l d ' s a c c u r a c y
y_compare = np . argmax ( y_test , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
5.2. PART 5.2: USING K-FOLD CROSS-VALIDATION WITH KERAS 169
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( y_compare , pred )
print ( f " Fold ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# B u i l d t h e oos p r e d i c t i o n l i s t and c a l c u l a t e t h e e r r o r .
oos_y = np . c o n c a t e n a t e ( oos_y )
oos_pred = np . c o n c a t e n a t e ( oos_pred )
oos_y_compare = np . argmax ( oos_y , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( oos_y_compare , oos_pred )
print ( f " F i n a l ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# Write t h e c r o s s −v a l i d a t e d p r e d i c t i o n
oos_y = pd . DataFrame ( oos_y )
oos_pred = pd . DataFrame ( oos_pred )
oosDF = pd . c o n c a t ( [ df , oos_y , oos_pred ] , a x i s =1 )
#oosDF . to_csv ( f i l e n a m e _ w r i t e , i n d e x=F a l s e )
Output
Fold #1
Fold s c o r e ( a c c u r a c y ) : 0 . 6 3 2 5
Fold #2
Fold s c o r e ( a c c u r a c y ) : 0 . 6 7 2 5
Fold #3
Fold s c o r e ( a c c u r a c y ) : 0 . 6 9 7 5
Fold #4
Fold s c o r e ( a c c u r a c y ) : 0 . 6 5 7 5
Fold #5
Fold s c o r e ( a c c u r a c y ) : 0 . 6 7 5
Final score ( accuracy ) : 0.667
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
170 CHAPTER 5. REGULARIZATION AND DROPOUT
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r p r o d u c t
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' p r o d u c t ' ] , p r e f i x=" p r o d u c t " ) ] , a x i s =1)
d f . drop ( ' p r o d u c t ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' age ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
y = d f [ ' age ' ] . v a l u e s
Now that the data has been preprocessed, we are ready to build the neural network.
Code
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
5.2. PART 5.2: USING K-FOLD CROSS-VALIDATION WITH KERAS 171
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold
# Keep a 10% h o l d o u t
x_main , x_holdout , y_main , y_holdout = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.10)
# Cross−v a l i d a t e
k f = KFold ( 5 )
oos_y = [ ]
oos_pred = [ ]
fold = 0
f o r t r a i n , t e s t in k f . s p l i t ( x_main ) :
f o l d+=1
print ( f " Fold ␣#{ f o l d } " )
x _ t r a i n = x_main [ t r a i n ]
y _ t r a i n = y_main [ t r a i n ]
x _ t e s t = x_main [ t e s t ]
y _ t e s t = y_main [ t e s t ]
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 5 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 ) )
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
pred = model . p r e d i c t ( x _ t e s t )
oos_y . append ( y _ t e s t )
oos_pred . append ( pred )
# Measure a c c u r a c y
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( f " Fold ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
# B u i l d t h e oos p r e d i c t i o n l i s t and c a l c u l a t e t h e e r r o r .
172 CHAPTER 5. REGULARIZATION AND DROPOUT
oos_y = np . c o n c a t e n a t e ( oos_y )
oos_pred = np . c o n c a t e n a t e ( oos_pred )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( oos_pred , oos_y ) )
print ( )
print ( f " Cross−v a l i d a t e d ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
Output
Fold #1
Fold s c o r e (RMSE) : 0 . 5 4 4 1 9 5 2 9 9 2 1 6 6 9 6
Fold #2
Fold s c o r e (RMSE) : 0 . 4 8 0 7 0 5 9 9 3 4 2 9 1 0 3 5 3
Fold #3
Fold s c o r e (RMSE) : 0 . 7 0 3 4 5 8 4 7 6 5 9 2 8 9 9 8
Fold #4
Fold s c o r e (RMSE) : 0 . 5 3 9 7 1 4 1 7 8 5 1 9 0 4 7 3
Fold #5
Fold s c o r e (RMSE) : 2 4 . 1 2 6 2 0 5 2 1 3 0 8 0 0 7 7
Cross−v a l i d a t e d s c o r e (RMSE) : 1 0 . 8 0 1 7 3 2 7 3 1 2 0 7 9 4 7
Holdout s c o r e (RMSE) : 2 4 . 0 9 7 6 5 7 9 4 7 2 9 7 6 7 7
As you can see, L1 algorithm is more tolerant of weights further from 0, whereas the L2 algorithm is
less tolerant. We will highlight other important differences between L1 and L2 in the following sections.
You also need to note that both L1 and L2 count their penalties based only on weights; they do not count
penalties on bias values. Keras allows l1/l2 to be directly added to your network.
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
Code
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s import regularizers
# Cross−v a l i d a t e
k f = KFold ( 5 , s h u f f l e=True , random_state =42)
oos_y = [ ]
oos_pred = [ ]
fold = 0
for t r a i n , t e s t in k f . s p l i t ( x ) :
f o l d+=1
print ( f " Fold ␣#{ f o l d } " )
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
#k e r n e l _ r e g u l a r i z e r=r e g u l a r i z e r s . l 2 ( 0 . 0 1 ) ,
model = S e q u e n t i a l ( )
# Hidden 1
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] ,
a c t i v a t i o n= ' r e l u ' ,
a c t i v i t y _ r e g u l a r i z e r=r e g u l a r i z e r s . l 1 ( 1 e −4)))
# Hidden 2
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ,
a c t i v i t y _ r e g u l a r i z e r=r e g u l a r i z e r s . l 1 ( 1 e −4)))
# Output
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) )
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
pred = model . p r e d i c t ( x _ t e s t )
oos_y . append ( y _ t e s t )
# raw p r o b a b i l i t i e s t o chosen c l a s s ( h i g h e s t p r o b a b i l i t y )
pred = np . argmax ( pred , a x i s =1)
oos_pred . append ( pred )
# Measure t h i s f o l d ' s a c c u r a c y
y_compare = np . argmax ( y_test , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( y_compare , pred )
print ( f " Fold ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# B u i l d t h e oos p r e d i c t i o n l i s t and c a l c u l a t e t h e e r r o r .
oos_y = np . c o n c a t e n a t e ( oos_y )
oos_pred = np . c o n c a t e n a t e ( oos_pred )
oos_y_compare = np . argmax ( oos_y , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( oos_y_compare , oos_pred )
print ( f " F i n a l ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# Write t h e c r o s s −v a l i d a t e d p r e d i c t i o n
oos_y = pd . DataFrame ( oos_y )
oos_pred = pd . DataFrame ( oos_pred )
oosDF = pd . c o n c a t ( [ df , oos_y , oos_pred ] , a x i s =1 )
#oosDF . to_csv ( f i l e n a m e _ w r i t e , i n d e x=F a l s e )
Output
Fold #1
Fold s c o r e ( a c c u r a c y ) : 0 . 6 4
Fold #2
Fold s c o r e ( a c c u r a c y ) : 0 . 6 7 7 5
Fold #3
Fold s c o r e ( a c c u r a c y ) : 0 . 6 8 2 5
Fold #4
Fold s c o r e ( a c c u r a c y ) : 0 . 6 6 7 5
Fold #5
Fold s c o r e ( a c c u r a c y ) : 0 . 6 4 5
Final score ( accuracy ) : 0.6625
176 CHAPTER 5. REGULARIZATION AND DROPOUT
term that originates from the musical ensembles in which the final music product that the audience hears
is the combination of many instruments.
Bootstrapping is one of the most simple ensemble techniques. The bootstrapping programmer simply
trains several neural networks to perform precisely the same task. However, each neural network will
perform differently because of some training techniques and the random numbers used in the neural network
weight initialization. The difference in weights causes the performance variance. The output from this
ensemble of neural networks becomes the average output of the members taken together. This process
decreases overfitting through the consensus of differently trained neural networks.
Dropout works somewhat like bootstrapping. You might think of each neural network that results
from a different set of neurons being dropped out as an individual member in an ensemble. As training
progresses, the program creates more neural networks in this way. However, dropout does not require the
same amount of processing as bootstrapping. The new neural networks created are temporary; they exist
only for a training iteration. The final result is also a single neural network rather than an ensemble of
neural networks to be averaged together.
The following animation shows how dropout works: animation link
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
178 CHAPTER 5. REGULARIZATION AND DROPOUT
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
########################################
# Keras w i t h d r o p o u t f o r C l a s s i f i c a t i o n
########################################
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n , Dropout
from t e n s o r f l o w . k e r a s import regularizers
# Cross−v a l i d a t e
k f = KFold ( 5 , s h u f f l e=True , random_state =42)
oos_y = [ ]
oos_pred = [ ]
fold = 0
for t r a i n , t e s t in k f . s p l i t ( x ) :
f o l d+=1
print ( f " Fold ␣#{ f o l d } " )
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
#k e r n e l _ r e g u l a r i z e r=r e g u l a r i z e r s . l 2 ( 0 . 0 1 ) ,
5.4. PART 5.4: DROP OUT FOR KERAS TO DECREASE OVERFITTING 179
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dropout ( 0 . 5 ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' , \
a c t i v i t y _ r e g u l a r i z e r=r e g u l a r i z e r s . l 1 ( 1 e −4))) # Hidden 2
# U s u a l l y do not add d r o p o u t a f t e r f i n a l h i d d e n l a y e r
#model . add ( Dropout ( 0 . 5 ) )
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
pred = model . p r e d i c t ( x _ t e s t )
oos_y . append ( y _ t e s t )
# raw p r o b a b i l i t i e s t o chosen c l a s s ( h i g h e s t p r o b a b i l i t y )
pred = np . argmax ( pred , a x i s =1)
oos_pred . append ( pred )
# Measure t h i s f o l d ' s a c c u r a c y
y_compare = np . argmax ( y_test , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( y_compare , pred )
print ( f " Fold ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# B u i l d t h e oos p r e d i c t i o n l i s t and c a l c u l a t e t h e e r r o r .
oos_y = np . c o n c a t e n a t e ( oos_y )
oos_pred = np . c o n c a t e n a t e ( oos_pred )
oos_y_compare = np . argmax ( oos_y , a x i s =1) # For a c c u r a c y c a l c u l a t i o n
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( oos_y_compare , oos_pred )
print ( f " F i n a l ␣ s c o r e ␣ ( a c c u r a c y ) : ␣ { s c o r e } " )
# Write t h e c r o s s −v a l i d a t e d p r e d i c t i o n
oos_y = pd . DataFrame ( oos_y )
oos_pred = pd . DataFrame ( oos_pred )
oosDF = pd . c o n c a t ( [ df , oos_y , oos_pred ] , a x i s =1 )
#oosDF . to_csv ( f i l e n a m e _ w r i t e , i n d e x=F a l s e )
Output
180 CHAPTER 5. REGULARIZATION AND DROPOUT
Fold #1
Fold s c o r e ( a c c u r a c y ) : 0 . 6 8
Fold #2
Fold s c o r e ( a c c u r a c y ) : 0 . 6 9 5
Fold #3
Fold s c o r e ( a c c u r a c y ) : 0 . 7 4 2 5
Fold #4
Fold s c o r e ( a c c u r a c y ) : 0 . 7 1
Fold #5
Fold s c o r e ( a c c u r a c y ) : 0 . 6 6 2 5
Final score ( accuracy ) : 0.698
To try out each of these hyperparameters you will need to run train neural networks with multiple settings
for each hyperparameter. However, you may have noticed that neural networks often produce somewhat
different results when trained multiple times. This is because the neural networks start with random
weights. Because of this it is necessary to fit and evaluate a neural network times to ensure that one
set of hyperparameters are actually better than another. Bootstrapping can be an effective means of
benchmarking (comparing) two sets of hyperparameters.
Bootstrapping is similar to cross-validation. Both go through a number of cycles/folds providing vali-
dation and training sets. However, bootstrapping can have an unlimited number of cycles. Bootstrapping
chooses a new train and validation split each cycle, with replacement. The fact that each cycle is chosen
with replacement means that, unlike cross validation, there will often be repeated rows selected between
cycles. If you run the bootstrap for enough cycles, there will be duplicate cycles.
In this part we will use bootstrapping for hyperparameter benchmarking. We will train a neural network
for a specified number of splits (denoted by the SPLITS constant). For these examples we use 100. We
will compare the average score at the end of the 100. By the end of the cycles the mean score will have
converged somewhat. This ending score will be a much better basis of comparison than a single cross-
validation. Additionally, the average number of epochs will be tracked to give an idea of a possible optimal
value. Because the early stopping validation set is also used to evaluate the the neural network as well,
it might be slightly inflated. This is because we are both stopping and evaluating on the same sample.
5.5. PART 5.5: BENCHMARKING REGULARIZATION TECHNIQUES 181
However, we are using the scores only as relative measures to determine the superiority of one set of
hyperparameters to another, so this slight inflation should not present too much of a problem.
Because we are benchmarking, we will display the amount of time taken for each cycle. The following
function can be used to nicely format a time span.
Code
# N i c e l y f o r m a t t e d time s t r i n g
def hms_string ( s e c _ e l a p s e d ) :
h = int ( s e c _ e l a p s e d / ( 6 0 ∗ 6 0 ) )
m = int ( ( s e c _ e l a p s e d % ( 6 0 ∗ 6 0 ) ) / 6 0 )
s = s e c _ e l a p s e d % 60
return " { } : { : > 0 2 } : { : > 0 5 . 2 f } " . format ( h , m, s )
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r p r o d u c t
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' p r o d u c t ' ] , p r e f i x=" p r o d u c t " ) ] , a x i s =1)
d f . drop ( ' p r o d u c t ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
182 CHAPTER 5. REGULARIZATION AND DROPOUT
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' age ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
y = d f [ ' age ' ] . v a l u e s
The following code performs the bootstrap. The architecture of the neural network can be adjusted to
compare many different configurations.
Code
import pandas a s pd
import o s
import numpy a s np
import time
import s t a t i s t i c s
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s import r e g u l a r i z e r s
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import S h u f f l e S p l i t
SPLITS = 50
# Bootstrap
boot = S h u f f l e S p l i t ( n _ s p l i t s=SPLITS , t e s t _ s i z e =0.1 , random_state =42)
# Track p r o g r e s s
mean_benchmark = [ ]
epochs_needed = [ ]
num = 0
# Loop t h r o u g h s a m p l e s
5.5. PART 5.5: BENCHMARKING REGULARIZATION TECHNIQUES 183
f o r t r a i n , t e s t in boot . s p l i t ( x ) :
s t a r t _ t i m e = time . time ( )
num+=1
# S p l i t t r a i n and t e s t
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
# C o n s t r u c t n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 0 , input_dim=x _ t r a i n . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 ) )
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
# Train on t h e b o o t s t r a p sample
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =(x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =0, e p o c h s =1000)
e p o c h s = monitor . stopped_epoch
epochs_needed . append ( e p o c h s )
# P r e d i c t on t h e o u t o f b o o t ( v a l i d a t i o n )
pred = model . p r e d i c t ( x _ t e s t )
# Measure t h i s b o o t s t r a p ' s l o g l o s s
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
mean_benchmark . append ( s c o r e )
m1 = s t a t i s t i c s . mean ( mean_benchmark )
m2 = s t a t i s t i c s . mean ( epochs_needed )
mdev = s t a t i s t i c s . p s t d e v ( mean_benchmark )
# Record t h i s i t e r a t i o n
time_took = time . time ( ) − s t a r t _ t i m e
print ( f "#{num } : ␣ s c o r e ={ s c o r e : . 6 f } , ␣mean␣ s c o r e ={m1 : . 6 f } , "
f " ␣ s t d e v={mdev : . 6 f } " ,
f " ␣ e p o c h s={e p o c h s } , ␣mean␣ e p o c h s={ i n t (m2) } " ,
f " ␣ time={hms_string ( time_took ) } " )
184 CHAPTER 5. REGULARIZATION AND DROPOUT
Output
...
The bootstrapping process for classification is similar, and I present it in the next section.
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
5.5. PART 5.5: BENCHMARKING REGULARIZATION TECHNIQUES 185
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
We now run this data through a number of splits specified by the SPLITS variable. We track the
average error through each of these splits.
Code
import pandas a s pd
import o s
import numpy a s np
import time
import s t a t i s t i c s
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s import regularizers
186 CHAPTER 5. REGULARIZATION AND DROPOUT
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d S h u f f l e S p l i t
SPLITS = 50
# Bootstrap
boot = S t r a t i f i e d S h u f f l e S p l i t ( n _ s p l i t s=SPLITS , t e s t _ s i z e =0.1 ,
random_state =42)
# Track p r o g r e s s
mean_benchmark = [ ]
epochs_needed = [ ]
num = 0
# Loop t h r o u g h s a m p l e s
for t r a i n , t e s t in boot . s p l i t ( x , d f [ ' p r o d u c t ' ] ) :
s t a r t _ t i m e = time . time ( )
num+=1
# S p l i t t r a i n and t e s t
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
# C o n s t r u c t n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =25 , v e r b o s e =0, mode= ' auto ' , r e s t o r e _ b e s t _ w e i g h t s=True )
# Train on t h e b o o t s t r a p sample
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =(x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =0, e p o c h s =1000)
e p o c h s = monitor . stopped_epoch
epochs_needed . append ( e p o c h s )
# P r e d i c t on t h e o u t o f b o o t ( v a l i d a t i o n )
pred = model . p r e d i c t ( x _ t e s t )
5.5. PART 5.5: BENCHMARKING REGULARIZATION TECHNIQUES 187
# Measure t h i s b o o t s t r a p ' s l o g l o s s
y_compare = np . argmax ( y_test , a x i s =1) # For l o g l o s s c a l c u l a t i o n
s c o r e = m e t r i c s . l o g _ l o s s ( y_compare , pred )
mean_benchmark . append ( s c o r e )
m1 = s t a t i s t i c s . mean ( mean_benchmark )
m2 = s t a t i s t i c s . mean ( epochs_needed )
mdev = s t a t i s t i c s . p s t d e v ( mean_benchmark )
# Record t h i s i t e r a t i o n
time_took = time . time ( ) − s t a r t _ t i m e
print ( f "#{num } : ␣ s c o r e ={ s c o r e : . 6 f } , ␣mean␣ s c o r e ={m1 : . 6 f } , " +\
f " s t d e v={mdev : . 6 f } , ␣ e p o c h s={e p o c h s } , ␣mean␣ e p o c h s={ i n t (m2) } , " +\
f " ␣ time={hms_string ( time_took ) } " )
Output
...
5.5.3 Benchmarking
Now that we’ve seen how to bootstrap with both classification and regression, we can start to try to optimize
the hyperparameters for the jh-simple-dataset data. For this example, we will encode for classification
of the product column. Evaluation will be in log loss.
Code
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] ,
a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
I performed some optimization, and the code has the best settings that I could determine. Later in this
5.5. PART 5.5: BENCHMARKING REGULARIZATION TECHNIQUES 189
book, we will see how we can use an automatic process to optimize the hyperparameters.
Code
import pandas a s pd
import o s
import numpy a s np
import time
import t e n s o r f l o w . k e r a s . i n i t i a l i z e r s
import s t a t i s t i c s
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n , Dropout
from t e n s o r f l o w . k e r a s import r e g u l a r i z e r s
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d S h u f f l e S p l i t
from t e n s o r f l o w . k e r a s . l a y e r s import LeakyReLU , PReLU
SPLITS = 100
# Bootstrap
boot = S t r a t i f i e d S h u f f l e S p l i t ( n _ s p l i t s=SPLITS , t e s t _ s i z e =0.1)
# Track p r o g r e s s
mean_benchmark = [ ]
epochs_needed = [ ]
num = 0
# Loop t h r o u g h s a m p l e s
f o r t r a i n , t e s t in boot . s p l i t ( x , d f [ ' p r o d u c t ' ] ) :
s t a r t _ t i m e = time . time ( )
num+=1
# S p l i t t r a i n and t e s t
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
# C o n s t r u c t n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n=PReLU ( ) , \
k e r n e l _ r e g u l a r i z e r=r e g u l a r i z e r s . l 2 ( 1 e −4))) # Hidden 1
190 CHAPTER 5. REGULARIZATION AND DROPOUT
# Train on t h e b o o t s t r a p sample
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =(x_test , y _ t e s t ) , \
c a l l b a c k s =[ monitor ] , v e r b o s e =0, e p o c h s =1000)
e p o c h s = monitor . stopped_epoch
epochs_needed . append ( e p o c h s )
# P r e d i c t on t h e o u t o f b o o t ( v a l i d a t i o n )
pred = model . p r e d i c t ( x _ t e s t )
# Measure t h i s b o o t s t r a p ' s l o g l o s s
y_compare = np . argmax ( y_test , a x i s =1) # For l o g l o s s c a l c u l a t i o n
s c o r e = m e t r i c s . l o g _ l o s s ( y_compare , pred )
mean_benchmark . append ( s c o r e )
m1 = s t a t i s t i c s . mean ( mean_benchmark )
m2 = s t a t i s t i c s . mean ( epochs_needed )
mdev = s t a t i s t i c s . p s t d e v ( mean_benchmark )
# Record t h i s i t e r a t i o n
time_took = time . time ( ) − s t a r t _ t i m e
print ( f "#{num } : ␣ s c o r e ={ s c o r e : . 6 f } , ␣mean␣ s c o r e ={m1 : . 6 f } , "
f " s t d e v={mdev : . 6 f } , ␣ e p o c h s={e p o c h s } , "
f " mean␣ e p o c h s={ i n t (m2) } , ␣ time={hms_string ( time_took ) } " )
Output
...
Figure 5.3: L1 vs L2
194 CHAPTER 5. REGULARIZATION AND DROPOUT
%m a t p l o t l i b i n l i n e
print ( np . a s a r r a y ( img ) )
img
195
196 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Output
[[[199 213 2 4 0 ]
[200 214 2 4 0 ]
[200 214 2 4 0 ]
...
[ 86 34 9 6 ]
[ 48 4 57]
[ 57 21 6 5 ] ]
[[199 213 2 3 9 ]
[200 214 2 4 0 ]
[200 214 2 4 0 ]
...
[215 215 251]
[252 242 255]
[237 218 250]]
[[200 214 240]
...
[131 98 91]
...
[ 86 82 57]
[ 89 85 60]
6.1. PART 6.1: IMAGE PROCESSING IN PYTHON 197
[ 89 85 60]]]
w, h = 6 4 , 64
data = np . z e r o s ( ( h , w, 3 ) , dtype=np . u i n t 8 )
# Yellow
f o r row in range ( 3 2 ) :
f o r c o l in range ( 3 2 ) :
data [ row , c o l ] = [ 2 5 5 , 2 5 5 , 0 ]
# Red
f o r row in range ( 3 2 ) :
f o r c o l in range ( 3 2 ) :
data [ row +32 , c o l ] = [ 2 5 5 , 0 , 0 ]
# Green
f o r row in range ( 3 2 ) :
f o r c o l in range ( 3 2 ) :
data [ row +32 , c o l +32] = [ 0 , 2 5 5 , 0 ]
# Blue
f o r row in range ( 3 2 ) :
f o r c o l in range ( 3 2 ) :
data [ row , c o l +32] = [ 0 , 0 , 2 5 5 ]
Output
198 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
%m a t p l o t l i b i n l i n e
img_array = np . a s a r r a y ( img )
rows = img_array . shape [ 0 ]
c o l s = img_array . shape [ 1 ]
Output
Rows : 7 6 8 , C o l s : 1024
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
import numpy a s np
from i o import BytesIO
from IPython . d i s p l a y import d i s p l a y , HTML
images = [
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / b r o o k i n g s . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / S e i g l e H a l l . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r /WUSTLKnight . j p e g "
200 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
# Crop t h e image , c e n t e r e d
new_width = min( width , h e i g h t )
new_height = new_width
l e f t = ( width − new_width ) / 2
top = ( h e i g h t − new_height ) / 2
r i g h t = ( width + new_width ) / 2
bottom = ( h e i g h t + new_height ) / 2
return image . c r o p ( ( l e f t , top , r i g h t , bottom ) )
x = []
for u r l in images :
I m a g e F i l e .LOAD_TRUNCATED_IMAGES = F a l s e
r e s p o n s e = r e q u e s t s . g e t ( u r l , h e a d e r s={ ' User−Agent ' : ' M o z i l l a / 5 . 0 ' } )
img = Image . open ( BytesIO ( r e s p o n s e . c o n t e n t ) )
img . l o a d ( )
img = crop_square ( img )
img = img . r e s i z e ( ( 1 2 8 , 1 2 8 ) , Image . ANTIALIAS)
print ( u r l )
d i s p l a y ( img )
img_array = np . a s a r r a y ( img )
img_array = img_array . f l a t t e n ( )
img_array = img_array . a s t y p e ( np . f l o a t 3 2 )
img_array = ( img_array −128)/128
x . append ( img_array )
x = np . a r r a y ( x )
print ( x . shape )
Output
6.1. PART 6.1: IMAGE PROCESSING IN PYTHON 201
(3 , 49152)
%m a t p l o t l i b i n l i n e
202 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
def add_noise ( a ) :
a2 = a . copy ( )
rows = a2 . shape [ 0 ]
c o l s = a2 . shape [ 1 ]
s = int (min( rows , c o l s ) / 2 0 ) # s i z e o f s p o t i s 1/20 o f s m a l l e s t dimension
fo r i in range ( 1 0 0 ) :
x = np . random . r a n d i n t ( c o l s −s )
y = np . random . r a n d i n t ( rows−s )
a2 [ y : ( y+s ) , x : ( x+s ) ] = 0
return a2
img_array = np . a s a r r a y ( img )
rows = img_array . shape [ 0 ]
c o l s = img_array . shape [ 1 ]
Output
6.1. PART 6.1: IMAGE PROCESSING IN PYTHON 203
Rows : 7 6 8 , C o l s : 1024
(768 , 1024 , 3)
import o s
i f COLAB:
PATH = " / c o n t e n t "
EXTRACT_TARGET = o s . path . j o i n (PATH, " c l i p s " )
SOURCE = o s . path . j o i n (PATH, " / c o n t e n t / c l i p s / p a p e r c l i p s " )
TARGET = o s . path . j o i n (PATH, " / c o n t e n t / c l i p s −p r o c e s s e d " )
else :
204 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Next, we download the images. This part depends on the origin of your images. The following code
downloads images from a URL, where a ZIP file contains the images. The code unzips the ZIP file.
Code
The following code contains functions that we use to preprocess the images. The crop_square function
converts images to a square by cropping extra data. The scale function increases or decreases the size of
an image. The standardize function ensures an image is full color; a mix of color and grayscale images
can be problematic.
Code
import i m a g e i o
import g l o b
from tqdm import tqdm
from PIL import Image
import o s
return img
def s t a n d a r d i z e ( image ) :
rgbimg = Image . new ( "RGB" , image . s i z e )
rgbimg . p a s t e ( image )
return rgbimg
6.1. PART 6.1: IMAGE PROCESSING IN PYTHON 205
Next, we loop through each image. The images are loaded, and you can apply any desired transforma-
tions. Ultimately, the script saves the images as JPG.
Code
f o r f i l e in tqdm ( f i l e s ) :
try :
target = " "
name = o s . path . basename ( f i l e )
f i l e n a m e , _ = o s . path . s p l i t e x t ( name )
img = Image . open ( f i l e )
img = s t a n d a r d i z e ( img )
img = crop_square ( img )
img = s c a l e ( img , 1 2 8 , 1 2 8 )
#f a i l _ b e l o w ( img , 128 , 128)
Now we can zip the preprocessed files and store them somewhere.
6.2 Part 6.2: Keras Neural Networks for Digits and Fashion
MNIST
This module will focus on computer vision. There are some important differences and similarities with
previous neural networks.
classes. Fashion-MNIST is a direct drop-in replacement for the original MNIST dataset for benchmarking
machine learning algorithms. It shares the same image size and structure of training and testing splits.
You can see this data in Figure 6.2.
The CIFAR-10 and CIFAR-100 datasets are also frequently used by the neural network research com-
munity.
The CIFAR-10 data set contains low-rez images that are divided into 10 classes. The CIFAR-100 data
set contains 100 classes in a hierarchy.
Although computer vision primarily uses CNNs, this technology has some applications outside of the
field. You need to realize that if you want to utilize CNNs on non-visual data, you must find a way to
encode your data to mimic the properties of visual data.
The order of the input array elements is crucial to the training. In contrast, most neural networks
that are not CNNs treat their input data as a long vector of values, and the order in which you arrange
the incoming features in this vector is irrelevant. You cannot change the order for these types of neural
networks after you have trained the network.
The CNN network arranges the inputs into a grid. This arrangement worked well with images because
the pixels in closer proximity to each other are important to each other. The order of pixels in an image
is significant. The human body is a relevant example of this type of order. For the design of the face, we
are accustomed to eyes being near to each other.
This advance in CNNs is due to years of research on biological eyes. In other words, CNNs utilize
overlapping fields of input to simulate features of biological eyes. Until this breakthrough, AI had been
unable to reproduce the capabilities of biological vision.
Scale, rotation, and noise have presented challenges for AI computer vision research. You can observe the
complexity of biological eyes in the example that follows. A friend raises a sheet of paper with a large
number written on it. As your friend moves nearer to you, the number is still identifiable. In the same way,
you can still identify the number when your friend rotates the paper. Lastly, your friend creates noise by
drawing lines on the page, but you can still identify the number. As you can see, these examples demonstrate
the high function of the biological eye and allow you to understand better the research breakthrough of
CNNs. That is, this neural network can process scale, rotation, and noise in the field of computer vision.
6.2. PART 6.2: KERAS NEURAL NETWORKS FOR DIGITS AND FASHION MNIST 209
So far, we have only seen one layer type (dense layers). By the end of this book we will have seen:
• Number of filters
• Filter Size
• Stride
• Padding
• Activation Function/Non-Linearity
The primary purpose of a convolutional layer is to detect features such as edges, lines, blobs of color, and
other visual elements. The filters can detect these features. The more filters we give to a convolutional
layer, the more features it can see.
A filter is a square-shaped object that scans over the image. A grid can represent the individual pixels
of a grid. You can think of the convolutional layer as a smaller grid that sweeps left to right over each
image row. There is also a hyperparameter that specifies both the width and height of the square-shaped
filter. The following figure shows this configuration in which you see the six convolutional filters sweeping
over the image grid:
A convolutional layer has weights between it and the previous layer or image grid. Each pixel on each
convolutional layer is a weight. Therefore, the number of weights between a convolutional layer and its
predecessor layer or image field is the following:
210 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
[ F i l t e r S i z e ] ∗ [ F i l t e r S i z e ] ∗ [# o f F i l t e r s ]
For example, if the filter size were 5 (5x5) for 10 filters, there would be 250 weights.
You need to understand how the convolutional filters sweep across the previous layer’s output or image
grid. Figure 6.5 illustrates the sweep:
The above figure shows a convolutional filter with 4 and a padding size of 1. The padding size is
responsible for the border of zeros in the area that the filter sweeps. Even though the image is 8x7,
the extra padding provides a virtual image size of 9x8 for the filter to sweep across. The stride specifies
the number of positions the convolutional filters will stop. The convolutional filters move to the right,
advancing by the number of cells specified in the stride. Once you reach the far right, the convolutional
filter moves back to the far left; then, it moves down by the stride amount and
continues to the right again.
Some constraints exist concerning the size of the stride. The stride cannot be 0. The convolutional
filter would never move if you set the stride. Furthermore, neither the stride nor the convolutional filter
size can be larger than the previous grid. There are additional constraints on the stride (s), padding (p),
and the filter width (f ) for an image of width (w). Specifically, the convolutional filter must be able to
start at the far left or top border, move a certain number of strides, and land on the far right or bottom
border. The following equation shows the number of steps a convolutional operator
must take to cross the image:
w − f + 2p
steps = +1
s
6.2. PART 6.2: KERAS NEURAL NETWORKS FOR DIGITS AND FASHION MNIST 211
The number of steps must be an integer. In other words, it cannot have decimal places. The purpose
of the padding (p) is to be adjusted to make this equation become an integer value.
• Spatial Extent (f )
• Stride (s)
Unlike convolutional layers, max-pool layers do not use padding. Additionally, max-pool layers have no
weights, so training does not affect them. These layers downsample their 3D box input. The 3D box output
by a max-pool layer will have a width equal to this equation:
w1 − f
w2 = +1
s
The height of the 3D box produced by the max-pool layer is calculated similarly with this equation:
h1 − f
h2 = +1
s
The depth of the 3D box produced by the max-pool layer is equal to the depth the 3D box received
as input. The most common setting for the hyper-parameters of a max-pool layer is f=2 and s=2. The
spatial extent (f) specifies that boxes of 2x2 will be scaled down to single pixels. Of these four pixels, the
pixel with the maximum value will represent the 2x2 pixel in the new grid. Because squares of size 4 are
replaced with size 1, 75% of the pixel information is lost. The following figure shows this transformation
as a 6x6 grid becomes a 3x3:
Of course, the above diagram shows each pixel as a single number. A grayscale image would have this
characteristic. We usually take the average of the three numbers for an RGB image to determine which
pixel has the maximum value.
is a picture of. For regression, this "label" is some numeric quantity the image should produce, such as a
count. We will look at two different means of providing this label.
The first example will show how to handle regression with convolution neural networks. We will provide
an image and expect the neural network to count items in that image. We will use a dataset that I created
that contains a random number of paperclips. The following code will download this dataset for you.
Code
import o s
i f COLAB:
PATH = " / c o n t e n t "
else :
# I used t h i s l o c a l l y on my machine , you may need d i f f e r e n t
PATH = " / U s e r s / j e f f /temp "
Next, we download the images. This part depends on the origin of your images. The following code
downloads images from a URL, where a ZIP file contains the images. The code unzips the ZIP file.
Code
! mkdir −p {EXTRACT_TARGET}
! u n z i p −o −j −d {SOURCE} { o s . path . j o i n (PATH, DOWNLOAD_NAME) } >/dev / n u l l
The labels are contained in a CSV file named train.csvfor regression. This file has just two labels, id
and clip_count. The ID specifies the filename; for example, row id 1 corresponds to the file clips-1.jpg.
The following code loads the labels for the training set and creates a new column, named filename, that
contains the filename of each image, based on the id column.
Code
import pandas a s pd
d f = pd . read_csv (
o s . path . j o i n (SOURCE, " t r a i n . c s v " ) ,
na_values =[ 'NA ' , ' ? ' ] )
Code
df
Output
id clip_count filename
0 30001 11 clips-30001.jpg
1 30002 2 clips-30002.jpg
2 30003 26 clips-30003.jpg
3 30004 41 clips-30004.jpg
4 30005 49 clips-30005.jpg
... ... ... ...
19995 49996 35 clips-49996.jpg
19996 49997 54 clips-49997.jpg
19997 49998 72 clips-49998.jpg
19998 49999 24 clips-49999.jpg
19999 50000 35 clips-50000.jpg
Code
TRAIN_PCT = 0 . 9
TRAIN_CUT = int ( len ( d f ) ∗ TRAIN_PCT)
d f _ t r a i n = d f [ 0 : TRAIN_CUT]
d f _ v a l i d a t e = d f [TRAIN_CUT : ]
Output
T r a i n i n g s i z e : 18000
V a l i d a t e s i z e : 2000
We are now ready to create two ImageDataGenerator objects. We currently use a generator, which
creates additional training data by manipulating the source material. This technique can produce consid-
erably stronger neural networks. The generator below flips the images both vertically and horizontally.
Keras will train the neuron network both on the original images and the flipped images. This augmentation
increases the size of the training data considerably. Module 6.4 goes deeper into the transformations you
can perform. You can also specify a target size to resize the images automatically.
The function flow_from_dataframe loads the labels from a Pandas dataframe connected to our
train.csv file. When we demonstrate classification, we will use the flow_from_directory; which loads
the labels from the directory structure rather than a CSV.
Code
import t e n s o r f l o w a s t f
import k e r a s _ p r e p r o c e s s i n g
from k e r a s _ p r e p r o c e s s i n g import image
from k e r a s _ p r e p r o c e s s i n g . image import ImageDataGenerator
t r a i n i n g _ d a t a g e n = ImageDataGenerator (
rescale = 1./255 ,
h o r i z o n t a l _ f l i p=True ,
v e r t i c a l _ f l i p=True ,
f i l l _ m o d e= ' n e a r e s t ' )
t r a i n _ g e n e r a t o r = t r a i n i n g _ d a t a g e n . flow_from_dataframe (
dataframe=d f _ t r a i n ,
d i r e c t o r y=SOURCE,
x_col=" f i l e n a m e " ,
6.2. PART 6.2: KERAS NEURAL NETWORKS FOR DIGITS AND FASHION MNIST 215
y_col=" c l i p _ c o u n t " ,
t a r g e t _ s i z e =(256 , 2 5 6 ) ,
b a t c h _ s i z e =32 ,
class_mode= ' o t h e r ' )
v a l i d a t i o n _ d a t a g e n = ImageDataGenerator ( r e s c a l e = 1 . / 2 5 5 )
v a l _ g e n e r a t o r = v a l i d a t i o n _ d a t a g e n . flow_from_dataframe (
dataframe=d f _ v a l i d a t e ,
d i r e c t o r y=SOURCE,
x_col=" f i l e n a m e " ,
y_col=" c l i p _ c o u n t " ,
t a r g e t _ s i z e =(256 , 2 5 6 ) ,
class_mode= ' o t h e r ' )
Output
We can now train the neural network. The code to build and train the neural network is not that
different than in the previous modules. We will use the Keras Sequential class to provide layers to the
neural network. We now have several new layer types that we did not previously see.
• Conv2D - The convolution layers.
• MaxPooling2D - The max-pooling layers.
• Flatten - Flatten the 2D (and higher) tensors to allow a Dense layer to process.
• Dense - Dense layers, the same as demonstrated previously. Dense layers often form the final output
layers of the neural network.
The training code is very similar to previously. This code is for regression, so a final linear activation
is used, along with mean_squared_error for the loss function. The generator provides both the x and y
matrixes we previously supplied.
Code
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
import time
model = t f . k e r a s . models . S e q u e n t i a l ( [
# Note t h e i n p u t s ha p e i s t h e d e s i r e d s i z e o f t h e image 150 x150
# with 3 bytes color .
# This i s t h e f i r s t c o n v o l u t i o n
t f . k e r a s . l a y e r s . Conv2D ( 6 4 , ( 3 , 3 ) , a c t i v a t i o n= ' r e l u ' ,
216 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
input_shape =(256 , 2 5 6 , 3 ) ) ,
t f . k e r a s . l a y e r s . MaxPooling2D ( 2 , 2 ) ,
# The second c o n v o l u t i o n
t f . k e r a s . l a y e r s . Conv2D ( 6 4 , ( 3 , 3 ) , a c t i v a t i o n= ' r e l u ' ) ,
t f . k e r a s . l a y e r s . MaxPooling2D ( 2 , 2 ) ,
t f . keras . layers . Flatten () ,
# 512 neuron h i d d e n l a y e r
t f . k e r a s . l a y e r s . Dense ( 5 1 2 , a c t i v a t i o n= ' r e l u ' ) ,
t f . k e r a s . l a y e r s . Dense ( 1 , a c t i v a t i o n= ' l i n e a r ' )
])
model . summary ( )
epoch_steps = 250 # needed f o r 2 . 2
v a l i d a t i o n _ s t e p s = len ( d f _ v a l i d a t e )
model . compile ( l o s s = ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =5, v e r b o s e =1, mode= ' auto ' ,
r e s t o r e _ b e s t _ w e i g h t s=True )
s t a r t _ t i m e = time . time ( )
h i s t o r y = model . f i t ( t r a i n _ g e n e r a t o r ,
verbose = 1 ,
v a l i d a t i o n _ d a t a=v a l _ g e n e r a t o r , c a l l b a c k s =[ monitor ] , e p o c h s =25)
e l a p s e d _ t i m e = time . time ( ) − s t a r t _ t i m e
print ( " Elapsed ␣ time : ␣ {} " . format ( hms_string ( e l a p s e d _ t i m e ) ) )
Output
...
This code will run very slowly if you do not use a GPU. The above code takes approximately 13 minutes
with a GPU.
d f _ t e s t = pd . read_csv (
o s . path . j o i n (SOURCE, " t e s t . c s v " ) ,
na_values =[ 'NA ' , ' ? ' ] )
t e s t _ d a t a g e n = ImageDataGenerator ( r e s c a l e = 1 . / 2 5 5 )
t e s t _ g e n e r a t o r = v a l i d a t i o n _ d a t a g e n . flow_from_dataframe (
dataframe=d f _ t e s t ,
d i r e c t o r y=SOURCE,
x_col=" f i l e n a m e " ,
b a t c h _ s i z e =1,
s h u f f l e=F a l s e ,
218 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
t a r g e t _ s i z e =(256 , 2 5 6 ) ,
class_mode=None )
Output
test_generator . reset ()
pred = model . p r e d i c t ( t e s t _ g e n e r a t o r , s t e p s=len ( d f _ t e s t ) )
• iris-setosa
• iris-versicolour
• iris-virginica
Code
import o s
i f COLAB:
PATH = " / c o n t e n t "
EXTRACT_TARGET = o s . path . j o i n (PATH, " i r i s " )
6.2. PART 6.2: KERAS NEURAL NETWORKS FOR DIGITS AND FASHION MNIST 219
! l s / content / i r i s
Output
i r i s −s e t o s a i r i s −v e r s i c o l o u r i r i s −v i r g i n i c a
We set up the generator, similar to before. This time we use flow_from_directory to get the labels
from the directory structure.
Code
import t e n s o r f l o w a s t f
import k e r a s _ p r e p r o c e s s i n g
from k e r a s _ p r e p r o c e s s i n g import image
from k e r a s _ p r e p r o c e s s i n g . image import ImageDataGenerator
t r a i n i n g _ d a t a g e n = ImageDataGenerator (
rescale = 1./255 ,
h o r i z o n t a l _ f l i p=True ,
v e r t i c a l _ f l i p=True ,
w i d t h _ s h i f t _ r a n g e =[ −200 ,200] ,
r o t a t i o n _ r a n g e =360 ,
220 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
f i l l _ m o d e= ' n e a r e s t ' )
v a l i d a t i o n _ d a t a g e n = ImageDataGenerator ( r e s c a l e = 1 . / 2 5 5 )
Output
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
c l a s s _ c o u n t = len ( t r a i n _ g e n e r a t o r . c l a s s _ i n d i c e s )
model = t f . k e r a s . models . S e q u e n t i a l ( [
# Note t h e i n p u t s ha p e i s t h e d e s i r e d s i z e o f t h e image
# 300 x300 w i t h 3 b y t e s c o l o r
# This i s t h e f i r s t c o n v o l u t i o n
t f . k e r a s . l a y e r s . Conv2D ( 1 6 , ( 3 , 3 ) , a c t i v a t i o n= ' r e l u ' ,
input_shape =(256 , 2 5 6 , 3 ) ) ,
t f . k e r a s . l a y e r s . MaxPooling2D ( 2 , 2 ) ,
# The second c o n v o l u t i o n
t f . k e r a s . l a y e r s . Conv2D ( 3 2 , ( 3 , 3 ) , a c t i v a t i o n= ' r e l u ' ) ,
t f . k e r a s . l a y e r s . Dropout ( 0 . 5 ) ,
t f . k e r a s . l a y e r s . MaxPooling2D ( 2 , 2 ) ,
# The t h i r d c o n v o l u t i o n
t f . k e r a s . l a y e r s . Conv2D ( 6 4 , ( 3 , 3 ) , a c t i v a t i o n= ' r e l u ' ) ,
t f . k e r a s . l a y e r s . Dropout ( 0 . 5 ) ,
t f . k e r a s . l a y e r s . MaxPooling2D ( 2 , 2 ) ,
# The f o u r t h c o n v o l u t i o n
6.2. PART 6.2: KERAS NEURAL NETWORKS FOR DIGITS AND FASHION MNIST 221
model . summary ( )
Output
...
222 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
_________________________________________________________________
...
10/10 [==============================] − 5 s 458ms/ s t e p − l o s s : 0 . 7 9 5 7
Epoch 50/50
10/10 [==============================] − 5 s 501ms/ s t e p − l o s s : 0 . 8 6 7 0
The iris image dataset is not easy to predict; it turns out that a tabular dataset of measurements is
more manageable. However, we can achieve a 63%.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
import numpy a s np
validation_generator . reset ()
pred = model . p r e d i c t ( v a l i d a t i o n _ g e n e r a t o r )
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " Accuracy : ␣ { c o r r e c t } " )
Output
Accuracy : 0 . 6 3 8 9 5 4 8 6 9 3 5 8 6 6 9 9
import pandas a s pd
import numpy a s np
import o s
import t e n s o r f l o w . k e r a s
import m a t p l o t l i b . p y p l o t a s p l t
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , GlobalAveragePooling2D
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import MobileNet
from t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g import image
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s . m o b i l e n e t import p r e p r o c e s s _ i n p u t
from t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g . image import ImageDataGenerator
from t e n s o r f l o w . k e r a s . models import Model
from t e n s o r f l o w . k e r a s . o p t i m i z e r s import Adam
We begin by downloading weights for a MobileNet trained for the imagenet dataset, which will take
some time to download the first time you train the network.
Code
The loaded network is a Keras neural network. However, this is a neural network that a third party
engineered on advanced hardware. Merely looking at the structure of an advanced state-of-the-art neural
network can be educational.
Code
model . summary ( )
Output
=================================================================
input_1 ( I npu tLay er ) [ ( None , 2 2 4 , 2 2 4 , 3 ) ] 0
conv1 ( Conv2D ) ( None , 1 1 2 , 1 1 2 , 3 2 ) 864
conv1_bn ( B a t c h N o r m a l i z a t i o ( None , 1 1 2 , 1 1 2 , 3 2 ) 128
n)
conv1_relu (ReLU) ( None , 1 1 2 , 1 1 2 , 3 2 ) 0
conv_dw_1 ( DepthwiseConv2D ) ( None , 1 1 2 , 1 1 2 , 3 2 ) 288
conv_dw_1_bn ( BatchNormaliz ( None , 1 1 2 , 1 1 2 , 3 2 ) 128
ation )
conv_dw_1_relu (ReLU) ( None , 1 1 2 , 1 1 2 , 3 2 ) 0
conv_pw_1 ( Conv2D ) ( None , 1 1 2 , 1 1 2 , 6 4 ) 2048
conv_pw_1_bn ( BatchNormaliz ( None , 1 1 2 , 1 1 2 , 6 4 ) 256
...
=================================================================
T o t a l params : 4 , 2 5 3 , 8 6 4
T r a i n a b l e params : 4 , 2 3 1 , 9 7 6
Non−t r a i n a b l e params : 2 1 , 8 8 8
_________________________________________________________________
Several clues to neural network architecture become evident when examining the above structure.
We will now use the MobileNet to classify several image URLs below. You can add additional URLs of
your own to see how well the MobileNet can classify.
Code
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
import numpy a s np
from i o import BytesIO
from IPython . d i s p l a y import d i s p l a y , HTML
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s . m o b i l e n e t import d e c o d e _ p r e d i c t i o n s
IMAGE_WIDTH = 224
IMAGE_HEIGHT = 224
IMAGE_CHANNELS = 3
c o l s , rows = img . s i z e
i f rows>c o l s :
pad = ( rows−c o l s ) / 2
img = img . c r o p ( ( pad , 0 , c o l s , c o l s ) )
else :
pad = ( c o l s −rows ) / 2
img = img . c r o p ( ( 0 , pad , rows , rows ) )
return img
def c l a s s i f y _ i m a g e ( u r l ) :
x = []
I m a g e F i l e .LOAD_TRUNCATED_IMAGES = F a l s e
response = requests . get ( url )
img = Image . open ( BytesIO ( r e s p o n s e . c o n t e n t ) )
img . l o a d ( )
img = img . r e s i z e ( (IMAGE_WIDTH,IMAGE_HEIGHT) , Image . ANTIALIAS)
d i s p l a y ( img )
print ( np . argmax ( pred , a x i s =1))
We can now classify an example image. You can specify the URL of any image you wish to classify.
Code
c l a s s i f y _ i m a g e (ROOT+" s o c c e r _ b a l l . j p g " )
Output
226 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
[805]
Downloading data from h t t p s : / / s t o r a g e . g o o g l e a p i s . com/ download . t e n s o r f l
ow . o r g / data / i m a g e n e t _ c l a s s _ i n d e x . j s o n
40960/35363 [==================================] − 0 s 0 us / s t e p
49152/35363 [=========================================] − 0 s 0 us / s t e p
( ' n04254680 ' , ' s o c c e r _ b a l l ' , 0 . 9 9 9 9 9 3 8 )
( ' n03530642 ' , ' honeycomb ' , 3 . 8 6 2 4 1 2 e −06)
( ' n03255030 ' , ' dumbbell ' , 4 . 4 4 2 4 5 8 e −07)
( ' n02782093 ' , ' b a l l o o n ' , 3 . 7 0 3 8 9 8 7 e −07)
( ' n04548280 ' , ' w a l l _ c l o c k ' , 3 . 1 4 3 9 1 1 e −07)
Code
c l a s s i f y _ i m a g e (ROOT+" r a c e _ t r u c k . j p g " )
Output
6.3. PART 6.3: TRANSFER LEARNING FOR COMPUTER VISION 227
import o s
URL = " h t t p s : / / g i t h u b . com/ j e f f h e a t o n / data−m i r r o r / "
DOWNLOAD_SOURCE = URL+" r e l e a s e s / download / v1 / p a p e r c l i p s . z i p "
DOWNLOAD_NAME = DOWNLOAD_SOURCE[DOWNLOAD_SOURCE. r f i n d ( ' / ' ) + 1 : ]
i f COLAB:
PATH = " / c o n t e n t "
else :
# I used t h i s l o c a l l y on my machine , you may need d i f f e r e n t
PATH = " / U s e r s / j e f f /temp "
228 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Output
[751]
( ' n04037443 ' , ' racer ' , 0.7131951)
( ' n03100240 ' , ' convertible ' , 0.100896776)
( ' n04285008 ' , ' sports_car ' , 0.0770768)
( ' n03930630 ' , ' pickup ' , 0 . 0 2 6 3 5 3 0 5 )
( ' n02704792 ' , ' amphibian ' , 0 . 0 1 1 6 3 6 1 6 9 )
Next, we download the images. This part depends on the origin of your images. The following code
downloads images from a URL, where a ZIP file contains the images. The code unzips the ZIP file.
Code
The labels are contained in a CSV file named train.csv for the regression. This file has just two
labels, id and clip_count. The ID specifies the filename; for example, row id 1 corresponds to the file
clips-1.jpg. The following code loads the labels for the training set and creates a new column, named
filename, that contains the filename of each image, based on the id column.
Code
We want to use early stopping. To do this, we need a validation set. We will break the data into 80
percent test data and 20 validation. Do not confuse this validation data with the test set provided by
Kaggle. This validation set is unique to your program and is for early stopping.
Code
TRAIN_PCT = 0 . 9
TRAIN_CUT = int ( len ( d f _ t r a i n ) ∗ TRAIN_PCT)
6.3. PART 6.3: TRANSFER LEARNING FOR COMPUTER VISION 229
d f _ t r a i n _ c u t = d f _ t r a i n [ 0 :TRAIN_CUT]
d f _ v a l i d a t e _ c u t = d f _ t r a i n [TRAIN_CUT : ]
Output
T r a i n i n g s i z e : 18000
V a l i d a t e s i z e : 2000
Next, we create the generators that will provide the images to the neural network during training. We
normalize the images so that the RGB colors between 0-255 become ratios between 0 and 1. We also use
the flow_from_dataframe generator to connect the Pandas dataframe to the actual image files. We
see here a straightforward implementation; you might also wish to use some of the image transformations
provided by the data generator.
The HEIGHT and WIDTH constants specify the dimensions to which the image will be scaled (or
expanded). It is probably not a good idea to expand the images.
Code
import t e n s o r f l o w a s t f
import k e r a s _ p r e p r o c e s s i n g
from k e r a s _ p r e p r o c e s s i n g import image
from k e r a s _ p r e p r o c e s s i n g . image import ImageDataGenerator
WIDTH = 256
HEIGHT = 256
t r a i n i n g _ d a t a g e n = ImageDataGenerator (
rescale = 1./255 ,
h o r i z o n t a l _ f l i p=True ,
#v e r t i c a l _ f l i p=True ,
f i l l _ m o d e= ' n e a r e s t ' )
t r a i n _ g e n e r a t o r = t r a i n i n g _ d a t a g e n . flow_from_dataframe (
dataframe=df_train_cut ,
d i r e c t o r y=SOURCE,
x_col=" f i l e n a m e " ,
y_col=" c l i p _ c o u n t " ,
t a r g e t _ s i z e =(HEIGHT, WIDTH) ,
# Keeping t h e t r a i n i n g b a t c h s i z e s m a l l
230 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
# USUALLY i n c r e a s e s performance
b a t c h _ s i z e =32 ,
class_mode= ' raw ' )
v a l i d a t i o n _ d a t a g e n = ImageDataGenerator ( r e s c a l e = 1 . / 2 5 5 )
v a l _ g e n e r a t o r = v a l i d a t i o n _ d a t a g e n . flow_from_dataframe (
dataframe=d f _ v a l i d a t e _ c u t ,
d i r e c t o r y=SOURCE,
x_col=" f i l e n a m e " ,
y_col=" c l i p _ c o u n t " ,
t a r g e t _ s i z e =(HEIGHT, WIDTH) ,
# Make t h e v a l i d a t i o n b a t c h s i z e as l a r g e as you
# have memory f o r
b a t c h _ s i z e =256 ,
class_mode= ' raw ' )
Output
We will now use a ResNet neural network as a basis for our neural network. We will redefine both the
input shape and output of the ResNet model, so we will not transfer the weights. Since we redefine the
input, the weights are of minimal value. We begin by loading, from Keras, the ResNet50 network. We
specify include_top as False because we will change the input resolution. We also specify weights as
false because we must retrain the network after changing the top input layers.
Code
base_model = ResNet50 (
i n c l u d e _ t o p=F a l s e , w e i g h t s=None , i n p u t _ t e n s o r=i n p u t _ t e n s o r ,
input_shape=None )
Now we must add a few layers to the end of the neural network so that it becomes a regression model.
6.3. PART 6.3: TRANSFER LEARNING FOR COMPUTER VISION 231
Code
x=base_model . output
x=GlobalAveragePooling2D ( ) ( x )
x=Dense ( 1 0 2 4 , a c t i v a t i o n= ' r e l u ' ) ( x )
x=Dense ( 1 0 2 4 , a c t i v a t i o n= ' r e l u ' ) ( x )
model=Model ( i n p u t s=base_model . input , o u t p u t s=Dense ( 1 ) ( x ) )
We train like before; the only difference is that we do not define the entire neural network here.
Code
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from t e n s o r f l o w . k e r a s . m e t r i c s import RootMeanSquaredError
# Important , c a l c u l a t e a v a l i d s t e p s i z e f o r t h e v a l i d a t i o n d a t a s e t
STEP_SIZE_VALID=v a l _ g e n e r a t o r . n// v a l _ g e n e r a t o r . b a t c h _ s i z e
Output
...
250/250 [==============================] − 61 s 243ms/ s t e p − l o s s :
1 . 9 2 1 1 − rmse : 1 . 3 8 6 0 − v a l _ l o s s : 1 7 . 0 4 8 9 − val_rmse : 4 . 1 2 9 0
Epoch 72/100
250/250 [==============================] − 61 s 243ms/ s t e p − l o s s :
2 . 3 7 2 6 − rmse : 1 . 5 4 0 3 − v a l _ l o s s : 1 6 7 . 8 5 3 6 − val_rmse : 1 2 . 9 5 5 8
232 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
import u r l l i b . r e q u e s t
import s h u t i l
from IPython . d i s p l a y import Image
with u r l l i b . r e q u e s t . u r l o p e n (URL) a s r e s p o n s e , \
open (LOCAL_IMG_FILE, 'wb ' ) a s o u t _ f i l e :
s h u t i l . c o p y f i l e o b j ( response , out_file )
Image ( f i l e n a m e=LOCAL_IMG_FILE)
Output
Next, we introduce a simple utility function to visualize four images sampled from any generator.
6.4. PART 6.4: INSIDE AUGMENTATION 233
Code
def v i s u a l i z e _ g e n e r a t o r ( i m g _ f i l e , gen ) :
# Load t h e r e q u e s t e d image
img = load_img ( i m g _ f i l e )
data = img_to_array ( img )
s a m p l e s = expand_dims ( data , 0 )
# Generat a u g u m e n t a t i o n s from t h e g e n e r a t o r
i t = gen . f l o w ( samples , b a t c h _ s i z e =1)
images = [ ]
f o r i in range ( 4 ) :
batch = i t . next ( )
image = batch [ 0 ] . a s t y p e ( ' u i n t 8 ' )
images . append ( image )
images = np . a r r a y ( images )
# C r e a t e a g r i d o f 4 images from t h e g e n e r a t o r
index , h e i g h t , width , c h a n n e l s = images . shape
nrows = i n d e x //2
f i g = p l t . f i g u r e ( f i g s i z e =(15. , 1 5 . ) )
plt . axis ( ' off ' )
p l t . imshow ( g r i d )
We begin by flipping the image. Some images may not make sense to flip, such as this landscape.
However, if you expect "noise" in your data where some images may be flipped, then this augmentation
may be useful, even if it violates physical reality.
234 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Code
visualize_generator (
LOCAL_IMG_FILE,
ImageDataGenerator ( h o r i z o n t a l _ f l i p=True , v e r t i c a l _ f l i p=True ) )
Output
Next, we will try moving the image. Notice how part of the image is missing? There are various ways
to fill in the missing data, as controlled by fill_mode. In this case, we simply use the nearest pixel to fill.
It is also possible to rotate images.
Code
visualize_generator (
LOCAL_IMG_FILE,
ImageDataGenerator ( w i d t h _ s h i f t _ r a n g e =[ −200 ,200] ,
f i l l _ m o d e= ' n e a r e s t ' ) )
Output
6.4. PART 6.4: INSIDE AUGMENTATION 235
Code
visualize_generator (
LOCAL_IMG_FILE,
ImageDataGenerator ( b r i g h t n e s s _ r a n g e = [ 0 , 1 ] ) )
Output
236 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Shearing may not be appropriate for all image types, it stretches the image.
Code
visualize_generator (
LOCAL_IMG_FILE,
ImageDataGenerator ( shear_range =30))
Output
6.4. PART 6.4: INSIDE AUGMENTATION 237
Code
visualize_generator (
LOCAL_IMG_FILE,
ImageDataGenerator ( r o t a t i o n _ r a n g e =30))
Output
238 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Code
import u r l l i b . r e q u e s t
import s h u t i l
from IPython . d i s p l a y import Image
! mkdir / c o n t e n t / images /
with u r l l i b . r e q u e s t . u r l o p e n (URL) a s r e s p o n s e , \
open (LOCAL_IMG_FILE, 'wb ' ) a s o u t _ f i l e :
s h u t i l . c o p y f i l e o b j ( response , out_file )
Image ( f i l e n a m e=LOCAL_IMG_FILE)
Output
240 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
! g i t c l o n e h t t p s : / / g i t h u b . com/ u l t r a l y t i c s / y o l o v 5 −−t a g 6 . 1
! mv / c o n t e n t / 6 . 1 / c o n t e n t / y o l o v 5
%cd / c o n t e n t / y o l o v 5
%p i p i n s t a l l −qr r e q u i r e m e n t s . t x t
from y o l o v 5 import u t i l s
6.5. PART 6.5: RECOGNIZING MULTIPLE IMAGES WITH YOLO5 241
display = u t i l s . notebook_init ()
Output
Next, we will run YOLO from the command line and classify the previously downloaded kitchen picture.
You can run this classification on any image you choose.
Code
Output
242 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Downloading h t t p s : / / u l t r a l y t i c s . com/ a s s e t s / A r i a l . t t f t o
/ root /. config / U l t r a l y t i c s / Arial . t t f . . .
d e t e c t : w e i g h t s =[ ' y o l o v 5 s . pt ' ] , s o u r c e=/c o n t e n t / images / ,
data=data / c o c o 1 2 8 . yaml , imgsz =[640 , 6 4 0 ] , c o n f _ t h r e s =0.25 ,
i o u _ t h r e s =0.45 , max_det=1000 , d e v i c e =, view_img=F a l s e , save_txt=F a l s e ,
save_conf=F a l s e , save_crop=F a l s e , nosave=F a l s e , c l a s s e s=None ,
agnostic_nms=F a l s e , augment=F a l s e , v i s u a l i z e=F a l s e , update=F a l s e ,
p r o j e c t=r u n s / d e t e c t , name=exp , e x i s t _ o k=F a l s e , l i n e _ t h i c k n e s s =3,
h i d e _ l a b e l s=F a l s e , h i d e _ c o n f=F a l s e , h a l f=F a l s e , dnn=F a l s e
YOLOv5 v6.1−85− g 6 f 4 e b 9 5 t o r c h 1 . 1 0 . 0 + cu111 CUDA: 0 ( A100−SXM4−40GB,
40536MiB)
Downloading h t t p s : / / g i t h u b . com/ u l t r a l y t i c s / y o l o v 5 / r e l e a s e s / download / v6
. 1 / y o l o v 5 s . pt t o y o l o v 5 s . pt . . .
100% 1 4 . 1M/ 1 4 . 1M [ 0 0 : 0 0 < 0 0 : 0 0 , 135MB/ s ]
Fusing l a y e r s . . .
...
6.5. PART 6.5: RECOGNIZING MULTIPLE IMAGES WITH YOLO5 243
import s y s
s y s . path . append ( s t r ( " / c o n t e n t / y o l o v 5 " ) )
from y o l o v 5 import u t i l s
display = u t i l s . notebook_init ()
Output
Next, we obtain an image to classify. For this example, the program loads the image from a URL.
YOLOv5 expects that the image is in the format of a Numpy array. We use PIL to obtain this image. We
will convert it to the proper format for PyTorch and YOLOv5 later.
Code
Code
import a r g p a r s e
import o s
import s y s
from p a t h l i b import Path
import cv2
import t o r c h
import t o r c h . backends . cudnn a s cudnn
We are now ready to load YOLO with pretrained weights provided by the creators of YOLO. It is also
possible to train YOLO to recognize images of your own.
Code
Output
The creators of YOLOv5 built upon PyTorch, which has a particular format for images. PyTorch
6.5. PART 6.5: RECOGNIZING MULTIPLE IMAGES WITH YOLO5 245
import numpy a s np
s o u r c e = ' / c o n t e n t / images / '
c o n f _ t h r e s =0.25 # c o n f i d e n c e t h r e s h o l d
i o u _ t h r e s =0.45 # NMS IOU t h r e s h o l d
c l a s s e s = None
agnostic_nms=F a l s e , # c l a s s −a g n o s t i c NMS
max_det=1000
# h t t p s : / / s t a c k o v e r f l o w . com/ q u e s t i o n s /50657449/
# c o n v e r t −image−to−proper −dimension−p y t o r c h
img2 = img . r e s i z e ( [ imgsz [ 1 ] , imgsz [ 0 ] ] , Image . ANTIALIAS)
Output
torch . S i z e ( [ 1 , 3 , 320 , 2 5 6 ] )
With the image converted, we are now ready to present the image to YOLO and obtain predictions.
Code
We now convert these raw predictions into the bounding boxes, labels, and confidences for each of the
images that YOLO recognized.
246 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
Code
results = []
for i , d e t in enumerate ( pred ) : # p e r image
gn = t o r c h . t e n s o r ( img_raw . shape ) [ [ 1 , 0 , 1 , 0 ] ]
i f len ( d e t ) :
# R e s c a l e b o x e s from img_size t o im0 s i z e
d e t [ : , : 4 ] = s c a l e _ c o o r d s ( o r i g i n a l _ s i z e , d e t [ : , : 4 ] , imgsz ) . round ( )
# Write r e s u l t s
f o r ∗xyxy , conf , c l s in reversed ( d e t ) :
xywh = ( xyxy2xywh ( t o r c h . t e n s o r ( xyxy ) . view ( 1 , 4 ) ) / \
gn ) . view ( −1). t o l i s t ( )
# Choose b e t w e e n x y x y and xywh as your d e s i r e d f or m a t .
r e s u l t s . append ( [ names [ int ( c l s ) ] , f l o a t ( c o n f ) , [ ∗ xyxy ] ] )
We can now see the results from the classification. We will display the first 3.
Code
for itm in r e s u l t s [ 0 : 3 ] :
print ( itm )
Output
It is important to note that the yolo class instantiated here is a callable object, which can fill the role
of both an object and a function. Acting as a function, yolo returns three arrays named boxes, scores,
and classes that are of the same length. The function returns all sub-images found with a score above
the minimum threshold. Additionally, the yolo function returns an array named called nums. The first
element of the nums array specifies how many sub-images YOLO found to be above the score threshold.
• boxes - The bounding boxes for each sub-image detected in the image sent to YOLO.
6.5. PART 6.5: RECOGNIZING MULTIPLE IMAGES WITH YOLO5 247
Your program should use these values to perform whatever actions you wish due to the input image. The
following code displays the images detected above the threshold.
To demonstrate the correctness of the results obtained, we draw bounding boxes over the original image.
Code
f o r itm in r e s u l t s :
b = itm [ 2 ]
print ( b )
draw . r e c t a n g l e ( b )
img3
Output
248 CHAPTER 6. CONVOLUTIONAL NEURAL NETWORKS (CNN) FOR COMPUTER VISION
7.1 Part 7.1: Introduction to GANS for Image and Data Gener-
ation
A generative adversarial network (GAN) is a class of machine learning systems invented by Ian Goodfellow
in 2014.[10]Two neural networks compete with each other in a game. The GAN training algorithm starts
with a training set and learns to generate new data with the same distributions as the training set. For
example, a GAN trained on photographs can generate new photographs that look at least superficially
authentic to human observers, having many realistic characteristics.
This chapter makes use of the PyTorch framework rather than Keras/TensorFlow. While there are
versions of StyleGAN2-ADA that work with TensorFlow 1.0, NVIDIA has switched to PyTorch for Style-
GAN. Running this notebook in this notebook in Google CoLab is the most straightforward means of
completing this chapter. Because of this, I designed this notebook to run in Google CoLab. It will take
some modifications if you wish to run it locally.
This original StyleGAN paper used neural networks to automatically generate images for several pre-
viously seen datasets: MINST and CIFAR. However, it also included the Toronto Face Dataset (a private
dataset used by some researchers). You can see some of these images in Figure 7.1.
Only sub-figure D made use of convolutional neural networks. Figures A-C make use of fully con-
nected neural networks. As we will see in this module, the researchers significantly increased the role of
convolutional neural networks for GANs.
We call a GAN a generative model because it generates new data. You can see the overall process in
Figure 7.2.
251
252 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
StyleGAN2 adaptive discriminator augmentation (ADA), which will be the focus of this module.[16]We
will see both how to train StyleGAN2 ADA on any arbitray set of images; as well as use pretrained weights
provided by NVIDIA. The NVIDIA weights allow us to generate high resolution photorealistic looking
faces, such seen in Figure 7.3.
The above images were generated with StyleGAN2, using Google CoLab. Following the instructions in
this section, you will be able to create faces like this of your own. StyleGAN2 images are usually 1,024 x
1,024 in resolution. An example of a full-resolution StyleGAN image can be found here.
The primary advancement introduced by the adaptive discriminator augmentation is that the algorithm
augments the training images in real-time. Image augmentation is a common technique in many convolution
neural network applications. Augmentation has the effect of increasing the size of the training set. Where
StyleGAN2 previously required over 30K images for an effective to develop an effective neural network;
now much fewer are needed. I used 2K images to train the fish generating GAN for this section. Figure
7.4 demonstrates the ADA process.
The figure shows the increasing probability of augmentation as p increases. For small image sets, the
discriminator will generally memorize the image set unless the training algorithm makes use of augmen-
tation. Once this memorization occurs, the discriminator is no longer providing useful information to the
training of the generator.
While the above images look much more realistic than images generated earlier in this course, they
are not perfect. Look at Figure 7.5. There are usually several tell-tail signs that you are looking at a
computer-generated image. One of the most obvious is usually the surreal, dream-like backgrounds. The
background does not look obviously fake at first glance; however, upon closer inspection, you usually can’t
quite discern what a GAN-generated background is. Also, look at the image character’s left eye. It is
7.1. PART 7.1: INTRODUCTION TO GANS FOR IMAGE AND DATA GENERATION 253
• Image A demonstrates the abstract backgrounds usually associated with a GAN-generated image.
• Image B exhibits issues that earrings often present for GANs. GANs sometimes have problems with
symmetry, particularly earrings.
• Image C contains an abstract background and a highly distorted secondary image.
• Image D also contains a highly distorted secondary image that might be a hand.
Several websites allow you to generate GANs of your own without any software.
The first site generates high-resolution images of human faces. The second site presents a quiz to see if
you can detect the difference between a real and fake human face image.
In this chapter, you will learn to create your own StyleGAN pictures using Python.
will run into compute limitations of Google CoLab. Make sure to run this code on a GPU instance. GPU
is assumed.
First, we clone StyleGAN3 from GitHub.
Code
! g i t c l o n e h t t p s : / / g i t h u b . com/ NVlabs / s t y l e g a n 3 . g i t
! pip i n s t a l l n i n j a
! l s / content / stylegan3
Output
Code
! python / c o n t e n t / s t y l e g a n 3 / gen_images . py \
−−network={URL} \
−−o u t d i r =/c o n t e n t / r e s u l t s −−s e e d s =6600−6625
Code
! l s / content / r e s u l t s
256 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
Output
! cp / c o n t e n t / r e s u l t s /∗ \
/ c o n t e n t / d r i v e /My\ Driv e / p r o j e c t s / s t y l e g a n 3
import s y s
s y s . path . i n s e r t ( 0 , " / c o n t e n t / s t y l e g a n 3 " )
import p i c k l e
import o s
import numpy a s np
import PIL . Image
from IPython . d i s p l a y import Image
import m a t p l o t l i b . p y p l o t a s p l t
import IPython . d i s p l a y
import t o r c h
import d n n l i b
import l e g a c y
def s e e d 2 v e c (G, s e e d ) :
return np . random . RandomState ( s e e d ) . randn ( 1 , G. z_dim )
def di s p l a y _ i m a g e ( image ) :
plt . axis ( ' off ' )
p l t . imshow ( image )
p l t . show ( )
7.1. PART 7.1: INTRODUCTION TO GANS FOR IMAGE AND DATA GENERATION 257
l a b e l = np . z e r o s ( [ 1 ] + G. input_shapes [ 1 ] [ 1 : ] )
# [ m i n i b a t c h , h e i g h t , width , c h a n n e l ]
images = G. run ( z , l a b e l , ∗∗G_kwargs )
return images [ 0 ]
def g e t _ l a b e l (G, d e v i c e , c l a s s _ i d x ) :
l a b e l = t o r c h . z e r o s ( [ 1 , G. c_dim ] , d e v i c e=d e v i c e )
i f G. c_dim != 0 :
i f c l a s s _ i d x i s None :
c t x . f a i l ( " Must␣ s p e c i f y ␣ c l a s s ␣ l a b e l ␣ with ␣−−c l a s s ␣when␣ u s i n g ␣ " \
" a ␣ c o n d i t i o n a l ␣ network " )
label [ : , class_idx ] = 1
else :
i f c l a s s _ i d x i s not None :
print ( " warn : ␣−−c l a s s=l b l ␣ i g n o r e d ␣when␣ r u n n i n g ␣on␣ " \
" an␣ u n c o n d i t i o n a l ␣ network " )
return l a b e l
Code
Output
Code
# Generate t h e images f o r t h e s e e d s .
for i in range (SEED_FROM, SEED_TO) :
print ( f " Seed ␣ { i } " )
z = s e e d 2 v e c (G, i )
img = generate_image ( d e v i c e , G, z )
di sp l a y _ i m a g e ( img )
Output
7.1. PART 7.1: INTRODUCTION TO GANS FOR IMAGE AND DATA GENERATION 259
Seed 1000
S e t t i n g up PyTorch p l u g i n " b i a s _ a c t _ p l u g i n " . . . Done .
S e t t i n g up PyTorch p l u g i n " f i l t e r e d _ l r e l u _ p l u g i n " . . . Done .
Seed 1001
260 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
Seed 1002
Code
def expand_seed ( s e e d s , v e c t o r _ s i z e ) :
result = [ ]
for s e e d in s e e d s :
7.1. PART 7.1: INTRODUCTION TO GANS FOR IMAGE AND DATA GENERATION 261
v e c t o r _ s i z e = G. z_dim
# range ( 8 1 9 2 , 8 3 0 0 )
s e e d s = expand_seed ( [ 8 1 9 2 + 1 , 8 1 9 2 + 9 ] , v e c t o r _ s i z e )
#g e n e r a t e _ i m a g e s ( Gs , s e e d s , t r u n c a t i o n _ p s i =0.5)
print ( s e e d s [ 0 ] . shape )
Output
The following code will move between the provided seeds. The constant STEPS specify how many
frames there should be between each seed.
Code
SEEDS = [ 6 6 2 4 , 6 6 1 8 , 6 6 1 6 ] # B e t t e r f o r f a c e s
#SEEDS = [ 1 0 0 0 , 1 0 0 3 , 1 0 0 1 ] # B e t t e r f o r f i s h
STEPS = 100
# Remove any p r i o r r e s u l t s
! rm / c o n t e n t / r e s u l t s /∗
262 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
# Generate t h e images f o r t h e v i d e o .
idx = 0
for i in range ( len (SEEDS) −1):
v1 = s e e d 2 v e c (G, SEEDS [ i ] )
v2 = s e e d 2 v e c (G, SEEDS [ i +1])
d i f f = v2 − v1
s t e p = d i f f / STEPS
c u r r e n t = v1 . copy ( )
# Link t h e images i n t o a v i d e o .
! ffmpeg −r 30 − i / c o n t e n t / r e s u l t s / frame−%d . png −vcodec mpeg4 −y movie . mp4
Code
from g o o g l e . c o l a b import f i l e s
f i l e s . download ( ' movie . mp4 ' )
Output
<IPython . c o r e . d i s p l a y . J a v a s c r i p t
o b j e c t ><IPython . c o r e . d i s p l a y . J a v a s c r i p t o b j e c t >
/ c o n t e n t / d r i v e /MyDrive/ data
! p i p i n s t a l l t o r c h ==1.8.1 t o r c h v i s i o n ==0.9.1
! g i t c l o n e h t t p s : / / g i t h u b . com/ NVlabs / s t y l e g a n 2 −ada−p y t o r c h . g i t
! pip i n s t a l l n i n j a
/ c o n t e n t / d r i v e /MyDrive/ data
It might be helpful to use an ls command to establish the exact path for your images.
Code
! {CMD}
7.2. PART 7.2: TRAIN STYLEGAN3 WITH YOUR IMAGES 265
You can use the following command to clear out the newly created dataset. If something goes wrong
and you need to clean up your images and rerun the above command, you should delete your partially
completed dataset directory.
Code
#! rm −R / c o n t e n t / d r i v e /MyDrive/ d a t a / gan / d a t a s e t / c i r c u i t /∗
from o s import l i s t d i r
from o s . path import i s f i l e , j o i n
import o s
from PIL import Image
from tqdm . notebook import tqdm
b a s e _ s i z e = None
f o r f i l e in tqdm ( f i l e s ) :
f i l e 2 = o s . path . j o i n (IMAGE_PATH, f i l e )
img = Image . open ( f i l e 2 )
s z = img . s i z e
i f b a s e _ s i z e and s z != b a s e _ s i z e :
print ( f " I n c o n s i s t a n t ␣ s i z e : ␣ { f i l e 2 } " )
e l i f img . mode!= 'RGB ' :
print ( f " I n c o n s i s t a n t ␣ c o l o r ␣ format : ␣ { f i l e 2 } " )
else :
base_size = sz
import o s
266 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
# Modify t h e s e t o s u i t your n ee d s
EXPERIMENTS = " / c o n t e n t / d r i v e /MyDrive/ data / gan / e x p e r i m e n t s "
DATA = " / c o n t e n t / d r i v e /MyDrive/ data / gan / d a t a s e t / c i r c u i t "
SNAP = 10
import o s
# Modify t h e s e t o s u i t your n ee d s
EXPERIMENTS = " / c o n t e n t / d r i v e /MyDrive/ data / gan / e x p e r i m e n t s "
NETWORK = " network−snapshot −000100. p k l "
RESUME = o s . path . j o i n (EXPERIMENTS, \
" 00008− c i r c u i t −auto1−resumecustom " , NETWORK)
DATA = " / c o n t e n t / d r i v e /MyDrive/ data / gan / d a t a s e t / c i r c u i t "
SNAP = 10
! g i t c l o n e h t t p s : / / g i t h u b . com/ NVlabs / s t y l e g a n 3 . g i t
! pip i n s t a l l n i n j a
We will use the same functions introduced in the previous part to generate GAN seeds and images.
Code
import s y s
s y s . path . i n s e r t ( 0 , " / c o n t e n t / s t y l e g a n 3 " )
import p i c k l e
import o s
import numpy a s np
import PIL . Image
from IPython . d i s p l a y import Image
import m a t p l o t l i b . p y p l o t a s p l t
import IPython . d i s p l a y
import t o r c h
import d n n l i b
import l e g a c y
def s e e d 2 v e c (G, s e e d ) :
return np . random . RandomState ( s e e d ) . randn ( 1 , G. z_dim )
def d i s p l a y _ i m a g e ( image ) :
plt . axis ( ' off ' )
p l t . imshow ( image )
p l t . show ( )
l a b e l = np . z e r o s ( [ 1 ] + G. input_shapes [ 1 ] [ 1 : ] )
268 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
# [ m i n i b a t c h , h e i g h t , width , c h a n n e l ]
images = G. run ( z , l a b e l , ∗∗G_kwargs )
return images [ 0 ]
def g e t _ l a b e l (G, d e v i c e , c l a s s _ i d x ) :
l a b e l = t o r c h . z e r o s ( [ 1 , G. c_dim ] , d e v i c e=d e v i c e )
i f G. c_dim != 0 :
i f c l a s s _ i d x i s None :
c t x . f a i l ( ' Must␣ s p e c i f y ␣ c l a s s ␣ l a b e l ␣ with ␣−−c l a s s ' \
' when␣ u s i n g ␣ a ␣ c o n d i t i o n a l ␣ network ' )
label [ : , class_idx ] = 1
else :
i f c l a s s _ i d x i s not None :
print ( ' warn : ␣−−c l a s s=l b l ␣ i g n o r e d ␣when␣ r u n n i n g ␣ ' \
' on␣an␣ u n c o n d i t i o n a l ␣ network ' )
return l a b e l
Next, we load the NVIDIA FFHQ (faces) GAN. We could use any StyleGAN pretrained GAN network
here.
Code
# HIDE CODE
Output
We will begin by generating a few seeds to evaluate potential starting points for our fine-tuning. Try out
different seeds ranges until you have a seed that looks close to what you wish to fine-tune.
Code
# Generate t h e images f o r t h e s e e d s .
f o r i in range (SEED_FROM, SEED_TO) :
print ( f " Seed ␣ { i } " )
z = s e e d 2 v e c (G, i )
img = generate_image ( d e v i c e , G, z )
d i s p l a y _ i m a g e ( img )
Output
270 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
Seed 4020
S e t t i n g up PyTorch p l u g i n " b i a s _ a c t _ p l u g i n " . . . Done .
S e t t i n g up PyTorch p l u g i n " f i l t e r e d _ l r e l u _ p l u g i n " . . . Done .
...
START_SEED = 4022
c u r r e n t = s e e d 2 v e c (G, START_SEED)
Next, generate and display the current vector. You will return to this point for each iteration of the
finetuning.
Code
img = generate_image ( d e v i c e , G, c u r r e n t )
7.3. PART 7.3: EXPLORING THE STYLEGAN LATENT VECTOR 271
SCALE = 0 . 5
d i s p l a y _ i m a g e ( img )
Output
Choose an explore size; this is the number of different potential images chosen by moving in 10 different
directions. Run this code once and then again anytime you wish to change the ten directions you are
exploring. You might change the ten directions if you are no longer seeing improvements.
Code
EXPLORE_SIZE = 25
explore = [ ]
f o r i in range (EXPLORE_SIZE ) :
e x p l o r e . append ( np . random . rand ( 1 , 5 1 2 ) − 0 . 5 )
Each image displayed from running this code shows a potential direction that we can move in the latent
vector. Choose one image that you like and change MOVE_DIRECTION to indicate this decision. Once
you rerun the code, the code will give you a new set of potential directions. Continue this process until
you have a latent vector that you like.
272 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
Code
i f MOVE_DIRECTION >=0:
c u r r e n t = c u r r e n t + e x p l o r e [MOVE_DIRECTION]
for i , mv in enumerate ( e x p l o r e ) :
print ( f " D i r e c t i o n ␣ { i } " )
z = c u r r e n t + mv
img = generate_image ( d e v i c e , G, z )
di sp l a y _ i m a g e ( img )
Output
Direction 0
...
7.4. PART 7.4: GANS TO ENHANCE OLD PHOTOGRAPHS DEOLDIFY 273
! p i p i n s t a l l −r c o l a b _ r e q u i r e m e n t s . t x t
The authors of deoldify suggest that you might wish to include a watermark to let others know that
AI-enhanced this picture. The following code downloads this standard watermark. The authors describe
the watermark as follows:
"This places a watermark icon of a palette at the bottom left corner of the image. The authors intend
this practice to be a standard way to convey to others viewing the image that AI colorizes it. We want
to help promote this as a standard, especially as the technology continues to improve and the distinction
between real and fake becomes harder to discern. This palette watermark practice was initiated and led
by the MyHeritage in the MyHeritage In Color feature (which uses a newer version of DeOldify than what
you’re using here)."
Code
! {CMD}
import s y s
import t o r c h
i f not t o r c h . cuda . i s _ a v a i l a b l e ( ) :
print ( 'GPU␣ not ␣ a v a i l a b l e . ' )
else :
print ( ' Using ␣GPU. ' )
Output
Using GPU.
We can now call the model. I will enhance an image from my childhood, probably taken in the late
1970s. The picture shows three miniature schnauzers. My childhood dog (Scooby) is on the left, followed
by his mom and sister. Overall, a stunning improvement. However, the red in the fire engine riding toy is
lost, and the red color of the picnic table where the three dogs were sitting.
Code
import f a s t a i
from d e o l d i f y . v i s u a l i z e import ∗
import w a r n i n g s
from u r l l i b . p a r s e import u r l p a r s e
import o s
7.4. PART 7.4: GANS TO ENHANCE OLD PHOTOGRAPHS DEOLDIFY 275
! wget {URL}
a = u r l p a r s e (URL)
b e f o r e _ f i l e = o s . path . basename ( a . path )
RENDER_FACTOR = 35
WATERMARK = F a l s e
c o l o r i z e r = g e t _ i m a g e _ c o l o r i z e r ( a r t i s t i c =True )
a f t e r _ i m a g e = c o l o r i z e r . get_transformed_image (
b e f o r e _ f i l e , r e n d e r _ f a c t o r=RENDER_FACTOR,
watermarked=WATERMARK)
#p r i n t ( " S t a r t i n g image : " )
Code
Output
276 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
You can see the deoldify version here. Please note that these two images will look similar in a black
and white book. To see it in color, visit this link.
Code
after_image
Output
7.5. PART 7.5: GANS FOR TABULAR SYNTHETIC DATA GENERATION 277
! {CMD}
! p i p i n s t a l l −r r e q u i r e m e n t s . t x t
! p i p i n s t a l l tabgan
7.5.2 Loading the Auto MPG Data and Training a Neural Network
We will begin by generating fake data for the Auto MPG dataset we have previously seen. The tabgan
library can generate categorical (textual) and continuous (numeric) data. However, it cannot generate
unstructured data, such as the name of the automobile. Car names, such as "AMC Rebel SST" cannot
be replicated by the GAN, because every row has a different car name; it is a textual but non-categorical
value.
The following code is similar to what we have seen before. We load the AutoMPG dataset. The tabgan
library requires Pandas dataframe to train. Because of this, we keep both the Pandas and Numpy values.
Code
import r e q u e s t s
import numpy a s np
from s k l e a r n import m e t r i c s
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
COLS_USED = [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' acceleration ' , ' year ' , ' o r i g i n ' , 'mpg ' ]
COLS_TRAIN = [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' acceleration ' , ' year ' , ' o r i g i n ']
d f = d f [COLS_USED]
# Handle m i s s i n g v a l u e
d f [ ' ho r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# S p l i t i n t o t r a i n i n g and t e s t s e t s
df_x_train , df_x_test , df_y_train , df_y_test = t r a i n _ t e s t _ s p l i t (
d f . drop ( "mpg" , a x i s =1) ,
d f [ "mpg" ] ,
t e s t _ s i z e =0.20 ,
#s h u f f l e=F a l s e ,
random_state =42 ,
)
# Crea t e d a t a f r a m e v e r s i o n s f o r t a b u l a r GAN
df_x_test , df_y_test = df_x_test . r e s e t _ i n d e x ( drop=True ) , \
df_y_test . r e s e t _ i n d e x ( drop=True )
df_y_train = pd . DataFrame ( df_y_train )
df_y_test = pd . DataFrame ( df_y_test )
# Pandas t o Numpy
x _ t r a i n = df_x_train . v a l u e s
x _ te s t = df_x_test . v a l u e s
y _ t r a i n = df_y_train . v a l u e s
y _ te s t = df_y_test . v a l u e s
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
# Hidden 1
7.5. PART 7.5: GANS FOR TABULAR SYNTHETIC DATA GENERATION 279
We now evaluate the trained neural network to see the RMSE. We will use this trained neural network
to compare the accuracy between the original data and the GAN-generated data. We will later see that
you can use such comparisons for anomaly detection. We can use this technique can be used for security
systems. If a neural network trained on original data does not perform well on new data, then the new
data may be suspect or fake.
Code
pred = model . p r e d i c t ( x _ t e s t )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( " F i n a l ␣ s c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
Output
F i n a l s c o r e (RMSE) : 4 . 3 3 6 3 3 9 3 6 4 5 2 5 4 5
Output
Note: if you receive an error running the above code, you likely need to restart the runtime. You should
have a "restart runtime" button in the output from the second cell. Once you restart the runtime, rerun
all of the cells. This step is necessary as tabgan requires specific versions of some packages.
If we display the results, we can see that the GAN-generated data looks similar to the original. Some
values, typically whole numbers in the original data, have fractional values in the synthetic data.
Code
gen_x
Output
7.5. PART 7.5: GANS FOR TABULAR SYNTHETIC DATA GENERATION 281
Finally, we present the synthetic data to the previously trained neural network to see how accurately
we can predict the synthetic targets. As we can see, you lose some RMSE accuracy by going to synthetic
data.
Code
# Predict
pred = model . p r e d i c t ( gen_x . v a l u e s )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , gen_y . v a l u e s ) )
print ( " F i n a l ␣ s c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
Output
F i n a l s c o r e (RMSE) : 9 . 0 8 3 7 4 5 2 2 5 6 3 3 0 9 8
282 CHAPTER 7. GENERATIVE ADVERSARIAL NETWORKS
Chapter 8
283
284 CHAPTER 8. KAGGLE DATA SETS
PassengerId , Survived
892 ,0
893 ,1
8.2. PART 8.2: BUILDING ENSEMBLES WITH SCIKIT-LEARN AND KERAS 285
894 ,1
895 ,0
896 ,0
897 ,1
...
The above file states the prediction for each of the various passengers. You should only predict on
ID’s that are in the test file. Likewise, you should render a prediction for every row in the test file. Some
competitions will have different formats for their answers. For example, a multi-classification will usually
have a column for each class and your predictions for each class.
• An accurate comparison of methods for quantifying variable importance in artificial neural networks
using simulated data[27]. Ecological Modelling, 178(3), 389-397.
For this chapter, we will use the input Perturbation feature ranking algorithm. This algorithm will work
with any regression or classification network. In the next section, I provide an implementation of the input
perturbation algorithm for scikit-learn. This code implements a function below that will work with any
scikit-learn model.
Leo Breiman provided this algorithm in his seminal paper on random forests. [Citebreiman2001random:]
Although he presented this algorithm in conjunction with random forests, it is model-independent and
appropriate for any supervised learning model. This algorithm, known as the input perturbation algorithm,
works by evaluating a trained model’s accuracy with each input individually shuffled from a data set.
Shuffling an input causes it to become useless---effectively removing it from the model. More important
inputs will produce a less accurate score when they are removed by shuffling them. This process makes
sense because important features will contribute to the model’s accuracy. I first presented the TensorFlow
implementation of this algorithm in the following paper.
This algorithm will use log loss to evaluate a classification problem and RMSE for regression.
Code
from s k l e a r n import m e t r i c s
import s c i p y a s sp
import numpy a s np
import math
from s k l e a r n import m e t r i c s
fo r i in range ( x . shape [ 1 ] ) :
h o l d = np . a r r a y ( x [ : , i ] )
np . random . s h u f f l e ( x [ : , i ] )
if regression :
pred = model . p r e d i c t ( x )
e r r o r = m e t r i c s . mean_squared_error ( y , pred )
else :
8.2. PART 8.2: BUILDING ENSEMBLES WITH SCIKIT-LEARN AND KERAS 287
pred = model . p r e d i c t ( x )
e r r o r = m e t r i c s . l o g _ l o s s ( y , pred )
e r r o r s . append ( e r r o r )
x [ : , i ] = hold
max_error = np .max( e r r o r s )
i m p o r t a n c e = [ e / max_error f o r e in e r r o r s ]
import pandas a s pd
import i o
import r e q u e s t s
import numpy a s np
from s k l e a r n import m e t r i c s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ i r i s . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
288 CHAPTER 8. KAGGLE DATA SETS
# B u i l d n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
model . f i t ( x_train , y_train , v e r b o s e =2, e p o c h s =100)
Next, we evaluate the accuracy of the trained model. Here we see that the neural network performs
great, with an accuracy of 1.0. We might fear overfitting with such high accuracy for a more complex
dataset. However, for this example, we are more interested in determining the importance of each column.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
pred = model . p r e d i c t ( x _ t e s t )
p r e d i c t _ c l a s s e s = np . argmax ( pred , a x i s =1)
e x p e c t e d _ c l a s s e s = np . argmax ( y_test , a x i s =1)
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " Accuracy : ␣ { c o r r e c t } " )
Output
Accuracy : 1 . 0
We are now ready to call the input perturbation algorithm. First, we extract the column names and
remove the target column. The target column is not important, as it is the objective, not one of the inputs.
In supervised learning, the target is of the utmost importance.
We can see the importance displayed in the following table. The most important column is always 1.0,
and lessor columns will continue in a downward trend. The least important column will have the lowest
rank.
8.2. PART 8.2: BUILDING ENSEMBLES WITH SCIKIT-LEARN AND KERAS 289
Code
# Rank t h e f e a t u r e s
from IPython . d i s p l a y import d i s p l a y , HTML
Output
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
290 CHAPTER 8. KAGGLE DATA SETS
# Handle m i s s i n g v a l u e
d f [ ' ho r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x_train , y_train , v e r b o s e =2, e p o c h s =100)
# Predict
pred = model . p r e d i c t ( x )
Just as before, we extract the column names and discard the target. We can now create a ranking of
the importance of each of the input features. The feature with a ranking of 1.0 is the most important.
Code
# Rank t h e f e a t u r e s
from IPython . d i s p l a y import d i s p l a y , HTML
Output
8.2. PART 8.2: BUILDING ENSEMBLES WITH SCIKIT-LEARN AND KERAS 291
Code
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s c i p y . s t a t s import z s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold
from IPython . d i s p l a y import HTML, d i s p l a y
d f _ t r a i n = pd . read_csv (
URL+" b i o _ t r a i n . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
d f _ t e s t = pd . read_csv (
URL+" b i o _ t e s t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
A large number of columns is evident when we display the shape of the dataset.
Code
print ( d f _ t r a i n . shape )
Output
(3751 , 1777)
The following code constructs a classification neural network and trains it for the biological response
dataset. Once trained, the accuracy is measured.
Code
import o s
import pandas a s pd
import t e n s o r f l o w a s t f
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
import numpy a s np
import s k l e a r n
# Encode f e a t u r e v e c t o r
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f _ t r a i n . columns . drop ( ' A c t i v i t y ' )
x = d f _ t r a i n [ x_columns ] . v a l u e s
y = df_train [ ' Activity ' ] . values # C l a s s i f i c a t i o n
x_submit = d f _ t e s t [ x_columns ] . v a l u e s . a s t y p e ( np . f l o a t 3 2 )
# Predict
pred = model . p r e d i c t ( x _ t e s t ) . f l a t t e n ( )
# C l i p so t h a t min i s n e v e r e x a c t l y 0 , max n e v e r 1
pred = np . c l i p ( pred , a_min=1e −6,a_max=(1−1e −6))
print ( " V a l i d a t i o n ␣ l o g l o s s : ␣ {} " . format (
s k l e a r n . m e t r i c s . l o g _ l o s s ( y_test , pred ) ) )
# Build r e a l submit f i l e
pred_submit = model . p r e d i c t ( x_submit )
Output
Fitting / Training . . .
Epoch 7 : e a r l y s t o p p i n g
F i t t i n g done . . .
Validation l o g l o s s : 0.5564708781752792
Validation accuracy sc or e : 0.7515991471215352
Code
# Rank t h e f e a t u r e s
from IPython . d i s p l a y import d i s p l a y , HTML
Output
import numpy a s np
import o s
import pandas a s pd
import math
from t e n s o r f l o w . k e r a s . wrappers . s c i k i t _ l e a r n import K e r a s C l a s s i f i e r
from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r
from s k l e a r n . ensemble import E x t r a T r e e s C l a s s i f i e r
8.2. PART 8.2: BUILDING ENSEMBLES WITH SCIKIT-LEARN AND KERAS 295
SHUFFLE = F a l s e
FOLDS = 10
def build_ann ( i n p u t _ s i z e , c l a s s e s , n e u r o n s ) :
model = S e q u e n t i a l ( )
model . add ( Dense ( neurons , input_dim=i n p u t _ s i z e , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 ) )
model . add ( Dense ( c l a s s e s , a c t i v a t i o n= ' softmax ' ) )
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
return model
def m l o g l o s s ( y_test , p r e d s ) :
e p s i l o n = 1 e −15
sum = 0
f o r row in zip ( preds , y _ t e s t ) :
x = row [ 0 ] [ row [ 1 ] ]
x = max( e p s i l o n , x )
x = min(1− e p s i l o n , x )
sum+=math . l o g ( x )
return ( (−1/ len ( p r e d s ) ) ∗sum)
def s t r e t c h ( y ) :
return ( y − y . min ( ) ) / ( y .max( ) − y . min ( ) )
models = [
K e r a s C l a s s i f i e r ( b u i l d _ f n=build_ann , n e u r o n s =20 ,
i n p u t _ s i z e=x . shape [ 1 ] , c l a s s e s =2) ,
K N e i g h b o r s C l a s s i f i e r ( n_neighbors =3) ,
R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =100 , n_jobs=−1,
c r i t e r i o n= ' g i n i ' ) ,
R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =100 , n_jobs=−1,
c r i t e r i o n= ' e n t r o p y ' ) ,
E x t r a T r e e s C l a s s i f i e r ( n _ e s t i m a t o r s =100 , n_jobs=−1,
c r i t e r i o n= ' g i n i ' ) ,
E x t r a T r e e s C l a s s i f i e r ( n _ e s t i m a t o r s =100 , n_jobs=−1,
296 CHAPTER 8. KAGGLE DATA SETS
c r i t e r i o n= ' e n t r o p y ' ) ,
G r a d i e n t B o o s t i n g C l a s s i f i e r ( l e a r n i n g _ r a t e =0.05 ,
subsample =0.5 , max_depth=6, n _ e s t i m a t o r s =50)]
print ( )
print ( " B l e n d i n g ␣ models . " )
b l e n d = L o g i s t i c R e g r e s s i o n ( s o l v e r= ' l b f g s ' )
blend . f i t ( dataset_blend_train , y )
return b l e n d . p r e d i c t _ p r o b a ( d a t a s e t _ b l e n d _ t e s t )
d f _ t r a i n = pd . read_csv (
URL+" b i o _ t r a i n . c s v " ,
8.3. PART 8.3: ARCHITECTING NETWORK: HYPERPARAMETERS 297
df_submit = pd . read_csv (
URL+" b i o _ t e s t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
p r e d i c t o r s = l i s t ( d f _ t r a i n . columns . v a l u e s )
p r e d i c t o r s . remove ( ' A c t i v i t y ' )
x = df_train [ predictors ] . values
y = df_train [ ' Activity ' ]
x_submit = df_submit . v a l u e s
i f SHUFFLE :
i d x = np . random . p e r m u t a t i o n ( y . s i z e )
x = x [ idx ]
y = y [ idx ]
####################
# Build submit f i l e
####################
i d s = [ id+1 f o r id in range ( submit_data . shape [ 0 ] ) ]
submit_df = pd . DataFrame ( { ' M o l e c u l e I d ' : i d s ,
' PredictedProbability ' :
submit_data [ : , 1 ] } ,
columns =[ ' M o l e c u l e I d ' ,
' PredictedProbability ' ])
submit_df . to_csv ( " submit . c s v " , i n d e x=F a l s e )
• Batch Normalization
• Training Parameters
The following sections will introduce each of these categories for Keras. While I will provide some general
guidelines for hyperparameter selection, no two tasks are the same. You will benefit from experimentation
with these values to determine what works best for your neural network. In the next part, we will see how
machine learning can select some of these values independently.
• Activation - You can also add activation functions as layers. Using the activation layer is the same
as specifying the activation function as part of a Dense (or other) layer type.
• ActivityRegularization Used to add L1/L2 regularization outside of a layer. You can specify L1
and L2 as part of a Dense (or other) layer type.
• Dense - The original neural network layer type. In this layer type, every neuron connects to the
next layer. The input vector is one-dimensional, and placing specific inputs next does not affect each
other.
• Dropout - Dropout consists of randomly setting a fraction rate of input units to 0 at each update
during training time, which helps prevent overfitting. Dropout only occurs during training.
• Flatten - Flattens the input to 1D and does not affect the batch size.
• Input - A Keras tensor is a tensor object from the underlying back end (Theano, TensorFlow, or
CNTK), which we augment with specific attributes to build a Keras by knowing the inputs and
outputs of the model.
• Lambda - Wraps arbitrary expression as a Layer object.
• Masking - Masks a sequence using a mask value to skip timesteps.
• Permute - Permutes the input dimensions according to a given pattern. Useful for tasks such as
connecting RNNs and convolutional networks.
• RepeatVector - Repeats the input n times.
• Reshape - Similar to Numpy reshapes.
• SpatialDropout1D - This version performs the same function as Dropout; however, it drops entire
1D feature maps instead of individual elements.
• SpatialDropout2D - This version performs the same function as Dropout; however, it drops entire
2D feature maps instead of individual elements
• SpatialDropout3D - This version performs the same function as Dropout; however, it drops entire
3D feature maps instead of individual elements.
There is always trial and error for choosing a good number of neurons and hidden layers. Generally, the
number of neurons on each layer will be larger closer to the hidden layer and smaller towards the output
layer. This configuration gives the neural network a somewhat triangular or trapezoid appearance.
8.3. PART 8.3: ARCHITECTING NETWORK: HYPERPARAMETERS 299
Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains
the mean activation close to 0 and the activation standard deviation close to 1. Can allow learning rate to
be larger.
• bayesian-optimization
• hyperopt
• spearmint
Code
# I g n o r e u s e l e s s W0819 w a r n i n g s g e n e r a t e d by TensorFlow 2 . 0 .
# H o p e f u l l y can remove t h i s i g n o r e i n t h e f u t u r e .
# See h t t p s : / / g i t h u b . com/ t e n s o r f l o w / t e n s o r f l o w / i s s u e s /31308
import l o g g i n g , o s
l o g g i n g . d i s a b l e ( l o g g i n g .WARNING)
o s . e n v i r o n [ "TF_CPP_MIN_LOG_LEVEL" ] = " 3 "
import pandas a s pd
from s c i p y . s t a t s import z s c o r e
# Read t h e d a t a s e t
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ jh−s i m p l e −d a t a s e t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
8.4. PART 8.4: BAYESIAN HYPERPARAMETER OPTIMIZATION FOR KERAS 301
# Generate dummies f o r j o b
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' j o b ' ] , p r e f i x=" j o b " ) ] , a x i s =1)
d f . drop ( ' j o b ' , a x i s =1, i n p l a c e=True )
# Generate dummies f o r ar ea
d f = pd . c o n c a t ( [ df , pd . get_dummies ( d f [ ' a r e a ' ] , p r e f i x=" a r e a " ) ] , a x i s =1)
d f . drop ( ' a r e a ' , a x i s =1, i n p l a c e=True )
# M i s s i n g v a l u e s f o r income
med = d f [ ' income ' ] . median ( )
d f [ ' income ' ] = d f [ ' income ' ] . f i l l n a (med)
# Standardize ranges
d f [ ' income ' ] = z s c o r e ( d f [ ' income ' ] )
df [ ' aspect ' ] = zscore ( df [ ' aspect ' ] )
df [ ' save_rate ' ] = z s c o r e ( df [ ' save_rate ' ] )
d f [ ' age ' ] = z s c o r e ( d f [ ' age ' ] )
df [ ' s u b s c r i p t i o n s ' ] = zscore ( df [ ' s u b s c r i p t i o n s ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' p r o d u c t ' ) . drop ( ' i d ' )
x = d f [ x_columns ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' p r o d u c t ' ] ) # C l a s s i f i c a t i o n
p r o d u c t s = dummies . columns
y = dummies . v a l u e s
Now that we’ve preprocessed the data, we can begin the hyperparameter optimization. We start by creating
a function that generates the model based on just three parameters. Bayesian optimization works on a
vector of numbers, not on a problematic notion like how many layers and neurons are on each layer. To
represent this complex neuron structure as a vector, we use several numbers to describe this structure.
These three numbers define the structure of the neural network. The commends in the below code show
exactly how the program constructs the network.
302 CHAPTER 8. KAGGLE DATA SETS
Code
import pandas a s pd
import o s
import numpy a s np
import time
import t e n s o r f l o w . k e r a s . i n i t i a l i z e r s
import s t a t i s t i c s
import t e n s o r f l o w . k e r a s
from s k l e a r n import m e t r i c s
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d K F o l d
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n , Dropout , Inp ut La ye r
from t e n s o r f l o w . k e r a s import r e g u l a r i z e r s
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
from s k l e a r n . m o d e l _ s e l e c t i o n import S t r a t i f i e d S h u f f l e S p l i t
from s k l e a r n . m o d e l _ s e l e c t i o n import S h u f f l e S p l i t
from t e n s o r f l o w . k e r a s . l a y e r s import LeakyReLU , PReLU
from t e n s o r f l o w . k e r a s . o p t i m i z e r s import Adam
# C o n s t r u c t n e u r a l network
model = S e q u e n t i a l ( )
# Add d r o p o u t a f t e r each h i d d e n l a y e r
8.4. PART 8.4: BAYESIAN HYPERPARAMETER OPTIMIZATION FOR KERAS 303
# S h r i n k neuron c ou n t f o r each l a y e r
neuronCount = neuronCount ∗ n e u r o n S h r i n k
We can test this code to see how it creates a neural network based on three such parameters.
Code
Output
We will now create a function to evaluate the neural network using three such parameters. We use
bootstrapping because one training run might have "bad luck" with the assigned random weights. We use
this function to train and then evaluate the neural network.
Code
SPLITS = 2
EPOCHS = 500
304 CHAPTER 8. KAGGLE DATA SETS
PATIENCE = 10
# for Classification
boot = S t r a t i f i e d S h u f f l e S p l i t ( n _ s p l i t s=SPLITS , t e s t _ s i z e =0.1)
# for Regression
# b o o t = S h u f f l e S p l i t ( n _ s p l i t s=SPLITS , t e s t _ s i z e =0.1)
# Track p r o g r e s s
mean_benchmark = [ ]
epochs_needed = [ ]
num = 0
# Loop t h r o u g h s a m p l e s
fo r t r a i n , t e s t in boot . s p l i t ( x , d f [ ' p r o d u c t ' ] ) :
s t a r t _ t i m e = time . time ( )
num+=1
# S p l i t t r a i n and t e s t
x_train = x [ t r a i n ]
y_train = y [ t r a i n ]
x_test = x [ t e s t ]
y_test = y [ t e s t ]
# Train on t h e b o o t s t r a p sample
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =(x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =0, e p o c h s=EPOCHS)
e p o c h s = monitor . stopped_epoch
epochs_needed . append ( e p o c h s )
# P r e d i c t on t h e o u t o f b o o t ( v a l i d a t i o n )
pred = model . p r e d i c t ( x _ t e s t )
# Measure t h i s b o o t s t r a p ' s l o g l o s s
8.4. PART 8.4: BAYESIAN HYPERPARAMETER OPTIMIZATION FOR KERAS 305
# Record t h i s i t e r a t i o n
time_took = time . time ( ) − s t a r t _ t i m e
t e n s o r f l o w . k e r a s . backend . c l e a r _ s e s s i o n ( )
return (−m1)
You can try any combination of our three hyperparameters, plus the learning rate, to see how effective
these four numbers are. Of course, our goal is not to manually choose different combinations of these four
hyperparameters; we seek to automate.
Code
print ( evaluate_network (
dropout =0.2 ,
l e a r n i n g _ r a t e =1e −3,
neuronPct =0.2 ,
neuronShrink =0.2))
Output
−0.6668764846259546
! p i p i n s t a l l b a y e s i a n −o p t i m i z a t i o n
We will now automate this process. We define the bounds for each of these four hyperparameters and
begin the Bayesian optimization. Once the program finishes, the best combination of hyperparameters
found is displayed. The optimize function accepts two parameters that will significantly impact how long
the process takes to complete:
• n_iter - How many steps of Bayesian optimization that you want to perform. The more steps, the
more likely you will find a reasonable maximum.
• init_points: How many steps of random exploration that you want to perform. Random exploration
can help by diversifying the exploration space.
306 CHAPTER 8. KAGGLE DATA SETS
Code
# S u p r e s s NaN w a r n i n g s
import w a r n i n g s
wa rni n g s . f i l t e r w a r n i n g s ( " i g n o r e " , c a t e g o r y =RuntimeWarning )
# Bounded r e g i o n o f parameter s p a c e
pbounds = { ' dropout ' : ( 0 . 0 , 0 . 4 9 9 ) ,
' learning_rate ' : (0.0 , 0.1) ,
' neuronPct ' : ( 0 . 0 1 , 1 ) ,
' neuronShrink ' : ( 0 . 0 1 , 1)
}
optimizer = BayesianOptimization (
f=evaluate_network ,
pbounds=pbounds ,
v e r b o s e =2, # v e r b o s e = 1 p r i n t s o n l y when a maximum
# i s observed , verbose = 0 i s s i l e n t
random_state =1,
)
s t a r t _ t i m e = time . time ( )
o p t i m i z e r . maximize ( i n i t _ p o i n t s =10 , n _ i t e r =20 ,)
time_took = time . time ( ) − s t a r t _ t i m e
Output
...
T o t a l runtime : 1 : 3 6 : 1 1 . 5 6
{ ' t a r g e t ' : −0.6955536706512794 , ' params ' : { ' dropout ' :
0.2504561773412203 , ' learning_rate ' : 0.0076232346709142924 ,
' neuronPct ' : 0 . 0 1 2 6 4 8 7 9 1 5 2 1 8 1 1 8 2 6 , ' neuronShrink ' :
0.5229748831552032}}
As you can see, the algorithm performed 30 total iterations. This total iteration count includes ten random
and 20 optimization iterations.
Previous Kaggle competition sites for this class (NOT this semester’s assignment, feel free to use code):
• kaggle_iris_test.csv - The data that Kaggle will evaluate you on. It contains only input; you must
provide answers. (contains x)
308 CHAPTER 8. KAGGLE DATA SETS
• kaggle_iris_train.csv - The data that you will use to train. (contains x and y)
• kaggle_iris_sample.csv - A sample submission for Kaggle. (contains x and y)
Important features of the Kaggle iris files (that differ from how we’ve previously seen files):
The following program generates a submission file for "Iris Kaggle". You can use it as a starting point for
assignment 3.
Code
import o s
import pandas a s pd
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
import t e n s o r f l o w a s t f
import numpy a s np
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
d f _ t r a i n = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ d a t a s e t s / "+\
" k a g g l e _ i r i s _ t r a i n . c s v " , na_values =[ 'NA ' , ' ? ' ] )
# Encode f e a t u r e v e c t o r
d f _ t r a i n . drop ( ' i d ' , a x i s =1, i n p l a c e=True )
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f _ t r a i n [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
dummies = pd . get_dummies ( d f _ t r a i n [ ' s p e c i e s ' ] ) # C l a s s i f i c a t i o n
s p e c i e s = dummies . columns
y = dummies . v a l u e s
# Train , w i t h e a r l y s t o p p i n g
8.5. PART 8.5: CURRENT SEMESTER’S KAGGLE 309
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 2 5 ) )
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) )
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =5, v e r b o s e =1, mode= ' auto ' ,
r e s t o r e _ b e s t _ w e i g h t s=True )
Output
Number o f c l a s s e s : 3
R e s t o r i n g model w e i g h t s from t h e end o f t h e b e s t epoch : 1 0 3 .
Epoch 1 0 8 : e a r l y s t o p p i n g
Now that we’ve trained the neural network, we can check its log loss.
Code
from s k l e a r n import m e t r i c s
Output
Log l o s s s c o r e : 0 . 1 0 9 8 8 0 1 0 5 0 8 9 3 9 6 2 3
Now we are ready to generate the Kaggle submission file. We will use the iris test data that does not
contain a y target value. It is our job to predict this value and submit it to Kaggle.
Code
# Generate Kaggle s u b m i t f i l e
# Encode f e a t u r e v e c t o r
d f _ t e s t = pd . read_csv (
310 CHAPTER 8. KAGGLE DATA SETS
# Convert t o numpy − C l a s s i f i c a t i o n
ids = df_test [ ' id ' ]
d f _ t e s t . drop ( ' i d ' , a x i s =1, i n p l a c e=True )
x = d f _ t e s t [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
y = dummies . v a l u e s
# Generate p r e d i c t i o n s
pred = model . p r e d i c t ( x )
#pred
# Crea t e s u b m i s s i o n d a t a s e t
# Write s u b m i t f i l e l o c a l l y
df_submit . to_csv ( " i r i s _ s u b m i t . c s v " , i n d e x=F a l s e )
print ( df_submit [ : 5 ] )
Output
id s p e c i e s −0 s p e c i e s −1 s p e c i e s −2
0 100 0.022300 0.777859 0.199841
1 101 0.001309 0.273849 0.724842
2 102 0.001153 0.319349 0.679498
3 103 0.958006 0.041989 0.000005
4 104 0.976932 0.023066 0.000002
• kaggle_mpg_test.csv - The data that Kaggle will evaluate you on. Contains only input, you must
provide answers. (contains x)
• kaggle_mpg_train.csv - The data that you will use to train. (contains x and y)
• kaggle_mpg_sample.csv - A sample submission for Kaggle. (contains x and y)
8.5. PART 8.5: CURRENT SEMESTER’S KAGGLE 311
Important features of the Kaggle iris files (that differ from how we’ve previously seen files):
The following program generates a submission file for "MPG Kaggle".
Code
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ d a t a s e t s / "+\
" kaggle_auto_train . csv " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' h o r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3, p a t i e n c e =5,
312 CHAPTER 8. KAGGLE DATA SETS
# Predict
pred = model . p r e d i c t ( x _ t e s t )
Now that we’ve trained the neural network, we can check its RMSE error.
Code
import numpy a s np
Output
F i n a l s c o r e (RMSE) : 6 . 0 2 3 7 7 6 4 0 5 9 4 7 5 0 1
Now we are ready to generate the Kaggle submission file. We will use the MPG test data that does not
contain a y target value. It is our job to predict this value and submit it to Kaggle.
Code
import pandas a s pd
# Generate Kaggle s u b m i t f i l e
# Encode f e a t u r e v e c t o r
d f _ t e s t = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ d a t a s e t s / "+\
" k a g g l e _ a u t o _ t e s t . c s v " , na_values =[ 'NA ' , ' ? ' ] )
# Convert t o numpy − r e g r e s s i o n
ids = df_test [ ' id ' ]
d f _ t e s t . drop ( ' i d ' , a x i s =1, i n p l a c e=True )
# Handle m i s s i n g v a l u e
df_test [ ' horsepower ' ] = df_test [ ' horsepower ' ] . \
f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
8.5. PART 8.5: CURRENT SEMESTER’S KAGGLE 313
x = df_test [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
# Generate p r e d i c t i o n s
pred = model . p r e d i c t ( x )
#p red
# Write s u b m i t f i l e l o c a l l y
df_submit . to_csv ( " auto_submit . c s v " , i n d e x=F a l s e )
print ( df_submit [ : 5 ] )
Output
id mpg
0 350 27.158819
1 351 24.450621
2 352 24.913355
3 353 26.994867
4 354 26.669268
314 CHAPTER 8. KAGGLE DATA SETS
Chapter 9
Transfer Learning
315
316 CHAPTER 9. TRANSFER LEARNING
included the four measurements, plus a cost as the target? This dataset does not contain the species; as a
result, it uses the same four inputs as the base model we just trained.
We can take our previously trained iris network and transfer the weights to a new neural network that
will learn to predict the cost through transfer learning. Also of note, the original neural network was a
classification network, yet we now use it to build a regression neural network. Such a transformation is
common for transfer learning. As a reference point, I randomly created this iris cost dataset.
The first step is to train our neural network for the regular Iris Dataset. The code presented here is
the same as we saw in Module 3.
Code
import pandas a s pd
import i o
import r e q u e s t s
import numpy a s np
from s k l e a r n import m e t r i c s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ i r i s . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
dummies = pd . get_dummies ( d f [ ' s p e c i e s ' ] ) # C l a s s i f i c a t i o n
s p e c i e s = dummies . columns
y = dummies . v a l u e s
# B u i l d n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) ) # Output
Output
...
9.1. PART 9.1: INTRODUCTION TO KERAS TRANSFER LEARNING 317
To keep this example simple, we are not setting aside a validation set. The goal of this example is to
show how to create a multi-layer neural network, where we transfer the weights to another network. We
begin by evaluating the accuracy of the network on the training set.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
pred = model . p r e d i c t ( x )
p r e d i c t _ c l a s s e s = np . argmax ( pred , a x i s =1)
e x p e c t e d _ c l a s s e s = np . argmax ( y , a x i s =1)
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " T r a i n i n g ␣ Accuracy : ␣ { c o r r e c t } " )
Output
T r a i n i n g Accuracy : 0 . 9 8 6 6 6 6 6 6 6 6 6 6 6 6 6 7
Viewing the model summary is as expected; we can see the three layers previously defined.
Code
model . summary ( )
Output
model2 = S e q u e n t i a l ( )
for l a y e r in model . l a y e r s :
model2 . add ( l a y e r )
model2 . summary ( )
Output
As a sanity check, we would like to calculate the accuracy of the newly created model. The in-sample
accuracy should be the same as the previous model that the new model transferred.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
pred = model2 . p r e d i c t ( x )
p r e d i c t _ c l a s s e s = np . argmax ( pred , a x i s =1)
e x p e c t e d _ c l a s s e s = np . argmax ( y , a x i s =1)
c o r r e c t = accuracy_score ( expected_classes , p r e d i c t _ c l a s s e s )
print ( f " T r a i n i n g ␣ Accuracy : ␣ { c o r r e c t } " )
Output
9.1. PART 9.1: INTRODUCTION TO KERAS TRANSFER LEARNING 319
T r a i n i n g Accuracy : 0 . 9 8 6 6 6 6 6 6 6 6 6 6 6 6 6 7
The in-sample accuracy of the newly created neural network is the same as the first neural network.
We’ve successfully transferred all of the layers from the original neural network.
d f _ c o s t = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ i r i s _ c o s t . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
df_cost
Output
For transfer learning to be effective, the input for the newly trained neural network most closely conforms
to the first neural network we transfer.
We will strip away the last output layer that contains the softmax activation function that performs
this final classification. We will create a new output layer that will output the cost prediction. We will
only train the weights in this new layer. We will mark the first two layers as non-trainable. The hope is
that the first few layers have learned to abstract the raw input data in a way that is also helpful to the
new neural network.
This process is accomplished by looping over the first few layers and copying them to the new neural
network. We output a summary of the new neural network to verify that Keras stripped the previous
320 CHAPTER 9. TRANSFER LEARNING
output layer.
Code
model3 = S e q u e n t i a l ( )
for i in range ( 2 ) :
l a y e r = model . l a y e r s [ i ]
layer . trainable = False
model3 . add ( l a y e r )
model3 . summary ( )
Output
We add a final regression output layer to complete the new neural network.
Code
Output
=================================================================
T o t a l params : 1 , 5 5 1
T r a i n a b l e params : 26
Non−t r a i n a b l e params : 1 , 5 2 5
_________________________________________________________________
Now we train just the output layer to predict the cost. The cost in the made-up dataset is dependent
on the species, so the previous learning should be helpful.
Code
# Convert t o numpy − C l a s s i f i c a t i o n
x = d f _ c o s t [ [ ' s e p a l _ l ' , ' sepal_w ' , ' p e t a l _ l ' , ' petal_w ' ] ] . v a l u e s
y = df_cost . cost . values
# Train t h e l a s t l a y e r o f t h e network
model3 . f i t ( x , y , v e r b o s e =2, e p o c h s =100)
Output
...
8/8 − 0 s − l o s s : 1 . 8 8 5 1 − 17ms/ epoch − 2ms/ s t e p
Epoch 100/100
8/8 − 0 s − l o s s : 1 . 8 8 3 8 − 9ms/ epoch − 1ms/ s t e p
We can evaluate the in-sample RMSE for the new model containing transferred layers from the previous
model.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
pred = model3 . p r e d i c t ( x )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y ) )
print ( f " F i n a l ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
Output
F i n a l s c o r e (RMSE) : 1 . 3 7 1 6 5 8 9 6 2 5 8 2 3 0 7 2
322 CHAPTER 9. TRANSFER LEARNING
Keras contains built-in support for several pretrained models. In the Keras documentation, you can find
the complete list.
import t e n s o r f l o w _ d a t a s e t s a s t f d s
import t e n s o r f l o w a s t f
tfds . disable_progress_bar ()
train_ds , v a l i d a t i o n _ d s = t f d s . l o a d (
9.2. PART 9.2: KERAS TRANSFER LEARNING FOR COMPUTER VISION 323
num_train = t f . data . e x p e r i m e n t a l . c a r d i n a l i t y ( t r a i n _ d s )
num_test = t f . data . e x p e r i m e n t a l . c a r d i n a l i t y ( v a l i d a t i o n _ d s )
Output
Number o f t r a i n i n g sa m pl e s : 9305
Number o f v a l i d a t i o n s a m p l e s : 2326
We begin by displaying several of the images from this dataset. The labels are above each image. As can
be seen from the images below, 1 indicates a dog, and 0 indicates a cat.
Code
import m a t p l o t l i b . p y p l o t a s p l t
p l t . f i g u r e ( f i g s i z e =(10 , 1 0 ) )
f o r i , ( image , l a b e l ) in enumerate ( t r a i n _ d s . t a k e ( 9 ) ) :
ax = p l t . s u b p l o t ( 3 , 3 , i + 1 )
p l t . imshow ( image )
p l t . t i t l e ( int ( l a b e l ) )
plt . axis ( " off " )
Output
324 CHAPTER 9. TRANSFER LEARNING
Upon examining the above images, another problem becomes evident. The images are of various sizes.
We will standardize all images to 190x190 with the following code.
Code
s i z e = (150 , 150)
t r a i n _ d s = t r a i n _ d s .map(lambda x , y : ( t f . image . r e s i z e ( x , s i z e ) , y ) )
v a l i d a t i o n _ d s = v a l i d a t i o n _ d s .map(lambda x , y : \
( t f . image . r e s i z e ( x , s i z e ) , y ) )
We will batch the data and use caching and prefetching to optimize loading speed.
Code
b a t c h _ s i z e = 32
t r a i n _ d s = t r a i n _ d s . c a c h e ( ) . batch ( b a t c h _ s i z e ) . p r e f e t c h ( b u f f e r _ s i z e =10)
validation_ds = validation_ds . cache ( ) \
9.2. PART 9.2: KERAS TRANSFER LEARNING FOR COMPUTER VISION 325
. batch ( b a t c h _ s i z e ) . p r e f e t c h ( b u f f e r _ s i z e =10)
Augmentation is a powerful computer vision technique that increases the amount of training data
available to your model by altering the images in the training data. To use augmentation, we will allow
horizontal flips of the images. A horizontal flip makes much more sense for cats and dogs in the real world
than a vertical flip. How often do you see upside-down dogs or cats? We also include a limited degree of
rotation.
Code
from t e n s o r f l o w import k e r a s
from t e n s o r f l o w . k e r a s import l a y e r s
data_augmentation = k e r a s . S e q u e n t i a l (
[ l a y e r s . RandomFlip ( " h o r i z o n t a l " ) , l a y e r s . RandomRotation ( 0 . 1 ) , ]
)
Code
import numpy a s np
f o r images , l a b e l s in t r a i n _ d s . t a k e ( 1 ) :
p l t . f i g u r e ( f i g s i z e =(10 , 1 0 ) )
f i r s t _ i m a g e = images [ 0 ]
f o r i in range ( 9 ) :
ax = p l t . s u b p l o t ( 3 , 3 , i + 1 )
augmented_image = data_augmentation (
t f . expand_dims ( f i r s t _ i m a g e , 0 ) , t r a i n i n g=True
)
p l t . imshow ( augmented_image [ 0 ] . numpy ( ) . a s t y p e ( " i n t 3 2 " ) )
p l t . t i t l e ( int ( l a b e l s [ 0 ] ) )
plt . axis ( " off " )
Output
326 CHAPTER 9. TRANSFER LEARNING
The batch normalization layers do require special consideration. We need to keep these layers in
inference mode when we unfreeze the base model for fine-tuning. To do this, we make sure that the base
model is running in inference mode here.
Code
base_model = k e r a s . a p p l i c a t i o n s . Xception (
w e i g h t s=" imagenet " , # Load w e i g h t s pre−t r a i n e d on ImageNet .
input_shape =(150 , 1 5 0 , 3 ) ,
i n c l u d e _ t o p=F a l s e ,
) # Do not i n c l u d e t h e ImageNet c l a s s i f i e r a t t h e t o p .
# F r e e z e t h e base_model
base_model . t r a i n a b l e = F a l s e
# C r e a t e new model on t o p
i n p u t s = k e r a s . Input ( shape =(150 , 1 5 0 , 3 ) )
x = data_augmentation ( i n p u t s ) # Apply random d a t a a ug m e nt a t io n
# Pre−t r a i n e d X c e p t i o n w e i g h t s r e q u i r e s t h a t i n p u t be s c a l e d
# from ( 0 , 255) t o a range o f ( −1. , + 1 . ) , t h e r e s c a l i n g l a y e r
# o u t p u t s : `( i n p u t s ∗ s c a l e ) + o f f s e t `
s c a l e _ l a y e r = k e r a s . l a y e r s . R e s c a l i n g ( s c a l e =1 / 1 2 7 . 5 , o f f s e t =−1)
x = scale_layer (x)
model . summary ( )
Output
83689472/83683744 [==============================] − 1 s 0 us / s t e p
83697664/83683744 [==============================] − 1 s 0 us / s t e p
Model : " model "
_________________________________________________________________
Layer ( type ) Output Shape Param #
=================================================================
input_2 ( I npu tLay er ) [ ( None , 1 5 0 , 1 5 0 , 3 ) ] 0
sequential ( Sequential ) ( None , 1 5 0 , 1 5 0 , 3 ) 0
rescaling ( Rescaling ) ( None , 1 5 0 , 1 5 0 , 3 ) 0
xception ( Functional ) ( None , 5 , 5 , 2 0 4 8 ) 20861480
g l o b a l _ a v e r a g e _ p o o l i n g 2 d (G ( None , 2 0 4 8 ) 0
lobalAveragePooling2D )
...
=================================================================
T o t a l params : 2 0 , 8 6 3 , 5 2 9
T r a i n a b l e params : 2 , 0 4 9
Non−t r a i n a b l e params : 2 0 , 8 6 1 , 4 8 0
_________________________________________________________________
Next, we compile and fit the model. The fitting will use the Adam optimizer; because we are performing
binary classification, we use the binary cross-entropy loss function, as we have done before.
Code
model . compile (
o p t i m i z e r=k e r a s . o p t i m i z e r s . Adam ( ) ,
l o s s=k e r a s . l o s s e s . B i n a r y C r o s s e n t r o p y ( f r o m _ l o g i t s=True ) ,
m e t r i c s =[ k e r a s . m e t r i c s . BinaryAccuracy ( ) ] ,
)
e p oc h s = 20
model . f i t ( train_ds , e p o c h s=epochs , v a l i d a t i o n _ d a t a=v a l i d a t i o n _ d s )
Output
...
291/291 [==============================] − 11 s 37ms/ s t e p − l o s s :
0.0907 − binary_accuracy : 0.9627 − val_loss : 0.0718 −
val_binary_accuracy : 0 . 9 7 2 9
Epoch 20/20
291/291 [==============================] − 11 s 37ms/ s t e p − l o s s :
9.2. PART 9.2: KERAS TRANSFER LEARNING FOR COMPUTER VISION 329
The training above shows that the validation accuracy reaches the mid 90% range. This accuracy is
good; however, we can do better.
model . compile (
o p t i m i z e r=k e r a s . o p t i m i z e r s . Adam( 1 e −5) , # Low l e a r n i n g r a t e
l o s s=k e r a s . l o s s e s . B i n a r y C r o s s e n t r o p y ( f r o m _ l o g i t s=True ) ,
m e t r i c s =[ k e r a s . m e t r i c s . BinaryAccuracy ( ) ] ,
)
e p o c h s = 10
model . f i t ( train_ds , e p o c h s=epochs , v a l i d a t i o n _ d a t a=v a l i d a t i o n _ d s )
Output
lobalAveragePooling2D )
dropout ( Dropout ) ( None , 2 0 4 8 ) 0
d e n s e ( Dense ) ( None , 1 ) 2049
=================================================================
T o t a l params : 2 0 , 8 6 3 , 5 2 9
T r a i n a b l e params : 2 0 , 8 0 9 , 0 0 1
...
val_binary_accuracy : 0 . 9 8 3 7
Epoch 10/10
291/291 [==============================] − 41 s 140ms/ s t e p − l o s s :
0.0162 − binary_accuracy : 0.9944 − val_loss : 0.0548 −
val_binary_accuracy : 0 . 9 8 1 9
These examples use TensorFlow Hub, which allows pretrained models to be loaded into TensorFlow easily.
To install TensorHub use the following commands.
Code
! pip i n s t a l l tensorflow_hub
It is also necessary to install TensorFlow Datasets, which you can install with the following command.
9.3. PART 9.3: TRANSFER LEARNING FOR NLP WITH KERAS 331
Code
! pip i n s t a l l tensorflow_datasets
Movie reviews are a good source of training data for sentiment analysis. These reviews are textual,
and users give them a star rating which indicates if the viewer had a positive or negative experience with
the movie. Load the Internet Movie DataBase (IMDB) reviews data set. This example is based on a
TensorFlow example that you can find here.
Code
import t e n s o r f l o w a s t f
import t e n s o r f l o w _ h u b a s hub
import t e n s o r f l o w _ d a t a s e t s a s t f d s
train_examples , t r a i n _ l a b e l s = t f d s . as_numpy ( t r a i n _ d a t a )
test_examples , t e s t _ l a b e l s = t f d s . as_numpy ( t e s t _ d a t a )
# / Users / j h e a t o n / t e n s o r f l o w _ d a t a s e t s / imdb_reviews / p l a i n _ t e x t / 0 . 1 . 0
Load a pretrained embedding model called gnews-swivel-20dim. Google trained this network on GNEWS
data and can convert raw text into vectors.
Code
The following code displays three movie reviews. This display allows you to see the actual data.
Code
train_examples [ : 3 ]
Output
...
The embedding layer can convert each to 20-number vectors, which the neural network receives as input
in place of the actual words.
Code
hub_layer ( t r a i n _ e x a m p l e s [ : 3 ] )
Output
...
We add additional layers to classify the movie reviews as either positive or negative.
Code
model = t f . k e r a s . S e q u e n t i a l ( )
model . add ( hub_layer )
model . add ( t f . k e r a s . l a y e r s . Dense ( 1 6 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( t f . k e r a s . l a y e r s . Dense ( 1 , a c t i v a t i o n= ' s i g m o i d ' ) )
model . summary ( )
Output
We are now ready to compile the neural network. For this application, we use the adam training method
for binary classification. We also save the initial random weights for later to start over easily.
334 CHAPTER 9. TRANSFER LEARNING
Code
Before fitting, we split the training data into the train and validation sets.
Code
x_val = t r a i n _ e x a m p l e s [ : 1 0 0 0 0 ]
partial_x_train = train_examples [ 1 0 0 0 0 : ]
y_val = t r a i n _ l a b e l s [ : 1 0 0 0 0 ]
partial_y_train = train_labels [ 1 0 0 0 0 : ]
We can now fit the neural network. This fitting will run for 40 epochs and allow us to evaluate the
effectiveness of the neural network, as measured by the training set.
Code
h i s t o r y = model . f i t ( p a r t i a l _ x _ t r a i n ,
partial_y_train ,
e p o c h s =40 ,
b a t c h _ s i z e =512 ,
v a l i d a t i o n _ d a t a =(x_val , y_val ) ,
v e r b o s e =1)
Output
...
30/30 [==============================] − 1 s 37ms/ s t e p − l o s s : 0 . 0 7 1 1 −
accuracy : 0.9820 − val_loss : 0.3562 − val_accuracy : 0.8738
Epoch 40/40
30/30 [==============================] − 1 s 37ms/ s t e p − l o s s : 0 . 0 6 6 1 −
accuracy : 0.9847 − val_loss : 0.3626 − val_accuracy : 0.8728
for training and validation sets. Loss measures the degree to which the neural network was confident in
incorrect answers. Accuracy is the percentage of correct classifications, regardless of the neural network’s
confidence.
We begin by looking at the loss as we fit the neural network.
Code
%m a t p l o t l i b i n l i n e
import m a t p l o t l i b . p y p l o t a s p l t
e p o c h s = range ( 1 , len ( a c c ) + 1 )
p l t . show ( )
Output
We can see that training and validation loss are similar early in the fitting. However, as fitting continues
336 CHAPTER 9. TRANSFER LEARNING
and overfitting sets in, training and validation loss diverge from each other. Training loss continues to fall
consistently. However, once overfitting happens, the validation loss no longer falls and eventually begins
to increase a bit. Early stopping, which we saw earlier in this course, can prevent some overfitting.
Code
p l t . show ( )
Output
The accuracy graph tells a similar story. Now let’s repeat the fitting with early stopping. We begin by
creating an early stopping monitor and restoring the network’s weights to random. Once this is complete,
we can fit the neural network with the early stopping monitor enabled.
Code
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
model . s e t _ w e i g h t s ( i n i t _ w e i g h t s )
h i s t o r y = model . f i t ( p a r t i a l _ x _ t r a i n ,
partial_y_train ,
e p o c h s =40 ,
b a t c h _ s i z e =512 ,
c a l l b a c k s =[ monitor ] ,
v a l i d a t i o n _ d a t a =(x_val , y_val ) ,
v e r b o s e =1)
Output
...
30/30 [==============================] − 1 s 39ms/ s t e p − l o s s : 0 . 1 4 7 5 −
accuracy : 0.9508 − val_loss : 0.3220 − val_accuracy : 0.8700
Epoch 34/40
29/30 [============================>.] − ETA: 0 s − l o s s : 0 . 1 4 1 9 −
a c c u r a c y : 0 . 9 5 2 8 R e s t o r i n g model w e i g h t s from t h e end o f t h e b e s t
epoch : 2 9 .
30/30 [==============================] − 1 s 38ms/ s t e p − l o s s : 0 . 1 4 1 4 −
accuracy : 0.9531 − val_loss : 0.3231 − val_accuracy : 0.8704
Epoch 0 0 0 3 4 : e a r l y s t o p p i n g
e p o c h s = range ( 1 , len ( a c c ) + 1 )
p l t . show ( )
Output
Finally, we evaluate the accuracy for the best neural network before early stopping occured.
Code
from s k l e a r n . m e t r i c s import a c c u r a c y _ s c o r e
import numpy a s np
c o r r e c t = a c c u r a c y _ s c o r e ( y_val , p r e d i c t _ c l a s s e s )
print ( f " Accuracy : ␣ { c o r r e c t } " )
Output
Accuracy : 0 . 8 6 8 5
9.4 Part 9.4: Transfer Learning for Facial Points and GANs
I designed this notebook to work with Google Colab. You can run it locally; however, you might need to
adjust some of the installation scripts contained in this notebook.
9.4. PART 9.4: TRANSFER LEARNING FOR FACIAL POINTS AND GANS 339
This part will see how we can use a 3rd party neural network to detect facial features, particularly the
location of an individual’s eyes. By locating eyes, we can crop portraits consistently. Previously, we saw
that GANs could convert a random vector into a realistic-looking portrait. We can also perform the reverse
and convert an actual photograph into a numeric vector. If we convert two images into these vectors, we
can produce a video that transforms between the two images.
NVIDIA trained StyleGAN on portraits consistently cropped with the eyes always in the same location.
To successfully convert an image to a vector, we must crop the image similarly to how NVIDIA used
cropping.
The code presented here allows you to choose a starting and ending image and use StyleGAN2 to
produce a "morph" video between the two pictures. The preprocessing code will lock in on the exact
positioning of each image, so your crop does not have to be perfect. The main point of your crop is for you
to remove anything else that might be confused for a face. If multiple faces are detected, you will receive
an error.
Also, make sure you have selected a GPU Runtime from CoLab. Choose "Runtime," then "Change
Runtime Type," and choose GPU for "Hardware Accelerator."
These settings allow you to change the high-level configuration. The number of steps determines how
long your resulting video is. The video plays at 30 frames a second, so 150 is 5 seconds. You can also
specify freeze steps to leave the video unchanged at the beginning and end. You will not likely need to
change the network.
Code
import o s
from g o o g l e . c o l a b import f i l e s
uploaded = f i l e s . upload ( )
i f len ( uploaded ) != 1 :
print ( " Upload ␣ e x a c t l y ␣ 1 ␣ f i l e ␣ f o r ␣ s o u r c e . " )
340 CHAPTER 9. TRANSFER LEARNING
else :
for k , v in uploaded . i t e m s ( ) :
_, e x t = o s . path . s p l i t e x t ( k )
o s . remove ( k )
SOURCE_NAME = f " s o u r c e { e x t } "
open (SOURCE_NAME, 'wb ' ) . w r i t e ( v )
uploaded = f i l e s . upload ( )
i f len ( uploaded ) != 1 :
print ( " Upload ␣ e x a c t l y ␣ 1 ␣ f i l e ␣ f o r ␣ t a r g e t . " )
else :
for k , v in uploaded . i t e m s ( ) :
_, e x t = o s . path . s p l i t e x t ( k )
o s . remove ( k )
TARGET_NAME = f " t a r g e t { e x t } "
open (TARGET_NAME, 'wb ' ) . w r i t e ( v )
Code
Code
import s y s
! g i t c l o n e h t t p s : / / g i t h u b . com/ NVlabs / s t y l e g a n 2 −ada−p y t o r c h . g i t
! pip i n s t a l l n i n j a
9.4. PART 9.4: TRANSFER LEARNING FOR FACIAL POINTS AND GANS 341
import cv2
import numpy a s np
from PIL import Image
import d l i b
from m a t p l o t l i b import p y p l o t a s p l t
Let’s start by looking at the facial features of the source image. The following code detects the five
facial features and displays their coordinates.
Code
i f len ( r e c t s ) == 0 :
r a i s e V a l u e E r r o r ( "No␣ f a c e s ␣ d e t e c t e d " )
e l i f len ( r e c t s ) > 1 :
raise ValueError ( " Multiple ␣ f a c e s ␣ detected " )
shape = p r e d i c t o r ( gray , r e c t s [ 0 ] )
w = img . shape [ 0 ] / / 5 0
342 CHAPTER 9. TRANSFER LEARNING
for i in range ( 0 , 5 ) :
pt1 = ( shape . p a r t ( i ) . x , shape . p a r t ( i ) . y )
pt2 = ( shape . p a r t ( i ) . x+w, shape . p a r t ( i ) . y+w)
cv2 . r e c t a n g l e ( img , pt1 , pt2 , ( 0 , 2 5 5 , 2 5 5 ) , 4 )
print ( pt1 , pt2 )
Output
We can easily plot these features onto the source image. You can see the corners of the eyes and the
base of the nose.
Code
Output
9.4. PART 9.4: TRANSFER LEARNING FOR FACIAL POINTS AND GANS 343
def f i n d _ e y e s ( img ) :
gray = cv2 . c v t C o l o r ( img , cv2 .COLOR_BGR2GRAY)
r e c t s = d e t e c t o r ( gray , 0 )
i f len ( r e c t s ) == 0 :
r a i s e V a l u e E r r o r ( "No␣ f a c e s ␣ d e t e c t e d " )
e l i f len ( r e c t s ) > 1 :
raise ValueError ( " Multiple ␣ f a c e s ␣ detected " )
shape = p r e d i c t o r ( gray , r e c t s [ 0 ] )
features = [ ]
f o r i in range ( 0 , 5 ) :
344 CHAPTER 9. TRANSFER LEARNING
return ( int ( f e a t u r e s [ 3 ] [ 1 ] [ 0 ] + f e a t u r e s [ 2 ] [ 1 ] [ 0 ] ) // 2 , \
int ( f e a t u r e s [ 3 ] [ 1 ] [ 1 ] + f e a t u r e s [ 2 ] [ 1 ] [ 1 ] ) // 2 ) , \
( int ( f e a t u r e s [ 1 ] [ 1 ] [ 0 ] + f e a t u r e s [ 0 ] [ 1 ] [ 0 ] ) // 2 , \
int ( f e a t u r e s [ 1 ] [ 1 ] [ 1 ] + f e a t u r e s [ 0 ] [ 1 ] [ 1 ] ) // 2 )
def c r o p _ s t y l e g a n ( img ) :
l e f t _ e y e , r i g h t _ e y e = f i n d _ e y e s ( img )
# Calculate the s i z e of the face
d = abs ( r i g h t _ e y e [ 0 ] − l e f t _ e y e [ 0 ] )
z = 255/ d
# Consider the a s p e c t r a t i o
a r = img . shape [ 0 ] / img . shape [ 1 ]
w = img . shape [ 1 ] ∗ z
img2 = cv2 . r e s i z e ( img , ( int (w) , int (w∗ a r ) ) )
b o r d e r s i z e = 1024
img3 = cv2 . copyMakeBorder (
img2 ,
top=b o r d e r s i z e ,
bottom=b o r d e r s i z e ,
l e f t =b o r d e r s i z e ,
r i g h t=b o r d e r s i z e ,
borderType=cv2 .BORDER_REPLICATE)
l e f t _ e y e 2 , r i g h t _ e y e 2 = f i n d _ e y e s ( img3 )
# A d j u s t t o t h e o f f s e t used by StyleGAN2
c r o p 1 = l e f t _ e y e 2 [ 0 ] − 385
c r o p 0 = l e f t _ e y e 2 [ 1 ] − 490
return img3 [ c r o p 0 : c r o p 0 +1024 , c r o p 1 : c r o p 1 +1024]
The following code will preprocess and crop your images. If you receive an error indicating multiple
faces were found, try to crop your image better or obscure the background. If the program does not see a
face, then attempt to obtain a clearer and more high-resolution image.
Code
i f image_target i s None :
r a i s e V a l u e E r r o r ( " S o u r c e ␣ image ␣ not ␣ found " )
cropped_source = c r o p _ s t y l e g a n ( image_source )
c r o p p e d _ t a r g e t = c r o p _ s t y l e g a n ( image_target )
#p r i n t ( f i n d _ e y e s ( cropped_source ) )
#p r i n t ( f i n d _ e y e s ( c r o p p e d _ t a r g e t ) )
Output
346 CHAPTER 9. TRANSFER LEARNING
True
The two images are now 1024x1024 and cropped similarly to the ffhq dataset that NVIDIA used to
train StyleGAN.
With the conversion complete, lets have a look at the two GANs.
Code
Output
9.4. PART 9.4: TRANSFER LEARNING FOR FACIAL POINTS AND GANS 347
Code
Output
348 CHAPTER 9. TRANSFER LEARNING
As you can see, the two GAN-generated images look similar to their real-world counterparts. However,
they are by no means exact replicas.
import t o r c h
import d n n l i b
import l e g a c y
import PIL . Image
import numpy a s np
import i m a g e i o
from tqdm . notebook import tqdm
d i f f = lvec2 − lvec1
s t e p = d i f f / STEPS
c u r r e n t = l v e c 1 . copy ( )
t a r g e t _ u i n t 8 = np . a r r a y ( [ 1 0 2 4 , 1 0 2 4 , 3 ] , dtype=np . u i n t 8 )
f o r i in range ( r e p e a t ) :
v i d e o . append_data ( synth_image )
current = current + step
video . c l o s e ()
from g o o g l e . c o l a b import f i l e s
f i l e s . download ( " movie . mp4" )
I based the code presented in this part on a style transfer example in the Keras documentation created
by François Chollet.
We begin by uploading two images to Colab. If running this code locally, point these two filenames at
the local copies of the images you wish to use.
import o s
from g o o g l e . c o l a b import f i l e s
uploaded = f i l e s . upload ( )
i f len ( uploaded ) != 1 :
print ( " Upload ␣ e x a c t l y ␣ 1 ␣ f i l e ␣ f o r ␣ s o u r c e . " )
else :
for k , v in uploaded . i t e m s ( ) :
_, e x t = o s . path . s p l i t e x t ( k )
o s . remove ( k )
base_image_path = f " s o u r c e { e x t } "
open ( base_image_path , 'wb ' ) . w r i t e ( v )
uploaded = f i l e s . upload ( )
i f len ( uploaded ) != 1 :
print ( " Upload ␣ e x a c t l y ␣ 1 ␣ f i l e ␣ f o r ␣ t a r g e t . " )
else :
for k , v in uploaded . i t e m s ( ) :
9.5. PART 9.5: TRANSFER LEARNING FOR KERAS STYLE TRANSFER 351
_, e x t = o s . path . s p l i t e x t ( k )
o s . remove ( k )
style_reference_image_path = f " s t y l e { ext } "
open ( s t y l e _ r e f e r e n c e _ i m a g e _ p a t h , 'wb ' ) . w r i t e ( v )
The loss function balances three different goals defined by the following three weights. Changing these
weights allows you to fine-tune the image generation.
• total_variation_weight - How much emphasis to place on the visual coherence of nearby pixels.
• style_weight - How much emphasis to place on emulating the style of the reference image.
• content_weight - How much emphasis to place on remaining close in appearance to the base image.
Code
import numpy a s np
import t e n s o r f l o w a s t f
from t e n s o r f l o w import k e r a s
from t e n s o r f l o w . k e r a s . a p p l i c a t i o n s import vgg19
# Weights o f t h e d i f f e r e n t l o s s components
t o t a l _ v a r i a t i o n _ w e i g h t = 1 e−6
s t y l e _ w e i g h t = 1 e−6
content_weight = 2 . 5 e−8
# Dimensions o f t h e g e n e r a t e d p i c t u r e .
width , h e i g h t = k e r a s . p r e p r o c e s s i n g . image . load_img ( base_image_path ) . s i z e
img_nrows = 400
img_ncols = int ( width ∗ img_nrows / h e i g h t )
We now display the two images we will use, first the base image followed by the style image.
Code
Output
352 CHAPTER 9. TRANSFER LEARNING
S o u r c e Image
Code
Output
9.5. PART 9.5: TRANSFER LEARNING FOR KERAS STYLE TRANSFER 353
S t y l e Image
def p r e p r o c e s s _ i m a g e ( image_path ) :
# U t i l f u n c t i o n t o open , r e s i z e and f or m a t
# pictures into appropriate tensors
img = k e r a s . p r e p r o c e s s i n g . image . load_img (
image_path , t a r g e t _ s i z e =(img_nrows , img_ncols )
)
img = k e r a s . p r e p r o c e s s i n g . image . img_to_array ( img )
img = np . expand_dims ( img , a x i s =0)
354 CHAPTER 9. TRANSFER LEARNING
def deprocess_image ( x ) :
# U t i l f u n c t i o n t o c o n v e r t a t e n s o r i n t o a v a l i d image
x = x . r e s h a p e ( ( img_nrows , img_ncols , 3 ) )
# Remove z e r o −c e n t e r by mean p i x e l
x [ : , : , 0 ] += 1 0 3 . 9 3 9
x [ : , : , 1 ] += 1 1 6 . 7 7 9
x [ : , : , 2 ] += 1 2 3 . 6 8
# 'BGR'−>'RGB'
x = x [ : , : , :: −1]
x = np . c l i p ( x , 0 , 2 5 5 ) . a s t y p e ( " u i n t 8 " )
return x
We calculate the Gram matrix by multiplying a matrix by its transpose. To calculate two parts of the
loss function, we will take the Gram matrix of the outputs from several convolution layers in the VGG
network. To determine both style, and similarity to the original image, we will compare the convolution
layer output of VGG rather than directly comparing the image pixels. In the third part of the loss function,
we will directly compare pixels near each other.
Because we are taking convolution output from several different levels of the VGG network, the Gram
matrix provides a means of combining these layers. The Gram matrix of the VGG convolution layers
represents the style of the image. We will calculate this style for the original image, the style-reference
image, and the final output image as the algorithm generates it.
9.5. PART 9.5: TRANSFER LEARNING FOR KERAS STYLE TRANSFER 355
Code
# An a u x i l i a r y l o s s f u n c t i o n
# d e s i g n e d t o maintain t h e " c o n t e n t " o f t h e
# b a s e image i n t h e g e n e r a t e d image
def c o n t e n t _ l o s s ( base , c o m b i n a t i o n ) :
return t f . reduce_sum ( t f . s q u a r e ( c o m b i n a t i o n − b a s e ) )
# The 3 rd l o s s f u n c t i o n , t o t a l v a r i a t i o n l o s s ,
# d e s i g n e d t o k e e p t h e g e n e r a t e d image l o c a l l y c o h e r e n t
def t o t a l _ v a r i a t i o n _ l o s s ( x ) :
a = t f . square (
x [ : , : img_nrows − 1 , : img_ncols − 1 , : ] \
− x [ : , 1 : , : img_ncols − 1 , : ]
)
b = t f . square (
x [ : , : img_nrows − 1 , : img_ncols − 1 , : ] \
− x [ : , : img_nrows − 1 , 1 : , : ]
)
return t f . reduce_sum ( t f . pow( a + b , 1 . 2 5 ) )
356 CHAPTER 9. TRANSFER LEARNING
The style_loss function compares how closely the current generated image (combination) matches
the style of the reference style image. The Gram matrixes of the style and current generated image are
subtracted and normalized to calculate this difference in style. Precisely, it consists in a sum of L2 distances
between the Gram matrices of the representations of the base image and the style reference image, extracted
from different layers of VGG. The general idea is to capture color/texture information at different spatial
scales (fairly large scales, as defined by the depth of the layer considered).
The content_loss function compares how closely the current generated image matches the original
image. You must subtract Gram matrixes of the original and generated images to calculate this difference.
Here we calculate the L2 distance between the base image’s VGG features and the generated image’s
features, keeping the generated image close enough to the original one.
Finally, the total_variation_loss function imposes local spatial continuity between the pixels of the
generated image, giving it visual coherence.
# S e t up a model t h a t r e t u r n s t h e a c t i v a t i o n v a l u e s f o r e v e r y l a y e r i n
# VGG19 ( as a d i c t ) .
f e a t u r e _ e x t r a c t o r = k e r a s . Model ( i n p u t s=model . i n p u t s , o u t p u t s=o u t p u t s _ d i c t )
We can now generate the complete loss function. The following images are input to the compute_loss
function:
• combination_image - The current iteration of the generated image.
• base_image - The starting image.
• style_reference_image - The image that holds the style to reproduce.
The layers specified by style_layer_names indicate which layers should be extracted as features from VGG
for each of the three images.
Code
# L i s t o f l a y e r s t o use f o r t h e s t y l e l o s s .
style_ l a ye r _ n a me s = [
9.5. PART 9.5: TRANSFER LEARNING FOR KERAS STYLE TRANSFER 357
# I n i t i a l i z e the l o s s
l o s s = t f . z e r o s ( shape =())
# Add c o n t e n t l o s s
l a y e r _ f e a t u r e s = f e a t u r e s [ content_layer_name ]
base_image_features = l a y e r _ f e a t u r e s [ 0 , : , : , : ]
combination_features = layer_features [2 , : , : , : ]
l o s s = l o s s + content_weight ∗ c o n t e n t _ l o s s (
base_image_features , c o m b i n a t i o n _ f e a t u r e s
)
# Add s t y l e l o s s
f o r layer_name in s t yl e _ l a y er _ n a me s :
l a y e r _ f e a t u r e s = f e a t u r e s [ layer_name ]
style_reference_features = layer_features [1 , : , : , : ]
combination_features = layer_features [2 , : , : , : ]
s l = style_loss ( style_reference_features , combination_features )
l o s s += ( s t y l e _ w e i g h t / len ( s t y l e _l a y e r _n a m e s ) ) ∗ s l
# Add t o t a l v a r i a t i o n l o s s
l o s s += t o t a l _ v a r i a t i o n _ w e i g h t ∗ \
t o t a l _ v a r i a t i o n _ l o s s ( combination_image )
return l o s s
358 CHAPTER 9. TRANSFER LEARNING
@tf . f u n c t i o n
def compute_loss_and_grads ( combination_image , \
base_image , s t y l e _ r e f e r e n c e _ i m a g e ) :
with t f . GradientTape ( ) a s t a p e :
l o s s = compute_loss ( combination_image , \
base_image , s t y l e _ r e f e r e n c e _ i m a g e )
g r a d s = t a p e . g r a d i e n t ( l o s s , combination_image )
return l o s s , g r a d s
o p t i m i z e r = k e r a s . o p t i m i z e r s .SGD(
k e r a s . o p t i m i z e r s . s c h e d u l e s . ExponentialDecay (
i n i t i a l _ l e a r n i n g _ r a t e =100.0 , d e c a y _ s t e p s =100 , decay_rate =0.96
)
)
base_image = p r e p r o c e s s _ i m a g e ( base_image_path )
style_reference_image = preprocess_image ( style_reference_image_path )
combination_image = t f . V a r i a b l e ( p r e p r o c e s s _ i m a g e ( base_image_path ) )
i t e r a t i o n s = 4000
for i in range ( 1 , i t e r a t i o n s + 1 ) :
l o s s , g r a d s = compute_loss_and_grads (
combination_image , base_image , s t y l e _ r e f e r e n c e _ i m a g e
)
o p t i m i z e r . a p p l y _ g r a d i e n t s ( [ ( grads , combination_image ) ] )
i f i % 100 == 0 :
print ( " I t e r a t i o n ␣%d : ␣ l o s s =%.2 f " % ( i , l o s s ) )
img = deprocess_image ( combination_image . numpy ( ) )
fname = r e s u l t _ p r e f i x + " _ a t _ i t e r a t i o n _%d . png " % i
k e r a s . p r e p r o c e s s i n g . image . save_img ( fname , img )
9.5. PART 9.5: TRANSFER LEARNING FOR KERAS STYLE TRANSFER 359
Output
Iteration 1 0 0 : l o s s =4890.20
Iteration 2 0 0 : l o s s =3527.19
Iteration 3 0 0 : l o s s =3022.59
Iteration 4 0 0 : l o s s =2751.59
Iteration 5 0 0 : l o s s =2578.63
Iteration 6 0 0 : l o s s =2457.19
Iteration 7 0 0 : l o s s =2366.39
Iteration 8 0 0 : l o s s =2295.66
Iteration 9 0 0 : l o s s =2238.67
Iteration 1 0 0 0 : l o s s =2191.59
Iteration 1 1 0 0 : l o s s =2151.88
Iteration 1 2 0 0 : l o s s =2117.95
Iteration 1 3 0 0 : l o s s =2088.56
Iteration 1 4 0 0 : l o s s =2062.86
Iteration 1 5 0 0 : l o s s =2040.14
...
Code
Output
360 CHAPTER 9. TRANSFER LEARNING
from g o o g l e . c o l a b import f i l e s
f i l e s . download ( r e s u l t _ p r e f i x + " _ a t _ i t e r a t i o n _ 4 0 0 0 . png " )
Chapter 10
361
362 CHAPTER 10. TIME SERIES IN KERAS
Code
x = [
[32] ,
[41] ,
[39] ,
[20] ,
[15]
]
y = [
1,
−1,
0,
−1,
1
]
print ( x )
print ( y )
Output
The following code builds a CSV file from scratch. To see it as a data frame, use the following:
Code
x = np . a r r a y ( x )
print ( x [ : , 0 ] )
Output
10.1. PART 10.1: TIME SERIES DATA ENCODING 363
x y
0 32 1
1 41 -1
2 39 0
3 20 -1
4 15 1
[ 3 2 41 39 20 1 5 ]
You might want to put volume in with the stock price. The following code shows how to add a dimension
to handle the volume.
Code
x = [
[32 ,1383] ,
[41 ,2928] ,
[39 ,8823] ,
[20 ,1252] ,
[15 ,1532]
]
y = [
1,
−1,
0,
−1,
1
]
print ( x )
print ( y )
Output
[[32 , 1383] , [41 , 2928] , [39 , 8823] , [20 , 1252] , [15 , 1532]]
[ 1 , −1, 0 , −1, 1 ]
Again, very similar to what we did before. The following shows this as a data frame.
364 CHAPTER 10. TIME SERIES IN KERAS
Code
x = np . a r r a y ( x )
print ( x [ : , 0 ] )
Output
price volume y
0 32 1383 1
1 41 2928 -1
2 39 8823 0
3 20 1252 -1
4 15 1532 1
[ 3 2 41 39 20 1 5 ]
Now we get to sequence format. We want to predict something over a sequence, so the data format
needs to add a dimension. You must specify a maximum sequence length. The individual sequences can
be of any size.
Code
x = [
[[32 ,1383] ,[41 ,2928] ,[39 ,8823] ,[20 ,1252] ,[15 ,1532]] ,
[[35 ,8272] ,[32 ,1383] ,[41 ,2928] ,[39 ,8823] ,[20 ,1252]] ,
[[37 ,2738] ,[35 ,8272] ,[32 ,1383] ,[41 ,2928] ,[39 ,8823]] ,
[[34 ,2845] ,[37 ,2738] ,[35 ,8272] ,[32 ,1383] ,[41 ,2928]] ,
[[32 ,2345] ,[34 ,2845] ,[37 ,2738] ,[35 ,8272] ,[32 ,1383]] ,
]
y = [
1,
−1,
0,
−1,
10.1. PART 10.1: TIME SERIES DATA ENCODING 365
1
]
print ( x )
print ( y )
Output
[ [ [ 3 2 , 1383] , [41 , 2928] , [39 , 8823] , [20 , 1252] , [15 , 1532]] , [[35 ,
8272] , [32 , 1383] , [41 , 2928] , [39 , 8823] , [20 , 1252]] , [[37 , 2738] ,
[35 , 8272] , [32 , 1383] , [41 , 2928] , [39 , 8823]] , [[34 , 2845] , [37 ,
2738] , [35 , 8272] , [32 , 1383] , [41 , 2928]] , [[32 , 2345] , [34 , 2845] ,
[37 , 2738] , [35 , 8272] , [32 , 1 3 8 3 ] ] ]
[ 1 , −1, 0 , −1, 1 ]
Even if there is only one feature (price), you must use 3 dimensions.
Code
x = [
[[32] ,[41] ,[39] ,[20] ,[15]] ,
[[35] ,[32] ,[41] ,[39] ,[20]] ,
[[37] ,[35] ,[32] ,[41] ,[39]] ,
[[34] ,[37] ,[35] ,[32] ,[41]] ,
[[32] ,[34] ,[37] ,[35] ,[32]] ,
]
y = [
1,
−1,
0,
−1,
1
]
print ( x )
print ( y )
Output
[[[32] , [41] , [39] , [20] , [15]] , [[35] , [32] , [41] , [39] , [20]] ,
[[37] , [35] , [32] , [41] , [39]] , [[34] , [37] , [35] , [32] , [41]] , [[32] ,
366 CHAPTER 10. TIME SERIES IN KERAS
1
S(t) =
1 + e−t
The second type of transfer function is the hyperbolic tangent (tanh) function, which allows you to
scale the output of the LSTM. This functionality is similar to how we have used other transfer functions
in this course.
We provide the graphs for these functions here:
Code
%m a t p l o t l i b i n l i n e
import matplotlib
import numpy a s np
import matplotlib . pyplot as p l t
import math
10.2. PART 10.2: PROGRAMMING LSTM WITH KERAS AND TENSORFLOW 367
def s i g m o i d ( x ) :
a = []
f o r item in x :
a . append (1/(1+ math . exp(−item ) ) )
return a
def f 2 ( x ) :
a = []
f o r item in x :
a . append ( math . tanh ( item ) )
return a
x = np . a r a n g e ( −10. , 1 0 . , 0 . 2 )
y1 = s i g m o i d ( x )
y2 = f 2 ( x )
Output
Sigmoid
368 CHAPTER 10. TIME SERIES IN KERAS
H y p e r b o l i c Tangent ( tanh )
Both of these two functions compress their output to a specific range. For the sigmoid function, this
range is 0 to 1. For the hyperbolic tangent function, this range is -1 to 1.
LSTM maintains an internal state and produces an output. The following diagram shows an LSTM
unit over three timeslices: the current time slice (t), as well as the previous (t-1) and next (t+1) slice, as
demonstrated by Figure 10.1.
The values ŷ are the output from the unit; the values (x) are the input to the unit, and the values c
are the context values. The output and context values always feed their output to the next time slice. The
context values allow the network to maintain the state between calls. Figure 10.2 shows the internals of a
LSTM layer.
Code
from t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g import s e q u e n c e
from t e n s o r f l o w . k e r a s . models import Sequential
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , Embedding
from t e n s o r f l o w . k e r a s . l a y e r s import LSTM
import numpy a s np
max_features = 4 # 0 , 1 , 2 , 3 ( t o t a l o f 4)
x = [
[[0] ,[1] ,[1] ,[0] ,[0] ,[0]] ,
[[0] ,[0] ,[0] ,[2] ,[2] ,[0]] ,
[[0] ,[0] ,[0] ,[0] ,[3] ,[3]] ,
[[0] ,[2] ,[2] ,[0] ,[0] ,[0]] ,
[[0] ,[0] ,[3] ,[3] ,[0] ,[0]] ,
[[0] ,[0] ,[0] ,[0] ,[1] ,[1]]
]
x = np . a r r a y ( x , dtype=np . f l o a t 3 2 )
y = np . a r r a y ( [ 1 , 2 , 3 , 2 , 3 , 1 ] , dtype=np . i n t 3 2 )
# Convert y2 t o dummy v a r i a b l e s
y2 = np . z e r o s ( ( y . shape [ 0 ] , max_features ) , dtype=np . f l o a t 3 2 )
y2 [ np . a r a n g e ( y . shape [ 0 ] ) , y ] = 1 . 0
print ( y2 )
# t r y u s i n g d i f f e r e n t o p t i m i z e r s and d i f f e r e n t o p t i m i z e r c o n f i g s
model . compile ( l o s s= ' b i n a r y _ c r o s s e n t r o p y ' ,
o p t i m i z e r= ' adam ' ,
m e t r i c s =[ ' a c c u r a c y ' ] )
Output
[ [ 0 . 1. 0. 0 . ]
[ 0 . 0. 1. 0 . ]
[ 0 . 0. 0. 1 . ]
[ 0 . 0. 1. 0 . ]
[ 0 . 0. 0. 1 . ]
[ 0 . 1. 0. 0 . ] ]
B u i l d model . . .
Train . . .
...
1/1 [==============================] − 0 s 66ms/ s t e p − l o s s : 0 . 2 6 2 2 −
accuracy : 0.6667
Epoch 200/200
1/1 [==============================] − 0 s 39ms/ s t e p − l o s s : 0 . 2 3 2 9 −
accuracy : 0.6667
P r e d i c t e d c l a s s e s : {} [ 1 2 3 2 3 1 ]
Expected c l a s s e s : {} [ 1 2 3 2 3 1 ]
Code
def r u n i t ( model , i n p ) :
i n p = np . a r r a y ( inp , dtype=np . f l o a t 3 2 )
pred = model . p r e d i c t ( i n p )
return np . argmax ( pred [ 0 ] )
print ( r u n i t ( model , [ [ [ 0 ] , [ 0 ] , [ 0 ] , [ 0 ] , [ 0 ] , [ 1 ] ] ] ) )
Output
1
372 CHAPTER 10. TIME SERIES IN KERAS
import pandas a s pd
import o s
names = [ ' y e a r ' , ' month ' , ' day ' , ' dec_year ' , ' sn_value ' ,
' s n _ e r r o r ' , ' obs_num ' , ' unused1 ' ]
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/SN_d_tot_V2 . 0 . c s v " ,
s e p= ' ; ' , h e a d e r=None , names=names ,
na_values =[ '−1 ' ] , i n d e x _ c o l=F a l s e )
Output
Starting f i l e :
y e a r month day dec_year sn_value s n _ e r r o r obs_num unused1
0 1818 1 1 1818.001 −1 NaN 0 1
1 1818 1 2 1818.004 −1 NaN 0 1
2 1818 1 3 1818.007 −1 NaN 0 1
3 1818 1 4 1818.010 −1 NaN 0 1
4 1818 1 5 1818.012 −1 NaN 0 1
5 1818 1 6 1818.015 −1 NaN 0 1
6 1818 1 7 1818.018 −1 NaN 0 1
7 1818 1 8 1818.021 65 10.2 1 1
8 1818 1 9 1818.023 −1 NaN 0 1
9 1818 1 10 1 8 1 8 . 0 2 6 −1 NaN 0 1
Ending f i l e :
y e a r month day dec_year sn_value s n _ e r r o r obs_num
unused1
10.2. PART 10.2: PROGRAMMING LSTM WITH KERAS AND TENSORFLOW 373
...
0
72863 2017 6 29 2017.492 12 0.5 25
0
72864 2017 6 30 2017.495 11 0.5 30
0
As you can see, there is quite a bit of missing data near the end of the file. We want to find the starting
index where the missing data no longer occurs. This technique is somewhat sloppy; it would be better to
find a use for the data between missing values. However, the point of this example is to show how to use
LSTM with a somewhat simple time-series.
Code
Output
11314
Code
Output
T r a i n i n g s e t has 55160 o b s e r v a t i o n s .
Test s e t has 6391 o b s e r v a t i o n s .
374 CHAPTER 10. TIME SERIES IN KERAS
To create an algorithm that will predict future values, we need to consider how to encode this data to
be presented to the algorithm. The data must be submitted as sequences, using a sliding window algorithm
to encode the data. We must define how large the window will be. Consider an n-sized window. Each
sequence’s x values will be a n data points sequence. The y’s will be the next value, after the sequence, that
we are trying to predict. You can use the following function to take a series of values, such as sunspots,
and generate sequences (x) and predicted values (y).
Code
import numpy a s np
def t o _ s e q u e n c e s ( s e q _ s i z e , obs ) :
x = []
y = []
return np . a r r a y ( x ) , np . a r r a y ( y )
SEQUENCE_SIZE = 10
x_train , y _ t r a i n = t o _ s e q u e n c e s (SEQUENCE_SIZE, s p o t s _ t r a i n )
x_test , y _ t e s t = t o _ s e q u e n c e s (SEQUENCE_SIZE, s p o t s _ t e s t )
Output
Shape o f t r a i n i n g s e t : ( 5 5 1 5 0 , 1 0 , 1 )
Shape o f t e s t s e t : ( 6 3 8 1 , 1 0 , 1 )
We can see the internal structure of the training data. The first dimension is the number of training
elements, the second indicates a sequence size of 10, and finally, we have one data point per timeslice in
the window.
10.2. PART 10.2: PROGRAMMING LSTM WITH KERAS AND TENSORFLOW 375
Code
x _ t r a i n . shape
Output
(55150 , 10 , 1)
from t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g import s e q u e n c e
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , Embedding
from t e n s o r f l o w . k e r a s . l a y e r s import LSTM
from t e n s o r f l o w . k e r a s . d a t a s e t s import imdb
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
import numpy a s np
Output
B u i l d model . . .
Train . . .
...
1724/1724 − 10 s − l o s s : 4 9 7 . 0 3 9 3 − v a l _ l o s s : 2 1 5 . 1 7 2 1 − 10 s / epoch −
6ms/ s t e p
Epoch 11/1000
376 CHAPTER 10. TIME SERIES IN KERAS
from s k l e a r n import m e t r i c s
pred = model . p r e d i c t ( x _ t e s t )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( " S c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
character level is the more interesting of the two. The LSTM is learning to construct its own words without
even being shown what a word is. We will begin with character-level text generation. In the next module,
we will see how we can use nearly the same technique to operate at the word level. We will implement
word-level automatic captioning in the next module.
We import the needed Python packages and define the sequence length, named maxlen. Time-series
neural networks always accept their input as a fixed-length array. Because you might not use all of the
sequence elements, filling extra pieces with zeros is common. You will divide the text into sequences of
this length, and the neural network will train to predict what comes after this sequence.
Code
We will train the neural network on the classic children’s book Treasure Island. We begin by loading
this text into a Python string and displaying the first 1,000 characters.
Code
Output
I l l u s t r a t o r : Milo Winter
R e l e a s e Date : January 1 2 , 2009 [ EBook #27780]
Language : E n g l i s h
∗∗∗ START OF THIS PROJECT GUTENBERG EBOOK TREASURE ISLAND ∗∗∗
Produced by J u l i e t S u t h e r l a n d , Stephen B l u n d e l l and t h e
O n l i n e D i s t r i b u t e d P r o o f r e a d i n g Team a t h t t p : / /www. pgdp . n e t
THE ILLUSTRATED CHILDREN' S LIBRARY
...
Milo Winter
[ Illustration ]
GRAMERCY BOOKS
NEW YORK
Foreword c o p y r i g h t 1986 by Random House V
We will extract all unique characters from the text and sort them. This technique allows us to assign a
unique ID to each character. Because we sorted the characters, these IDs should remain the same. The IDs
will change if we add new characters to the original text. We build two dictionaries. The first char2idx
is used to convert a character into its ID. The second idx2char converts an ID back into its character.
Code
p r o c e s s e d _ t e x t = raw_text . l o w e r ( )
p r o c e s s e d _ t e x t = r e . sub ( r ' [ ^ \ x00−\x 7 f ] ' , r ' ' , p r o c e s s e d _ t e x t )
c h a r s = sorted ( l i s t ( set ( p r o c e s s e d _ t e x t ) ) )
print ( ' t o t a l ␣ c h a r s : ' , len ( c h a r s ) )
c h a r _ i n d i c e s = dict ( ( c , i ) f o r i , c in enumerate ( c h a r s ) )
i n d i c e s _ c h a r = dict ( ( i , c ) f o r i , c in enumerate ( c h a r s ) )
Output
c o r p u s l e n g t h : 397400
t o t a l c h a r s : 60
We are now ready to build the actual sequences. Like previous neural networks, there will be an x and
y. However, for the LSTM, x and y will be sequences. The x input will specify the sequences where y is
the expected output. The following code generates all possible sequences.
10.3. PART 10.3: TEXT GENERATION WITH LSTM 379
Code
Output
nb s e q u e n c e s : 132454
Code
sentences
Output
...
Code
Output
Vectorization . . .
Next, we create the neural network. This neural network’s primary feature is the LSTM layer, which
allows the sequences to be processed.
Code
# b u i l d t h e model : a s i n g l e LSTM
print ( ' B u i l d ␣ model . . . ' )
model = S e q u e n t i a l ( )
model . add (LSTM( 1 2 8 , input_shape =(maxlen , len ( c h a r s ) ) ) )
model . add ( Dense ( len ( c h a r s ) , a c t i v a t i o n= ' softmax ' ) )
o p t i m i z e r = RMSprop ( l r =0.01)
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r=o p t i m i z e r )
Output
B u i l d model . . .
10.3. PART 10.3: TEXT GENERATION WITH LSTM 381
Code
model . summary ( )
Output
The LSTM will produce new text character by character. We will need to sample the correct letter
from the LSTM predictions each time. The sample function accepts the following two parameters:
The sample function below essentially performs a softmax on the neural network predictions. This process
causes each output neuron to become a probability of its particular letter.
Code
Keras calls the following function at the end of each training Epoch. The code generates sample text
generations that visually demonstrate the neural network better at text generation. As the neural network
trains, the generations should look more realistic.
382 CHAPTER 10. TIME SERIES IN KERAS
Code
f o r i in range ( 4 0 0 ) :
x_pred = np . z e r o s ( ( 1 , maxlen , len ( c h a r s ) ) )
f o r t , c h a r in enumerate ( s e n t e n c e ) :
x_pred [ 0 , t , c h a r _ i n d i c e s [ c h a r ] ] = 1 .
p r e d s = model . p r e d i c t ( x_pred , v e r b o s e = 0 ) [ 0 ]
next_index = sample ( preds , t e m p e r a t u r e )
next_char = i n d i c e s _ c h a r [ next_index ]
g e n e r a t e d += next_char
s e n t e n c e = s e n t e n c e [ 1 : ] + next_char
s y s . s t d o u t . w r i t e ( next_char )
sys . stdout . f l u s h ()
print ( )
We are now ready to train. Depending on how fast your computer is, it can take up to an hour to train
this network. If you have a GPU available, please make sure to use it.
Code
# I g n o r e u s e l e s s W0819 w a r n i n g s g e n e r a t e d by TensorFlow 2 . 0 .
H o p e f u l l y can remove t h i s i g n o r e i n t h e f u t u r e .
# See h t t p s : / / g i t h u b . com/ t e n s o r f l o w / t e n s o r f l o w / i s s u e s /31308
import l o g g i n g , o s
l o g g i n g . d i s a b l e ( l o g g i n g .WARNING)
o s . e n v i r o n [ "TF_CPP_MIN_LOG_LEVEL" ] = " 3 "
10.4. PART 10.4: INTRODUCTION TO TRANSFORMERS 383
# F i t t h e model
p r i n t _ c a l l b a c k = LambdaCallback ( on_epoch_end=on_epoch_end )
model . f i t ( x , y ,
b a t c h _ s i z e =128 ,
e p o c h s =60 ,
c a l l b a c k s =[ p r i n t _ c a l l b a c k ] )
Output
...
1035/1035 [==============================] − 74 s 71ms/ s t e p − l o s s :
1.1361
Epoch 60/60
1029/1035 [============================>.] − ETA: 0 s − l o s s :
1.1339∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗
−−−−− G e n e r a t i n g t e x t a f t e r Epoch : 59
−−−−− t e m p e r a t u r e : 0 . 2
−−−−− G e n e r a t i n g with s e e d : "
t a i l o f i t on h i s u n r u l y f o l l o w e r s . th "
" i t ' s a don ' t be men b e l i e f i n my t i l l be c a p t a i n s i l v e r had been t h e
blows and t h e s t o c k a d e , and a man o f t h e b o a t s was p l a c e and t h e
c a p t a i n . he was a p a i r s and seemed t o t h e barrows and t h e part , and i
saw he was a s t a t e b e f o r e t h e p i r a t e s o f h i s s
−−−−− t e m p e r a t u r e : 0 . 5
...
Sequence-to-sequence allows an input sequence to produce an output sequence based on an input sequence.
Transformers focus primarily on this sequence-to-sequence configuration.
We use a transformer that translates between English and Spanish for this example. We present the
English sentence "the cat likes milk" and receive a Spanish translation of "al gato le gusta la leche."
We begin by placing the English source sentence between the beginning and ending tokens. This input
can be of any length, and we presented it to the neural network as a ragged Tensor. Because the Tensor is
ragged, no padding is necessary. Such input is acceptable for the attention layer that will receive the source
10.4. PART 10.4: INTRODUCTION TO TRANSFORMERS 385
sentence. The encoder transforms this ragged input into a hidden state containing a series of key-value
pairs representing the knowledge in the source sentence. The encoder understands to read English and
convert to a hidden state. The decoder understands how to output Spanish from this hidden state.
We initially present the decoder with the hidden state and the starting token. The decoder will predict
the probabilities of all words in its vocabulary. The word with the highest probability is the first word of
the sentence.
The highest probability word is attached concatenated to the translated sentence, initially containing
only the beginning token. This process continues, growing the translated sentence in each iteration until
the decoder predicts the ending token.
• num_layers = 4
• d_model = 128
• dff = 512
• num_heads = 8
• dropout_rate = 0.1
Multiple encoder and decoder layers can be present. The num_layers hyperparameter specifies how
many encoder and decoder layers there are. The expected tensor shape for the input to the encoder layer
is the same as the output produced; as a result, you can easily stack these layers.
We will see embedding layers in the next chapter. However, you can think of an embedding layer as a
dictionary for now. Each entry in the embedding corresponds to each word in a fixed-size vocabulary. Sim-
ilar words should have similar vectors. The d_model hyperparameter specifies the size of the embedding
vector. Though you will sometimes preload embeddings from a project such as Word2vec or GloVe, the
optimizer can train these embeddings with the rest of the transformer. Training your embeddings allows
the d_model hyperparameter to set to any desired value. If you transfer the embeddings, you must set
the d_model hyperparameter to the same value as the transferred embeddings.
The dff hyperparameter specifies the size of the dense feedforward layers. The num_heads hyperpa-
rameter sets the number of attention layers heads. Finally, the dropout_rate specifies a dropout percentage
to combat overfitting. We discussed dropout previously in this book.
• Embeddings
• Positional Encoding
• Attention and Self-Attention
• Residual Connection
386 CHAPTER 10. TIME SERIES IN KERAS
import pandas a s pd
import o s
10.5. PART 10.5: PROGRAMMING TRANSFORMERS WITH KERAS 387
names = [ ' y e a r ' , ' month ' , ' day ' , ' dec_year ' , ' sn_value ' ,
' s n _ e r r o r ' , ' obs_num ' , ' e x t r a ' ]
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/SN_d_tot_V2 . 0 . c s v " ,
s e p= ' ; ' , h e a d e r=None , names=names ,
na_values =[ '−1 ' ] , i n d e x _ c o l=F a l s e )
Output
Starting f i l e :
y e a r month day dec_year sn_value s n _ e r r o r obs_num e x t r a
0 1818 1 1 1818.001 −1 NaN 0 1
1 1818 1 2 1818.004 −1 NaN 0 1
2 1818 1 3 1818.007 −1 NaN 0 1
3 1818 1 4 1818.010 −1 NaN 0 1
4 1818 1 5 1818.012 −1 NaN 0 1
5 1818 1 6 1818.015 −1 NaN 0 1
6 1818 1 7 1818.018 −1 NaN 0 1
7 1818 1 8 1818.021 65 10.2 1 1
8 1818 1 9 1818.023 −1 NaN 0 1
9 1818 1 10 1 8 1 8 . 0 2 6 −1 NaN 0 1
Ending f i l e :
y e a r month day dec_year sn_value s n _ e r r o r obs_num e x t r a
72855 2017 6 21 2 0 1 7 . 4 7 0 35 1.0 41 0
...
As you can see, there is quite a bit of missing data near the end of the file. We want to find the starting
index where the missing data no longer occurs. This technique is somewhat sloppy; it would be better to
388 CHAPTER 10. TIME SERIES IN KERAS
find a use for the data between missing values. However, the point of this example is to show how to use
a transformer encoder with a somewhat simple time series.
Code
Output
11314
Output
T r a i n i n g s e t has 55160 o b s e r v a t i o n s .
Test s e t has 6391 o b s e r v a t i o n s .
The to_sequences function takes linear time series data into an x and y where x is all possible
sequences of seq_size. After each x sequence, this function places the next value into the y variable.
These x and y data can train a time-series neural network.
Code
import numpy a s np
def t o _ s e q u e n c e s ( s e q _ s i z e , obs ) :
x = []
10.5. PART 10.5: PROGRAMMING TRANSFORMERS WITH KERAS 389
y = []
return np . a r r a y ( x ) , np . a r r a y ( y )
SEQUENCE_SIZE = 10
x_train , y _ t r a i n = t o _ s e q u e n c e s (SEQUENCE_SIZE, s p o t s _ t r a i n )
x_test , y _ t e s t = t o _ s e q u e n c e s (SEQUENCE_SIZE, s p o t s _ t e s t )
Output
Shape o f t r a i n i n g s e t : ( 5 5 1 5 0 , 1 0 , 1 )
Shape o f t e s t s e t : ( 6 3 8 1 , 1 0 , 1 )
We can view the results of the to_sequences encoding of the sunspot data.
Code
print ( x _ t r a i n . shape )
Output
(55150 , 10 , 1)
Next, we create the transformer_encoder; I obtained this function from a Keras example. This layer
includes residual connections, layer normalization, and dropout. This resulting layer can be stacked multiple
times. We implement the projection layers with the Keras Conv1D.
390 CHAPTER 10. TIME SERIES IN KERAS
Code
from t e n s o r f l o w import k e r a s
from t e n s o r f l o w . k e r a s import l a y e r s
The following function is provided to build the model, including the attention layer.
Code
def build_model (
input_shape ,
head_size ,
num_heads ,
ff_dim ,
num_transformer_blocks ,
mlp_units ,
dropout =0,
mlp_dropout =0,
):
i n p u t s = k e r a s . Input ( shape=input_shape )
x = inputs
fo r _ in range ( num_transformer_blocks ) :
x = t r a n s f o r m e r _ e n c o d e r ( x , head_size , num_heads , ff_dim , dropout )
o u t p u t s = l a y e r s . Dense ( 1 ) ( x )
return k e r a s . Model ( i n p u t s , o u t p u t s )
input_shape = x _ t r a i n . shape [ 1 : ]
model = build_model (
input_shape ,
h e a d _ s i z e =256 ,
num_heads=4,
ff_dim =4,
num_transformer_blocks =4,
mlp_units = [ 1 2 8 ] ,
mlp_dropout =0.4 ,
dropout =0.25 ,
)
model . compile (
l o s s=" mean_squared_error " ,
o p t i m i z e r=k e r a s . o p t i m i z e r s . Adam( l e a r n i n g _ r a t e =1e −4)
)
#model . summary ( )
c a l l b a c k s = [ k e r a s . c a l l b a c k s . E a r l y S t o p p i n g ( p a t i e n c e =10 , \
r e s t o r e _ b e s t _ w e i g h t s=True ) ]
model . f i t (
x_train ,
y_train ,
v a l i d a t i o n _ s p l i t =0.2 ,
e p o c h s =200 ,
b a t c h _ s i z e =64 ,
c a l l b a c k s=c a l l b a c k s ,
)
Output
...
392 CHAPTER 10. TIME SERIES IN KERAS
from s k l e a r n import m e t r i c s
pred = model . p r e d i c t ( x _ t e s t )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( " S c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
Output
S c o r e (RMSE) : 1 4 . 6 4 7 8 7 5 9 4 6 2 8 3 0 0 7
10.5. PART 10.5: PROGRAMMING TRANSFORMERS WITH KERAS 393
! pip i n s t a l l transformers
! pip i n s t a l l transformers [ s e n t e n c e p i e c e ]
Now that we have Hugging Face installed, the following sections will demonstrate how to apply Hugging
Face to a variety of everyday tasks. After this introduction, the remainder of this module will take a deeper
look at several specific NLP tasks applied to Hugging Face.
395
396 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
Code
from u r l l i b . r e q u e s t import u r l o p e n
Usually, you have to preprocess text into embeddings or other vector forms before presentation to a
neural network. Hugging Face provides a pipeline that simplifies this process greatly. The pipeline allows
you to pass regular Python strings to the transformers and return standard Python values.
We begin by loading a text-classification model. We do not specify the exact model type wanted, so
Hugging Face automatically chooses a network from the Hugging Face hub named:
• distilbert-base-uncased-finetuned-sst-2-english
To specify the model to use, pass the model parameter, such as:
The following code loads a model pipeline and a model for sentiment analysis.
Code
import pandas a s pd
from t r a n s f o r m e r s import p i p e l i n e
We can now display the sentiment analysis results with a Pandas dataframe.
Code
outputs = c l a s s i f i e r ( text )
pd . DataFrame ( o u t p u t s )
Output
label score
0 POSITIVE 0.984666
• Location (LOC)
• Organizations (ORG)
• Person (PER)
• Miscellaneous (MISC)
The following code requests a "named entity recognizer" (ner) and processes the specified text.
Code
We similarly view the results as a Pandas data frame. As you can see, the person (PER) of Abraham
Lincoln and location (LOC) of the United States is recognized.
Code
Output
r e a d e r = p i p e l i n e ( " q u e s t i o n −a n s w e r i n g " )
q u e s t i o n = "What␣now␣ s h a l l ␣ f a d e ? "
For this example, we will pose the question "what shall fade" to Hugging Face for Sonnet 18. We see
the correct answer of "eternal summer."
398 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
Code
o u t p u t s = r e a d e r ( q u e s t i o n=q u e s t i o n , c o n t e x t=t e x t )
pd . DataFrame ( [ o u t p u t s ] )
Output
o u t p u t s = t r a n s l a t o r ( t e x t , c l e a n _ u p _ t o k e n i z a t i o n _ s p a c e s=True ,
min_length =100)
print ( o u t p u t s [ 0 ] [ ' t r a n s l a t i o n _ t e x t ' ] )
Output
11.1.5 Summarization
Summarization is an NLP task that summarizes a more lengthy text into just a few sentences.
Code
text2 = """
An a p p l e i s an e d i b l e f r u i t produced by an a p p l e t r e e ( Malus d o m e s t i c a ) .
Apple t r e e s a r e c u l t i v a t e d w o r l d w i d e and a r e t h e most w i d e l y grown s p e c i e s
i n t h e genus Malus . The t r e e o r i g i n a t e d i n C e n t r a l Asia , where i t s w i l d
a n c e s t o r , Malus s i e v e r s i i , i s s t i l l found t o d a y . A p p l e s have been grown
f o r t h o u s a n d s o f y e a r s i n Asia and Europe and were b r o u g h t t o North America
by European c o l o n i s t s . A p p l e s have r e l i g i o u s and m y t h o l o g i c a l s i g n i f i c a n c e
i n many c u l t u r e s , i n c l u d i n g Norse , Greek , and European C h r i s t i a n t r a d i t i o n .
"""
Output
An a p p l e i s an e d i b l e f r u i t produced by an a p p l e t r e e ( Malus
d o m e s t i c a ) Apple t r e e s a r e c u l t i v a t e d worldwide and a r e t h e most
w i d e l y grown s p e c i e s i n t h e genus Malus . Apples have r e l i g i o u s and
mythological
from u r l l i b . r e q u e s t import u r l o p e n
g e n e r a t o r = p i p e l i n e ( " t e x t −g e n e r a t i o n " )
400 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
Here an example is provided that generates additional text after Sonnet 18.
Code
o u t p u t s = g e n e r a t o r ( t e x t , max_length =400)
print ( o u t p u t s [ 0 ] [ ' g e n e r a t e d _ t e x t ' ] )
Output
Sonnet 18 o r i g i n a l t e x t
William S h a k e s p e a r e
S h a l l I compare t h e e t o a summer ' s day ?
Thou a r t more l o v e l y and more t e m p e r a t e :
Rough winds do shake t h e d a r l i n g buds o f May ,
And summer ' s l e a s e hath a l l t o o s h o r t a d a t e :
Sometime t o o hot t h e eye o f heaven s h i n e s ,
And o f t e n i s h i s g o l d complexion dimm ' d ;
And e v e r y f a i r from f a i r sometime d e c l i n e s ,
By chance o r nature ' s c h a n g i n g c o u r s e untrimm ' d ;
But thy e t e r n a l summer s h a l l not f a d e
Nor l o s e p o s s e s s i o n o f t h a t f a i r thou owest ;
Nor s h a l l Death brag thou wander ' s t i n h i s shade ,
When i n e t e r n a l l i n e s t o time thou g r o w e s t :
So l o n g a s men can b r e a t h e o r e y e s can s e e ,
...
[ I t a l i a n : The Tale o f t h e
Cat ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
' S i r ! s i r l a verde '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~
[ I r i s h : The Tale o f
• This is a test.
• Ok, but what about this?
• Is U.S.A. the same as USA.?
11.2. PART 11.2: HUGGING FACE TOKENIZERS 401
The hugging face includes tokenizers that can break these sentences into words and subwords. Because
English, and some other languages, are made up of common word parts, we tokenize subwords. For example,
a gerund word, such as "sleeping," will be tokenized into "sleep" and "##ing".
We begin by installing Hugging Face if needed.
Code
! pip i n s t a l l transformers
! pip i n s t a l l transformers [ s e n t e n c e p i e c e ]
First, we create a Hugging Face tokenizer. There are several different tokenizers available from the
Hugging Face hub. For this example, we will make use of the following tokenizer:
• distilbert-base-uncased
Output
• input_ids - The individual subword indexes, each index uniquely identifies a subword.
• attention_mask - Which values in input_ids are meaningful and not padding.
This sentence had no padding, so all elements have an attention mask of "1". Later, we will request the
output to be of a fixed length, introducing padding, which always has an attention mask of "0". Though
402 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
each tokenizer can be implemented differently, the attention mask of a tokenizer is generally either "0" or
"1".
Due to subwords and special tokens, the number of tokens may not match the number of words in the
source string. We can see the meanings of the individual tokens by converting these IDs back to strings.
Code
t o k e n i z e r . convert_ids_to_tokens ( encoded . i n p u t _ i d s )
Output
[ ' [ CLS ] ' , ' token ' , '## i z i n g ' , ' t e x t ' , ' i s ' , ' easy ' , '. ' , ' [ SEP ] ' ]
As you can see, there are two special tokens placed at the beginning and end of each sequence. We will
soon see how we can include or exclude these special tokens. These special tokens can vary per tokenizer;
however, [CLS] begins a sequence for this tokenizer, and [SEP] ends a sequence. You will also see that the
gerund "tokening" is broken into "token" and "*ing".
For this tokenizer, the special tokens occur between 100 and 103. Most Hugging Face tokenizers use
this approximate range for special tokens. The value zero (0) typically represents padding. We can display
all special tokens with this command.
Code
t o k e n i z e r . convert_ids_to_tokens ( [ 0 , 1 0 0 , 1 0 1 , 1 0 2 , 1 0 3 ] )
Output
[ ' [PAD] ' , ' [UNK] ' , ' [ CLS ] ' , ' [ SEP ] ' , ' [MASK] ' ]
text = [
" This ␣ movie ␣was␣ g r e a t ! " ,
11.3. PART 11.3: HUGGING FACE DATASETS 403
Output
∗∗ Input IDs ∗∗
[ 1 0 1 , 2023 , 3185 , 2001 , 2307 , 999 , 102 , 0 , 0 , 0 , 0 ]
[ 1 0 1 , 1045 , 6283 , 2023 , 2693 , 1010 , 5949 , 1997 , 2051 , 999 , 102]
[ 1 0 1 , 8680 , 1029 , 102 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ]
∗∗ A t t e n t i o n Mask∗∗
[1 , 1 , 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0]
[1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1]
[1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0]
Notice the input_id’s for the three movie review text sequences. Each of these sequences begins with
101 and we pad with zeros. Just before the padding, each group of IDs ends with 102. The attention masks
also have zeros for each of the padding entries.
We used two parameters to the tokenizer to control the tokenization process. Some other useful pa-
rameters include:
• add_special_tokens (defaults to True) Whether or not to encode the sequences with the special
tokens relative to their model.
• padding (defaults to False) Activates and controls truncation.
• max_length (optional) Controls the maximum length to use by one of the truncation/padding pa-
rameters.
Face data sets, the data is in a format specific to Hugging Face. In this part, we will explore this format
and see how to convert it to Pandas or TensorFlow data.
We begin by installing Hugging Face if needed. It is also essential to install Hugging Face datasets.
Code
! pip i n s t a l l transformers
! pip i n s t a l l transformers [ s e n t e n c e p i e c e ]
! pip i n s t a l l d a t a s e t s
We begin by querying Hugging Face to obtain the total count and names of the data sets. This code
obtains the total count and the names of the first five datasets.
Code
from d a t a s e t s import l i s t _ d a t a s e t s
all_datasets = list_datasets ()
Output
We begin by loading the emotion data set from the Hugging Face hub. Emotion is a dataset of English
Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.[30]The following
code loads the emotion data set from the Hugging Face hub.
11.3. PART 11.3: HUGGING FACE DATASETS 405
Code
from d a t a s e t s import l o a d _ d a t a s e t
Output
A quick scan of the downloaded data set reveals its structure. In this case, Hugging Face already
separated the data into training, validation, and test data sets. The training set consists of 16,000 obser-
vations, while the test and validation sets contain 2,000 observations. The dataset is a Python dictionary
that includes a Dataset object for each of these three divisions. The datasets only contain two columns,
the text and the emotion label for each text sample.
Code
emotions
Output
DatasetDict ({
t r a i n : Dataset ({
f e a t u r e s : [ ' text ' , ' label ' ] ,
num_rows : 16000
406 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
})
v a l i d a t i o n : Dataset ({
f e a t u r e s : [ ' text ' , ' label ' ] ,
num_rows : 2000
})
t e s t : Dataset ({
f e a t u r e s : [ ' text ' , ' label ' ] ,
num_rows : 2000
})
})
You can see a single observation from the training data set here. This observation includes both the text
sample and the assigned emotion label. The label is a numeric index representing the assigned emotion.
Code
e mo t io n s [ ' t r a i n ' ] [ 2 ]
Output
e mo t io n s [ ' t r a i n ' ] . f e a t u r e s
Output
{ ' l a b e l ' : C l a s s L a b e l ( num_classes =6, names =[ ' s a d n e s s ' , ' joy ' , ' l o v e ' ,
' anger ' , ' f e a r ' , ' s u r p r i s e ' ] , i d=None ) ,
' t e x t ' : Value ( dtype =' s t r i n g ' , i d=None ) }
Hugging face can provide these data sets in a variety of formats. The following code receives the emotion
data set as a Pandas data frame.
Code
import pandas a s pd
Output
text label
0 i didnt feel humiliated 0
1 i can go from feeling so hopeless to so damned... 0
2 im grabbing a minute to post i feel greedy wrong 3
3 i am ever feeling nostalgic about the fireplac... 2
4 i am feeling grouchy 3
We can use the Pandas "apply" function to add the textual label for each observation.
Code
def l a b e l _ i t ( row ) :
return e m o t i o n s [ " t r a i n " ] . f e a t u r e s [ " l a b e l " ] . i n t 2 s t r ( row )
Output
With the data in Pandas format and textually labeled, we can display a bar chart of the frequency of
each of the emotions.
Code
import m a t p l o t l i b . p y p l o t a s p l t
Output
Finally, we utilize Hugging Face tokenizers and data sets together. The following code tokenizes the
entire emotion data set. You can see below that the code has transformed the training set into subword
tokens that are now ready to be used in conjunction with a transformer for either inference or training.
Code
def t o k e n i z e ( rows ) :
return t o k e n i z e r ( rows [ ' t e x t ' ] , padding=True , t r u n c a t i o n=True )
e mo t io n s . s e t _ f o r m a t ( type=None )
Output
! pip i n s t a l l transformers
! pip i n s t a l l transformers [ s e n t e n c e p i e c e ]
! pip i n s t a l l d a t a s e t s
We begin by loading the emotion data set from the Hugging Face hub. Emotion is a dataset of English
Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. The following code
loads the emotion data set from the Hugging Face hub.
Code
from d a t a s e t s import l o a d _ d a t a s e t
You can see a single observation from the training data set here. This observation includes both the text
sample and the assigned emotion label. The label is a numeric index representing the assigned emotion.
Code
Output
Code
e mo t io n s [ ' t r a i n ' ] . f e a t u r e s
Output
{ ' l a b e l ' : C l a s s L a b e l ( num_classes =6, names =[ ' s a d n e s s ' , ' joy ' , ' l o v e ' ,
' anger ' , ' f e a r ' , ' s u r p r i s e ' ] , i d=None ) ,
' t e x t ' : Value ( dtype =' s t r i n g ' , i d=None ) }
Next, we utilize Hugging Face tokenizers and data sets together. The following code tokenizes the entire
emotion data set. You can see below that the code has transformed the training set into subword tokens
that are now ready to be used in conjunction with a transformer for either inference or training.
Code
def t o k e n i z e ( rows ) :
return t o k e n i z e r ( rows [ ' t e x t ' ] , padding=" max_length " , t r u n c a t i o n=True )
e mo t io n s . s e t _ f o r m a t ( type=None )
t o k e n i z e d _ d a t a s e t s = e m o t i o n s .map( t o k e n i z e , batched=True )
We will utilize the Hugging Face DefaultDataCollator to transform the emotion data set into Ten-
sorFlow type data that we can use to finetune a neural network.
Code
from t r a n s f o r m e r s import D e f a u l t D a t a C o l l a t o r
d a t a _ c o l l a t o r = D e f a u l t D a t a C o l l a t o r ( r e t u r n _ t e n s o r s=" t f " )
Code
We can now generate the TensorFlow data sets. We specify which columns should map to the input
features and labels. We do not need to shuffle because we previously shuffled the data.
Code
We will now load the distilbert model for classification. We will adjust the pretrained weights to predict
the emotions of text lines.
Code
import t e n s o r f l o w a s t f
from t r a n s f o r m e r s import T F A u t o M o d e l F o r S e q u e n c e C l a s s i f i c a t i o n
model = T F A u t o M o d e l F o r S e q u e n c e C l a s s i f i c a t i o n . f r o m _ p r e t r a i n e d ( \
" d i s t i l b e r t −base−uncased " , num_labels =6)
We now train the neural network. Because the network is already pretrained, we use a small learning
rate.
Code
model . compile (
o p t i m i z e r=t f . k e r a s . o p t i m i z e r s . Adam( l e a r n i n g _ r a t e =5e −5) ,
l o s s=t f . k e r a s . l o s s e s . S p a r s e C a t e g o r i c a l C r o s s e n t r o p y ( f r o m _ l o g i t s=True ) ,
412 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
m e t r i c s=t f . m e t r i c s . S p a r s e C a t e g o r i c a l A c c u r a c y ( ) ,
)
model . f i t ( t f _ t r a i n _ d a t a s e t , v a l i d a t i o n _ d a t a=t f _ v a l i d a t i o n _ d a t a s e t ,
e p o c h s =5)
Output
...
2000/2000 [==============================] − 346 s 173ms/ s t e p − l o s s :
0.1092 − sparse_categorical_accuracy : 0.9486 − val_loss : 0.1654 −
val_sparse_categorical_accuracy : 0.9295
Epoch 5/5
2000/2000 [==============================] − 347 s 173ms/ s t e p − l o s s :
0.0960 − sparse_categorical_accuracy : 0.9585 − val_loss : 0.1830 −
val_sparse_categorical_accuracy : 0.9220
Now we create a neural network with a vocabulary size of 10, which will reduce those values between 0-9 to
4 number vectors. This neural network does nothing more than passing the embedding on to the output.
But it does let us see what the embedding is doing. Each feature vector coming in will have two such
features.
11.5. PART 11.5: WHAT ARE EMBEDDING LAYERS IN KERAS 413
Code
model = S e q u e n t i a l ( )
embedding_layer = Embedding ( input_dim =10 , output_dim =4, i n p u t _ l e n g t h =2)
model . add ( embedding_layer )
model . compile ( ' adam ' , ' mse ' )
Let’s take a look at the structure of this neural network to see what is happening inside it.
Code
model . summary ( )
Output
For this neural network, which is just an embedding layer, the input is a vector of size 2. These two
inputs are integer numbers from 0 to 9 (corresponding to the requested input_dim quantity of 10 values).
Looking at the summary above, we see that the embedding layer has 40 parameters. This value comes
from the embedded lookup table that contains four amounts (output_dim) for each of the 10 (input_dim)
possible integer values for the two inputs. The output is 2 (input_length) length 4 (output_dim) vectors,
resulting in a total output size of 8, which corresponds to the Output Shape given in the summary above.
Now, let us query the neural network with two rows. The input is two integer values, as was specified
when we created the neural network.
Code
input_data = np . a r r a y ( [
[1 , 2]
414 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
])
Output
(1 , 2)
[[[ −0.04494917 0 . 0 1 9 3 7 4 6 8 −0.00152863 0 . 0 4 8 0 8 6 5 9 ]
[ −0.04002655 0.03441895 0.04462588 −0.01472597]]]
Here we see two length-4 vectors that Keras looked up for each input integer. Recall that Python arrays
are zero-based. Keras replaced the value of 1 with the second row of the 10 x 4 lookup matrix. Similarly,
Keras returned the value of 2 by the third row of the lookup matrix. The following code displays the lookup
matrix in its entirety. The embedding layer performs no mathematical operations other than inserting the
correct row from the lookup table.
Code
embedding_layer . g e t _ w e i g h t s ( )
Output
The values above are random parameters that Keras generated as starting points. Generally, we will
transfer an embedding or train these random values into something useful. The following section demon-
strates how to embed a hand-coded embedding.
11.5. PART 11.5: WHAT ARE EMBEDDING LAYERS IN KERAS 415
embedding_lookup = np . a r r a y ( [
[1 , 0 , 0] ,
[0 , 1 , 0] ,
[0 , 0 , 1]
])
model = S e q u e n t i a l ( )
embedding_layer = Embedding ( input_dim =3, output_dim =3, i n p u t _ l e n g t h =2)
model . add ( embedding_layer )
model . compile ( ' adam ' , ' mse ' )
embedding_layer . s e t _ w e i g h t s ( [ embedding_lookup ] )
We query the neural network with two categorical values to see the lookup performed.
Code
input_data = np . a r r a y ( [
[0 , 1]
])
Output
(1 , 2)
[ [ [ 1 . 0. 0 . ]
[ 0 . 1. 0 . ] ] ]
The given output shows that we provided the program with two rows from the one-hot encoding table.
This encoding is a correct one-hot encoding for the values 0 and 1, where there are up to 3 unique values
possible.
The following section demonstrates how to train this embedding lookup table.
We create a neural network that classifies restaurant reviews according to positive or negative. This
neural network can accept strings as input, such as given here. This code also includes positive or negative
labels for each review.
Code
# D e f i n e 10 r e s t u r a n t r e v i e w s .
reviews = [
' Never ␣ coming ␣ back ! ' ,
' Horrible ␣ service ' ,
' Rude␣ w a i t r e s s ' ,
' Cold ␣ f o o d . ' ,
' Horrible ␣ food ! ' ,
' Awesome ' ,
' Awesome␣ s e r v i c e ! ' ,
' Rocks ! ' ,
' poor ␣ work ' ,
' Couldn \ ' t ␣ have ␣ done ␣ b e t t e r ' ]
# D e f i n e l a b e l s (1= n e g a t i v e , 0= p o s i t i v e )
l a b e l s = array ( [ 1 , 1 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ] )
11.5. PART 11.5: WHAT ARE EMBEDDING LAYERS IN KERAS 417
Notice that the second to the last label is incorrect. Errors such as this are not too out of the ordinary,
as most training data could have some noise.
We define a vocabulary size of 50 words. Though we do not have 50 words, it is okay to use a value
larger than needed. If there are more than 50 words, the least frequently used words in the training set are
automatically dropped by the embedding layer during training. For input, we one-hot encode the strings.
We use the TensorFlow one-hot encoding method here rather than Scikit-Learn. Scikit-learn would expand
these strings to the 0’s and 1’s as we would typically see for dummy variables. TensorFlow translates all
words to index values and replaces each word with that index.
Code
VOCAB_SIZE = 50
encoded_reviews = [ one_hot ( d , VOCAB_SIZE) f o r d in r e v i e w s ]
print ( f " Encoded ␣ r e v i e w s : ␣ { encoded_reviews } " )
Output
Encoded r e v i e w s : [ [ 4 0 , 4 3 , 7 ] , [ 2 7 , 3 1 ] , [ 4 9 , 4 6 ] , [ 2 , 2 8 ] , [ 2 7 , 2 8 ] ,
[ 2 0 ] , [ 2 0 , 3 1 ] , [ 3 9 ] , [ 1 8 , 3 9 ] , [ 1 1 , 3 , 18 , 1 1 ] ]
The program one-hot encodes these reviews to word indexes; however, their lengths are different. We
pad these reviews to 4 words and truncate any words beyond the fourth word.
Code
MAX_LENGTH = 4
Output
[[40 43 7 0 ]
[27 31 0 0 ]
[49 46 0 0 ]
[ 2 28 0 0 ]
[27 28 0 0 ]
[20 0 0 0]
[20 31 0 0 ]
[39 0 0 0]
[18 39 0 0 ]
[11 3 18 1 1 ] ]
418 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
As specified by the padding=post setting, each review is padded by appending zeros at the end, as
specified by the padding=post setting.
Next, we create a neural network to learn to classify these reviews.
Code
model = S e q u e n t i a l ( )
embedding_layer = Embedding (VOCAB_SIZE, 8 , i n p u t _ l e n g t h=MAX_LENGTH)
model . add ( embedding_layer )
model . add ( F l a t t e n ( ) )
model . add ( Dense ( 1 , a c t i v a t i o n= ' s i g m o i d ' ) )
model . compile ( o p t i m i z e r= ' adam ' , l o s s= ' b i n a r y _ c r o s s e n t r o p y ' ,
m e t r i c s =[ ' a c c ' ] )
Output
This network accepts four integer inputs that specify the indexes of a padded movie review. The first
embedding layer converts these four indexes into four length vectors 8. These vectors come from the
lookup table that contains 50 (VOCAB_SIZE) rows of vectors of length 8. This encoding is evident by the
400 (8 times 50) parameters in the embedding layer. The output size from the embedding layer is 32 (4
words expressed as 8-number embedded vectors). A single output neuron is connected to the embedding
layer by 33 weights (32 from the embedding layer and a single bias neuron). Because this is a single-class
classification network, we use the sigmoid activation function and binary_crossentropy.
The program now trains the neural network. The embedding lookup and dense 33 weights are updated
to produce a better score.
11.5. PART 11.5: WHAT ARE EMBEDDING LAYERS IN KERAS 419
Code
# f i t t h e model
model . f i t ( padded_reviews , l a b e l s , e p o c h s =100 , v e r b o s e =0)
Output
We can see the learned embeddings. Think of each word’s vector as a location in the 8 dimension
space where words associated with positive reviews are close to other words. Similarly, training places
negative reviews close to each other. In addition to the training setting these embeddings, the 33 weights
between the embedding layer and output neuron similarly learn to transform these embeddings into an
actual prediction. You can see these embeddings here.
Code
Output
(50 , 8)
[ array ([[ −0.11389559 , −0.04778124 , 0.10034387 , 0.12887037 ,
0.05670259 ,
−0.09982903 , −0.15423775 , − 0 . 0 6 7 7 4 8 0 5 ] ,
[ −0.04839246 , 0 . 0 0 5 2 7 7 4 5 , 0 . 0 0 8 4 3 0 6 , −0.03498586 , 0.010772
,
0.04015711 , 0.03564452 , −0.00849336] ,
[ −0.11003157 , −0.05829103 , 0 . 1 2 3 7 0 5 3 5 , −0.07124459 , −0.0667479
,
−0.14339209 , −0.13791779 , − 0 . 1 3 9 4 7 7 2 1 ] ,
[ −0.15395765 , −0.08560142 , −0.15915371 , −0.0882007 ,
0.15756004 ,
−0.10337664 , −0.12412377 , − 0 . 1 0 2 8 2 9 6 1 ] ,
[ 0.04919637 , −0.00870635 , −0.02393281 , 0 . 0 4 4 4 5 9 5 3 , 0.0124351
,
...
0.04153964 ,
−0.04445877 , −0.00612149 , − 0 . 0 3 4 3 0 6 6 3 ] ,
[ − 0 . 0 8 4 9 3 9 2 8 , −0.10910758 , 0 . 0 6 0 5 1 7 8 , −0.10072854 ,
420 CHAPTER 11. NATURAL LANGUAGE PROCESSING WITH HUGGING FACE
−0.11677803 ,
−0.05648913 , −0.13342443 , − 0 . 0 8 5 1 6 3 1 8 ] ] , dtype=f l o a t 3 2 ) ]
We can now evaluate this neural network’s accuracy, including the embeddings and the learned dense
layer.
Code
Output
Accuracy : 1 . 0
The accuracy is a perfect 1.0, indicating there is likely overfitting. It would be good to use early
stopping to not overfit for a more complex data set.
Code
Output
Log−l o s s : 0 . 4 8 4 4 6 8 6 3 8 8 9 6 9 4 2 1 4
However, the loss is not perfect. Even though the predicted probabilities indicated a correct prediction in
every case, the program did not achieve absolute confidence in each correct answer. The lack of confidence
was likely due to the small amount of noise (previously discussed) in the data set. Some words that
appeared in both positive and negative reviews contributed to this lack of absolute certainty.
Chapter 12
Reinforcement Learning
Because OpenAI Gym requires a graphics display, an embedded video is the only way to display Gym in
Google CoLab. The presentation of OpenAI Gym game animations in Google CoLab is discussed later in
this module.
You must provide a write-up with sufficient instructions to reproduce your result if you submit a score. A
video of your results is suggested but not required.
421
422 CHAPTER 12. REINFORCEMENT LEARNING
• action space: What actions can we take on the environment at each step/episode to alter the
environment.
• observation space: What is the current state of the portion of the environment that we can observe.
Usually, we can see the entire environment.
Before we begin to look at Gym, it is essential to understand some of the terminology used by this library.
• Agent - The machine learning program or model that controls the actions.
Step - One round of issuing actions that affect the observation space.
• Episode - A collection of steps that terminates when the agent fails to meet the environment’s
objective or the episode reaches the maximum number of allowed steps.
• Render - Gym can render one frame for display after each episode.
• Reward - A positive reinforcement that can occur at the end of each episode, after the agent acts.
• Non-deterministic - For some environments, randomness is a factor in deciding what effects actions
have on reward and changes to the observation space.
It is important to note that many gym environments specify that they are not non-deterministic even
though they use random numbers to process actions. Based on the gym GitHub issue tracker, a non-
deterministic property means a deterministic environment behaves randomly. Even when you give the
environment a consistent seed value, this behavior is confirmed. The program can use the seed method of
an environment to seed the random number generator for the environment.
The Gym library allows us to query some of these attributes from environments. I created the following
function to query gym environments.
Code
import gym
We will look at the MountainCar-v0 environment, which challenges an underpowered car to escape
the valley between two mountains. The following code describes the Mountian Car environment.
Code
Output
Action Space : D i s c r e t e ( 3 )
O b s e r v a t i o n Space : Box ( −1.20000 00476837158 , 0 . 6 0 0 0 0 0 0 2 3 8 4 1 8 5 7 9 , ( 2 , ) ,
float32 )
Max Episode S t e p s : 200
Nondeterministic : False
Reward Range : (− i n f , i n f )
Reward T h r e s h o l d : −110.0
This environment allows three distinct actions: accelerate forward, decelerate, or backward. The obser-
vation space contains two continuous (floating point) values, as evident by the box object. The observation
space is simply the position and velocity of the car. The car has 200 steps to escape for each episode. You
would have to look at the code, but the mountain car receives no incremental reward. The only reward for
the vehicle occurs when it escapes the valley.
Code
Output
Action Space : D i s c r e t e ( 2 )
O b s e r v a t i o n Space : Box ( −3.4028234663852886 e +38 ,
3 . 4 0 2 8 2 3 4 6 6 3 8 5 2 8 8 6 e +38 , ( 4 , ) , f l o a t 3 2 )
Max Episode S t e p s : 500
Nondeterministic : False
Reward Range : (− i n f , i n f )
Reward T h r e s h o l d : 4 7 5 . 0
The CartPole-v1 environment challenges the agent to balance a pole while the agent. The environment
has an observation space of 4 continuous numbers:
• Cart Position
• Cart Velocity
• Pole Angle
424 CHAPTER 12. REINFORCEMENT LEARNING
Output
Note: If you see a warning above, you can safely ignore it; it is a relatively minor bug in OpenAI Gym.
Atari games, like breakout, can use an observation space that is either equal to the size of the Atari
screen (210x160) or even use the RAM of the Atari (128 bytes) to determine the state of the game. Yes,
that’s bytes, not kilobytes!
Code
Code
Output
Action Space : D i s c r e t e ( 4 )
O b s e r v a t i o n Space : Box ( 0 , 2 5 5 , ( 2 1 0 , 1 6 0 , 3 ) , u i n t 8 )
12.1. PART 12.1: INTRODUCTION TO THE OPENAI GYM 425
Code
Output
Action Space : D i s c r e t e ( 4 )
O b s e r v a t i o n Space : Box ( 0 , 2 5 5 , ( 1 2 8 , ) , u i n t 8 )
Max Episode S t e p s : 10000
Nondeterministic : False
Reward Range : (− i n f , i n f )
Reward T h r e s h o l d : None
Next, we define the functions used to show the video by adding it to the CoLab notebook.
426 CHAPTER 12. REINFORCEMENT LEARNING
Code
import gym
from gym . wrappers import Monitor
import g l o b
import i o
import b a s e 6 4
from IPython . d i s p l a y import HTML
from p y v i r t u a l d i s p l a y import D i s p l a y
from IPython import d i s p l a y a s i p y t h o n d i s p l a y
d i s p l a y = D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) )
display . start ()
"""
U t i l i t y f u n c t i o n s t o e n a b l e v i d e o r e c o r d i n g o f gym environment
and d i s p l a y i n g i t .
To e n a b l e v i d e o , j u s t do " env = wrap_env ( env ) " "
"""
def show_video ( ) :
m p 4 l i s t = g l o b . g l o b ( ' v i d e o / ∗ . mp4 ' )
i f len ( m p 4 l i s t ) > 0 :
mp4 = m p 4 l i s t [ 0 ]
v i d e o = i o . open (mp4 , ' r+b ' ) . r e a d ( )
encoded = b a s e 6 4 . b64encode ( v i d e o )
i p y t h o n d i s p l a y . d i s p l a y (HTML( data= ' ' '<v i d e o a l t =" t e s t " a u t o p l a y
l o o p c o n t r o l s s t y l e =" h e i g h t : 400 px ;" >
<s o u r c e s r c =" d a t a : v i d e o /mp4 ; base64 , { 0 } " t y p e =" v i d e o /mp4" />
</v i d e o > ' ' ' . format ( encoded . decode ( ' a s c i i ' ) ) ) )
else :
print ( " Could ␣ not ␣ f i n d ␣ v i d e o " )
Now we are ready to play the game. We use a simple random agent.
12.2. PART 12.2: INTRODUCTION TO Q-LEARNING 427
Code
o b s e r v a t i o n = env . r e s e t ( )
while True :
env . r e n d e r ( )
# your a g e n t g o e s h e r e
a c t i o n = env . a c t i o n _ s p a c e . sample ( )
i f done :
break
env . c l o s e ( )
show_video ( )
in an infinite number of states. Q-Learning handles continuous states by binning these numeric values into
ranges.
Out of the box, Q-Learning does not deal with continuous inputs, such as a car’s accelerator that
can range from released to fully engaged. Additionally, Q-Learning primarily deals with discrete actions,
such as pressing a joystick up or down. Researchers have developed clever tricks to allow Q-Learning to
accommodate continuous actions.
Deep neural networks can help solve the problems of continuous environments and action spaces. In
the next section, we will learn more about deep reinforcement learning. For now, we will apply regular
Q-Learning to the Mountain Car problem from OpenAI Gym.
This section will demonstrate how Q-Learning can create a solution to the mountain car gym environment.
The Mountain car is an environment where a car must climb a mountain. Because gravity is stronger
than the car’s engine, it cannot merely accelerate up the steep slope even with full throttle. The vehicle
is situated in a valley and must learn to utilize potential energy by driving up the opposite hill before the
car can make it to the goal at the top of the rightmost hill.
First, it might be helpful to visualize the mountain car environment. The following code shows this
environment. This code makes use of TF-Agents to perform this render. Usually, we use TF-Agents for
the type of deep reinforcement learning that we will see in the next module. However, TF-Agents is just
used to render the mountain care environment for now.
Code
import t f _ a g e n t s
from t f _ a g e n t s . e n v i r o n m e n t s import suite_gym
import PIL . Image
import p y v i r t u a l d i s p l a y
d i s p l a y = p y v i r t u a l d i s p l a y . D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) ) . s t a r t ( )
Output
12.2. PART 12.2: INTRODUCTION TO Q-LEARNING 429
• state[0] - Position
• state[1] - Velocity
The cart is not strong enough. It will need to use potential energy from the mountain behind it. The
following code shows an agent that applies full throttle to climb the hill.
Code
import gym
from gym . wrappers import Monitor
import g l o b
import i o
import b a s e 6 4
from IPython . d i s p l a y import HTML
from p y v i r t u a l d i s p l a y import D i s p l a y
from IPython import d i s p l a y a s i p y t h o n d i s p l a y
d i s p l a y = D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) )
display . start ()
430 CHAPTER 12. REINFORCEMENT LEARNING
def show_video ( ) :
m p 4 l i s t = g l o b . g l o b ( ' v i d e o / ∗ . mp4 ' )
i f len ( m p 4 l i s t ) > 0 :
mp4 = m p 4 l i s t [ 0 ]
v i d e o = i o . open (mp4 , ' r+b ' ) . r e a d ( )
encoded = b a s e 6 4 . b64encode ( v i d e o )
i p y t h o n d i s p l a y . d i s p l a y (HTML( data= ' ' '<v i d e o a l t =" t e s t " a u t o p l a y
l o o p c o n t r o l s s t y l e =" h e i g h t : 400 px ;" >
<s o u r c e s r c =" d a t a : v i d e o /mp4 ; base64 , { 0 } "
t y p e =" v i d e o /mp4" />
</v i d e o > ' ' ' . format ( encoded . decode ( ' a s c i i ' ) ) ) )
else :
print ( " Could ␣ not ␣ f i n d ␣ v i d e o " )
Code
import gym
i f COLAB:
env = wrap_env (gym . make ( " MountainCar−v0 " ) )
else :
env = gym . make ( " MountainCar−v0 " )
env . r e s e t ( )
done = F a l s e
i = 0
while not done :
i += 1
s t a t e , reward , done , _ = env . s t e p ( 2 )
env . r e n d e r ( )
print ( f " Step ␣ { i } : ␣ S t a t e={ s t a t e } , ␣Reward={reward } " )
env . c l o s e ( )
12.2. PART 12.2: INTRODUCTION TO Q-LEARNING 431
Output
...
It helps to visualize the car. The following code shows a video of the car when run from a notebook.
Code
show_video ( )
Code
import gym
i f COLAB:
env = wrap_env (gym . make ( " MountainCar−v0 " ) )
else :
env = gym . make ( " MountainCar−v0 " )
s t a t e = env . r e s e t ( )
done = F a l s e
i = 0
while not done :
i += 1
if state [ 1 ] > 0:
action = 2
else :
action = 0
env . c l o s e ( )
Output
...
show_video ( )
The Q-values can dictate action by selecting the action column with the highest Q-value for the current
environment state. The choice between choosing a random action and a Q-value-driven action is governed
by the epsilon () parameter, the probability of random action.
Each time through the training loop, the training algorithm updates the Q-values according to the
following equation.
434 CHAPTER 12. REINFORCEMENT LEARNING
temporal difference
z
}| {
new
Q (st , at ) ← Q(st , at ) + α · rt + γ · max Q(st+1 , a) − Q(st , at )
| {z } |{z} |{z} |{z} a | {z }
old value learning rate reward discount factor
| {z } old value
estimate of optimal future value
| {z }
new value (temporal difference target)
* Q(st , at ) - The Q-table. For each combination of states, what reward would the agent likely receive
for performing each action?
* st - The current state.
* rt - The last reward received.
* at - The action that the agent will perform.
The equation works by calculating a delta (temporal difference) that the equation should apply to the
old state. This learning rate (α) scales this delta. A learning rate of 1.0 would fully implement the tem-
poral difference in the Q-values each iteration and would likely be very chaotic.
There are two parts to the temporal difference: the new and old values. The new value is subtracted
from the old value to provide a delta; the full amount we would change the Q-value by if the learning rate
did not scale this value. The new value is a summation of the reward received from the last action and
the maximum Q-values from the resulting state when the client takes this action. Adding the maximum of
action Q-values for the new state is essential because it estimates the optimal future values from proceeding
with this action.
## Q-Learning Car
We will now use Q-Learning to produce a car that learns to drive itself. Look out, Tesla! We begin
by defining two essential functions.
Code
import gym
import numpy a s np
# This f u n c t i o n c o n v e r t s t h e f l o a t i n g p o i n t s t a t e v a l u e s i n t o
# d i s c r e t e v a l u e s . This i s o f t e n c a l l e d b i n n i n g . We d i v i d e
# t h e range t h a t t h e s t a t e v a l u e s might occupy and a s s i g n
# each r e g i o n t o a b u c k e t .
12.2. PART 12.2: INTRODUCTION TO Q-LEARNING 435
def c a l c _ d i s c r e t e _ s t a t e ( s t a t e ) :
d i s c r e t e _ s t a t e = ( s t a t e − env . o b s e r v a t i o n _ s p a c e . low ) / b u c k e t s
return tuple ( d i s c r e t e _ s t a t e . a s t y p e ( int ) )
# Run s i m u l a t i o n s t e p
new_state , reward , done , _ = env . s t e p ( a c t i o n )
# Convert c o n t i n u o u s s t a t e t o d i s c r e t e
new_state_disc = c a l c _ d i s c r e t e _ s t a t e ( new_state )
# Update q−t a b l e
i f should_update :
max_future_q = np .max( q_table [ new_state_disc ] )
current_q = q_table [ d i s c r e t e _ s t a t e + ( a c t i o n , ) ]
new_q = ( 1 − LEARNING_RATE) ∗ current_q + LEARNING_RATE ∗ \
( reward + DISCOUNT ∗ max_future_q )
q_table [ d i s c r e t e _ s t a t e + ( a c t i o n , ) ] = new_q
d i s c r e t e _ s t a t e = new_state_disc
436 CHAPTER 12. REINFORCEMENT LEARNING
i f render :
env . r e n d e r ( )
return s u c c e s s
Several hyperparameters are very important for Q-Learning. These parameters will likely need adjust-
ment as you apply Q-Learning to other problems. Because of this, it is crucial to understand the role of
each parameter.
• LEARNING_RATE The rate at which previous Q-values are updated based on new episodes run
during training.
• DISCOUNT The amount of significance to give estimates of future rewards when added to the
reward for the current action taken. A value of 0.95 would indicate a discount of 5% on the future
reward estimates.
• EPISODES The number of episodes to train over. Increase this for more complex problems; how-
ever, training time also increases.
• SHOW_EVERY How many episodes to allow to elapse before showing an update.
• DISCRETE_GRID_SIZE How many buckets to use when converting each continuous state
variable. For example, [10, 10] indicates that the algorithm should use ten buckets for the first and
second state variables.
• START_EPSILON_DECAYING Epsilon is the probability that the agent will select a random
action over what the Q-Table suggests. This value determines the starting probability of randomness.
• END_EPSILON_DECAYING How many episodes should elapse before epsilon goes to zero
and no random actions are permitted. For example, EPISODES//10 means only the first 1/10th of
the episodes might have random actions.
Code
LEARNING_RATE = 0 . 1
DISCOUNT = 0 . 9 5
EPISODES = 50000
SHOW_EVERY = 1000
DISCRETE_GRID_SIZE = [ 1 0 , 1 0 ]
START_EPSILON_DECAYING = 0 . 5
END_EPSILON_DECAYING = EPISODES//10
We can now make the environment. If we are running in Google COLAB, we wrap the environment to be
displayed inside the web browser. Next, create the discrete buckets for state and build Q-table.
Code
i f COLAB:
env = wrap_env (gym . make ( " MountainCar−v0 " ) )
12.2. PART 12.2: INTRODUCTION TO Q-LEARNING 437
else :
env = gym . make ( " MountainCar−v0 " )
epsilon = 1
e p s i l o n _ c h a n g e = e p s i l o n / (END_EPSILON_DECAYING − START_EPSILON_DECAYING)
b u c k e t s = ( env . o b s e r v a t i o n _ s p a c e . h i g h − env . o b s e r v a t i o n _ s p a c e . low ) \
/ DISCRETE_GRID_SIZE
q_table = np . random . uniform ( low=−3, h i g h =0, s i z e =(DISCRETE_GRID_SIZE
+ [ env . a c t i o n _ s p a c e . n ] ) )
success = False
We can now make the environment. If we are running in Google COLAB, we wrap the environment to
be displayed inside the web browser. Next, create the discrete buckets for state and build Q-table.
Code
episode = 0
success_count = 0
# Loop t h r o u g h t h e r e q u i r e d number o f e p i s o d e s
while e p i s o d e < EPISODES :
e p i s o d e += 1
done = F a l s e
# Count s u c c e s s e s
if success :
s u c c e s s _ c o u n t += 1
# Move e p s i l o n t o w a r d s i t s e n d i n g v a l u e , i f i t s t i l l n ee d s t o move
i f END_EPSILON_DECAYING >= e p i s o d e >= START_EPSILON_DECAYING:
e p s i l o n = max( 0 , e p s i l o n − e p s i l o n _ c h a n g e )
print ( s u c c e s s )
438 CHAPTER 12. REINFORCEMENT LEARNING
Output
...
As you can see, the number of successful episodes generally increases as training progresses. It is not
advisable to stop the first time we observe 100% success over 1,000 episodes. There is a randomness to
most games, so it is not likely that an agent would retain its 100% success rate with a new run. It might
be safe to stop training once you observe that the agent has gotten 100% for several update intervals.
import pandas a s pd
Output
v-0 v-1 v-2 v-3 v-4 v-5 v-6 v-7 v-8 v-9
p-0 2 2 2 2 2 2 2 0 2 0
p-1 0 1 0 1 2 2 2 2 2 1
p-2 1 0 0 2 2 2 2 1 1 0
p-3 2 0 0 0 2 2 2 1 2 2
p-4 2 0 0 0 0 2 0 2 2 2
p-5 1 1 2 1 1 0 1 1 2 2
p-6 2 2 0 0 0 0 2 2 2 2
p-7 0 2 1 0 0 1 2 2 2 2
p-8 2 0 1 2 0 0 2 2 1 2
p-9 2 2 2 1 1 0 2 2 2 1
Code
d f . mean ( a x i s =0)
Output
v−0 1.4
v−1 1.0
v−2 0.8
v−3 0.9
v−4 1.0
v−5 1.1
440 CHAPTER 12. REINFORCEMENT LEARNING
v−6 1.7
v−7 1.5
v−8 1.8
v−9 1.4
dtype : float64
Code
d f . mean ( a x i s =1)
Output
p−0 1.6
p−1 1.3
p−2 1.1
p−3 1.3
p−4 1.0
p−5 1.2
p−6 1.2
p−7 1.2
p−8 1.2
p−9 1.5
dtype : float64
chapter.
This chapter will use TF-Agents to implement a DQN to solve the cart-pole environment. TF-Agents
makes designing, implementing, and testing new RL algorithms easier by providing well-tested modu-
lar components that can be modified and extended. It enables fast code iteration with functional test
integration and benchmarking.
To apply DQN to this problem, you need to create the following components for TF-Agents.
• Environment
• Agent
• Policies
• Metrics and Evaluation
• Replay Buffer
• Data Collection
• Training
These components are standard in most DQN implementations. Later, we will apply these same compo-
nents to an Atari game, and after that, a problem with our design. This example is based on the cart-pole
tutorial provided for TF-Agents.
First, we must install TF-Agents.
Code
i f COLAB:
! sudo apt−g e t i n s t a l l −y xvfb ffmpeg x11−u t i l s
! p i p i n s t a l l −q 'gym==0.10.11 '
! p i p i n s t a l l −q ' i m a g e i o ==2.4.0 '
! p i p i n s t a l l −q PILLOW
! p i p i n s t a l l −q ' p y g l e t ==1.3.2 '
! p i p i n s t a l l −q p y v i r t u a l d i s p l a y
! p i p i n s t a l l −q t f −a g e n t s
! p i p i n s t a l l −q pygame
import base64
import imageio
import IPython
import matplotlib
import matplotlib . pyplot as p l t
import numpy a s np
import PIL . Image
import pyvirtualdisplay
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 443
import t e n s o r f l o w a s t f
To allow this example to run in a notebook, we use a virtual display that will output an embedded
video. If running this code outside a notebook, you could omit the virtual display and animate it directly
to a window.
Code
# S e t up a v i r t u a l d i s p l a y f o r r e n d e r i n g OpenAI gym e n v i r o n m e n t s .
d i s p l a y = p y v i r t u a l d i s p l a y . D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) ) . s t a r t ( )
12.3.2 Hyperparameters
We must define Several hyperparameters for the algorithm to train the agent. The TF-Agent example
provided reasonably well-tuned hyperparameters for cart-pole. Later we will adapt these to an Atari game.
Code
# How l o n g s h o u l d t r a i n i n g run ?
n u m _ i t e r a t i o n s = 20000
# How many i n i t i a l random s t e p s , b e f o r e t r a i n i n g s t a r t , t o
# c o l l e c t i n i t i a l data .
i n i t i a l _ c o l l e c t _ s t e p s = 1000
# How many s t e p s s h o u l d we run each i t e r a t i o n t o c o l l e c t
# d a t a from .
collect_steps_per_iteration = 1
# How much d a t a s h o u l d we s t o r e f o r t r a i n i n g e x a m p l e s .
replay_buffer_max_length = 100000
b a t c h _ s i z e = 64
444 CHAPTER 12. REINFORCEMENT LEARNING
l e a r n i n g _ r a t e = 1 e−3
# How o f t e n s h o u l d t h e program p r o v i d e an u p d a t e .
l o g _ i n t e r v a l = 200
12.3.3 Environment
TF-Agents use OpenAI gym environments to represent the task or problem to be solved. Standard en-
vironments can be created in TF-Agents using tf_agents.environments suites. TF-Agents has suites
for loading environments from sources such as the OpenAI Gym, Atari, and DM Control. We begin by
loading the CartPole environment from the OpenAI Gym suite.
Code
Code
env . r e s e t ( )
PIL . Image . f r o m a r r a y ( env . r e n d e r ( ) )
Output
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 445
The environment.step method takes an action in the environment and returns a TimeStep tuple
containing the following observation of the environment and the reward for the action.
The time_step_spec() method returns the specification for the TimeStep tuple. Its observation
attribute shows the shape of observations, the data types, and the ranges of allowed values. The reward
attribute shows the same details for the reward.
Code
Output
O b s e r v a t i o n Spec :
BoundedArraySpec ( shape = ( 4 , ) , dtype=dtype ( ' f l o a t 3 2 ' ) ,
name=' o b s e r v a t i o n ' , minimum=[ −4.8000002 e+00 −3.4028235 e+38
−4.1887903 e −01 −3.4028235 e +38] , maximum= [ 4 . 8 0 0 0 0 0 2 e+00 3 . 4 0 2 8 2 3 5 e+38
4 . 1 8 8 7 9 0 3 e −01 3 . 4 0 2 8 2 3 5 e +38])
Code
Output
446 CHAPTER 12. REINFORCEMENT LEARNING
Reward Spec :
ArraySpec ( shape =() , dtype=dtype ( ' f l o a t 3 2 ' ) , name=' reward ' )
The action_spec() method returns the shape, data types, and allowed values of valid actions.
Code
Output
Action Spec :
BoundedArraySpec ( shape =() , dtype=dtype ( ' i n t 6 4 ' ) , name=' a c t i o n ' ,
minimum=0, maximum=1)
Code
time_step = env . r e s e t ( )
print ( ' Time␣ s t e p : ' )
print ( time_step )
a c t i o n = np . a r r a y ( 1 , dtype=np . i n t 3 2 )
next_time_step = env . s t e p ( a c t i o n )
print ( ' Next ␣ time ␣ s t e p : ' )
print ( next_time_step )
Output
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 447
Time s t e p :
TimeStep (
{ ' d i s c o u n t ' : a r r a y ( 1 . , dtype=f l o a t 3 2 ) ,
' o b s e r v a t i o n ' : a r r a y ( [ − 0 . 0 3 2 7 9 8 5 9 , 0 . 0 3 5 6 2 8 9 2 , −0.04014493 ,
− 0 . 0 4 9 1 1 8 0 2 ] , dtype=f l o a t 3 2 ) ,
' reward ' : a r r a y ( 0 . , dtype=f l o a t 3 2 ) ,
' step_type ' : a r r a y ( 0 , dtype=i n t 3 2 ) } )
Next time s t e p :
TimeStep (
{ ' d i s c o u n t ' : a r r a y ( 1 . , dtype=f l o a t 3 2 ) ,
' o b s e r v a t i o n ' : a r r a y ( [ − 0 . 0 3 2 0 8 6 0 1 , 0 . 2 3 1 3 0 2 8 3 , −0.04112729 ,
− 0 . 3 5 4 1 9 1 8 4 ] , dtype=f l o a t 3 2 ) ,
' reward ' : a r r a y ( 1 . , dtype=f l o a t 3 2 ) ,
' step_type ' : a r r a y ( 1 , dtype=i n t 3 2 ) } )
Usually, the program instantiates two environments: one for training and one for evaluation.
Code
The Cartpole environment, like most environments, is written in pure Python and is converted to
TF-Agents and TensorFlow using the TFPyEnvironment wrapper. The original environment’s API
uses Numpy arrays. The TFPyEnvironment turns these to Tensors to make them compatible with
Tensorflow agents and policies.
Code
12.3.4 Agent
An Agent represents the algorithm used to solve an RL problem. TF-Agents provides standard implemen-
tations of a variety of Agents:
You can only use the DQN agent in environments with a discrete action space. The DQN uses a QNetwork,
a neural network model that learns to predict Q-Values (expected returns) for all actions given a state from
the environment.
The following code uses tf_agents.networks.q_network to create a QNetwork, passing in the ob-
servation_spec, action_spec, and a tuple describing the number and size of the model’s hidden layers.
Code
fc_layer_params = ( 1 0 0 , )
a g e n t = dqn_agent . DqnAgent (
t r a i n _ e n v . time_step_spec ( ) ,
train_env . action_spec ( ) ,
q_network=q_net ,
o p t i m i z e r=o p t i m i z e r ,
t d _ e r r o r s _ l o s s _ f n=common . element_wise_squared_loss ,
t r a i n _ s t e p _ c o u n t e r=t r a i n _ s t e p _ c o u n t e r )
agent . i n i t i a l i z e ( )
12.3.5 Policies
A policy defines the way an agent acts in an environment. Typically, reinforcement learning aims to train
the underlying model until the policy produces the desired outcome.
In this example:
• The desired outcome is keeping the pole balanced upright over the cart.
• The policy returns an action (left or right) for each time_step observation.
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 449
• agent.policy - The algorithm uses this main policy for evaluation and deployment.
• agent.collect_policy - The algorithm this secondary policy for data collection.
Code
eval_policy = agent . p o l i c y
c o l l e c t _ p o l i c y = agent . c o l l e c t _ p o l i c y
You can create policies independently of agents. For example, use random_tf_policy to create a policy
that will randomly select an action for each time_step. We will use this random policy to create initial
collection data to begin training.
Code
To get an action from a policy, call the policy.action method. The time_step contains the obser-
vation from the environment. This method returns a PolicyStep, which is a named tuple with three
components:
Output
The most common metric used to evaluate a policy is the average return. The return is the sum of rewards
obtained while running a policy in an environment for an episode. Several episodes are run, creating an
average return. The following function computes the average return, given the policy, environment, and
number of episodes. We will use this same evaluation for Atari.
Code
total_return = 0.0
fo r _ in range ( num_episodes ) :
time_step = environment . r e s e t ( )
episode_return = 0.0
avg_return = t o t a l _ r e t u r n / num_episodes
return avg_return . numpy ( ) [ 0 ]
# See a l s o t h e m e t r i c s module f o r s t a n d a r d i m p l e m e n t a t i o n s
# of d i f f e r e n t metrics .
# h t t p s : / / g i t h u b . com/ t e n s o r f l o w / a g e n t s / t r e e / master / t f _ a g e n t s / m e t r i c s
Running this computation on the random_policy shows a baseline performance in the environment.
Code
Output
15.2
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 451
The replay buffer keeps track of data collected from the environment. This tutorial uses TFUniformRe-
playBuffer. The constructor requires the specs for the data it will be collecting. This value is available
from the agent using the collect_data_spec method. The batch size and maximum buffer length are
also required.
Code
r e p l a y _ b u f f e r = t f _ u n i f o r m _ r e p l a y _ b u f f e r . TFUniformReplayBuffer (
data_spec=a g e n t . c o l l e c t _ d a t a _ s p e c ,
b a t c h _ s i z e=t r a i n _ e n v . b a t c h _ s i z e ,
max_length=replay_buffer_max_length )
For most agents, collect_data_spec is a named tuple called Trajectory, containing the specs for
observations, actions, rewards, and other items.
Code
agent . collect_data_spec
Output
Trajectory (
{ ' a c t i o n ' : BoundedTensorSpec ( shape =() , dtype=t f . i n t 6 4 , name=' a c t i o n ' ,
minimum=a r r a y ( 0 ) , maximum=a r r a y ( 1 ) ) ,
' d i s c o u n t ' : BoundedTensorSpec ( shape =() , dtype=t f . f l o a t 3 2 ,
name=' d i s c o u n t ' , minimum=a r r a y ( 0 . , dtype=f l o a t 3 2 ) , maximum=a r r a y ( 1 . ,
dtype=f l o a t 3 2 ) ) ,
' next_step_type ' : TensorSpec ( shape =() , dtype=t f . i n t 3 2 ,
name=' step_type ' ) ,
' o b s e r v a t i o n ' : BoundedTensorSpec ( shape = ( 4 , ) , dtype=t f . f l o a t 3 2 ,
name=' o b s e r v a t i o n ' , minimum=a r r a y ( [ − 4 . 8 0 0 0 0 0 2 e +00 , −3.4028235 e +38 ,
−4.1887903 e −01 , −3.4028235 e +38] ,
dtype=f l o a t 3 2 ) , maximum=a r r a y ( [ 4 . 8 0 0 0 0 0 2 e +00 , 3 . 4 0 2 8 2 3 5 e +38 ,
4 . 1 8 8 7 9 0 3 e −01 , 3 . 4 0 2 8 2 3 5 e +38] ,
dtype=f l o a t 3 2 ) ) ,
' policy_info ' : () ,
' reward ' : TensorSpec ( shape =() , dtype=t f . f l o a t 3 2 , name=' reward ' ) ,
' step_type ' : TensorSpec ( shape =() , dtype=t f . i n t 3 2 , name=' step_type ' ) } )
452 CHAPTER 12. REINFORCEMENT LEARNING
# Add t r a j e c t o r y t o t h e r e p l a y b u f f e r
buffer . add_batch ( t r a j )
The replay buffer is now a collection of Trajectories. The agent needs access to the replay buffer. TF-
Agents provides this access by creating an iterable tf.data.Dataset pipeline, which will feed data to the
agent.
Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both
the current and following observation to compute the loss, the dataset pipeline will sample two adjacent
rows for each item in the batch (num_steps=2).
The program also optimizes this dataset by running parallel calls and prefetching data.
Code
# D a t a s e t g e n e r a t e s t r a j e c t o r i e s w i t h sh a pe [ Bx2x . . . ]
dataset = replay_buffer . as_dataset (
n u m _ p a r a l l e l _ c a l l s =3,
sample_batch_size=b a t c h _ s i z e ,
num_steps =2). p r e f e t c h ( 3 )
12.3. PART 12.3: KERAS Q-LEARNING IN THE OPENAI GYM 453
dataset
Output
Code
i t e r a t o r = iter ( dataset )
print ( i t e r a t o r )
Output
This example also periodically evaluates the policy and prints the current score.
The following will take ~5 minutes to run.
Code
# Reset the t r a i n s t e p
agent . train_step_counter . a s s i g n (0)
# E v a l u a t e t h e a g e n t ' s p o l i c y once b e f o r e t r a i n i n g .
avg_return = compute_avg_return ( eval_env , a g e n t . p o l i c y ,
num_eval_episodes )
r e t u r n s = [ avg_return ]
for _ in range ( n u m _ i t e r a t i o n s ) :
# C o l l e c t a few s t e p s u s i n g c o l l e c t _ p o l i c y and
# save to the replay b u f f e r .
fo r _ in range ( c o l l e c t _ s t e p s _ p e r _ i t e r a t i o n ) :
c o l l e c t _ s t e p ( train_env , a g e n t . c o l l e c t _ p o l i c y , r e p l a y _ b u f f e r )
s t e p = a g e n t . t r a i n _ s t e p _ c o u n t e r . numpy ( )
i f s t e p % l o g _ i n t e r v a l == 0 :
print ( ' s t e p ␣=␣ { 0 } : ␣ l o s s ␣=␣ {1} ' . format ( s t e p , t r a i n _ l o s s ) )
i f s t e p % e v a l _ i n t e r v a l == 0 :
avg_return = compute_avg_return ( eval_env , a g e n t . p o l i c y ,
num_eval_episodes )
print ( ' s t e p ␣=␣ { 0 } : ␣ Average ␣ Return ␣=␣ {1} ' . format ( s t e p , avg_return ) )
r e t u r n s . append ( avg_return )
Output
p a c k a g e s / t e n s o r f l o w / python / u t i l / d i s p a t c h . py : 1 0 8 2 : c a l l i n g f o l d r _ v 2
( from t e n s o r f l o w . python . ops . f u n c t i o n a l _ o p s ) with back_prop=F a l s e i s
d e p r e c a t e d and w i l l be removed i n a f u t u r e v e r s i o n .
I n s t r u c t i o n s f o r updating :
back_prop=F a l s e i s d e p r e c a t e d . C o n s i d e r u s i n g t f . s t o p _ g r a d i e n t
instead .
Instead of :
r e s u l t s = t f . f o l d r ( fn , elems , back_prop=F a l s e )
Use :
r e s u l t s = t f . n e s t . map_structure ( t f . s t o p _ g r a d i e n t , t f . f o l d r ( fn , e l e m s ) )
step = 200: l o s s = 23.158374786376953
step = 400: l o s s = 7.158817768096924
step = 600: l o s s = 30.97699737548828
step = 800: l o s s = 9.831337928771973
...
Code
i t e r a t i o n s = range ( 0 , n u m _ i t e r a t i o n s + 1 , e v a l _ i n t e r v a l )
plt . plot ( iterations , returns )
p l t . y l a b e l ( ' Average ␣ Return ' )
plt . xlabel ( ' Iterations ' )
p l t . ylim ( top =250)
Output
456 CHAPTER 12. REINFORCEMENT LEARNING
(3.859999799728394 , 250.0)
12.3.11 Videos
The charts are nice. But more exciting is seeing an agent performing a task in an environment.
First, create a function to embed videos in the notebook.
Code
def embed_mp4( f i l e n a m e ) :
" " " Embeds an mp4 f i l e i n t h e n o t e b o o k . " " "
v i d e o = open ( f i l e n a m e , ' rb ' ) . r e a d ( )
b64 = b a s e 6 4 . b64encode ( v i d e o )
tag = ' ' '
<v i d e o w i d t h ="640" h e i g h t ="480" c o n t r o l s >
<s o u r c e s r c =" d a t a : v i d e o /mp4 ; base64 , { 0 } " t y p e =" v i d e o /mp4">
Your b r o w s e r d o e s not s u p p o r t t h e v i d e o t a g .
</v i d e o > ' ' ' . format ( b64 . decode ( ) )
Now iterate through a few episodes of the Cartpole game with the agent. The underlying Python
environment (the one "inside" the TensorFlow environment wrapper) provides a render() method, which
outputs an image of the environment state. We can collect these frames into a video.
Code
f o r _ in range ( num_episodes ) :
time_step = eval_env . r e s e t ( )
v i d e o . append_data ( eval_py_env . r e n d e r ( ) )
while not time_step . i s _ l a s t ( ) :
a c t i o n _ s t e p = p o l i c y . a c t i o n ( time_step )
time_step = eval_env . s t e p ( a c t i o n _ s t e p . a c t i o n )
v i d e o . append_data ( eval_py_env . r e n d e r ( ) )
return embed_mp4( f i l e n a m e )
c r e a t e _ p o l i c y _ e v a l _ v i d e o ( a g e n t . p o l i c y , " t r a i n e d −a g e n t " )
For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)
Code
• Virtual Atari
Atari games have become popular benchmarks for AI systems, particularly reinforcement learning. OpenAI
Gym internally uses the Stella Atari Emulator. You can see the Atari 2600 in Figure 12.3.
• Maximum resolution: 160 x 192 pixels (NTSC). Max resolution is achievable only with programming
tricks that combine sprite pixels with playfield pixels.
• 128 colors (NTSC). 128 possible on screen. Max of 4 per line: background, playfield, player0 sprite,
and player1 sprite. Palette switching between lines is common. Palette switching mid-line is possible
but not common due to resource limitations.
• 2 channels of 1-bit monaural sound with 4-bit volume control.
import b a s e 6 4
12.4. PART 12.4: ATARI GAMES WITH KERAS NEURAL NETWORKS 459
import imageio
import IPython
import matplotlib
import matplotlib . pyplot as p l t
import numpy a s np
import PIL . Image
import pyvirtualdisplay
import t e n s o r f l o w a s t f
from t f _ a g e n t s . s p e c s import t e n s o r _ s p e c
from t f _ a g e n t s . t r a j e c t o r i e s import time_step a s t s
# S e t up a v i r t u a l d i s p l a y f o r r e n d e r i n g OpenAI gym e n v i r o n m e n t s .
d i s p l a y = p y v i r t u a l d i s p l a y . D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) ) . s t a r t ( )
12.4.3 Hyperparameters
The hyperparameter names are the same as the previous DQN example; however, I tuned the numeric
values for the more complex Atari game.
Code
i n i t i a l _ c o l l e c t _ s t e p s = 200
c o l l e c t _ s t e p s _ p e r _ i t e r a t i o n = 10
replay_buffer_max_length = 100000
b a t c h _ s i z e = 32
l e a r n i n g _ r a t e = 2 . 5 e−3
l o g _ i n t e r v a l = 1000
num_eval_episodes = 5
e v a l _ i n t e r v a l = 25000
The algorithm needs more iterations for an Atari game. I also found that increasing the number of
collection steps helped the algorithm train.
Code
env = s u i t e _ a t a r i . l o a d (
env_name ,
max_episode_steps=max_episode_frames / ATARI_FRAME_SKIP,
gym_env_wrappers=s u i t e _ a t a r i .DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
#env = batched_py_environment . BatchedPyEnvironment ( [ env ] )
We can now reset the environment and display one step. The following image shows how the Pong
game environment appears to a user.
Code
env . r e s e t ( )
PIL . Image . f r o m a r r a y ( env . r e n d e r ( ) )
Output
We are now ready to load and wrap the two environments for TF-Agents. The algorithm uses the first
environment for evaluation and the second to train.
Code
train_py_env = s u i t e _ a t a r i . l o a d (
env_name ,
max_episode_steps=max_episode_frames / ATARI_FRAME_SKIP,
gym_env_wrappers=s u i t e _ a t a r i .DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
eval_py_env = s u i t e _ a t a r i . l o a d (
env_name ,
max_episode_steps=max_episode_frames / ATARI_FRAME_SKIP,
gym_env_wrappers=s u i t e _ a t a r i .DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
12.4.5 Agent
I used the following code from the TF-Agents examples to wrap up the regular Q-network class. The
AtariQNetwork class ensures that the pixel values from the Atari screen are divided by 255. This division
assists the neural network by normalizing the pixel values between 0 and 1.
Code
c l a s s A t a r i C a t e g o r i c a l Q N e t w o r k ( network . Network ) :
" " " C a t e g o r i c a l Q N e t w o r k s u b c l a s s t h a t d i v i d e s o b s e r v a t i o n s by 2 5 5 . " " "
@property
def num_atoms ( s e l f ) :
return s e l f . _categorical_q_network . num_atoms
def c a l l ( s e l f , o b s e r v a t i o n , step_type=None , n e t w o r k _ s t a t e = ( ) ) :
state = t f . cast ( observation , t f . float32 )
# We d i v i d e t h e g r a y s c a l e p i x e l v a l u e s by 255 h e r e r a t h e r than
# s t o r i n g normalized v a l u e s beause u i n t 8 s are 4x cheaper to
# s t o r e than f l o a t 3 2 s .
# TODO( b / 1 2 9 8 0 5 8 2 1 ) : h a n d l e t h e d i v i s i o n by 255 f o r
# t r a i n _ e v a l _ a t a r i . py i n
# a preprocessing layer instead .
s t a t e = s t a t e / 255
return s e l f . _categorical_q_network (
s t a t e , step_type=step_type , n e t w o r k _ s t a t e=n e t w o r k _ s t a t e )
Next, we introduce two hyperparameters specific to the neural network we are about to define.
12.4. PART 12.4: ATARI GAMES WITH KERAS NEURAL NETWORKS 463
Code
fc_layer_params = ( 5 1 2 , )
conv_layer_params = ( ( 3 2 , ( 8 , 8 ) , 4 ) , ( 6 4 , ( 4 , 4 ) , 2 ) , ( 6 4 , ( 3 , 3 ) , 1 ) )
q_net = A t a r i C a t e g o r i c a l Q N e t w o r k (
train_env . observation_spec ( ) ,
train_env . action_spec ( ) ,
conv_layer_params=conv_layer_params ,
fc_layer_params=fc_layer_params )
Convolutional neural networks usually comprise several alternating pairs of convolution and max-
pooling layers, ultimately culminating in one or more dense layers. These layers are the same types
as previously seen in this course. The QNetwork accepts two parameters that define the convolutional
neural network structure.
The more simple of the two parameters is fc_layer_params. This parameter specifies the size of
each of the dense layers. A tuple specifies the size of each of the layers in a list.
The second parameter, named conv_layer_params, is a list of convolution layers parameters, where
each item is a length-three tuple indicating (filters, kernel_size, stride). This implementation of QNetwork
supports only convolution layers. If you desire a more complex convolutional neural network, you must
define your variant of the QNetwork.
The QNetwork defined here is not the agent. Instead, the QNetwork is used by the DQN agent to
implement the actual neural network. This technique allows flexibility as you can set your class if needed.
Next, we define the optimizer. For this example, I used RMSPropOptimizer. However, AdamOptimizer
is another popular choice. We also created the DQN agent and referenced the Q-network.
Code
o p t i m i z e r = t f . compat . v1 . t r a i n . RMSPropOptimizer (
l e a r n i n g _ r a t e=l e a r n i n g _ r a t e ,
decay =0.95 ,
momentum=0.0 ,
e p s i l o n =0.00001 ,
c e n t e r e d=True )
o b s e r v a t i o n _ s p e c = t e n s o r _ s p e c . from_spec ( t r a i n _ e n v . o b s e r v a t i o n _ s p e c ( ) )
time_step_spec = t s . time_step_spec ( o b s e r v a t i o n _ s p e c )
a c t i o n _ s p e c = t e n s o r _ s p e c . from_spec ( t r a i n _ e n v . a c t i o n _ s p e c ( ) )
t a r g e t _ u p d a t e _ p e r i o d = 32000 # ALE frames
update_period = 16 # ALE frames
_update_period = update_period / ATARI_FRAME_SKIP
464 CHAPTER 12. REINFORCEMENT LEARNING
a g e n t = c a t e g o r i c a l _ d q n _ a g e n t . CategoricalDqnAgent (
time_step_spec ,
action_spec ,
c a t e g o r i c a l _ q _ n e t w o r k=q_net ,
o p t i m i z e r=o p t i m i z e r ,
# e p s i l o n _ g r e e d y=e p s i l o n ,
n_step_update =1.0 ,
target_update_tau =1.0 ,
t a r g e t _ u p d a t e _ p e r i o d =(
t a r g e t _ u p d a t e _ p e r i o d / ATARI_FRAME_SKIP / _update_period ) ,
gamma=0.99 ,
r e w a r d _ s c a l e _ f a c t o r =1.0 ,
g r a d i e n t _ c l i p p i n g=None ,
debug_summaries=F a l s e ,
summarize_grads_and_vars=F a l s e )
agent . i n i t i a l i z e ( )
total_return = 0.0
fo r _ in range ( num_episodes ) :
time_step = environment . r e s e t ( )
episode_return = 0.0
t o t a l _ r e t u r n += e p i s o d e _ r e t u r n
avg_return = t o t a l _ r e t u r n / num_episodes
return avg_return . numpy ( ) [ 0 ]
# See a l s o t h e m e t r i c s module f o r s t a n d a r d i m p l e m e n t a t i o n s o f
# d i f f e r e n t metrics .
# h t t p s : / / g i t h u b . com/ t e n s o r f l o w / a g e n t s / t r e e / master / t f _ a g e n t s / m e t r i c s
DQN works by training a neural network to predict the Q-values for every possible environment state. A
neural network needs training data, so the algorithm accumulates this training data as it runs episodes.
The replay buffer is where this data is stored. Only the most recent episodes are stored; older episode data
rolls off the queue as the queue accumulates new data.
Code
r e p l a y _ b u f f e r = t f _ u n i f o r m _ r e p l a y _ b u f f e r . TFUniformReplayBuffer (
data_spec=a g e n t . c o l l e c t _ d a t a _ s p e c ,
b a t c h _ s i z e=t r a i n _ e n v . b a t c h _ s i z e ,
max_length=replay_buffer_max_length )
# D a t a s e t g e n e r a t e s t r a j e c t o r i e s w i t h sh a pe [ Bx2x . . . ]
dataset = replay_buffer . as_dataset (
n u m _ p a r a l l e l _ c a l l s =3,
sample_batch_size=b a t c h _ s i z e ,
num_steps =2). p r e f e t c h ( 3 )
Output
# Add t r a j e c t o r y t o t h e r e p l a y b u f f e r
buffer . add_batch ( t r a j )
c o l l e c t _ d a t a ( train_env , random_policy , r e p l a y _ b u f f e r ,
s t e p s=i n i t i a l _ c o l l e c t _ s t e p s )
i t e r a t o r = iter ( dataset )
# Reset the t r a i n s t e p
agent . train_step_counter . a s s i g n (0)
# E v a l u a t e t h e a g e n t ' s p o l i c y once b e f o r e t r a i n i n g .
avg_return = compute_avg_return ( eval_env , a g e n t . p o l i c y ,
num_eval_episodes )
r e t u r n s = [ avg_return ]
f o r _ in range ( n u m _ i t e r a t i o n s ) :
# C o l l e c t a few s t e p s u s i n g c o l l e c t _ p o l i c y and
# save to the replay b u f f e r .
f o r _ in range ( c o l l e c t _ s t e p s _ p e r _ i t e r a t i o n ) :
c o l l e c t _ s t e p ( train_env , a g e n t . c o l l e c t _ p o l i c y , r e p l a y _ b u f f e r )
s t e p = a g e n t . t r a i n _ s t e p _ c o u n t e r . numpy ( )
i f s t e p % l o g _ i n t e r v a l == 0 :
print ( ' s t e p ␣=␣ { 0 } : ␣ l o s s ␣=␣ {1} ' . format ( s t e p , t r a i n _ l o s s ) )
i f s t e p % e v a l _ i n t e r v a l == 0 :
avg_return = compute_avg_return ( eval_env , a g e n t . p o l i c y ,
num_eval_episodes )
print ( ' s t e p ␣=␣ { 0 } : ␣ Average ␣ Return ␣=␣ {1} ' . format ( s t e p , avg_return ) )
r e t u r n s . append ( avg_return )
Output
12.4.10 Videos
Perhaps the most compelling way to view an Atari game’s results is a video that allows us to see the agent
play the game. We now have a trained model and observed its training progress on a graph. The following
functions are defined to watch the agent play the game in the notebook.
Code
def embed_mp4( f i l e n a m e ) :
" " " Embeds an mp4 f i l e i n t h e n o t e b o o k . " " "
v i d e o = open ( f i l e n a m e , ' rb ' ) . r e a d ( )
b64 = b a s e 6 4 . b64encode ( v i d e o )
tag = ' ' '
<v i d e o w i d t h ="640" h e i g h t ="480" c o n t r o l s >
<s o u r c e s r c =" d a t a : v i d e o /mp4 ; base64 , { 0 } " t y p e =" v i d e o /mp4">
Your b r o w s e r d o e s not s u p p o r t t h e v i d e o t a g .
</v i d e o > ' ' ' . format ( b64 . decode ( ) )
c r e a t e _ p o l i c y _ e v a l _ v i d e o ( a g e n t . p o l i c y , " t r a i n e d −a g e n t " )
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 469
For comparison, we observe a random agent play. While the trained agent is far from perfect, with
enough training, it does outperform the random agent considerably.
Code
import base64
import imageio
import IPython
import matplotlib
import matplotlib . pyplot as p l t
import numpy a s np
import PIL . Image
import pyvirtualdisplay
import math
import numpy a s np
import t e n s o r f l o w a s t f
import gym
from gym import s p a c e s
from gym . u t i l s import s e e d i n g
from gym . envs . r e g i s t r a t i o n import r e g i s t e r
import PIL . ImageDraw
import PIL . Image
from PIL import ImageFont
If you get the following error, restart and rerun the Google CoLab environment. Sometimes a restart
is needed after installing TF-Agents.
# S e t up a v i r t u a l d i s p l a y f o r r e n d e r i n g OpenAI gym e n v i r o n m e n t s .
v d i s p l a y = p y v i r t u a l d i s p l a y . D i s p l a y ( v i s i b l e =0, s i z e =(1400 , 9 0 0 ) ) . s t a r t ( )
The class presented below implements a financial planning simulation. The agent must save for retirement
and should attempt to amass the greatest possible net worth. The simulation includes the following key
elements:
The action space is composed of the following floating-point values (between 0 and 1):
The actions are weights that the program converts to a percentage of the total. For example, the home
loan percentage is the home loan action value divided by all actions (including a home loan). The following
code implements the environment and provides implementation details in the comments.
Code
STATE_ELEMENTS = 7
STATES = [ ' age ' , ' s a l a r y ' , ' home_value ' , ' home_loan ' , ' req_home_pmt ' ,
' acct_tax_adv ' , ' acct_tax ' , " e x p e n s e s " , " actual_home_pmt " ,
" tax_deposit " ,
" tax_adv_deposit " , " net_worth " ]
STATE_AGE = 0
STATE_SALARY = 1
STATE_HOME_VALUE = 2
STATE_HOME_LOAN = 3
STATE_HOME_REQ_PAYMENT = 4
STATE_SAVE_TAX_ADV = 5
STATE_SAVE_TAXABLE = 6
MEG = 1 . 0 e6
ACTION_ELEMENTS = 4
ACTION_HOME_LOAN = 0
ACTION_SAVE_TAX_ADV = 1
ACTION_SAVE_TAXABLE = 2
ACTION_LUXURY = 3
INFLATION = ( 0 . 0 1 5 ) / 1 2 . 0
INTEREST = ( 0 . 0 5 ) / 1 2 . 0
TAX_RATE = ( . 1 4 2 ) / 1 2 . 0
EXPENSES = 0 . 6
INVEST_RETURN = 0 . 0 6 5 / 1 2 . 0
SALARY_LOW = 4 0 0 0 0 . 0
SALARY_HIGH = 6 0 0 0 0 . 0
START_AGE = 18
RETIRE_AGE = 80
s e l f . a c t i o n _ s p a c e = s p a c e s . Box (
low =0.0 ,
h i g h =1.0 ,
shape =(SimpleGameOfLifeEnv .ACTION_ELEMENTS, ) ,
dtype=np . f l o a t 3 2
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 473
)
s e l f . o b s e r v a t i o n _ s p a c e = s p a c e s . Box (
low =0,
h i g h =2,
shape =(SimpleGameOfLifeEnv .STATE_ELEMENTS, ) ,
dtype=np . f l o a t 3 2
)
s e l f . seed ()
s e l f . reset ()
s e l f . state_log = [ ]
def s e e d ( s e l f , s e e d=None ) :
s e l f . np_random , s e e d = s e e d i n g . np_random ( s e e d )
return [ s e e d ]
def _calc_net_worth ( s e l f ) :
home_value = s e l f . s t a t e [
SimpleGameOfLifeEnv .STATE_HOME_VALUE]
principal = s e l f . state [
SimpleGameOfLifeEnv .STATE_HOME_LOAN]
worth = home_value − p r i n c i p a l
worth += s e l f . s t a t e [
SimpleGameOfLifeEnv .STATE_SAVE_TAX_ADV]
worth += s e l f . s t a t e [
SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE]
return worth
act_luxury + s e l f . e x p e n s e s
i f t o t a l _ a c t < 1 e −2:
pct_home_payment = 0
pct_tax_adv_pay = 0
pct_taxable = 0
pct_luxury = 0
else :
pct_home_payment = act_home_payment / t o t a l _ a c t
pct_tax_adv_pay = act_tax_adv_pay / t o t a l _ a c t
pct_taxable = act_taxable / total_act
pct_luxury = act_luxury / t o t a l _ a c t
def s t e p ( s e l f , a c t i o n ) :
s e l f . last_action = action
age = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_AGE]
s a l a r y = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SALARY]
home_value = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_VALUE]
p r i n c i p a l = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_LOAN]
payment = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_REQ_PAYMENT]
n e t 1 = s e l f . _calc_net_worth ( )
remaining_salary = salary
# Calculate actions
pct_home_payment , pct_tax_adv_pay , pct_taxable , pct_luxury = \
s e l f . _eval_action ( a c t i o n , payment )
# Expenses
current_expenses = salary ∗ s e l f . expenses
r e m a i n i n g _ s a l a r y −= c u r r e n t _ e x p e n s e s
i f s e l f . verbose :
print ( f " Expenses : ␣ { c u r r e n t _ e x p e n s e s } " )
print ( f " Remaining ␣ S a l a r y : ␣ { r e m a i n i n g _ s a l a r y } " )
# Tax a d v a n t a g e d d e p o s i t a c t i o n
my_tax_adv_deposit = min( s a l a r y ∗ pct_tax_adv_pay ,
remaining_salary )
# Govt CAP
my_tax_adv_deposit = min( my_tax_adv_deposit ,
s e l f . year_tax_adv_deposit_left )
s e l f . y e a r _ t a x _ a d v _ d e p o s i t _ l e f t −= my_tax_adv_deposit
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 475
r e m a i n i n g _ s a l a r y −= my_tax_adv_deposit
# Company match
tax_adv_deposit = my_tax_adv_deposit ∗ 1 . 0 5
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SAVE_TAX_ADV] += \
int ( tax_adv_deposit )
i f s e l f . verbose :
print ( f " IRA␣ D e p o s i t : ␣ { tax_adv_deposit } " )
print ( f " Remaining ␣ S a l a r y : ␣ { r e m a i n i n g _ s a l a r y } " )
# Tax
r e m a i n i n g _ s a l a r y −= r e m a i n i n g _ s a l a r y ∗ \
SimpleGameOfLifeEnv .TAX_RATE
i f s e l f . verbose :
print ( f " Tax␣ S a l a r y : ␣ { r e m a i n i n g _ s a l a r y } " )
# Home payment
actual_payment = min( s a l a r y ∗ pct_home_payment ,
remaining_salary )
if principal > 0:
i p a r t = p r i n c i p a l ∗ SimpleGameOfLifeEnv . INTEREST
p p a r t = actual_payment − i p a r t
p r i n c i p a l = int ( p r i n c i p a l −p p a r t )
i f p r i n c i p a l <= 0 :
principal = 0
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_REQ_PAYMENT] = 0
e l i f actual_payment < payment :
s e l f . l a t e _ c o u n t += 1
i f s e l f . late_count > 15:
s e l l = ( home_value−p r i n c i p a l ) / 2
s e l l −= 20000
s e l l = max( s e l l , 0 )
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE] \
+= s e l l
principal = 0
home_value = 0
s e l f . e x p e n s e s += . 3
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_REQ_PAYMENT] \
= 0
i f s e l f . verbose :
print ( f " F o r e c l o s u r e ! ! " )
else :
476 CHAPTER 12. REINFORCEMENT LEARNING
l a t e _ f e e = payment ∗ 0 . 1
p r i n c i p a l += l a t e _ f e e
i f s e l f . verbose :
print ( f " Late ␣ Fee : ␣ { l a t e _ f e e } " )
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_LOAN] = p r i n c i p a l
r e m a i n i n g _ s a l a r y −= actual_payment
i f s e l f . verbose :
print ( f "Home␣Payment : ␣ { actual_payment } " )
print ( f " Remaining ␣ S a l a r y : ␣ { r e m a i n i n g _ s a l a r y } " )
# Taxable s a v i n g s
actual_savings = remaining_salary ∗ pct_taxable
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE] \
+= a c t u a l _ s a v i n g s
r e m a i n i n g _ s a l a r y −= a c t u a l _ s a v i n g s
i f s e l f . verbose :
print ( f " Tax␣ Save : ␣ { a c t u a l _ s a v i n g s } " )
print ( f " Remaining ␣ S a l a r y ␣ ( g o e s ␣ t o ␣ Luxury ) : ␣ { r e m a i n i n g _ s a l a r y } " )
# I n v e s t m e n t income
return_taxable = s e l f . state [
SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE] \
∗ s e l f . invest_return
return_tax_adv = s e l f . s t a t e [
SimpleGameOfLifeEnv .STATE_SAVE_TAX_ADV] \
∗ s e l f . invest_return
r e t u r n _ t a x a b l e ∗= 1−SimpleGameOfLifeEnv .TAX_RATE
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE] \
+= r e t u r n _ t a x a b l e
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SAVE_TAX_ADV] \
+= return_tax_adv
# Yearly events
i f age > 0 and age % 12 == 0 :
s e l f . perform_yearly ( )
# Monthly e v e n t s
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_AGE] += 1
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 477
# Time t o r e t i r e ( by age ?)
done = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_AGE] > \
( SimpleGameOfLifeEnv .RETIRE_AGE∗ 1 2 )
# C a l c u l a t e reward
n e t 2 = s e l f . _calc_net_worth ( )
reward = n e t 2 − n e t 1
# Track p r o g r e s s
i f s e l f . verbose :
print ( f " Networth : ␣ {nw} " )
print ( f " ∗∗∗ ␣End␣ Step ␣ { s e l f . step_num } : ␣ S t a t e={ s e l f . s t a t e } , ␣ \
␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣Reward={reward } " )
s e l f . s t a t e _ l o g . append ( s e l f . s t a t e + [ c u r r e n t _ e x p e n s e s ,
actual_payment ,
actual_savings ,
my_tax_adv_deposit ,
net2 ] )
s e l f . step_num += 1
# Normalize s t a t e and f i n i s h up
norm_state = [ x/ SimpleGameOfLifeEnv .MEG f o r x in s e l f . s t a t e ]
return norm_state , reward / SimpleGameOfLifeEnv .MEG, done , {}
def p e r f o r m _ y e a r l y ( s e l f ) :
s a l a r y = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SALARY]
home_value = s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_VALUE]
s e l f . i n f l a t i o n = SimpleGameOfLifeEnv . INTEREST + \
s e l f . np_random . normal ( l o c =0, s c a l e =1e −2)
s e l f . i n v e s t _ r e t u r n = SimpleGameOfLifeEnv .INVEST_RETURN + \
s e l f . np_random . normal ( l o c =0, s c a l e =1e −2)
s e l f . y e a r _ t a x _ a d v _ d e p o s i t _ l e f t = 19000
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SALARY] = \
int ( s a l a r y ∗ (1+ s e l f . i n f l a t i o n ) )
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_VALUE] \
= int ( home_value ∗ (1+ s e l f . i n f l a t i o n ) )
def r e s e t ( s e l f ) :
s e l f . e x p e n s e s = SimpleGameOfLifeEnv . EXPENSES
s e l f . late_count = 0
478 CHAPTER 12. REINFORCEMENT LEARNING
s e l f . step_num = 0
s e l f . l a s t _ a c t i o n = [ 0 ] ∗ SimpleGameOfLifeEnv .ACTION_ELEMENTS
s e l f . s t a t e = [ 0 ] ∗ SimpleGameOfLifeEnv .STATE_ELEMENTS
s e l f . state_log = [ ]
s a l a r y = f l o a t ( s e l f . np_random . r a n d i n t (
low=SimpleGameOfLifeEnv .SALARY_LOW,
h i g h=SimpleGameOfLifeEnv .SALARY_HIGH) )
house_mult = s e l f . np_random . uniform ( low =1.5 , h i g h =4)
v a l u e = round ( s a l a r y ∗ house_mult )
p = ( value ∗0.9)
i = SimpleGameOfLifeEnv . INTEREST
n = 30 ∗ 12
m = f l o a t ( int ( p ∗ ( i ∗ ( 1 + i ) ∗ ∗ n ) / ( ( 1 + i ) ∗ ∗ n − 1 ) ) )
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_AGE] = \
SimpleGameOfLifeEnv .START_AGE ∗ 12
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_SALARY] = s a l a r y / 1 2 . 0
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_VALUE] = v a l u e
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_LOAN] = p
s e l f . s t a t e [ SimpleGameOfLifeEnv .STATE_HOME_REQ_PAYMENT] = m
s e l f . y e a r _ t a x _ a d v _ d e p o s i t _ l e f t = 19000
s e l f . perform_yearly ( )
return np . a r r a y ( s e l f . s t a t e )
balance_taxable = s e l f . s t a t e [
SimpleGameOfLifeEnv .STATE_SAVE_TAXABLE]
net_worth = s e l f . _calc_net_worth ( )
return np . a r r a y ( img )
def c l o s e ( s e l f ) :
pass
480 CHAPTER 12. REINFORCEMENT LEARNING
You must register the environment class with TF-Agents before your program can use it.
Code
register (
id= ' s i m p l e −game−of −l i f e −v0 ' ,
e n t r y _ p o i n t=f ' {__name__} : SimpleGameOfLifeEnv ' ,
)
env . r e s e t ( )
done = F a l s e
i = 0
env . v e r b o s e = F a l s e
while not done :
i += 1
s t a t e , reward , done , _ = env . s t e p ( [ 1 , 1 , 0 , 0 ] )
env . r e n d e r ( )
env . c l o s e ( )
Code
import pandas a s pd
Output
1810888.5833333335
12.5.3 Hyperparameters
I tuned the following hyperparameters to get a reasonable result from training the agent. Further opti-
mization would be beneficial.
Code
# How l o n g s h o u l d t r a i n i n g run ?
n u m _ i t e r a t i o n s = 3000
# How o f t e n s h o u l d t h e program p r o v i d e an u p d a t e .
l o g _ i n t e r v a l = 500
b a t c h _ s i z e = 64
We are now ready to make use of our environment. Because we registered the environment with TF-Agents
the program can load the environment by its name "simple-game-of-life-v".
Code
We can now have a quick look at the first state rendered. Here we can see the random salary and home
values are chosen for an agent. The learned policy must be able to consider different starting salaries and
home values and find an appropriate strategy.
Code
env . r e s e t ( )
PIL . Image . f r o m a r r a y ( env . r e n d e r ( ) )
Output
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 483
Just as before, the program instantiates two environments: one for training and one for evaluation.
Code
You might be wondering why a DQN does not support continuous actions. This limitation is that
the DQN algorithm maps each action as an output neuron. Each of these neurons predicts the likely
future reward for taking each action. The algorithm knows the future rewards for each particular action.
Generally, the DQN agent will perform the action that has the highest reward. However, because a
continuous number represented in a computer has an effectively infinite number of possible values, it is not
possible to calculate a future reward estimate for all of them.
We will use the Deep Deterministic Policy Gradients (DDPG) algorithm to provide a continuous action
space.[22]This technique uses two neural networks. The first neural network, called an actor, acts as the
agent and predicts the expected reward for a given value of the action. The second neural network, called a
critic, is trained to predict the accuracy of the actor-network. Training two neural networks in parallel that
operate adversarially is a popular technique. Earlier in this course, we saw that Generative Adversarial
Networks (GAN) used a similar method. Figure 12.4 shows the structure of the DDPG network that we
will use.
The environment provides the same input (x(t)) for each time step to both the actor and critic networks.
The temporal difference error (r(t)) reports the difference between the estimated reward and the actual
reward at any given state or time step.
The following code creates the actor and critic neural networks.
484 CHAPTER 12. REINFORCEMENT LEARNING
Code
a c t o r _ l e a r n i n g _ r a t e = 1 e−4
c r i t i c _ l e a r n i n g _ r a t e = 1 e−3
debug_summaries = F a l s e
summarize_grads_and_vars = F a l s e
g l o b a l _ s t e p = t f . compat . v1 . t r a i n . g e t _ o r _ c r e a t e _ g l o b a l _ s t e p ( )
a c t o r _ n e t = actor_network . ActorNetwork (
t r a i n _ e n v . time_step_spec ( ) . o b s e r v a t i o n ,
train_env . action_spec ( ) ,
fc_layer_params=a c t o r _ f c _ l a y e r s ,
)
12.5. PART 12.5: APPLICATION OF REINFORCEMENT LEARNING 485
c r i t i c _ n e t _ i n p u t _ s p e c s = ( t r a i n _ e n v . time_step_spec ( ) . o b s e r v a t i o n ,
train_env . action_spec ( ) )
c r i t i c _ n e t = critic_network . CriticNetwork (
critic_net_input_specs ,
o b s e r v a t i o n _ f c _ l a y e r _ p a r a m s=c r i t i c _ o b s _ f c _ l a y e r s ,
action_fc_layer_params=c r i t i c _ a c t i o n _ f c _ l a y e r s ,
j o i n t _ f c _ l a y e r _ p a r a m s=c r i t i c _ j o i n t _ f c _ l a y e r s ,
)
t f _ a g e n t = ddpg_agent . DdpgAgent (
t r a i n _ e n v . time_step_spec ( ) ,
train_env . action_spec ( ) ,
actor_network=actor_net ,
c r i t i c _ n e t w o r k=c r i t i c _ n e t ,
a c t o r _ o p t i m i z e r=t f . compat . v1 . t r a i n . AdamOptimizer (
l e a r n i n g _ r a t e=a c t o r _ l e a r n i n g _ r a t e ) ,
c r i t i c _ o p t i m i z e r=t f . compat . v1 . t r a i n . AdamOptimizer (
l e a r n i n g _ r a t e=c r i t i c _ l e a r n i n g _ r a t e ) ,
ou_stddev=ou_stddev ,
ou_damping=ou_damping ,
target_update_tau=target_update_tau ,
t a r g e t _ u p d a t e _ p e r i o d=target_update_period ,
d q d a _ c l i p p i n g=dqda_clipping ,
t d _ e r r o r s _ l o s s _ f n=t d _ e r r o r s _ l o s s _ f n ,
gamma=gamma ,
r e w a r d _ s c a l e _ f a c t o r=r e w a r d _ s c a l e _ f a c t o r ,
g r a d i e n t _ c l i p p i n g=g r a d i e n t _ c l i p p i n g ,
debug_summaries=debug_summaries ,
summarize_grads_and_vars=summarize_grads_and_vars ,
t r a i n _ s t e p _ c o u n t e r=g l o b a l _ s t e p )
tf_agent . i n i t i a l i z e ( )
total_return = 0.0
486 CHAPTER 12. REINFORCEMENT LEARNING
fo r _ in range ( num_episodes ) :
time_step = environment . r e s e t ( )
episode_return = 0.0
avg_return = t o t a l _ r e t u r n / num_episodes
return avg_return . numpy ( ) [ 0 ]
# See a l s o t h e m e t r i c s module f o r s t a n d a r d i m p l e m e n t a t i o n s o f
# d i f f e r e n t metrics .
# h t t p s : / / g i t h u b . com/ t e n s o r f l o w / a g e n t s / t r e e / master / t f _ a g e n t s / m e t r i c s
# Add t r a j e c t o r y t o t h e r e p l a y b u f f e r
buffer . add_batch ( t r a j )
r e p l a y _ b u f f e r = t f _ u n i f o r m _ r e p l a y _ b u f f e r . TFUniformReplayBuffer (
data_spec=t f _ a g e n t . c o l l e c t _ d a t a _ s p e c ,
b a t c h _ s i z e=t r a i n _ e n v . b a t c h _ s i z e ,
max_length=replay_buffer_max_length )
# D a t a s e t g e n e r a t e s t r a j e c t o r i e s w i t h sh a pe [ Bx2x . . . ]
dataset = replay_buffer . as_dataset (
n u m _ p a r a l l e l _ c a l l s =3,
sample_batch_size=b a t c h _ s i z e ,
num_steps =2). p r e f e t c h ( 3 )
Output
i t e r a t o r = iter ( dataset )
# Reset the t r a i n s t e p
tf_agent . train_step_counter . a ss i gn (0)
# E v a l u a t e t h e a g e n t ' s p o l i c y once b e f o r e t r a i n i n g .
avg_return = compute_avg_return ( eval_env , t f _ a g e n t . p o l i c y ,
num_eval_episodes )
r e t u r n s = [ avg_return ]
for _ in range ( n u m _ i t e r a t i o n s ) :
# C o l l e c t a few s t e p s u s i n g c o l l e c t _ p o l i c y and
# save to the replay b u f f e r .
fo r _ in range ( c o l l e c t _ s t e p s _ p e r _ i t e r a t i o n ) :
c o l l e c t _ s t e p ( train_env , t f _ a g e n t . c o l l e c t _ p o l i c y , r e p l a y _ b u f f e r )
s t e p = t f _ a g e n t . t r a i n _ s t e p _ c o u n t e r . numpy ( )
i f s t e p % l o g _ i n t e r v a l == 0 :
print ( ' s t e p ␣=␣ { 0 } : ␣ l o s s ␣=␣ {1} ' . format ( s t e p , t r a i n _ l o s s ) )
i f s t e p % e v a l _ i n t e r v a l == 0 :
avg_return = compute_avg_return ( eval_env , t f _ a g e n t . p o l i c y ,
num_eval_episodes )
print ( ' s t e p ␣=␣ { 0 } : ␣ Average ␣ Return ␣=␣ {1} ' . format ( s t e p , avg_return ) )
r e t u r n s . append ( avg_return )
Output
12.5.8 Visualization
The notebook can plot the average return over training iterations. The average return should increase as
the program performs more training iterations.
12.5.9 Videos
We use the following functions to produce video in Jupyter notebook. As the person moves through their
career, they focus on paying off the house and tax advantage investing.
Code
def embed_mp4( f i l e n a m e ) :
" " " Embeds an mp4 f i l e i n t h e n o t e b o o k . " " "
v i d e o = open ( f i l e n a m e , ' rb ' ) . r e a d ( )
b64 = b a s e 6 4 . b64encode ( v i d e o )
tag = ' ' '
<v i d e o w i d t h ="640" h e i g h t ="480" c o n t r o l s >
<s o u r c e s r c =" d a t a : v i d e o /mp4 ; base64 , { 0 } " t y p e =" v i d e o /mp4">
Your b r o w s e r d o e s not s u p p o r t t h e v i d e o t a g .
</v i d e o > ' ' ' . format ( b64 . decode ( ) )
c r e a t e _ p o l i c y _ e v a l _ v i d e o ( t f _ a g e n t . p o l i c y , " t r a i n e d −a g e n t " )
490 CHAPTER 12. REINFORCEMENT LEARNING
Chapter 13
Advanced/Other Topics
app = F l a s k (__name__)
491
492 CHAPTER 13. ADVANCED/OTHER TOPICS
This program starts a web service on port 9000 of your computer. This cell will remain running
(appearing locked up). However, it is merely waiting for browsers to connect. If you point your browser
at the following URL, you will interact with the Flask web service.
• http://localhost:9000/
{
" cylinders ": 8 ,
" displacement " : 300 ,
" hor s e p o w e r " : 7 8 ,
" weight " : 3500 ,
" a c c e l e r a t i o n " : 20 ,
" year " : 76 ,
" origin ": 1
}
We will see two different means of POSTing this JSON data to our web server. First, we will use a
utility called POSTman. Secondly, we will use Python code to construct the JSON message and interact
with Flask.
First, it is necessary to train a neural network with the MPG dataset. This technique is very similar
to what we’ve done many times before. However, we will save the neural network so that we can load it
later. We do not want to have Flask train the neural network. We wish to have the neural network already
trained and deploy the already prepared .H5 file to save the neural network. The following code trains an
MPG neural network.
13.1. PART 13.1: FLASK AND DEEP LEARNING WEB SERVICES 493
Code
d f = pd . read_csv (
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ data / t81 −558/ auto−mpg . c s v " ,
na_values =[ 'NA ' , ' ? ' ] )
# Handle m i s s i n g v a l u e
d f [ ' h o r s e p o w e r ' ] = d f [ ' h o r s e p o w e r ' ] . f i l l n a ( d f [ ' h o r s e p o w e r ' ] . median ( ) )
# Pandas t o Numpy
x = df [ [ ' c y l i n d e r s ' , ' displacement ' , ' horsepower ' , ' weight ' ,
' a c c e l e r a t i o n ' , ' year ' , ' o r i g i n ' ] ] . values
y = d f [ 'mpg ' ] . v a l u e s # r e g r e s s i o n
# S p l i t i n t o v a l i d a t i o n and t r a i n i n g s e t s
x_train , x_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t (
x , y , t e s t _ s i z e =0.25 , random_state =42)
# B u i l d t h e n e u r a l network
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 1
model . add ( Dense ( 1 0 , a c t i v a t i o n= ' r e l u ' ) ) # Hidden 2
model . add ( Dense ( 1 ) ) # Output
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
Output
Next, we evaluate the score. This evaluation is more of a sanity check to ensure the code above worked
as expected.
Code
pred = model . p r e d i c t ( x _ t e s t )
# Measure RMSE e r r o r . RMSE i s common f o r r e g r e s s i o n .
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y _ t e s t ) )
print ( f " A f t e r ␣ l o a d ␣ s c o r e ␣ (RMSE) : ␣ { s c o r e } " )
Output
A f t e r l o a d s c o r e (RMSE) : 5 . 4 6 5 1 9 3 6 8 8 1 3 0 7 3 2
We want the Flask web service to check that the input JSON is valid. To do this, we need to know
what values we expect and their logical ranges. The following code outputs the expected fields and their
ranges, and packages all of this information into a JSON object that you should copy to the Flask web
application. This code allows us to validate the incoming JSON requests.
Code
Output
{
" c y l i n d e r s " : { " min " : 3 , " max " : 8 } ,
" d i s p l a c e m e n t " : { " min " : 6 8 . 0 , " max " : 4 5 5 . 0 } ,
" h o r s e p o w e r " : { " min " : 4 6 . 0 , " max " : 2 3 0 . 0 } ,
" w e i g h t " : { " min " : 1 6 1 3 , " max " : 5 1 4 0 } ,
" a c c e l e r a t i o n " : { " min " : 8 . 0 , " max " : 2 4 . 8 } ,
" y e a r " : { " min " : 7 0 , " max " : 8 2 } ,
" o r i g i n " : { " min " : 1 , " max " : 3 }
}
Finally, we set up the Python code to call the model for a single car and get a prediction. You should
also copy this code to the Flask web application.
Code
import o s
from t e n s o r f l o w . k e r a s . models import load_model
import numpy a s np
pred = model . p r e d i c t ( x )
f l o a t ( pred [ 0 ] )
Output
6.212100505828857
496 CHAPTER 13. ADVANCED/OTHER TOPICS
• mpg_server_1.py
You can run this server from the command line with the following command:
python mpg_server_1 . py
If you are using a virtual environment (described in Module 1.1), use the activate tensorflow com-
mand for Windows or source activate tensorflow for Mac before executing the above command.
Code
import r e q u e s t s
json = {
" cylinders " : 8 ,
" displacement " : 300 ,
" horsepower " : 78 ,
" weight " : 3500 ,
" a c c e l e r a t i o n " : 20 ,
" year " : 76 ,
" origin " : 1
}
Output
Success : {
" errors ": [] ,
" i d " : " 6 4 3 d027e −554 f −4401−ba5f −78592 a e 7 e 0 7 0 " ,
"mpg " : 2 3 . 8 8 5 4 3 8 9 1 9 0 6 7 3 8 3
}
python mpg_server_1 . py
If you are using a virtual environment (described in Module 1.1), use the activate tensorflow com-
mand for Windows or source activate tensorflow for Mac before executing the above command.
To successfully use PostMan to query your web service, you must enter the following settings:
• Use "Form Data" and create one entry named "image" that is a file. Choose an image file to classify.
• Click Send, and you should get a correct result
Figure 13.2 shows a successful result.
import r e q u e s t s
r e s p o n s e = r e q u e s t s . p o s t ( ' h t t p : / / l o c a l h o s t : 5 0 0 0 / a p i / image ' , f i l e s =\
dict ( image=( ' h i c k o r y . j p e g ' , open ( ' . / p h o t o s / h i c k o r y . j p e g ' , ' rb ' ) ) ) )
i f r e s p o n s e . s t a t u s _ c o d e == 2 0 0 :
print ( " S u c c e s s : ␣ {} " . format ( r e s p o n s e . t e x t ) )
e l s e : print ( " F a i l u r e : ␣ {} " . format ( r e s p o n s e . t e x t ) )
Output
Success : {
" pred " : [
{
" name " : " boxer " ,
" prob " : 0 . 9 1 7 8 2 8 1 4 2 6 4 2 9 7 4 9
},
{
" name " : " A m e r i c a n _ S t a f f o r d s h i r e _ t e r r i e r " ,
13.2. PART 13.2: INTERRUPTING AND CONTINUING TRAINING 499
...
import o s
import r e
import s y s
import time
import numpy a s np
from t y p i n g import Any , L i s t , Tuple , Union
from t e n s o r f l o w . k e r a s . d a t a s e t s import mnist
from t e n s o r f l o w . k e r a s import backend a s K
import t e n s o r f l o w a s t f
import t e n s o r f l o w . k e r a s
import t e n s o r f l o w a s t f
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g , \
L e a r n i n g R a t e S c h e d u l e r , ModelCheckpoint
from t e n s o r f l o w . k e r a s import r e g u l a r i z e r s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
500 CHAPTER 13. ADVANCED/OTHER TOPICS
def g e n e r a t e _ o u t p u t _ d i r ( o u t d i r , run_desc ) :
prev_run_dirs = [ ]
i f o s . path . i s d i r ( o u t d i r ) :
prev_run_dirs = [ x f o r x in o s . l i s t d i r ( o u t d i r ) i f o s . path . i s d i r ( \
o s . path . j o i n ( o u t d i r , x ) ) ]
prev_run_ids = [ r e . match ( r ' ^\d+ ' , x ) f o r x in prev_run_dirs ]
prev_run_ids = [ int ( x . group ( ) ) f o r x in prev_run_ids i f x i s not None ]
cur_run_id = max( prev_run_ids , d e f a u l t =−1) + 1
run_dir = o s . path . j o i n ( o u t d i r , f ' { cur_run_id : 0 5 d}−{run_desc } ' )
a s s e r t not o s . path . e x i s t s ( run_dir )
o s . makedirs ( run_dir )
return run_dir
# From StyleGAN2
c l a s s Logger ( object ) :
" " " R e d i r e c t s t d e r r t o s t d o u t , o p t i o n a l l y p r i n t s t d o u t t o a f i l e , and
o p t i o n a l l y f o r c e f l u s h i n g on b o t h s t d o u t and t h e f i l e . " " "
i f f i l e _ n a m e i s not None :
s e l f . f i l e = open ( file_name , f i l e _ m o d e )
s e l f . should_flush = should_flush
s e l f . stdout = sys . stdout
s e l f . stderr = sys . stderr
sys . stdout = s e l f
sys . stderr = s e l f
i f s e l f . f i l e i s not None :
s e l f . f i l e . write ( text )
i f s e l f . should_flush :
s e l f . flush ()
s e l f . stdout . f l u s h ()
# i f u s i n g m u l t i p l e l o g g e r s , p r e v e n t c l o s i n g i n wrong o r d e r
i f sys . stdout is s e l f :
sys . stdout = s e l f . stdout
i f sys . stderr is s e l f :
sys . stderr = s e l f . stderr
i f s e l f . f i l e i s not None :
s e l f . file . close ()
def obtain_data ( ) :
( x_train , y _ t r a i n ) , ( x_test , y _ t e s t ) = mnist . load_data ( )
print ( " Shape ␣ o f ␣ x _ t r a i n : ␣ {} " . format ( x _ t r a i n . shape ) )
print ( " Shape ␣ o f ␣ y _ t r a i n : ␣ {} " . format ( y _ t r a i n . shape ) )
print ( )
print ( " Shape ␣ o f ␣ x _ t e s t : ␣ {} " . format ( x _ t e s t . shape ) )
print ( " Shape ␣ o f ␣ y _ t e s t : ␣ {} " . format ( y _ t e s t . shape ) )
502 CHAPTER 13. ADVANCED/OTHER TOPICS
# i n p u t image d i m e n s i o n s
img_rows , img_cols = 2 8 , 28
i f K. image_data_format ( ) == ' c h a n n e l s _ f i r s t ' :
x _ t r a i n = x _ t r a i n . r e s h a p e ( x _ t r a i n . shape [ 0 ] , 1 , img_rows , img_cols )
x _ t e s t = x _ t e s t . r e s h a p e ( x _ t e s t . shape [ 0 ] , 1 , img_rows , img_cols )
input_shape = ( 1 , img_rows , img_cols )
else :
x _ t r a i n = x _ t r a i n . r e s h a p e ( x _ t r a i n . shape [ 0 ] , img_rows , img_cols , 1 )
x _ t e s t = x _ t e s t . r e s h a p e ( x _ t e s t . shape [ 0 ] , img_rows , img_cols , 1 )
input_shape = ( img_rows , img_cols , 1 )
x_train = x_train . astype ( ' f l o a t 3 2 ' )
x_test = x_test . astype ( ' f l o a t 3 2 ' )
x _ t r a i n /= 255
x _ t e s t /= 255
print ( ' x _ t r a i n ␣ shape : ' , x _ t r a i n . shape )
print ( " T r a i n i n g ␣ s a m p l e s : ␣ {} " . format ( x _ t r a i n . shape [ 0 ] ) )
print ( " Test ␣ s a m p l e s : ␣ {} " . format ( x _ t e s t . shape [ 0 ] ) )
# convert c l a s s vectors to binary c l a s s matrices
y _ t r a i n = t f . k e r a s . u t i l s . t o _ c a t e g o r i c a l ( y_train , num_classes )
y _ t e s t = t f . k e r a s . u t i l s . t o _ c a t e g o r i c a l ( y_test , num_classes )
We define the basic training parameters and where we wish to write the output.
Code
run_dir = g e n e r a t e _ o u t p u t _ d i r ( o u t d i r , run_desc )
print ( f " R e s u l t s ␣ saved ␣ t o : ␣ { run_dir } " )
Output
Keras provides a prebuilt checkpoint class named ModelCheckpoint that contains most of our desired
functionality. This built-in class can save the model’s state repeatedly as training progresses. Stopping
neural network training is not always a controlled event. Sometimes this stoppage can be abrupt, such
as a power failure or a network resource shutting down. If Microsoft Windows is your operating system
13.2. PART 13.2: INTERRUPTING AND CONTINUING TRAINING 503
of choice, your training can also be interrupted by a high-priority system update. Because of all of this
uncertainty, it is best to save your model at regular intervals. This process is similar to saving a game at
critical checkpoints, so you do not have to start over if something terrible happens to your avatar in the
game.
We will create our checkpoint class, named MyModelCheckpoint. In addition to saving the model,
we also save the state of the training infrastructure. Why save the training infrastructure, in addition to
the weights? This technique eases the transition back into training for the neural network and will be more
efficient than a cold start.
Consider if you interrupted your college studies after the first year. Sure, your brain (the neural network)
will retain all the knowledge. But how much rework will you have to do? Your transcript at the university
is like the training parameters. It ensures you do not have to start over when you come back.
Code
c l a s s MyModelCheckpoint ( ModelCheckpoint ) :
def __init__ ( s e l f , ∗ a r g s , ∗∗ kwargs ) :
super ( ) . __init__ ( ∗ a r g s , ∗∗ kwargs )
# Also s a v e t h e o p t i m i z e r s t a t e
f i l e p a t h = s e l f . _ g e t _ f i l e _ p a t h ( epoch=epoch ,
l o g s=l o g s , batch=None )
filepath = filepath . rsplit ( " . " , 1 )[ 0 ]
f i l e p a t h += " . p k l "
The optimizer applies a step decay schedule during training to decrease the learning rate as training
progresses. It is essential to preserve the current epoch that we are on to perform correctly after a training
resume.
Code
return L e a r n i n g R a t e S c h e d u l e r ( s c h e d u l e )
We build the model just as we have in previous sessions. However, the training function requires a few
extra considerations. We specify the maximum number of epochs; however, we also allow the user to select
the starting epoch number for training continuation.
Code
checkpoint_cb = MyModelCheckpoint (
o s . path . j o i n ( run_dir , ' model−{epoch : 0 2 d}−{ v a l _ l o s s : . 2 f } . hdf5 ' ) ,
monitor= ' v a l _ l o s s ' , v e r b o s e =1)
e l a p s e d _ t i m e = time . time ( ) − s t a r t _ t i m e
print ( " Elapsed ␣ time : ␣ {} " . format ( hms_string ( e l a p s e d _ t i m e ) ) )
We now begin training, using the Logger class to write the output to a log file in the output directory.
Code
Output
...
469/469 − 2 s − l o s s : 0 . 1 5 7 5 − a c c u r a c y : 0 . 9 5 4 1 − v a l _ l o s s : 0 . 0 8 3 7 −
v a l _ a c c u r a c y : 0 . 9 7 4 6 − l r : 7 . 5 0 0 0 e −05 − 2 s / epoch − 5ms/ s t e p
Test l o s s : 0 . 0 8 3 6 5 7 0 1 1 3 8 9 7 3 2 3 6
Test a c c u r a c y : 0 . 9 7 4 6 0 0 0 1 7 0 7 0 7 7 0 3
Elapsed time : 0 : 0 0 : 2 2 . 0 9
You should notice that the above output displays the name of the hdf5 and pickle (pkl) files produced
506 CHAPTER 13. ADVANCED/OTHER TOPICS
! l s . / data
Output
00000− t e s t −t r a i n
Code
! l s . / data /00000− t e s t −t r a i n
Output
Keras stores the model itself in an HDF5, which includes the optimizer. Because of this feature, it is
not generally necessary to restore the internal state of the optimizer (such as ADAM). However, we include
the code to do so. We can obtain the internal state of an optimizer by calling get_config, which will
return a dictionary similar to the following:
{ ' name ' : 'Adam' , ' l e a r n i n g _ r a t e ' : 7 . 5 e −05 , ' decay ' : 0 . 0 ,
' beta_1 ' : 0 . 9 , ' beta_2 ' : 0 . 9 9 9 , ' e p s i l o n ' : 1 e −07 , ' amsgrad ' : F a l s e }
In practice, I’ve found that different optimizers implement get_config differently. This function will
always return the training hyperparameters. However, it may not always capture the complete internal
state of an optimizer beyond the hyperparameters. The exact implementation of get_config can vary per
optimizer implementation.
13.2. PART 13.2: INTERRUPTING AND CONTINUING TRAINING 507
The following code loads the HDF5 and PKL files and then recompiles the model based on the PKL
file. Depending on the optimizer in use, you might have to recompile the model.
Code
import t e n s o r f l o w a s t f
from t e n s o r f l o w . k e r a s . models import load_model
import p i c k l e
# n o t e : o f t e n i t i s not n e c e s s a r y t o r e c o m p i l e t h e model
model . compile (
l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' ,
o p t i m i z e r=t f . k e r a s . o p t i m i z e r s . Adam . f r o m _ c o n f i g ( opt ) ,
m e t r i c s =[ ' a c c u r a c y ' ] )
Finally, we train the model for additional epochs. You can see from the output that the new training
starts at a higher accuracy than the first training run. Further, the accuracy increases with additional
training. Also, you will notice that the epoch number begins at four and not one.
Code
run_dir = g e n e r a t e _ o u t p u t _ d i r ( o u t d i r , run_desc )
print ( f " R e s u l t s ␣ saved ␣ t o : ␣ { run_dir } " )
Output
13.3 Part 13.3: Using a Keras Deep Neural Network with a Web
Application
In this part, we will extend the image API developed in Part 13.1 to work with a web application. This
technique allows you to use a simple website to upload/predict images, such as in Figure 13.3.
I added neural network functionality to a simple ReactJS image upload and preview example. To do
this, we will use the same API developed in Module 13.1. However, we will now add a ReactJS website
around it. This single-page web application allows you to upload images for classification by the neural
network. If you would like to read more about ReactJS and image uploading, you can refer to the blog
post that provided some inspiration for this example.
13.4. PART 13.4: WHEN TO RETRAIN YOUR NEURAL NETWORK 509
Once your application is released, you will hopefully obtain new data. This data will come from job
seekers using your app. These people are your customers. You have x values (their attributes), but you
do not have y-values (their jobs). Your customers have come to you to find out what their be jobs will
be. You will provide the customer’s attributes to the neural network, and then it will predict their jobs.
Usually, companies develop neural networks on initial data and then use the neural network to perform
predictions on new data obtained over time from their customers.
Your job prediction model will become less relevant as the industry introduces new job types and the
demographics of your customers change. However, companies must look if their model is still relevant as
time passes. This change in your underlying data is called dataset drift. In this section, we will see ways
that you can measure dataset drift.
You can present your model with new data and see how its accuracy changes over time. However, to
calculate efficiency, you must know the expected outputs from the model (y-values). You may not know the
correct outcomes for new data that you are obtaining in real-time. Therefore, we will look at algorithms
that examine the x-inputs and determine how much they have changed in distribution from the original
x-inputs that we trained on. These changes are called dataset drift.
Let’s begin by creating generated data that illustrates drift. We present the following code to create a
chart that shows such drift.
Code
import numpy a s np
import m a t p l o t l i b . p y p l o t a s p l o t
from s k l e a r n . l i n e a r _ m o d e l import L i n e a r R e g r e s s i o n
def t r u e _ f u n c t i o n ( x ) :
x2 = ( x ∗ 8 ) − 1
return ( ( np . s i n ( x2 ) / x2 ) ∗ 0 . 6 ) + 0 . 3
#
x _ t r a i n = np . a r a n g e ( 0 , 0 . 6 , 0 . 0 1 )
x _ te s t = np . a r a n g e ( 0 . 6 , 1 . 1 , 0 . 0 1 )
x_true = np . c o n c a t e n a t e ( ( x_train , x _ t e s t ) )
#
y_true_train = t r u e _ f u n c t i o n ( x _ t r a i n )
y_true_test = t r u e _ f u n c t i o n ( x _ t e s t )
y_true = np . c o n c a t e n a t e ( ( y_true_train , y_true_test ) )
#
y _ t r a i n = y_true_train + ( np . random . rand ( ∗ x _ t r a i n . shape ) − 0 . 5 ) ∗ 0 . 4
y _ te s t = y_true_test + ( np . random . rand ( ∗ x _ t e s t . shape ) − 0 . 5 ) ∗ 0 . 4
#
13.4. PART 13.4: WHEN TO RETRAIN YOUR NEURAL NETWORK 511
l r _ x _ t r a i n = x _ t r a i n . r e s h a p e ( ( x _ t r a i n . shape [ 0 ] , 1 ) )
r e g = L i n e a r R e g r e s s i o n ( ) . f i t ( lr_x_train , y _ t r a i n )
reg_pred = r e g . p r e d i c t ( l r _ x _ t r a i n )
print ( r e g . coef_ [ 0 ] )
print ( r e g . i n t e r c e p t _ )
#
p l o t . xlim ( [ 0 , 1 . 5 ] )
p l o t . ylim ( [ 0 , 1 ] )
l 1 = p l o t . s c a t t e r ( x_train , y_train , c=" g " , l a b e l=" T r a i n i n g ␣ Data " )
l 2 = p l o t . s c a t t e r ( x_test , y_test , c=" r " , l a b e l=" T e s t i n g ␣ Data " )
l 3 , = p l o t . p l o t ( lr_x_trai n , reg_pred , c o l o r= ' b l a c k ' , l i n e w i d t h =3,
l a b e l=" Trained ␣ Model " )
l 4 , = p l o t . p l o t ( x_true , y_true , l a b e l = " True ␣ Function " )
p l o t . l e g e n d ( h a n d l e s =[ l 1 , l 2 , l 3 , l 4 ] )
#
plot . t i t l e ( ' Drift ' )
p l o t . x l a b e l ( ' Time ' )
plot . ylabel ( ' Sales ' )
p l o t . g r i d ( True , which= ' both ' )
p l o t . show ( )
Output
−1.1979470956001936
0.9888340153211445
The "True function" represents what the data does over time. Unfortunately, you only have the training
512 CHAPTER 13. ADVANCED/OTHER TOPICS
portion of the data. Your model will do quite well on the data that you trained it with; however, it will
be very inaccurate on the new test data presented. The prediction line for the model fits the training data
well but does not fit the est data well.
Because Kaggle provides datasets as training and test, we must load both of these files.
Code
import o s
import numpy a s np
import pandas a s pd
from s k l e a r n . p r e p r o c e s s i n g import LabelEncoder
I provide a simple preprocess function that converts all numerics to z-scores and all categoricals to
dummies.
Code
def p r e p r o c e s s ( d f ) :
fo r i in d f . columns :
i f d f [ i ] . dtype == ' o b j e c t ' :
d f [ i ] = d f [ i ] . f i l l n a ( d f [ i ] . mode ( ) . i l o c [ 0 ] )
e l i f ( d f [ i ] . dtype == ' i n t ' or d f [ i ] . dtype == ' f l o a t ' ) :
d f [ i ] = d f [ i ] . f i l l n a ( np . nanmedian ( d f [ i ] ) )
enc = LabelEncoder ( )
fo r i in d f . columns :
i f ( d f [ i ] . dtype == ' o b j e c t ' ) :
d f [ i ] = enc . f i t _ t r a n s f o r m ( d f [ i ] . a s t y p e ( ' s t r ' ) )
df [ i ] = df [ i ] . astype ( ' object ' )
Next, we run the training and test datasets through the preprocessing function.
13.4. PART 13.4: WHEN TO RETRAIN YOUR NEURAL NETWORK 513
Code
preprocess ( train_df )
preprocess ( test_df )
Finally, we remove thr target variable. We are only looking for drift on the x (input data).
Code
13.4.2 KS-Statistic
We will use the KS-Statistic to determine the difference in distribution between columns in the training
and test sets. As a baseline, consider if we compare the same field to itself. In this case, we are comparing
the kitch_sq in the training set. Because there is no difference in distribution between a field in itself, the
p-value is 1.0, and the KS-Statistic statistic is 0. The P-Value is the probability of no difference between
the two distributions. Typically some lower threshold is used for how low a P-Value is needed to reject the
null hypothesis and assume there is a difference. The value of 0.05 is a standard threshold for p-values.
Because the p-value is NOT below 0.05, we expect the two distributions to be the same. If the p-value
were below the threshold, the statistic value becomes interesting. This value tells you how different the
two distributions are. A value of 0.0, in this case, means no differences.
Code
from s c i p y import s t a t s
Output
Now let’s do something more interesting. We will compare the same field kitch_sq between the test
and training sets. In this case, the p-value is below 0.05, so the statistic value now contains the amount
of difference detected.
Code
Output
514 CHAPTER 13. ADVANCED/OTHER TOPICS
Next, we pull the KS-Stat for every field. We also establish a boundary for the maximum p-value and
how much of a difference is needed before we display the column.
Code
for c o l in t r a i n _ d f . columns :
ks = s t a t s . ks_2samp ( t r a i n _ d f [ c o l ] , t e s t _ d f [ c o l ] )
i f ks . p v a l u e < 0 . 0 5 and ks . s t a t i s t i c > 0 . 1 :
print ( f ' { c o l } : ␣ { ks } ' )
Output
...
cafe_sum_2000_max_price_avg :
Ks_2sampResult ( s t a t i s t i c =0.10732529051140638 ,
p v a l u e =1.1100804327460878 e −61)
cafe_avg_price_2000 : Ks_2sampResult ( s t a t i s t i c =0.1081218037860151 ,
p v a l u e =1.3575759911857293 e −62)
Code
Output
7662
We take the random samples from the training and test sets and add a flag called source_training
to tell the two apart.
Code
# I s t h e d a t a from t h e t r a i n i n g s e t ?
training_sample [ ' source_training ' ] = 1
testing_sample [ ' source_training ' ] = 0
Next, we combine the data that we sampled from the training and test data sets and shuffle them.
Code
# B u i l d combined t r a i n i n g s e t
combined = t e s t i n g _ s a m p l e . append ( t r a i n i n g _ s a m p l e )
combined . r e s e t _ i n d e x ( i n p l a c e=True , drop=True )
# Now randomize
combined = combined . r e i n d e x ( np . random . p e r m u t a t i o n ( combined . i n d e x ) )
combined . r e s e t _ i n d e x ( i n p l a c e=True , drop=True )
We will now generate x and y to train. We attempt to predict the source_training value as y, which
indicates if the data came from the training or test set. If the model successfully uses the data to predict
if it came from training or testing, then there is likely drift. Ideally, the train and test data should be
indistinguishable.
Code
# Get r e a d y t o t r a i n
y = combined [ ' s o u r c e _ t r a i n i n g ' ] . v a l u e s
combined . drop ( ' s o u r c e _ t r a i n i n g ' , a x i s =1, i n p l a c e=True )
516 CHAPTER 13. ADVANCED/OTHER TOPICS
x = combined . v a l u e s
Output
array ( [ 1 , 1 , 1 , . . . , 1 , 0 , 0 ] )
We will consider anything above a 0.75 AUC as having a good chance of drift.
Code
model = R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s = 6 0 , max_depth = 7 ,
min_samples_leaf = 5 )
lst = []
Output
id 1.0
timestamp 0 . 9 6 0 1 8 6 2 1 1 1 9 7 5 6 8 8
f u l l _ s q 0.7966785611424911
l i f e _ s q 0.8724218330166038
build_year 0.8004825176688191
kitch_sq 0.9070093804672634
cafe_sum_500_min_price_avg 0 . 8 4 3 5 9 2 0 0 3 6 0 3 5 6 8 9
cafe_avg_price_500 0 . 8 4 5 3 5 3 3 8 3 5 3 4 4 6 7 1
13.5. PART 13.5: TENSOR PROCESSING UNITS (TPUS) 517
import o s
import pandas a s pd
i f COLAB:
PATH = " / c o n t e n t "
else :
# I used t h i s l o c a l l y on my machine , you may need d i f f e r e n t
PATH = " / U s e r s / j e f f /temp "
# Download p a p e r c l i p d a t a
! wget −O { o s . path . j o i n (PATH,DOWNLOAD_NAME) } {DOWNLOAD_SOURCE}
! mkdir −p {SOURCE}
! mkdir −p {TARGET}
! mkdir −p {EXTRACT_TARGET}
518 CHAPTER 13. ADVANCED/OTHER TOPICS
# Add f i l e n a m e s
d f _ t r a i n = pd . read_csv ( o s . path . j o i n (SOURCE, " t r a i n . c s v " ) )
d f _ t r a i n [ ' f i l e n a m e ' ] = " c l i p s −" + d f _ t r a i n . id . a s t y p e ( s t r ) + " . j p g "
import t e n s o r f l o w a s t f
import k e r a s _ p r e p r o c e s s i n g
import glob , o s
import tqdm
import numpy a s np
from PIL import Image
IMG_SHAPE = ( 1 2 8 , 1 2 8 )
BATCH_SIZE = 32
# Process t r a i n i n g data
d f _ t r a i n = pd . read_csv ( o s . path . j o i n (SOURCE, " t r a i n . c s v " ) )
d f _ t r a i n [ ' f i l e n a m e ' ] = " c l i p s −" + d f _ t r a i n . id . a s t y p e ( s t r ) + " . j p g "
13.5. PART 13.5: TENSOR PROCESSING UNITS (TPUS) 519
# Load images
images = [ o s . path . j o i n (SOURCE, x ) f o r x in d f _ t r a i n . f i l e n a m e ]
x = load_images ( images , IMG_SHAPE)
y = df_train . clip_count . values
# Convert t o d a t a s e t
d a t a s e t = t f . data . D a t a s e t . f r o m _ t e n s o r _ s l i c e s ( ( x , y ) )
d a t a s e t = d a t a s e t . batch (BATCH_SIZE)
TPUs are typically Cloud TPU workers, different from the local process running the user’s Python
program. Thus, it would be best to do some initialization work to connect to the remote cluster and
initialize the TPUs. The TPU argument to tf.distribute.cluster_resolver. TPUClusterResolver is
a unique address just for Colab. If you are running your code on Google Compute Engine (GCE), you
should instead pass in the name of your Cloud TPU. The following code performs this initialization.
Code
try :
tpu = t f . d i s t r i b u t e . c l u s t e r _ r e s o l v e r . TPUClusterResolver . c o n n e c t ( )
print ( " D e vi ce : " , tpu . master ( ) )
s t r a t e g y = t f . d i s t r i b u t e . TPUStrategy ( tpu )
except :
strategy = t f . d i s t r i b u t e . get_strategy ()
print ( " Number␣ o f ␣ r e p l i c a s : " , s t r a t e g y . num_replicas_in_sync )
We will now use a ResNet neural network as a basis for our neural network. We begin by loading, from
Keras, the ResNet50 network. We will redefine both the input shape and output of the ResNet model,
so we will not transfer the weights. Since we redefine the input, the weights are of minimal value. We
specify include_top as False because we will change the input resolution. We also specify weights as
false because we must retrain the network after changing the top input layers.
Code
def create_model ( ) :
520 CHAPTER 13. ADVANCED/OTHER TOPICS
i n p u t _ t e n s o r = Input ( shape=IMG_SHAPE+ ( 3 , ) )
base_model = ResNet50 (
i n c l u d e _ t o p=F a l s e , w e i g h t s=None , i n p u t _ t e n s o r=i n p u t _ t e n s o r ,
input_shape=None )
x=base_model . output
x=GlobalAveragePooling2D ( ) ( x )
x=Dense ( 1 0 2 4 , a c t i v a t i o n= ' r e l u ' ) ( x )
x=Dense ( 1 0 2 4 , a c t i v a t i o n= ' r e l u ' ) ( x )
model=Model ( i n p u t s=base_model . input , o u t p u t s=Dense ( 1 ) ( x ) )
return model
with s t r a t e g y . s c o p e ( ) :
model = create_model ( )
Output
...
32/32 [==============================] − 1 s 44ms/ s t e p − l o s s : 1 8 . 3 9 6 0
− rmse : 4 . 2 8 9 1
Epoch 100/100
32/32 [==============================] − 1 s 44ms/ s t e p − l o s s : 1 0 . 4 7 4 9
− rmse : 3 . 2 3 6 5
You might receive the following error while fitting the neural network.
I n v a l i d A r g u m e n t E r r o r : Unable t o p a r s e t e n s o r p r o t o
If you do receive this error, it is likely because you are missing proper authentication to access Google
Drive to store your datasets.
Chapter 14
• AutoKeras
• Auto-SKLearn
• Auto PyTorch
• TPOT
This module will show how to use AutoKeras. First, we download the paperclips counting dataset that
you saw previously in this book.
521
522 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
Code
import o s
import pandas a s pd
i f COLAB:
PATH = " / c o n t e n t "
else :
# I used t h i s l o c a l l y on my machine , you may need d i f f e r e n t
PATH = " / U s e r s / j e f f /temp "
# Download p a p e r c l i p d a t a
! wget −O { o s . path . j o i n (PATH,DOWNLOAD_NAME) } {DOWNLOAD_SOURCE}
! mkdir −p {SOURCE}
! mkdir −p {TARGET}
! mkdir −p {EXTRACT_TARGET}
! u n z i p −o −j −d {SOURCE} { o s . path . j o i n (PATH, DOWNLOAD_NAME) } >/dev / n u l l
# Process t r a i n i n g data
d f _ t r a i n = pd . read_csv ( o s . path . j o i n (SOURCE, " t r a i n . c s v " ) )
d f _ t r a i n [ ' f i l e n a m e ' ] = " c l i p s −" + d f _ t r a i n . id . a s t y p e ( s t r ) + " . j p g "
One limitation of AutoKeras is that it cannot directly utilize generators. Without resorting to complex
techniques, all training data must reside in RAM. We will use the following code to load the image data
to RAM.
Code
import t e n s o r f l o w a s t f
import k e r a s _ p r e p r o c e s s i n g
import glob , o s
import tqdm
import numpy a s np
from PIL import Image
14.1. PART 14.1: WHAT IS AUTOML 523
IMG_SHAPE = ( 1 2 8 , 1 2 8 )
! pip i n s t a l l autokeras
AutoKeras contains several examples demonstrating image, tabular, and time-series data. We will make
use of the ImageRegressor. Refer to the AutoKeras documentation for other classifiers and regressors
to fit specific uses.
We define several variables to determine the AutoKeras operation:
Setting MAX_TRIALS and EPOCHS will have a great impact on your total runtime. You must balance
how many models to try (MAX_TRIALS) and how deeply to try to train each (EPOCHS). AutoKeras
utilize early stopping, so setting EPOCHS too high will mean early stopping will prevent you from reaching
the EPOCHS number of epochs.
One strategy is to do a broad, shallow search. Set TRIALS high and EPOCHS low. The resulting
model likely has the best hyperparameters. Finally, train this resulting model fully.
Code
import numpy a s np
import a u t o k e r a s a s ak
MAX_TRIALS = 2
SEED = 42
VAL_SPLIT = 0 . 1
EPOCHS = 1000
BATCH_SIZE = 32
auto_reg = ak . I m a g e R e g r e s s o r ( o v e r w r i t e=True ,
m a x _ t r i a l s=MAX_TRIALS,
s e e d =42)
auto_reg . f i t ( x , y , v a l i d a t i o n _ s p l i t=VAL_SPLIT, b a t c h _ s i z e=BATCH_SIZE,
e p o c h s=EPOCHS)
print ( auto_reg . e v a l u a t e ( x , y ) )
Output
T r i a l 2 Complete [ 0 0 h 04m 17 s ]
val_loss : 36.5126953125
Best v a l _ l o s s So Far : 3 6 . 1 2 3 9 9 2 9 1 9 9 2 1 8 7 5
T o t a l e l a p s e d time : 01h 05m 46 s
INFO : t e n s o r f l o w : O r a c l e t r i g g e r e d e x i t
...
32/32 [==============================] − 3 s 85ms/ s t e p − l o s s : 2 4 . 9 2 1 8
− mean_squared_error : 2 4 . 9 2 1 8
Epoch 1000/1000
32/32 [==============================] − 2 s 78ms/ s t e p − l o s s : 2 4 . 9 1 4 1
− mean_squared_error : 2 4 . 9 1 4 1
INFO : t e n s o r f l o w : A s s e t s w r i t t e n t o : . / i m a g e _ r e g r e s s o r / best_model / a s s e t s
32/32 [==============================] − 2 s 30ms/ s t e p − l o s s : 2 4 . 9 0 7 7
− mean_squared_error : 2 4 . 9 0 7 7
[24.90774917602539 , 24.90774917602539]
14.1. PART 14.1: WHAT IS AUTOML 525
Output
This top model can be saved and either utilized or trained further.
Code
try :
model . s a v e ( " model_autokeras " , save_format=" t f " )
except E x c e p t i o n :
model . s a v e ( " model_autokeras . h5 " )
print ( loaded_model . e v a l u a t e ( x , y ) )
Output
# Regression chart .
def c h a r t _ r e g r e s s i o n ( pred , y , s o r t=True ) :
t = pd . DataFrame ( { ' pred ' : pred , ' y ' : y . f l a t t e n ( ) } )
if sort :
t . s o r t _ v a l u e s ( by=[ ' y ' ] , i n p l a c e=True )
p l t . p l o t ( t [ ' y ' ] . t o l i s t ( ) , l a b e l= ' e x p e c t e d ' )
p l t . p l o t ( t [ ' pred ' ] . t o l i s t ( ) , l a b e l= ' p r e d i c t i o n ' )
p l t . y l a b e l ( ' output ' )
plt . legend ()
p l t . show ( )
Next, we will attempt to approximate a slightly random variant of the trigonometric sine function.
Code
import t e n s o r f l o w a s t f
import numpy a s np
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 527
import pandas a s pd
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
import m a t p l o t l i b . p y p l o t a s p l t
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 5 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 ) )
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , y , v e r b o s e =0, b a t c h _ s i z e=len ( x ) , e p o c h s =25000)
pred = model . p r e d i c t ( x )
c h a r t _ r e g r e s s i o n ( pred . f l a t t e n ( ) , y , s o r t=F a l s e )
Output
Actual
528 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
[[0.00071864]
[0.01803382]
[0.11465593]
[0.1213861 ]
[0.1712333 ] ]
Pred
[[0.00078334]
[0.0180243 ]
[0.11705872]
[0.11838552]
[0.17200738]]
As you can see, the neural network creates a reasonably close approximation of the random sine function.
from s k l e a r n import m e t r i c s
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 5 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 2 ) ) # Two o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , y , v e r b o s e =0, b a t c h _ s i z e=len ( x ) , e p o c h s =25000)
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 529
# F i t r e g r e s s i o n DNN model .
pred = model . p r e d i c t ( x )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , y ) )
print ( " S c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
np . s e t _ p r i n t o p t i o n s ( s u p p r e s s=True )
Output
S c o r e (RMSE) : 0 . 0 6 1 3 6 9 5 2 2 2 0 4 6 6 9 5 6
Predicted :
530 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
[[2.720404 1.590426 ]
[2.7611256 1.5165515 ]
[2.9106038 1.2454026 ]
[3.005532 1.0359662 ]
[3.0415256 0.90731066]]
Expected :
[[2.70765313 1.59317888]
[2.75138445 1.51640628]
[2.89299999 1.22480835]
[2.97603942 1.00637655]
[3.01381723 0.88685404]]
The following program demonstrates a very simple autoencoder that learns to encode a sequence of
numbers. Fewer hidden neurons will make it more difficult for the autoencoder to understand.
Code
from s k l e a r n import m e t r i c s
import numpy a s np
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 531
import pandas a s pd
from IPython . d i s p l a y import d i s p l a y , HTML
import t e n s o r f l o w a s t f
x = np . a r r a y ( [ range ( 1 0 ) ] ) . a s t y p e ( np . f l o a t 3 2 )
print ( x )
model = S e q u e n t i a l ( )
model . add ( Dense ( 3 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( x . shape [ 1 ] ) ) # M u l t i p l e o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , x , v e r b o s e =0, e p o c h s =1000)
pred = model . p r e d i c t ( x )
s c o r e = np . s q r t ( m e t r i c s . mean_squared_error ( pred , x ) )
print ( " S c o r e ␣ (RMSE) : ␣ {} " . format ( s c o r e ) )
np . s e t _ p r i n t o p t i o n s ( s u p p r e s s=True )
print ( pred )
Output
[ [ 0 . 1. 2. 3. 4. 5. 6. 7. 8. 9 . ] ]
S c o r e (RMSE) : 0 . 0 2 4 2 4 5 1 8 7 6 4 0 1 9 0 1 2 5
[[0.00000471 1.0009701 2.0032287 3.000911 4.0012217 5.0025473
6.025212 6.9308095 8.014739 9.014762 ]]
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
from t e n s o r f l o w . k e r a s . o p t i m i z e r s import SGD
import r e q u e s t s
from i o import BytesIO
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 , input_dim=img_array . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( img_array . shape [ 1 ] ) ) # M u l t i p l e o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( img_array , img_array , v e r b o s e =0, e p o c h s =20)
Output
49152
[ [ 2 0 3 . 217. 240. . . . 94. 92. 68.]]
Neural network output
[ [ 2 3 8 . 3 1 0 8 8 239.55913 194.47536 . . . 67.12295 66.15083 74.94332]]
[ [ 2 0 3 . 217. 240. . . . 94. 92. 68.]]
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 533
Code
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
import numpy a s np
from i o import BytesIO
from IPython . d i s p l a y import d i s p l a y , HTML
images = [
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / Brown_Hall . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / b r o o k i n g s . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r /WUSTLKnight . j p e g "
]
i f rows>c o l s :
pad = ( rows−c o l s ) / 2
img = img . c r o p ( ( pad , 0 , c o l s , c o l s ) )
else :
pad = ( c o l s −rows ) / 2
img = img . c r o p ( ( 0 , pad , rows , rows ) )
return img
x = []
f o r u r l in images :
I m a g e F i l e .LOAD_TRUNCATED_IMAGES = F a l s e
r e s p o n s e = r e q u e s t s . g e t ( u r l , h e a d e r s={ ' User−Agent ' : ' M o z i l l a / 5 . 0 ' } )
img = Image . open ( BytesIO ( r e s p o n s e . c o n t e n t ) )
img . l o a d ( )
img = make_square ( img )
534 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
x = np . a r r a y ( x )
print ( x . shape )
Output
...
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 535
Autoencoders can learn the same encoding for multiple images. The following code learns a single encoding
for numerous images.
Code
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
from i o import BytesIO
from s k l e a r n import m e t r i c s
import numpy a s np
import pandas a s pd
import t e n s o r f l o w a s t f
from IPython . d i s p l a y import d i s p l a y , HTML
# F i t r e g r e s s i o n DNN model .
print ( " C r e a t i n g / T r a i n i n g ␣ n e u r a l ␣ network " )
model = S e q u e n t i a l ( )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( x . shape [ 1 ] ) ) # M u l t i p l e o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , x , v e r b o s e =0, e p o c h s =1000)
c o l s , rows = img . s i z e
f o r i in range ( len ( pred ) ) :
print ( pred [ i ] )
img_array2 = pred [ i ] . r e s h a p e ( rows , c o l s , 3 )
img_array2 = ( img_array2 ∗128)+128
img_array2 = img_array2 . a s t y p e ( np . u i n t 8 )
img2 = Image . f r o m a r r a y ( img_array2 , 'RGB ' )
d i s p l a y ( img2 )
Output
536 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
C r e a t i n g / T r a i n i n g n e u r a l network
S c o r e n e u r a l network
WARNING: t e n s o r f l o w : 5 out o f t h e l a s t 11 c a l l s t o <f u n c t i o n
Model . make_predict_function .< l o c a l s >. p r e d i c t _ f u n c t i o n a t
0 x7fe605654320> t r i g g e r e d t f . f u n c t i o n r e t r a c i n g . Tracing i s expensive
and t h e e x c e s s i v e number o f t r a c i n g s c o u l d be due t o ( 1 ) c r e a t i n g
@tf . f u n c t i o n r e p e a t e d l y i n a loop , ( 2 ) p a s s i n g t e n s o r s with d i f f e r e n t
shapes , ( 3 ) p a s s i n g Python o b j e c t s i n s t e a d o f t e n s o r s . For ( 1 ) , p l e a s e
d e f i n e your @tf . f u n c t i o n o u t s i d e o f t h e l o o p . For ( 2 ) , @tf . f u n c t i o n
has e x p e r i m e n t a l _ r e l a x _ s h a p e s=True o p t i o n t h a t r e l a x e s argument s h a p e s
t h a t can a v o i d u n n e c e s s a r y r e t r a c i n g . For ( 3 ) , p l e a s e r e f e r t o
h t t p s : / /www. t e n s o r f l o w . o r g / g u i d e / f u n c t i o n#c o n t r o l l i n g _ r e t r a c i n g and
h t t p s : / /www. t e n s o r f l o w . o r g / api_docs / python / t f / f u n c t i o n f o r more
details .
[ 0.98446846 0.9844943 0 . 9 8 4 5 6 8 3 6 . . . −0.17971231 −0.20315537
−0.20320868]
...
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 537
Code
%m a t p l o t l i b i n l i n e
def add_noise ( a ) :
a2 = a . copy ( )
rows = a2 . shape [ 0 ]
c o l s = a2 . shape [ 1 ]
s = int (min( rows , c o l s ) / 2 0 ) # s i z e o f s p o t i s 1/20 o f s m a l l e s t dimension
f o r i in range ( 1 0 0 ) :
x = np . random . r a n d i n t ( c o l s −s )
y = np . random . r a n d i n t ( rows−s )
a2 [ y : ( y+s ) , x : ( x+s ) ] = 0
return a2
img_array = np . a s a r r a y ( img )
rows = img_array . shape [ 0 ]
c o l s = img_array . shape [ 1 ]
# C r e a t e new image
img2_array = img_array . a s t y p e ( np . u i n t 8 )
print ( img2_array . shape )
538 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
Output
Rows : 7 6 8 , C o l s : 1024
(768 , 1024 , 3)
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
import numpy a s np
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 539
images = [
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / Brown_Hall . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r / b r o o k i n g s . j p e g " ,
" h t t p s : / / data . h e a t o n r e s e a r c h . com/ images / j u p y t e r /WUSTLKnight . j p e g "
]
i f rows>c o l s :
pad = ( rows−c o l s ) / 2
img = img . c r o p ( ( pad , 0 , c o l s , c o l s ) )
else :
pad = ( c o l s −rows ) / 2
img = img . c r o p ( ( 0 , pad , rows , rows ) )
return img
x = []
y = []
loaded_images = [ ]
f o r u r l in images :
I m a g e F i l e .LOAD_TRUNCATED_IMAGES = F a l s e
r e s p o n s e = r e q u e s t s . g e t ( u r l , h e a d e r s={ ' User−Agent ' : ' M o z i l l a / 5 . 0 ' } )
img = Image . open ( BytesIO ( r e s p o n s e . c o n t e n t ) )
img . l o a d ( )
img = make_square ( img )
img = img . r e s i z e ( ( 1 2 8 , 1 2 8 ) , Image . ANTIALIAS)
img_array = img_array . f l a t t e n ( )
540 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
img_array = img_array . a s t y p e ( np . f l o a t 3 2 )
img_array = ( img_array −128)/128
img_array_noise = img_array_noise . f l a t t e n ( )
img_array_noise = img_array_noise . a s t y p e ( np . f l o a t 3 2 )
img_array_noise = ( img_array_noise −128)/128
x . append ( img_array_noise )
y . append ( img_array )
x = np . a r r a y ( x )
y = np . a r r a y ( y )
print ( x . shape )
print ( y . shape )
Output
...
We now train the autoencoder neural network to transform the noisy images into clean images.
Code
%m a t p l o t l i b i n l i n e
from PIL import Image , I m a g e F i l e
from m a t p l o t l i b . p y p l o t import imshow
import r e q u e s t s
from i o import BytesIO
from s k l e a r n import m e t r i c s
import numpy a s np
import pandas a s pd
import t e n s o r f l o w a s t f
from IPython . d i s p l a y import d i s p l a y , HTML
# F i t r e g r e s s i o n DNN model .
print ( " C r e a t i n g / T r a i n i n g ␣ n e u r a l ␣ network " )
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 5 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 0 0 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( x . shape [ 1 ] ) ) # M u l t i p l e o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x , y , v e r b o s e =1, e p o c h s =20)
Output
C r e a t i n g / T r a i n i n g n e u r a l network
...
1/1 [==============================] − 0 s 105ms/ s t e p − l o s s : 0 . 0 0 6 8
Epoch 20/20
1/1 [==============================] − 0 s 110ms/ s t e p − l o s s : 0 . 0 0 5 6
Neural network t r a i n e d
Code
for z in range ( 3 ) :
print ( " ∗∗∗ ␣ T r i a l ␣ {} " . format ( z +1))
# Add n o i s e
img_array_noise = add_noise ( img_array )
#D i s p l a y n o i s y image
img2 = img_array_noise . a s t y p e ( np . u i n t 8 )
img2 = Image . f r o m a r r a y ( img2 , 'RGB ' )
print ( " With␣ n o i s e : " )
d i s p l a y ( img2 )
# P r e s e n t n o i s y image t o a u t o e n c o d e r
img_array_noise = img_array_noise . f l a t t e n ( )
img_array_noise = img_array_noise . a s t y p e ( np . f l o a t 3 2 )
img_array_noise = ( img_array_noise −128)/128
img_array_noise = np . a r r a y ( [ img_array_noise ] )
pred = model . p r e d i c t ( img_array_noise ) [ 0 ]
# Display neural r e s u l t
img_array2 = pred . r e s h a p e ( rows , c o l s , 3 )
img_array2 = ( img_array2 ∗128)+128
img_array2 = img_array2 . a s t y p e ( np . u i n t 8 )
img2 = Image . f r o m a r r a y ( img_array2 , 'RGB ' )
print ( " A f t e r ␣ auto ␣ encode ␣ n o i s e ␣ removal " )
d i s p l a y ( img2 )
Output
14.2. PART 14.2: USING DENOISING AUTOENCODERS IN KERAS 543
∗∗∗ T r i a l 1
With n o i s e :
∗∗∗ T r i a l 2
With n o i s e :
∗∗∗ T r i a l 3
With n o i s e :
544 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
import pandas a s pd
from t e n s o r f l o w . k e r a s . u t i l s import g e t _ f i l e
try :
path = g e t _ f i l e ( ' kdd−with−columns . c s v ' , o r i g i n =\
' h t t p s : / / g i t h u b . com/ j e f f h e a t o n / j h e a t o n −ds2 /raw/main/ ' \
' kdd−with−columns . c s v ' , a r c h i v e _ f o r m a t=None )
except :
print ( ' E r r o r ␣ downloading ' )
raise
14.3. PART 14.3: ANOMALY DETECTION IN KERAS 545
print ( path )
# d i s p l a y 5 rows
pd . s e t _ o p t i o n ( ' d i s p l a y . max_columns ' , 5 )
pd . s e t _ o p t i o n ( ' d i s p l a y . max_rows ' , 5 )
df
Output
The KDD99 dataset contains many columns that define the network state over time intervals during
which a cyber attack might have taken place. The " outcome " column specifies either "normal," indicating
no attack, or the type of attack performed. The following code displays the counts for each type of attack
and "normal".
Code
Output
outcome
back . 2203
buffer_overflow . 30
...
warezclient . 1020
warezmaster . 20
Name : outcome , Length : 2 3 , dtype : i n t 6 4
14.3.2 Preprocessing
We must perform some preprocessing before we can feed the KDD99 data into the neural network. We
provide the following two functions to assist with preprocessing. The first function converts numeric
columns into Z-Scores. The second function replaces categorical values with dummy variables.
Code
i f sd i s None :
sd = d f [ name ] . s t d ( )
# Encode t e x t v a l u e s t o dummy v a r i a b l e s ( i . e . [ 1 , 0 , 0 ] , [ 0 , 1 , 0 ] , [ 0 , 0 , 1 ]
# f o r red , green , b l u e )
def encode_text_dummy ( df , name ) :
dummies = pd . get_dummies ( d f [ name ] )
fo r x in dummies . columns :
dummy_name = f " {name}−{x} "
d f [ dummy_name ] = dummies [ x ]
d f . drop ( name , a x i s =1, i n p l a c e=True )
This code converts all numeric columns to Z-Scores and all textual columns to dummy variables. We
now use these functions to preprocess each of the columns. Once the program preprocesses the data, we
display the results.
14.3. PART 14.3: ANOMALY DETECTION IN KERAS 547
Code
# Now encode t h e f e a t u r e v e c t o r
f o r name in d f . columns :
i f name == ' outcome ' :
pass
e l i f name in [ ' p r o t o c o l _ t y p e ' , ' s e r v i c e ' , ' f l a g ' , ' l a n d ' , ' l o g g e d _ i n ' ,
' is_host_login ' , ' is_guest_login ' ] :
encode_text_dummy ( df , name )
else :
encode_numeric_zscore ( df , name )
# d i s p l a y 5 rows
Output
We divide the data into two groups, "normal" and the various attacks to perform anomaly detection.
The following code divides the data into two data frames and displays each of these two groups’ sizes.
Code
df_normal = d f [ normal_mask ]
d f _ a t t a c k = d f [ attack_mask ]
548 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
Output
Next, we convert these two data frames into Numpy arrays. Keras requires this format for data.
Code
# This i s t h e numeric f e a t u r e v e c t o r , as i t g o e s t o t h e n e u r a l n e t
x_normal = df_normal . v a l u e s
x_attack = d f _ a t t a c k . v a l u e s
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
x_normal_train , x_normal_test = t r a i n _ t e s t _ s p l i t (
x_normal , t e s t _ s i z e =0.25 , random_state =42)
Output
We are now ready to train the autoencoder on the normal data. The autoencoder will learn to compress
the data to a vector of just three numbers. The autoencoder should be able to also decompress with
reasonable accuracy. As is typical for autoencoders, we are merely training the neural network to produce
the same output values as were fed to the input layer.
Code
from s k l e a r n import m e t r i c s
import numpy a s np
import pandas a s pd
from IPython . d i s p l a y import d i s p l a y , HTML
import t e n s o r f l o w a s t f
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
model = S e q u e n t i a l ( )
model . add ( Dense ( 2 5 , input_dim=x_normal . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 3 , a c t i v a t i o n= ' r e l u ' ) ) # s i z e t o compress t o
model . add ( Dense ( 2 5 , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( x_normal . shape [ 1 ] ) ) # M u l t i p l e o u t p u t neurons
model . compile ( l o s s= ' mean_squared_error ' , o p t i m i z e r= ' adam ' )
model . f i t ( x_normal_train , x_normal_train , v e r b o s e =1, e p o c h s =100)
Output
...
2280/2280 [==============================] − 6 s 3ms/ s t e p − l o s s :
0.0512
Epoch 100/100
2280/2280 [==============================] − 5 s 2ms/ s t e p − l o s s :
0.0562
Output
import pandas a s pd
from t e n s o r f l o w . k e r a s . u t i l s import g e t _ f i l e
try :
path = g e t _ f i l e ( ' kdd−with−columns . c s v ' , o r i g i n =\
' h t t p s : / / g i t h u b . com/ j e f f h e a t o n / j h e a t o n −ds2 /raw/main/ ' \
' kdd−with−columns . c s v ' , a r c h i v e _ f o r m a t=None )
except :
print ( ' E r r o r ␣ downloading ' )
raise
print ( path )
# d i s p l a y 5 rows
pd . s e t _ o p t i o n ( ' d i s p l a y . max_columns ' , 5 )
pd . s e t _ o p t i o n ( ' d i s p l a y . max_rows ' , 5 )
df
Output
Before we preprocess the KDD99 dataset, let’s look at the individual columns and distributions. You can
use the following script to give a high-level overview of how a dataset appears.
Code
import pandas a s pd
import o s
import numpy a s np
from s k l e a r n import m e t r i c s
from s c i p y . s t a t s import z s c o r e
def e x p a n d _ c a t e g o r i e s ( v a l u e s ) :
result = [ ]
s = values . value_counts ( )
t = f l o a t ( len ( v a l u e s ) )
fo r v in s . i n d e x :
r e s u l t . append ( " {}:{}% " . format ( v , round ( 1 0 0 ∗ ( s [ v ] / t ) , 2 ) ) )
return " [ { } ] " . format ( " , " . j o i n ( r e s u l t ) )
def a n a l y z e ( d f ) :
print ( )
c o l s = d f . columns . v a l u e s
t o t a l = f l o a t ( len ( d f ) )
The analysis looks at how many unique values are present. For example, duration, a numeric value, has
2495 unique values, and there is a 0% overlap. A text/categorical value such as protocol_type only has a
few unique values, and the program shows the percentages of each. Columns with many unique values do
not have their item counts shown to save display space.
14.4. PART 14.4: TRAINING AN INTRUSION DETECTION SYSTEM WITH KDD99 553
Code
# Analyze KDD−99
analyze ( df )
Output
494021 rows
∗∗ d u r a t i o n : 2 4 9 5 (0%)
∗∗ p r o t o c o l _ t y p e : [ icmp : 5 7 . 4 1 % , t c p : 3 8 . 4 7 % , udp : 4 . 1 2 % ]
∗∗ s e r v i c e : [ e c r _ i : 5 6 . 9 6 % , p r i v a t e : 2 2 . 4 5 % , h t t p : 1 3 . 0 1 % , smtp : 1 . 9 7 % , o t h e r : 1
.46% , domain_u : 1 . 1 9 % , ftp_data : 0 . 9 6 % , eco_i : 0 . 3 3 % , f t p : 0 . 1 6 % , f i n g e r : 0 . 1 4 % ,
urp_i : 0 . 1 1 % , t e l n e t : 0 . 1 % , ntp_u : 0 . 0 8 % , auth : 0 . 0 7 % , pop_3 : 0 . 0 4 % , time : 0 . 0 3 % ,
csnet_ns : 0 . 0 3 % , remote_job : 0 . 0 2 % , gopher : 0 . 0 2 % , imap4 : 0 . 0 2 % , d i s c a r d : 0 . 0 2 %
, domain : 0 . 0 2 % , i s o _ t s a p : 0 . 0 2 % , s y s t a t : 0 . 0 2 % , s h e l l : 0 . 0 2 % , echo : 0 . 0 2 % , r j e : 0
.02% , whois : 0 . 0 2 % , s q l _ n e t : 0 . 0 2 % , p r i n t e r : 0 . 0 2 % , nntp : 0 . 0 2 % , c o u r i e r : 0 . 0 2 % ,
s u n r p c : 0 . 0 2 % , n e t b i o s _ s s n : 0 . 0 2 % , mtp : 0 . 0 2 % , vmnet : 0 . 0 2 % , uucp_path : 0 . 0 2 % , u
ucp : 0 . 0 2 % , k l o g i n : 0 . 0 2 % , bgp : 0 . 0 2 % , s s h : 0 . 0 2 % , supdup : 0 . 0 2 % , nnsp : 0 . 0 2 % , l o g
i n : 0 . 0 2 % , hostnames : 0 . 0 2 % , e f s : 0 . 0 2 % , daytime : 0 . 0 2 % , l i n k : 0 . 0 2 % , n e t b i o s _ n s
: 0 . 0 2 % , pop_2 : 0 . 0 2 % , l d a p : 0 . 0 2 % , netbios_dgm : 0 . 0 2 % , e x e c : 0 . 0 2 % , http_443 : 0 .
02%, k s h e l l : 0 . 0 2 % , name : 0 . 0 2 % , c t f : 0 . 0 2 % , n e t s t a t : 0 . 0 2 % , Z39_50 : 0 . 0 2 % , IRC : 0
.01% , urh_i : 0 . 0 % , X11 : 0 . 0 % , tim_i : 0 . 0 % ,pm_dump: 0 . 0 % , tftp_u : 0 . 0 % , red_i : 0 . 0
...
i f sd i s None :
sd = d f [ name ] . s t d ( )
# Encode t e x t v a l u e s t o dummy v a r i a b l e s ( i . e . [ 1 , 0 , 0 ] ,
# [ 0 , 1 , 0 ] , [ 0 , 0 , 1 ] f o r red , green , b l u e )
def encode_text_dummy ( df , name ) :
dummies = pd . get_dummies ( d f [ name ] )
fo r x in dummies . columns :
dummy_name = f " {name}−{x} "
d f [ dummy_name ] = dummies [ x ]
d f . drop ( name , a x i s =1, i n p l a c e=True )
Again, just as we did for anomaly detection, we preprocess the data set. We convert all numeric values
to Z-Score and translate all categorical to dummy variables.
Code
# Now encode t h e f e a t u r e v e c t o r
# d i s p l a y 5 rows
# Convert t o numpy − C l a s s i f i c a t i o n
x_columns = d f . columns . drop ( ' outcome ' )
x = d f [ x_columns ] . v a l u e s
14.4. PART 14.4: TRAINING AN INTRUSION DETECTION SYSTEM WITH KDD99 555
We will attempt to predict what type of attack is underway. The outcome column specifies the attack
type. A value of normal indicates that there is no attack underway. We display the outcomes; some attack
types are much rarer than others.
Code
Output
outcome
back . 2203
buffer_overflow . 30
...
warezclient . 1020
warezmaster . 20
Name : outcome , Length : 2 3 , dtype : i n t 6 4
import pandas a s pd
import i o
import r e q u e s t s
import numpy a s np
import o s
from s k l e a r n . m o d e l _ s e l e c t i o n import t r a i n _ t e s t _ s p l i t
from s k l e a r n import m e t r i c s
from t e n s o r f l o w . k e r a s . models import S e q u e n t i a l
from t e n s o r f l o w . k e r a s . l a y e r s import Dense , A c t i v a t i o n
from t e n s o r f l o w . k e r a s . c a l l b a c k s import E a r l y S t o p p i n g
# Create a t e s t / t r a i n s p l i t . 25% t e s t
556 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
# Crea t e n e u r a l n e t
model = S e q u e n t i a l ( )
model . add ( Dense ( 1 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 5 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 0 , input_dim=x . shape [ 1 ] , a c t i v a t i o n= ' r e l u ' ) )
model . add ( Dense ( 1 , k e r n e l _ i n i t i a l i z e r= ' normal ' ) )
model . add ( Dense ( y . shape [ 1 ] , a c t i v a t i o n= ' softmax ' ) )
model . compile ( l o s s= ' c a t e g o r i c a l _ c r o s s e n t r o p y ' , o p t i m i z e r= ' adam ' )
monitor = E a r l y S t o p p i n g ( monitor= ' v a l _ l o s s ' , min_delta=1e −3,
p a t i e n c e =5, v e r b o s e =1, mode= ' auto ' ,
r e s t o r e _ b e s t _ w e i g h t s=True )
model . f i t ( x_train , y_train , v a l i d a t i o n _ d a t a =( x_test , y _ t e s t ) ,
c a l l b a c k s =[ monitor ] , v e r b o s e =2, e p o c h s =1000)
Output
...
11579/11579 − 22 s − l o s s : 0 . 0 1 3 9 − v a l _ l o s s : 0 . 0 1 5 3 − 22 s / epoch −
2ms/ s t e p
Epoch 19/1000
R e s t o r i n g model w e i g h t s from t h e end o f t h e b e s t epoch : 1 4 .
11579/11579 − 23 s − l o s s : 0 . 0 1 4 1 − v a l _ l o s s : 0 . 0 1 5 2 − 23 s / epoch −
2ms/ s t e p
Epoch 1 9 : e a r l y s t o p p i n g
We can now evaluate the neural network. As you can see, the neural network achieves a 99% accuracy
rate.
Code
# Measure a c c u r a c y
pred = model . p r e d i c t ( x _ t e s t )
pred = np . argmax ( pred , a x i s =1)
y_eval = np . argmax ( y_test , a x i s =1)
s c o r e = m e t r i c s . a c c u r a c y _ s c o r e ( y_eval , pred )
print ( " V a l i d a t i o n ␣ s c o r e : ␣ {} " . format ( s c o r e ) )
Output
14.5. PART 14.5: NEW TECHNOLOGIES 557
Validation s co re : 0.9977005165740935
This section seeks only to provide a high-level overview of these emerging technologies. I provide links to
supplemental material and code in each subsection. I describe these technologies in the following sections.
Transformers are a relatively new technology that I will soon add to this course. They have resulted
in many NLP applications. Projects such as the Bidirectional Encoder Representations from Transformers
(BERT) and Generative Pre-trained Transformer (GPT-1,2,3) received much attention from practitioners.
Transformers allow the sequence to sequence machine learning, allowing the model to utilize variable
length, potentially textual, input. The output from the transformer is also a variable-length sequence. This
feature enables the transformer to learn to perform such tasks as translation between human languages
or even complicated NLP-based classification. Considerable compute power is needed to take advantage
of transformers; thus, you should be taking advantage of transfer learning to train and fine-tune your
transformers.
Complex models can require considerable training time. It is not unusual to see GPU clusters trained
for days to achieve state-of-the-art results. This complexity requires a substantial monetary cost to train
a state-of-the-art model. Because of this cost, you must consider transfer learning. Services, such as
Hugging Face and NVIDIA GPU Cloud (NGC), contain many advanced pretrained neural networks for
you to implement.
Augmentation is a technique where algorithms generate additional training data augmenting the training
data with new items that are modified versions of the original training data. This technique has seen many
applications in computer vision. In this most basic example, the algorithm can flip images vertically
and horizontally to quadruple the training set’s size. Projects such as NVIDIA StyleGAN3 ADA have
implemented augmentation to substantially decrease the amount of training data that the algorithm needs.
Currently, this course makes use of TF-Agents to implement reinforcement learning. TF-Agents is
convenient because it is based on TensorFlow. However, TF-Agents has been slow to update compared
558 CHAPTER 14. OTHER NEURAL NETWORK TECHNIQUES
to other frameworks. Additionally, when TF-Agents is updated, internal errors are often introduced that
can take months for the TF-Agents team to fix. When I compare simple "Hello World" type examples
for Atari games on platforms like Stable Baselines to their TF-Agents equivalents, I am left wanting more
from TF-Agents.
[1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In
12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (2016),
pp. 265–283.
[2] Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuronlike adaptive elements that can solve
difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 5 (1983),
834–846.
[3] Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R. S., Constant, N.,
Guajardo-Cespedes, M., Yuan, S., Tar, C., et al. Universal sentence encoder. arXiv preprint
arXiv:1803.11175 (2018).
[5] Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern
recognition unaffected by shift in position. Biological cybernetics 36, 4 (1980), 193–202.
[6] Gatys, L. A., Ecker, A. S., and Bethge, M. Image style transfer using convolutional neural
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016),
pp. 2414–2423.
[7] Glorot, X., and Bengio, Y. Understanding the difficulty of training deep feedforward neural net-
works. In Proceedings of the thirteenth international conference on artificial intelligence and statistics
(2010), pp. 249–256.
[8] Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings
of the fourteenth international conference on artificial intelligence and statistics (2011), pp. 315–323.
[9] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT press, 2016.
[10] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information
processing systems (2014), pp. 2672–2680.
561
562 BIBLIOGRAPHY
[11] Heaton, J., McElwee, S., Fraley, J., and Cannady, J. Early stabilizing feature importance for
tensorflow deep neural networks. In 2017 International Joint Conference on Neural Networks (IJCNN)
(2017), IEEE, pp. 4618–4624.
[12] Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets.
Neural computation 18, 7 (2006), 1527–1554.
[13] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural computation 9, 8 (1997),
1735–1780.
[14] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal
approximators. Neural networks 2, 5 (1989), 359–366.
[15] Howard, J., and Ruder, S. Universal language model fine-tuning for text classification. arXiv
preprint arXiv:1801.06146 (2018).
[16] Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training
generative adversarial networks with limited data. Advances in Neural Information Processing Systems
33 (2020), 12104–12114.
[17] Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019),
pp. 4401–4410.
[18] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and
improving the image quality of stylegan. arXiv preprint arXiv:1912.04958 (2019).
[19] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014).
[20] LeCun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The
handbook of brain theory and neural networks 3361, 10 (1995), 1995.
[21] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436–444.
[22] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and
Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
(2015).
[23] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and
Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer
vision (2014), Springer, pp. 740–755.
[24] McCulloch, W. S., and Pitts, W. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics 5, 4 (1943), 115–133.
[25] Nair, V., and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In
Proceedings of the 27th international conference on machine learning (ICML-10) (2010), pp. 807–814.
BIBLIOGRAPHY 563
[26] Ng, A. Y. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of
the twenty-first international conference on Machine learning (2004), p. 78.
[27] Olden, J. D., Joy, M. K., and Death, R. G. An accurate comparison of methods for quantifying
variable importance in artificial neural networks using simulated data. Ecological modelling 178, 3-4
(2004), 389–397.
[28] Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time
object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition
(2016), pp. 779–788.
[29] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-
propagating errors. nature 323, 6088 (1986), 533–536.
[30] Saravia, E., Liu, H.-C. T., Huang, Y.-H., Wu, J., and Chen, Y.-S. Carer: Contextualized
affect representations for emotion recognition. In Proceedings of the 2018 conference on empirical
methods in natural language processing (2018), pp. 3687–3697.
[31] Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556 (2014).
[32] Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning
algorithms. In Advances in neural information processing systems (2012), pp. 2951–2959.
[33] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning
research 15, 1 (2014), 1929–1958.
[34] Stevens, S. S. On the theory of scales of measurement. Science 103, 2684 (1946), 677–680.
[35] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,
and Polosukhin, I. Attention is all you need. Advances in neural information processing systems
30 (2017).
[36] Zhu, X., Lyu, S., Wang, X., and Zhao, Q. Tph-yolov5: Improved yolov5 based on transformer
prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (2021), pp. 2778–2788.