0% found this document useful (0 votes)
24 views1,695 pages

JP PDF

Uploaded by

arshukhanckp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views1,695 pages

JP PDF

Uploaded by

arshukhanckp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1695

INDEX

S.NO TOPICS PAGE.NO


Week 1

1 Course Introduction 5

2 History 26

3 Image Formation 49

4 Image Representation 63

5 Linear Filtering 79

6 Image in Frequency Domain 103

7 Image Sampling 129

Week 2

8 Edge Detection 155

9 From Edges to Blobs and Corners 189

10 Scale Space, Image Pyramids and Filter Banks 215

11 Feature Detectors : SIFT and Variants 242

12 Image Segmentation 280

13 Other Feature Spaces 312

14 Human Visual System 331

Week 3

15 Feature Matching 351

16 Hough Transform 395

17 From Points to Images: Bag-of-Words and VLAD Representations 416

18 Image Descriptor Matching 438

19 Pyramid Matching 454

1
20 From Traditional Vision to Deep Learning 486

Week 4

21 Neural Networks: A Review - Part 1 493

22 Neural Networks: A Review - Part 2 514

23 Feedforward Neural Networks and Backpropagation - Part 1 526

24 Feedforward Neural Networks and Backpropagation - Part 2 541

25 Gradient Descent and Variants - Part 1 554

26 Gradient Descent and Variants - Part 2 580

27 Regularization in Neural Networks - Part 1 601

28 Regularization in Neural Networks - Part 2 631

29 Improving Training of Neural Networks - Part 1 664

30 Improving Training of Neural Networks - Part 2 680

Week 5

31 Convolutional Neural Networks: An Introduction - Part 01 695

32 Convolutional Neural Networks: An Introduction - Part 02 736

33 Backpropagation in CNNs 759

34 Evolution of CNN Architectures for Image Classification-Part 01 772

35 Evolution of CNN Architectures for Image Classification-Part 02 790

36 Recent CNN Architectures 806

37 Finetuning in CNNs 827

Week 6

38 Explaining CNNs: Visualization Methods 837

39 Explaining CNNs: Early Methods 853

40 Explaining CNNs: Class Attribution Map Methods 868

41 Explaining CNNs: Recent Methods-Part 01 893

2
42 Explaining CNNs: Recent Methods-part 02 908

43 Going Beyond Explaining CNNs 927

Week 7

44 CNNs for Object Detection-I-PART 01 943

45 CNNs for Object Detection-I-PART 02 966

46 CNNs for Object Detection-II 986

47 CNNs for Segmentation 1013

48 CNNs for Human Understanding: Faces-PART 01 1039

49 CNNs for Human Understanding: Faces-PART 02 1050

50 CNNs for Human Understanding: Human Pose and Crowd 1071

51 CNNs for Other Image Tasks 1089

Week 8

52 Recurrent Neural Networks: Introduction 1102

53 Backpropagation in RNNs 1135

54 LSTMs and GRUs 1147

55 Video Understanding using CNNs and RNNs 1169

Week 9

56 Attention in Vision Models: An Introduction 1186

57 Vision and Language: Image Captioning 1212

58 Beyond Captioning: Visual QA, Visual Dialog 1250

59 Other Attention Models 1279

60 Self-Attention and Transformers 1307

Week 10

61 Deep Generative Models: An Introduction 1329

62 Generative Adversarial Networks-Part 01 1345

3
63 Generative Adversarial Networks-Part 02 1358

64 Variational Autoencoders 1376

65 Combining VAEs and GANs 1398

66 Beyond VAEs and GANs: Other Deep Generative Models-Part - 01 1419

67 Beyond VAEs and GANs: Other Deep Generative Models-Part - 02 1439

Week 11

68 GAN Improvements 1461

69 Deep Generative Models across Multiple Domains 1485

70 VAEs and DIsentanglement 1509

71 Deep Generative Models: Image Applications 1529

72 Deep Generative Models: Video Applications 1545

Week 12

73 Few-shot and Zero-shot Learning - Part 01 1568

74 Few-shot and Zero-shot Learning - Part 02 1579

75 Self-Supervised Learning 1601

76 Adversarial Robustness 1621

77 Pruning and Model Compression 1649

78 Neural Architecture Search 1667

79 Course Conclusion 1684

4
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Course Introduction

(Refer Slide Time:​ ​0:15)

Hello and welcome to this first lecture for this course Deep Learning for Computer Vision.
My name is Vineeth Balasubramanian. I am a faculty in the Department of Computer Science
and Engineering at IIT, Hyderabad. Hope all of you had a chance to look at the welcome
video for this course which may have given you an introduction of what to expect. We will
however revisit some of those things in this particular lecture and talk about what to expect
and why we need to study this particular course.

5
(Refer Slide Time:​ ​0:51)

To start with, we will talk about what and why Computer Vision. Then we will talk about the
topics that we will cover in this course, the structure that we will follow, what are objectives,
what you will not learn and finally we will talk about the resources and references for this
course in this first part of the lecture.

(Refer Slide Time: 1:11)

Firstly, to start with what is computer vision? So, if you looked at this image of the left and
ask the question where is the glue stick? Or can you find the book? What is a full title? You
probably take a few seconds but if you observe closely, you will then find this glue stick here
standing vertically and there is this book hidden behind other books but that is the book that

6
slightly to draw your attention and looking at some partial information I would say the books
name is Lord of the rings.

The brain is really good at filling in information to get what is hidden in the image. And the
image on the right, by looking at it if I ask you the question, what is wrong with this image?
It seems very naturally made but from our knowledge of the world and our knowledge how
the world behaves, we would probably say the somebody has stampered and created this
image and this does not seem to meet some natural laws that we know of.

Now, what we want to find out in this course is, can a machine answer the same questions? Is
it trivial for a machine to understand the clutter in an image and isolate one particular glue
stick whose pose is very untypical where the name glue stick is not revealed on that particular
object? Or can a computer be able to find out that from the word rings and from the font of
that particular word, can you fill in and see the book is Lord of the rings.

And same here for the image on the right, can a computer have a knowledge of the world that
can help it understand there is something wrong with this image? And be able to say what is
wrong. These are very difficult problems for a computer and the human brain is very very
good at filling in partial information and bringing in context and all of the knowledge that we
have gained to be able to help us understand the world around us visually.

(Refer Slide Time: 3:21)

So, what is Computer Vision? More formally speaking, you could say, Computer Vision is a
field that seeks to automate and endow a computing framework with the ability to interpret

7
images the way humans do. At the end of the day, we want that system to be able to serve
humans. So, we want a computer to interpret the world around us the way human users do.
And as all of us may know Computer Vision is considered as sub-topic of Artificial
Intelligence today.

There could be other definitions that many researchers over the years have defined, you could
call it “the construction of explicit meaningful descriptions of physical objects from images”.
This was defined by Ballard and Brown or Trucco and Verri defined it as “computing
properties of the 3D world around us from one or more, one or more digital images” or as
Sockman and Shapiro say “to make useful decisions about real physical objects and scenes
based on sensed images”.

And one thing that I should mention here is when we say images, we also mean videos,
images is more a place holder for the visual world that a camera catches. It could be images
or videos in this particular context.

(Refer Slide Time: 4:45)

Why do we need to study this field apart from the curiosity to understand the world, apart
from the need to a computer do what humans do. What are the applications of Computer
Vision in today’s world? I am quite sure many of you may already know this, but Computer
Vision is all around us today. Be it the face detection on your Facebook or Picasa or any
other visual platform that you have, to autonomous vehicles where the perception in front of a
car is through various cameras including cameras like Ladder cameras.

8
Surveillance such as CCTV in your Airports or railway stations or wherever it maybe.
Factory Automation, this is one of the oldest applications of Computer Vision where cameras
are fixed in various settings in manufacturing pipelines or warehouses to be able to automate
various aspects of factory, factory cases, medical imaging, very important application today.

Human-Computer Interaction, some of you may have played the Kinect or the X-Box where
there is a stereo camera looking at a user and tracking the user to say whether a user is
playing, if the user is playing a bowling game or a tennis game, whether the movement is a
particular forehand or backhand or any other movement for a particular game. And finally,
visual effects and movies where computer vision is used extensively.

These are just some sample applications from different domains but you can imagine that
there are many many kinds of applications where one can use cameras and computer vision
on the feed that you get from the cameras to be able to get knowledge using in an automated
manner.

(Refer Slide Time: 6:33)

There are many many more applications. Here are a few more including some recent ones
that I have tried to put together on this particular slide. In fact, one of your homework so this
particular lecture is going to be to click on some of these links and see how computer vision
is used there. And I can guarantee you that some of these are very very interesting
applications which you may or may not have come across so far.

9
One of them is Retail and Retail Security. Some of you may have heard of Amazon Go an
initiative of Amazon where the entire shopping experience is without a billing counters. So,
you go in into a store, pick up whatever objects that you have in in on the different shelves
and you simply walk out of the store with no billing. So, the retail store automatically tries to
understand what objects you have picked and put into your basket and just deducts it from
your Amazon Wallet or any other thing, any other Wallet that you configure for your
shopping experience.

Similarly, this Virtual Try-on where you try on a piece of cloth before buying it online and
try to see how it looks on you. StopLift to prevent shop lifting. Healthcare where you can use
it for blood loss detection, for dermatology, in agriculture for trying to find out various
aspects of agriculture, for trying to find out livestock face recognition, in banking and finance
to be able to deposit your cheques automatically through a through a feeder. So, there is no
human involved and trying to find out to whom that cheques should be credited and what is
the amount all of that is automated through vision system.

Insurance Risk, Insurance Risk Profiling, Remote Sensing where you do land use
understanding using satellite imagery or forestry modelling using drone or satellite imagery.
Structural health monitoring, a very important application today where you can do oil well
inspection or drone based bridge inspection or 3D reconstructing or try to understand the
health of railway tracks using drone imagery and so on and so forth.

This is extremely important in various settings where you may want to monitor the health of
an infrastructure without having to physically inspect that infrastructure. For example, it may
not be trivial to under a bridge and try to find out if there are cracks under the bridge, but you
could use a drone to fly through that area and use the images that you get from that drone to
be able to process and understand the health of bridges, railway tracks, oil refineries and any
of those any of those large infrastructures.

Document understanding, optical character recognition is one of the oldest successes of


computer vision, in fact one of the oldest products that computer vision has delivered to the
world. Robotic process automation for document understanding, you can click on these links
to understand what each of these terms mean. Tele and Social media, image understanding,
brand exposure analytics.

10
Augmented reality to be able to do visual support or warehouse and enterprise management
so on and so forth. In case any of these terms are not clear, please do click on these links and
each of those links will take you to a video or a description of what that application is and
how computer vision is used in that setting. Please do it to understand the various
applications that computer vision has.

(Refer Slide Time: 9:56)

So, in terms of studying computer vision, very broadly speaking there are a few different
terms that often overlap with Computer vision,Computer vision, Machine vision and Image
processing. So, one of these questions you could ask here is, how are these different? Today
it is popularly called computer vision. Image processing often refers to low level image
processing. You try to process an image at a low level at a low level such as extracting edges
or trying to invert an image or try to do various processing operations straight on that image.

But computer vision is more about understanding higher level abstractions or knowledge of
the world from images. So, there is a difference in abstraction of understanding between
image processing and computer vision. Although, these days I think the lines between them
are very blurry and where the one ends and where the other begins is very difficult to say. We
will obviously flow from one to the other in this particular course.

Regarding computer vision and machine vision, this is a little bit more trickier, this is more in
terms of the legacy of how this was developed. Machine vision was one of the popular terms
that was used in the 80s and 90s when computer vision system was used in industry

11
automation. So, today when you say machine vision, you often talk about vision systems that
are deployed as part of a larger machine or an industry system. That is what you typically
mean by machine vision.

Whereas, computer vision refers to the particular set of methods or algorithms that you use to
extract knowledge from a camera. So, in a sense, computer vision methods are used as part of
machines in machine vision. Once again, I think it is a perspective of usage and context and
scope that you have in a particular setting that you use these terms in.

So, in terms of the various topics that are allied to computer vision obviously, Artificial
Intelligence, Artificial Intelligence is the larger realm and computer vision forms one aspect
of Artificial Intelligence at this time, although computer vision has other facets too. Machine
Learning of course, a lot of the way computer vision problems, computer vision problems are
solved today is through Machine Learning methods, mathematics, no surprise.

Neurobiology, because that is what helps us understand how the human visual system works.
Imaging, how do cameras image is seen around us is important to be able to solve computer
vision. The physics or the optics that is involved in lenses and how light falls on a particular
object, how an object's appearance changes when the light source changes, when the direction
of the light source changes, when the colour of the light source changes, or there are when
there are multiple light sources so on and so forth.

Signal processing because at the end of the day, an image is a 2-dimensional signal. So, a lot
of the basic computer vision concepts follow through from signal processing, control and
robotics play an important role especially in robotics vision and of course there are various
other peripheral areas such as cognitive vision, statistics, geometry, optimization, biological
visions, smart cameras, optics, multi-variable and non-linear signal processing, robotic vision
so on and so forth.

And these are all connected to those larger topics that you see them organised next too. So,
that gives you a prospect of the various topics that are connected to computer vision
obviously, we would not have the scope to cover all of them and soon talk about what we are
going to cover in this particular course.

12
(Refer Slide Time:​ ​13:34)

So one question here is why why is this hard? Our brain seems to be so good at perceiving
objects around us. So, in fact we will talk about the history of computer vision in subsequent
lecture but when people started studying computer vision in the 1960s, it was assumed that
computer vision is something that you could solve through a summer project.

And obviously, researchers have long since realised that it is not that simple but let us try to
understand why is it really hard? So, one simple way of trying to understand it is by looking
at optical illusions because they help you understand why even we do not understand how we
perceive certain things in the visual world around us. So, very popular example is this
Muller-Lyer illusion where one asks this question, which line is longer?

So, if you did not think too hard, you would probably say that the line that is longer is the one
below but the true answer is that both these lines are exactly the same length. Just the
contextualization of those lines with those angle parentheses makes you get an illusion that
one of the lines is longer than the other.

And there is a very popular illusion called the Hermann grid illusion, named after a German
Physiologist called Hermann where if you simply looked at that grid, front of in front of your
eyes, you would start hallucinating black dots and white dots at the intersection of those grid
points. So, in fact some argue whether this is an illusion or a hallucination, but it still proves
the point that human vision is hard to understand and how we perceive things is still hard to
understand.

13
Another popular example is called the Adelson’s brightness constancy illusion. So, this again
link, please do take a look at that link later on. But the question that we want to ask here is, if
you see closely here, there are two squares A and B that are labelled on that checker board.
And we ask the question, which is brighter, A or B? What is your answer? So, the truth is
obviously, it looks like B is brighter and A is darker.

The truth of it is A and B have exactly the same intensity at the level of a pixel. The pixels
that are there in the A block and the B block exactly have the same intensities. This can be a
little surprising. But that once again based on how human brain uses context make probably
understand, so there is a shadow falling on B and it compensates for that to make B look
brighter than A. But A and B exactly have the same pixel intensity.

Once again, the brain does an amazing job here, but to make a computer do this job, is hard.
And finally this is an example where you have two sets of crosses. So, on the left part, you
have a bunch of red crosses. On the right part, you again have a bunch of red crosses. And the
task for you is to count the red crosses in both figures. I let you take a moment so that, clearly
if I ask you the question which is harder?

You would say the one on the right because there are some red circles that are putting of your
counting process and making it harder to count although otherwise, it is the same red crosses
on both sides. So, this is something again that makes understanding how visual perception
works hard.

(Refer Slide Time: 17:25)

14
More deeply, if you were you have to ask why computer vision is hard. Most of the practical
use cases of computer vision or what is known as inverse model applications. Let us know
what does this mean. So, this means that we have no knowledge of how an image was taken
or what were the camera parameters when the image was taken. So, you do not know at what
angle the image was taken?

What were the settings on your camera when an image was taken? But we need to model the
real world in which the picture and the video was taken. So, that is why we call it an inverse
problem, we do not know what was the world in which the image was taken? But we actually
have to find out from the image as to what are the 3D characteristics of that world in which
the image was taken?

So, an often in practice, we almost have to model this from incomplete, partial, noisy
information. By that we mean, there could be noise in the image. The noise could come from
some motion blur, could come from noisy speck on the lens, could come from noise on the
CCD or the CMOS silicon that captures the image or could come from just the processing
elements that were involved at various stages of the pipeline. But the noise could come from
anywhere.

On the other hand, to just make this clear, forward model. So, we said this is an inverse
model so an obvious contour question then is, what is a forward model? A forward model is
what could be used in say, physics or computer graphics where you can define the various
parameters that you have and you create the image in that setting like animation or graphics
for that matter where you say, I am going to place a light source at this angle, the light source
is blue in colour and I have an object setting at a particular location, I have another light
source that is red colour in some other part of the room falling on the object and now we can
actually create the room.

Those kinds of applications should be called forward models of vision, while the computer
vision that we are going to talk about in this course is the inverse model and a lot of
applications such as object recognition or deduction or segmentation are all inverse models.
Secondly, one of the problems in computer vision is that, image data is very high
dimensional. So, even if you took a single 1 mega pixel image which today is not very high
resolution by the kind of cameras that we all have in our smart phones.

15
Remember 1 mega pixel image is 10 power 6 pixels. And once again if you took a colour
image, you typically would have as red channel, a green channel and a blue channel in RGB
image which means you actually would have 3 into 10 per 6 pixels in that particular 1 mega
pixel image. So, that is 3 million measurements for that 1 mega pixel image and that is a very
huge dimension to work with when you use machine learning algorithms for this kind of a
data.

So, which means images have image processing such images and getting understanding from
these images has very heavy computational requirements and another thing here is computer
vision is said to be what is known as AI-complete. So, once again anything that is blue in the
slides is clickable so, please do feel free to click on it and learn more about what AI-complete
is, but in very simple words AI-complete means that a computer cannot solve the problem by
itself and you need to bring a human to be able to solve the problem.

So, that is what we call as human AI-complete in very in very crude words, please be free to
click on that link to understand more. And as you can see here on the right, if you simply
want to check whether a photo that you click is in a particular National Park, you can
probably do it easily but if you want to find out final details of a particular photo that you
took, what bird is it? Which particular kind or a sub kind of a bird are you looking at. These
kind of things are very hard computer vision problems.

(Refer Slide Time: 21:28)

16
Further points here are that no complete models of the human visual system itself are known
at this time. Most of the existing models of the human visual system relate to subsystems and
are not holistic in their understanding in terms of the entire central nervous system and there
are also questions that you can ask as what is perceived and what is cognized especially
things like optical illusions, there is a significant difference especially if you took the
example of the Hermann grid illusion that we saw, what you perceive and what you cognized
are two different things.

And you may look at a piece of thread in the night and think it is a snake so once again what
you perceive and what you cognized are different here so, that makes the problem even
harder and you could also ask the question when is an object important for a task and when is
the context around the object important for a task. These are difficult questions to answer and
some of these have been answer very convincing today and some of them still remain open
research questions for the field of computer vision problem.

And also the verifiability of the mathematical or physical models for these kinds of systems is
non-trivial. For example, if you found a good way to represent an image as a low dimensional
vector per say, how should similarity or dissimilarity between representations be defined?
Would that be a proper distance metric mathematically? Do all images in all settings, RGB
images, thermal images, Lidar images, do all of them follow the same distance metric?

How do you change this distance metric between these images? Unfortunately, there is no
universal answer, there is no clear verifiability for such. Even if you develop a method, how
do you verify this in a in a in a very reliable way is also an open question. Finally, how would
you, if you manipulate an image, what is typically called a counterfactual. A counterfactual is
where you take an image and ask the question, what if I had a particular object in this imaged
changed?

So, you had an image of the scene in front of you and let us say in instead of a laptop, if you
placed a table lamp there, how the scene looks like and what would change in your
perception of that image? So, those are called counterfactuals. So, if you had a manipulation
in a given environment, how would that change the behaviour of the perception, of the
cognition of the image?

17
Can a physical model or a mathematical model of your environment actually capture those
kinds of counterfactuals and how the counterfactuals affect your perception? These are open
questions at this point and these are the reasons why computer vision stays a hard problem
today.

(Refer Slide Time: 24:14)

So, with that overall briefly introduction to computer vision, why we need to study it, why is
it hard, let us actually talk about this course and what we going to cover in this course and
what this is going to be the structure of the topics.

(Refer Slide Time: 24:30)

18
So, the way computer vision has evolved over the last few decades, we will talk about the
history of computer vision in L02. You can broadly classify the topics into learning based
vision, geometry-based vision and physics based vision. Once again these topics are fairly
porous, so it is not that what topic does not flow into the other. Often to solve problems, you
need to use transcripts from various sub-topics.

So, by learning-based vision we generally refer to problems in computer vision such as visual
recognition such as object recognition, gesture recognition or emotion recognition from
images, detecting your face in Facebook, in a Facebook photo or segmentation where you
segment to seen into various parts at the pixel level or tracking an object or retrieving an
image based on the search query so on and so forth.

Typically, all of these are solved using a learning based methods so, that is one segment of
computer vision. Then there is geometry-based vision where you talk about feature based
alignment, image stitching such as the Panorama that you see on your cell phone, epipolar
geometry, structure from motion, 3D reconstructing or so on and so forth. Typically, you
study that as part of geometry-based vision.

And finally, this physics-based vision where you talk about computational photography,
photometry, light-fields, colour spaces, shape-from-X, where X should be shading or
structure or any other any other topic, reflection, refraction, polarization, diffraction,
interference, so on and so forth.

So, clearly you can see a lot of this has to deal with the physics of how an image is created.
So, these are all different subfields of computer vision that have developed over the years.
And the focus of this course is going to be learning-based vision and that is the reason this
course is called deep learning for computer vision.

19
(Refer Slide Time: 26:25)

So, let us talk about how this course is structured. So, we have tried to divide the course into
5 different segments just for simplicity of covering topics. The first segment we will talk
about the journey so far which covers traditional computer vision topics. We will talk about
basis here including image formation, linear filtering, convolutions, correlation, edge
detection, blob detection, feature detection, feature descriptors, feature matching so on and so
forth.

That could be the traditional computer vision segment. Then we will go on to segment 2
where we will talk about the building blocks of deep learning for computer vision. We will
quickly review neural networks, back propagation which we assume that you have
knowledge of as I said this course is built on top of Machine Learning and deep learning
courses. We will quickly review neural networks though because it is important to understand
various other topics.

We will cover convolutional neural networks, various architectures and models that have
evolved over the last few years and also try to understand how do you visualize and
understand how CNNs work. In the third segment, we will talk about the various different
applications and use cases and tasks in which CNNs are used, recognition, verification,
retrieval, detection, segmentation, there are various models and loss functions and settings in
which the CNNs are used here. We will talk about all of them in segment 3.

20
Then we will go to segment 4, we will add a new dimension there where we will talk about
recurrent neural network, Spatio-Temporal models where we will talk about computer vision
for video especially, action recognition, activity recognitions, so on and so forth. We will also
talk about attention models here and vision language tasks such as image captioning, visual
dialog or visual question answering so on and so forth.

And finally in the last segment, we will focus on staying contemporary where we will talk
about deep generative models such as GANs and variational auto encoders, we will talk about
learning with limited supervision such as Few-shot learning, One-shot learning, Zero-shot
learning, continual learning, multitask learning so on and so forth. And we will also wrap up
the course with some recent trends in the field. So, there will be some programming
assignments on these areas, there will be weekly quizzes and we will obviously have an exam
at the end for those of you who are trying to credit this course for in your own college.

(Refer Slide Time: 28:53)

What are the pre requisites? Completion of a basic course in Machine Learning we are going
to assume that you know Machine Learning we are going to build on top of that. I would
really hope that you have also done a course on deep learning before because we are not
going to cover deep learning in it is entirety, but we will review concepts in deep learning to
the to the extent you need for this course, but it will be highly be recommended if you cover a
completely different deep learning course in a holistic way.

21
And obviously you need strong mathematical basics in probability, linear algebra and
calculus. And this course is designed as advanced under graduate or post graduate electives.
Through a programming perspective, we are going to hope that you are comfortable with
programming in Python, knowledge of a deep learning framework such as PyTorch or
TensorFlow will be highly recommended. You can pick it up as part of the course you want
but we will highly recommend it if you already, if you already know that.

(Refer Slide Time: 29:54)

In terms of resources and references, the popular computer vision books that we will refer to,
which have also follow widely across the world is Computer Vision A Model Approach by
David Forsyth and John Ponce. The book website is linked here. Computer Vision
Algorithms and Applications by Richard Seliski. It is a very popular book. It is there in the
open public domain for you to download the pdf, we will follow this book for certain topics
too.

And also Computer Vision models, learning and inference. Once again the book website has
the book and you can download it if you like by Simon Prince. These are the popular
computer vision books that we will follow.

22
(Refer Slide Time: 30:33)

There are also the popular Deep Learning references. Deep Learning by Ian Goodfellow,
Yoshua Bengio and Aaron Courville, the book website has the book. This also very nice short
online book my Michael Nielsen which will refer to when required. And also the book by
Christopher Bishop on Neural Networks for Pattern Recognition. While these are good
references for you to follow.

For each lecture we will point you to where you can look at and sometimes we will just point
you to good blogs and post that are online. These days you have some excellent blogs that
explain concepts well. So, we will point you to those as and when required and possible.

(Refer Slide Time: 31:08)

23
So, if you wanted to learn geometry-based vision, so I said that the course is focused on
learning based vision. So, if you want to learn geometry-based vision, what do you do? Is a
very nice book called Multiple View Geometry in Computer Vision. It is again linked here.
Please do take a look at that book if you want to learn and there is also a course on NPTEL
for geometry-based vision. Please click on this link for these weeks in that particular course if
you want to learn a little bit more about topics in geometry-based vision.

And if you wanted to learn physics-based vision, once again a very nice book called
Physics-Based Vision Principles and Practice is available on this link. And there is also web
based course which is linked here which you can follow if you want to learn more about
physics-based vision.

(Refer Slide Time: 31:55)

So, one of the homework that we are going to let you take away from this inaugural
introductory lecture is, go through all the links, all the applications of Computer Vision slide
that was slide 6 for you. They are very very interesting views or reads to open your world to a
lot of different computer vision applications which I think that some of you at least may not
have come across. That is going to be the homework for your first lecture today.

24
(Refer Slide Time: 32:17)

So, to wrap-up the first lecture, I think I should thank many many people that have
contributed to the creation of the slides and to my own knowledge in delivering this lecture.
We are very grateful to the deep learning with computer vision courses and contents that are
publicly available online. Wherever possible and relevant, these sources have been cited but
if you notice an oversight, please let us know, and we will be glad to acknowledge.

I will thank all of the specific people that have contributed to creation of lectures. At the end
of this course because there is a chance that I may add some contents, a little later down. I
want to ensure that I thank all of them towards the end. We will do that towards the end. And
obviously any errors in the material are our own, please do point out some issues, we will be
glad to rectify, rectify and to the extent possible, the images used have been taken from free
stock photos or public spaces to avoid copyright violations.

But if there is an oversight, please let us know and we will take it down or replace the image
with something, something else. There are also teaching assistants that have been involved in
creating the slides and helping make this course possible to offer to you. Once again, so of
these teaching assistance contacts and names will be shared with you in the next few weeks.
But we will also acknowledge all of them at the end of this course for I mean, there may be
other people that contribute as the course progresses towards the end. So that is the end of
first lecture, we will continue soon with the next topics. Thank you.

25
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
History

(Refer Slide Time: 0:15)

Last lecture we gave an introduction to the course and now we will actually get started with
the contents. We will review history of computer vision over the last few decades just to give
a perspective of where the field started from and how it is evolved over the last few decades.
So, this lecture is structured into four parts.

(Refer Slide Time: 0:35)

26
We will briefly describe very initial forays in the field in the fifties, sixties and seventies.
Then we will talk about affords that contributed to low level understanding of images in the
80s largely, then we will go on to high level understanding the community took up in the 90s
and 2000s and of course we will then cover a brief history of deep learning in the last decade
or so.

(Refer Slide Time: 1:01)

To start with a disclaimer, this is going to be a history of the field as captured from multiple
sources: Szeliski’s book as well as many other sources that are mentioned on each of the
slides. It may be a slightly biased history from multiple perspectives: 1) perhaps the way I
have seen it and I have seen it to be important please bare with that personal bias. 2) It may
also be biased to the topics that we cover in the course, may not cover physics-based vision,
geometry-based vision in too much detail.

Once again I will refer you to those books that we talked about in the previous lecture if you
want to know them in more detail. There is also a slight predisposition to work around
images, more that videos but still hopefully this slides gives you a perspective of the field and
how it is evolved over the last few decades.

27
(Refer Slide Time: 1:53)

The earliest history of computer vision was way back in the 50s when two researchers David
Hubel and Torsten Wiesel published their work called “Receptive fields of single neurons in
the cat’s striate cortex”. So, they conducted multiple experiments to understand how the
mammalian vision cortex functions and they took a cat and they did many experiments in this
regard but they inserted electrons into a sedated cat and then tried to see how cat’s neurons
fire with respect to visual stimuli presented to the cat.

Incidentally for quite a long time long time, they could not make headway and accidentally
they found that the cat’s neuron fire when they switched slides in the projector in front of the
cat. They were initially perplexed, but they later realized and that was one of their
propositions that the edges created on the screen by the slide that was inserted into the
projector was what fired a neuron in the cat.

One of the outcomes of their early experiments was that simple and complex neuron exists in
the mammalian visual cortex and that visual processing starts with simple structures such as
oriented edges. So, Hubel and Wiesel in fact did many more experiments over the next two
decades. They actually won the Nobel Prize in 1981 for their work in understanding the
mammalian visual cortex. So, this is one of the earliest efforts in computer vision.

28
(Refer Slide Time: 3:35)

In the same year in 1959, there was actually another major development too, which was by
Russell Kirsch and his colleagues were for the first time they represented an image as a set of
1s and 0s. So, representing an image as a number grid is a huge achievement which is
something that we inherit to until today and in fact the first image taken was of Russell’s
infant son which was a 5 centimetre by 5 centimetre photo. About 176 cross 176 array that
was captured at that particular time. This is considered as such a big achievement in the field
of vision, that this particular photo is still preserved in the Portland Art Museum in the USA.

(Refer Slide Time: 4:24)

Then in 1963, there was a significant development by a person called Lawrence Roberts and
he wrote a PhD thesis on “Machine Perception of 3 Dimensional Solids”. The PhD thesis in

29
fact is hyperlinked on this particular slide. So, please take a look at it if you are interested.
But I think this thesis had some ideas even beyond its times at that point. So, the thesis
discussed by Roberts talked about extracting 3D information about solid objects from 2D
photographs of line drawings.

So, if you recall what we spoke in the previous lecture, we said that the aim of computer
vision is to understand the 3D world around us from a 2D images that we get or the 2D
videos that we get. To some extent this is what was talked about way back in that PhD thesis
in the early 60s. So, the thesis discussed issues such as camera transformations, perspective
effects, rules and assumptions of depth perception so on and so forth.

Interestingly, Lawrence Roberts moved on from this topic and he is actually more famous for
some other big development that all of us owe him for. So, I am going to leave that as a
trivial quiz for you to find out. We will talk about that in the next class. But try to find out
what Lawrence Roberts is known for and the hint is it is not for anything in computer vision,
but it is a huge technological development that all of us today owe him for. Take a look at it
and try to find it out before the next lecture.

(Refer Slide Time: 6:06)

Subsequently in 1966, one of one of the earliest efforts in trying to come up with systems for
computer vision which happened in MIT in 1966 by Papert and Sussman who decided the
they could use a bunch of their summer interns to develop an end to end system for computer
vision. They thought they could take a few summer interns and develop a platform to
automatically segment foreground and background and extract non-overlapping objects from

30
real world images and this is something that they thought they could achieve within a
summer.

So, this was actually a note that was written by Papert at that time. Obviously, you and I
know now that the project did not succeed rather the project opened up researchers to the fact
that this was a very deep problem and it was not something that could be solved in 2-3
months and we still know that this problem, certain aspects of it are solved but many other
aspects still remain unsolved.

(Refer Slide Time: 7:13)

Then the years went on and early 1970s, there were also works were people tried to study
how lines could be labelled in an image as say, convex, concave or occluded or things of
those kind. So, that was one of the effort by Huffman and Clowes in the early in the early
70s.

31
(Refer Slide Time: 7:35)

And in 1973 came an important approach called the Pictorial Structures by Fischer and
Elschlager which was again reinvented in the early 2000s, I will talk about that a bit later. But
what they talked about there was, they wanted that given a visual object’s description that
somebody should be able to find out the object in a photograph. So, the part of the solution
was to define an object as a combination of individual components and the connections
between those components.

And they proposed a solution which firstly a specialization of a descriptive scheme of an


object as I said in terms of individual parts and connections between parts. But they also
defined a metric on which one could base the decision of goodness of matching or detection
based on such descriptive scheme. This is a significant development at this time and a lot of
the models that were developed in 2000s inherited this approach to the problem.

32
(Refer Slide Time: 8:39)

Then between 1971 and 1978, there were lot of efforts that were attempted by researchers and
it that period was also known as the “Winter of AI”. But at that time many efforts on object
recognition using shape understanding, in some sense trying to envision objects as
summation of parts. The parts could be cylinders, parts could be different kind of skeletal or
skeletal parts was an important effort in that in that time.

So, generalised cylinders, skeletons in cylinders were all efforts at that particular time. And
importantly there was also the world first machine vision course offered by the MIT’s AI lab
in that time in the 1970s. So, I will talk about the applications later, but in the 1970s, also one
of the first products of computer vision was developed which was optical character
recognition which was developed by Ray Kurzweil who considered a visionary for the field
of AI and this was in the 70s again.

33
(Refer Slide Time: 9:42)

Then between 1979 to 1982 was a again a landmark development for computer vision. David
Marr who is research is followed until this, until today. And in fact, the ICCV conference, the
International Conference in Computer Vision actually gives out a prize named after David
Marr for landmark achievements in computer vision. So, David Marr proposed pretty
important framework in his book called ‘Vision computational investigation into the human
representation and processing of visual information’.

Firstly, he established that vision is hierarchical and he also introduced a framework were
low level algorithms that detect edges, curves, corners are then used to feed into a high level
understanding of visual data. In particular, his representational framework first had a primal
sketch of an image where you have edges, bars, boundaries, etc. Then you have a 2 and a half

34
D sketch representation where surfaces information about depth, discontinuities are all pieced
together.

And finally a 3D model that is hierarchically organized in terms of surface and volumetric
primitives. So, to some extend you could say that this also resembles how a human brain
perceives information but we will talk about that a bit later. But this was Marr’s
representational framework which led to a lot of research in subsequent years and decades.

(Refer Slide Time: 11:19)

In the same period around the 80-81 time, there was also a significant development by
Kunihiko Fukushima called the Neocognitron which is actually the precursor of
convolutional neural networks the day we see today. I think was a significant development
for the time and Fukushima introduced a self-organizing artificial network of simple and
complex cells to recognize patterns,

In fact, you can call this the original ConvNet. It also talked about convolutional layers with
weight vectors which are also called filters today. That was one of the earliest versions of
convolutional neural networks which are used to this day.

35
(Refer Slide Time: 12:00)

So, that was the initial years and now we will talk about some developments in low level
understanding of images which largely happen in the 80s. So we may not cover all of the
methods but at least some of the important ones as we go forward.

(Refer Slide Time: 12:17)

So, in 1981, there was a very popular method called Optical Flow which was developed by
Horn and Schunck and the idea of this method was to understand and estimate the direction
and speed is a moving object across two images captured in in a timeline. So, for object
moved from position A to position B, then what was the velocity of that object across the two
images.

36
So, flow was formulated as a global energy functional which was minimized and the solution
is solution was obtained. And this is the method that was extensively used over many decades
especially for video understanding. And I think is still used in certain applications such as
say, compression, video compression or other video understanding applications.

(Refer Slide Time: 13:12)

In 1986 came the Canny Edge Detector which was a significant development for Edge
Detection. John Canny proposed a multi-staged edge detection operator which is also known
as a computational theory of edge detection. It use calculus of a variations to find the function
that optimizes a given functional. It was a very well defined principle method, simple to
implement and became very very popular for edge detection. So, it was extensively used for
many years to detect edges probably until to this day in certain industries.

37
(Refer Slide Time: 13:47)

In 1987, there was also the recognition by components theory proposed by Biederman which
was a bottom up process to explain object recognition where the object was constituted in
terms of parts which were labelled as geons, geons simply meant three dimensional basic
three dimensional shapes such as cylinders, cones and so on as you can see in some of this
images here which were assembled to form an object. Again this was a theory of visual
recognition to see if we could recognise objects in terms of their parts.

(Refer Slide Time: 14:26)

In 1988, came what are known as Snakes or active contour models which helped delineate an
object outline from a potentially noisy 2D image. It was widely used in applications like
tracking, shape recognition, segmentation, edge detection, so on and so forth.

38
(Refer Slide Time: 14:48)

In 1989, was the first version of back propagation for convolutional neural networks. So, it is
not necessarily low level visual understanding but I think it happened in the 80s and that is
why I am talking about it here and it was applied to hand written digit recognition as we will
talk about very soon.

(Refer Slide Time: 15:08)

Other things that happened in the 80s where the development of the image pyramids
representation of image and multiple scales, scale-space processing, processing of an image
at different scales, wavelets which is landmark development at that time. Shape-from-X
which is shape from shading, shape from focus, shape from silhouette, basically try to get

39
shape from various aspects of image formation. Variational Optimization methods, Markov
Random field, all of these were developed in the 1980s.

(Refer Slide Time: 15:41)

Then came the 1990s where the community stepped into a higher level of understanding
beyond low level artefacts such edges or corners or so on and so forth.

(Refer Slide Time: 15:53)

It started with Eigenfaces for face recognition which used a variant of Eigen decomposition
for doing face recognition. It happened in 1991 which was successful for face recognition at
least in constraints settings. There were also computational theories of object detection by
Edelman that was proposed in 1997. Then came Perceptual grouping and Normalized cuts
which was a landmark step for image segmentation methods that came in 1997.

40
Came Particle filters and Mean shift in 1998, Scale Invariant Feature Transform. We will talk
about some of these methods in detail which was an important image key point detector and
representation method which was developed in late 90s early 2000s. Then Viola-Jones face
detection, again that came in the early 2000s. Conditional Random Fields which was an
improvement over Markov Random fields.

Then Pictorial structures, the method proposed in 1973 was revisited in 2005 to develop, they
came up with an improved statistical approach to be able to estimate the individual parts and
their connections between parts which was called pictorial structures at that time and they
actually showed that that could work in practise and give good performance for image
matching.

PASCAL VOC which is a data set that is popular to this day actually started in 2005 and
around that time between 2005 to 2007, a lot of methods for scene recognition, panorama
recognition, location recognition also grew at that time. Constellation models which were part
based probabilistic generator models also grew at that time to be able to again recognize
objects in terms of parts and how the parts were put together in the whole.

And deformable part models, a very popular approach I think considered one of the major
developments of the first decade of 2000 of the twenty first century came in 2009.

(Refer Slide Time: 18:10)

And since then of course, the big developments have been Deep Learning. So, let us briefly
review them too.

41
(Refer Slide Time: 18:17)

In 2010, the ImageNet data set was developed and the purpose of the dataset was that until
then a lot of developments in computer vision relied on lab scale datasets of course, PASCAL
VOC dataset changed this to some extent in 2005 and 2006. But many other developments
relied on labs scale datasets that were developed in various labs around the world and it did
not give a standard way to benchmark methods and compare them across a unified platform,
across the unified dataset.

And that is the purpose ImageNet sort to achieve that particular time. So, 2010 was when
ImageNet arrived and 2012 was a turning point for deep learning as many of you may be
aware, AlexNet won the ImageNet challenge until then all the models that won ImageNet
until 2012 were what I mean is shallow models. So, you extracted some features out of the
images and then used Machine Learning models such as support vector machines to be able
to do object recognition.

So, in 2012 AlexNet came into the picture and it was the first convolutional neural network
that won the ImageNet challenge and it was a significant achievement because it took the
accuracy in the ImageNet challenge by a significant amount beyond the previous years best
performers. We will talk about the numbers and all of these details when we get to this point
in the course.

42
(Refer Slide Time: 19:51)

Then in 2013 came a variant of a convolutional neural network called ZFNet stands for Zeiler
and Fergus, it won the ImageNet challenge. Then also regions CNNs or R-CNNs were first
developed in 2013 for object detection task and people also started investing efforts in trying
to understand how CNNs work.

(Refer Slide Time: 20:17)

43
In 2014, InceptionNet and VGG models arrived. Human pose estimations were developed so,
CNN started being used for other tasks beyond just object recognition, deep generative
models such as Generative Adversarial Networks GANs and Variational Auto Encoders
VAEs also were developed in 2014. In 2015, Residual networks or ResNets arrived and
CNNs matched human performance on ImageNet. It was again a landmark achievement.

(Refer Slide Time: 20:53)

2015 also saw segmentation networks that came into the picture. Fully convolutional
networks SegNet and U-Net were all developed in 2015 for the task of semantic segmentation
or labelling every pixel in an image with a particular class label. The COCO dataset also
started appearing at that time and also the first visual question answering dataset VQA dataset
was actually developed in 2015.

44
In 2016, moving beyond region based CNNs for object detection, single stage methods such
as You Only Look Once and Single Short Detector, YOLO and SSD were developed. The
Cityscapes dataset arrived, the visual genome dataset arrived and 2017 was the start of a
higher level of abstraction in understanding images which is scene graph generation. Given
an image, how do you understand what is the Scene graph? A person sitting on a horse or a
man going on a motor bike, so on and so forth.

And in 2018 and 19, higher levels of abstraction such as the visual common sense reasoning
dataset where we try to see if we not only give an answer to a question on an image but can
also give a rational to that answer and task such as Panoptic Segmentation have been
developed. So, as you can see this journey has focus on going from low level image
understanding to higher and higher abstractions of the world we see around us from images.

(Refer Slide Time: 22:34)

From an application stand point, we are not going to walk through every application but at a
high level, in the 1970s as I already mentioned, one of the earliest products that was
developed was Optical Character Recognition by Kurzweil Technologies by Ray Kurzweil.
That was one of the earliest successes of computer vision you can say. In 1980s, most of the
industry developments were in machine vision which installed cameras in various
manufacturing setups or industrial settings.

Probably finding defects in processing chips for example or even in smart cameras, where
some of these algorithms like edge detection and so on and so forth were embedded as part of
the manufacture of cameras itself which I think is known as smart cameras, which I think is
a field that is important even today. In 1990s, slowly the applications of vision started

45
growing, machine vision in manufacturing environments continued to grow, biometrics or
recognising people from images could be from gait, could be from face, could be from iris,
could be from gestures, all of them started growing.

Medical imaging started becoming important. Recording devices, video surveillance, all of
them started growing in the 90s. In 2000s, more of all of these, better medical imaging, object
and face detection, autonomous navigation started in the mid-2000s, Google Goggles, vision
on social media, all of that started in 2000s. And in 2010s, I am not even going to try listing
the applications, I think it is grown to a point where vision applications are in various
domains all around us.

(Refer Slide Time: 24:25)

Hopefully, that gave you a brief perspective of the history of computer vision over the last
few decades. I would recommend you to read Szeliskis chapter 1 at this time and also read
some of these links that have been shared as part of these slides, every slide had a footnote
where the information was taken from. So, go through some of these slides, grow through the
links, you will be able to understand how some of these topics grew in specific areas on those
links. We will stop here for now and continue with the next topic very soon.

46
(Refer Slide Time: 25:01)

47
Here are some references if you like to take a look.

48
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad
Image Formation

Moving on from the history of computer vision, in this lecture we will talk about the
formation of images. Before we go there, did you have a chance to check on the answer for
the trivia question that we had last class? What was Lawrence Roberts known for? Besides
his contribution to computer vision he is more well known for being one of the founders of
the internet. Infact he was the project leader of the ARPANET project that was the precursor
to the internet to the US defense organization - DARPA.

(Refer Slide Time: 00:57)

Let us move on to the topic of this lecture. As most of you may know images are formed
when a light source hits us the surface of an object and light is reflected and some of that
light is reflected onto an image plane which is then captured through optics on to a sensor
plane. So, that is the overall information and the factors that affect the image formation are
the light source strength and direction, the surface geometry, material of the surface such as
its texture as well as other nearby surfaces that, whose light could get reflected onto the
surface, the sensor capture properties we will talk more about that as we go and the image
representation and color space itself. We will talk about some of these as we go.

49
(Refer Slide Time: 01:54)

So, to study all of these one would probably need to study this from geometrical perspective,
where you study 2D transformations, 3D transformations, camera calibration, distortion.
From a photometric prospective where you study lighting, reflectance, shading, optics, so on.
From a color perspective where you study the physics of colour, human colour, colour
representation and from a sensor perspective looking at it from human perceptions, camera
design, sampling and Aliasing, compression so on and so forth. So, we would not cover all of
these but cover a few relevant topics from these in this particular lecture. If you are interested
in a more detailed coverage of these topics please read chapters 1 to 5 of the book by Forsyth
and Ponce.

(Refer Slide Time: 02:48)

50
Starting with how light gets reflected off a surface the more typical morals of reflection state
that when light hits a surface there are 3 simple reactions possible, there are more than 3 but 3
simple reactions to start with. Firstly, some light is absorbed and that depends on a factor
called albedo (ρ) and typically when you have a surface with low albedo more light gets
absorbed. So that is why you say it is the factor 1- ρ for absorption. Some light is reflected
diffusively. It scatters in multiple directions, so that happens independent of the viewing
angle.

Example of surfaces where lights scatters diffusively is brick, cloth, rough wood or any other
texture material and in this scenario Lambert’s cosine law states that the amount of reflected
light is proportional to cosine of angle from which you are viewing the reflection. And thirdly
some light is also reflected specularly where the reflected light depends on the viewing
direction. So, an example of a surface where this happens is a mirror where we all know that
the reflected light follows the same angle as the incident light.

(Refer Slide Time: 04:15)

Generally, in the real world most surfaces have both specular and diffuse components and the
intensity that you receive at the output depends also on the illumination angle because when
you have oblique angle, lesser light comes through. And in addition to absorption, diffuse
reflection and specular reflection, there are other actions possible like there is transparency,
where light could pass through the surface, there is refraction such as a prism where light
could get refracted there is also sub surface scattering, where multiple layers of the surface
could result in certain levels of scattering. And finally, there are also phenomena such as

51
fluorescence, where the output wavelength could be different from the input wavelength or
other phenomena such as phosphorescence.

An important concept that is also studied here is called the BRDF or the Bidirectional
Reflectance Distribution Function which is a model of local reflection that tells us how bright
a surface appears from one direction when light falls on it from another direction, another
prespecified direction. And there are models to evaluate how bright the surface appears.

(Refer Slide Time: 05:46)

So from a view point of colour itself we all know that visible light is 1 portion of the vast
electromagnetic spectrum, so visible light is one small portion of the vast electromagnetic
spectrum, so we know that infrared falls on one side, ultraviolet falls on the other side and
there are many other forms of light across the electromagnetic spectrum. So, coloured light
which arrives at a sensor typically involves two factors, colour of the light source and colour
of the surface itself.

52
(Refer Slide Time: 06:26)

So, an important development in sensing of colour in cameras is what is known as the Bayer
Grid or the Bayer Filter. The Bayer Grid talks about the arrangement of colour filters in a
camera sensor. So, not every sensing element in a camera captures all three components of
light you may be aware that typically we represent light as RGB at least coloured light as
RGB; Red Green and Blue.

We will talk a little bit more about other ways of representing coloured light a little later, but
this is the typical way of representing coloured light and not every sensing element on the
camera captures all three colours instead a person called Bayer proposed this method in a grid
manner where you have 50 percent green sensors, 25 percent red sensors and 25 percent blue
sensors which is inspired by human visual receptors.

And this is how these sensors are checkered, so in a real camera device you would have a
sensor array and there is a set of sensors that captures only red light, there is set of sensors
that captures the green light, there is set of sensors that captures the blue light and to obtain
the full colour image demosaicing algorithms are used where surrounding pixels are used to
contribute the value of the exact colour at a given pixel.

So, that particular sensing element will have its own colour that you also use the surrounding
elements to find out to assign a colour at that particular sensing element. These are known as
demosaicing algorithms. This is not the only kind of colour filter. Bayer Filter is one filter
that is more popular especially in single sensors cameras, but there has been other kinds of
filters, other kind of colour grading mechanism that have been developed over the years too.

53
So, you can also read a little bit more about this on the Wikipedia entries of Bayer Filter
which also talks about other kinds of mechanisms that are used.

(Refer Slide Time: 08:41)

So, let us ask one question for you to think, if visible lights spectrum is VIBGYOR or Violet,
Indigo, Blue, Green, Yellow, Orange, Red, why do we use an RGB wave representing
colour? There is something for you to think about, we will answer it in the next class at least
try to find this out yourself if you can.

(Refer Slide Time: 09:01)

So, the image sensing pipeline in a camera follows a flow chart such as this, where you have
the optics such as the lens. Ofcourse light falls in through that. You have an aperture and

54
shutter parameters that you can specify or adjust and from there light falls onto the sensor.
Sensor can be CCD or CMOS, we will talk about these variants very soon.

Then there is a gain factor, we will talk about that also soon. Then the image is obtained in an
analog or digital form which represents the raw image that you get, the typically cameras do
not stop there, you then use demosaicing algorithms which we just talked about, we could,
you could sharpen the image if you like or any other important image processing algorithms.
Some white balancing, some other digital signal processing methods to improve the quality of
the image and finally you compress the image into a suitable format to store the image. So,
this is the general pipeline of image capture.

(Refer Slide Time: 10:12)

So, let us try to revisit, visit some of these components over the next few minutes. So, first
thing is the camera sensor itself so you all must have heard of CCD and CMOS. This is often
common decision to be made when you buy a camera these days a lesser issue but earlier
days it used to be even more. What is the difference? So, the main difference between CCD
and CMOS is that in CCD it stands for Charged Coupled Device.

You typically generate a charge at each sensing element and then you move that
photogenerated charge, so the charge generated by a photons striking that sensing elements
from pixel to pixel and you convert it to a voltage at an output node on that particular
column. Then typically an ADC or an analog to digital converter converts each pixel’s value
into a digital value. This is how the CCD sensors work.

55
(Refer Slide Time: 11:15)

On the other hand, the CMOS sensors, complementary metal oxide semiconductors, they
work by converting charge to voltage inside each element. So, CCD accumulates there is
CMOS converts at each element it uses transistors at each pixel to amplify and move the
charge using more traditional wires.

So, the CMOS signal is digital so it does not need any ADC at a later point of time. So, today
CMOS, originally CMOS technologies had some limitations but today CMOS technologies
are fairly well developed and most of the cameras that we use today are actually CMOS or
CMOS devices.

(Refer Slide Time: 11:59)

56
So, the many properties that you may actually see when you look at, when you take a picture
on a camera. Shutter speed which controls the amount of light reaching a sensor or also called
exposure time. Sampling pitch, which defines a spacing between the sensor cells on the
imaging chip. Fill factor or also known as active sensing area size, sorry, which is the ratio of
the active sensing area size with respect to the theoretically available sensing area on the
sensing element.

Chip size which is the entire size of area of the chip itself. Analog gain which is the
amplification of the sense signal using automatic gain control logic we would not going to the
details of each of this once again if you are interested you can read the references provided at
the end of this lecture to get more details of all of them.

Typically, analog gain is what you control using your ISO setting on your camera, you can
also have sensor noise that comes from various sources in the sensing process. Your
resolution tells you how many bits is specified for each pixel which is also decided by an
analog to digital conversion module in CCD or in case of CMOS in the sensing, in the
sensing elements.

So, which means if you use 8 bits to represent each pixel, so you could get a value going
from 0 to 255 for each pixel that gives you the sensing resolution for that particular pixel, and
finally there are also post processing elements as we already briefly mentioned such as digital
image enhancement methods used before compression and storage of the captured image.

(Refer Slide Time: 13:48)

57
So, one popular question that often can be asked here is, these days smartphones seem to be
so good, you have very high-resolution cameras in smartphones, do you really need what are
known as DSLR cameras. So, what are DSLR cameras? DSLR camera stand for Digital
Single Lens Reflex camera and the main difference between a DSLR camera or any other
point and shoot camera or a cell phone camera is the use of mirrors. DSLR camera uses a
mirror mechanism to reflect light to a view finder or also can turn off the mirror, moving the
mirror out of the way to actually reflect the light on the image sensor.

So, affectively the comparison here becomes between mirrored cameras and mirrorless
cameras. So, mirrorless cameras such as what you see in your smartphones are more
accessible, portable, low cost, whereas when you have a mirror, the picture quality tends to
be better, you have more functionality possible, again we will not step into more details here
but please do read down the sources of the links given under each slide if you want to know
more. Mirrored cameras such as DSLR also give you a physical shutter mechanism variable
focal length and aperture so on and so forth. That is the reason there is value for DSLR
cameras despite the advancement in smartphones cameras.

(Refer Slide Time: 15:22)

So, the other factors that you need to understand when you talk about image formation is the
concept of sampling and Aliasing, we will talk about this in more details bit later but a brief
review now is Shannon Sampling Theorem states that if the maximum frequency of your data
on your image is f_max you should at least sample at twice that frequency. Why so, we will
see a bit later but for the moment that frequency that you captured it is also called the Nyquist

58
frequency and if you have frequencies about the Nyquist frequency in your image then the
phenomenon called Aliasing happens.

So, why is this bad and what impact can it have on image formation? This can often create
issues when you up sample or down sample an image. If you capture an image at a particular
resolution say 256 cross 256. If you choose to up sample or down sample Aliasing can be bad
in those settings, we will see this in more detail a bit later in a lecture which will come in
sometime.

(Refer Slide Time: 16:37)

Also, in terms of representing the image itself there are multiple colours spaces possible,
while RGB is the most common one, people today use various other kinds of colour spaces
not necessarily in a camera but in other kind of devices we will see that. I will mention that
briefly now. Popular colours spaces are RGB and CMYK, CMYK stands for cyan, magenta,
yellow and black that is what you see here. So, they are supposed to be; so additive colours
are RGB, R, G and B; subtractive colours are C, M and Y particular application where
CMYK is used in practice is in printers.

So, it happens that it is a lot easier to control colours using CMYK in printers, you can read
more about these on these links provided below. Other colour spaces that are used in practice
are XYZ, YUV, Lab, YCbCr, HSV so on and so forth. There is actually an organization call
the CIE which establishes standards for colour spaces because this is an important, this is
actually important for the printing and scanning industry, I think this is extremely important
people working in that space. So, that is the reason there are standards establish for these
kinds of spaces, we would not get into more details here once again if you are interested

59
please go through these links below to know more about colour spaces what do you mean by
additive, subtractive, so on and so forth, please look at these links.

(Refer Slide Time: 18:19)

Finally, the last stage in image formation is image compression, because you have to store the
image that you captured, so typically you convert the signal into a form called YCbCr where
Y is luminance CbCr talks about chrome what is known as colour factor or the chrominance
and the reason for this is that you typically try to compress luminance with a higher fidelity
than chrominance.

Because of the way humans or the human visual system perceives light, luminance is a bit
more important than chrominance, so you ensure that luminance is actually compressed with
a higher fidelity which means your reconstruction is better for luminance than for

60
chrominance, so that is one reason why YCbCr is used as a popular colour space before
storage, once again if you do not understand YCbCr, go back to the previous slide look at all
of these links to understand YCbCr is one of the colour space representations that are
available in practice.

And as I just mentioned so the most common compression technique that used to store an
image is called the Discreate Cosine Transform which is popularly used in standard such as
MPEG and JPEG Discreate Cosine Transform is actually a variant of Discrete Fourier
Transform and it is a you can call it as a reasonable approximation of an eigen decomposition
of image patches.

So, we would not get into in for the time now, videos this is how images are compressed
using method call DCT, videos also use what is known as block level motion compensation,
so you also divide images into frames and set of frames into block and then you store certain
frames based on concepts from motion compensation, this is typically used in the MPEG
standard which uses, which divides all frames into what are known as i frames, p frames and
b frames and then uses strategies to decide how each frame should be coded, that is how
videos are compressed.

And compression quality finally is measured through a metric called PSNR, apologies for the
typo, it will be fixed before the slides are uploaded, which stands for Peak Signal to Noise

I 2max
Ratio, sorry for these typos. PSNR is defined as 10log 10 MSE , where i_max is the maximum
intensity and MSE is simply talks about the mean squared error between the original image
and the compressed image, how much is the mean squared error pixel wise between these two
images.

And the numerator talks about the maximum intensity that you can have in an image, so this
is typically called as PSNR which is used to measure the quality of image compression, there
are other kinds of matrix which are based on human perception but this is the most popular
statistical metric that is used.

61
(Refer Slide Time: 21:40)

That is about this lecture on image formation so if you need to read more please read chapter
2 of Szeliski’s book, please also read the links provided on some of the slides specially one of
those topics interest you or you are left with some questions please read those links. If you
want to know in a more detail form about how images are captured including the geometric
aspects of it and the photometric aspect of it please read chapters 1 to 5 of Forsyth and Ponce.

62
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad
Image Representation

In last lecture we spoke about image formation and now we will move on to how do you
represent an image so you can process it using transformations

(Refer Slide Time: 00:30)

So, we did leave one question during the last lecture which is if the visible lights spectrum is
VIBGYOR from violet to red, why do we use an RGB colour representation, hope you all
had a chance to think about it, read about it and figure out the answer. The answer is the
human eye is made up of rods and cones.

The rods are responsible for detecting the intensity in the world around us and the cones are
responsible for capturing the colours and it happens that the human eye there are mainly three
kinds of cones and these cones has specific sensitivities and the sensitivities of these cones
are at specific wavelengths which are represented by S, M and L on this particular figure.

So, if you look at where these cones peak at that happens to be close to red, green and blue
and that is the reason for representing images as red, green and blue in all honesty the
peaking does not happen exactly red, green and blue, it actually happens in off colours in
between but for convenience we just use R, G and B.

63
Some interesting facts here it happens that the M and L wavelengths here are stronger on the
X chromosome, so which means males who have the XY chromosome, females have the XX
are more likely to be color-blind. So, also it is not that all animals have the same three cones
while humans have 3 cones, night animals have 1 cone, dogs have 2 cones fish and birds have
more colour sensitivity and it goes to 4, 5 or in a mantis shrimp goes up to 12 different kinds
of cones. So, nature has abundance of how colours are perceived.

(Refer Slide Time: 02:42)

Moving on to how an image is represented, the simplest way to represent an image which you
may have already thought of is to represent an image as a matrix. So, here is the picture of the
Charminar and if you look at one small portion of the image the clock part you can clearly

64
see that you can zoom into it and you can probably represent it as a matrix of values in this
case line between 0 and 1 and obviously you will have a similar matrix for the rest of the
image too.
So, but is very very common in practice while we are talking here about using it with values
0 to 1, in practice people use up to a byte for representing each pixel which means every pixel
has can take a value between 0 and 255 to byte and a in practice we also normalize these
values between 0 and 1 and that is the reason why you see these kinds of values in a
representation.
And also, to keep in mind is that for every colour channel you would have one such matrix if
you had a Red, Green, Blue image, you would have one matrix for each of these channels.
What would be the size of this matrix, the size of these matrix would depend on the
resolution of the image. So, recall again what we spoke about the image sensing component
in the last lecture, so depending on what resolution the image sensor captures the image in,
that would decide the resolution and hence the size of the matrix.

(Refer Slide Time: 04:27)

A matrix is not the only way to represent an image, an image can also be represented as a
function, why so? It just helps us have operations on images more effectively if we represent
it also as a function, certain operations at least. So, in this case we could talk about this
function being going from R2 to R, where R2 simply corresponds to one particular
coordinate location on the image, say (i, j) and that is what we mean by R2 .

65
And the range R is the intensity of the image that could assume a value between 0 to 255 or 0
to 1 if you choose to normalize the image. And a digital image is a discrete sampled
quantized version of that continuous function that we just spoke about, why is it a sample
quantized version by sample we mean that we sample it at that resolution, originally the
function can be continuous which is like the real world in which the image was captured.
Then we sample the real world at some particular pixel values on some grid with respect to a
point of reference and that is what we call as a sample discrete version of the original of the
original continuous function. Why quantized because we are saying that the intensity can be
represented only as values between 0 and 255 and also in the same steps you cannot have a
value 0.5 for instance at least in this particular example. Obviously you can change it if you
like in a particular capture setting but when we talk about using a byte for representing a
pixel you can only have 0, 1, 2 so on and so forth till 255 you can not have a 0.5 so you
actually discretized or you have quantized the intensity value that you have in the image.

(Refer Slide Time: 06:25)

66
So, let us talk about transforming images when we look at them as functions, so here is an
example transformation so you have a face and we seem to have lightened the face in some
way. What do you think is the transformation here? Can you guess? In case you have not, the
transformation here is if your input image was I and your output image was Î you can say

that Î hat is I (x, y ) plus 20. And 20 is just a number if you want it to be more lighter you
would say plus 30 or plus 40, again here we assuming that the values lie between 0 and 255.

(Refer Slide Time: 07:11)

67
One more example, let say this is the next example where on the left you have a source image
on the right you have a target image. What do you think is the transformation, the
transformation is Î (x, y ) would be I (− x, y ) . The image is reflected around the vertical axis, y
axis is fixed and then you rotate, you flip the x axis values. If you notice here both of these
examples, the transformations happen point wise or pixel wise, in both these cases we have
defined the transformation at a pixel level. Is that the only way you can perform a
transformation? Not necessarily.
(Refer Slide Time: 08:06)

Very broadly speaking we have three different kinds of operations that you can perform on an
image you have point operations, point operations are what we have just spoken about where

68
a pixel at the output depends only on that particular pixel the same coordinate location in the
input that would be a point operation.

A local operation is where a pixel at the output depends on an entire region or neighbourhood
around that coordinate in the input image, and a global operation is one in which the value
that a pixel assumes in the output image depends on the entire input, on the entire input
image. In terms of complexity for a point operation the complexity per pixel would just be a
constant, for a local operation the complexity per pixel would be p square assuming a pxp
neighbourhood, local neighbourhood around the coordinate that you considering for that
operation. And in case of global operations obviously the complexity per pixel will be N 2
where the image is N x N .

(Refer Slide Time: 09:29)

Let see a couple of more point operations and then we see local and global, so here is a very
popular point operation that you may have used in your smartphone camera or adobe
photoshop or any other task image editing task that you took on. It is an image enhancement
task and we want to reverse the contrast, reversing the contrast we want the black to become
white and the dark grey to become light grey so on and so forth.

What do you think? How would you implement this operation? In case you have not worked
it out yet, the operation would be at a particular pixel is a point operation so at particular

69
pixel (m0 , n0 ) your output will be I M AX minus the original pixel at that location plus I M IN ,
you are flipping so if you had a value say 240 which is close to white generally white is to
255 and 0 is black if you had a value 240, now that is going to become 15 because I M AX in
our case is 255 and I M IN is 0. I M IN is 0, I M IN obviously does not matter but this formula is
assuming at more general setting where I M IN could be some other values that you have in
practice.

(Refer Slide Time: 10:57)

Moving on let us take one more example of image enhancement again, but this time you are
going to talk about stretching the contrast when we stretch the contrast, you are taking the set
of values and you are stretching it to use the entire set of values that each pixel can occupy,
so you can see here this is again a very common operation that you have used if you edited
images.

What do you think is the operation here this is slightly more complicated then the previous
one, in case you already do not have the answer, remember let us first find out the ratio so
you have a typical I M AX − I M IN which is 255-0 by max of I in this image minus min of I in
this image let us assume hypothetically that this image on the left had its max value to be 200
and its min value to be 100.

If that is the case this entire ratio here that you see is going to become 2.55 this is
(255-0)/100. So, you are simply saying that I am going to take the original pixel whatever let
us assume for the moment that the original pixel had a value, say 150 so if this had the value

70
150 so you subtracting a minimum, so which means you have a value the minimum is 100, so
you are going to have 50 into 2.55 plus I M IN for us which is 0 which roughly comes to 128.
So that is 50 percent of the overall output.

So, what was 150 which was in the middle of the spectrum in the range of values that we had
for this input image now becomes 128 which becomes the middle of the spectrum for the
entire set of values between 0 to 255, you are simply trying to stretch the contrast that you
have to use all the values that you have between 0 and 255. Which means what would have
gone from dark grey to light grey now goes from black to white that is how you increase a
contrast, so this is called linear contrast stretching a simple operation again but in practice we
do something more complicated.

(Refer Slide Time: 13:27)

So, we do what is known as histogram equalization you may have again heard about it
perhaps used it in certain settings if you have heard about it, read about it and that is going to
be your homework for this particular lecture.

71
(Refer Slide Time: 13:41)

So, let ask the question do point operations satisfy all the requirements we have of operating
on images? Let us take one particular example, so we know that a single points intensities
influence by multiple factors we talked about at the last time and it may not tell us everything
so because it influence by light source strength, direction, surface geometry, sensor capture,
image representation and so on.

So, it may not be fully informative so let us take an example to show this, so let us assume we
give you a camera and you have a still scene no movement how do you reduce noise using
point operations. The noise could be cause by some dust blowing in the scene could be cause
by speck of dust on the lens of your camera or by any other reason for that matter could that
they could be a damage on one of the sensors.

Noise could be at several, at various levels, how would you reduce noise using only point
operations? The answer you have to take many images and average them because it is still
scene, we can keep taking images and hope that the noise gets averaged out, across all of
your images that you took and you take the average of all of your images it is a bunch of
matrices you can simply taken element wise average of all of those matrices and that can help
you mitigate the issue of noise to some extent.

But clearly that is the stretch you do not get multiple images for every scene all the time and
you do not get a still scene that is absolutely still all the time to there is always some motion

72
and so this may not be a method that works very well in practice. So, to do this we have to
graduate from point operations to local operations.

(Refer Slide Time: 15:31)

So, let see what a local operation means, as we already said a pixel value at output depends
on an entire neighbourhood of pixels in the input around that coordinate whichever
coordinate we want to evaluate the output at.

(Refer Slide Time: 15:47)

73
So, here is a very simple example to understand what local operation is standard example is
what is known as the moving average, so here you have the original input image I as you can
see the input image I is simply a white box placed on a dark grey background or in this case a
black background because you can see zeros as the values assume that that means a black
background.

So, the image has the particular resolution in this particular case it is a 10 × 10 image and the
white box is located in a particular region. But the problem for us is we are going to assume
that this black pixel in the middle here and this white pixel here are noise pixels that came in
inadvertently. So, how do you remove them?

So, the way we going to remove them is to consider a moving average, so you take a 3 × 3
window, need not be 3 × 3 all the time could be a different size, further moment we are
going to take it 3 × 3 and simply take the average of the pixels in that particular region. So,
the average here comes out to be 0, so you fill it at the center location of that box.

(Refer Slide Time: 17:03)

74
Moving on you now move the 3 × 3 box to the next location you again take an average now
the sum turns out be 90, 90/9 = 10. Similarly, move the box slide the box till further and once
again take the average of all pixels in the box in the input and that gives you one value in the
output. Clearly you can see that this is a local operation, the output pixel depends on a local
neighbourhood around the same coordinate location in the input image.
(Refer Slide Time: 17:46)

75
And you can continue this process and you finally will end up creating the entire image
looking somewhat like this, so you can see now you may have to squint your eyes to see this,
you can see now that the seemingly noise pixels here and here in the input have been
smoothened out because of the values of the neighbours and the output looks much smoother,
here is a low resolution image so it looks a bit blocky.

But if you have higher resolution it would look much smoother to your eyes. So, what is the
operation that we did, let us try to write out what operation we did. So, we said here that Î at
a particular location say (x, y ) is going to be, you are going to take a neighbourhood. So,
which means you going to take the same location in your input image and say you are going
to go from say x minus some window k to x plus some window k.

Similarly, we are going to go from some y-k to y+k and let us call this say i, let us call this
say j you are going to take the values of all those pixels in the input image. And obviously we
are going to average all of them, so you are going to multiply this entire value by 1 by
because the neighbourhood goes from x-k to x+k there are totally 2 k+1 pixels there.

So, you are going to have (2k + 1)2 , because for x will have 2k+1 pixels, for y you will have
2k+1 pixels and then you just, so the total number of pixels is going to be 1 cross the other
and in this particular example that we saw k was 1 for us, we went from x-1 to x+1, so if you
took a particular location on the output, you took the corresponding location on the input and

76
then one to the left and one to the right. So, from x-1 to x+1, y-1 to y+1 and that creates a
3 × 3 matrix for you and that is what we are going to finally normalize it by. That becomes
the operation that you have for your moving average, so this is an example of a local
operation.
(Refer Slide Time: 20:11)

Moving to the last kind of an operation called a global operation as we already mentioned in
this case the value at the output pixel depends on the entire input image. Can you think of
examples?

(Refer Slide Time: 20:27)

77
In case you have already not figured out, a strong example of something like this, this what is
known as a Fourier transform we will see this in a slightly later lecture but there are other
operations to that can be global depending on different applications, we will see more of this
a bit later and we will specifically talk about Fourier transform a bit later.
(Refer Slide Time: 20:49)

That is about this lecture so your readings are going to be chapter 3.1 of Szeliski’s book and
also as we mentioned think about the question and read about histogram equalization and try
to find out how it works and what is the expression you would write out to make it work.

78
Deep Learning for Computer Vision
Professor. Vineeth Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture – 05
Linear Filtering

Last lecture we spoke about representing an image and we also saw a few operations that you
performed on images. Let us quickly review them and then move ahead.

(Refer Slide Time: 00:28)

We talked about 3 types of operations: point, local and global, and we did leave one question
for you to find out about. Hope you put in some effort to answer that question. The question
was how do you perform histogram equalization? So, we did talk about an example of linear
contrast stretching and histogram equalization is a more complex variation of contrast
stretching operation.

So, let us now see how histogram equalization which is a very popular operation for
improving the quality of images, let us see how that is done. Say, if you had I to be an image,
with M x N pixels and let us assume that I_max is the maximum image intensity value and let
us also create a histogram of the image which we denote as h(I). Remember that a histogram
is nothing, but obtaining your entire range of image values. And counting how many pixels
lie in each range it is simply a frequency count in bins.

79
So, now you can integrate h(I) to obtain a cumulative distribution c(I) for your image. You
will see an example of this in a moment and the cumulative distribution c_k it is simply be
given by if you go from 1 to k where k be a particular intensity value. You count the number
of pixels until that particular intensity value, histogram would be about binning it
individually. Cumulative takes it as an accumulation and then you normalize by the number
of pixels.

So, here is an example of a cumulative distribution, this red curve here is a cumulative
distribution which simply adds up the histogram as you go through from the lowest intensity
to the highest intensity. So, the final transformed image after doing histogram equalization is
simply I_max, the maximum intensity into c_{p_{ij}} which is what is the cumulative
distribution at a particular point.

So in this particular figure, if you saw we are saying here that if you had a gray level intensity
90 in an image, in the histogram equalized image, you see what is the cumulative proportion
of 90 in that image. So, in this case it happens to be 0.29 and you simply multiply that by
I_max the maximum intensity. Rather an intuitive way of understanding histogram
equalization is you try to look at the distribution of intensities in a particular image, and let us
say a lot of your intensity was lying between 200 and 250 which means an a pixel with
intensity 90, there would not be much cumulative distribution in that space and hence in
histogram equalized image it would get a lower value than another pixel with a higher
intensity, that is the idea of histogram equalization. You can again read up the references for
this course, either Simon Price book or Szeliskís book to understand more.

80
(Refer Slide Time: 04:26)

Let us move on now to understanding what filters of an image are? So, we did talk about
operations. Let us now formalize it into a concept called filter. So, an image filter is a local,
typically a local operation can be global, but typically a local operation, where you modify
image pixels based on some function of a local neighbourhood of that pixel exactly what we
defined a local operation to be in the previous lecture.

Let us take this example. You have a 3 x 3 image here which has pixel values indicated in
this particular image. Let us assume that you are going to use some filter and get the output at
the central pixel to be 4. Can you guess what the function is? Each pixel with this particular
case it simply turns out to be an average of all the pixels in the neighbourhood. You could
also make this more complex and introduce a filter which has specific values in different
locations, and you simply do a dot product of every value in the location in the image with
the corresponding value in the location in the filter. So, this filter is also sometimes called
mask or kernel. These terms are overloaded in other fields. They mean different things in
other fields, but but when we talk about image processing or low level computer vision of
today in a computer vision we call them filters or masks or kernels.

So, here is another example of a linear filter where you take this particular kernel that you see
here and simply do a linear combination of the original image with this kernel and you get the
output to be 6.5. Why is it called a linear filter? Because the output is a linear combination of
the local neighbourhood as defined by the combination in the kernel.

81
(Refer Slide Time: 06:48)

Let us formerly defined this now. So, this operation that we saw on the earlier slide can be
formerly now written as: if you had a kernel of size (2k + 1) x (2k + 1). So in our case, if we
took a 3 cross 3 filter, then k would be 1 and then we defined this operation to be correlation
because it is simply a dot product between two quantities. So, your input I and then your
output G at a particular location (i, j) is given by the first term here.

It is simply an average in term and we are simply taking the sum of an index going from -k to
k, v going from -k to k and you will simply add up all the pixels in that window. This is
simply correlation. In Cross-correlation the key difference here is that you now have non
uniform weights where you specify what should be the linear combination.

So, if you are going to add up all the pixels in the neighbourhood should you simply add up
an average or should you multiply each of those elements in that neighborhood by some
value, if so what value and that value is given by H(u,v). This is called a cross correlation
between H and I. This is one of the simplest operations so we will only formalized the
operation that we did on the earlier slide.

So, cross correlation is formally denoted by G(H,I). As I already mentioned it can be viewed
as a dot product between the local neighborhood and the kernel or the filter or the mask for

82
that particular pixel. The entries of the kernels or the mask or the filter are called the filter
coefficient. So, on the previous slide the values in this kernel are called the filter coefficients.

(Refer Slide Time: 09:25)

Let us see the example of the moving average operation that we saw on the last lecture as a
linear filter. So, we said that given an image the moving average filter gives you an estimate
by removing certain kinds of noise where at each location you simply do an average of the
local neighborhood in the input image. So, if we saw this as a cross correlation operation or
as a linear filter what would the filter be?

Remember, it has to take an average of the values in a neighborhood. The answer is, it would
simply be 1/9 in all of those elements. 1/9 is simply to normalize the entire set of values. So
obviously, if you were taking a 5 x 5 window this would end up being 1 / 25 times in a 5 x 5
filter. So, this becomes your linear filter for your moving average operation that we saw in
the last lecture can help to remove certain kinds of noise.

83
(Refer Slide Time: 10:50)

Here is a more real world example of a moving average filter. So you can see here the input
image that is your input image and on the right you see the output image where you can see
that the output is an averaged version or the smoothed version or you can also call it a blurry
version, because you are smudging the image at different pixels. So, this is also sometimes
known as a box filter because the filter coefficients are all exactly the same, so it is a box in
that sense: 1 / 9 in all those locations as we saw on the earlier slide.

(Refer Slide Time: 11:38)

Now let us try to complicate this a bit. I will say we do not want a box filter we want to take
an average, but we want to take an average in such a way that the middle most element

84
should have the highest influence on the output. 1 neighbors away should have the next level
of influence, 2 neighbors away should have a lower level of influence and so on and so forth
depending on the size of your filter.

Remember you could do an average not just with a 3 x 3 neighborhood, you could do it with
a 5 x 5 neighborhood, 7 x 7 neighborhood and that is something that you have to define when
you perform the operation. So, a Gaussian average filter which looks somewhat like this,
where you can see that the middle most element has the highest influence. One neighbor has
the next level of influence and so on and so forth. And 1 / 16 is simply a normalizing factor,
because you do not want the pixel intensity in the output image to blow up. We wanted to do
with a predefined range and that is the reason you need to normal this. So, this filter is a
discrete version of a 2D Gaussian function as it is defined like this and it gives you an
averaging filter again because it again smoothens out or blurs your pixels at different
locations, but it now does not do it in a box filter way but gives more importance to a central
pixel.

(Refer Slide Time: 13:21)

So, if you took the same example that we saw a couple of slides back and see the box filter
and the Gaussian filter, you see that there are slight differences in how these two filters work.
If you observe very carefully you will see that the box filter has some blockiness artifacts
whereas the Gaussian filter does not have these artifacts. Why? Because with box filter does

85
not smoothen out at the edges that keeps the same value across the entire neighborhood and
that can introduce certain kinds of blockiness artifacts.

(Refer Slide Time: 14:05)

What are other kinds of linear filters that you can do? You will see more and more of them as
we go in this course, but another typical example could be what is known as an edge filter. So
given an image, you may want to extract the edges in the image so let us take this example on
the slide. So, you have this input image. Let us assume that you want to extract say some
vertical edges, some horizontal edges; then the question here is what would be your filter,
kernel of mask.

So, the last set of images here are simply the absolute value of the output image. You will see
more on edge filters a bit later, but the question for you now is what would be H(u,v) if you
simply wanted to extract edge information from your input image. For purposes of
convenience let us take let us stay with the 3 x 3 filter let us not increase the size we can do in
practice, but we are not going to increase it now.

86
(Refer Slide Time: 15:17)

If you have not guessed the filter already, the filter would look something like this. A vertical
filter would look something like this where you have -1, -1, -1 on one side and 1, 1, 1 on the
other side. So effectively, the filter would look for places in the image where the significant
difference between the left of the pixel and the right of the pixel vertically speaking and if
you try to exaggerate those pixels in the output image.

horizontal horizontal edge filter will do the same thing, but along the horizontal directions.
This 1 x 9 here is again a normalizing factor, but you typically do not need that in an edge
edge filter because we are not interested in the absolute edge value, but we are interested in
high intensity where there is an edge. We will talk about this a bit more when we go into edge
detection a few lectures down the line.

87
(Refer Slide Time: 16:26)

So, let us ask an important question now. So, let us take an impulse signal for those of you
with a signal processing background we will be able to appreciate this. So, let us take an
impulse signal, which for us let us say, we define it as a single white pixel in the middle of an
entire black image. We are going to call that an impulse signal and if you take a particular
filter let us say the filter is given by a set of values a, b, c, d, e, f, g, h, i; so it is neither a box
filter nor a Gaussian filter. It is just a set of values that you have organized as a filter.

What would the output be if you use cross correlation? You can work this out a bit carefully
by yourself, but you will find that the answer would be something like this. You would have
an output image where the output is completely fliped. So, you would have the bottom right
value going to the top left and the top left value going to the bottom right rather your output
will be a double flipped version of your input. You can try out working out cross correlation
pixel by pixel and you will realize this yourself. For example, the output at this pixel here it is
simply going to be taking this input pixel and placing this kernel exactly as it is there and it
happens that this white value is the only value that will get multiplied by this 1 all other
kernel values will get multiplied by the corresponding black which is 0. And will not have
any bearing on the output, which means the output at this particular location is going to be
white and we will keep doing this over and over again at every pixel and you will see that let
us take one more example. So, if you took this pixel on the output which would be this pixel
on the input so you will take the same kernel place it here as a 3 x 3 neighborhood. And you

88
would see that the only pixel, the only value that would that would get highlighted is the one
which is here which would get multiplied by 1 everything else would get multiplied by a 0,
because of the black and the output would be black itself because that is the value here. You
can do this for every pixel and realize that the output here is going to be double flipped. This
is something that we do not expect we thought a typical identity operation for example an
impulse signal.

We would expect that if perform the operation with a kernel you get the kernel itself, but
sadly here we notice that the output to be double fliped. What do we do if we want an
operation like an identity operation where if you take an impulse signal and you do that
operation with a kernel you should get the kernel itself which as you can see does not happen
when you do cross correlation.

(Refer Slide Time: 20:09)

That introduces us to the operation of convolution. So, convolution is very similar to cross
correlation. The main difference here is given a kernel of size 2k+ 1 cross 2k + 1. Given an
input image I and a filter H your output is H(u, v) x I(i - u, j - v). What are you doing here?
When you are doing the operation itself you are double flipping the filter so that your output
turns out to become a same as the filter itself.

89
So, probably this slide has a minor issue this should be a normalizing factor here, but it really
does not matter in practice as long as you normalize the values in the filter itself. So, if your
filter values are not normalized you have to explicitly normalize in the end, but if you assume
that the filter values are all normalized this normalization not be required. So, as I just
mentioned, convolution it is equivalent to flipping the filter in both directions, but in the top
and right and left and then applying cross correlation.

So, convolution is typically denoted in this manner where H is the filter and I is the image.
So, here is an example of how it works the same example now you have the impulse signal,
you have your filter H(u,v) you are going to double flip it now which means your double
flipping is going to result in this filter on the right. Now you perform a cross correlation with
this double flipped filter. And you will end up getting this as the output because you are
doing a cross correlation with this filter now, but the output now is the same as the filter that
we gave as input to the operation. Why do we really need to do this? We will answer in a
moment. One reason as we said is we wanted some kind of an identity function, but we will
just come back in a moment.

(Refer Slide Time: 22:41)

Before we answer that question if you recall this slide that we had when we talked about the
history of computer vision we talked about these early experiments in the late 1950s that
establish that there are simply and complex cells in the mammalian visual cortex.

90
(Refer Slide Time: 23:01)

There has been followed work along these lines that have shown that simple cells in the
visual cortex perform simple linear spatial summation over their receptive fields. Receptive
field is simply what part of the input image are you focusing on while performing an
operation. For example when you take a 3 x 3 convolution filter or a correlation filter, your
receptive field is of the size 3 x 3 because that is the part of the input image that you will
focus on while performing one such operation.

Obviously, you keep repeating this at every pixel in your image and you get the output, but
when you do one particular operation your receptive field is 3 x 3. So coming back to this
point, so this work here in the late 1970s showed that simple cells actually perform linear
spatial summation. So, it seems to hint that correlation and convolution could be operations
that we could use to perform operations on images similar to the mammalian visual cortex.

91
(Refer Slide Time: 24:16)

It happens that correlation and convolution are both what are known as linear shift invariant
operators. Those of who with the linear system of signal processing background may already
know this. Linearity means that if you had an image I and if you had two filters h0 and h1
then I operation h0, the operation could be convolution or correlation and then I operation h1
whether you add up the filters first or add up the images the output images after performing
the operation of the individual filters they both will be the same.

This is linearity or what is also known as a superposition principle. The other property here is
shift invariance. This is an important property something that can account for why
convolution is used to this day in deep learning and so on and so forth this is an important
property. It simply states that shifting or translating a signal commutes applying an operator.

Let me try to explain this slightly differently, this is the general definition here, but if you had
h to be a certain image I and g to be a certain image I_hat so your I_hat is simply a translated
version of your I. Why is it a translated version? Because for every pixel at I_hat which is
defined as i, j. The value is given by what was the value in I at i plus k and j plus l. So it is
like moving the entire image in I a little bit left or depending on what K is? K could also be
negative so it could be moving left or or right. So this is simply a translated version of your
image I. So, now if you have a filter f whether you convolve it with i_hat or you convolve it

92
with I at the other locations, at the translated locations you will get the same output rather this
seems trivial to you at the first go, but this is an important property; but what it tells is if you
have a filter f and you apply it at say a particular location of the of the input say 1, 1 of the
image matrix it will have the same output as applying the same filter at 3, 3 provided these
two regions have exactly the same intensity values. Why is it important? If you have a cat at
left top of the image or a cat at the right bottom of the image if you use a particular filter it
will give you the same output whether the cat is located at the left top of the image or bottom
right of the image and that is what this shifting variance.

And that helps convolution induced translation invariance into any approach that uses
convolution. So another way of saying that is the effect of the operator is the same
everywhere in the image as long as the image has the same characteristics. If two images
have the same characteristics, but the object of that corresponding to the characteristics is at
different locations in that image. The corresponding output at those locations will be the same
for a given filter. Why do we need this in computer vision I just explain that.

(Refer Slide Time: 28:20)

So, this is an important slide. We said that cross correlation unfortunately did not help us
maintain an identity with the impulse function whereas convolution did and convolution has
many more mathematical properties that may give an elegant operation. The first one is it is
commutative so a * b is the same as b * a. So, this is an interesting and important property
because this simply that if a was your image I and b was your filter h, whether you call a or

93
image b of filter or the other way round does not matter because the operation is
commutative.

The second property is associativity which says a * (b * c) is the same as (a * b) * c. Why is


this important? If you took an image defined given by a and you have two different filters b
and c. Now you can either apply the filter h1 on I first get an output and then apply filter h2
or you can pre compute the convolution of h1 and h2 and simply apply the output of that in
the image.

We will see some tangible useful applications of this property a bit later. Other properties are
it is distributive over addition so a * (b + c) = (a * b) + (a * c) I think we saw it with linearity
too. In factors out scalars, ka * b = a * kb which is nothing, but k(a * b). And as we already
saw that if you had an unit impulse which is defined by say a vector like this so it has a 1 in
the center 0 everywhere else. Remember this is the one dimensional impulse in a two
dimensional impulse we saw we already saw an example where you have a white value in the
middle of the image black everywhere then a convolution this unit impulse will be a, itself
which for us did not fold with cross correlation. So, convolution has a few elegant properties
which you will see helps us in many applications.

(Refer Slide Time: 31:03)

94
Another important property of convolution is the notion of separability. So, typical
convolution operator requires k^2 operations per pixel assuming k x k to be your kernel so
there are 3 x 3 kernel. If you look at a particular output pixel you need 9 operations 3x3
operations to be able to compute the value at that output pixel. This can be costly because it is
an image is of a very large size, remember we have to repeat the same operations for every
pixel in the image so it could be 9 into N^2 where N^2 is the size of the image assuming you
have an N x N image. Can we do something to reduce the cost? It happens at a certain cases
you can. For certain kinds of filters you can speed up the convolution operation by first
performing a 1D horizontal convolution followed by a 1D vertical convolution.

So, you first convolve along each row remember that the entire operation of convolution why
we defined it in an image processing context which is the reason why we took 2 dimensional
filters like 3 x 3 so on and so forth. Convolution is a more general signal processing concept
that can also be defined for one dimensional signals, can be defined for 3 dimensional signals
or any other dimension for that matter.

Just that the kernel or the filter that you have would have to be defined in that dimension. So,
if you are performing a 1D convolution along every row of the image, you would define a
one dimensional filter or one dimensional filter would be something like [1 / 3, 1 / 3, 1 / 3].
Because there are 3 values and you are normalizing, so should be 1 / 3, this is the one
dimensional filter.

What we saw at the block filter or the box filter was 1 by 9 in a 3 x 3 location. Similarly, a
vertical filter would be 1 / 3, 1 / 3, 1 / 3 and we can perform exactly the convolution operation
that we discussed we still have double flip and then do it only along every row of an image or
only along every column of an image. If you do that you only require 2k operations because
every row would require, every 1D kernel would require k operations. Every 1D vertical
convolution would require k operation. We only need a total of k + k operations this becomes
cheaper than k^2. Remember that for a 5x5 filter a k square would mean 25 operations while
a 2k would mean 10 operations and this obviously this difference becomes more prominent

95
when k becomes larger and larger and larger. One point to add here is you did see these
kernels here to be 1 / 3, 1 / 3, 1 / 3 or in a box filter 1 by 3 everywhere.

Remember when you have filters such as this your double flipping really does not matter
because the matrix is or matrix or the filter in this particular case is exactly the same
everywhere. In such a case convolution and correlation will give you exactly the same output.
So, to make this a bit clearer so we defined a kernel k to be separable if you can write it as an
outer product of two 1D vectors, v and h.

If you can do that then such a kernel is said to be separable because you can separate it with
two 1D kernels, it is said to be separable and then you can use this trick to reduce the number
of operations from k square 2k by first performing a 1D convolution with v and then
performing a 1D convolution with h. Here is an example so if you have a 2D kernel which is
given by this. We recall this kernel this was called the Gaussian kernel. We saw it a few
slides back you can write this as an outer product of v and h where both are equal and can be
given by 1 / 4 (1, 2, 1). So, then what you do is you simply take this kernel, do a convolution
in one dimension, take the transpose of it do a one dimensional convolution in the other
dimensions rows and columns. Should v and h be the same always? Not necessarily.

Here is another example where you have a filter such as this if you recall this filter in case
you do not recall this was your edge filter. So, in this case as you can see this can be written
as an outer product of v = to 1/4 times this vector and h is equal to 1/2 times this vector. You
can test it out to see if this actually is the outer product, but it happens to be this way. So, now
you can replace a 2D convolution by two 1D convolutions which can help reduce the cost.

96
(Refer Slide Time: 37:09)

But this raises one question how can you look at a kernel and tell if it is separable. So, I said
that if a kernel can be written as an outer product of two vectors, then you can say that the
kernel is separable, but why how can you say this? One option is to visually look at it and try
to work out various combinations and find out which gives you the outer product and you can
probably work that out with the example of a previous slide and see.

But there is a slightly more principle way to do this. The principle way to do this is you can
take this singular value decomposition of your kernel. Remember the singular value
decomposition of any matrix is given by U sigma V transpose and that can be written this
way where U is a matrix whose column vectors are ui, v is a matrix whose column vectors
are vi and sigma i are the elements in your diagonal matrix sigma as your diagonal matrix
that is given by sigma_1 till sigma_k assuming K is a k cross k matrix, then this is your
standard singular value decomposition. So, if you write out the singular value decomposition
of K, then it happens that root(sigma_1)u1 and root(sigma_1)v1 will be the vertical and
horizontal kernels.

Try working this out or checking all this to see if this actually works with the two examples
that we saw on the couple of slides back, the Gaussian kernel and the edge kernel.

97
(Refer Slide Time: 39:13)

So, let us talk about the few practical issues when you actually start using convolution or
even correlation as an operation. What do you think should be the ideal size for a filter? We
talked about a filter being 3 x 3 we said it can be 5 x 5, 7 x 7 so on and so forth, but what is
the ideal size. Obviously, it depends on a given application, but the bigger the mask or the
filter more neighbors are going to contribute.

There is going to be a smaller noise variance of the output so for example if you had say
some kind of a noise that got introduced in certain parts of your input image. If your
convolution is with a very small local neighborhood that noise will have an impact on the
output, but if you took a larger mask then that noise will get subdued among the other pixels
that you consider for that linear filter or for that local operation.

It can also result on the other hand in a bigger noise spread because if you had noise in your
input if you took a bigger filter, the impact of that noise is now going to be on a larger region.
Remember we talked about the receptive field. So, a larger number of pixels in your output
image will be impacted by that noise that is the other side of a same point that we saw in the
previous on the previous bullet.

98
The larger the size of the filter if you do an averaging filter it leads to more blurring. It is also
more expensive to compute, smaller filter means the number of operations is going to be
lesser so 3 x 3 filter has 9 operations in a non separable case, but a 5 x 5 filter has 25
operations.

What about the boundaries? This is perhaps a question that would have been in your mind all
through. So when you do convolution, we said that if you have an input image, if you have a
kernel you convolve and you get an output image, but we said that for a particular location at
the output image you have to place this 3 x 3 kernel at that particular location in the input
image. And if you now place this 3 x 3 kernel on one of those edge pixels, there is no value
outside the image to be able to perform that linear combination which means you will only be
able to place these output pixels one pixel into the input image if you had 3 x 3 filter.
Obviously, that was a 5 x 5 filter. You have this impact for two sets of border pixels all
around the image.

What do we do? Do we have to lose information? Will the output image be smaller than the
input image? So, if you have a 100 cross 100 image and 3 cross 3 filter because placing it at
that every edge means two pixels will be outside which you do not have values for, will your
output becomes lesser? The answer turns out to be yes, the output will become the size 98
cross 98 in this particular case.

What do we do if we want the output image to be the same size at the input image? We can
do what is known as padding. Without padding, yes we lose out information at the boundaries
and the output image will be smaller than the input image depending on the size of the filter,
but there are many strategies for padding your input image with certain values so that we do
not lose that information.

99
(Refer Slide Time: 43:18)

Let me show a few examples so you can do what is known as zero padding then you simply
pad beyond your image with black pixels. So, now you can place your filter at that corner
location you you simply will have black values for the (fi) for the filter coefficient that go
outside the image. You can wrap your image so whatever is here so if this was 2 pixels you
simply just wrap those two pixels towards the other end.

You can clamp the values which means whatever values you have at the output at the last
value there, you just continue with the same values outside the image. You can mirror those
values at the edges so on and so forth. So, in each of these cases if you did say an average
filter these are the kinds of outputs that you would get. As you can see there is not too much
variation, but for the variation at the edges and the larger the image become the output of
these padding strategies do not change too much from a visual impact.

100
(Refer Slide Time: 44:36)

So, let us end this lecture with a couple of questions for you. So, we went from defining cross
correlation and said that it has a problem when you deal with an impulse signal and then
define convolution and then we talked about a few elegant properties that convolution has
which correlation does not have and then give a few examples. Now, one question that
lingers is do we need cross correlation at all?

Can it be completely replaced by convolution in all kinds of applications think about it? We
will answer this in the next lecture, but please do spend some time thinking about it and the
other question here is why should we take a linear combination? Why cannot we take a non-
linear combination of local neighborhood of an input image? Obviously, in our case we
defined a linear filter that way, but let us try to see if this can be made non-linear, think about
it and we will answer these questions in the next lecture.

(Refer Slide Time: 45:42)

101
So, for this lecture these are your readings with a specific section number listed here. Please
do read them as a follow up.

102
Deep Learning for Computer Vision
Professor. Vineeth Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture – 06
Image in Frequency Domain

Moving on from the last lecture we will now talk about representing images in in a very
different way looking at it as a two dimensional signal composed of multiple frequencies.

(Refer Slide Time: 00:31)

Before we go there, I think we left two questions from the last lecture which was, do we need
cross-correlation at all or is convolution sufficient for all applications of image processing
and the second question was, do filters always have to be linear. Hope you had a chance to
dig up a bit about those questions.

103
(Refer Slide Time: 00:55)

Regarding correlation while convolution has mathematical properties that make it useful for
various reasons, correlation can still be useful for a tasks such as template matching.
Template matching is a task where you have given a template and you have to find that
template in a larger image. For example, you could be looking for this template in an image
such as this and your goal is to highlight the regions in the image where that particular
template is present. In tasks such as this correlation becomes a more direct choice.

(Refer Slide Time: 01:39)

Another example is finding an object in a more real world image. Here again a point to note
is that your template and the object in the scene may not exactly match. It may not exactly

104
match, but as long as they are similar in scale, in orientation, and general appearance you will
still be able to template matching to be able to catch the object. As you can see if the car in
the scene was from a different pose angle, a different size, then this may not have worked as
this. There are ways to improve upon it. We will talk about that later, but as is correlation
may not work. But template matching is a task where correlation can still be useful.

(Refer Slide Time: 02:31)

Regarding linear and nonlinear filters, let us try to give an example of a non-linear filter using
a practical use case. We all know that there are different types of noise that is possible in
images, so this was your original image. You could have salt and pepper noise, you could
have what is known as impulse noise. Salt and pepper noise has black and white specks
pepper and salt. Impulse noise has only white specks and you could also have a Gaussian
noise where you add some Gaussian to the signal at each locations and things you have some
noise due to that noise distribution. So, let us consider one particular noise case. Let us take
the salt and pepper noise which all of you I am sure will agree is present in a lot of images
that we see. So, salt and pepper noise looks like this. And if you now use a Gaussian filter
which we saw on the earlier lecture and convolve this image with a Gaussian filter with a 3×3
Gaussian filter you would get this output, 5×5 this output, 7×7 this output. Do you see the
problem? Irrespective of the size of the Gaussian filter, you see that we are simply smudging
around the salt and pepper noise and not really removing the noise. What do we do?

(Refer Slide Time: 04:04)

105
The answer is obviously in a nonlinear filter and in this case the nonlinear filter is going to be
a median filter. So, we did see an average filter of mean filter last lecture. Median filter is
simply a variant of the mean filter where you take that 3×3 window, sort them all out and
simply take the median of the sorted list of values. If you simply use this as the filter, you see
that in these cases things become much better in the 3×3, 5×5 and 7×7 cases and the salt and
pepper noise is almost removed. Why does this happen?

(Refer Slide Time: 04:49)

If you take a more numerical example, so let us assume this was your image so to speak. So,
if you going to take one particular window, take those values, sort them out and take the
median which is at 10 in this case and put that median back here and you keep repeating this

106
process all through your image. So in this particular example, 99 here is a salt noise has a
very high white value compared to the remaining set of values that you have in this image.

And when you come to that particular portion of the image, you list out your elements, you
sort, take the median and when you do the median, it becomes an automatic way of
discarding the outliers in that sequence of values. If you had done the mean filter, the 99
would have been used to compute the mean and hence the value in that location would have
still been high.

But because you did median that values is no more relevant to you and you are going to get
only 10 in that particular location and this helps you eliminate the salt and pepper noise.
Clearly median you cannot get the mean in median through a linear combination of your
input values so it is a nonlinear filter.

(Refer Slide Time: 06:14)

Another example, which you may have seen in several places is what is known as bilateral
filtering. An example output from bilateral filtering looks like this you may have seen these
images. So, where as you can see it almost looks like given a photo you have taken particular
regions of the face and smoothened out pixels in those regions.

For example if you take the cheek region, if you just take the region there the pixel values in
that region are smooth and the pixel values in some other region are smooth by themselves,
the pixel values in this region are smooth by themselves and they do not bleed into each
other. Now how do you achieve such an effect using a filter? You may have seen this in

107
image reading software such as Adobe Photoshop 2. So, bilateral filtering is a nonlinear edge
preserving smoothing and let us see how this is achieved.

So, our main objective here is going to be, if a pixel differs too much from the central pixel
value in its intensity, we do not want to consider that pixel. How do we implement it? It is
going to be again a filter. If you remember that, this is what you have. I is your image, your
filter is going to be a bunch of weights that you associate with each of those filter locations (i,
j) or (k, l) in this particular place, but now and the denominator is simply normalization
factor.

We already know that normalizing a filter is important especially for an averaging filter, but
you do see that there are 4 coordinates here. What do you think those 4 coordinates mean?
Let us see that using a specification of w(i, j, k, l). You have to only define w(i, j, k, l) as it is
like a Gaussian filer. We say that if the coordinate location is far from (i, j) then weight it
lesser that is one part of the term. The first term is very similar to Gaussian filter. If the
neighborhood location is far from the central pixel, weight that a bit lesser, that part is same
as a Gaussian filter. But you have an additional term here which also says, I do not want to
decide this based on only the location of the pixel. I also want to decide the filter value by
what value is there in the image at that location.

Remember so far in all our filters, the filters only where relevant to the position of a pixel in
an image. The filter coefficients did not change for different locations in the image, but here
we are talking about changing the filter coefficients for different locations and how do we
change? We say that if an image pixel at (k, l) which is one of those locations in the
neighborhood around (i, j) is going to be very far from the intensity value at (i, j) at the
central pixel you want to weight that lesser.

So even if a pixel is an immediate neighbor of that central pixel that you are evaluating at, if
it had an intensity that was very different from the central pixel, you are going to weight that
lesser. That is what this function does and we call this bilateral because you are doing it along
the coordinates as well as along with intensities.

(Refer Slide Time: 09:43)

108
Why does this work? You see that on this slide you can see on the left which is the
representation of the image. So, these are a bunch of intensity values in a vicinity of a
particular number and then there is a large separation before a neighborhood which has a
different set of intensity values with some variations and this is how a filter will look.
Remember again that this filter will vary for different parts of the image based on the
intensities in that local neighborhood.

So, here you see that the filter in this part of the neighborhood varies like a Gaussian. As we
said if the intensities are similar the only thing that affects is the coordinate location and then
the value of the filter drops off significantly because there is a huge change in intensity
between a value here and a value down there. So, now what is the result if you convolve this
filter with this input you get something like this where all the intensity values in this range
get smoothened out here. And all the intensity values at the bottom range get smoothened out
here and that is how you got the boy's image with a smooth cheek area and smooth regions in
other parts of the face and neck.

109
(Refer Slide Time: 11:10)

Moving on to the main topic of this lecture, let us start by asking ourselves what do we really
lose in the low resolution image. So, if you actually look at these two images perhaps our eye
has the same response in both these images. We will probably seen a very similar scene in
both cases. So what did we really lose when we reduced the resolution? You all know that we
probably lost some sharpness or some edges in the image, but what does this really mean is
what we are going to talk more about today.

(Refer Slide Time: 11:47)

So, when we talk about images in the frequency domain we mean Fourier domain and when
we say Fourier we go back to the scientist Fourier who lived in the eighteenth century and

110
early nineteenth century who proposed way back in 1807 that any univariate function can be
written as a weighted sum of sines and cosines of different frequencies. Unfortunately for this
time, at that time you could say of course what is new you probably heard about it in some
place.

But at his time many scientist including Lagrange, Laplace, Poisson and many others did not
believe the idea and this is statement that they gave. They felt that it really was not complete
and the work was not even translated to English to until 1878 that was 70 years after he
invented the idea and much, much after he passed away. But today we know that this is
mostly true. And this is called the Fourier series and has some subtle restrictions, but is more
or less true across signals that we know and we are going to talk about viewing images as 2D
signals now.

(Refer Slide Time: 13:01)

But before we go that, let us talk about the Fourier series itself. So, for simplicity we are
going to talk about the Fourier series as representing any time varying 1D signal as a sum of
sinusoids. We will move to images after some time. So, the building blocks are sin waves and
w is the frequency, φ is a phase and A is an amplitude of the signal. So the idea here is that
with these building blocks and with signals of different frequencies and phases, you could
approximate any function. So, let us see a more tangible example for this.

111
(Refer Slide Time: 13:41)

Let us take this particular example here. This function here g(t) what you see as I said the
time varying one dimensional signal can be rewritten as sin(2pft)+(1/3)sin(2p(3f)t). So, there
are two frequencies f and 3f. f is the frequency that gives you this signal and 3f is the
frequency that gives you the other signal. Now let us say this is the input signal to us. We can
write is as a sum of two frequencies because we know the actual definition.

Now in the Fourier domain this kind of a signal can be represented as, so you had along the x
axis you have different frequencies and what we tell now is that the frequency f you have a
certain amplitude and at the frequency 3f you have a certain amplitude. Here 1/3 at 3f and
you have 1 at f. So, that is how you represent this signal in the frequency domain.

112
(Refer Slide Time: 14:55)

Now consider a slightly more complicated waveform such as this square waveform which is
something like a periodic step function. Now let us try to see how this can be written using a
Fourier series. So, we could we know already that by using frequencies f and 3f we can get a
signal such as this. To this output signal we can add one more signal of an even higher
frequency and then you would get this signal.

And to that signal you add a signal of even more higher frequency and you would get this
signal and you keep repeating this process over and over again until you go closer and closer
to your original square wave function. As you can see as you keep adding higher and higher
frequencies, but in lesser and lesser amplitudes, you get closer and closer to a square wave
which means in the Fourier domain the final square wave can be written in this form, where
frequency f has the highest amplitude, the next frequency say 3f has a slightly lesser
amplitude, 5f has an even lesser amplitude so on and so forth and as you keep adding more
and more frequencies your waveform will get closer and closer to your original square wave.

113
(Refer Slide Time: 16:27)

Now, let us come, that was in 1D example. So, if you come to images how would this look?
So, this is slightly more nontrivial we will not get into every detail here, but hopefully I will
try to explain the basics. So, if your original intensity image was something like this. So,
what you see here is there is no variation along the y axis or there is no variation in any
particular column. All the variations are along the rows so which means there are only
horizontal frequencies here, no vertical frequencies.

Remember we are talking about 2D signals so there are going to be signals in both directions.
But here in this particular example, there is no change along the columns, so there is only a
horizontal frequency. So, typically in the Fourier image if you observe this carefully here
there are 3 dots. The central dot is actually called the DC component or it simply the average
intensity of your original spatial domain image.

The dot on the right here is the actual frequency. So you can now represent, let me try to
draw it on this one it is easier for you to see. Imagine that to be your axis, imagine the central
point to be your origin and that origin as I said is your DC component and on the x axis this
particular point refers to one kind of a frequency and you can see a white dot there. White dot
represents the amplitude of that frequency in your spatial domain image.

So, the Fourier domain image is simply telling us that there is this much amount of that
frequency in this image. Now you could ask me what is the third dot here on the left hand
side of your DC component, you may ask what is this? Just hold on to it and we will talk

114
about it on the next slide. Similarly, on the second column you see that it is very similar to
the first column, but the frequency has got a bit higher.

And because the frequency is got a bit higher let me erase these lines. You can now see that
on the x axis we now have a slightly higher frequencies present in this input dimension.
Remember here that if you consider that to be in axis this had a lower value of the frequency,
this has a higher value of the frequency and if you take the third column you can clearly see
that there are frequencies both in the horizontal direction and vertical direction which boils
down to points, lines somewhere here.

So, if you again thought of this as an axis in the Fourier domain. So, you have now some
frequency along the x axis and some frequency along the y axis and that give you that dot.
We will come to the third point in a couple of slides from now.

(Refer Slide Time: 19:27)

So, another way of summarizing what we just said is that in the Fourier domain an image can
be represented as in terms of the presence of different frequencies of the signals starting from
your zero frequency which is nothing, but your average intensity of image. Zero frequency
means it is a constant value. So, what is the average intensity of the image is your constant
value on top of which you are going to add your frequency changes.

And f is a particular frequency where you record what amplitude of that frequency is present
in this given image and you go the, the x axis goes all the way until what is known as Nyquist
frequency and we will come to that in the next lecture. For the moment you can just assume

115
that is something called Nyquist frequency. So, a signal is plotted as a single peak at point f if
there is only one frequency in that image. And the height of that peak corresponds to the
amplitude or the contrast of your sinusoidal signal. Now, if you noticed on these images we
did say that there is a third pixel on these images and I said I would answer it.

(Refer Slide Time: 20:40)

So, it happens that when you plot your Fourier domain version, a spatial domain version of an
image or a time varying signal for that matter, you typically mirror your entire representation
along your origin. Why you mirror it? I think you may not have the scope to get into it in this
course, but if you are interested, please look at this course to understand why is the Fourier
domain representation mirrored?

So, the third dot that we saw on the slide couple of minutes back is just the mirrored version
of the frequency that we saw along the x axis.

116
(Refer Slide Time: 21:21)

So, here are more examples. So, this is one kind of harmonic frequency and this is something
like 3f, this is 5f. Note that along the x axis this dot here is actually moving forward and
forward because the frequency is higher you are moving further along the x axis, this is 7f so
on and so forth. Now you can combine these frequencies this is again 1f, this is (1+3)f which
means now you get dots in both those locations. This is (1+3+5)f you get dots in all of those
locations this is (1+3+5+7)f dots in all of these locations.

(Refer Slide Time: 21:59)

And we will keep doing this further and further and further if you had a single image as a
white dot that is going to get transformed into a horizontal line in the Fourier transform. This

117
could involve slightly more complex understanding of Fourier transforms, but for those of
you were comfortable remember that a discrete signal in an original domain becomes
periodic and continuous in your Fourier domain. Again we will not get into those details
keeping the focus of this course, but I will share references with you through which you can
learn more about this.

(Refer Slide Time: 22:43)

Here are some examples of how real world images look in the Fourier domain. So you can
see here that this is a real world image. These are two animals. This is the log of the
magnitude spectrum and what you see on the third column is the phase of the image. So, as
we already mentioned, remember we said that in the Fourier domain you write out any signal
as a sum of sinusoids with different frequencies and phases.

Phases are where the sinusoid starts. For example so this is a certain sinusoid with frequency.
This is a sinusoid with the same frequency, but with a slightly different phase. So, the third
column represents the phase component of the signals and we will try to talk about what it
means over the next few slides.

118
(Refer Slide Time: 23:39)

So, Fourier transform consist of both magnitude and phase. So, in practice this is often
written in terms of real and complex numbers. So the amplitude or the magnitude of the
Fourier signal is written this way where φ is the actual representation and the magnitude is

I m (φ )
2 2
tan − 1
R e (φ )
simply √ R e(φ) +I m(φ) , but the phase is given by tan inverse of imaginary by
real. Once again we would not get into the depth of this, but I would share references where
you can understand this in more detail.

(Refer Slide Time: 24:17)

119
To complete this discussion, so if you take a continuous signal such as this and if you take a
Fourier transform of that continuous signal, its Fourier magnitude spectrum would look
something like this which simply says that this is the DC component. You have certain
frequencies with different amplitudes so on and so forth. If you now sample that signal and
get a discrete version remember that when we talk about images this is what we have.

We do not know the real signal across the continuous signal across the image what we have is
the signal sampled at different pixels. Remember this goes back to the representation of an
image as a function. If you take a Fourier transform of a sample signal, your output would be
a periodic continuous Fourier spectrum. These are concepts for which you may need to do a
little bit more reading of Fourier analysis.

And I will again point you the references, but you can just keep in mind that with your input
is a discrete aperiodic signal, your Fourier domain is going to be a periodic continuous signal
and this is the kind of a signal that you would end up having and this will be very similar to
copying and shifting the magnitude spectrum for your continuous notion of a signal.


∫ h ( x ) e− jwx dx
So mathematically speaking, the continuous Fourier transform is given by −∞ ,
where h(x) is your original signal and the discrete version of it is where you go from 0 to up
to N-1 sample or total of N samples and that is what is the discrete version of the Fourier
transform.

(Refer Slide Time: 26:15)

120
So, if you want to learn more about the Fourier transform a highly recommended resource is
this first one “An Interactive Guide to the Fourier Transform”, it gives a very intuitive
explanation. There are also other references here if you would like to know more you can
click on those links to know more.

(Refer Slide Time: 26:32)

Let us come back to why we are talking about the Fourier transform in this lecture and why
did we have to deviate to talk about something that is for which you need to read a lot more.
There is a particular reason for that and that reason is what is known as the convolution
theorem. The convolution theorem states that the Fourier transform of the convolution of two
functions is a product of their Fourier transforms.

Rather if you take g convolution h two signals and you take the Fourier transform of them. It
happens that will be the Fourier transform of the individual signals and the product of the
Fourier transforms of the individual signals. So, what does that mean? If we want to do
convolution in spatial domain we can obtain it through multiplication in the frequency
domain.

And this can lead to some computational savings not in all cases, but in certain cases
especially large filters and large images which can be useful and we will talk about that in
more detail and that is the reason we needed to talk about the frequency domain here beyond
just getting a fundamental understanding.

121
(Refer Slide Time: 27:39)

The Fourier transform has a lot of interesting properties which nicely fit into convolution,
superposition, shift, reversal, convolution, multiplication, differentiation, scaling and so on
and so forth we are not going to walk into all this, but there are lot of useful properties of the
Fourier transform.

(Refer Slide Time: 27:57)

So, if you now ask me the question how does it really matter looks like the Fourier transform
also will take the same number of operations as convolution? Let us try to analyze this. So,
the number of arithmetic operations to compute a Fourier transform of N numbers let us say a
function was defined with N samples. What do you think would be the number of operations?

122
2
It would actually be proportional to N that would be the Fourier transform just by the very
definition of the Fourier transform itself.

But it happens that there is an algorithm called the Fast Fourier Transform which helps
reduce this computation to NlogN. Fast Fourier Transform is a recursive divide and conquer
algorithm simply does this by dividing the set of samples that you have in a different portions
and then computing the overall Fourier transform. You can see this link to understand this
more intuitively.

2
But this reduces the number of computations from N to NlogN and this is what is going to
help us in making convolution more feasible by doing the operations in the frequency domain
and this is the trick that is often used to this day when convolution is implemented in libraries
which you may just call when you learn deep learning.

(Refer Slide Time: 29:18)

Applications of FFT are in convolution and correlation.

123
(Refer Slide Time: 29:22)

So, let us see an example of this. So, if you have an intensity image and you take such a filter
this is an example of an edge filter you would get an output such as this where only the edges
are highlighted. This is your normal spatial convolution. So, what does this mean to go to the
frequency domain?

(Refer Slide Time: 29:44)

To go to the frequency domain means this, you take the original image, take the Fourier
domain version of this particular image. As you can see the Fourier domain version is still
symmetric above that central (0,0) points. Remember we talked about it being mirrored

124
version, but there are lot of complex frequencies now because to look at an image like this
and be able to eyeball it and find the frequency is hard so there are methods to do that.

So, you get the frequency domain representation, you take your edge filter, get the frequency
domain representation. Multiply these two across point wise and you get this output then you
do an inverse Fourier transform and you can get back your original image. Remember again
if you go back here, this step is your inverse Fourier transform which is what we implement
using an inverse FFT. Now coming back to the slide.

Now can we ask the question in this particular case what specific cost improvement did the
use of convolution theorem really give? I am going to leave this as a question for you to think
about. In this particular example can you try to think what actual cost improvement would the
use of the Fourier transform or doing this operation in the frequency domain, frequency and
Fourier are synonymous terms in this case would it give? Think about it and we will discuss
the answer in the next lecture.

(Refer Slide Time: 31:28)

One more important concept here is the concept of what is known as low pass filters and high
pass filters. So, when we talked about the filters or the masks or the kernels, it is common
practise to refer to some such filters as low pass filters which means that they allow low
frequencies to go through and block high frequencies. What would be an example? your
Gaussian filter because it does smoothening.

125
Remember that when you smoothen an image you remove certain edge information which are
high frequency components. High frequency components simply mean that there is a
significant change between one pixel and the pixel immediately next to it that is what we
mean by a high frequency in an image and high pass filters are filters that allow high
frequencies to pass through and block low frequencies.

Can you think of an example here? It would be an edge filter because an edge is a high
frequency component and it allows edges to go through, but blocks out your low low
frequency components across your image. Now if you went back to the first slide that we
used to start this entire discussion. When you go to a low resolution image from a high
resolution image you actually lose your high frequency components.

But it happens that the human eye is very good at working with medium frequency
components that you can still make out the meaning in an image, but that is what you lose
when you go to the lower resolution image.

(Refer Slide Time: 33:11)

One last question before we wind up this discussion which is also an interesting topic is we
said that there is a magnitude component and a phase component in every Fourier version of
an image and this is how the magnitude is something that we said is easy to interpret the
phase is a little harder to describe so we will not get into it now, but this is how it looks.
Which do you think has more information?

126
We will try to understand this by perhaps swapping the magnitude in the phases and doing an
inverse Fourier transform and seeing what you get. These are the kind of things that you
would get. So, if you now take the magnitude in this cace I think yes it is the magnitude of
your leopard or a cheetah and your phase from the zebra and similarly you take the
magnitude of the zebra and take the phase from the leopard and you get these images.

What does this tell us? This tells us that the texture information comes from phase whereas
the frequencies actually come from your magnitudes.

(Refer Slide Time: 34:28)

So, here is a popular example of doing this magnitude and phase switch using a method
called hybrid images where you exactly do what I talk about. So you take an image, you
simply first get a low resolution version of that by applying a Gaussian filter. You in fact can
do it at multiple resolutions we will come to that a bit later and then you swap your
magnitudes and phases of different images and you can get pretty looking images that can be
interesting. These are called hybrid images and this was an interesting fact way back in the
mid-2000s.

127
(Refer Slide Time: 35:11)

One more exercise before we stop. Here are a bunch of spatial domain images, spatial domain
images, and corresponding frequency domain images on top. Your task is to match which one
go where try this and we will talk about this in the next lecture.

(Refer Slide Time: 35:30)

We will stop here please do follow up on the readings and the exercies given to you.

128
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Image Sampling

The next topic we are going to look at is another important fundamental topic in processing of
images which is sampling and interpolation. Before we go there, we will quickly review the
questions that we left in the last lecture.

(Refer Slide Time: 0:31)

So one of the questions we asked in the last lecture was, by implementing convolution by going
through, to a Fourier domain that is using the convolution theorem, what cost improvement do
you really get? Convolution theorem allows us to define convolution in the spatial domain as a
multiplication in the frequency domain and then taking an inverse Fourier transform. So you take
the individual Fourier transforms of your g and h which can be our image and your filter or
kernel and then you multiply them and then you do the inverse Fourier transform.

Let us take a simple basic case here to explain the cost improvement, so your image convolution
takes O(N 2 k 2 ) time where N × N is the image size and k × k is the kernel size. Then by
performing the convolution in the Fourier domain you are still going to have a cost of O(N 2 ) for

129
doing a single pass over the image. In addition to that to do FFT you will need O(N 2 log N 2 ) for
the image, recall that when we spoke about the Fast Fourier Transform, we said that if you have
N samples, the cost is going to be N log N.

In our case an image has N 2 pixels, so the cost of computing for your fast Fourier Transform is
going to be O(N 2 log N 2 ) . Similarly, O(k 2 log k 2 ) kernel which is of course k × k . So together
your computation is going to be O(N 2 log N 2 ) + O(k 2 log k 2 ) . Any other computations that you
are going to have are going to be additive and this will be the largest component which will
overshadow the others.

And clearly you can see that this is cheaper than the cost, the original image convolution
operations, especially when k gets bigger. Which means if your filter size or your mask size gets
bigger using fast Fourier Transform to implement convolution is going to give you some cost
benefits. A lot of the deep learning libraries that you use today, already implement convolution
through Fourier transforms, you may not know it, you may just be using a library, a deep
learning libraries say, for instance and you just invoke convolution but the backend of this, many
of them already use fast Fourier Transforms to implement their convolution which makes it fast.

(Refer Slide Time: 3:09)

130
The other exercise we had was to match spatial domain images, hope you had chance to figure
out these answers yourselves in case you had doubts on the answers, let us do the simple ones
first.
(Refer Slide Time: 3:25)

1 would go to D, that is straight forward. 3 would go to A, straight forward and we also had that
clue in one the slides in the earlier lecture. And 4 goes to C mainly you look at 4, you see a very
strong vertical set of edges and you see that image C has a lot of horizontal edges in image. So
the horizontal edges in the original image would lead to a vertical edge in the Fourier transform.
Go back and look at the sample Fourier images that we saw last lecture, you will see that we saw
an example of a vertical line in your spatial domain which lead to a horizontal line in your
Fourier domain which would be the inverse the other way.

So if you had a horizontal line, if you had a horizontal line in your spatial domain you can get a
vertical line in your frequency domain. Once again I think the scope of this course would not
allow us to go too deep in to the Fourier domains and related concepts plus, but do read the links
that we shared in the last lecture.

131
(Refer Slide Time:​ ​4:37)

Regarding the other two images, which can be a bit tricky image 2 goes to E and the reason for
that is if you look at image 2, there is pretty strong vertical line there and that comes from this
horizontal lines at the horizon of the water or even these horizontal lines that you have in the
repulse through there are lots of horizontal lines in your repulse, that is translating to this vertical
line in the Fourier match.

And the fifth one is a straightforward extension, in the fifth one as you can see that the there are
lots of stalks in the flowers that are giving the horizontal aspect of the frequencies and there are
also lot of horizontal edges across the different flowers that give you some of these vertical
frequencies.

132
(Refer Slide Time:​ ​5:26)

Let us review one of the questions we asked the last time again as to what sense does a
low-resolution image make to us? If we had to go back and try to understand this from human
perception perspective, it is understood today that early processing in human’s filters in the
human visual system are filters that look for different orientations and scales of frequency. We
will review this bit more carefully later but sometimes people also relate them to what are known
as Gabor filter.

And secondly the perceptual cues in the human visual system are largely based on mid to high
frequencies. So remember once again, high frequency means an edge because there is a sudden
change in intensity value as you move from one pixel to the next pixel along a row or a column.
If it is along a row, you would call that a vertical frequency, if it is along a column, you would
call that a horizontal edge.

So perceptual cues depend on mid to high frequencies in general, so but when you see a
low-resolution image, it is equivalent to sub-sampling image or if you are seeing an image from
very far away you are actually sub-sampling the image, which means you may be losing some of
the high frequency information but you do get a lot of the low frequency and mid frequency
information which is good enough to get a sense of what the object is but may not give you finer
details.

133
(Refer Slide Time:​ ​6:56)

So, how do you sub-sample an image as an operation? The simplest way to sub-sample is given
an image such as this, you simply throw away every other row and column. Consider only every
alternate pixel both row wise and column wise. That straight away gives you a half size image,
you can then go to repeating this to go to lower and lower resolutions.
(Refer Slide Time:​ ​7:27)

134
But if you actually zoomed into these low-resolution images, or the sub-sampled images, this is
how they look, we are just zooming in to these images here and if you just zoomed in and looked
at those images, the half sampled image gets you something like this, the quarter sampled image
at a 2x zoom gets you something like this, the 1 by 8th image at a 4x zoom looks like something,
something like this.

You may ask, say yes, that is expected, I mean if you sub-sample, why do you, I mean why do
you worry about how it looks at close, that is a valid question. But there is a reason for bringing
this up here is the last image here looks pretty crufty, it has too many artifacts, can we at least
make it look smooth? We understand that by subsampling you have lost some information but
can it appear smooth when you zoom in, is what we are looking for, we understand that you may
have lost sharpness but we still want to look, make it look smooth. What do you think? How
would you achieve this?

135
(Refer Slide Time:​ ​8:43)

Let us wait for the answer for a moment but let us also see another example of an effect such as
this. So do you what is happening in this kind of an image? So this is the original image, we have
sub-sampled in some way and you get an effect such as this.
(Refer Slide Time:​ ​9:05)

This effect is also known as aliasing. So aliasing generally happens when your sampling rate is
not high enough to capture the amount of detail in the image, we are going to see a couple of
examples and then re-visit the painting that we saw and how do you ensure that it is not crufty.

136
So whenever your sampling rate is not high enough, you will not get the details, example you see
a sinusoidal wave form on top here. If you sub-sample, only at these points remember that you
are not looking at the downward trend of the sin wave.

So if you only sub-sample at these black points, you may think that this is the actual sin wave.
Remember here that the frequency of this red wave that we have just drew is very different from
the original wave. And why is it so different, only because we did not sample the original signal,
when I say original signal I mean the one here in purple. Because we did not sample it enough,
that is the reason why we end up getting this kind of a shape which is very different from the
original sinusoid wave.

(Refer Slide Time: 10:25)

So what you need to do to get the sampling right, ideally you need to know the structure of your
original signal and this is where an important concept called the Nyquist rate comes into the
picture. We also mentioned this briefly in the previous lecture with the minimum sampling rate
that is required to reconstruct your original signal or to capture the original signal in its actual
form is called the Nyquist rate. So now what does the Nyquist rate mean?

137
(Refer Slide Time:​ ​10:58)

Nyquist rate goes backs to fundamentals from informational theory and signal processing where
Shannon whose sampling theorem is very popular, proposed that the minimum sampling rate or
the Nyquist frequency is given by f s which has to be at least 2 * f max , f max here is the highest
frequency in a signal, it is the highest frequency in a signal. And we are saying now and for an
image we would say it is an image and we are saying that your sampling rate or Nyquist
frequency should be at least twice the highest frequency that you have in your image or signal.

138
Why is this so, we are going to see in a moment but we can explain the impact of not doing this
in an image setting, in a video setting or even in a graphic setting. Let us see couple of examples.
(Refer Slide Time:​ ​12:07)

So in an image setting if you do not sample your original signal at the appropriate frequency
which is twice f max, your shirt is going to look very weird and this is as I said what is known as
aliasing.
(Refer Slide Time: 12:21)

139
In a video, this is something that you may have observed a lot, sometimes when you go out on a
road you see a bicycle, especially it happens with ​bicycles or car wheels, you may although car
is moving forward, when you look at the wheels alone, it may appear that the wheels are circling
backward. So this could be first sight bicycle or a car, try observing this if you have not already
done this.

If you have not asked yourself why this happens, here is the reason. The reason is if you now
take the motion of a wheel to be the top row here, so the top row here is the actual motion. As
you can see the top row, you can see that the arrow here is moving in a clockwise direction, so
you can assume that this is something like a cycle wheel moving in a clockwise direction. But we
have chosen to sample it only at three locations across all of these. So the top row is the actual
movement of the wheel but these boxes here are the ones that are actually sampled, they are the
ones that are sampled.

So if you looked at only the sampled positions, you see that the arrow is first here, then it is here,
then it is here, so on so forth. So probably in fact the second and third one could give you the
reason why this effect happens, so you see that if you only took the second and the third one, you
would actually think that your wheel is moving counter clockwise, because those are the only
two samples you have and the arrows moving in the counter clockwise direction, but originally
your wheel was moving in the clockwise direction. This simply means that you are not sampling
the wheel at the frequency at which it is rotating or at twice the maximum frequency at which it
is rotating.

(Refer Slide Time:​ ​14:16)

140
Here is another example from graphics and the reason is again the same. You can see in this
particular case that the aliasing is prominent at these regions where the frequency is very, very
high. Remember once again, when we say frequency in an image, we mean how quickly the
pixels change between black and white. When we say black and white, it can be dark grey or
light grey but going from black and white and that becomes very high, very closer to the horizon
of this image. And that is where the frequency is very high, which means your sampling rate has
to be twice that frequency to be able to be true to the original image and this is an effect that we
often see with graphics.
(Refer Slide Time:​ ​14:57)

141
A more tangible way to understand why you need to sample at twice the maximum frequency,
we are not going to derive Shannon sampling theorem here because that is not the focus of this
particular course but let us try to intuitively understand why you need to do sample at twice the
maximum frequency. Let us take this example of a chess board, so you have a chess board here
and we argue now that the top row are examples if good sampling practices and the bottom row
are bad sampling practices.
So let us see here that if you see this one here, that is the bottom left, you can see that the
samples are these circles here, you can see those circles. And when you see those samples, you
can clearly see that we are sampling all of them at only blacks. So if you, if you had only those
four samples, you going to assume this entire board is a black board, that there are no white
squares in between because you never sampled in the white squares.
So clearly you are not sampling at that appropriate frequency, the frequency here is how quickly
does white and black change in how many pixels does it change. Clearly you are not sampling at
twice that maximum frequency and it is going to give you poor information. Here is a slightly
better sampling but still not complete, so once again here you can see these hallow circles, you
can see that you improved your sampling a little bit but once again you see a black, you see a
white and again your third sample on that first row is a white again.

142
So, which means your perception or any further processing is going to imagine that this is black
followed by a series of white, that is not checker board or a chess board, but it is just an entire
block of white, after the first small square of black it is an entire block of white, which again is
poor sampling. So here is a sampling where we actually go at twice the frequency, you may ask
why twice, so in this case remember here that the actual frequency is when a signal completely
goes through between its variations.
So in this case a signal completes from here till here and we are sampling twice inside those two
boxes. So which means that we are sampling twice the maximum frequency. And when you
sample like that you have got a sample in every black square and every white square and you
now know this is a checker board. Obviously if you sample more than that, that is even better for
you. Hopefully that give you an intuitive understanding of why you need to sample any signal,
any image at, at least twice the maximum frequency of content in the image.
(Refer Slide Time:​ ​17:53)

So a good anti-aliasing technique is Gaussian pre-filtering, we will talk about what Gaussian
pre-filtering means in a moment but this is the image that we saw earlier, we said that this image
was crufty and this is what we want to try to achieve, we do understand that if you sub-sample,
you going to lose some information but we at least wanted to make it look smooth.

143
(Refer Slide Time:​ ​18:21)

So how do you do Gaussian pre-filtering? You first going to take an image and then do a
Gaussian blur on it, you know now what a Gaussian blur is, it is Gaussian smoothing. You can
take a 3x3 Gaussian kernel, in case you do not recall what a Gaussian kernel is, remember it look
1
something like 16
[[1, 2, 1], [2, 4, 2], [1, 2, 1]] .
Similarly, you can also build 5x5 kernels and so on and so forth. So you first apply that kernel on
your input image that gives you a blur version of the image. Now, you sub-sample, then before
you sub-sample again blur then sub-sample, blur then sub-sample, blur then sub-sample, so on
and so forth. Why do you think this works? There is a reason for this. Let me ask you this
question, this as we said is a Gaussian filter.
Do you recall what a Gaussian filter is, a low pass filter or a high pass filter? A Gaussian filter is
a low pass filter, which means it removes high frequencies and only passes through the low
frequencies, relatively lower frequencies which means it is going to remove, remember any
smoothing filter removes a certain amount of high frequencies. So by applying the Gaussian blur,
you have removed the high frequency which means your sampling rate now can actually be
reduced. Remember again the Shannon sampling theorem states that you must sample at least
twice the maximum frequency in the image.
If I reduce the maximum frequency in the image, I can sample at a lesser rate. And that is the
trick that we employ to be able to sub-sample but still have a smooth effect in your output image.

144
So you do a Gaussian blur, now your overall frequencies have come to a smaller range, now you
can afford to sub-sample without having an aliasing effect or cruftiness in the image and that is
what we mean Gaussian pre-filter.
(Refer Slide Time: 20:40)

So now sub-sampling is one side of the story and the other side of the story is how do you
upsample. Sometimes you need to go from a low-resolution image to a high resolution image and
the simplest way to do this would be a very standard interpolation operation. The interpolation
between any two values, so you have, we are talking about the real axes, the interpolation
between any two values here is simply values that lie in the line connecting, at least linear
interpolation is on the line joining those two points.
So the simplest form of interpolation that we can do, if we want to upsample an image is what is
known as nearest neighbour interpolation, where we simply repeat each row and column 10
times. So every pixel, you repeat it 10 times along the column and 10 times along the row, so it
almost becomes like one entire 10 by 10 block has exactly the same value. Then you take the
next pixel repeat it for 10 columns, repeat it for 10 rows which means the next 10 by 10 block
has the same value so and so forth.
Good news, your resolution goes up; bad news, you are going to have such artifacts across the
image because this is exactly what you did, because you took several blocks, several pixels and

145
simply expanded them into blocks. How can we do better than this? Clearly we do not want to
upsample images like this and have a blocky effect in the upsampled image.

(Refer Slide Time:​ ​22:13)

So the way we can go about doing interpolation or upsampling while retaining some smoothness
in the output is, let us try to recall how digital image is formed again, recall an image being a
discrete sampled version of the original continuous function or the original continuous signal. So
you have an original signal f , you quantize it at certain locations which gives you your image or
your signal F [x, y ] . In our case it is going to be an image.
(Refer Slide Time: 22:51)

146
So as we already said this is nothing but a discrete point sampling of a continuous function. So
when we do interpolation, we are simply asking the question can we somehow reconstruct that
original continuous function from the discrete samples. That is the question that we are trying to
ask because if we do that, we could interpolate and go to any resolution that we want to go for.
(Refer Slide Time: 23:16)

Unfortunately, in practice we do not know the small f which is your continuous function. We do
not know that continuous function, we do not know how, what was the real world seeing from

147
which your image was captured in a sampled manner. However, we can try to approximate this
in a principled way and the way we are going to do that is through a convolution approach.

So the first step that we want to do is convert your continuous function, sorry, convert your F,
which is your discrete sample version to a continuous function because we do not know the
original continuous function we are going to define a pseudo continuous function which simply
states that wherever you know the value, you simply put the images value there, everywhere else
you just say it is 0. This would be an automatic version of a continuous, this is at least a
continuously defined function, let me put it that way, at least for the moment.

(Refer Slide Time: 24:18)

So now we try to reconstruct your f hat which is the resolution at which you want to take f to, by
convolving this pseudo continuous image with some filter. So we want to ask if, if we have
some, this is not exactly right in terms of the function, so let me erase that. But if we have this
function, continuous function, f F , can we get some kernel h in such a way that we can construct
your original, not original but at least an upsampled signal? What do you think ‘h’ should be?

This image should give you some understanding, maybe a signal such as a tent, in a shape of a
tent could help you with the 1D signal perhaps, but that still does not answer how convolution
can take us from one resolution to a higher resolution because the way we saw it, convolution
can maintain the same resolution if you include padding and if you do not pad, you are going to

148
be, you probably will lose a few pixels on the boundaries. But now we want to go to a higher
resolution then we use convolution.

(Refer Slide Time:​ ​25:36)

Now, to be able to perform interpolation and then upsample an image to a higher resolution
using convolution, we are going to do it using an interpolation kernel which is defined as what
you see here. Remember again that your original convolution is defined something like this,

149
where you have g (i, j ) = ∑ f (k, l)h(i − k , j − l) and this is the new definition that we are going to
k,l

use for interpolation.

And if you observe carefully, the main difference between these two expressions, is the quantity
r here, that is multiplying k and l, and r is what we call as the upsampling rate. So if you want to
double your image, your r is 2, if you want to quadruple your size of your image, remember we
are doing up sampling, so if you want to quadruple the size of your image, r is going to be 4, so
on and so forth. Let us try to understand how this actually works.

Let us just for the sake of simplicity, let us look at one-dimension variant of this. Let us say

g (i) = ∑ f (k)h(i − rk) . That is the one dimensional variant of the same equation, remember once
k

again that an image is a 2D signal for these purposes. So a lot of these expressions that we write
for images, if you remove one-dimension, you can apply that as convolution for one dimensional
signals.

So this is what we actually have in one dimension, so it is just easier to explain in one dimension.
So let us consider, let us try to understand what is actually happening. So let us consider an x
axis, so let us assume that there are some signals f(k-1), that is it, there is another signal f (k),
there is another signal here f (k+1), so on and so forth. This is very similar to what you saw in
the earlier slide.

So I just drew lines such as these to represent the various values of f that I have with me, that is
my input image which I want to increase the resolution of. So now, because I want to increase
the resolution, the question I am asking here is if I take some I here in between those values,
what would be g(i) but g is the output, f is the input, that is what I am trying to ask. So in the g
axis the tricky part here is that the coordinates, the indices in g are different from the indices in f,
because f has only a certain number of indices whereas g has double those number of indices.

So this would be rk-1 in g indices, this will be rk and this would be rk+1 in the indices of g. Now

if we try to understand what g of i is trying to do here, it is ∑ f (k)h(i − rk) , as you take many k
k

150
values, so of course k can range from k-1 to k+1. If you take a 3x3 window or a higher
depending on k simply defines the length of the window that you have.

So in this case, so what we need now is h(i-rk), so if you again took an example such as this,
remember here that the h was a one of the interpolation kernels, in this case it is called the tent
kernel because it looks like a tent. And what this kernel does is as you go further out from the
central pixel on which you place the kernel, the effect of those pixels fall off linearly, not
exponentially but linearly. So that is your, that is what is your tent kernel.

So if you add a hat such as this, we are simply saying that this is f, so if you place the h kernel
here, so if your h kernel is placed something like this, so then you are saying g(i) is going to be
f(k)h(i-rk), i-rk is this value here, i-rk is going to be this value here with respect to h, which is the
green cuff, so you are going to multiply fk by h of i minus rk f k plus 1 by h of i minus rk plus 1
which is going to be this quantity here, so on and so forth. So by doing this operation, you are
able to interpolate and you are able to, you will obviously sum up those values, they are still in
the summation here, you will sum of those values and that will give what the value of g of i is.
That is how this interpolation kernels work.

(Refer Slide Time:​ ​30:49)

151
So now linear interpolator is what we saw corresponding to a tent kernel, you can also have more
complex kernels such as B-splines, those are one of the most complex kernels.
(Refer Slide Time: 30:59)

But here are some qualitative results of applying this interpolation, so the first one that we
already saw is an effect of doing the nearest neighbour interpolation, where you have the
blockyness effects. Then if you do bilinear interpolation which is the linear interpolation that we
spoke about but along the two axis and you get something like this, if you going to bicubic
interpolation you get something like this, so on and so forth.
So the main difference in these interpolations is the h kernel that you use, in one case it is nearest
neighbour, in the second case it is bilinear, third case is bicubic. We are not going to define every
kernel but you can look at the references at the end of this lecture to be able to understand
various kernels that have been defined in practice.

152
(Refer Slide Time: 31:48)

So just like how we defined interpolation, you can also define the complementary operation of it
which is decimation. Decimation means sub-sampling, so which means while we saw
sub-sampling so far as simply omitting every other pixel where if you do Gaussian pre-filtering,
you also ensure that you get a smoother output. You can also achieve sub-sampling using a
convolution like operation where it would be very similar, the only thing is instead of rk you
would have k/r and l/r here because you now want to sub-sample and not interpolate.

(Refer Slide Time: 32:28)

153
So that is where we going to stop this lecture, to know more interpolation kernels and decimation
kernels, you can look at these references that are on your homework, please read them to
understand this better and we will continue with the next topic.

154
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 8
Edges Detection
(Refer Slide Time: 0:15)

Moving on from the previous set of lectures, which introduced you to basics of image
processing, we will now move to the next segment, which is on understanding and extracting
higher level features from images. So we will start this with our first lecture on Edge
Detection.

(Refer Slide Time: 0:40)

155
The idea of detecting edges on an image is about mapping the image from a 2D matrix to a
set of lines or curves on the image. In a sense, these curves or lines or what we call edges are
a more compact representation of the image. Why is this representation important? If you see
these three images and I asked you the question, what are the objects in the image? Can you
guess? It is not hard for you to say, the first image is one of a person, the middle image looks
like one of a horse and the last image looks like one of an aircraft.

This is a bit surprising because if you give such an image as input to machine learning
algorithm or to a neural network, it is not going to be trivial for the model to be able to make
these classifications the way you and I can do. So the human visual system, edges are
extremely important to complete the picture and for the entire process of perception. In fact,
if you see these images, even if the edges are taken away to some extent, we can fill in with
the rest of the edges and still be able to say what objects are present in these images. That is
how important edges are in an image.

So let us ask the question. We have being introduced to a few concepts with the edges, we
have been introduced to convolution, we have understood frequency representation of images
to some extent, sampling and interpolation. So using the ideas that you have studied so far,
how would you go about and find edges in images?

(Refer Slide Time: 2:50)

The key idea is to look for strong gradients and then do some kind of a post-processing to get
good-looking stable edges. So why do we see gradients? Let us try to understand that over the
next few slides.

156
(Refer Slide Time: 3:09)

Firstly, before we mathematically try to understand how we are going to detect edges in the
images, let us try to understand how edges are formed in the first place. There can be many
factors. Given an image such as what you see on the screen right now, edges are
fundamentally some form of discontinuity in the image. The discontinuity could be because
of surface normals. For example, you see on this particular pole that it is cylindrical and as
you go around the cylindrical, the surface normal direction is changing and beyond a point,
there is a discontinuity in the surface normal direction, which appears as an edge for the
human eye.

Another option is simply a color or an appearance discontinuity. As you can see here, that
there is some discontinuity in the surface color or appearance, maybe while this is a black and
white image, you can imagine a red block below which there is a blue block. Just that color
difference is going to lead to a discontinuity. There could also be depth based discontinuities.
If you observed this tower on the back, you can see that that is one portion here, which is
protruding out of the building and then obviously, the building is behind the protrusion. So
that gap in depth between these two artefacts in your image also leads to a discontinuity and
hence an edge.

And lastly, you have an illumination discontinuity, which is caused due to changes in light,
such as shadows. For example, you will see here, that there is a small artifact that where a
portion of a shadow falls on that region and you see an edge on that particular place. These
are not the only kind of discontinuities that can cause edges, but these are examples of
discontinuities that can cause edges.

157
(Refer Slide Time: 5:07)

If you look more locally at these regions, you see that where there is no edge, the image looks
fairly smooth, but where there is an edge, there is some kind of a transition, I mean in the
image pixels in a particular direction, edges can be in different directions in an image, and
you see that there is one particular direction in which, there is a change in intensities. Or if
you took another region of the image, you can see that in that region, there are many kinds of
edges in different orientations. So we ideally want to be able to detect all of these in an
image.

(Refer Slide Time: 5:47)

Why are edges important? We did talk about this a few slides back, where we said that it is
very important how to the human visual system perceives the environment around us. But

158
even if not the human visual system, if you wanted to run a machine intelligence system. We
talked about this a few slides ago that edges are important to the human visual system, but it
is not just the human visual system, edges are also important for a machine based intelligence
system.

Where do you use edges? You can use edges to group pixels in the objects or parts. For
example, edges tell you that all the region inside one particular area belongs to a common
object. It can also help us to track important features across images. It can be cues for 3D
shape or it could also help you just do interactive image editing.

What do we mean? Let us say I wanted to take this building. This is a building from the IIT
Hyderabad campus. Let us say I want to take this building and I want to put snow in the
background or mountains in the background. So I ideally need edge information to isolate the
building from the background and then be able to change the background accordingly, so
edges are important for those kinds of applications.

(Refer Slide Time: 7:14)

So we talked about images being represented in multiple ways. Images can be looked at as
matrices, images can also be looked at as functions. So when you talk about images as
functions, edges look like very steep cliffs. What does that mean? So if you had an image
such as what you see on the left. Our job now is to find out where do these steep cliffs exist in
the images.

159
(Refer Slide Time: 7:46)

So that brings us to derivatives (gradients), because if you need to find steep cliffs, that
means in a very small unit change in the pixels, that is going from one pixel to the next pixel
or two pixels away, there is a huge change in the intensity. So in some sense, we are saying
that an edge is a place of rapid change in the image intensity function, or it can effectively be
measured using a derivative or a gradient.

How? Let us see an example. So if this was your image, so where you have a white patch
followed by a black patch, then a white patch again. So if you took one particular row of this
image, even in that row, you have the same pattern a set of white pixels then a set of black
pixels and then a set of white pixels, your intensity function looks something like this.
Remember again, that white has higher intensity, black has lower intensity, so your image
actually looks something like this in the image intensity space, that particular row of pixels
that is indicated in red on the left side.

So if you took a derivative of all of these values in the intensity function, this is how the
derivative would look, the third column. So there is an initial point where you have a negative
derivative because the value is falling down and then there is a later point where you have an
equal in magnitude of the gradient, but in the opposite direction, because the intensity
increases at that particular point.

So here is just another example of the same setting, but here we are also showing you how
the second derivative looks in this particular setting. So the second derivative remember,
looks like as long as the first gradient keeps going up, it goes up until a certain point then
falls off and then comes back for the second half of the gradient, the second derivatives graph

160
would look something like this. We will see more examples of the second derivative over the
next few slides.

(Refer Slide Time: 10:01)

Now our challenge is about how do you get these derivatives? So we understand that edges
are important and that edges can be obtained using derivatives or gradients. But the question
we have now is how do you achieve this using convolution? To do that, let us look at first
principles definition of a derivative. So if you had x, y directions among pixels in an image,
and let us consider the intensity function of an image as f (x, y) So in d(f(x,y))/d(x), we are
trying to measure the derivative with respect to x..

For discrete data, you will define d(f(x, y))/d(x) to be (f(x + 1, y) - f (x, y)) / 1, that would be
your discrete definition for the derivative.

161
(Refer Slide Time: 11:08)

Now looking at these definitions, what do you think would be the associated mask or kernel
or filter to ensure that we get this particular gradient? Can you guess what the filter would
be?

(Refer Slide Time: 11:23)

A simple answer is what you see on the right. It is exactly what this definition is. It simply
says that if you are at a particular pixel (x, y) you can say f(x + 1, y) - f(x, y), by placing this
pixel at this value of the filter at that particular pixel, this is the output that you would get by
applying this filter at that particular location. Similarly, you can get a very similar effect
using the scan of a vertical filter, which will give you the gradient along the y direction. Is
this the only way to get the gradient? Not necessarily.

162
There are other ways of defining the gradient too. Remember here, on the previous slide,
when we said that for discrete data, we obtain the gradient using this particular formula,
remember that this is one kind of an approximation. There could be several other ways in
which you could approximate the gradient. For example, you can say a gradient can also be
written as (f(x + 1, y) + f (x - 1, y) - 2 f (x, y))/ 2, that is also valid approximation of gradient.
So there are multiple ways of defining the gradient, depending on how far you want to go in
your neighborhood around the point at which you are completing the gradient.

(Refer Slide Time: 12:53)

Here is another example of a 3 x 3 filter. I am going to let you think carefully about what
would be the derivative expansion here, it is not too difficult. Remember again, if this was a
f(x, y) remember that this is going to be f(x+1, y) , this is going to be f(x-1, y) and this is
going to be f(x, y+1) and so on and so forth. You can define the coordinate locations for each
of those values, and you simply now write out what this would mean as a derivative.

In this case, you can see the derivative is being defined as 2f(x-1, y) + f(x-1, y-1). This value
comes here, this value comes here plus f(x-1, y+1) which corresponds to this one. Then you
will have a minus for all of these locations here and you finally divide by the overall number
of locations, which in this case will be 9 to get your final gradient.

This kind of a filter is called a Sobel Edge filter and once again, this is another approximation
of the gradient to get the edges in this setting. But clearly, you see here, that you are in this
case, trying to find out the gradient along only the vertical direction. Remember you are
trying to give some values to the left and to the right, which means you would actually find
what are the edges in the vertical direction.

163
So what you see on the right hand side is an absolute value of the gradient and so we are not
looking at negative or positive because it does not matter to us. Wherever there is a sharp
change is an edge, whether the intensity falls down or falls up, it is still an edge. So it really
does not matter to us as to what is the sign of the gradient and the magnitude of the gradient
of the absolute value is what matters to us here.

(Refer Slide Time: 15:20)

Here is the complimentary filter for finding the horizontal edges for the Sobel filter. Are these
the only two filters, not necessarily so.

164
(Refer Slide Time: 15:31)

So there are more such filters, there is something called the Prewitt filter, there is something
called the Sobel, which we just saw. There is also the Roberts filter, which finds edges in
diagonal directions. So clearly, you can handcraft several kinds of filters to be able to find
edges in different directions. And obvious follow-up question for us now here is, then how do
we find edges in any direction? Do we have to convolve with many different filters to be able
to find edges in different directions? Let us see that in a moment.

(Refer Slide Time: 16:08)

Before we get there, we will see a complete edges filter as we go forward. Let us formally
define a few quantities that we are going to use for the rest of this lecture. Remember that the

165
gradient of an image grad(f), where f is the image, is given by a tuple: (d(f) / d(x), d(f) /
d(y)).

(Refer Slide Time: 16:32)

And the gradient always points in the direction of the most rapid changing intensity. So for
example, if you have the gradient to be (d(f) / d(x), 0), this tells you that there is no change in
the y direction and all the change is only along the x direction, which the edge in the image
would look something like this. If you had the gradient to be (0, d(f) / d(y)), then there is no
change along the x direction and the edge is only along the vertical direction and this is how
such an edge will look like an image.

On the other hand, if you add an edge in a completely different random direction, like a
diagonal direction, not aligned along the vertical axis or the horizontal axis, then you would
have a gradient, which is simply given by (d(f) / d(x), d(f) / d(y)). In both the directions there
is a non-zero gradient, which gives you an edge in a different direction.

166
(Refer Slide Time: 17:36)

How do you find the orientation of the edge? We said that edges have different orientations.
Your principles from simple calculus, the orientation of the gradient is simply given by tan
inverse of gradient along y divided by gradient along x, these are high school calculus results.

(Refer Slide Time: 17:55)

And finally the strength of the edge is given by the magnitude of the gradient. That is That is
((d(f) / d(x))^2, (d(f) / d(y)^2) under root gives you the magnitude of the edge in that
particular location. So how strong is the edge, is what this gives you at that particular
location.

167
(Refer Slide Time: 18:17)

Let us again take this example of a single row in an image. So let us assume now that the
image is likely changed. We have a black patch followed by a white patch, that is what we
have here. And now let us take one particular row, we are taking one particular row, just for
simplicity of understanding. You could have taken the full image to just that the figures on
the right would have looked more complex.

So let us just take one particular row on the image. Clearly, even in that particular row, there
be the first set of pixels which are black then the next set of pixels which are white. Rather in
that particular row, the image intensity function would look something like this, first set of
pixels which are black the next set of pixels, which are white.

If you took doing the gradient of such a function, we know that the derivative would look
something like this. The derivative is flat in all these locations and all these locations and
there is a spike right in the middle when there is a change in the intensity. So if I asked you
the question, where is the edge? It is simple, you simply say wherever the gradient has a high
value, that is where the edge is present in the image and clearly you would have been right in
this particular setting. Unfortunately, real world images do not come as well polished as what
you see in this particular setting.

168
(Refer Slide Time: 19:43)

Real world images have a lot of artefacts, especially noisy artefacts. So we take the same
image, but this time the image has a lot of noise. We can see it up close. There is a lot of salt
and pepper noise in this particular image. Now, if we try to plot the image, intensity, your
image intensity would look something like this, where you have a lot of noise and in the
middle, there is a big change in the intensity, which is where your edge is and then you have
a lot of noise again in the white region.

If you now took the gradient, you would observe that the gradient looks something like this.
Remember that the gradient is measuring infinitesimal change. So in all of these regions, the
infinitesimal change would get represented by the gradient like this. And even here, if you
took small, small steps, the gradient there is more or less the same kind of a gradient as in the
noisy regions of the same image.

Now, how would you find the edge? Remember this is how real world images look, what do
you think? So if you simply took the gradient, you are going to be confused because even the
noisy parts of the image will look like they are an edge too, but clearly from the human eye,
you can see that the edge is right in the middle. So how do you solve this problem? The
answer is something that you have already seen. You first smoothen the image and then do
edge detection.

169
(Refer Slide Time: 21:11)

Let us see that as a concrete example. So what we do now is we are going to smooth. You
already seeing things like box filter, Gaussian filter so on and so forth. We are going to use a
Gaussian filter, this is the original row of that image that we just saw on the previous slide.
You take the Gaussian filter, convolve the original row and image or the image with the
Gaussian filter, once remember that the row in an image is a one-dimensional signal. So we
are using a one dimensional Gaussian filter here.

If you do it on the image, this would be a two-dimensional Gaussian filter, remember we


have already seen what a two dimensional Gaussian filter looks like. We have seen it as 4, 2,
2, 2, 1 so on and so forth, right, we have seen that as a Gaussian filter earlier.

So in this case on this image, it is going to be a 1D Gaussian, one-dimensional Gaussian


filter. So we convolve the image with that one-dimensional Gaussian filter and the output
would then be something like this. So the entire row in the image is smoothened out, all the
noise becomes flat and even the edge gets slightly smoothened out, remember it was a sharp
edge and it is got a little smoothened out and that is what Gaussian filters achieve. While they
remove noise they also blur out portions of the image, that is what would happen here for that
middle region where the edge is. And then you have the rest of the region to be flat too.

Now on this Gaussian smoothed image, you can run an edge detector by running a gradient
and you would find that an edge can be found in the middle of it. There is more to it than
what we just saw.

170
(Refer Slide Time: 22:49)

Remember again, that we said that convolution has some nice mathematical properties that
correlation does not satisfy and we are going to use one of them to make our life easier. So
remember that convolution is associative. So if you assume a gradient function of f
convolution g, and if I told you that this gradient can also be implemented as a convolution,
which we just saw a couple of slides back with those Sobel filters and other filters.

So because of the associative property, we can write this as f convolution with gradient of g
or gradient of f convolution g, all of them are equivalent because of properties of
convolution. Which means what I can do now is I can take the Gaussian filter, compute the
gradient of the Gaussian filter pre-compute it and store it with me.

So you know how a Gaussian filter looks like, you know how a derivative filter looks like,
you have seen examples of both of them so far. Now convolve the two and you will get one
output filter. Store it with you and that filter will look something like this. This is the
derivative of the Gaussian filter. Given that this is the derivative of the Gaussian filter, which
means you already taken care of, let us assume that g was the Gaussian filter and f was the
original function. So you are automatically computing, pre-computing by d/dx(g) and storing
it as a signal and now, all you need to do is to directly convolve this function on the original
signal and you will hit the output directly to be able to find the edge.

So now it becomes a one step process rather than a two-step process, which can save a lot of
time and effort for you to compute the edge on a noisy image. And remember, this was
possible because of the associative property of convolution, which we could leverage to make
things easier.

171
(Refer Slide Time: 24:52)

And obvious follow-up question is what about the second derivative? We have been insisting
that gradients are good estimators of edges. What about the second derivative? You can use
the second derivative also to compute the edges and this is how it would look. So if you add
your original signal f again here so the second derivative of Gaussian, so remember in the
previous slide, we took the first derivative of Gaussian here. So if you took the second
derivative of Gaussian, this is how it will look. The second derivative of Gaussian is called
Laplacian of Gaussian, we will see that in more detail very soon.

So the Laplacian of Gaussian can also be used to directly convolve on the image and now you
get an output something like this. And this time, the edge is located here. Remember, in the
earlier case, when we used the gradient of the Gaussian filter, we found that the edge was
somewhere, edge was peaking here at the thousand reading, there is where the edge was
peaking. But if you apply the Laplacian of Gaussian, you find that at thousand, you actually
get the value to be zero.

So, which means if you use the Laplacian of Gaussian, so an obvious question that could be
there in your mind is what would the filter look like? We all know how the Gaussian looks
like 4, 2, 2, 2, 2, 1, 1, at least one instance of a Gaussian filter, you can have that instance if
you like, depending on the size or depending on your amplitude of the Gaussian.

And we saw that the derivative of the Gaussian we know how to do. We can take a Sobel
filter, we can take Gaussian filter, we can convolve them. How do you do this for Laplacian
of Gaussian? We will talk about that in some time, we will actually have a tangible three

172
cross three Laplacian of Gaussian filter and show how that is obtained in a few slides from
now.

So now if you apply the Laplacian of Gaussian filter, we find that edges now become zero
crossings. When we took the plain first derivative, edges were found where the gradient was
high, but if you go to the second derivative of the Gaussian, edges are found where you have
a zero crossing. So which means there is some value that is positive and you are transitioning
to a negative value and the zero crossing is where an edge can be localized. That is what the
Laplacian of Gaussian filters talks about here.

(Refer Slide Time: 27:16)

So here is a visual illustration of these different filters. This is your Gaussian filter, this is
your derivative of Gaussian filter. I hope you understand this even geometrically that if you
differentiate the Gaussian, this is what you would get in terms of the derivative of Gaussian.
And you can also work out and check that if you differentiate this, you would get the
Laplacian of Gaussian, which looks somewhat like this.

The Laplacian of Gaussian is written as or the Laplacian operator in general is written as


Nabla square f.

So if you now put the derivative of the Gaussian, remember again, the derivative can be in
two directions. You can have the derivative of the Gaussian along the x direction, derivative
of the Gaussian along the y direction, so the derivative of Gaussian along the x direction
would look like this, I hope you can connect the surface to the surface. Red means high value,
blue means low value, it is simply a surface map or a heat map. And the derivative of

173
Gaussian along the y direction would look something like this. Or from an image perspective,
this is how a derivative of Gaussian along one direction would look like and this is how
derivative of Gaussian along the other direction would look like.

So now, if I ask you the question, which one of these finds horizontal edges and which one of
these finds vertical edges? What do you think? If you have not got the answer already, the
second image finds horizontal edges and the first image finds vertical edges, it is actually
quite evident when you see the images itself. The second image finds edges here, that
separate this white region from the black region and the first image finds edges like this,
separating the white region from the black region. So one of them finds the horizontal axis,
one of them finds the vertical axis.

One takeaway that you also have here is that the image or the filter, remember this is just the
filter that you are going to use. These images that you saw on the last row are simply masks
or kernels. So to some extent how the filter visually looks to you is what is the artifact that it
will actually try to look for in the image so that is what we see here also on these images.

(Refer Slide Time: 29:57)

So if you now compute it gradients, you can see that you can compute the x derivative of the
Gaussian, you can see that you can compute the y derivative of the Gaussian, now you can
simply take the gradient magnitude, which is root of x derivative square plus y derivative
square and you will find that the gradient magnitude gives you a set of edges something like
this.

174
There is a very popular image called the Lina image, which people use to give examples of
image processing. That gives you a first level of edges, but is that all it? Not necessarily so.

(Refer Slide Time: 30:28)

Let us now take a closer look at one of these edges on the image that we just saw on the
earlier slide. So it actually looks somewhat like this. So I could now ask the question, where
is the edge really. There seem to be so many pixels that are white here, that an edge could be
anywhere on those pixels. But clearly when we say we want an edge, we expect certain
properties to be met. We have a certain expectations of what that edges should look like.
What do they look like? Let us first define them.

175
(Refer Slide Time: 31:04)

So properties of a good edge detector are, it should firstly do good detection. What does that
mean? It should find all the real edges in the image and ignore noise or other artefacts. Like if
there is an edge in the image, it should detect it. So that is what good detection is what we are
looking for from the method for edge detection. Remember that if you do not have good
detection, you are going to look at all noise points in the image and then keep detecting them
as edges, you do not want to do that, that is one property that you want.

Second thing that you want is good localization, which means if the edge is in this location,
you also want to detect the edge in the same location. If you are to detect edges in the
vicinity in the neighborhood and not exactly where the edge is, that will be called poor
localization, but what we want is good localization. Wherever the edge was is exactly in that
pixel is where you want to detect an edge. You do not want to go two pixels to the left or two
pixels to the right, which would be poor localization.

And lastly, we only want a single response. We only want to return one point for a true edge
point and not an entire region. Once again, if this was the true edge, this kind of a detection,
which is spread out is not very useful and we want to be exactly at this location for a good
edge detection method. But so far, we only know how to use the gradient to detect the edges.
How do you solve problem such as good localization, single response, so on and so forth and
that is what we going to see now.

176
(Refer Slide Time: 32:48)

So what are the things we can do to ensure that get a single response in a particular region is
to do what is known as Non-Maxima Suppression and the idea is very literal, it is non-
maximum suppression. So if something is not a maximum suppress it. How do we do it? If
you had an edge such as this. Let us assume there was a circular kind of an edge in an image,
that is how the edge looked. So clearly, you can see here that every white pixel would get
detected as an edge, as we just saw, all of them could have high responses depending on what
neighborhood you use to compute your gradient. So we may end up having many pixels
corresponding to the edge, which you do not want. You want it to be isolated.

What do you do? You take a particular pixel, you compute the orientation of the gradient at
that particular pixel, how do you compute the orientation of the gradient? We just saw a few
slides back, tan inverse of gradient along y direction by gradient along x direction, that gives
you the orientation of the gradient at that particular pixel. So in this particular case, the
orientation of the gradient would be like this. Remember the orientation of the gradient would
be always normal to the surface, this is how the orientation of the gradient would be.

So, which means if you are at a pixel q and the orientation of the gradient is along a particular
direction, then what you do is you go along the direction of the gradient and then try to see if
this pixel is a maximum in that direction or not. For example, if you go along the direction of
the gradient in this image, you want to retain only the pixel that has the highest gradient in
value and make any other pixel that is not the maximum into zero. So you just check if the
pixel is a local maximum along the gradient direction, if it is a maximum, retain its value, the

177
edge magnitude value that you got, if it is not a maximum, make the edge magnitude zero and
do not treat it as an edge.

So in this process, one thing that you may have to remember is, you may need, when you go
along the direction of the gradient orientation, you may end up going to a location which is
not defined on the image. It could be between two pixels. In those cases, you may have to
interpolate, you can use simple linear interpolation or any simple interpolation method to be
able to get what the value of the pixel of the gradient is at that location and you can then
continue to do more maximum suppression.

(Refer Slide Time: 35:38)

So here is an example of the same image after applying non-maximum suppression. And you
can see now that the edges have fairly thinned out well and are fairly localized where the
edge should be rather than have very thick edges. Are we done yet? Not exactly. One more
problem that we find in the set of edges here is that, there are many discontinuities. We
would have ideally wanted the hat on her head to be one large edge, unfortunately, there seem
to be multiple pieces, in fact, there seem to be pieces even missing in between where we
would have expected an edge.

178
(Refer Slide Time: 36:22)

How do we handle this? If you do not have an idea yourselves, the method proposed here is
called Hysteresis Thresholding. What does that mean? Hysteresis Thresholding means that
you would have to threshold your gradient magnitude, you would have two thresholds, a high
threshold and a low threshold. These are just values that you have to set up. You give a
particular value for these high threshold and the low threshold.

So now how do we use it? We say that if the gradient at a pixel is greater than high,
remember high gradient means it is an edge, that is what we said for derivative. If the
gradient at a pixel is above the high threshold, it is definitely an edge pixel. Similarly, if the
gradient at that pixel is less than a low value, it is definitely a non-edge pixel. Those two
cases are straightforward.

Now, what happens if you have the gradient at a particular pixel to be lying between these
two thresholds. Then what you say is that you consider that that particular pixel to be an edge
pixel, if and only if, remember this if stands for if, and only if, that pixel is connected to an
edge pixel directly or via other pixels between low and high. Let us see an example.

Remember that all these pixels that are above the high thresholds are already edge pixels.
Similarly, if there was any pixel below the low threshold, that could be a non-edge pixel.
Now for the pixels with edge intensity values lying in between these two thresholds,
remember high and low are just some numbers that you come up with for thresholds.

So if you take one particular pixel there, you check if this pixel was connected directly or
through a set of values to any edge pixel. If so, you would call that an edge pixel. On the

179
other hand, if you take this pixel, if you now try to connect it, it does not connect to any edge
pixel that is greater than high and hence, this entire set of pixels would get classified also as
non-edge pixels.

Which means, instead of relying on a threshold to be able to say whether you have an edge or
not, you are now having two thresholds to ensure that even if you were on the boundary
region, you could now use this approach to decide whether it should be an edge or not and
that now gives more complete looking edges, where the edges are completely covering the
object rather than those broken lines that we saw on the earlier slide.

(Refer Slide Time: 39:29)

So this process of finding edges where we defined a good engine detector to have three
characteristics and we try to address each of them through non-maximum suppression and
Hysteresis Thresholding over the last two slides. This was proposed as part of a paper by a
person called John Canny in 1986 and this entire methodology is actually called the Canny
Edge Detector, one of the most popular edge detectors that have been used in computer vision
applications for the decades together.

So what is the canny edge detection algorithm? It is just a summary of whatever we have
seen so far. You filter the image with the derivative of the Gaussian, remember the purpose it
solves is to first, smooth the image with the Gaussian and then take the derivative to get
edges. Then you find the magnitude and orientation of the gradient. Then you do non-
maximum suppression in the direction of the orientation of the gradient at every pixel. Then
you finally do linking and thresholding using hysteresis. So you define two thresholds, low
and high. You use the high threshold to start edge curves and you use the low threshold to

180
ensure that those edge curves can be continued until a little bit lower to be able to get a sense
of more connected edges in the image.

(Refer Slide Time: 40:47)

Let us just an example of the Canny Edge Detector. So here you see an original image, it
smoothen using a Gaussian filter and here are the gradient magnitudes after applying
derivative, and then you have the edges after doing the non-maximum suppression, you can
see that the edges have thinner. Then you do the double thresholding, which is the high
thresholding.

So you keep only the edges, edge pixels that are greater than the high threshold and you
remove anything lower than the low threshold. Then you do the hysteresis to give the
continuity of the edges in between, regions between high and low and here you have your
final output after applying all of these processes. Here are a few more examples of applying
Canny edges, and you can actually see that for these kinds of images, they do a fairly good
job, in a fairly cohesive way across various artefacts in the image.

181
(Refer Slide Time: 41:51)

So one of the parameters that you have with the Canny Edge Detector is the size of the
Gaussian kernel. You could have various sigmas, you could have the Gaussian filter to be 3
cross 3, 5 cross 5, so on and so forth. So let us try to see what happens if you change the
sigma in your Gaussian.

You will notice that a large sigma detects large scale edges and small sigma detects finer
edges, it is very straightforward. So if you had a canny with sigma is equal to 1, you would
have lots of small edges, whereas if you now increase sigma, then you would end up finding
a canny filter, which has only large scale edges, not finer details, mainly because it increased
the gradients of the Gaussian.

(Refer Slide Time: 42:48)

182
While Canny Edge Detector was the most popular edge director for many decades, it was a
simple image crossing operation, there have been other efforts that people have come up with
different other methods, sophisticated methods to do edge detection even until recent years.
So let us try to visit a few of them. We would not go into detail in all of them, some of them
may need other topics that we are going to cover in later lectures, but we will just try to visit
them at a high level to give you a perspective of how edge detection can be solved using
other kinds of approaches.

So here are some method that was proposed by a Martin and others in 2004, it is called
Learning to Detect Natural Image Boundaries Using Local Brightness, Color and Texture
Cues. And as the title says, what these group of researchers did is they took an image and
they came up with the texture in the image, the brightness in the image and the color in the
image and they end up giving this to a classifier, a machine learning based classifier, and they
observed that for non-boundary regions, you have a certain set of patterns that you see,
whereas for boundary regions, you see a sharp change in the pixel values.

They simply use this idea to provide these texture, brightness and color pixels at every value
to a classifier, which tells you whether there is a particular pixel is an edge or not an edge and
they then find a way to combine these informations from multiple sources, brightness, color,
and texture to give their final edge detection.

(Refer Slide Time: 44:32)

So here is an example of how their method works. So these are a bunch of different images.
So if you only used brightness to detect edges, this is how the edges look. If you only used
color to detect the edges, this is how it looks. If you only used texture to detect the edges, this

183
is how it looks. Now if you combine all of them, they actually find that you end up getting
stronger edges. This was one method that was proposed beyond the canny style of edge
detections.

(Refer Slide Time: 45:04)

There has also been another method called Structured Forest for Fast Edge Detection that was
proposed in 2013, where the idea was, once again, the idea was to use machine learning
based methods, to be able to quickly predict whether a pixel is an edge or not. So, and the
insights that they try to use is that predictions can be learned, that is, edges can be learned by
looking at past data, you do not need to exactly compute a derivative pixel, you can just learn
from past data about how edge pixels look, and simply apply that machine learning based
classifier to a pixel in this particular image, to give you the outcome of whether that pixel was
an edge or not.

And the other insight is that we want predictions for nearby pixels to be influencing each
other. If there are two pixels nearby, if one of them is an edge, there is a good chance that the
next pixel is also going to be an edge. So these are two insights that they actually use. But the
way this method works is it operates at the level of patches in an image and it uses the
random forest classifier to be able to tell which pixels in a patch corresponds to an edge.

184
(Refer Slide Time: 46:24)

Let us see this is in a slightly more algorithmic manner. So the algorithm works as you
extract a bunch of different overlapping, 32 cross 32 patches at three scales in an image.
What is a scale? A scale is a resolution in image. So for example, if you had an image whose
size was given by 64 cross 64, if you sub-sample it, it becomes 32 cross 32, if sample it still
further, it becomes 16 cross 16, these are three different scales of an image.

So you extract overlapping 32 cross 32 patches at multiples scales. And then you consider
many features in each of these patches, such as pixel values or pairwise differences in colour
or radiant magnitudes or oriented gradients. And using these features, you build a random
forest classifier based on some training data where you knew where the edges were. And in
each of these 32 cross 32 patches, you take a central 16 cross 16 region, where you are going
to predict the pixel points using these random forest classifiers.

And you would get such a prediction for the central 16 cross 16 region across many different
patches. You simply average the predictions for each pixel across all the patches in which
that pixel occurred and got a prediction. You simply average it out and tells you whether that
pixel was edge pixel or not an edge pixel. For more details you are welcome to look at this
paper to understand this better.

185
(Refer Slide Time: 48:07)

One other method was called Crips Boundary Detection using Pointwise Mutual Information.
What this method does is it tries to once again use the idea that, pixels that are edges share
some common information, that is where the pointwise mutual information comes in. So they
take an image and then they plot all the pairwise co-occurrences of pixels as a density
function. They also compute a quantity called PMI, which is defined as the mutual
information in the same image. And then based on these quantities, they come up with
something called an affinity matrix, which tells you how close one pixel is to another pixel in
terms of how often they co-occur to each other. And based on that affinity matrix, they use
what is known as spectral clustering, which is a clustering method.

And if you did not know it, it is completely all right at this time, but they use spectral
clustering to be able to segment and obtain the images in the image. Once again, the purpose
of talking about this slide was not to give you an in-depth information, but just to give you an
idea of how edges can be found through other means.

186
(Refer Slide Time: 49:30)

This is an example where they show that given these input images, their methods give fairly
good edges, which are very similar to how human labelers give outputs for the same edges.

(Refer Slide Time: 49:44)

Lastly, one more recent method called Holistically Nested Edge Detection. This is actually a
deep learning based method. So while we have not covered those topics yet at this time, I
hope that some of you have the background in deep learning. So this method actually uses
convolutional neural networks to find edges, where you can see given an input, these are
what are known as convolutional layers.

And at the end of every convolutional layer, you try to predict the edges in the image and
then you try to have a loss at that particular point, which goes back and updates the weights

187
in that convolutional layer, and you keep doing it multiple times to help improve the
performance of that convolutional neural network. And your final edge detection is a fusion
of all of these edges in different layers. Once again, for more details, you are welcome to look
at this paper called Holistically Nested Edge Detection.

(Refer Slide Time: 50:45)

That concludes our discussion on edges. As readings, you can read chapter two of the
Szeliski’s book. And we are going to leave you with one question for today's lecture, which
is, let us say we had an image such as this, you know how to find the canny edges now, you
know the methodology for that. How do you go from canny edges to straight lines?
Remember canny edges will find, because you use gradient magnitude, canny edges will give
you edges in any orientation. But how do you find the straight lines and what the
parameterization of the straight line will be? Think about it and we will answer this in the
next lecture.

188
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 09
From Edges to Blobs and Corners

Last class, we spoke about edge detection and went into details of the canny edge detector
and this lecture, we will move a little forward and talk about other kinds of artefacts from
images that are practically useful, specifically blobs and corners.

(Refer Slide Time: 0:40)

Before we go there, we did leave one question for you at the end of the last lecture, hope you
spend some time working out the solution yourself, if not, we will briefly discuss the outline
right now and wait for you to work out the details yourself. So, we talked about trying to use
canny edges to get straight lines in images. How do you do this? You first compute canny
edges in your image, which means you would compute your gradients in the X direction and
the Y direction and you could also compute your thetas, which are your angles of each of
󠁧
−1 𝜵 𝑓
your edges, which is given by𝑡𝑎𝑛 𝑦
𝜵𝑥𝑓
.

Now as the next step, what we do is assign each edge to one of 8 directions, 8 is just a
number you could choose anything else, but you assign each edge to one of 8 directions. So,
you probably combine several inches into a single bin. And for each of these 8 directions, let
us call one of those directions to be d, we obtain what are known as edgelets and what are
edgelets? Edgelets are connected components for edge pixels with directions in (d-1, d, d+1).

189
So for a given, so you have 8 possible directions that you have bin all your edge orientations
into and you take one of them and connect them with edges, with orientation d-1 and d+1. So,
you will get a set of what are known as edgelets.

(Refer Slide Time: 2:28)

Once you compute your edgelets, we then compute two quantities, straightness and
orientation of the edgelets. And the way we go about doing this is using the eigenvectors and
eigenvalues of the second moment matrix of the edge pixels. So what we do here is you have
your edge pixels that belong to a particular edgelets, so you take all of those pixels and
compute such a matrix. So if you look at the first entry here, the first entry here corresponds
to (X - μ𝑥), μ𝑥 is the mean X dimension of all of those pixels in the edgelets and each X

corresponds to one of those X-coordinates of one of those edgelet pixels, so you are simply
computing in some sense, a sense of variance with respect to the main edge pixel along the X
direction. Similarly, you have a variance along the Y direction of those edge pixels with
respect to their means and you also find the covariance with respect to X and Y directions.

So once you have this very similar to how principle component analysis happens, we take the
eigenvectors and eigenvalues of this second moment matrix and remember, the eigenvector
corresponding to the largest eigenvalue will give you the direction of the maximum variance
among these pixels and that is what we are looking for at this time.

190
−1 𝑣1
So we finally decide that the orientation of the edgelet is going to be given by 𝑡𝑎𝑛 𝑣0
,

where 𝑣1 is the larger eigenvector. By larger we mean the eigenvector corresponding to the

larger eigenvalue. Remember here, that M is a two cross to matrix, which means you will
only have two eigenvalues at maximum, and you are going to take the eigenvector
corresponding to the larger eigenvalue and we call that to be 𝑣1 and 𝑣0 is the other

eigenvector.

So if you take a tan inverse of these two vectors, that theta will give you the direction of that
overall edgelet. So in case, you are finding this difficult to understand, please go back and
read principle component analysis and you will be able to connect that to this particular idea.
λ2
And we define straightness to be λ1
, where λ2 is the second largest eigenvalue and λ1 is the

largest eigenvalue. So the straightness here is going to be the highest value when both the λ2s

and λ1s are close to equal. And once you get this quantity, you just threshold the straightness

appropriately and store your line segments.

λ2 λ1
So it really does not matter whether you measure straightness as λ1
or λ2
, you just have to

ensure that you construct your threshold appropriately. If you inward the straightness ratio,
you just have to ensure that you threshold it at an appropriate value and then obtain your line
segments.

(Refer Slide Time: 6:02)

191
So here is a visual example. So you can see that after applying canny, you get all these edges
on the left. So you do get lots of edges, which are fairly well connected, not too noisy, but not
all of them correspond to straight lines, many of them correspond to edges that represent that
say the texture of the floor and so on and so forth. But we really do not want that when the
model, let us say, there is an application where we really do not want it. So then you use this
kind of an approach to obtain which is the straight lines from your canny edges. Think more
about this and you can, I think knowing more about PCA could help you understand this a bit
better.

(Refer Slide Time: 6:44)

Moving forward. As we said, the focus is going to be going from edges to newer artifacts
such as blobs and corners. So remember, last lecture, we talked about taking the Laplacian of
Gaussian and we said that taking the zero crossings of the Laplacian of Gaussian could give
you a measure of edges in an image. That was another way you could obtain edges in an
image. We are going to talk about it slightly differently now. So once again, just to recap,
remember this is your Gaussian, this is your derivative of the Gaussian and this is your
2 2
2 ∂𝑓 ∂𝑓
Laplacian of Gaussian where the Laplacian we define as ∇ 𝑓 is equal to 2 + 2 .
∂𝑥 ∂𝑦

192
(Refer Slide Time: 7:35)

Now, let us say an example of a three cross three Laplacian of Gaussian filter is going to look
somewhat like this, where you have a minus four in the center and one, one, one, one in its
nearest neighbors and 0, 0, 0, 0 in its next level of nearest neighbors. But can you try to guess
why this is a relevant Laplacian of Gaussian filter? One point to notice here is an equivalent
Laplacian of Gaussian filter could also have been 0, 0, 0, 0, -1, -1, -1, -1 and 4.

So if you actually visualize this filter, you would see that at the center, there is a peak at its
immediate nearest neighbors, there is a -1, there is a value that goes below underneath your 0
and then there are 0s in other places. So if you try to visualize this filter would be very, very
similar to the shape of the Laplacian of Gaussian that we just saw on the earlier slide, but that
is from a geometric or a conceptual perspective.

Why did we say 4? Why not 8, why not any other number? Once again it goes back to
approximating the gradients in some manner and we can work this out to show you how this
works out.

193
(Refer Slide Time: 8:59)

2
∂𝑓
So remember we are competing second derivatives here. So 2 can be written as
∂𝑥

approximating using first principles, you can say f(x + 1, y) + f(x - 1, y) - 2f(x, y). Similarly,
2
∂𝑓
you can write it out for 2 . Now, if you try to put both of these into your Laplacian of
∂𝑦

Gaussian equation or your Laplacian equation in this particular case, you would then have
2
∇ 𝑓 is equal to f(x+1, y) + f(x-1, y) + f(x, y+1) + f(x, y-1) - 4f(x, y). This should straight
away ring a bell for you as to why you have got this kind of a filter.

Remember, at f(x, y) the coefficient is -4 at its neighbors x + 1, x - 1 and similarly y + 1, y -


1, the coefficient is 1. So this simply came from taking an approximation of the gradient.
Remember, you could have other kinds of approximations of the gradient with respect to the
local neighborhood. You could consider larger windows one and so forth. If you do that,
obviously the definition of the Laplacian of Gaussian filter would have to be appropriately
modified.

194
(Refer Slide Time: 10:26)

Here is another visual illustration. Here is the original image, here is simply taking the
Laplacian and here is the result if you take the Laplacian of Gaussian. Remember again, that
Laplacian of Gaussian gives you the smoothing effect, which smoothens out the noise and
then takes the Laplacian of your original image. Once again, remember that we are just taking
a filter and convolving the filter by moving it around at every point of the image and getting
your output.

(Refer Slide Time: 10:56)

Now let us ask the question. So this is something that you partially saw the last thing. So
what else can Laplacian of Gaussian do? Do you have ideas?

195
(Refer Slide Time: 11:12)

The other thing Laplacian of Gaussian can do is detect blobs. Why? Remember that a
Laplacian of Gaussian filter looks somewhat like this. Remember, you could write the
Laplacian this way or the other way, it really does not matter. So at the end of the day for
edge detection and similar detection of other artifacts, they only take the absolute value,
whether the output is negative or positive, we just take the absolute value.

So that would change if you had a white circle with a black hole or a black circle with the
white hole. We ideally do not care, both of them are blobs to us and we ideally want to
recognize blobs in both these cases. So you could write your Laplacian of Gaussian this way
or another way where your central peak goes up on top and the other part comes below. This
is similar to writing the Laplacian of Gaussian with a -4 in the center, and 1, 1, 1 around or a
four in the center and -1, -1, -1 around.

So now if you try to visualize this as an image, so if you took a Laplacian of Gaussian filter, 3
by 3 maybe too small, let us say you expand it, take a larger neighborhood. Let us say you
take a 7 by 7 Laplacian of Gaussian filter or 11 by 11 Laplacian of Gaussian filter, you would
find that the Laplacian of Gaussian filter, just the filter itself remember that is also a matrix
that also can be visualized as an image would look something like this, a black blob in the
middle or white ring, which is where you have a peak and gray all the way in other places.

You could also have an invert of this, where you have white in the middle, a black ring, and
then gray all over. Both of these are similar. And by looking at the filter, you can say that it is
likely to detect blobs. In some sense, convolution of the filter can be viewed as comparing a

196
little picture of what you want to find against all local regions in the image. So there is a
slight nuance which is different. This does look a bit similar to template matching for which
you use cross correlation, but in convolution, you would double flip the filter and search for
that across the image.But when your filter is symmetric, it really does not matter, both cross
correlation and convolution will be looking for the same filter in the image.

So once you have aLaplacian of Gaussian like this, you can probably count sunflowers in a
field or you can probably detect red blood cells in your blood test, any structure with blobs
across an image. Clearly in this image with sunflowers, there are blobs of different sizes. So
you would have to run a Laplacian of Gaussian with different blob sizes to be able to capture
all of those blobs in the image.

(Refer Slide Time: 14:06)

Let us now move forward to the next artifact, which is very useful to extract from images.
For a large part extracting corners from an image was a very very important area of research
in computer vision in the late 90s and early 2000s, a lot probably entire 90s and early 2000s.

So we will try to describe one popular method today and let us start by asking the question.
Suppose we had an image such as this, what would be interesting features that set apart this
image from any other image? Remember that if you want to do any processing, we want to be
able to extract some unique elements of the image. So what are those unique elements in this
image?

197
(Refer Slide Time: 14:53)

And you are going to say that those are going to be your corners, but let us try to go there. We
ideally want to look for images, image regions that are unusual, something that sets apart that
image, that region in the image from other images. If you have a region that is textureless, for
example, the blue sky, then that could be common across several images and it may not really
be unique to that particular image, you may not be able to localize that region in any other
image with the similar content.

You are ideally looking for patches with large contrast changes, large gradients. Edges are a
good example, but we will talk about why edges may not be the right artifact in a moment.
But we are ideally looking for some patches or some regions in the image which have large
contrast changes, such as gradients. But the problem with edges is edges suffer from what is
known as the aperture problem. By aperture problem we will just see a visual illustration on
the next slide, but while edges are good, unique aspects of a particular image, they do suffer
from a problem, which we will talk about in the next slide.

So what we ideally are looking for are regions of the image or what we are going to call
corners, where gradients in at least two different directions are significant. So remember an
edge is an artifact where you have a significant gradient in one direction, which is going to be
normal to that edge, it could be whatever direction the edge maybe, the normal direction to
that edge is going to be the orientation of that edge. But we are saying, we want those points
where there could be significant gradient in two directions. What does that mean? We will try
to answer.

198
(Refer Slide Time: 16:50)

Let us take a tangible example. Let us assume that there is an artifact such as this, this looks
like some inverted V let us say. And we have such an artifact and we want to find out, which
part of this image, let us assume this is the full image or ignore the blue box, the blue box is
for explaining it to you. Just consider the image with just this inverted V. We want to find out
now which aspect of the image is unique to it, which can help us recognize it, say other times,
or if you view the image from other angles.

So if you consider this blue box placed here, you see that that is a flat textureless region.
There is no change in intensity in that region. So it is not going to be very useful. It is like the
blue sky with absolutely no change. It is not going to be very useful when you try to compare
this image with other images.

So if you now place the blue box on the edge part of the image, this is good. There is some
artifact that is useful, but the problem is if you move this patch here or here, it would all have
the same response. You would never know whether you placed your box here or here or here,
because all of them would have exactly the same response. And there is no difference in the
local characteristics in all of these places.

So which means while edges are useful, there is something that they are lacking. So which
means if you try to match or let us say you take a panoramic photo in your phone and you try
to align two images, you may not know which part of the edge to align at. We are ideally
looking for placing the box at that kind of a point where there is change in two directions and

199
that point could be unique to this particular image. And ideally, we are looking for many such
points in an image, but such points are the kind of points that we want to detect in an image.

(Refer Slide Time: 18:59)

How do we find such points in an image? We know how to do edges now, but let us try to go
one level further to try to see how do you find such corners in an image. To do that, we are
going to define a quantity called auto correlation. As the name states it is autocorrelation, it is
correlation with itself. So you are not going to use any external filter, we are going to take a
patching image and see how it correlates with itself. What does that mean? Let us try to
quantify that.

So we are going to define the auto correlation function as: you take a patch in the image and
you compute the sum of squared differences between pixel intensities with small variations in
the image patch position. So if you had a point 𝑝𝑖 in the image, let us say you place it at the

center, that is what is going to be pi. And if you now have a small δu, which is going to be the
difference, you are going to move that patch by a certain δu, remember 𝑝𝑖 and δu are both 2D

coordinate, so even Δu will probably have a δu and δv, for instance, they are going to be two
dimensions there, it is a vector, which is going to be, so you take I(𝑝𝑖), which is the image

intensity at that point 𝑝𝑖, then you move that point, 𝑝𝑖 + Δu, you move it a little bit and you

see what is the image intensity at that new location and you take the sum of squared
differences for all points in that particular patch.

200
So if you take a square patch, take every point, move it a bit, so move that entire box a bit and
then you compute pairwise distances between the same locations in the original patch and the
new patch and sum them all up. This is what we define as autocorrelation.

About this function w(𝑝𝑖). w(𝑝𝑖) is a function that tells you how much you want to give

weight for each particular point in that patch. Maybe for the central pixel, you want to give
more weight. For a pixel at the periphery of that square of that blue square that you have, you
may want to weight it a little lesser. So you could have a fixed weight for all points in that
square or you could have a Gaussian which defines w(𝑝𝑖), where you weight the center pixel

more and the pixels at the periphery a little less. We define this as auto correlation.

(Refer Slide Time: 21:39)

Now let us try to see how do you compute autocorrelation and then come back to how do you
compute corners using this autocorrelation. So let us look at this more deeply. Let us consider
the Taylor Series expansion of your I(𝑝𝑖 + Δu), remember (𝑝𝑖 + Δu) is about taking the patch

which was centered at 𝑝𝑖 and moving it by a Δu and placing it at a slightly offset location in

the same image. From Taylor Series expansion, you can write this as I(𝑝𝑖) + ∇I(𝑝𝑖)Δu, where
δ𝐼(𝑝𝑖) δ𝐼(𝑝𝑖)
∇I(𝑝𝑖) is the image gradient at that particular location. ( δ𝑥
, δ𝑦
). We know how to

compute the gradient, we have already seen that so far.

201
(Refer Slide Time: 22:35)

Now let us write out autocorrelation. Remember the definition of autocorrelation that we
wrote on the previous slide is this one, where we said, autocorrelation is w(𝑝𝑖)[I(𝑝𝑖+δu)-I(𝑝𝑖

)]2. Now we are going to replace I(𝑝𝑖+δu) with this Taylor Series expansion. So plugging that

in here, you are going to have w(𝑝𝑖), this first term here gets written as I(𝑝𝑖)+δI(𝑝𝑖)δu-I(𝑝𝑖).

So just to explain the notations here. By δ(u,v) mean a vector, δu and say δv, because of this,
we have taken out a summation here, we are simply writing it as one of those δu's inside. So
anyway, there is going to be a summation for δv, which will take care of the other competent
of δu.

Once you have this, you can see that I(𝑝𝑖) and I(𝑝𝑖) gets canceled and you are left with

summation over x, y, w(𝑝𝑖)[δI(𝑝𝑖)δu]2 . And we are going to write that as a quantity, ΔuTAΔu.

So remember that you have this [δu]2, you just split it into two parts where you have one part
that comes here and one part that goes here and the rest of what you have w(𝑝𝑖)δI(𝑝𝑖). You

combine that into a matrix called A.

How would A look?

202
(Refer Slide Time: 24:23)

How would A look? A would look like something like this. Remember A is going to be a
combination of w into your δys, δIs, so it means A is going to look something like this,
2 2
w(u,v), you are going to have remember two components of the summation. 𝐼𝑥,𝐼𝑥𝐼𝑦, 𝐼𝑥𝐼𝑦, 𝐼𝑦,

those are going to be your gradients and that is going to be your A matrix.

So we took the definition. So we defined autocorrelation in a particular manner where we


said, we will take a patch, move it a bit and see how the image properties change in that local
region. So we then took the definition of autocorrelation, played with it a little bit and then
came up with an expansion which looks somewhat like this and we are not focusing on this
matrix A.

So this matrix A, so δu is simply the change that you imposed in the patch location, that is
what you imposed. A is what is giving you how the gradients changed between those two
patches and a w weighting factor that tells you how much you should weight the central part
of that patch versus the peripheral parts.

So since A is what defines the intensity change, we are going to consider an eigen
decomposition of A, which is given by uλuT, where λ is going to be a diagonal matrix with λ1
and λ2. Two eigenvalues, remember A is a two dimensional matrix, which means the
maximum number of eigenvalues you can have is two, where Aui=λui, your standard eigen
decomposition, which is wonderful.

203
So once again started with autocorrelation, wrote it in a slightly different manner and then
now we are done in eigen decomposition of A. Where do we go from here? How do we go
from here to finding a column?

(Refer Slide Time: 26:29)

That is the question we are asking. Think about it for a moment. How do you think you can
go from the eigenvalues of A to a position of a column?

Very similar to the discussion that we had at the early part of the lecture, when both λ1 and λ2
are large. You know that the intensity changes in both directions at those points, when either
λ2 is much greater than λ1 or λ1 is much greater than λ2, it is going to be an edge because there
is going to be change only along one direction. And if both λ1 and λ2 are small, you are going
to say it is a flat or textureless region. So which means we know that from the eigen
decomposition of A, all what we are looking for is both the eigenvalues to be height and we
know we have probably hit a corner.

(Refer Slide Time: 27:35)

204
So let us see how you actually do this inference. So it is another way of looking at it is, so
wherever there is a vertical edge or a horizontal edge, you are going to have either λ1 greater
than λ2 or λ2 greater than λ1. At a corner, we are going to have both λ1 and λ2 to be large and
in a flat region, you are going to have both λ1 and λ2 to be very small.

(Refer Slide Time: 28:02)

So what do we do with this? So the way we are going to compute your corner, this was a
method given by a person called Harris and that is why it is known as the Harris Corner
Detector, it is a very popular detector. It was used for several years. Of course, there have
been lots of improvements and better methods that people have developed but this was one of
the earliest corner detectors that was developed and was used for many years. So the entire
procedure follows something like this.

205
You compute gradients at each point in the image. Using that, you compute your A matrix.
You can use a weighting function, or if you do not use a weighting function, you just assume
that you are going to consider all of them to be equal, all of the patch position, positions of
that patch to be equal. Then we ideally want to compute your eigenvalues and decide your
cornerness based to the eigenvalue, but because eigen decomposition by itself can be a costly
process, we try to do a slight deviation to be able to compute our cornerness measure.

So we are going to define a cornerness measure as: λ1λ2 - κ(λ1+λ2)2. I will let you work this
out to show that when this entire quantity is high , you will know that both those λ1s and λ2s
are high. Work this out for yourself. Try out different λ1s and λ2s and you will see what I am
saying. But what is interesting here is, λ1λ2 is nothing but the determinant of A and (λ1 + λ2) is
nothing but the trace of A, which means we can define our cornerness measure as
det(A)-κ.trace2(A). κ is just a constant that you have to define to get what you want to get.
For different images, you may have to set kappa differently.

Why is this useful? You no more need to compute the eigen decomposition of A matrix, you
only need to compute your determinant and trace, which is a bit easier than computing your
eigen decomposition.

Finally, you then take all points in the image whose cornerness measure is greater than the
threshold. So you can take lots of different points, probably all points, compute their
cornerness using autocorrelation, and then whichever is greater than a particular threshold
you actually call them corners. You can finally also perform non-maximum suppression,
where if you find that there are many corners in a very small local neighborhood, you pick
the one with the highest cornerness measure. That is your non-maximum suppression that we
also talked about in the canny edge detector. So this non-maximum suppression will keep
coming back to us at various stages in this course in various use cases.

(Refer Slide Time: 31:05)

206
So, here is a visual illustration of the Harris corner detector. So, let us consider 2 images of
the same object taken from different angles. The objects are at different poses and the
illumination is also different. Let us try to run the Harris corner detector. Ideally what we
want is that in both these images the same parts or the same corners in the doll get picked.

Why is that relevant? Once again if I take the example of say stitching different images in
your cell phone a panoramic mode or something like that, we want the same points in both
those objects to be located, so that they can be matched and probably stitched together to get
a panoramic image. Just as an example of an application.

(Refer Slide Time: 31:53)

Let us see it now. So here is the step of computing the cornerness where you do auto
correlation, get your cornerness values.

207
(Refer Slide Time: 32:02)

And then you take all the high responses while using the threshold.

208
(Refer Slide Time: 32:06)

You do non-maximum suppression, that is why you see many of those points in regions you
just take the highest cornerness value, you do a non max suppression and you get a bunch of
different points.

(Refer Slide Time: 32:21)

And you let us try to visualize them on the image and you can actually see that if you look at
say these highlighted examples, you can see that, although those are fairly differently tilted,
you can get similar corners in both of these regions. You can see many other regions too.
Obviously, you get a few more corners which are not there in the second image but those can
be overcome at the matching face as you see a bit later on this course. The focus of this

209
particular lecture is on cornerness, just detecting the cornerness. How do you do the matching
between the two images will come back to that a bit later.

(Refer Slide Time: 33:03)

While we saw one variant of the Harris corner detector here, which was developed by Harris
and Stephens in 88 where we use the determinant minus κ or α times your trace that is what
we used here, there have been other improvisations of the same method where a researcher,
Triggs, suggested that you can use λ0 - αλ1 where λ0 is the first eigen value of the larger eigen
𝑑𝑒𝑡(𝐴)
value and λ1 is the second eigen value. A researcher, Brown and a team, proposed 𝑡𝑟𝑎𝑐𝑒(𝐴)

instead of doing the det(A) - trace2(A).

𝑑𝑒𝑡(𝐴)
We said κ on two slides ago, that just does not matter just a constant. So you have 𝑡𝑟𝑎𝑐𝑒(𝐴)
all

of these could be different base of playing around with the same quantities to get what you
want. All of them effectively try to measure the cornerness using the eigen values of your A
matrix which comes from autocorrelation.

210
(Refer Slide Time: 34:14)

Let us ask a few questions about properties of the Harris corner detector before concluding
this lecture. The first question we are going to ask is, Is the Harris corner detector scale
invariant? What do we mean by scale? Remember, scale is complementary to resolution. So
if you have an object as which is very small in image, we call that smaller scale or if it is
large, we call that larger scale. So, you could have an artifact such as this curved line here in
a particular image or you could also have this curved line on a larger canvas, maybe you just
took one image from close up and another image further out. But it is the same image, it is
the same object that is being taken.

So, in this particular case if you observe, in the second image this point would be considered
a corner by the Harris corner detector. In the first image if you took the same size patch to do
the autocorrelation, you would find that all of these would get categorized as edges and none
of them would get categorized as corners, which means the Harris corner detector need not
necessarily be scale invariant. How do you make it scale invariant will talk about that later.

211
(Refer Slide Time: 35:36)

What about rotation invariant? Is the Harris corner detector rotation invariant? It happens that
it would be rotational invariant. So, weather you have this particular inverted V like this or
weather you rotate it and have the inverted V like this in an image, this particular corner will
have high change in both directions in both cases and you would detect this corner both in
this image as well as this image, as long as there is no change in scale, you would detect the
same corner on both images and Harris corner detector in this particular case would be
rotation invariant.

(Refer Slide Time: 36:19)

What about other kinds of invariances? What about photometric change, what does a
photometric change mean? A photometric change is an affine intensity change. An affine

212
intensity change means that you take your intensity at every pixel of the image, scale the
intensity by value a and may be translated by a value b. Remember that intensity of value
lying between 0 on to 255 or if you normalized it, it comes to between 0 and 1 . You just
multiply all of those values by a scalar say a, it could be less than 1 if you like, and also add a
quantity b if you like. We call that as a photometric change or an affine intensity change.

So, in this particular case if you notice, for only the translation, it really does not matter, you
would find the corner wheather you increase the intensity in all places by an amount or
otherwise it would still find the corner. But about the scaling part of the intensity change, by
multiplying it by an a, it depends on the threshold that you choose. If you had a particular
curve, any function for that matter, and you had a certain threshold, all these points which
have a certain value, remember that the points closer to that would get suppressed due to non
max suppression you will only have the peak in all of these points.

So, those peaks would get detected. But if you scale the curve, there could be newer points
that get added. If you scale it up in a different way, there could be newer points that get
added. So, which means it perhaps invariant to translation but not necessarily invariant to
scaling the intensity. A couple of slide back we spoke about invariant to scale with respect to
size of the artefacts itself, now we are talking about scaling with respect to the intensity
value. Hope you see that these are two different things. In this case, you could still have a
curve which is small but now you brighten it or darken it in a different way, and that is what
we mean as an photometric change in this particular context.

(Refer Slide Time: 38:32)

213
So, the homework that you are going to have is to go ahead and continue to read your chapter
2 of the Szeliski’s book and since a lot of the concepts that we used in this lecture were based
on Eigen values, Eigen vectors, just go ahead and rush up your linear algebra, try to show that
the trace of a matrix is the sum of its eigen values and the determinant of a matrix is a product
of its eigen values.

214
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 10
Scale Space, Image Pyramids and Filter Banks

Moving on from the last lecture, we get into Scale Space, Image, Pyramids and Filter Banks
in this one. If you recall, one of the limitations of the Harris Corner Detector that we stated
the last time was that it is not scale invariant. That is, what would have been a corner in an
image could have been an edge in another image which is zoomed in. Let us try to quickly
recall this before we move forward.

(Refer Slide Time: 00:49)

Once again, an application where you would want to detect key points or corners is when you
have two different images or even more number of images. Let us say, if you want to stitch
these images together, this is typically called image mosaicking or this panorama building.
So, let us say we have these two images here, which correspond to the same scene from two
different locations. We really do not know the camera movement between these two images,
but we want to stitch these two images. How do we go about it?

215
(Refer Slide Time: 1:28)

We typically detect key points in each of these images independently. How do we do that?
An example would be the Harris Corner Detector. We are on the Harris Corner Detector on
image one, we run the Harris Corner Detector on image two, and then we match which key
point or set of key points in image one matches which set of key points in image two. How do
we do the matching? We will talk about it a little later in this course, but our focus now is on
finding those key points.

(Refer Slide Time: 2:04)

And one method that we spoke for finding those key points is the Harris Corner Detector. We
said that in the Harris Corner Detector, we build something called the autocorrelation matrix

216
and then take eigen decomposition of the autocorrelation matrix. And then we said that when
λ1 one and λ2 , which are the two eigenvalues of your autocorrelation matrix a, are small,
then it means that the region is flat, there is no change.

When one of the eigenvalues is much greater than the other, we say it is an edge and when
both the eigenvalues are large, it corresponds that that particular patch has a lot of changes in
multiple directions and we call such a point a corner. And that is what is our methodology for
coming up with the Harris Corner Detector.

(Refer Slide Time: 2:59)

But some observations that we made towards the end is that the Harris Corner Detector is
rotation invariant.

217
(Refer Slide Time: 3:09)

But the Harris Corner Detector is not necessarily scale invariant. Which means, what is a
corner in one image need not be a corner in another image which is zoomed in where it could
seem just like an edge.

(Refer Slide Time: 3:28)

So, we are ideally looking for a setting where we can analyze both these image artifacts in
different scales and be able to match them at the right scale and that is the way we would
make the Harris Corner Detector scale invariant.

218
(Refer Slide Time: 3:45)

Before we go there, let us try to ask how we can independently select interest points in each
image, such that the detections are repeatable across different scales. So, which means, there
are two images at different scales. When we say scales, they are zoomed in differently, one of
them is zoomed in a lot, one of them is say zoomed out. We ideally want to be able to detect
a key point in both these images. Remember that if you take the patch size to be the same in
both these cases, in one of those images which is zoomed in, a key point may just look like an
edge when you zoom in a lot. How do we counter this?

A simple approach could be that we extract features at a variety of scales by using say,
multiple resolutions in a pyramid and then we match features at the same level. That could be
one of the simplest things that we can do.

219
(Refer Slide Time: 4:44)

Where do you think this will actually work? If you thought carefully, you would find that as
long as you match features at the same level, the properties of the Harris Corner Detector will
only be compared at the same scale, but to be scale invariant, we ideally need to compare the
Harris Cornerness measure, recall the Harris Cornerness measure. At a different scale in one
image and the Harris Cornerness measure at a different scale in the next image. So, how do
we do that?

(Refer Slide Time: 5:24)

So, what we try to do now is to extract features that are stable in both location and scale and
we are going to try to describe how we are going to do that over the next few minutes.

220
(Refer Slide Time: 5:39)

So, if you have two images, notice again that we have two images now, which definitely
differ in scale. In one of them, this inside artifact are paintings and one of them this insight
artifact is quite small and then we are zooming into that artifact on the right image. We now
ideally want to find a corner, which is indicated by the yellow cross there. We want to find
the same corner in both these images irrespective of the scale differences. How do we go
ahead and do that? So, we want to find a function f which gives you a maximum at both x
and sigma, the sigma is denoted as the scale of the image in this context.

(Refer Slide Time: 6:28)

221
And the way we are going to go about doing it is, we compute your scale signature, in this
case, it could be the Harris Cornerness measure. At that particular point for a particular scale
and let us say that particular point has a particular Harris Cornerness measure, which is
plotted on a graph.

(Refer Slide Time: 6:50)

We then change the scale. In our case, a simple way to change the scale is simply to take a
larger patch for your autocorrelation window. So, if you take a larger patch for your
autocorrelation window and now take your Harris Cornerness measure, you are going to get a
slightly different value for the Harris Cornerness measure. So, remember in the x axis, we are
measuring scale, so we have changed the scale, which is the size of your autocorrelation
window and we now get a different Cornerness measure.

222
(Refer Slide Time: 7:23)

And you do this for different scales. So, which means you again take a different patch size,
compute the Cornerness measure for that patch size. Remember again, that from a definition
of our Harris Cornerness measure, the autocorrelation matrix would change if the size of your
patch changes. Remember again, that we did a summation with all the pixels and that will
change now when the size of your patch changes.

(Refer Slide Time: 7:50)

So, we do this for more scales, more scales, and you see that you will get such a graph when
you do this for multiple scales. So, the takeaway from this graph is that we seem to be getting
the maximum Cornerness measure or the maximum Cornerness response at a particular scale,

223
which is going to be important to us. At that particular scale is when the Cornerness measure
is the highest for that particular key point in that given image. So, how do we take this
forward?

(Refer Slide Time: 8:27)

We now take another image. So, in this case, as I said, that is the maximum, that is what we
are showing in this particular slide.

(Refer Slide Time: 8:34)

Now, we take another image, which is the one on the right which is zoomed in version and
perhaps there is a slight rotation of the first image. And now again at the first scale, compute

224
the Cornerness measure. At the second scale, compute the Cornerness measure and we repeat
this process for all the different scale that we considered for the first image.

(Refer Slide Time: 9:02)

And as we keep doing this, we are going to get another graph for the second image where the
peak now is at a different scale.

(Refer Slide Time: 9:15)

The peak now is at a different scale which is denoted by that particular value. So which
means now we have made some progress. We have been able to find what would be the
Cornerness measure, the maximum Cornerness measure for a particular point, both in

225
location and scale. So, location would be the coordinate of that center of that patch and scale
would be the scale at which we got the maximum cornerness measure. One question that we
ask ourselves now is, is there a better way to implement this?

(Refer Slide Time: 9:51)

The answer to that is to use what are known as image pyramids. Instead of changing your
patch size in each of your images, you fix your patch size across any images you may
encounter, but change your image size by doing a Gaussian pyramid, recall our discussion of
Gaussian pyramids when we spoke about interpolation and frequencies. Remember a
Gaussian pyramid is constructed by taking an image, Gaussian smoothing the image, then sub
sampling the image and repeating this process again and again.

So, we do the same thing now and keep the patch size the same and construct a Gaussian
pyramid. Keep in mind, that when you construct your Gaussian pyramid, it need not always
be reduced by half each time, you can also reduce your sizes by say three fourth or by any
other faction by using interpolation methods.

226
(Refer Slide Time: 10:54)

When we consider an image pyramid, there are several kinds of pyramids that you can
construct and use in practice. So, the Gaussian pyramid is what we have seen before, which is
what the top part of this diagram shows, which is about taking the original image, let us call it
G1. Then you smooth the image and then down sample it and then get a G2. Then you again
smooth G2, down sample it, get a G3 and you keep repeating this process.

There is also another way of getting another kind of a pyramid called Laplacian pyramid,
which is obtained by, you take G1, which is your original image. Once you get your G2,
which is your smoothened and down sample image you again up sample G2 and again
smoothen it. Now, when you compute G1 minus G2, that gives you a quantity called L1. The
reason why we call L1 a Laplacian pyramid is because Laplacian can be written as a
difference of two Gaussians. Why so? Let us try to see it a little illustratively at this time.

So, recall the Laplacian filter that we discussed in the last lecture. A Laplacian filter could be
drawn as something like this. This was one way of drawing it. Remember, we could also have
drawn it the other way, where you have it as something like this. So, both are Laplacian
pyramids depending on whether your central value is negative or positive. If you see purely
from a graph perspective, this is a 1D Laplacian. If you will purely see from a graph
perspective.

Such Laplacian can be written as the difference of one Gaussian, which is say wide, let us
call that Laplacian some cursive g1 and say another Gaussian, which is narrow, let us call it

227
Gaussian G2. When you subtract G2 from G1, you will actually get a shape, which is similar
to the Laplacian.

Clearly you will have to choose the variance for G1 and G2 appropriately to get the kind of
Laplacian that you are looking for. And because in this particular example, G1 minus the
smoothed up sample version of G2 turns out to be a difference of Gaussians, it effectively
was done out to be some kind of a Laplacian which is why we call it a Laplacian pyramid.

And you repeat the Laplacian for every successive lower resolution representation in your
Gaussian pyramid and get multiple L2s and L3s so on and so forth, to also get a Laplacian
pyramid. For different applications, you may want to use a Gaussian pyramid or a Laplacian
pyramid.

(Refer Slide Time: 14:02)

But where do you use image pyramids in practice, in multiple applications? You could use it
for compression because you may want to just transmit a low-resolution version of the
original image and send some other information through other means and be able to
reconstruct a high resolution from the low-resolution image.

228
(Refer Slide Time: 14:21)

You could use an image pyramid for object detection. How and why? You could use it by
doing a scale search and then doing some features. What we mean here is, you could look for
an object firstly in a low resolution part of the pyramid and once you find the region of the
image, where you get the object, then you go into the next high resolution, search in that
region a bit more carefully, find where the object is and then, you can repeating this in high
resolutions.

(Refer Slide Time: 14:55)

You can also use an image pyramid for stable interest points, which is what we have been
discussing so far.

229
(Refer Slide Time: 15:02)

Another application of image pyramids could be registration. In registration, is the process of


aligning key points from two different images. How do you use image pyramids in
registration? You can do what is known as coarse to fine image registration, where you start
by constructing a pyramid for each image that you have.

So, you have a coarse level, a medium level and a fine level. And you first compute this
Gaussian pyramid and then align features at the coarse pyramid level, just at this level to start
with. Once you do that, you then continue successively aligning with final pyramids by only
searching smaller ranges for that final match.

(Refer Slide Time: 15:59)

230
Moving on from the image pyramid, we will go to the third topic that we are covering in this
lecture, which is the notion of textures, which is closely connected and built upon the other
concepts that we are covering. What are textures to start with? Textures are regular or
stochastic patterns that are caused by bumps, grooves and or markings, the way we literally
call them textures.

(Refer Slide Time: 16:32)

So, these textures give us some information about the spatial arrangement of colors or
intensities in an image. On the right side, you will see that textures can give you an idea of
materials, textures can give you an idea of the orientation. Textures can also give you an idea
of the scale that you are dealing with. So, textures contain significant information to be able
to make higher level decisions or predictions from images.

231
(Refer Slide Time: 17:02)

It is also important to keep in mind that even if you had a single image. Let us say you
obtained a high-level statistic, such as the histogram of an image containing 50 percent white
pixels and 50 percent black pixels. In this scenario, we could have images of multiple kinds,
three samples of what you see on the slide. You could have the image to be something like
this. You could have the image to be something like this, or you could have the image to be
something like this.

In all these three cases, the histogram contains 50 percent white and 50 percent black, but the
textures are vastly different. So, it is not only important to get global statistics, it is also
important to get local texture information to be able to understand what is in images.

232
(Refer Slide Time: 17:59)

So, how do we actually represent textures? Let me let you think for a moment. So far, we
have seen edges, we have seen corners, we have seen corners at different scales. How do we
represent textures? The answer is you put together whatever you have seen so far. And how
do we put them together?

(Refer Slide Time: 18:26)

We compute responses of blobs and edges at different orientations and scales and that is one
way of getting textures. So, the way we process an image is we record simple statistics, such
as mean and standard deviation of absolute filter responses of an image. And then we could
take the vectors of filtered responses at each pixel and cluster them to be able to represent

233
your textures. There are multiple ways of doing this, but that could be the general process of
capturing the textures and images. We will see a couple of examples of how texture can be
captured in an image.

(Refer Slide Time: 19:11)

A simple way to do this is by what is known as filter banks. Filter banks are as the word says,
a bank of filters. We are not going to use just a Sobel filter or a Harris Corner Detector or
Laplacian to compute blobs, we are going to use a set of different filters, a bank of different
filters. And what do each of these filters do?

Each of these filters can be viewed as what are known as band-pass filters. This goes back to
our discussion on extracting low-frequency components and high-frequency components in
images. Band-pass filters are filters that allow a certain band of frequencies to pass through
and get as output when you convolve a filter with the image.

So, remember we have seen examples of filters that extract high-frequency components, edge
detection. We can also be opposite to get low-frequency components by doing Gaussian
smoothing. At this point, with band-pass filters, we are saying that we want only certain set
of frequencies to pass through, and we are going to use a bank of such filters to be able to
separate the input signal into multiple components, each one carrying a certain sub band of
your original signal image, and that can be used to represent the texture in your image.

234
(Refer Slide Time: 20:40)

Here is a visual illustration. So, you process an image with different filters. So, you see here
eight different filters that you can come up with. This is your input image. So, you convolve
each filter on the image with the image and these are the responses that you get when you
convolve each of those filters with the input image. As you can see, each of these outputs
capture different aspects of the texture or the content in that butterfly, and they all put
together give you a sense of what is the texture in the image.

(Refer Slide Time: 21:23)

We will talk about a more concrete example, which are known as Gabor filters. Gabor filters
are a very popular set of band-pass filters. At a certain level, they are known to mimic or

235
mimic how the human visual system works. But they allow a certain band of frequencies and
reject the others.

(Refer Slide Time: 21:46)

The way Gabor filters work is intuitively, they can be seen as a combination of a Gaussian
filter and a sinusoidal filter. So, here is an example of a sinusoidal filter for certain
orientation. Here is an example of a Gaussian filter. If you convolve a Gaussian filter and a
sinusoidal filter, you would get something like this. Imagine superimposing your sinusoid on
your Gaussian, you would get something like this.

(Refer Slide Time: 22:20)

236
Mathematically speaking, a 2D Gabor filter can be written as you have an x, y, you have a λ ,
θ , ψ , σ and γ , we will talk about each of them in a moment. And it is given by

x′2 +γ 2 y ′2
ei(2π λ′ +ψ)
x
g (x, y , λ, θ, ψ , σ , γ ) = e−( 2 σ2
)

We will talk about each of those quantities, we are not going to derive this in this particular
course, that may be outside the scope.

But in this particular formula, x′ = xcos(θ) + y sin(θ) , we will talk about what theta is. θ is the
orientation of the normal to the parallel stripes of the Gabor. We saw that the sinusoid could
be oriented in a particular direction and that is given by theta. So, x′ = xcos(θ) + y sin(θ) ,
y ′ =− xsin(θ) + y cos(θ)

(Refer Slide Time: 23:24)

λ is the wavelength of your sinusoidal component. Remember your sinusoid has a


wavelength and a frequency. So, your λ wavelength, ψ is the phase offset of your sinusoidal
function. Once again, recall our discussion on Mitch frequencies earlier. σ is the standard
deviation of your Gaussian envelope.

And γ is a spatial aspect ratio and specifies the electricity of the support of your Gabor
function. So, if you want to elongate it, all of them can be controlled in this particular
context. Instead of having a circular Gaussian, you can use the gamma parameter to be able to
control the elliptical nature of your Gabor response function.

237
(Refer Slide Time: 24:15)

So, this is a 2D Gabor filter. As you can see, it gives you an idea of certain textures. So, here
is a filter bank of Gabor filters. So, this has 16 Gabor filters at an orientation of 11.25, which
means, if your first filter has an orientation of 0, your next filter will be 11.25, the next filter
will be 22.5, so on and so forth. And you can see the Gabor filter being rotated and you now
have an entire bank of Gabor filters.

(Refer Slide Time: 24:48)

You can now take an image and convolve each of these filters with the image and you will
get 16 different responses of the image to these 16 different filters. As you can see here, each
of these responses capture a certain aspect of your original image. In case of a circle, they

238
simply seems to highlight a different perspective to the circle, but when you have more
complex textures, each of these responses captures a certain dimension of that texture.

(Refer Slide Time: 25:24)

And putting these together gives us an overall response of the image to different set of
orientations and frequencies. There has also been another popular set of filter banks called
Steerable Filter Banks. Steerable filters are a class of oriented filters that can be expressed as
a linear combination of a set of basis filters.

2 2
For example, if you have an isotropic Gaussian filter, e−(x +y ) , you can define a Steerable
0 0 90 0
filter as you have G1 θ = G1 θ cos(θ) + G1 θ sin(θ) , where G1 θ is the first derivative of
G at a certain angle θ . For example, if you have an original image, you can now consider G1
along the y axis to be the derivative at a particular angle.

You can consider G1 of 15 degrees to be the derivative at a different angle and so on and so
forth. So, now you can construct combinations of these two, of these different images to
construct an overall response that you have. So, each of them is a Steerable filter where you
can control the angle at which you are getting your response.

So, this is another, Gabor filter banks was one example that could be used to extract textures
from images, Steerable filters banks are another example that could be used to extract
textures from images.

239
(Refer Slide Time: 27:07)

Here is an example, another illustration of Steerable filter banks, where you can take a
band-pass filter, B0. As you can see this band-pass filter allows a certain set of frequencies to
pass through. Another band-pass filter B1, B2, so on and so forth. You can have a low-pass
filter, so on and so forth.

Now, you can combine the responses of an image to all these kinds of filters and store some
statistics at each pixel. So remember, you are going to get a value at each pixel, you can store
the mean and standard deviation, you can cluster, you can do various things with those values
that you get at each pixel across the filter banks, responses to the ​filter banks and be able to
get a representation for your texture.

240
(Refer Slide Time: 27:53)

That concludes this lecture. Please do continue to read Chapter 2 in Szeliski’s book. Some
interesting questions for you to take away now, which you may not really answered, but it is
something for you to think about is; From the discussions we have had in this lecture, why is
a camouflage attire effective? Think about it. Obviously, it connects to our lecture, so think
carefully on what we discussed and how you can extend it to understanding how a
camouflage attire works.

Another question to ask here is, how is texture different from say a salt and pepper noise? A
salt and pepper noise could also look like a texture. So, how is a texture different from a salt
and pepper noise? Something for you to think about and read to understand. And a last
question is, will scale invariant filters be effective in matching pictures, containing
Matryoshka’s dolls or I think we also have equivalence in India. Nesting dolls, can scale
invariant filters be able to match pictures across these dolls? Think about these questions as
your exercise for this lecture.

241
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 11
Feature Detectors: SIFT and Variants
Last lecture, we saw the filter banks, image pyramids and also a scale invariant Harris corner
detector. We will now move on to one of the most significant feature detectors that have been
developed in computer vision known as the scale invariant feature transform or SIFT.

(Refer Slide Time: 0:45)

SIFT was developed by a person David Lowe from University of British Columbia. It was
developed way back in the late nineties, but by the time it was formerly published, it was close to
2004. To this date, it has over, I think, 56,000 or 57,000 citations. That speaks of the impact that
it has had on the community over the last decade and a half.

The main objective of SIFT is to transform image data into scale invariant key point coordinates.
So once again, very similar to Harris corner detector. Our goal is to extract key points from
images, but we are going to go one step further now. In addition to detecting those key points in
the images, we also want to find an effective way of describing those key points. A simple way
of describing key points that we have seen so far is X comma Y, which is the coordinate location
of that key point. Maybe you can add a scale just such as a sigma that we saw in the previous
lecture.

242
But now with SIFT, you are going to talk about a full-fledged feature descriptive, that describes
the local characteristics around that particular key point. SIFT is fundamental to many core
vision problems and applications, including recognition, motion tracking, multi-view geometry,
so on and so forth.

For many years, until deep learning became popular, SIFT was extensively used for simple
object recognition in images.

(Refer Slide Time: 02:30)

As I just mentioned, in SIFT, image content is transformed to local feature coordinates, that are
ideally invariant to translation, rotation, scale and shear. So for example, once again, two images
of the same object taken from different angles, different poses, perhaps different illuminations.
But we ideally want to ensure that the same key points are detected in both images irrespective
of variations in translation, irrespective of variations in rotation, irrespective of changes in scale,
as well as shear, shear is a perspective transformation.

243
(Refer Slide Time: 03:15)

SIFT consists of four steps and we will go through each of them individually. But before we
cover them, I should mention that SIFT is an excellent engineering effort. Each step has a proper
motivation as to why it was developed and what part of the objective did it help fulfil. I will
highly advise you to read the SIFT paper, just to understand how a beautiful engineering paper in
computer vision has been written.

The first step in SIFT is to do what is known as scale space extrema detection. This is very
similar to the scale invariant Harris Corner detector. Although there is a little bit more than that,
which I will clarify in some time.

The second step is about key point localization. Once you have found a key point, can we try to
find out exactly where is that key point? I mean, is it maybe half a pixel, a little further, so on
and so forth, could be in location, could be in scale too.

Third step is to determine the orientation of that key point. There is a reason why we are going to
try to determine the orientation of that key point. We are going to try to use the local image
gradients around that key point to assign the orientation. And we ideally want to preserve the
orientation scale and location for each feature.

244
Finally, as I mentioned, we are going to talk about how do you describe those key points in terms
of local information about that key point. Once again, we are going to use some information
from the gradients, but we want to develop a representation for that key point, which is invariant
to a lot of transformations. Why do we need a representation? Remember that, we said that if you
want to match two images and stitch them together, we need to compare key points in first image
to key points in the second image, having a representation allows you to compare. Otherwise you
would not be able to compare, just coordinate locations because they might be very different in
the two images.

(Refer Slide Time: 05:33)

So the first step, as I just said, is scale space extrema detection. And the way SIFT implements
this is to firstly construct what is known as a scale space. A scale space as shown in the diagram
here is simply taking an image, convolving it with Gaussian, that is one image. Then convolving
it with K times sigma. So you have another Gaussian with K times sigma, that is another image,
but the size of the image stays the same. It might only be more blurry. Then, you have a
convolution with a Gaussian with K square sigma. Once again, the size of the image remains the
same. We are not sub-sampling at this time. Simply making it more blurred depending on what
you choose to be done.

All these images from what is known as one octave. So you are going to have many such images
in one octave, depending on the values of K and how many images you wanted. In the next step,

245
you construct your second octave, where the ideal goal would be to sub-sample the image and
then repeat the same process that you had for your first octave.

Instead of doing a sub-sampling and repeating the same process, you can also do simply a
Gaussian convolution with two sigma. Remember that will make it wider. So you can simply do
a Gaussian convolution with two sigma and that gives you a similar effect as constructing a
second octave of subsampled images.

Remember when you take Gaussian 2 sigma, your spread is going to be wider for the Gaussian.
So you are going to blur out more pixels and probably consider more pixels that are further out,
which is what would have happened if you had sub-sampled and then Gaussian with sigma as the
standard deviation.

(Refer Slide Time: 07:39)

To be able to detect the extrema amongst these Gaussians, we ideally want to use the idea of
Laplacians. Remember, again, if you recall our Laplacian as an edge detector, we said that unlike
an edge detector like Sobel, where you get high intensities, wherever there is an edge, for a
Laplacian, you get an edge whenever there is a zero crossing, please go back and refresh your
lecture on Laplacian.

246
We said, whenever the Laplacian assumes a zero value and it is surrounded by some intensities,
you can assume that there is edge information there. So we ideally want to use a Laplacian to be
able to obtain the information of corners and edges in the image.

But remember we spoke about it in last lecture that Laplacian can be implemented as a difference
of Gaussians. So, which means all these Gaussians that you have in the first octave, remember,
one of those images would have been G of sigma, one of them, G of K sigma, G of K square
sigma, so on and so forth, those are the filters. You would have ideally convolved your input
image with each of those filters.

Now, you would simply take the difference between successive Gaussians on the images in the
octave, and you will get a set of difference of Gaussian images, which, remember, is very similar
to the Laplacian of those original images. So you construct these different Gaussian images in
each octave separately.

(Refer Slide Time: 09:30)

We are going to write that as D(x, y, sigma) is equal to I-hat(x, y, sigma) - I-hat(x, y, k sigma),
where I-hat is the convolution of a Gaussian with the appropriate Sigma with the original image.
We are just writing mathematically, whatever we described so far.

247
(Refer Slide Time: 09:50)

Now to find the scale space extrema detection, what we are going to do is for every pixel, so let
us say you take that crossed pixel, in the image on the right. You take a three by three
neighborhood around that point in its own image. And then you take a three cross three
neighborhood in the image on the next higher scale and a three cross three neighborhood at the
same location in the next lower scale.

With this, you will have a set of total 27 pixels and nine pixels on the top on the higher scale, 9
pixels on the bottom scale and 9 pixels in the appropriate scale, including the pixel at the center.
So you have all of these pixels given to you.

248
(Refer Slide Time: 10:41)

And what you do now is you compare a pixel with all of the 26 pixels in and around it. So that is
going to be around it in terms of location, spatial as well as around it in terms of scale, the next
higher scale and the next lower scale. And then, you select a pixel with, if it is larger or smaller
than all of the 26 pixels around it. So you select both the minimum and the maximum of all of
those 27 pixels in a spatial and scale neighborhood.

That is why we call this scale space extrema detection. So we are trying to detect extrema in both
scale and space. That is your first step.

249
(Refer Slide Time: 11:30)

So we detect interesting points, invariant scale and orientation to a certain extent, using DOG or
difference of Gaussians. Remember, again, the difference of Gaussians is an approximation of a
Laplacian of Gaussian. Just one clarification here. Last time, we said the difference of Gaussian
is an approximation of the Laplacian corrected to Laplacian of Gaussian. The difference of
gaussian is an approximation to the Laplacian of Gaussian.

The next step is key point localization. We ideally want to see if we have found the exact point
where we have the extrema, the maximum and the minimum.

250
(Refer Slide Time: 12:20)

To do this, we are going to look at this difference of Gaussian function D as any other function.
And if we knew that our detector extrema were at these blue points, which are known as your
detected extrema here. We ideally want to find these red points where the minimum and the
maximum is actually achieved. So it means that minimum and maximum could be achieved in
between two coordinate locations or in between two scales that we considered in, in the
experiment so far. How do we find this?

(Refer Slide Time: 12:55)

251
The way we find this is we are going to consider the Taylor series expansion. Given an s_0,
which is given by (x_0 y_0 sigma_0). And at delta S which is given by (delta x, delta y, delta
sigma), so s_0 is the minimum or maximum scale and coordinate location that we already have
found in the earlier step, in step one and delta x, delta y and delta Sigma is what we want to find,
to find out where the exact extrema is achieved.

So we write out your traditional Taylor series expansion, which is given by D in our case, that is
the difference of Gaussian output. s not plus delta s is not equal to, it is just approximate, because
we are not considering the higher order terms here, approximately D(s_0) + the first derivative x
delta s + 1/2 delta transposed second derivative delta s and higher order terms.

(Refer Slide Time: 14:00)

Now, to find out where this exactly attends the minima. We ideally have to differentiate this
Taylor series expansion with respect to s or delta s in this particular case and so all for delta s. So
we say that our solution for delta s, which we denote as s hat is going to be given by, this is a
simple derivative, we just differentiate this with respect to s. The first term derivative will go to
zero. Second derivative, remember the gradient is evaluated at s not. So that would be a constant
with respect to delta s. So you would be left with, if you would differentiate this Taylor series
expansion, you will end up having d(D)/ d(s) + d^2(D)/ d(s)^2 delta s = 0.

252
And when you solve for delta s, the solution is what we denote by S hat. It comes from simple
differentiation of your Taylor series expansion.

(Refer Slide Time: 15:20)

So, and how do we solve this? We can compute your second derivative and first derivative,
simply by finite differences. Remember, you can take to complete the first derivative in both
space and scale. You can simply do finite differences with the next locations on the left, next
locations on the right or the higher locations on the scale and so on and so forth. Simple finite
differences, the way we talked about computing gradients with respect to edges exactly the same
way this can be computed. And we can solve for delta s, which gives us delta x, delta y, delta
sigma, where the actual extrema is achieved.

So now we exactly know where the extrema is achieved. So this could help us localize better.
Once we do this, we also do one more step. We also want to now remove all points, which have
low contrast and all points, which are edge points because of their edge points, they are not
useful corners as we already talked about in last lecture.

253
(Refer Slide Time: 16:26)

How do we remove low contrast points? We removed low contrast points by simply saying that
if D at S hat, which is that new location is smaller than a certain value 0.03. If that value is small,
we are going to say that it has the difference of Gaussian output there is very small, assuming of
course, that your image values have been normalized in a certain range.

We are going to say that the value is too small, and we can actually reject it and not consider
those points for future processing. How do you remove edge points though?

254
(Refer Slide Time: 17:00)

That should ring a bell, using a very similar approach, like the Harris corner detector. However,
David Lowe and SIFT takes a slightly different approach and uses a Gaussian of your D, which
is your difference of Gaussian output again, instead of using an auto correlation.

(Refer Slide Time: 17:20)

So there have been other people who also extended the Harris corner detector to use corners
based on the Hessian, SIFT uses that kind of an approach where the Hessian of D which can also

255
be looked at as the curvatures. The Hessian effectively, the second derivative effectively captures
the curvatures so the Hessian, which is your pairwise second derivative matrix, gives you an idea
of curvatures in different directions or the changes in different, sharp changes in different
directions.

So the eigenvalues of the Hessian are also good estimates to understand corners. So very similar
to the Harris corner detector, SIFT proposes using the Hessian, computing its largest and
smallest eigenvalue, which are denoted by alpha and beta. Then you, once again, compute your
trace and determinant of your hessian matrix, very similar to how we detect for the Harris corner
detector, and then we evaluate different ratios with respect to trace and determinant.

(Refer Slide Time: 18:20)

So you could write your trace is going to be alpha plus beta. Your determinant is going to be,
alpha x beta, so your trace square by determinant is going to be (alpha + beta)^2 /alpha x beta,
which can be written as (r beta + beta) / r beta^2, where r is written as alpha / beta..

256
(Refer Slide Time: 19:00)

Using this approach, the trace square by determinant can be written as (r + 1)^2 / r and the final
way we removed edges is by noting that this quantity now is going to be minimum when r is
equal to 1, because when r is equal to 1, we know that alpha and beta are close to each other, are
equally high, which is what we want to find your corner points.

So SIFT proposes that you reject the key point if the trace of the Hessian square by determinant
of H is greater than the threshold, because you want it to be close to 1. So if it is greater than the
threshold, you simply reject the key point. The original SIFT paper uses r is equal to 10, in this
particular case, the threshold to be 10 in this particular case.

257
(Refer Slide Time: 19:50)

So at the end of the step two, we have determined the exact location and scale at every extrema
point. And we have also selected some of these key points based on stability by eliminating low
contrast key points as well as edge points.

(Refer Slide Time: 20:15)

Now coming to the next step, which is on estimating the orientation of these key points. Let us
first start by asking, why do we really need the orientation? The answer is simple. We want to

258
help it get rotation invariance by getting a sense of orientation and that would become clear when
we describe how we are going to do it.

So we use the scale of the point to choose the correct image. Remember now that every key point
is denoted as (x, y, sigma), where x and y are the coordinate locations and sigma is the scale at
which it was found to be a key point, very similar to the scale invariant Harris detector, where
the final corner is measured is highest at a particular scale.

Here again, a corner point or a key point is defined by (x, y, and sigma). So you use the scale of
the point to choose an appropriate Gaussian and you convolve that Gaussian with your input
image to get an image called I-hat. Now you compute the gradient magnitude and orientation,
using a simple finite differences method for the I-hat image. So you would have m of every
point, which is your gradient magnitude at every point, which you would be given by simply
your root of gradient in x direction square + gradient in y direction square. That is going to be
your magnitude of the gradient. Similarly, your orientation of the gradient at that location is
going to be tan inverse of the y gradient by the x gradient.

At the end of this step, we are going to have a magnitude and an orientation for every point in
your image, including the points that you have identified so far. Now, what do we do?

(Refer Slide Time: 22:05)

259
Now, what we are going to do is you take a certain region around the key point. We are not
going to rely only on the magnitude and orientation of the key point alone, because we want to
capture the local characteristics, not just the characteristics at the point. You take a region around
the key point, a certain window, say 5 x 5 window or whatever that might be. And you consider
the orientations of all the points, the magnitudes and the orientations of all the points in the
neighborhood around the key point.

And you construct a histogram out of the orientations of all of these points. This histogram in the
SIFT paper has 36 bins, with 10 degrees per bin. So we have an orientation angle depending on
what the orientation angle is. So if you had an orientation angle of 185 degrees, you would put it
in the 180 to 90 bin, so every bin is 10 degrees. When you have 360 degrees, so you have 36
bins.

(Refer Slide Time: 23:10)

The SIFT paper also recommends a few heuristics to improve performance here to say that the
histogram entries are weighted also with a gradient magnitude. If the gradient magnitude is high,
that weight in the histogram in a particular bin is increased. And if the gradient magnitude is low,
that weight in the bin for that point is reduced. And it also has a Gaussian function with sigma
equal to 1.5 times the scale of the key point.

260
So if you added a key point, you would place a Gaussian on top of the key point, and then with a
certain sigma (sigma is given as 1.5 times the scale of the key point itself). And then you look if
a particular point in the neighborhood that you considering is farther away, then it contributes
lesser to the histogram. And if a certain point is closer to the key point, it contributes more of the
histogram.

These are some heuristics that were used to improve performance. And once you construct this
histogram, after these heuristics, the peak, whichever bin had the highest number of votes in the
orientation is considered as the peak of that particular key point. Once again, keep in mind here,
although we call it the peak of that key point, the peak is decided by the local characteristics of
the key point rather than the point alone.

The SIFT work also suggests to introduce additional key points at the same location, if there is
another peak in the histogram, which is at the 80% value of the original peak. If there were other
such peaks, then you introduce another key point in the same location with a different
orientation. So it is possible. You have two key points at the same location with different
orientation, which gets more robust results.

(Refer Slide Time: 25:10)

Here are some visual illustrations of how this works. Here is an input image. The first step is
about capturing the extrema of the difference of Gaussians. So you can see here that each point

261
also has an arrow. The arrow denotes the orientation of the key point and the length of the arrow
denotes the magnitude of the gradient at that key point. So the base of those arrows are the actual
points. So this is your first extrema detection step.

(Refer Slide Time: 25:50)

So then after doing a low contrast threshold these 832 DoG extrema lower down to about 729.
You eliminate key points which have lower than a threshold in terms of contrast.

(Refer Slide Time: 26:00)

262
Then we also do the testing ratio based on hessians to eliminate edge-like artifacts. And then
your key points come down from 729 to 536.

(Refer Slide Time: 26:18)

So you are pruning out noisy key points that may not be really of value. And as we already said,
the arrows that you saw were the orientations that you obtained using your step three. Now we
go to the last step, which is about finding a way to describe that key point. So far, we have found
the key point, we have found the scale at which the key point exists, we also have an orientation
for the key point. Now, the question is how do we describe the key point?

263
(Refer Slide Time: 26:50)

So to do this, we are once again going to use the gradient information in the local neighborhood
around the key point. So we take a 16 cross 16 window around the detected key point. So bear
with this image. This image shows an eight cross eight window, but for just for illustrative
purposes, but the actual method suggests to take a 16 cross 16 window around the key point.

So you take your 8 cross 8 or 16 cross 16 window. And then you divide that 16 cross 16 window
into quadrants of 4 cross 4 each. So for a 8 cross 8 window you will have 4 such quadrants, for a
16 cross 16 window, you would have 16 such quadrants, where each quarter was 4 cross 4.

Inside each of those quadrants, you construct a histogram of the orientations of points inside the
patch and align it along 8 bins. So instead of aligning along 36 bins, this time you are going to do
more closer alignment and align it only along 8 bins. This is what SIFT proposes. So you would
have once again, just to remind in a 8 cross 8 window you would have four such histograms. But
in a 16 cross 16 window, you would have 16 such histograms.

So each histogram has 8 orientations because you have 8 bins there. And every point in the
neighborhood in that 4x4 patch contributes to one of the bins in the histogram.

264
(Refer Slide Time: 28:40)

Similar to how we found the orientation. Here also, we use heuristic such as down weighting the
gradients by a Gaussian fall off function. So within a particular neighborhood, if there was a
point further out that contributes a little lesser to the histogram and a point closer contributes
more to the histogram.

Just to clarify on the image on the right, this is just another way of representing a histogram, just
keep in mind that the same diagram could also have been drawn something like this, with 8 bins
as I said.

265
(Refer Slide Time: 29:22)

So the length here denotes the strength of the bin in the histogram. So in each 4x4 quadrant, you
compute your gradient orientation histogram using 8 orientation bins.

(Refer Slide Time: 29:30)

Once you do this, you are going to have totally 16 histograms, each with 8 values. This set of
16x8, which is 128 becomes your raw version of a SIFT descriptive. So this is how we are going
to describe that key point, just by the gradient orientations of the neighborhood around that key

266
point. To reduce effects of any contrast or local changes, this 128 dimensional vector is
normalized to unit length.

Further, to ensure that the descriptor can be robust to other kinds of illumination variations. The
values are also clipped to 0.2. And the resulting vector is once again renormalized to unit length.
So you do not consider very small values. You clip it to 0.2 and once again normalize it. This is,
you could consider this as a step of removing outliers, something very low, we just want to clip it
and then renormalize it to the original length. These are heuristics that the paper recommends to
get better performance.

(Refer Slide Time: 30:40)

Let us see a few examples. So here you can see a couple of tourist images of a popular
monument. And you can see these background values here, which are the key points detected by
SIFT in this image. And if you actually compare even on the second image where the significant
rotation, perhaps almost a morningtel evening change we detect similar key points in both these
settings.

So it is an extraordinarily robust feature detector. It handles up to about 60 degrees of changes in


rotation. It can also handle a good amount of changes in illumination. And it is also pretty fast
and efficient can run in real time. There is a lot of code available for people willing, wanting to

267
use SIFT. And SIFT, as I already mentioned has been used in many vision applications since it
was initially proposed in the early 2000s.

(Refer Slide Time: 31:43)

Here is another challenging example of images from the Mars Rover. And we ideally want to see
this is a practical application, where, when the Mars Rover takes images of different scenes since
there is no human there, it would be good to understand which scenes actually match with each
other so that we know which part of Mars is the rover currently going on.

But then because the scales and the angles may be different in these images matching becomes a
little hard.

268
(Refer Slide Time: 32:23)

So this particular example, although these two images look very different, a SIFT matching
between these two images finds some features, which are actually have very similar descriptors
in these two images.

Now you could match these two images to say where the rover was with respect to a broader
scene on the landscape.

(Refer Slide Time: 32:40)

269
Here are more examples. So SIFT is fairly invariant to geometric transformations, such as scale
translation and rotation.

(Refer Slide Time: 32:55)

It is also reasonably invariant to photometric transformations such as changing in color hues


because we largely rely on gradients more than the pixel intensity itself.

(Refer Slide Time: 33:07)

270
One of the popular applications that SIFT can be used for as we started the last lecture with is
image stitching, where you find SIFT key points in both of these images and then match
descriptors and find out which key point has gone to from image one has gone to which key
point in image two.

And once you find these matching key points, you can solve for a system of equations to find the
transformation between two images. And once you find the transformation between two images,
you can stitch them together to make a panorama. This part of it, we will talk about a little later
when we talk about image matching.

271
(Refer Slide Time: 34:16)

So here is an example. You detect feature points in both images. Then you find out using the
descriptors that you get from SIFT. Remember every key point is going to be represented by a
128 dimensional vector in case of SIFT. And you try to see, match the 128 dimensional vector
here with the 128 dimensional vector here. If the vectors match, you know, that that key point
matches with that key point in the second image.

(Refer Slide Time: 32:40)

Once you do that, you can align the images and place them one or the other.

272
(Refer Slide Time: 32:23)

If you want to know more about SIFT, there are plenty of resources online. David Lowe has an
excellent paper, as I already mentioned, please do read the paper. There are also python tutorials,
the Wikipedia entry is also fairly informative as well as an open SIFT library.

(Refer Slide Time: 34:43)

Before we conclude this lecture, we are also going to look at variants of SIFT that have been
developed. At least a few variants that have been developed since SIFT was developed. One of

273
the most popular improvements on SIFT is known as SURF, stands for speeded up robust
features. And what SURF does, it takes the pipeline of SIFT.

So you could now write the pipeline of SIFT as construct the scale space, take the difference of
Gaussians, locate the DoG extrema, find the sub-pixel localizations of those key points. Then
filter the edges and low contrast responses, assign key point orientations, build key point
descriptors, and then use those features for whatever you want. That is the pipeline, but that we
just finished discussing. So one of the improvements that SURF comes up with on top of SIFT is
instead of using difference of Gaussians and finding the extrema, they use what are known as
box filters.

(Refer Slide Time: 36:26)

These are also known as Harr filters, we will talk about them in the next slide, but these are also
known as box filters. Box filters are filters such as these, where you can have a filter, where one
part of it is white, another part black, and another part white. You could do this in various
combinations. You could do a checkerboard kind of an effect. You could do it with, you could
increase the number of blacks and whites. You could have the blacks two times, whites three
times, or you could flip it. All of those combinations are typically called box filters and they
together form what are known as Haar wavelets. We are not going to go into wavelets here. But
you could look at wavelets as a generalization of a Fourier basis.

274
So we use the Haar wavelets, which are your Haar filters such as these to get key point
orientations. So that is one difference of SURF and that allows SURF to be very good at
handling blur and rotation variations.

275
(Refer Slide Time: 36:51)

So SURF is not as good as SIFT on invariants to illumination and viewpoint changes though, but
it handles blur and rotations fairly well. Importantly, SURF is almost three times faster than
SIFT because of these changes. There is a particular reason for that using Haar filters can make
computations significantly faster. I am going to leave that to you as a homework question, please
do read up why using Haar wavelets can make computations faster? A hint for you is to read up
something called integrated images.

(Refer Slide Time: 37:40)

276
There is also another variant of SIFT called MOPS. MOPS stands for multi scale oriented
patches descriptor. In this case, the patch that you have around the key point is rotated according
to its dominant gradient orientation. And then you compute your histogram and the descriptive.

Why is this useful by rotating the patch to its dominant gradient orientation, you are ensuring
that all key points have a canonical orientation, which is the same. So if you rotated a particular
artifact in image one and image two, because the orientations would be different when you
canonicalize them, they both would end up getting the same descriptor in both of these cases,
that is the idea with a multi scale oriented patches descriptor.

(Refer Slide Time: 38:40)

Finally, there is also an approach called gradient location-orientation histogram or GLOH, which
is a variant of SIFT that uses a log polar winning structure instead of using quadrants. This
particular approach, so as you can see here has about 17 spatial bins and has about 16 orientation
bins, where in each concentric circle, there are 8 of them. So you have 16 orientation bins instead
of SIFT, where you had 8 orientation bins in the final step.

So because of this, the representation here becomes 272 dimensional. There are 17 spatial bins
and 16 orientation bins, 17 into 16, 272 dimensional. And to make it computationally efficient,
this method then uses principle component analysis to reduce the 272 dimensional descriptor to a

277
128 dimensional descriptor, which is actually used for as the descriptor using this particular
method.

(Refer Slide Time: 39:40)

That concludes this lecture. Please read section 3.5 of Szeliski’s book. For more information on
SURF, there are open CV python tutorials as well as an entry on Wikipedia, which is informative
again.

Please also look at other links on respective slides to be able to understand further. A couple of
questions to take away other than why are wavelets are fast to compute is, which descriptor
performs better SIFT or MOPS, think about it. Why is SIFT descriptor better than the Harris
corner detector?

The answers to these questions are in the lectures itself. That is going to be your hint.

278
(Refer Slide Time: 40:25)

And here are other references to takeaway.

279
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 12
Image Segmentation
Last lecture, we spoke about feature extractors and feature distractors. In particular, we spoke
about shift and it's variants. Before we go to other kinds of features, we will take a slight
detour, and talk about another important task in processing of images, which is image
segmentation.

(Refer Slide Time: 0:46)

Human vision has been studied extensively from several perspectives, and one of the
perspectives of human perception has been based on what is known as the Gestalt theory,
which emerged almost 100 years ago, which had this fundamental principle that the whole is
greater than the sum of its parts.

And one of the important cornerstones of this theory was the idea of grouping, to understand
human vision. So this is a very popular optical illusion. It is known as the Muller-Lyer
illusion which asks a viewer as to which of these horizontal lines is longer. And because of
the human perceptions bias towards viewing things as a group and not the individual
components, the illusion actually arises here and makes one line look longer than the other
while the horizontal lines are exactly the same length in both of these images.

280
(Refer Slide Time: 2:00)

So Gestalt theory proposes various factors and images, which can result in grouping. So on
one hand you could have proximity to be a reason, when there are objects or artifacts in the
images, which are simply close to each other and these groups lie in different parts of the
image, that could result in one kind of a grouping or it could just be the similarity of the
objects themselves.

Some objects are hollow circles, some objects are filled circles, and just because these objects
are different they get grouped differently, when we perceive this image. Other factors could
be parallelism, and there are groups of lines that have similar properties in terms of
parallelism we end up grouping them or another option could also be just symmetry.

Just seeing artifacts in the symmetric sense makes us group them in one particular manner.
Further, you could also have similarity of a different kind, you could have a common fate in
terms of these arrows telling you where these dots are going. So in the second example here,
all these dots are more or less lying in a straight line, but the arrows let us group them
differently. So in a sense of where they are going to go along the direction of the arrows,
gives the human visual system a queue to how they should be grouped.

Or it could be purely, because of a common region that was already marked as a circle, it
could be due to continuity of artifacts, it could be due to closure of artifacts, as you can see,
there are many factors that result in human perception viewing things as groups.

281
(Refer Slide Time: 3:57)

And Gestalt theory is fairly descriptive, it lays down various different rules and principles
that could lead to grouping when humans perceive images. However, these rules by
themselves are not clearly defined enough to become algorithms, so you really, cannot take
the rule directly and implement a pseudo code for that rule that could help you find say
groups in an image.

So for the rest of this lecture, we are going to talk about that task called image segmentation,
where we try to group similar pixels into different groups. You could call this akin to
clustering, where you group different pixels into their own clusters. So a significant part of
this lecture is taken from David Forsyth’s book and Szeliski’s books and we will share these
references at the end.

282
(Refer Slide Time: 4:53)

One of the oldest methods for image segmentation is known as watershed segmentation. This
was developed way back in 1979. An image segmentation, remember is the task of grouping
pixels in an image, right, so that is the task that we are looking for. So watershed
segmentation method, segments and image into what are known as catchment basins or
regions. It goes along with the name that is why it is called watershed segmentation that will
be become clearer in the next slide.

So you can view any grayscale image as a 3D topological surface. So for example, if you
have this grayscale image, very similar to how we viewed an image as a function way back in
one of the earlier lectures, so you can view an image as a 3D topological surface where the
whiteness peaks at these bevelled locations and the whiteness subsides in other locations, and
the dark areas are where you have black pixels in the original image.

What watershed segmentation tries to do is to segment region, segment the image into
regions with a conceptual idea of rain water flowing into the same lake, so you are going to
create a methodology, we are going to assume that you are going to flood water in your
image, and wherever you have catchment basins in the image, imagine this 3D topographic
representation of the original image. Wherever you have water stagnating, those groups of
pixels form segments. Let us see this in a bit more detail.

283
(Refer Slide Time: 6:50)

So you identify various local minima in your image, remember, local minima, as simply
values were our pixel locations where your image intensity value is low. Remember again,
that image intensity can lie between 0 and 255 or 0 and 1 if you normalize the values in the
image.

So, you pick in, you take certain neighborhoods say whatever 5X5 patch or 7X7 patch
whatever that may be, and using that process, you find out local minima that lie across the
image and you flood the landscape from local minima, and prevent merging of water from
different minima, which means you would start with a local minima, and slowly keep
expanding, which is equivalent to flooding here until you go to a point which is reachable in
the same number of pixels from another local minima.

So that is where the boundary between these two regions would lie. So the simple process
results is nothing very fancy about this other than whatI just mentioned. So this process
results in partitioning the image into catchment basins and the watershed lines. Remember,
this method was proposed in the late 70s and it was used for many years to segment images
into various parts.

284
(Refer Slide Time: 8:20)

It was generally applied on image gradients rather than the image itself, just to give you a bit
of visual example. So, the original image is this one, we ideally want to group the pixels. So,
you first take a gradient image. Now, you know how to take a gradient image run an edge
filter on the original image.

So, now the watersheds of your gradient image are given in your bottom left image and you
finally refine and get your segmentation output and put it back on the original image the last
image. A simple method, but works rather effectively, however, it does have some
limitations.

285
(Refer Slide Time: 9:06)

If there is a lot of noise and irregularities in the image for example, let us say you have a
heavily textured image, so you can imagine a lot of undulations in the texture of the image or
if there is noise in the image, you are going to end up with something like this, because they
are going to be lot of catchments. So now trying to flood those areas will lead you to many
watershed lines that separate many different regions, resulting in what is known as over
segmentation.

So keeping this in mind, watershed segmentation is typically used as part of an interactive


system, where a user points out a few different centers in the image, and then the algorithm
floods from those centers alone, not worrying about other centers that could be in the image,
so this idea can help overcome the limitation of over segmentation. If you would like to know
more, you can read chapter 5.2.1 Szeliski’s book. This is one of the earliest methods.

286
(Refer Slide Time: 10:17)

Since then, there have been many methods that people have developed, and one could
broadly categorize methods into two kinds, Region Splitting, and Region Merging. In region
splitting methods, as the name suggests, the idea is to start with the image, and then keep
splitting it into finer and finer regions.

We will see one example of this method a little later in this lecture. The other kind, which is a
bit more popular are what are known as region merging methods, where you start with a pixel
as a region and keep merging similar pixels and forming regions and you can keep going
forward all the way till the complete image.

So here is an example of an image, which has been merged into groups of pixels, groups of
pixels are also sometimes called superpixels, and this is sometimes generally used as a
preprocessing step for higher level segmentation algorithms.

287
(Refer Slide Time: 11:19)

One of the popular methods to do image segmentation was graph-based segmentation, which
was proposed in the early 2000s by Felzenszwalb and Huttenlocher, they proposed a
graph-based segmentation algorithm, which used dissimilarities between regions to find out
which regions to merge in a given iteration. So, if you considered an image as a graph G with
vertices V, and edges E, so, you have a set of vertices and a set of edges and the pixels form
your vertices initially and edges are defined between adjacent pixels.

So, we define a pixel to pixel dissimilarity metric for example, this could be based on the
intensity values in these two pixels. You could also define this using more sophisticated
mechanisms by probably taking a neighborhood and taking oriented gradients and so on and
so forth, but whatever be that measure you have a pixel to pixel dissimilarity metric, which is
defined as W of E which is the weight of that edge joining those two pixels on which we are
defining the dissimilarity where edges is given by v1, v2, where v1 and v2 are two pixels.

For example, this dissimilarity metric could be the intensity differences between the N8
neighbors of both these pixels. When we say N8 neighbors, we stake a three cross three
neighborhood and excluding the central pixel you will have eight different neighbors, we are
calling that as N8 neighbors, so you take two different pixels, take their N8 neighbors and
find their intensity differences that could be one way of computing W of E.

288
(Refer Slide Time: 13:10)

Once, W of E is defined the method defines a quantity called internal difference for every
region C. Initially, remember, the region C would just be a pixel, but in future iterations a
region could be a collection of pixels.

The internal difference is defined as the largest edge weight in the regions minimum spanning
tree, which means if you have a bunch of different pixels you find the minimal spanning tree
of the bunch of pixels that you have. Now you take the largest edge weight in the minimum
spanning tree. We are going to define that as internal difference. To some extent, that is going
to be the largest edge weight in that region, it is just a simplified way of saying that, that is
one quantity we are going to define.

289
(Refer Slide Time: 14:06)

Another quantity we are going to define is called minimum internal difference. Minimum
internal difference is given by, if you had two regions C1 and C2, the minimum internal
difference between these two regions is given by minimum of the internal difference of C1
plus τ(C1), where τ(C1) is some manually chosen penalty.

So, I will explain that in a moment, and then the second quantity is the interval difference of
C2 and τ(C2). So, τ(C1) and τ(C2) could just be say the number of pixels that you have in the
region may be normalized based on that or so on and so forth, and so you have a quantity and
you take the minimum of both these quantities.

You have region one for whom you define, its internal difference plus some penalty region
C2, whose internal difference and its corresponding penalty, and you take the minimum of
these two.

290
(Refer Slide Time: 15:08)

Once you have this for any two adjacent regions, which have at least one edge connecting
any of their vertices, we define a quantity called Diff(C1, C2) which is given by the minimum
weight edge connecting these two regions. So for example, you consider all edges V1, V2
such that V1 comes from region C1 and V2 comes from region C2, so all edges between
those two regions and you take the minimum weight of the edges that connect these two
regions.

(Refer Slide Time: 15:46)

291
With these quantities defined, the method defines a predicate D for any two regions C1 and
C2 as if Diff(C1, C2) is greater than the minimum internal difference between C1 and C2,
then you say it is true, otherwise, you say it is false. This illustration should help you
understand this a bit better. Remember, diff is the minimum edge weight between two
regions, whereas minimum internal difference is a minimum internal difference within each
of these between C1 and C2, whichever is the minimum. So, if the difference between the
two regions is greater than the minimum internal difference, you do not want to do anything.

(Refer Slide Time: 16:36)

If it is the other way, regions would be merged. It means if this predicate turns out to be false
you know that diff is equal or lesser than minimum internal difference, which means C2 is as
close to C1 as the farthest separated pixels in C1 so you may as well merge C1 and C2. That
is the main idea with this particular method. So to summarize, this algorithm merges any two
regions whose difference is smaller than the minimum internal difference of these two
regions.

292
(Refer Slide Time: 17:18)

Here is an example of how this works using the N8 pixel neighborhood. So you can see that
the results are fairly good, the different players get separated reasonably well,parts of the
body gets separated reasonably well. If you are more interested, please look at chapter 5.2.4
in Szeliski’s book.

(Refer Slide Time: 17:41)

The next method we are going to look at is another popular method called probabilistic
aggregation, which was proposed by Alpert et al in 2007. This method is again, a region
merging algorithm and it is a probabilistic approach as the name suggests, and it uses two
different cues, intensity values for the gray level values, as well as some texture content of
the specific regions in the image.

293
So initially, the method considers each pixel as a region and then you assign a merging
likelihood pij. So for every two pixels, i and j, there is a likelihood pij, which says how likely
are these two pixels to merge? And how do you assign that pij? Based on their intensity and
texture similarities.

We are not going to go into each of these details. I will point you to the reference if you are
interested in knowing how exactly it is implemented you can look at the paper for those
details, but it is broadly based on intensity. You know how to compute intensity by now you
know how to compute texture we talked about it the last time, so the pij is based on intensity
and texture similarities between the two pixels.

(Refer Slide Time: 19:00)

Now, given a graph in a particular iteration, let us say it is an (S-1)-th iteration the graph is
given by a set of vertices V(S-1) and E(S-1). So the graph at the next iteration G(S), where you
hope to merge a few regions is obtained by you consider a subset of node C from V(S-1) and
you merge nodes or regions if they are strongly coupled to regions in C.

So what you would do here is you would consider those pixels so which means you would
say that if you took a subset of nodes in V(S-1), whichever other nodes, right, we would ideally
consider other nodes which are in V(S-1) - C. So those will be the other pixels in V(S-1). You
consider those pixels and for whichever pixels you have V summation of pij j belonging to C
divided by summation of pij j belonging to V(S-1) - C, when this ratio is greater than a
threshold, you know that there are certain pixels that are quite likely to merge as the pixels in
the region C itself.

294
So, if this ratio is greater than a threshold which the paper recommends to set to point 2 then
you merge those images.

(Refer Slide Time: 20:35)

And once you merge those regions, you propagate those assignments to the final level
children, which would be the property of any region merging algorithm and you do this
coarsening process recursively.

(Refer Slide Time: 20:54)

Here is an example of how probabilistic aggregation works in practice. So this is an image


taken from Szeliski’s book. So you can see here that image A is the original gray-level pixel
grid, so it is a 5x5 image. So you can see the inter-pixel couplings in B image so a thick edge

295
means a stronger coupling, a thin edge means a weaker coupling so you can see that the
pixels that are white are strongly coupled to each other, because of the intensity similarity.

Then in an image C, you perform one level of coarsening where you combine edges that are
close to other edges, that you take one particular pixel when you start which would be C, then
you take all the other V(S-1) - C of all the other pixels in V(S-1) - C and then you would try to
find out which of them has a higher probability of merging based on intensity similarity and
texture similarity again.

And then you keep repeating this process, you can see that after two levels of coarsening, you
get a fairly good estimate of different regions in the image.

(Refer Slide Time: 22:12)

Moving on to another popular method known as Mean Shift Segmentation, this method is a
different approach and it uses a mode finding technique based on non-parametric density
estimation.

So, non-parametric density estimation is a standard procedure for that people in machine
learning use to estimate the density of any given set of observations. So the feature vectors of
each pixel in the image are assumed to be some samples from some unknown probability
distribution and we are going to estimate the probability density function of the image and
find its modes.

Remember, in some sense the modes would be like what we had for the watershed
segmentation. Aalthough, in that case we did a minimum, now we are going to talk about a

296
maximum. And then the image is finally segmented. Once you identify the modes in the
image the image is segmented pixel wise by considering every set of pixels which climb to
the same mode, as a consistent segment we will describe this for the next few slides.

(Refer Slide Time: 23:24)

So, let us consider an example image as you can see here, so, it is fairly real world image. On
the right, you see the representation of the image. So, we talked about color spaces in week
one. So, one of the color spaces that is popularly discussed is known as the L*u*v* or LUV
color space. So, this is simply for simplicity a plot of only the L*u* features of this image just
for simplicity of plotting.

So, in this feature space, we ideally want to find out what are the modes of the overall
distribution of the intensities in the L*u* space. So how do you find the modes of such a
distribution?

297
(Refer Slide Time: 24:13)

If you recall your studies in machine learning, you would recall that one of the popular
upload approaches to estimate the probability density function or modes would be through
kernel density estimation. So we are going to use the same method here to be able to estimate
what is the PDF and what are the modes of the distribution of intensities in a particular space
while we have taken example, as the LUV space, you could do such a mod finding in any
other space for that matter too.

Just for simplicity, we are going to consider a one dimensional example. So remember once
again, that this is a one dimensional signal, which can be extrapolated to image as a 2D
signal. So the way we would estimate this function is, we would convolve the data let us
assume that you have observations given by these impulses here so you can see those vertical
lines, those are your observations that are given, so you would estimate your density function
f, which is what we want to estimate by convolving, your data, your observations with some
kernel of width h.

Let us assume that you have a kernel k, so you could do it as a falling of function where h is
the width, and you have x-xi to be your function, your quadratic function that you are falling
off over.

So convolving your observations with this kernel will give you an estimate of what is your
original F function, that your observations were coming from? What do we do with this?

298
(Refer Slide Time: 25:49)

So to find the modes, mean shift segmentation uses a gradient ascent method with multiple
restarts. Why ascent? Because we want to find the maximum of or the modes of that
probability distribution.

So we start with some guests, y0 for a local maximum, which could just be any point,
remember, we are going to perform gradient ascent, which is an iterative procedure. So you
start with some point and at every iteration you add some constant times the gradient to the
previous estimate and you update your estimate again, and again and again. So you have a
guess, y0, which could be any random point xi.

(Refer Slide Time: 26:35)

299
So we calculate the density or the gradient of the density estimate f(x), we just talked about
how f(x) is computed using that kernel. How do you choose that kernel? We will talk about it
in a moment. So you have the gradient of the density estimate f(x) at y0 and we take an
essence step in that direction. So the gradient is going to be given by something like this
summation (xi - x)g of this quantity where g is nothing but k prime or the derivative of the
kernel k, which is written as the simple derivative.

(Refer Slide Time: 27:14)

So, the gradient of the density function can actually be written slightly differently, you can
write the gradient again, let us see the previous slide. So, this was the gradient that we just
||𝑥−𝑥𝑖||2
had (xi - x)g( ℎ2
) when g is -k՛(.) or the gradient of kernel.

300
So we are going to show that this can actually be written slightly differently, where the
gradient can be written as ΣG(x-xi), where G(x-xi) is simply the same small g that we saw on
the previous slide into m(x) where m(x) is like a weighted contribution of the X values that
Σ𝑥𝑖𝐺(𝑥−𝑥𝑖)
you are considering in this particular combination. So Σ𝐺(𝑥−𝑥𝑖)
.

(Refer Slide Time: 28:16)

Why is this a correct way of writing from the previous slide? I am going to leave it to you for
homework. It is not too difficult to find this out, try to work it out stepwise and you will
actually get the answer.

So this vector m(x), which corresponds to the gradient is also known as the mean shift. It is
the difference between x sorry, I did not point you to this -x here, but the mean shift is the
difference between x and the gradient weighted mean of the neighbors around x, x is where
you place the filter and (x-xi)s are the values in the range of that filter and you do convolution
very similar to how we saw convolution earlier. In this case, we are only talking about 1D
convolution for simplicity.

So the vector m(x) is the mean shift the difference between x and the gradient weighted mean
of the neighbors around x.

301
(Refer Slide Time: 29:13)

And once you get your mean shift, which is also your gradient, in this particular case, subject
to a scalar multiple, then your next iteration in your gradient descent, your gradient ascent,
you are going to say yk+1 is equal to yk + m(yk), which can also be written this way.

Remember, we just have yk and x would get canceled because that is where we are placing yk
at this point in time and you will be left only with yk as the quantity which will become yk+1.

(Refer Slide Time: 29:58)

And you keep doing this process iteratively and you would finally reach the mode of the
distribution, and then you would find out all the neighboring points which by climbing would
lead to this mode will form one segment.

302
Other points, which by climbing would reach to other modes, would go to those segments.
Clearly, this method relies on selecting a suitable kernel width in addition to the kernel itself,
we use a quadratic kernel here what width you choose is also important and will impact the
final result and that is chosen empirically.

In the above description, the approach seemed to be color-based we took it in the LUV space.
You can also consider other kinds of features, other kinds of feature spaces to find the modes
in it. More interested, please look at chapter 5.3.2 in Szeliski’s book.

(Refer Slide Time: 30:55)

One more method which is not a region merging method, but a region splitting method is
normalized cups for segmentation. Once again, a very popular method was used for many
years. Once again a graph-based method, but a different approach.

In this case, a graph representing pixels in an image is successively split into parts. It is not
about merging this time, it is about splitting. How do you split it, you mark edge weights
between pixels based on their similarity, so if two pixels are very similar, maybe they have
similar intensities, you give them a high edge weight and if two pixels are further apart in
their similarity you give them low weight.

So here is an example of such a graph that is constructed. So the goal of this would be to find
the min cut, you are not going to work out the entire math here. If you are interested, you can
look at the references and learn more about it, but you try to find the min cut, which is going

303
to give you two different regions as shown here. And those edges when you remove, you get
the corresponding two segments, which is your original objective.

(Refer Slide Time: 32:19)

The min-cut is defined as sum of all weights being cut. So you consider two sets A and B,
two sets of vertices and for all i belonging to A and j belonging to B, you add up those edge
weights wherever you get for whichever choice of A and B, you get a minimum cut value that
is the cut that you are going to choose to divide the graph into two segments.

So, clearly as you can observe here, this method only divides two segments at a time. So if
you want more segments, you will have to do this more and more times or probably use other
heuristics if you wanted to combine them later into fewer signals.

304
(Refer Slide Time: 33:11)

But one of the problems with using the min-cut is that it can also result in trivial solutions.
Why so? Can you think about it? Why does this min-cut problem result in trivial solutions?

The answer is very simple, when A consists of say only one pixel, and B consists of say,
another region of pixels, this summation may be restricted to a smaller set and hence may
have a smaller value when you sum all of them up, which means min-cut could favor regions
where there is only one pixel on one side, which may simply be an outlier and may not really
form a good segmentation for the image.

How do you overcome this problem? This problem is overcome by a method called
normalized cut, which improves the min-cut problem and the normalized cut is defined by
𝑐𝑢𝑡(𝐴,𝐵) 𝑐𝑢𝑡(𝐴,𝐵)
𝑎𝑠𝑠𝑜𝑐(𝐴,𝑉)
+ 𝑎𝑠𝑠𝑜𝑐(𝐵,𝑉)
.

305
(Refer Slide Time: 34:29)

We will see a moment what they are. Association is defined again, as sum of all the weights
in A and V where assoc(A,V) is given by assoc(A,A)+assoc(A,B). It is the sum of all the
weights.

(Refer Slide Time: 34:50)

And now what you have done by doing this association of A, V and association of B, V is to
ensure that the denominator contains the number of pixels in each of these regions. So if there
was only one pixel in one region, remember that one of these quantities will become very
high as against another region, which had many more pixels where the denominator will pull
down.

306
This ensures that you do not get trivial solutions such as what you saw in min-cut, but you get
more useful solutions in practice.

(Refer Slide Time: 35:28)

So the normalized cut happens to be an NP-complete problem, but there are approximate
solutions to solve this slightly more involved, but you can look at the paper called normalized
cut and image segmentations if you are interested in how the method is actually implemented,
that this is the overall formulation.

(Refer Slide Time: 35:49)

307
Here is an example of how normalized cuts work. So this is the input image and you can see
the various different regions gets segmented using normalized cuts in over the iterations of
the method.

(Refer Slide Time: 36:08)

There have been many other methods for image segmentation too, such as simply using K
means clustering. There have also been methods based on Markov Random Fields and
Conditional Random Fields, and many more.

If you want to have a detailed understanding, please read chapter five of Szeliski’s book there
are more references at the end of this lecture too.

(Refer Slide Time: 36:29)

308
But the question that people to ask now is, do we really need all of these segmentation
methods. So think the entire course is based on deep learning for computer vision? Are not
there deep learning methods that can segment images today? Do you really need all of this? It
is a valid question for many tasks today, you do not need these methods. The purpose of
covering these topics is to give you a context of computer vision in the pre deep learning era,
but some of these methods have also been used in the deep learning era.

(Refer Slide Time: 37:06)

For those of you who are a little bit more aware about these areas, the earlier methods of deep
learning that were used for detection of objects, which means given an image, you try to find
an object and draw the bounding box as to where in the image that object lies that is the task
of detection and there could be multiple instances of objects or multiple objects in an image.

For the initial methods of deep learning for object detection, these kinds of segmentation
methods but actually used to generate what are known as region proposals where the objects
could lie. As I said, one of the first R-CNN, which is a region CNN work, that was used for
object detection, used a min-cut segmentation method known as CPMC, Constraint
Parametric Min Cuts, to generate region proposals for foreground segments.

We will talk about this in detail when we come to that section of this course.

309
(Refer Slide Time: 38:06)

But that is the reason why image segmentation is useful for you to know at this time.
Segmentation can also go beyond images into videos. If you go into videos, the principles for
the methods are very similar, although, there is a third dimension that comes into the picture,
but the kind of tasks that one wants to solve in videos is say shot boundary detection, which
is an important problem in video segmentation, where you have to divide an entire video into
shots.

Remember that if there is a very long video, and you want to understand the video, let us say
you have a movie and you want to be able to understand the movie by automatically passing
the video on the movie, you first have to divide the entire movie into shots, and then try to
analyze each shot for understanding the scene understanding the characters so on and so
forth.

So that is an important task in video segmentation. Another important task with videos is
what is known as motion segmentation, where you may want to isolate the motion in the
video of an object. So you could be wanting to isolate a person running. For example, there is
a football game and you want to track a person moving with the ball from one end of the field
to the other into the field, or you may want to track a car, so on and so forth.

So for further information on this, please look at chapters 15 and 17 of the David Forsyth
book in the references of this course.

310
(Refer Slide Time: 39:37)

That concludes this lecture, which gives you an overview of how image segmentation
methods were used in the pre deep learning era. Some of them as I said were used actually
even with deep learning methods, although not as much at this time, but for more readings
please do read these references given here and recall we left one derivation as homework for
you, go back and see, how to derive the final expression for the gradient of the kernel density
function used in the mean shift method.

(Refer Slide Time: 40:12)

There are some references.

311
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 13
Other Feature Space
(Refer Slide Time: 0:15)

We started this week by talking about edges. Then we went from edges to blobs to corners,
and then talked about a few varieties of corner detectors and talked about an important corner
detector called sift, talked about the feature descriptor of how you would describe your corner
as a vector. And then in the last lecture, we also talked about image segmentation.

Now, we are also going to talk about other kinds of feature spaces that people worked on in
computer vision, before deep learning came into the picture. That is going to be the focus of
this lecture.

312
(Refer Slide Time: 1:04)

One of the other important feature representations that was developed was known as shape
context. Shape context was about taking a pixel and placing and getting a log-polar
representation of the neighborhood around the pixel using this kind of a representation. This
kind of representation. So we define a coordinate system in two dimensions, that is
parameterized by logarithmic distance from the origin, which is the pixel at which you are
trying to get the representation as well as the angle.

So for each point that you consider, you would count the number of points in each log-polar
bin around that particular pixel. So you would, so if this log-polar representation you can see
that there are perhaps something like one, two, three and four regions in terms of angular
distance in terms of log distance and then 12 bins in terms of the angels themselves.

So, as you can see, there is more precision for nearby points, very similar to the Gaussian
kernel, where you give more weightage to nearby points, and then more flexibility for farther
points on that pixel where we are considering the representation. To some extent, this
method, this representation is translation and scale invariant, depending on the size of the
neighborhood you take.

313
(Refer Slide Time: 2:53)

Let us study this in a bit more detail. So let us consider two different images. So, now, if you
consider this pixel, there are if you can now see carefully, this is a circle, this is a triangle and
this is a rhombus. So, if you can see here you can find the first one is the representation of the
circle, the second one is the representation of the rhombus and the third one is the
representation of the triangle.

How did these representations come? There are five bins in terms of log R of the log radial
distance and 12 bins in terms of your θ or your angles around the center. So, we can see here
that the circle and the rhombus have very similar representations so one could consider these
two points as correspondence points between these two representations or these two images
of A.

And when we say two points are in correspondence? We say two points are in
correspondence if they minimize a value of Cij where Cij, i and j being two pixels that you
are trying to measure and

K (hi (k)−hj (k))2


1
C ij = 2 ∑ hi (k)+hj (k)
k=1

and hi (k) and hj (k) are the histogram locations in the ith, around the ith pixel and in the
representation around the jth pixel, hi is the histogram of the number of pixels in each of
those bins around the ith pixel and hj similarly for the jth pixel. hi (k) would correspond to

314
the count in one of those bins and what is small k or capital K, capital K is the total number
of bins, right. So for hi (k) would be the count in one of those bins.

So, when would C ij get minimized when the numerator is zero or the histograms exactly

match or when the histograms are fairly close to each other, you would ensure that these
values will turn out to be low.

The denominator ensures that you are also considering the number of pixels considered while
evaluating this, very similar to how we spoke about a normalized cut and min-cut the
denominator helps in normalizing based on the number of pixels that you are considering for
your studies.

(Refer Slide Time: 5:31)

So once you get these correspondence points between two images based on the
representation, you could then do a one to one matching sequentially between these points
between these correspondence points and when a distance is less than a threshold this kind of
an approach was used to check for a shape match, not just a point match, but matching shapes
between two different images. That is the reason it was called shape context.

315
(Refer Slide Time: 5:58)

Another feature space or important method that was proposed is known as the MSER or the
maximally stable extremal regions. This was a method for blob detection based on watershed
segmentation. So, we just saw image segmentation in the previous lecture using one of those
methods, which is simple watershed segmentation that method is used to come up with a
method for MSER or maximally stable external regions. And how does this work? You
identify regions in an image that stay nearly the same as you keep varying a threshold.

(Refer Slide Time: 6:42)

Let us try to describe this. So you would ideally sweep the threshold of an intensity going
from white to black, or black to white, and in each step, you are simply thresholding based on
that particular intensity value. So if you have an intensity of 75, everything above that is

316
white and everything below that is black a simple thresholding operation. Then you extract
connected components within that thresholded image, and the final region descriptors serve
as the features.

(Refer Slide Time: 7:12)

Let us see this more clearly with some illustrations. So let us say we start thresholding the
images about a particular intensity level g, so which means you considered a threshold g, and
everything above that is considered to be white, and everything below that is considered to be
black. So your threshold binding threshold, your image, based on the threshold g. Clearly,
when you threshold, you are going to see a bunch of blobs or collections of pixels appear.

They are going to be cohesive regions that appear together when you threshold. So and as
you increase the value or decrease the value of g whatever it be. In this case, if you say
decrease the value of g, newer regions would start appearing. If you decrease the value of g
you are now giving a lower threshold for considering regions to be white and the rest of them
to be black, so there would be more regions that start appearing with white pixel values in the
thresholded image.

And if you increase g, things would start coalescing, and these regions that get developed as
you keep changing g values can be depicted in a tree like structure. It is very similar to region
merging and regions splitting, as you reduce g some regions may merge as you increase g
regions may split based on their intensity values.

317
(Refer Slide Time: 8:36)

So now, what do you do with this? So regions at a particular g level be denoted as R1 g to


Rn g , let us say there are small in regions, each corresponding to one particular threshold g.
Where the cardinality of Ri g denotes the total number of pixels in one of those regions Ri g ,
|Rj g−Δ |−|Rk g+Δ |
then we define a quantity ψ , which is given by g
|Ri |
. Rj and Rk are parents and

children of Ri of that corresponding region at slightly different thresholds, g minus delta and
g plus delta.

Let us try to study this a bit more carefully. Remember, again, that Ri g is one of the regions
that you are considering at the threshold value g. Now, if you subtract a little value g minus
delta, then you would get a parent region which is perhaps larger than Ri.

Remember when you subtract the threshold, the region would get, the white region would
probably get bigger. So you could have a parent region Rk, which is defined at g minus delta.
And at g plus delta when you increase the threshold the region may get smaller, the white
region may get smaller, you are going to define that as a child and that is defined by Rk.

And if this quantity ψ , which is the difference between Rj and Rk is below a user defined
threshold, you would call those regions to be maximally stable, extremal regions. Remember,
if it is smaller than a threshold, it simply means that there is almost no change between the
parent and the child. If there is a lot of change, then you probably need to expand a little bit
further before calling it a stable or an extremal region.

318
(Refer Slide Time: 11:02)

Here is a visual illustration. So, this is the input image, as you can see it has some characters
in it. And when you set your threshold to be 75, assuming again that you use 8 bits to
represent pixel so value is between 0 and 255 you see the regions not very clearly defined,
but when you go to g is equal to 105, you see regions getting slightly more well defined. And
you can see that at 135 there are a few changes, but not much at 165 very minor changes not
much again, at 195, almost no change, at 225 things again start going to black because the
threshold is very high.

So you could now represent that as a tree, where you start with a region, that region becomes
two regions at level g is equal to 105 and region 2 becomes region 4 and region 8, at level g is
equal 135, and so on and so forth to build the rest of the tree. So finally, as you can see at
level g is equal to 195 you have two regions, one with the k, and the other with an r, which is
what you are originally looking for. So this gives you one way of separating different regions.
So this is not exactly a corner detector, but another method to separate regions.

319
(Refer Slide Time: 12: 29)

So another way, we said that it could also be used as a blob detector, and here is an example
for doing blob detection using maximally stable single regions. So given this input image if
you go from white to black or black to white the top two rows depict when you go from white
to black, and the bottom two rows depict when you go from black to white. If you go from
white to black, you start with a high threshold and then you decrease the threshold more
regions in black start showing up, you decrease the threshold more regions in black start
showing up and region start merging and you finally left with as, even as you change the
threshold when regions do not change, you know that that is actually a stable blob
representation of your original image.

Remember, when you see it from black to white, you are going to look at it in terms of
merging. So you are going to say that when it is black, basically, there is no region and you
slightly increase certain regions start showing up, and as you keep increasing the threshold
this way, and then go up to the third row, you can see more stable regions showing as you go
forward. So this is another approach to find out blobs or regions in an image called
maximally stable extremal images.

320
(Refer Slide Time: 13:48)

Another famous method you probably have already been exposed to this when we talked
about sift, but this method was also independently developed to be able to detect humans in
image is known as histograms of oriented gradients or popularly known as hog or HOG. So
this originally was proposed in a paper that was used for human detection in different kinds
of images.

So, let us say you had a portion of an image or an image containing a human, so at each
location where the window is applied you compute gradients. Remember, computing
gradients is simply an edge detector whatever is detector and you want to choose. And then
you divide the entire gradient image into a uniformly spaced grid. And then for every 2 cross
2 block in this grid you compute your orientations of the gradients, very similar to how we
talked about it for sift you would have say multiple orientations.

Let us say you define 8 different orientations you see how many pixels in that block had
orientations of gradients in certain bins and you bin them to get a histogram for each 2 cross 2
block. And you do this overlapping 2 cross 2 blocks, which gives you the histograms that you
see on this image on every grid center.

So you can see here that each of these, each of these is just a different way of drawing a
histogram. Remember, again, if you recall, something like this was just another way of
drawing a histogram with 8 bins in 8 directions where the length of the arrow denotes the
frequency count in each of those bins. This is just another way of representing that or this
could be another way of representing a gradient orientations in getting the histogram.

321
So the histogram of oriented gradients was shown to be very effective for detecting humans
in an image in the mid-2000s.

(Refer Slide Time: 16:00)

And this was also improved upon to give what is known as pyramidal HOG, a PHOG hog,
where in every, at every step, you divided the image into parts and constructed a HOG
representation for each part individually, and then concatenated, all of them to form a
representation.

How do you do this? You divide an image into 2l x 2l cells at each permit level l. So, if l was
1, you would divide it into two cross, the entire image into 2 x 2 cells, 4 cells, or if l was 2
you would divide it into 4 cross 4 cross so on and so forth. Now for each of these cells you
extract HOG descriptors just the way we spoke about on the previous slide and then you
concatenate the HOGs at different pyramid levels to give an entire histogram of oriented
gradients representation.

So as you can see, this can be done in several ways, you would get a histogram of oriented
gradients for each cell and concatenate all of them and then you would get a histogram of
oriented gradients at l is equals 0, l is equal to 1, and you can concatenate all of this also to
get your final representation.

So this method was shown to capture spatial relationships between say objects a little bit
better than HOG, which does not do this at a pyramid level.

322
(Refer Slide Time: 17:33)

Another popular method, which was extensively used for face images in particular human
faces, was known as local binary patterns. Local binary patterns had an interesting idea where
given an image for every pixel you take, for example, a 3 cross 3 neighborhood around a
pixel, so this is the IIT Hyderabad logo. So you consider a pixel and a 3 cross 3 neighborhood
around the pixel write down the intensity values of those pixels. Now, with respect to the
central pixel you decide if each of the other pixels around it were lower or higher.

If it was lower, you would set a 0, if it is equal or higher, you can set it to 1, that now gives
you a binary pattern around the central pixel. Once you define a canonical representation,
which means you say I will start at the top left, and then go in a circular manner, you can
define a binary representation for this particular pixel.

So once again, I will repeat that. So you define a binary representation based on how each
pixel is related to the central pixels in density. So now, this value is now 15 in decimal and
you replace that pixel’s value with 15, other values around it would similarly be obtained by
doing an LBP positioned at that particular pixel and so on and so forth. Now, what do you do
with this new LBP representation?

323
(Refer Slide Time: 19:14)

Before we go there, you can also define neighborhoods in multiple ways. We said you would
take a 3 by 3 neighborhood and take the 8 pixels around it that may not be always the case,
you can vary two different parameters here, you can define your radius r you can also define
the number of neighbors P.

So you can have a particular radius where you want to compute your local binary pattern with
that may not be the immediate neighbor. It can be a radius of four, five pixels around the
central pixel. And you can also define how many neighbors you want at that radius. So you
could define p is equal to 8, the number of neighbors is 8 and r is equal to 2, as you can see,
that is the example that you see on this particular image.

Similarly, you could have a closer neighborhood, r is equal to 1, but the number of neighbors
to be same p is equal to 8. Similarly, you could have an r is equal to 2, you could increase the
number of neighbors and make it p is equal to 12 and at r is equal to 3 further out, you could
still have 12 neighbors.

Once you define the neighbors, if these neighbors lie in between two pixels, you can use
bilinear interpolation to get those values. So once you define R and P, based on those values,
you would get a binary pattern around the central pixel. You write that out in a circular
manner as a binary number, convert it into decimal and replace the central pixel with that
particular value. What do you do with that?

324
(Refer Slide Time: 20:51)

So once you get an LBP result, which is your LBP intensity for each pixel, remember the
previous process that we talked about in the earlier slide gives you one decimal for each pixel
around which you considered the neighborhood. So you could now construct such an image
for every pixel, you could repeat the same thing and construct such an LBP image.

Once you do this, you divide the LBP result into regions or grids and once again, you can get
a histogram of each region, the number of pixels in each region in this particular case, and
concatenate all of them to be able to get a single representation for the region or an entire
image.

In this case, we are not considering the gradients it is simply the histogram, but you could
also improve LBP to consider gradients in each cell, so on and so forth. There have been
several improvements of LBP or local binary patterns that have considered these kinds of
variations.

325
(Refer Slide Time: 21:54)

So here is a quick summary of various feature detectors. We have seen some of them. We
may not have the scope to see all of them, but you are welcome to look at the source of this
particular table to understand more features, but you can view all features in terms of whether
they are corner detectors, blob detectors or do they help in finding regions whether they are
rotation invariant, scale invariant, affine invariant. How repeatable are they remember,
repeatability is about ensuring that if you found a particular pixel to be a corner, in a
particular image, if you have the same image taken from a different angle, in another picture
or same scene taken as a picture in another angle that corner should also be detected as a
corner in the second image. This is known as repeatability.

How good is the localization accuracy? Remember, we talked about localization accuracy
even for the canny edge detector where we said, is the point exactly where it should be or is it
a few pixels away from the actual corner that we know. How robust is the method to noise
and how fast is the method?

So these are various parameters in which, using which you can evaluate the goodness of the
methods that we have discussed so far. So the Harris corner detector, let us take a few of
them and just go over them. The Harris is a corner detector, it is rotation invariant, as we
have seen not necessarily scale or affine invariant, as we already saw.

It is fairly repeatable, it is reasonably good with localization, accuracy, robustness, as well as


efficiency. Let us take another one. Let us consider SURF, which is an improvement over

326
SIFT that we saw a couple of lectures back, which is a corner detector or a blob detector, it
can be used either way.

It is rotation invariant, and scale invariant that is the purpose of sift and SURF to be scale
invariant. It is in the name of sift itself. And it is also fairly repeatable fairly accurate in terms
of localization, robustness and is extremely efficient. We know we saw that SURF is almost
three times as fast as sift.

So you can similarly look at say MSER, which is the one that we just saw a few slides ago is
a region detector, which is reasonably rotation invariant, scale invariant and affine invariant
and it is also is repeatable, accurate in terms of localization, robustness and efficiency. More
details of other feature detectors please look at this in the references, and you can look at the
paper to get a more detailed survey of other feature extractors.

327
(Refer Slide Time: 24:43)

Why are we not discussing each of these is what we are going to talk about next, which is a
lot of representations of images that used to be used in the pre deep learning era have, no
longer relevant because of the success of deep learning in extracting features that are
extremely good for various computer vision tasks.

Remember when we talked about the history of deep learning, we said that in the mid-1960s,
Marvin Minsky started out with a summer project to be able to solve the problem of computer
vision, but many decades later deep learning has been able to significantly make that progress
at this time.

328
(Refer Slide Time: 25:34)

329
So the homework for this is going to be reading chapters 4.1.1 and 4.1.2. If you are further
interested about other feature detectors, please do read some of these links, as well as some of
these references. In particular, this reference gives an overview of various different feature
detectors if you are interested.

330
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 14
Human Visual System
(Refer Slide Time: 0:15)

For the last lecture of this week, we are going to look at whatever we have seen from a very
different perspective of the human visual system. So, we saw that processing images can be
done to achieve several tasks, such as extract edges, extract blobs, corners, key points, extract
representations around key points, segment images, so on and so forth. For many decades,
these were used extensively in computer vision applications.

In particular, one of the topics that we covered in the lectures, which was a bank of filters
using a Gabor filter bank or steerable filters was about using multiple different filters in
different orientations and scales to extract content out of images. To some extent, we will see
now how this approach is similar to how the human visual system processes images. It is not
exactly an imitation, but there are similarities in how these methods were used to process
images versus how things happen in the visual, human visual system. So to complete that, let
us look at a slightly detailed view of the human visual system.

331
(Refer Slide Time: 1:47)

To start with an acknowledgement, most of this lecture slides are taken from Professor
Rajesh Rao’s slides at University of Washington, so unless stated explicitly the imagesources
are also the same.

(Refer Slide Time: 2:03)

So the human visual system can be summarized in this diagram. There is a lot more detail
than what you see in this diagram here, but what you see here is the eye and the retina, and
the scene around you is here around the human and the left visual field and the right visual
field fall on both the eyes and then you can see that the right eye goes to the left part of the
brain, which is drawn in the blue color here and similarly, the input to the left eye goes to the
right part of the brain drawn in red colors.

332
The primary visual cortex is located at the back and there are other components that the
human visual system passes through, such as the pulvinar nucleus, LGN or the lateral
geniculate nucleus, superior colliculus optic radiation, so on and so forth. So if you observe
carefully here, among all the inputs that come in through the retina, most of it go to the visual
cortex, but there is a slight deviation of some content here, which goes into the superior
colliculus and the superior colliculus is what is responsible for feedback to moving the eye.

So the superior colliculus is what tells you to move your eyeballs to see something to get a
better understanding, so on and so forth, while the visual cortex is what gives us
understanding and perception of the scene around us itself. Let us see this in a bit more detail.

(Refer Slide Time: 3:48)

To start once again, we talked about this in an earlier lecture too that light visible to the
human eye is restricted to one part of the electromagnetic spectrum, which goes from
somewhere roughly between say a little less than 400 nanometers to a little over 700
nanometers going from violet to red. Obviously, the radiations that you have to the left of
violet are called ultra violet and the radiations that you have to the right of red called infrared.
So, this is known to us.

333
(Refer Slide Time: 4:24)

So, if you ask us, why is it that our eye receives only this light spectrum the most, it seems to
be that as we have evolved our vision appears to be optimized for receiving the most
abundant spectral radiance of our star, the sun. So, in this graph on top you see the energy of
the various components in the electromagnetic spectrum, you can see that the sun's energy
peaks in the visible spectrum and then falls off over the rest of the electromagnetic radiation.
And so that is potentially a reason for why our eyes seem to have got used to that spectrum as
the most useful spectrum from a vision perspective.

(Refer Slide Time: 5:18)

So the retina itself, which is the sensor of our human visual pathway, consists of
photoreceptors, and also does a lot of image filtering, before it passes on information to the

334
next phase of the human visual pathway. So if this was our retina and light fell from left to
right here, so the back of the retina is blown up on the right side, so you can see that a bit
more closer. So at the far end, it consists of, of course, epithelial cells, and just before the
epithelial cells, the retina consists of what are known as the rods and cones, which you may
have heard of. But before the photons fall on the rods and cones there are many other cells
too, such as what are known as ganglion, bipolar cells, so on and so forth, which the
information passes through before reaching the rods and cones. So, each of the rods and
cones have specific properties.

(Refer Slide Time: 6:26)

The rods are sensitive to intensity, but not color and why are they called rods and cones they
are shaped as you can see here, the rods are shaped like this, and the cones are shaped
conically. So, the rods are sensitive to intensity, but are not sensitive to color, so they in some
sense get a blurred image of what is happening around us. And cones are sensitive to color
they form sharp images and require many more photons to absorb the information. Cones
typically form three different types in humans each of these cones are sensitive to specific
wavelengths.

335
(Refer Slide Time: 7:16)

And what are these wavelengths? So you have a set of cones that respond very well to blue
color, a set of cones that respond very well to green color, a set of cones that respond very
well to red color. Clearly, rods are somewhere in between where they are not color sensitive,
but just are sensitive to the intensity of the photons falling on the retina. I should also explain
the RGB aspect of color that we choose because that seems to be where our cones are
peaking in the VIBGYOR spectrum.

So this also explains why a person could be colorblind. So for example, if a person does not
have green cones the person may not be able to see green color in the world around us.

(Refer Slide Time: 8:06)

336
So, before the image the photons reach the rods and cones, there are what are known as
ganglion cells or other cells in the retina, which typically operate in what is known as an
excitatory manner or an inhibitory manner. So, in this diagram that you see on the slide, plus
denotes an excitatory reaction and minus denotes an inhibitive reaction.

So, cells are organized this way, where there is a central cell, which is, which gets excited
when a photon falls on it, and there are a set of cells around that gets suppressed when the
photon falls on it. So what happens? Remember, at the end of the day, we will see this as we
go through this lecture that even the eye access image filters and that is the reason we are
talking about it now, having discussed image filters, edges, features, so on and so forth, it is
perhaps a right time to be able to relate what we have discussed so far, to how things happen
in the visual, human visual system.

One key difference between whatever we have studied so far to what we are going to talk
about in the human visual system is that the human visual system does spatiotemporal
filtering. It is not only spatial filters, which is largely what we've seen so far in this course,
but it also does filtering over time. We will talk about this a bit more detail in the next few
slides.

Before we go there, as we were saying, arrangements of cells in the retina have components
of excitatory and inhibitory elements to them. So there could be an excitatory cell flanked by
inhibitory cells on either side, so when a spot of light shines on that photon or the spot of
light shines on the central cell, so when the light is on, you can see here these are just a set of
impulses, remember that at the end of the day, the human brain or these cells release
chemicals of spikes of electricity, as you can see, which are known as action potentials.

So, each of this is a spike and when the light is turned on, there seems to be an excitatory
reaction, because the light is at the excitatory part those photons follow the excitatory part of
the cell. On the other hand, if the light is on, and that part falls on the inhibitory part of the
cells, you actually see that there is no response or spikes from the cells because those cells
which are inhibitory, even when photons fall on them they actually suppress and do not throw
out any potential. This idea of inhibitive an excitatory is extremely key to how our human
visual system works.

337
(Refer Slide Time: 11:06)

So there are two kinds. The earlier kind is where we saw the excitatory to be in the middle, so
it is called on-center off-surround cell. You also have the converse, where you have an off-
center and an on-surround cell, in which case, you have an inhibitory cell in the middle and
then an excitatory around it flanked on either side. In this case when the photon or the light is
on and the photon falls on the middle cell, your action potentials or your spikes stop for some
time this, so this particular set of spikes are spikes that you get over time.

So the light was on for that duration that you see there so that graph is a graph over time
going from left to right. So when the light was on, you can see that there is no spike that
comes out of that particular cells. Whereas, when in this case when the light is in the region
outside the inhibitory cell those are the executive cells and you can actually see that they

338
throw a bunch of different spikes. So this idea of off-center and on-center where there are
cells that inhibit and cells that excite are important components of how our visual system
works.

(Refer Slide Time: 12:30)

As I just mentioned, the human visual system is a spatiotemporal filter. So there's a filter on
the spatial site, which largely resembles a blob detector or a Laplacian of Gaussian for large
extent. So it could be either way. So you could have a Laplacian of Gaussian remember the
other way, which can peak in the other direction. So you could look at for large part, they
seem to assemble the Laplacian of Gaussians. But as I just mentioning, there is also temporal
filter, which acts something like this graph here.

What does this graph mean? When the light is highest, you get the highest response. After
that, you actually get a negative response before stabilizing, which means remember, again,
that in a human visual system, it is a spatial temporal filter. So when you have a photon that
shines, or you have an edge that falls on you, you are first going to detect the edge, then for a
small few milliseconds, the reaction is going to be the opposite in terms of time, and then you
revert back to a stable state. So that is what the temporal filter does.

Where can you see it taking effect? Why do you think this happens? Here is an example for
that.

339
(Refer Slide Time: 13:51)

If you have seen this optical illusion, which is a common one, what do you think you see at
the center, black dots are the intersections, black dots or white dots. This should explain to
you what is happening in the eye. So if you see a white dot, when you move your eye from
that, remember the response over time is to go back to the other side and make it look like a
black dot before you recover and find out it is a white dot, and that is the reason such an
illusion happens is because of how the temporal filter in the human visual system works.

(Refer Slide Time: 14:30)

Another effect that you may have seen popularly is what is known as color-opponent
processing. So, in this particular case, if you see a lot of these examples, these are also visual
illusions, optical illusions but you can, you may have seen this has many other settings. When

340
you focus on some very strong colors, you typically have a negative after image. So you
focus on the yellow and quickly move around you may find that to be a blue color and you
get a negative afterimage, which again corresponds to the temporal filter that we are talking
about, where you get an opposite response over time before stabilizing to an equilibrium.

(Refer Slide Time: 15:16)

As we mentioned, in the human visual system pathway, you also have a component called
LGN, which lies somewhere in between. LGN also has very similar center-surround an on-
off structure to the cells in that particular region, which means there are a set of cells while
one cell could be inhibitory it may be surrounded by excitatory cells and the vice versa in that
same region. So you have combinations of both kinds of cells, which together lead to
perception the way we see things.

Originally, the LGN or the lateral geniculate nucleus was considered to be more a relay
system that takes the input from the retina and passes it to the visual cortex, but it is now
understood to receive a lot of feedback from various parts of the brain, which also come back
into the LGN to make it get a more holistic picture of the scene. So there are other feedback
that come in to make it get the perception that it actually sees.

341
(Refer Slide Time: 16:23)

So the visual cortex or the V1 cortex lies at the far end and let us talk about the visual
pathway a bit more detail in the next few slides.

(Refer Slide Time: 16:35)

In the visual cortex the V1 cortex, we go back and recall the history of computer vision that
we talked about last week, where we said that there were two researchers Hubel and Wiesel,
who were the first to characterize the V1 and receptive fields by recording from a cat viewing
stimuli from a screen. We also talked about them receiving the Nobel Prize in 1981 for this
work.

342
(Refer Slide Time: 17:02)

And one of their largest contributions was to show that the V1 cortex has two kinds of cells,
simple cells that simply detect oriented bars and edges. For example, you can see a bar
detector, a bar is simply a white region flanked by two black regions or otherwise, and an
edge detector is the edge detector that we already know those are simple cells. While
complex cells may be invariant to position, but they are sensitive to orientation.

So if you have certain orientations of edges the complex cells are what pick up those kinds of
orientations in their structure.

(Refer Slide Time: 17:46)

343
The cortical cells actually end up computing derivatives. Remember, again, that spatial
derivative is orientation sensitive so you get depending on how you place your filter, you are
going to detect different orientations of edges in your image. So, if you had such an edge in
your, in the scene that fell on your eyes, the spatial receptor field would look something like
this, which is a derivative in space and the derivative in time, as we already said, would peak
and then fall off to the other extreme and then gradually go further.

To some extent the spatial derivative and this temporal derivative look similar, but the time
derivative or the temporal derivative leads to illusions based on time when we are looking at
an image.

(Refer Slide Time: 18:45)

So also, some of these cortical cells have direction selectivity as we said, the complex cells
respond to specific orientations, and the oriented derivative can actually be in an XT space
rather than just in X space. So for example, with all the edge detectors that we saw so far, we
saw that you could have an edge detector that detects a vertical edge, an edge detector that
detects a horizontal edge, or an edge detector that detects an edge with a certain orientation.
But because the brain is processing information in three dimensions, X, Y, and T, you could
also have an edge that is moving.

For example, you could have a vertical edge that is actually moving that is what you see here.
So you have a rightward moving edge, but as you keep moving the edge from left to right,
you now have a cuboid of space X and Y and time T. And you would notice that because
over time, the edges moving from left to right. Remember again, that unlike the simple cases

344
that we saw, so far with filters and masks, the human visual system is responding to stimuli
that change over time. It is not a still image, but a changing image, so the human eye has to
adapt to those changes in the image too.

So then it appears that over T, you are going to have an edge in a different direction, because
the edge is actually moving from one part of the image to the other part of the image. So in X,
T dimension, this particular cortical cell will end up having an edge along this direction. So
where T comes from the movement in one direction, an X edge is the edge that it actually is,
remember there is a vertical edge so you are going to have change along the X direction, and
you will have changed along the T direction because it is also moving edge.

So an oriented derivative now need not be just an XY space which is what we have seen so
far, but it can be in XT space YT space, so on and so forth. So, remember that the concept of
an oriented edge detector is very different in the human visual system, because of the concept
of time.

(Refer Slide Time: 21:13)

Why is oriented filters important? So even from the human visual system perspective, people
have shown that, given natural images, and let us say we had to learn independent filters,
whose linear combination would best represent natural images, it can be shown that the
optimal set of such filters are actually oriented filters and are localized to different regions of
the image.

Another way of saying this is a natural image simply becomes a positive response to a filter
bank with several orientations, and each of these filters placed at different regions in the

345
image. This should perhaps, connect you to the discussion that we had with filter banks and
Gabor wavelengths and Gabor filters and steerable filters, so on and so forth. So even at that
time we mentioned that Gabor filters are known to be little similar to how the visual system,
human visual system performs and this should perhaps be the context of why we made that
statement.

(Refer Slide Time: 22:20)

Also, at the visual cortex, the final processing also has two pathways called the dorsal and
ventral pathways in the visual cortex. So the dorsal pathway is responsible for where
information, so which part of the scene in front of you are you seeing what you are seeing
and the ventral pathway is corresponds to the what information or what object are you seeing
in front of you. So each of these parts lead to different aspects of perception that we see in the
scene around us.

346
(Refer Slide Time: 23:02)

So the What pathway, so is what you see here. The What pathway goes from the V1 cortex to
the V2 cortex to the V4 cortex to a couple of regions called the TEO and TE, we are not
going to get into this today, there are going to be references at the end of this lecture if you
would like to know more about this, but those are different parts of the brain as you can see
here, which finally lead to understanding what the object is.

And as you go from each of these regions, as you go from the V1 cortex to the V2 cortex to
the V4 cortex to TEO and TE, each region captures higher abstractions of the information
around us. Remember again, that if rods and cones and other early processing in the human
visual system are only responding to edges and textures there has to be later layers in the
human visual system that make us understand the scene around us. Maybe a table, a desk, a
wall, a water bottles so on and so forth. So the V4 gets higher levels of abstraction the TEO
gets even higher level abstraction. And this is put together as you go deeper and deeper.

347
(Refer Slide Time: 24:25)

On the other hand, the Where pathway, you go from V1 to V2 two regions called MST, MT
and what is known as the posterior parietal cortex. So, these cells respond to more and more
complex forms of motion and spatial relationships and that is where the Where pathway
comes into the picture, while the What pathway takes different features and puts them at
higher levels of abstraction, the Where pathway response to more complex forms of motion
and spatial relationships.

So in fact, it is shown that if there is damage to the right parietal cortex, it could lead to a
condition called spatial hemi-neglect where a patient which is considered a disability where a
patient cannot see one side of themselves all the time.

So, once again that relates to the Where pathway. So if one part of the parietal cortex is
damaged, they really cannot see one side of the scene around them, and the patient behaves as
if that left field does not does not exist at all. So there have been some experiments that have
been conducted, where people have asked, so these are eye movements that have, that were
tracked on the screen and you can see that the patient is only focusing on the right part of the
screen or in another case where a patient is asked to draw a clock, the patient ends up
drawing only the right side of the clock and does not draw a left side of the clock.

These are ways in which this condition is diagnosed and the condition is known as spatial
hemi-neglect or hemi-spatial neglect.

348
(Refer Slide Time: 26:02)

So to summarize the visual processing hierarchy, so you go from the retina to the LGN to the
V1 cortex, and from the V1 cortex, there are two pathways the Where pathway and the What
pathway, where the, What pathway goes from V1 to V2 to V4 where V1 gives you a certain
set of attributes in your image, low level attributes in your image. V2 puts things together and
gets things like edges, borders, colors, and so on and so forth. V4 gets angles, curvatures,
kinetic contours, motion and so on and so forth and TEO gets simple shapes and TE gets the
complex body parts or perceives the world around us as we see it.

The Where pathway, you go from V1 to V2 to MT which detects things like spatial
frequency, temporal frequency, local and global motion so on and so forth. MST gets even
higher levels of abstraction in terms of movement such as contractions, rotations, translation,
optical flow so on and so forth. And finally, you have multimodal integration and a better
understanding of the Where pathway and the parietal regions.

349
(Refer Slide Time: 27:15)

This set was primarily, intended to give you a parallel between what we have been discussing
so far and the how the human visual system perceives. If you are further interested there is a
nice summary of whatever we discussed so far in the lecture notes of Dr. Aditi Majumder at
UCI on Visual Perception. And if you are further interested, there are many more links on the
slide, which you can read to understand more and the lectures of Dr. Rajesh Rao from whom
these slides were borrowed, is also there as one of these links if you want to read more.

(Refer Slide Time: 27:49)

Here are some references for you to read.

350
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 15
Feature Matching
Over the lecture so far, we have talked about basic methods to process images, we talked
about operations such as convolution, correlation, and then we talked about how we can use
such operations to detect edges in images, corners in images, different kinds of corners,
different methods to extract those corners, as well as how do you describe these corners in
ways in which they could be used for further tasks. We also talked about how this process
could be similar to how the human visual system also perceives the world around us.

One of the aspects that we mentioned is, if you have two different images, and let say, you
want to stitch a panorama consisting of these two images or more than two we ideally detect
interest points in both of these images, get descriptors of each of these points in both of these
images and then we match points across these images.

How you match is what we are going to get into next. So, over the next few lectures we will
talk about a few different methods to match key points between images, not just key points
between images, we will try to use these methods to do other kinds of tasks like finding
different kinds of shapes and images such as circles, lines, or whatever shape you like, as
well as even more descriptors from what we have seen so far.

(Refer Slide Time: 01:52)

351
Most of this week's lectures are based on the excellent lectures of Professor Yannis, at the
University of Rennes, Inria in France.

(Refer Slide Time: 05:07)

If you recall, we gave this example earlier of two images taken of the same scene, perhaps
from different viewpoints, perhaps at different parts of the day, or perhaps with just different
illuminations or different camera parameters. And if you want to stitch a panorama of these
two images the standard process is to find key points and match them.

So, we know how to find key points in both these images individually. We also know how to
describe each of those key points as a vector. We have seen SIFT, we have seen HOG, we
have seen LBP, we have seen a few different methods to do this. The question that's left is if
you now have the key points and descriptors in two different images, how do you really
match them and be able to align them? That's what we will do next.

352
(Refer Slide Time: 3:05)

We will start with a very simple method called dense registration to optical flow, a fairly old
method, which pertains to a setting where you have very small change between different
images. So, if you again take the example of the cell phone, if you are going to gradually
move your cell phone over a scene, and then you want to stitch a panorama the differences
between successive images is going to be very little.

So, if you have tried this yourself you will notice that in certain cases if you move your hand
very fast you will get an error message to repeat and move your hand very slowly to get a
panorama from the app on your cell phone.

So, in these kinds of cases the displacement of the scene between successive images is very
little. In these settings you can use this kind of a method called dense registration through
optical flow. Here is a visual example of a scene where a boat is going across water. You can
see that the scene is more or less the same, but a few changes in the positions of the boats.

Our goal here is for each location in the image, say a key point in the image we want to find a
displacement with respect to another reference image. Once you have a displacement, we can
simply place one image on top of the other image and be able to align them. So, this kind of a
method of using dense registration is generally useful for small displacements such as
stereopsis or optical flow.

353
(Refer Slide Time: 4:50)

To understand how to do this, let's first take a one dimensional case. Let's work out the math
and then we will go into a two dimensional case. So let's consider a one dimensional case,
let's consider a function 𝑓(𝑥), which is given by this green curve and let's consider this
function 𝑔(𝑥), which is simply a displaced version of the same 𝑓(𝑥), or mathematically
speaking I can say that 𝑔(𝑥) = 𝑓(𝑥 + 𝑡) is just a displaced version of 𝑓(𝑥) and we also
assume that 𝑡 is small, we are only looking at small changes between successive images.

We know by first principles definition of derivative you can say that ∂𝑓/∂𝑥 is given by
lim (𝑓(𝑥 + 𝑡) − 𝑓(𝑥 − 𝑡)) / 𝑡 which would be the formal definition. But we know
𝑡 −> 0

now that 𝑓(𝑥 + 𝑡) is 𝑔(𝑥). So, which means we can write ∂𝑓/∂𝑥 to be, (𝑔(𝑥) − 𝑓(𝑥)) / 𝑡.
Where do we go from here?

354
(Refer Slide Time: 6:09)

Now, we define the error between these two signals in this particular case, because we are
considering a one dimensional equivalent right now that is going to be some weighted
combination. Let's assume that this is very similar to the weighted autocorrelation that we
talked about for the Harris corner detector, just that in that case we talked about
autocorrelation. Here, we are looking at differences between two signals f and g.

So, you have 𝑓(𝑥 + 𝑡) − 𝑔(𝑥), that’s going to be the difference, and you do a weighted
combination of these two to be able to find the actual displacement. So, you have
2
𝑤(𝑥) (𝑓(𝑥 + 𝑡) − 𝑔(𝑥)) . Now, this second term, this first term here using a first order
𝑇
Taylor series expansion can be written as 𝑓(𝑥) + 𝑡 𝑓(𝑥).

The remaining terms are the same across these two equations. The first term is simply
expanded as a first order Taylor series expansion and you get the right hand side of this
equation. Where do we go from here?

355
(Refer Slide Time: 7:21)

We know that the error is minimized when the gradient vanishes, so we take ∂𝐸/∂𝑡, which is
just going to take a simple derivative of this right hand most term which is going to be 𝑤(𝑥)
summation of x, that part stays the same as here and the term that depends on 𝑡 is this
𝑇 2
particular term (𝑓(𝑥) + 𝑡 𝑓(𝑥) − 𝑔(𝑥)) .

So, if you take the gradient of that, you are going to have 2 into the entire term inside the
brackets into the derivative of the term that's affected by 𝑡, which ∆𝑓(𝑥). So, you are going to
have 2∆𝑓(𝑥) into the entire term inside the brackets. We want to set this gradient to 0, and
then solve for what we're looking for.

(Refer Slide Time: 8:13)

356
So, now simply expanding this equation, you can simply take terms out on both sides and
write this out as we are just going to ignore the summation and the arguments just for
𝑇
simplicity of explaining this. If we ignore those you would have 𝑤∆𝑓(∆𝑓) , these terms are
branched here. Similarly, 𝑤∆𝑓(𝑔 − 𝑓) if you take it the other side. 2 does not matter here
because we are anyway equating into 0. So, now by doing this you can solve for the ∆𝑓 and
be able to figure out the displacement between these two signals.

What is the two dimensional equivalent it's exactly the same set of equations just that instead
of a 1D signal you will now have an image patch that's defined by a window w and we then
try to find what is the error between the patch shifted by 𝑡 in reference image 𝑓, and the patch
at origin and shifted image 𝑔.

So, if you moved 𝑓 by a certain 𝑡 in the original image do you get 𝑔 is the question that we
want to ask. We want to find that 𝑡 that minimizes this change, because that would give you
the displacement between 𝑓 and 𝑔. Now, by solving for this you can get the value of 𝑡, find
the displacement and now be able to match or align these two images. A very simple solution.

357
(Refer Slide Time: 9:51)

One of the problems of this approach is the same aperture problem that we have dealt with
when we moved from images to the Harris corner detector.

(Refer Slide Time: 10:02)

Remember, the aperture problem simply means that you can only solve this problem for a
very local neighborhood. Why so? Because the entire definition or the way we solve the
problem assumes a local neighborhood. If you looked at the first order Taylor series
expansion that approximation holds only for a local neighborhood, which means this entire
formulation holds only if the displacement is inside a very small neighborhood and that's the
reason why we said that this method works when there is only very small changes between
successive images of the frame.

358
So, what do we do if there is more than a minute difference between these two images? For
example, a few slides ago, we saw those images of those mountain ranges. It did not look like
those two images were displaced by a very small amount; it looked like there was a
significant rotation or a significant perspective difference in how those pictures were taken.
How do you solve those kinds of problems?

(Refer Slide Time: 11:08)

And for that we move into what is known as white baseline spatial matching. In white
baseline spatial matching there is a difference from the dense registration just to repeat again,
in dense registration we started from a very local template matching process and we found an
efficient solution based on a Taylor approximation. Both of these make sense only when you
have small displacements.

359
(Refer Slide Time: 11:37)

But in wide baseline spatial matching, we are going to assume now that every part of one
image may appear in any part of the second image. It's no longer a small displacement; you
could have a corner point that was lying in the top left of one image and the bottom right of
the other image and we still want to be able to match these points across these images.

How do we go about this? The key intuition is going to be that we start by pairwise matching
of local descriptors. So, you have a bunch of key points in image one, and a bunch of key
points in image two. For each of these key points you have a descriptor, you now match those
descriptors with the descriptors of all key points in the second image.

Wherever you have the best match of descriptors you are going to say that this point in image
one is likely to match with this particular point a certain point in image two, and these points
could be at completely different coordinate positions in the first image and the second image.

So, we start by pairwise matching of local descriptors with no other order imposed and then
we try to enforce some kind of geometry consistency according to a rigid motion model. So,
we know that in the real world you can perhaps, rotate an image, translate or move your
camera on pan your camera, you can probably zoom in and zoom out, there are a few
different transformations that is generally possible all of them is what we mean as a rigid
motion model or geometric consistency.

So, we are going to assume a particular model that could have taken place and using these
pairwise matching of local descriptors we are going to try to solve what would be the

360
parameters of the transformation between the two images. This is going to be the key idea,
but we will now talk about how we actually go about doing this.

(Refer Slide Time: 13:29)

So, once again, in wide baseline spatial matching you could have two images such as this
where a region in one image may appear anywhere in the other. There could be a zoom in
zoom out, it could be a different angle or it could be translated by some bit any of those could
happen when we try to do this kind of magic.

(Refer Slide Time: 13:52)

So, as we already said, we first independently detect features in both these images, so each of
them are different features that you see across these images.

361
(Refer Slide Time: 14:04)

Then we try to do a pairwise descriptor matching for each detected feature, we can come up
with a descriptor such as histogram of oriented gradients, or local binary patterns, or the
variant of histogram of oriented gradients that SIFT uses, so on and so forth. You try to do a
pairwise matching of the descriptors between the key points on these two images.

Clearly, when there is a lot of change between two images it is not necessary that every key
point will match with some key points on the other. In this particular case you can see that the
car does not even exist in the second image. So, any key points on the car would not have an
equivalent match on the second image, which is perfectly fine with us. So, only a subset of
features that were detected in the first step would actually lead to matches in both cases.

In both these cases, even in the first image, only a subset of features will match with the
second image. Even among all the features detected in the second image, only a subset of
features from the second image would match with features in the first image. How do you
match? Once you get the descriptors in terms of vectors you can simply take the Euclidean
distance to match, you can use other kinds of distances too, but you can simply use the
Euclidean distance between the descriptors of the features in both these images to be able to
match.

362
(Refer Slide Time: 15:32)

So, once you get these tentative correspondences, we try to assume a certain geometric
model. For example, we can say that we know that in our particular domain only a translation
is possible or only a translation and rotation is possible because in my camera there is no
zoom in or zoom out, it could happen.

So, if you knew what were the conditions under which a particular capture was taken, so you
know what could be the transformation that could have taken place between the first image
and the second image or you assumed a certain rigid transformation, and you find among
those pairwise correspondences that we saw on the previous slide, which of them would hold
true to this kind of rigid transformation that I assume?

We will come a bit later in this lecture as to how that rigid transformation is represented and
how we find points that are in line. We will come back to this in a few slides from now, but
this is the overall idea. So among all of those correspondences you narrow down to a few
which satisfy your hypothesis of what could have happened.

363
(Refer Slide Time: 16:38)

And then once you get that subset of inlier correspondences you can simply match and find
the transformation and align one image on top of the other. So, let's talk about this in more
detail over the next few slides. So, we first extract descriptors from the key points in each
image, so for each detected feature you could do something like construct a local histogram
of gradient orientations, you could do other kinds of things too, this is just an example, you
find one or more dominant orientations corresponding to the peaks of the histogram.
Remember, in SIFT, we talked about finding the orientation of each key point, that's what
we're talking about here.

At that point, you may want to resample the local patch at a given location, scale or
orientation, based on what feature detector you used. You could have a scale for that
particular key point. So, you could have a location for that key point, you could have a scale,
you could also have an orientation so you could resample the local patch. When we say
resample if it's a rotated patch you may want to resample it by doing some interpolation so on
and so forth.

You can, you resample the local patch and then you find a descriptor for each dominant
orientation. That gives you your descriptors. Remember, again, just like how we spoke for
sift, you could take multiple descriptors for each corner key point if there are different
orientations that are dominant. We talked about this earlier too.

364
(Refer Slide Time: 18:06)

Okay, now at the end of that step we have a bunch of descriptors in image one, a bunch of
descriptors in image two. How do we go forward? For each descriptor in one image we find
its two nearest neighbors in the next image. Why two? It’s just one method you can also take
other kinds of nearest neighbors if you like.

If you in this method we take two nearest neighbors and we then evaluate the ratio of the
distance of the first to the distance of the second, so you have a distance between the
descriptor the first image to the first match in the second image and the distance of the
descriptor from the first image the same descriptor to the second closest match. If the ratio
between the two is one, which means both are good matches.

If in one case the distance is very low, but in the second case the distance is very high you
perhaps now know which of them is significantly closer, you can threshold to find out which
of them are strong matches. So, whenever this ratio is small you know that you have found a
very strong match, because the second year distance is very far away that is what this ratio
would measure.

So, whenever you have a strong match, you are going to consider that a correspondence, and
then after you do all these pairwise matchings you have a list of correspondences between
image one and image two. What do we mean by correspondences? We are simply saying that
descriptor D1 in image one corresponds to descriptor D10 in image two, something like that.
You can just write out a table of correspondences between these between the descriptors of
these two images.

365
(Refer Slide Time: 19:52)

Okay, here is an illustration of the ratio test. So, you can see here that for correct matches you
can see that the ratio of distances forms this kind of a distribution, it is much smaller.
Whereas for incorrect matches the ratio keeps going up and further towards one, in the
incorrect matches the ratio is going to be close to one, which means the first match is as good
as the second match.

Then you are not very sure whether the match is strong enough. When the first match
distance is much lesser than the second match’s distance you know that you are doing a good
job. You can, as I said, also expand this to more nearest neighbors and expand the concept of
ratio if you would like to get a better idea of the robustness of this match.

(Refer Slide Time: 20:40)

366
Once you have identified these good matches we move on and then try to estimate which of
them are inliers with the rigid transformation that we assumed. Before we go there let us first
try to find out why this is a difficult process by itself? Okay, we have so far spoken about a
few steps. Firstly, we have to choose key points or these kinds of correspondences which
allow for a geometric transformation that may not be trivial in several images.

Fitting the model or the geometric transformation to the correspondences that we have found,
could be sensitive to outliers. It is possible just by chance that your correspondence could
have been wrong because in the new image maybe there was a newer artifact that came in
which was not there in the first image, which ended up matching the key point of the first
image in that particular case could simply be an outlier match, which could make fitting your
geometric model a little harder.

To find inliers to a transformation you first of all need to find a transformation. So far, I kept
telling you that you can assume a transformation, but assuming a transformation is not trivial.
You need domain knowledge, you may perhaps need to do something more to be able to find
out what should be the transformation in the first place before fitting these correspondences
towards transformation.

In certain cases such as outliers, correspondences can also have a grace error. It is likely that
in certain cases the correspondences can lead to mistakes; it is possible that HOG may not
have been the right descriptor to get correspondences for certain features, so you could have
errors in these kinds of cases. An inliers are often less than 50 percent of your total
correspondences. Generally, even lesser, but they're typically less than 50 percent. So, which

367
means the number of inliers that you are going to be left with at the end is very few that you
actually can play with.

368
(Refer Slide Time: 22:44)

So for the next part, to be able to understand how do you match these correspondences to the
rigid transformation model? Let us actually talk about what we mean by geometric
transformations here. What do we mean by rigid transformations here, and then we will come
back and then try to align the correspondences to one particular transformation.

Given two images, i and i prime are equal at two data points x and x prime, we know that i of
x is equal to i prime at x prime. This simply says that across these two images you could map
the point x to the point x prime in the second image or rather you can write this as x prime is
some transformation of x.

We got the point x prime by perhaps rotating the first image or by translating the first image
or by zooming into the first dimension. We are going to refer to all those kinds of
transformations such as rotation, translation, and scaling as a transformation matrix 𝑇. And
2
what does 𝑇 do? 𝑇 is an operation that takes you from a vector in 𝑅 and gives you another
2
vector in 𝑅 . Remember any matrix can be looked at as a transformation in this perspective.

So, given a point at a coordinate location (𝑥, 𝑦) in image 1, the transformation matrix 𝑇 takes
you to another point (𝑥', 𝑦') in your second image. And this transformation is going to be a
bijection, which means it's a one to one match between image one and image two. Every
point in image one matches to only one point in image two and every point in image two
matches to only one point in image one, it's going to be a bijection.

369
(Refer Slide Time: 24:35)

Let us try to study what T looks like. T is a transformation, we said it's a matrix. So, for a
certain set of common transformations, T is fairly well defined, especially rigid body
transformations, and this has been extensively studied, especially in the graphics-based vision
that we talked about in the first lecture.

So, we will briefly talk about this now to understand how the matching is done. So, suppose
you have this green triangle in the first image, and you translate this rather you just move it
slightly along the 𝑥 − 𝑎𝑥𝑖𝑠 or 𝑦 − 𝑎𝑥𝑖𝑠 or on both these axes it moves to a slightly different
location in the second image.

In this particular case, you would define the transformation to be given by a 3 x 3 matrix,
which is given a 1001, which is the top 2𝑥2 of this matrix, then you have 𝑡𝑥, 𝑡𝑦 which

corresponds to the translation along the 𝑥 − 𝑎𝑥𝑖𝑠 and translation along the 𝑦 − 𝑎𝑥𝑖𝑠. If you
work this out and whenever you apply this transformation on (𝑥, 𝑦, 1). 1 is simply used as a
^ ^
normalized coordinate to represent this transformation we get an outcome which is 𝑥, 𝑦, 1.

Why so? Let us analyze this a bit carefully. It's simply a matrix vector transformation. If you
simply did a matrix vector translation you will actually see that this is just another way of
writing a system of equations and the system of equations says 𝑥' = 𝑥 + 𝑡𝑥. Similarly, you

370
have 𝑦' = 𝑦 + 𝑡𝑦. The third one doesn't matter, you are just going to have 1 is equal to 1,

it does not matter, but this is exactly what we're looking for.

This is just another way, there is just a system of equations and we are simply writing the
system of equations in terms of a matrix vector transformation, a matrix transformation on a
vector to give you another vector. This is a translation. Let us see one more.

(Refer Slide Time: 26:39)

If you took rotation, this green triangle is now simply rotated. There is no translation, it is
only rotated. You can see that zero zeros are put here for the translation, which means there is
zero translation, but there is rotation. And in this case it is given by
𝑐𝑜𝑠(θ), − 𝑠𝑖𝑛(θ), 𝑠𝑖𝑛(θ), 𝑐𝑜𝑠(θ) in the upper 2x2 this 3x3 matrix. I will let you look at
this more carefully it's a simple expansion again you would have 𝑥' = 𝑥𝑐𝑜𝑠(θ) − 𝑦𝑠𝑖𝑛(θ)
and 𝑥' = 𝑥𝑠𝑖𝑛(θ) + 𝑦𝑐𝑜𝑠(θ) that simply represents the new coordinates based on your
rotation angle θ.

371
(Refer Slide Time: 27:30)

So, you can see here that if you went back to the previous slide in translation there are two
degrees of freedom 𝑡𝑥 and 𝑡𝑦. In rotation, just one degree of freedom is given by the angle

theta.

(Refer Slide Time: 27:43)

Another transformation is called the similarity transformation, which has four degrees of
freedom, which combines rotation has two degrees of freedom due to translation, but you also
have a scaling aspect here which is given by r, which can change the size of the object in the
second image. And we see size or scale, remember it would correspond to zoom in or zoom

372
out in terms of the camera parameters. So, now you have r, θ, 𝑡𝑥and 𝑡𝑦 four degrees of

freedom in this geometric transformation.

(Refer Slide Time: 28:23)

Going forward, this is another example of a similarity transformation where you can see the
zoom out in action where the r has a non-zero value or a non-one value to be able to show a
similarity transformation where r is operational.

(Refer Slide Time: 28:42)

Another transformation is known as the shear transformation you can see here as to how the
triangle gets transformed between image one and image two. This is known as shear where
you apply pressure on one of the sides of the triangle and extend it and keep the other side

373
constant. So, this is given by changing just these quantities bx, by in your transformation and
the rest of them stay one, so then it is called shear. You can write out the equations of shear as
𝑥 + 𝑏𝑥𝑦 = 𝑥', 𝑏𝑦𝑥 + 𝑦 = 𝑦'. This is simply a linear system of equations and a way of

writing the transformation.

(Refer Slide Time: 29:37)

Furthermore, a popular transformation known as the affine transformation is given by six


degrees of freedom, where you can have values for any of those six spots in your
transformation matrix that we spoke about. We are going to stick to these sets of rigid body
transformations at this time. There are certain transformations that also use these values at the
bottom which are going to projective transformations perspective transformations, we were
not going to get into it at this particular point in time, we were going to stick to affine
transformations.

(Refer Slide Time: 30:07)

374
So, in all of these cases, as you can see using those tentative correspondences that we get
between two images we can find out which x prime matches to x and y in your image one.
So, (𝑥', 𝑦') in image 2 could be matching with (𝑥, 𝑦) in image 1. So, we already have a list
of correspondences based on those matching of descriptors. Our job is to find out what are
the parameters of this transformation 𝑇, that's what we want to look for. Clearly, this is about
solving a linear system of equations.

(Refer Slide Time: 30:44)

375
So, we want to solve a linear system, 𝐴𝑥 = 𝑏, where x and b are the coordinates of the
known point correspondences from images, 𝐼 and 𝐼' and 𝐴 contains our model parameters that
we want to learn.

376
(Refer Slide Time: 31:00)

Ideally speaking, if we had 𝑑 degrees of freedom in a given transformation you ideally need
the ceiling of 𝑑/2 correspondences. For example, for translation, two degrees of freedom
which means you need only one correspondence. If you have one point in one image and
another point in the second image you can find both 𝑡𝑥 and 𝑡𝑦 because you would know how

much you moved in x and how much you moved in y. So for a given the d degrees of
freedom you need about 𝑑/2 ceiling as the number of correspondences from descriptors.

(Refer Slide Time: 31:44)

Okay, now, how do you solve this right? So we know now that just to recall, repeat what we
have talked about so far. We have found key points in each of the images. We found

377
descriptors and then we matched the descriptors between these two images and then based on
the nearest neighbor approach we prune those descriptor matches to a few set of descriptor
matches which are strong, and among those we now want to find out which of them will suit
my rigid body model that I am going to assume for my transformation between the two
images.

(Refer Slide Time: 32:29)

So, if I assume and affine transformation norm, using those set of correspondences that I have
I ideally have to solve for these six values as my transformation, and once I have solved for
these values, I know what was the transformation between these two images, so I can simply
place one on one image on top of the other using the transformation again, and be able to
blend them and create a panorama.

378
(Refer Slide Time: 32:50)

So, we are left with one task as to how do you actually estimate those parameters given those
correspondences?

(Refer Slide Time: 32:54)

Let's start with the simplest approach that we all know is if you have two points that fit a line,
that is the simplest approach that we all know, the simplest model that we can imagine. Let us
say the approach that we are going to use, but let us try to describe this further.

379
(Refer Slide Time: 33:09)

So, if you had a least squares approach to fitting correspondences this is what you would
have. If you have a bunch of correspondences here this is clean data, not many outliers. The
least squares fit would give you a fairly good equation for the line. We're just talking about
the transformation in a slightly more abstract sense now, but we will come back and make it
clear as to how you really estimate the parameters of the transformation.

380
(Refer Slide Time: 33:37)

However, if there are outliers in your matches then the least square fit fails and gives a very
different answer compared to what should have been the right answer. So, what do we do
here?

381
(Refer Slide Time: 33:58)

At this point comes to rescue one of the most popular methods for matching features between
images, which is known as RANSAC or stands for Random Sample Consensus. A very
simple iterative method can be slightly computationally intensive, but works very, very well
and has been used for many decades to be able to match correspondences between two
images.

For that matter it can match correspondences not just between images between any set of
observations in data. So, let us assume these are a set of correspondences that we have. These
are data with some outliers and we ideally want to fit a line to this particular model. So, we
will talk about fitting a line to data at this point. We will then later come back to the analogy
of taking this to fitting an affine transformation or any other rigid body transformation to a set
of correspondences in images.

382
(Refer Slide Time: 35:05)

So, if you have such a set of data points and we want to fit a line to this particular set of data,
what random sample consensus suggests to us is, you take two points at random, any two
points, you can see two points that have been picked in red. We know how to fit a line for two
points.

(Refer Slide Time: 35:18)

383
So you fit a line for those two points. You take a neighborhood of that line on both sides and
see how many points in that neighborhood are inlier points or how many of them correspond
to points that you are looking for. You can see here in green that there are about six points
that are in the neighborhood of this line that fit on these two points.

384
(Refer Slide Time: 35:47)

Now, you take another two points randomly. Once again, fit a line through those two points,
take a neighborhood and now count how many of four of your given data points lie inside this
neighborhood. Now, you can see that this is going to be a fair few, it's going to be about 15,
or 20, which is greater than the 6 that we got in the previous random pick that we had.

What does this tell you? That this line may be a better fit than the previous line. So among all
the lines that we have seen so far we try to retain the line which has the highest number of
inliers, and throw away the previous lines that we came across. This is the line that we have
now.

(Refer Slide Time: 36:32)

385
You again, repeat this process, pick another two points, pick a line, see how many are in the
neighborhood of that line. Once again, you get a lesson number for such a line and you keep
repeating this process over and over again, but keep your best hypothesis saved with you.
And you see that in this particular example, as you keep repeating this over and over again
this line that you have here which goes through a set of random picks is the best possible
hypothesis and that is the hypothesis that you are going to go with, that the model that you're
going to say that the set of outliers fits.

(Refer Slide Time: 37:07)

386
So here are other examples of lines that you can draw. And in all of those cases, you know
that the line that you had in the middle has the highest number of inliers in this neighborhood.
And that's the one that you will go with. Lets try to write out RANSAC more formally in
terms of an algorithm.

So, given a set of tentative correspondences now we are looking at the correspondences that
we started with so you have a set of correspondences which is your data. Let's look at n,
which is the minimum number of samples to fit a model. Once again, remember, if you are
trying to fit a model for rotation it has one degree of freedom, so you at least need one set of
correspondences to be able to solve the problem that is what that one set of correspondences
is what we are referring to as n here.

387
And let us assume that in each of these cases you are going to have a score 𝑠(𝑥; θ), which
gives us the score of sample x given model parameters, θ. θ is just the model parameters that
we are trying to learn. If it's rotation you're going to have just one parameter θ. If it's
translation, then theta would be 𝑡𝑥, 𝑡𝑦.

If it's, this is for translation. If it's for affine we already saw that you are going to have 𝑎11, all

the way to 𝑎23, 6 six such values, that is going to be affine, so on and so forth. Okay, those

are your model parameters that you want to learn. So, what RANSAC says is, you start with
the hypothesis verify the hypothesis, and keep iterating and storing the best hypothesis so far.

So, you draw 𝑛 samples from x at random. Let's call that set of n samples as 𝐻. You fit the
model to 𝐻 and compute the parameters. If your data are consistent with this hypothesis, most
of your data are consistent with this hypothesis you compute a score. If this score was better
than the earlier score you make the new hypothesis the best hypothesis, and you keep
repeating this process.

Let's talk about this in the context of matching features. So, in this case let us say you drew
about say two different or let's take an affine transformation, which has six three parameters,
so you need about three sets of correspondences to be able to get a good match between these
two, so you draw three pairs of correspondences, the best correspondences that you have, you
fit an affine transformation model, which means by fitting you mean you can actually solve
for 𝑎11to 𝑎23 through using a linear system of equations.

Once you solve for using a linear system of equations using at least regression fit you will get
a bunch of values 𝑎11to 𝑎23. Now, you try to see among all your correspondences, which of

them would stay within an ϵ neighborhood of this particular transformation. Rather, you take
all the points from image 1, which you were considering in your inliers which you are
considering in your tentative correspondences, apply this transformation, the transformation
for which you now have 𝑎11to 𝑎23 values, you would see where those points go to in image 2.

If for many of your correspondences these matches lie within an ϵ neighborhood you are
going to consider that a good match has a good number of inliers and the best hypothesis so
far. And you just keep repeating this by randomly drawing three, three correspondences each
time in case of an affine transformation. For a translation transformation you just have one

388
correspondence that you have to keep drawing to be able to fit these models. That's the main
idea of RANSAC.

(Refer Slide Time: 40:54)

Some limitations of RANSAC is when the inlier ratio which means the number of inliers in
data divided by the number of points in data is unknown, then you could have problems in
being able to fit. Also when the inlier ratio is very small, and the minimum number of
samples to be drawn is large.

Let us assume that your definition of a model is such that you need to let’s say there are 12
free parameters, which means you need at least six correspondence that you need to draw to
learn each model and then you need to check for how many of them lie within a particular
epsilon distance of the correspondences so on and so forth. If your original inlier ratio itself
was small in these images then you could have a tough time, in this particular case it will be
6
equivalent to having about 10 iterations to ensure 1% probability of failure. Just to expand
on this I am going to leave this for you to work this out by just giving a few hints and you can
work it out later.

389
(Refer Slide Time: 42:05)

𝑛
Remember that w is an inlier ratio, which means, 𝑤 is the probability that all n points that
you picked are inliers, so you pick n points that is the number of points that you have to pick
𝑛
to be able to solve for this model and 𝑤 is the probability that all n points are inliers, which
𝑛
means, 1 − 𝑤 is going to be the probability that at least one of those points is an outlier.
And when one of those points is an outlier we are going to get a bad model which may not
have too many correspondences in its neighborhood.

390
(Refer Slide Time: 42:46)

If you round k different iterations, then the probability that the algorithm never selects a set
𝑛 𝑘
of n points which are inliers your probability is going to be (1 − 𝑤 ) . This you are seeing
is straight forward. Using these three terms you should be able to work out why this would
happen as an example here. Try working out by yourself.

(Refer Slide Time: 43:16)

Here are a few visual illustrations of how well RANSAC works for different kinds of
transformations. Here is an example of rotation, this is the original book rotated to a certain
degree and you can see that it is not a book sorry I think its a food box and this is the foot box

391
placed in a different place in the second image and RANSAC finds fairly good
transformations between these two settings.

(Refer Slide Time: 43:43)

It also works well at estimating what is known as a transformation matrix or a fundamental


matrix, when you relate to two views of the same image, if you have two different views
remember, this is how you would build a 3D model of a given scene. And if you wanted to
build a 3D model of say the statue you would ideally take multiple images by slowly moving
around this particular 3D object and you would get a 3D model. And in each of those cases
between every pair of images that you capture you have to estimate these transformation
matrix, which is also known as the fundamental metrics in this particular case.

(Refer Slide Time: 44:22)

392
RANSAC is also used for what we started with in this lecture, which is to compute what is
known as a homography, which is typically the term used for the transformation matrix in
cases of image stitching. We said that it could be an affine transformation, translation, a
rotation so on and so forth that transformation metrics is typically called a homography when
you do it in the context of image stitching.

(Refer Slide Time: 44:50)

393
That concludes this first lecture on feature matching. We will see a few other methods also as
we go, but your readings for this lecture are going to be chapter 4.3 and chapter 6.1 of
Szeliski’s book. As well as if you are interested the papers on the respective slides for more
illustration and the references are here.

394
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 16
Hough Transform

The next topic that we are going to talk about in trying to match shapes or any other
templates on an image is the method known as Hough Transform.

(Refer Slide Time: 00:31)

Once again, these lecture slides are also based on the excellent lectures of Professor Yannis,
at Inria Ren, as well as the lectures of professor Mubarak Shah at the university of Central
Florida.

395
(Refer Slide Time: 00:48)

We have already seen a couple of line fitting methods. We saw least squares fit and then we
saw RANSAC. The question that you are going to ask now is, what do you do if there are
multiple lines placed in a particular orientation or placed in a particular configuration in an
image? It could be multiple lines, it could be a polygon, it could be a circle, then how do you
deal with fitting a shape onto the set of data points that you have? And that is what we are
going to talk about now, when we Hough Transform.

(Refer Slide Time: 1:30)

This method dates back to the early 60s, by Hough, who filed the U.S. patent for this, but
since then, there have been many efforts that have translated this into what we see today. In

396
the early 70s, the Hough Transform was used for detecting lines and curves. Then in the early
80s, there was a method that came that was called a generalized transformation and we will
see all of these over the next few slides.

Let us start with the simple equation of a line and Cartesian coordinates, which we all know
is y = mx + c , where we say that m is the slope and c is the y intercept. You could just play
with the terms of the equation a little bit, and then write this as c =− mx + y .

Rather we now live in an m, c space where x is c intercept, y define the line in that space.
So, were originally in the x, y space, where m define the slope and c define the y intercept.
We could just rearrange these terms slightly to now go to an m, c space where − x becomes
the slope and y becomes the c intercept. In this kind of a context, every point in m, c space
becomes the equation of a line in the x, y space, remember for every choice of m and c , you
will get the equation of a line as we see here. And every point in the x y space similarly
would become the equation of a line in the m, c space. There is a dual correspondence
between these two spaces. So what do we do with this?

(Refer Slide Time: 3:35)

We know that to fit any model we need a certain number of samples. For example, to fit a
line, we need two samples. If it was any other model, if you were fitting a square or a circle,
we need a different number of points. To fit a line, you may need two points. Let us start with
that example. But even if you had just one point instead of two points, you know that it could

397
belong to a certain family of lines. It gives you some partial information about the line
equation itself.

So, what we can ask every point in your x, y space to do is to vote for certain configurations
of m and c , in case you are talking about a line, if aligned. Every point can vote for what
configuration of m and c would have resulted in this point being at that particular location.
And then you collect words for all such points and try to seek consensus and wherever the
majority vote, you are going to say that that is the equation of the line that I am looking for.
Let us elaborate on this and go through slowly.

(Refer Slide Time: 4:50)

Before we go forward, we spoke about the Cartesian coordinates on the previous slide.
Unfortunately, the Cartesian formulation can be problematic for vertical lines. You may say
why? When you have a vertical line, your slope is unbounded, it becomes difficult to
represent such a line in your m, c space because m has to be infinity.

So, in such a scenario, it may be wiser to use a slightly different parameterization, the polar
parameterization, which is been written in terms of ρ and θ , where ρ is the distance of the
line from origin and θ is the angle made by the normal to the X axis. You can also write
ρ = x cos θ + y sin θ. This is a standard polar parameterization that is used going from
Cartesian coordinates to polar coordinates.

398
We know now that in this space, ρ is at least lower bounded by 0 and θ lies between 0 and
360. We know that these quantities are bounded, which makes it slightly easier to work with.

(Refer Slide Time: 6:17)

So, every point in your original space, in your Cartesian space (x, y ) votes for many points in
your parameter space. So, if you had a certain line, certain point x1 in your original Cartesian
space. For the moment, we are going to talk about this method generally, in terms of a
Cartesian space and fitting a line or fitting a circle and so on and so forth.

A little later in this lecture, we will talk about how you use this with images. If you already
have a sense of what is coming, we are trying to find out how do you use Hough Transform
to match lines or circles or any other shape in an image. That is the goal of this lecture or the
method discussed in this lecture.

But before we go to an image, we are just going to talk generally about Cartesian coordinates.
So, if you have a point x1 in the (x, y ) space, you know that there are a family of lines that
can pass through x1 , and each of those lines has a correspondent m, c it is going to have a
slope and the y intercept, and all of them will result in corresponding points in the m, c
space or the ρ, θ space. If you go to the log space, you will have a slightly different
variation. They will correspond to a point here in the r theta space.

So, for every point x, y can vote for many points in your parameter space, which that
particular x1 point could have been lying on. So, each line to through this point, (x1 , y 1 ) is a

399
vote for a point in a parameter space that satisfies ρ = x cos θ + y sin θ . Let us see a few
examples.

(Refer Slide Time: 8:12)

Here is another line that passes through x1 and that would work for a different point in the
parameter space, which is here.

(Refer Slide Time: 8:23)

Here is another line that passes through x1 , that would vote for a different point in the
parameter space. Remember the parameter space now is defined in a polar parameterization.

400
(Refer Slide Time: 8:36)

And you keep repeating this for several lines that could have passed through x1 and all of
them get votes in the parameter space.

(Refer Slide Time: 8:46)

Similarly, you could have another point x2 in your original space through which you would
have another set of lines that can pass in your original Cartesian space and each of those lines
may vote for them for a different parameterization in your parameter space. You can see that
again, here, these are little examples for several lines that pass through x2 , you get several
votes in the parameter space.

401
(Refer Slide Time: 9:17)

Let us try to re recount this through an algorithm. So, you have your data X, you have a set of
quantized parameters, theta min to theta max. For example, if you take polar
parameterization, you may not want to vote for every angle between 0 and 360, you may
want to divide 0 to 360 into 10-degree bins and have only 36 possible theta values. That is
what we mean by a quantized parameter here.

You also initialize an accumulator array, which is simply a frequency count. It is simply a
frequency count of vote. You can look at it that way. So, for every point (x, y ) in your
Cartesian space, we try to see for theta belonging to theta, you have 10 possibility thetas. If
you write out ρ = x cos θ + y sin θ . We try to see for each set of model parameters consistent
with the sample, so which rho and which theta would align with this particular X, Y that we
have taken. You increment A for that theta and rho.

So, you imagine the accumulator being a 2D matrix defined by theta values on one axis and
rho values on the other axis, which could also be quantized, you could quantize distance into,
distance of say a certain number of units. And whichever rho and theta we know corresponds
to this X and Y that we have, you go to that rho and theta and in each of these cases, you
could check for consistency and increment that cell in the accumulator matrix by 1. And you
keep doing this for every X Y point that is given to you. Those are your set of points that are
given to you and you keep accumulating and at the end, you do a non-maximum suppression
in A.

402
So, if you have many different bins that have a high vote, you do a non-maximum local, local
non-maximum suppression to see which bin has the really highest votes and your define that
theta and rho to be corresponding to the model that corresponds to your, that directly has a
bearing on the data in your X matrix. This gives you a way of estimating the theta and rho or
in Cartesian coordinates, finding the equation of the line, corresponding to this set of points
that are given to you.

As you can see this is very similar to a least square fit, but doing it in a slightly different way.
We are still trying to fit a line to a set of points that is given to us, but we are doing it using
the Hough voting approach.

(Refer Slide Time: 11:54)

Now, as I promised, here is an example from an image perspective. Let us say we have an
image and the image has several lines and we want to find the equations of those lines. That
could be several applications where this may be required. Let us say for instance, that you are
trying to take images of a computer chip and trying to find out how those lines are aligned on
a computer chip and so on and so forth. You want to find the equations of those lines.

403
(Refer Slide Time: 12:24)

So, using the same approach that we just talked about on the earlier slide, you assume a
certain parameterization, you ask for each point to vote to which parameterization may have
generated in that particular point, and you end up getting votes over your accumulator 2D
matrix, let us just say a ρ and θ . So, you can see here that you have a bunch of votes that is
spread across many different accumulator bins, we said that accumulated as a matrix that is
rho and theta.

(Refer Slide Time: 12:54)

Now, you are doing non-maximum suppression to maintain only the local minima in your
accumulator. And once you do this local maxima, you find that there are four maxima here

404
which corresponds to four values or four parameterizations of lines, which are the four lines
that you actually have in the original image. And those are the points that were voting for
each of those accumulator bins. This way you can find out the equations of those lines, as
well as which points in the original image correspond to those equations, to be able to extract
that shape in that image. So, you can actually find equations of lines in any image given to
you using this approach.

(Refer Slide Time: 13:44)

Let us make this a bit harder now. Let us say we want to now find circles in an image. We did
talk about block detection earlier, but now we want to exactly find the circle with its radius
and center and be able to parameterize it perfectly. So, circle fitting is very similar to line
fitting, that is going to be the fitting that you are looking for (x − x0 )2 + (y − y 0 )2 − r2 = 0 .
What should be the dimensions of the accumulator? For line, the accumulator was a 2D
matrix. What would be the dimension of the accumulator for circle fitting?

405
(Refer Slide Time: 14:29)

It would be a 3D three-dimensional accumulator, where you have x0 , y 0 , which corresponds


to the center of the circle and r which corresponds to the radius of the circle. Those are the
three votes that each point is going to be voting for. Since we have three possibilities here,
one option is to fix one of the parameters and generally you fix the radius and then look for
the rest, ask every point to vote what is the center and then you can keep going around and
going in a round Robin manner to ensure that every point works for all three parameters at
the end.

So, you keep incrementing the accumulator for every vote that a point makes to each of these
values and at the end, you do a local maxima in a, and this way you can actually estimate the
parametrization of the circle that you are looking for. Here is a visual illustration, where
given a set of coins in an image, you can actually now find out where those coins lie and what
is the radius of that coin by first extracting edges.

So, in this, you have to first extract edges, and then be able to use those edge information to
be able to find the exact parameterizations on the surface. So when I say parameterizations, I
mean define the centers and radii of each of those circles that you have,

406
(Refer Slide Time: 15:56)

You need not stop with just lines or circles, you can obviously do this for any other shape. As
long as you can parameterize any shape in an analytical manner, you can give a set of
equations to define a shape. All you have to do is take a point in your original image and ask
you to vote for what would be the parameters of a polygon that that point belongs to or lies
on. And then you just count which parameterization got the highest votes amongst all the
possibilities in your parameter space, and simply define that to be the polygon or any shape
that you are looking for.

But what, if you are looking for the shape that has no analytical description. For example, on
this slide, you see this irregular shape, which has no analytical distribution, which has no
analytical definition. How do you estimate, how do you find such shapes in an image? We do
assume that the shape has given to us.

So, you do know that this is the shape that you are looking for in a given image and it could
be of different sizes, could be rotated in different ways, it could be scaled, all of those
transformations are possible. But we want to find where in the image it lies. The only thing
we know is that the shape in this particular case may not have an analytical definition. You
cannot write it as an equation. What do you do in this particular case?

You define a reference point in the shape. Like you could define, for example, an x naught y
naught as a reference point in that shape and now every edge point of this shape can be
defined with respect to that reference point. So every point here (x, y ) has a certain length

407
from your reference point (x0 , y 0 ) and a certain angle with respect to the reference point
(x0 , y 0 ) . As an example, (x0 , y 0 ) could be the centroid of that object that you are looking for.
But it need not be the centroid, it could be any reference point.

So, once you have a centroid or any other reference point, for each edge point that you have,
so remember again, that given a new image, you have to first do an edge detection to get the
edge points. Then for each edge point you compute what is the distance to the centroid. You
would first do this for the template object that you have before you run it on any new image.
This is when you know that this is the shape you are looking for, you define a reference point
for each edge point in this particular shape, you compute a distance to the centroid, which is
ri and an each orientation ϕi .

And then you create something called an r table. And what the r table says is that, for a given
edge orientation ϕ1 , what are all the possibilities? What are all the points that could have a
certain radii with the same orientation, with respect to the reference point? Similarly, for
another point with another orientation ϕ2 , let us say ϕ2 is this particular edge point. That
has a certain radius with respect to the reference point. Similarly, there could be another point
with the same angle, which would have a certain radius r21 , there could be a third point with
the same angle, same image orientation and that could have a radius r23 and you create an r
table, which connects these orientations and these distances to that centroid of the reference
point. So once you have built this r table, what do we do?

(Refer Slide Time: 19:31)

408
Here is the algorithm of what is known as Generalized Hough Transform. So, you have your
data. You are given your r table, which you have already built from the shape that was given
to you. Remember once again, that the shape is defined for you. Not defined is given to you,
it is not well defined mathematically is given to you. Using that shape you construct
something called the r table and you also construct an accumulator array for how many of the
dimensions you have to estimate.

So, if that reference point has two coordinates, you have only two coordinates to vote for.
Otherwise, you may have other parameters too. If you allow your shape to change in multiple
ways, you could have other dimensions in your accumulator for which also the points can
vote.

In this particular case, to keep it simple let us assume that the shape is going to be the same,
just that the shape could be placed anywhere in the image. No scaling, no rotation. Let us
assume that that is the scenario here. So, the only thing that we are trying to find is where
would the centroid of that object?

So, you have an accumulated array, which is of two dimensions going from X cmin to X cmax ,
Y cmin to Y cmax are the maximum values that the centroid can assume in the X axis.
Similarly, Y cmin and Y cmax . And you initialize the accumulator, every cell of the accumulator
to 0. Then for every point for every ri , ϕ in your r table, you try to write if your centroid can
be written as this x + ri cos ϕi , y + ri sin ϕi .

If you can write your centroid with respect to this, given x y using this particular formula,
you would increase the accumulator for that particular centroid. And you keep doing this for
various different centroid possibilities, where your centroid possibilities will come from
X cmin to X cmax and Y cmin to Y cmax , where you discretize your possibilities of the centroid and
you create an accumulative array.

That is the way you would do it. And at the end, you would get a set of values, a set of
frequency counts as to what each x y voted for and you do a non-maximum suppression to
find the local maxima and that is going to give you the final xc yc which correspond to this
new set of points that are given to you.

Speaking in terms of images. Once again, let us assume that this is the shape that you are
looking for. As I said, we assume in this example here, that we are only looking for the shape

409
as is, no scale change no rotation change. So, which means the only thing that can change is
the center of the object could be all different parts. So, that is why we are voting only for
where the center is. And so given a new image, you run an edge detector and for each edge
point, let it work for where the centroid would be and wherever you get the maximum votes
for the centroid is where your centroid potentially could be.

You could use this kind of an approach to say, find a car in an image. If you knew a car has a
particular shape, you defined that shape, you now ask every, you run an edge detector on the
new image, and then ask each point on the edge to vote for where the center of the car is and
based on that, you can probably trace where the car is located in the given image.

So, you could use such an approach even for object detection. Again, within sudden changes
in your shapes in the possibilities of the shape. If a car’s pose completely changes, obviously
this kind of an approach will not work in that scenario. So, it works only with certain
tolerance.

(Refer Slide Time: 23:23)

To summarize the Hough Transform, it is an effective approach for detection of shapes,


objects, including say multiple, even if there were multiple cars, you can probably find all of
them. You could have multiple maxima in where, edges vote for the centroids of the cars and
each of them may be different instances of a car in a given image. So, you can also use this
for detection of multiple instances of object in an image.

410
The advantages of the Hough Transform is that it deals with occlusion fairly well because it
is a voting procedure. As long as the number of votes is high, it does not matter if certain
votes went wrong. So, even if a certain part of the object was occluded it can work.
Obviously, if a significant part of the object is occluded it will not work. It is also robust to
noise for the same reason, because you only need to look at certain set of pixels, and as long
as they vote for a majority, as a majority, you are going to get your shape or object in the
image.

The problem here is it can be computationally expensive because depending on how you
quantize your parameter space. Remember that every point has to cast a vote for different
possibilities in your parameter space, which can be time consuming. And setting your
parameters and quantizing them may not be easy for different kinds of shapes. For line, circle
and a few really defined shapes, it may be straightforward, for certain other shapes this may
not be easy or quantizing them may not give very accurate answers. Those are some
limitations to work with here.

(Refer Slide Time: 25:02)

Here is the visualization of the car example. So, you have a model image and there are some
key points with respect to a reference point at the center of the image. So, you record the
coordinates relative to the reference point in the model image and for every test image, you
look for the same configuration of the key points with respect to the centroid of the object.

411
(Refer Slide Time: 25:37)

So, in these particular cases, the car is inverted, but when the test image is of the same
conphiguration as the original object, you would find a match in this particular scenario.

(Refer Slide Time: 25:47)

Here is one more example. So if you consider your model image to be the Eiffel Tower and
here is a test image which is also the Eiffel Tower taken in a different scale on a different day
with different lighting conditions.

412
(Refer Slide Time: 26:02)

You then have your accumulator after you find your model image points in your original
image and your test image, you have your accumulator that votes for the centroid of the
object with respect to various key points in that object. You consider a local maxima, you do
not see very clearly here, but the local maxima is somewhere in the middle.

(Refer Slide Time: 26:29)

And based on that you vote for where the Eiffel Tower is located in the test image.

413
(Refer Slide Time: 26:36)

Your home work for this lecture is Chapter 4.3 of Szeliski’s book and a couple of questions
for you to think about. How would you Hough Transform to detect ellipsis, squares and
rectangles? Try to work out what the parameterizations or the analytical forms of each of
these shapes is, and try to find out how the accumulator array would look in each of these
cases. That should help you address these problems.

And a real-world use case. Let us assume your friend working in a diagnostic startup asks
you how to count the number of red blood cells in a blood sample automatically? So you
have a blood sample, you want an automated way to count the number of red blood cells
using Hough Transform. What would you advise him or her? Something for you to think
about.

414
(Refer Slide Time: 27:29)

And here are the references.

415
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 17
From Points to Images: Bag-of-Words and VLAD Representations
(Refer Slide Time: 0:15)

Moving on from the previous lecture, we will now talk about how you describe entire
images. So far, we spoke about how to obtain descriptors for certain key points in images,
which could be in several multiples in each image, how do you aggregate these
descriptors to obtain a single descriptor for an image and why do we need this.

We talked about image panorama stitching, where you try to match key points. You may
now want to match images themselves. For example, in image search or retrieval or even
to perform image classification as we will see when we go through these slides. In this
lecture, we will focus on two specific kinds of descriptors, the bag-of-words descriptor
and another extension of it known as the VLAD descriptor.

416
(Refer Slide Time: 01:15)

Once again, the lecture slides are based on Professor Yannis lectures at Inria Rennes.

(Refer Slide Time: 01:21)

So we have spoken so far about obtaining key points from two different views of the
same scene or object. You have two different views. You extract key points in both of
these views. You get descriptors of key points in both of these views and you match them
and perhaps try to match the geometry of the configuration between them using methods
such as RANSAC or Hough transform, so on and so forth.

417
(Refer Slide Time: 01:54)

But the question that we are asking here is what if we want to match different instances of
an object. The object is no more the same. In the earlier slide, it was the same object from
different viewpoints. Now the object could be completely different. In this case, it is a
helicopter, but two completely different helicopters. They are not the same object itself.
Then how do you match or how do you search? For example, imagine your Google image
search where you are going to search for helicopter kind of images. So you post an image
and ask the system to retrieve similar images for you. How do you do this is what we are
going to talk about now.

418
(Refer Slide Time: 02:37)

The main observation here is that rigid transformations may not work here. So far, if you
only wanted to match two different objects or scenes taken from different perspectives,
you could consider one to be a rigid transformation of the other. You could try to estimate
your affine parameters, your rotation parameters or your translation parameters, but then
these are completely two different instances of an object. It may not be a rigid
transformation at all.

So, ideally what we want to do is now to see if we can completely discard geometry
itself, because it is no longer going to be a rigid transformation. Can we completely
discard the idea of geometry? Try to match images in some other way beyond geometry.
And then maybe later using other methods, we will try to bring back certain elements of
geometry, little loosely into the mix.

419
(Refer Slide Time: 03:35)

So our first attempt towards getting image level descriptors is a method known as
bag-of-words, an old method has been used a lot in text before, but let us try to see how
we use this in images. So we are going to start with a set of samples, a set of data points.
Then you form a vocabulary of those data points. What do we mean by vocabulary, a set
of common words that exist across images. And finally, based on the vocabulary, we are
going to get a histogram, which is simply a frequency count of words in a certain
document. In our case, visual words in a certain image. Let us try to describe that in more
detail.

So if you had text information, your samples would simply be different sentences. Your
vocabulary would be the words that occur in these sentences, all the possible words that
you have occurring in your sentences. And finally, for each of these sentences, you would
come up with a histogram of how many occurrences of a particular word in your
vocabulary happened in this particular sentence. So, for each of your sentences, you can
look up a vocabulary or a dictionary and try to see how many times each word in the
vocabulary or dictionary occurred in this particular sentence.

Let us try to draw a parallel to images now. Let us say you have three different images in
this particular case. So you form a vocabulary. So what does a vocabulary mean for
images in text? It is perhaps simple. It is all the words that occur in your documents. But

420
if you had images, what would be your vocabulary? So the way we are going to define
vocabulary is very simple again. You take different parts of each of these images, group
them together and see which of them correspond to something common. For example, it
could be possible that there are six kinds of visual words that occur in the first image,
which all look similar. Maybe they may look similar in color, in texture or in any other
appearance of the representation you choose.

Similarly, the second image has a set of visual words. And the third image has a set of
visual words. Once again, we define visual words as extracting several parts from an
image. So if you have a set of images, you extract parts from all of those images and then
try to pool them in some way and then group them and say these parts in the image have
to, seem to have similar properties and I am going to group them into one particular
word. We will see this a little bit more clearly over the next few slides. And once you get
these visual words, similar to text, you have a histogram of how many times each of these
visual words occurred in a given image. That histogram is what is drawn here for each of
these three images.

(Refer Slide Time: 06:56)

Let us see a more real world example to understand this a bit better. Before we go there,
what could be used for bag-of-words? It could be used for image retrieval as well as

421
image classification as we just work. Why, because this is a way to represent an entire
image as a single vector. Let us see this using a more tangible example.

(Refer Slide Time: 07:17)

Here is an example of an image. And let us assume that this is the query image. And here
is an image from a dataset. So you could imagine that the dataset images are your entire
repository of images that you have. And the query image is like Google image search,
where you put up this image and you are asking for a search engine to retrieve all images
similar to this image. For example, you could upload a cat image on any search engine
and ask the search engine to retrieve similar cat images, a very similar example here.

Let us assume that there is a 15th image in your dataset image, which has an occurrence
of the similar structure. Now let us see what would happen earlier and how we would
change things now.

422
(Refer Slide Time: 08:08)

Earlier, we would take key points in each of these images, descriptors around these key
points in each of these images and try to do a pairwise matching between the descriptors
in both of these images. And this is obviously a time taking process, because if you had
many key points, many descriptors in each image, you would ideally have to do a
pairwise matching between every number in the first image versus every number in the
reference image. This can get computationally expensive.

(Refer Slide Time: 08:43)

423
If you had another image, you would have to repeat the same process for as many times
between the query image and the second reference image and so on and so forth for all
reference images.

(Refer Slide Time: 08:55)

So how do we want to resolve this problem? So what we are going to say now is that if
there were descriptors that were similar that would match between the two images,
instead of matching them pairwise between the query image and the reference image, let
us try to assume that similar descriptors will anyway correspond to similar
representations in that representation space in a certain metric, let us say the Euclidean
metric.

So a group of visual words or regions that are similar to each other would probably have
similar representations and they all grouped together in the representation space.
Similarly, for those regions marked with blue and the regions marked with red. They are
similar and they would probably match in the representation space. How do we use this?

424
(Refer Slide Time: 09:46)

(Refer Slide Time: 09: 52)

So what we would do now is to come up with a representation, a common representation


for all similar descriptors, how would we do that? We could simply take your entire
repository of images or training images, take visual words from them, which are common
to each other and simply obtain their mean in that space. That becomes the visual word
corresponding to all of those visual words in the repository of images or training images.
So you would do this process offline before the retrieval process starts. So you could

425
construct your visual words offline and keep them stored and ready when your search
process actually begins. So these visual words are already stored.

Given a query image, you take the descriptors corresponding to the key points in your
query image and now you simply match them with the visual word, which represents the
mean of similar visual words in your reference images. You do not need to do pairwise
matching with every descriptor in your reference word anymore. You would have a set of
visual words that are common across your repository of images. You only need to match
with them now to be able to get your answer. This makes this process very feasible to do
as well as very effective. There is no pairwise matching required and your visual words
now act as a proxy for your descriptors in your repository.

(Refer Slide Time: 11:20)

Now let us see this in a more well defined way. So let us imagine that your image now is
𝑘
represented by a vector z belonging to 𝑅 , where k is the size of the codebook or the
number of visual words that you choose, which would be a parameter that the user would
set for a particular dataset. Each element of z, which is 𝑧𝑖is represented as 𝑤𝑖𝑛𝑖, where 𝑤𝑖

is a fixed weight per visual word. You can assign a weight. If you do not have a weight,
they could just have uniform weights. And 𝑛𝑖 is the number of occurrences of that

426
particular word in this query image. I will repeat that. So 𝑤𝑖 is a weight per visual word,

which is optional. And 𝑛𝑖 is the number of occurrences of that word in this image.

So given a set of n images in your repository in your reference images, which are
𝑘×𝑛
represented by a Z in 𝑅 , remember there are going to be K visual words, and n such
images. For each image, you would have a k dimensional vector. So, and then n such
images becomes 𝑘 × 𝑛.

And for a query image, you would simply compare the similarities between your query
images vector representation, which is going to count the number of visual words in the
query image and the number of visual words in each of the reference images. You simply
take a dot product or cosine distance between these two and you get a set of scores. You
sort order your scores in descending order and that gives you the best match followed by
the next best match so on and so forth.

Note here that when we take a cosine distance based on say dot product, remember here
𝑇
what we are doing as a dot product in 𝑍 𝑞, you would take every column of Z, which
corresponds to one specific image, and you are taking an individual dot product of that
reference image with this query image q, the second reference image with the query
image q, so on and so forth. And whichever has the highest dot product that image is
what you would recommend as the closest match to your query image.

Just to point out to you a more general observation that if you take a dot product, it is
similar to measuring Euclidean distance, the similarity that you measure using a dot
product or a cosine distance, cosine similarity is complementary to the use of Euclidean
2 𝑇
distance. Why, because || 𝑧 − 𝑞 || now can also be written as 2(1 − 𝑧 𝑞). Work it
out as homework. So, which means something closer as cosine similarity will have a low
Euclidean distance. A high cosine similarity will be a low Euclidean distance. So, in
some sense, these two matrices comparing similarities or distances are complimentary.

An important observation here is when 𝑘 >> 𝑝, where p is the number of features per
image on average. Remember k is the number of visual words that you have in your

427
codebook, which would be user defined and p is the average number of features that we
have in a given image. We saw with examples like SIFT that could be several hundreds.
If 𝑘 >> 𝑝, then z and q are going to be sparse. Why, if you have 1,000 visual words and
say 100 in each image, then obviously there are going to be quite a few, at least 900 in a
given image that will have zero count, because those visual words do not exist in this
image. Remember k is the number of visual words and p is the number of features in a
given image. So, if 𝑘 >> 𝑝, both z and q are going to be sparse.

So to make the computation more effective here, rather than check whether a word is in a
given image, which is what you would typically do, you check for every word in a given
image, but then the entire image could be a very sparse representation of the possible set
of words. We can inward this problem and check which images contain a particular word
and we will see now that this can be a more effective way to compute similarity in this
scenario. Let us see this a bit more intuitively.

(Refer Slide Time: 16:11)

So let us imagine now that you have a query image, which has visual words, 54, 67 and
72, these are your visual words, remember you have a bunch of visual words and every
image has a certain number of occurrences of each of these visual words. Let us assume
now that you have just one occurrence of visual words, 54, 67 and 72 in a given query

428
image. 54, 67 and 72 are just random numbers. You could have taken any other numbers.
This is just to explain.

Now, if you have a repository of images that you want to retrieve from, let us say they are
given by a set of numbers in your dataset. So you now try to draw this out. So you look at
each of these images. Let us see the 15th image in your repository had the 72nd word, the
67th word and 54th word, whereas the 13th image has only the 67th word, the 17th image
has only the 72nd word, the 19th image has the 67th and the 54th word and so on and so
forth. Now what do you do?

(Refer Slide Time: 17:27)

So you take one of these visual words, which was occurring in your query image, and you
try to see which of your query images also had that particular visual word and you add a
count for that particular image. Similarly, 19 would have got a count of 1, 21 would get a
count of 1 for these images. So you are trying to see, you are now trying to compare
which of these repository images contain the same word which is there in your query
image.

429
(Refer Slide Time: 18:03)

Similarly, you would do this for the 67th word and you will get an increase in count for
these two images, an increase in count for 13th and 21st just had a green, there was no
increase in count, but the 13th image has an increase in count now.

(Refer Slide Time: 18:25)

430
Similarly, you would do this for the red visual word or the 72nd visual word. And at this
time the 15th image gets one more count. The 17th image gets a new count added here.
And the 22nd image gets a new count added here. So it is evident from this that the 15th
image has the highest count with respect to the query image and you now rank order all
your repository images based on a short list with respect to these visual words.

So you find that the 15th image has the best match. The second best match is the 19th
image with two common visual words. And you repeat this process to be able to get your
best match from your repository of images and this would be known as the image
retrieval problem.

431
(Refer Slide Time: 19:16)

Let us try to ask how you would extend bag-of-words to classification. Now, we talked
about retrieval, but we now want to classify a given image as belonging to a particular
scene. For example, in the previous image that we saw, maybe that building has a
particular name and there are many monuments that you have and you want to classify an
image as belonging to a particular monument. So you want to treat this now as a
classification problem. How do you adapt this for classification?

(Refer Slide Time: 19:49)

432
The way you would do it is once again, you represent an image by a vector z belonging to
𝑘
𝑅 , where k is the number of the visual words, very similar to how we did it for retrieval
problems.

(Refer Slide Time: 20:00)

But once you do this now, you can use a classifier to be able to classify these, the
frequency of visual words. So every image now becomes a vector representation of a set
of visual words, a histogram of visual words. And you would have a set of similar images
in your training data, which you would have also had a class label associated with them,
because remember classification is a supervised learning problem. So now you could just
use a classifier such as, say, naive bayes or support vector machines or any other
classifier for that matter.

If you are using naive bayes, you would estimate the maximum posterior probability of a
class C given an image z assuming features are independent, assuming the presence or
the count of each visual word is independent. You would simply run a naive bayes
classifier in this particular scenario. Remember naive bayes would become a linear
classifier. You could do this using a support vector machine. You could use an
appropriate kernel function in a support vector machine to perform the classification or
for that matter you could use any other classifier too.

433
(Refer Slide Time: 21:06)

An extension of the bag-of-words is known as the vector of locally aggregated


descriptors or VLAD. It is very similar to bag-of-words with a small, but a significant
difference. Bag-of-words as we just saw would give you a scalar frequency of how many
times a particular visual word, which is obtained by clustering a lot of regions in your
training images or repository of images, how many times each of those visual words
occurred in a query image. So in some sense, this gives you limited information. You do
not have geometry here or you do not have what were the exact regions in your query
image. You are going to map them any way to a common visual word, which is the
average of similar visual words in your depository.

In VLAD, instead, what you do is you again have visual words, all of that remains the
same, but now you have, instead of a scalar frequency, you have a vector per visual
words, which looks at how far each of these features in your query image are from the
visual word. So you have a visual word which you would have obtained by clustering
groups of features in your training images.

Now, given a new feature in your query image, you look for how far that is from the
visual word you would get a residual vector. Similarly you take another feature that is
map to the same visual word and see how far that is with respect to the visual word and
you add up all the residuals and you would get one vector residual, which is the sum of

434
all residuals that get mapped to the visual word that corresponds to the visual word. So it
is not a scalar frequency anymore, it is a vector of how far other features that map to this
visual word are in each dimension. This gives you a little bit more information that could
be more effective.

(Refer Slide Time: 23:09)

Let us see this a bit more clearly. If we had a bag-of-words representation, given a color
image, which is a three channel RGB input, you first converted to gray-scale. You could
adapt bag-of-words to color two, we are just taking a simpler example here. So once you
have that, you get a set of say 1,000 features and you convert that say to one dimensional.
If you had 1,000 key points, you convert that to 128 dimensional SIFT descriptors for
each of those 1,000 key points. Then you do an element-wise encoding to say 100 visual
words, let us assume you had 100 visual words. You take each of those 1,000 features and
map them to 100 visual words.

So you would have k such visual words and you would end up counting them, counting
the occurrences of each of those 1,000 features with respect to one of these 100 visual
words, you get a set of frequencies. Then you could do a global sum pooling and an L2
normalization of that final histogram vector to get your final bag-of-words representation.

435
On the other hand with VLAD, you would do the same. You would convert the three
channel RGB input into a one channel gray-scale. You would once again get 1,000
features in your query image, convert that to 128 dimensional representations for each of
those 1,000 features. You would again assign each of those 1,000 features to one of k
visual words. But this is where the difference comes that you are now not going to get a
scalar representation or a histogram, you are going to get a residual vector for each of
those k visual words, which means this now is going to be a 128 cross k, into k rather
representation rather than a k dimensional representation.

Because for each visual word, you are going to see for all the features that got mapped or
all the key points that got mapped to a particular visual word what was the residual for
each dimension for each of those 128 dimensions.

And finally, you would get a 128 into k representation, which you would 𝐿2 normalize

and use for practice. 𝐿2 normalization is simply to ensure that the vector becomes a unit

norm, 𝐿2 norm to be one and that is the VLAD descriptor.

(Refer Slide Time: 25:39)

To conclude this lecture, please read chapter 14.3 and 14.4 of Szeliski’s book. And
something for you to think about at the end of this lecture is how is bag-of-words

436
connected to the k-means clustering algorithm. We already, I think, spoke about it briefly
during the lecture, but think about it and answer this more carefully.

Assuming you understood that bag-of-words is connected to k-means clustering, how are
extensions of k-means clustering, such as hierarchical k-means clustering or approximate
k-means clustering relevant for the bag-of-words problem. Your hint here is to look up
what are known as vocabulary trees.

(Refer Slide Time: 26:22)

And here are some references for this lecture.

437
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 18
Image Descriptor Matching
(Refer Slide Time: 0:14)

Last lecture we spoke about describing images using the bag-of-words approach or the
VLAD approach. We will now take this forward and show how these descriptors can be
used for matching between images.

438
(Refer Slide Time: 00:39)

Before we go there, once again, an acknowledgement that these slides are taken from the
excellent lectures of Professor Avrithis at Inria Rennes.

(Refer Slide Time: 00:47)

We also left behind one question from last time: since bag-of-words is inherently
dependent on k-means to define its cluster centers, can we consider extensions of
k-means to improve how bag-of-words performs. So one specific example could be
hierarchical k-means which is an extension of k-means clustering algorithm, where the

439
cluster centers are organized in a hierarchical manner, starting from a root note, all the
way to a set of leaf nodes. So it happens that this has been explored for bag-of-words two
using a method known as vocabulary tree way back in 2006.

And what this method does is to take hierarchical k-means and build a final partition tree
and now your image descriptors descend from each, from the root to one of the leaves in
each level of the tree. So you have a bunch of cluster centers that you put together from
all of the visual words that you have from different images. Recall that in bag-of-words,
when you build your cluster centers, you pool all the images in your data together, you
take all the features, the descriptors around the features, cluster them using a method, and
then you take those clusters as your, what are known as codebook centers to which each
key point is assigned.

Your image is finally represented as xi, which is one of the elements in your image, in the
descriptor of your image is given by wi into ni, wi is the weight of that particular node in
the tree and ni is the number of key points assigned to that particular cluster center or
cluster centroid in the tree. So one evident thing here is it is difficult to know how wi is
given a value. One could argue that perhaps for levels down the tree, you must have a
higher weight because they are a better match. One could also argue otherwise,
depending on a particular setting there a high level match is perhaps more important than
a low level match. It depends on the application, depends on what is important in a
particular context.

So, which is a constraint of this method that there is no principle way of defining wi,
although you could come up with some juristic on top of the method. The dataset is again
searched using inverted file indices just the maybe we saw it in bag-of-words. And
fundamentally there is an issue here which is that distortion is often minimized only
locally, for example, around a particular cluster centroid. So any error that you make or
any differences between images are only local with respect to that particular cluster
centroid, distortion is not measured on a global sense. It is all with respect to each cluster
centroid, across the occurrence of each cluster centroid in each image across two images
that you are matching. That is how hierarchical k-means can be extended for

440
bag-of-words. So one could also consider other extensions of k-means and be able to use
them in such methods to be able to describe images.

(Refer Slide Time: 04:28)

Let us now talk about how you match descriptors from two different images in a more
principled manner. Before we go there, let us try to assess what we know so far? One of
the simplest methods that we have explored so far is nearest neighbor matching, where
you have one image, which is a set of feature points, another image, which is a set of
another feature points. At this time, we are not talking about any bag-of-words
aggregation. It is simply a set of features in one image and a set of features in the other
image.

We use each feature in a set and its corresponding descriptor to independently index into
a feature from the second set. We simply do a nearest neighbor matching based on the
descriptor of the feature in each of these images. Can you think of what could be a
limitation of such an approach?

An inherent limitation of this approach is that you could be ignoring useful information
of co-occurrence. So it is possible that there could be one image where there could be
multiple instances of the same feature. Think of, say, spots in a leopard or stripes in a

441
zebra or so on and so forth. And you may be mapping a single feature in one image to
multiple different features in the other image because they all look similar.

For example, a spot on a leopard in one image could get mapped to multiple spots on a
leopard in a second image. You ideally would not want to have, want this to happen. You
would want each spot on a leopard to get mapped to be one spot on the leopard in the
other image and another spot to get mapped to another spot and so on and so forth.

Or the example that you see on the slide here, which is a giant panda, which again has
some features which repeat in its structure, which could get mapped to the same instance
in the second image, which is what you see illustratively here. This will be the ideal
setting where each feature gets mapped to the same, to an independent feature in the
second image and this could be another scenario where two different instances of the
same feature get mapped to the same feature on the second image which is not desirable.
This could lead to some problems when you use a nearest neighbor matching approach.

(Refer Slide Time: 07:01)

We have also seen the bag-of-words matching approach so far. Any glaring limitation that
you see of this approach, an evident limitation is that the bag-of-words approach is
limited to a full image matching, not a partial matching, because you are going to be
looking at the histogram of the occurrences of a cluster centroid in image one versus such

442
a histogram for image two, you would say two images match only when the complete
histograms are fairly close to each other. So if only a part of the second image matches
the first dimension, these histograms will not match and you will not get a good match in
this kind of setting. So in other words, you could say the bag-of-words is an all-all
matching, but we ideally want some sense of a one-to-one matching of one part of the
image matching with another part, so that we can still match some images which could be
partially close to one another. Ideally speaking, the nearest neighbor matching was a
one-to-one matching approach, but it has its own limitations.

(Refer Slide Time: 08:16)

Now let us try to generalize how you can do this descriptor matching using a method that
you would have seen in machine learning which is kernels. You may have heard about
kernels in support vector machines. So we are going to use a similar idea here to be able
to generalize matching between descriptors. Recall that even in support vector machines
or for that matter any other machine learning algorithm works where kernels can be used.
Kernels are a sense of similarity between two data points. It is the same principle that is
being used here also.

To be able to do that let us define the setting so far. So you have an image described by
end descriptors let us say. So X is an image given by 𝑥1, 𝑥2, ···, 𝑥𝑛 where each of these

443
is a d dimensional vector. Remember this could just be the cluster centroids or could be
individual features.

And looking at the bag-of-words example, these descriptors are typically quantized using
k-means clustering or for that matter any other clustering method. They are quantized,
which means you simply do not take all the features in an image. You try to see which
cluster center they belong to and only have a count in that particular cluster center as a
representation of that particular feature or that particular cluster center. This quantization
𝑑 𝑑
function is given by q, which takes us from 𝑅 to a subset of C of 𝑅 , where C is called a
codebook, which is given by 𝑐1 to 𝑐𝑘.

So there are k possible cluster centroids which, as I just mentioned, you get by doing a
k-means clustering on the features from all images. And in a single image, you try to see,
you use a quantizer function, which takes you from every feature to one of these cluster
centroids which is nearest to it. This is the setting.

Now let us try to define the kernel. So the kernel now is given by given two images, X
and Y, the matching kernel 𝐾(𝑋, 𝑌) = γ(𝑋)γ(𝑌)which as we will see soon are
normalization functions, we will see why we need this in a moment, into summation over
all the cluster centers that you have Σ𝑀(𝑋𝑐, 𝑌𝑐), where M is the within-cell matching

function.

So it is the, M is the matching kernel function that you are talking about and that happens
for each, the occurrences of each cluster centroid in one image and the occurrences of
that cluster centroid in the second image. So we still have to define what M is. We will
see a few examples of that as we go forward.

So γ(𝑋)γ(𝑌) is required here because you do not want to be biased by the presence of
number of features in a given image. For example, it is possible that you may detect say a
1,000 features in one image and just 200 in the other. If you would simply do a
summation you would always be biased by an image that has a lot of features because the
count would go up and which may not really be a perfect match.

444
So the γ(𝑋)γ(𝑌) are normalization functions, where you can divide by the total number
of features in the image so that the matching is not biased by the, simply the number of
features in an image.

(Refer Slide Time: 12:01)

Now let us try to see a few examples of how M would look in examples that we have
seen so far. So if you turn to the bag-of-words matching, bag-of-words matching is in a
way a cosine similarity between the cluster centers in the two images. You could define
that whatever we have seen in bag-of-words similarity so far can be defined by a
matching kernel which is given by, you take 𝑋𝑐 which is one of your codebook elements

or your cluster centroids and you count how many occurrences of that are there in image
X, how many occurrences are there in image Y and you simply add them all.

Rather for one particular codebook entry 𝑐3, if you had 10 such features in image X that

correspond to 𝑐3 and three such features in image Y that correspond to 𝑐3, the

corresponding matching kernel is simply 10 x 3 here, which becomes 30. You simply
count that over a double summation. So that is simply a bag-of-words model. Remember
that the normalization function, as I said, would take care of normalizing by the total
number of features in the image itself. But this is what M is defined as.

445
(Refer Slide Time: 13:24)

You could extend this to another approach known as hamming embedding for matching,
where if you assume that each descriptor can be binarized in some way, for example, you
could take a descriptor and simply say that anything greater than a threshold is one and
anything less than a threshold is zero, you could binarize any descriptor for that matter.

Then you compute your matching kernel as it is, again, similar to your bag-of-words
kernel. The only difference now is you are only going to count the number of instances
where the hamming distance between bx and by is going to be less than a threshold. This
is simply your hamming embedding. So the hamming, h, is the hamming distance
between the two binary vectors and τ is a threshold that you have to specify to be able to
get a matching kernel in this particular setting.

446
(Refer Slide Time: 14:25)

You can also define VLAD matching in the same framework. Remember, again, that
VLAD is similar to bag-of-words, the only difference is you do not count how many
features belong to a codebook element or a cluster center. You rather obtain all the
features that belong to a codebook element or a cluster center and rather count the
residual vector.

Recall again, in VLAD you have a cluster center. For example, you could have two
cluster centers. You have a set of features that are closest to this particular cluster center.
You have another set of features which are closest to this particular cluster center. So you
take the difference between them which is going to be a residual vector and you add up
all the residual vectors that belong to one particular cluster center and that becomes the
representation for this cluster center. And similarly, you would do this for other cluster
centers.

447
(Refer Slide Time: 15:30)

Now, if this was the representation, how do you do the matching kernel? So the matching
kernel is given by the entire representation of image X is going to be a set or an entire
vector represented by (𝑉(𝑋𝑐 ), 𝑉(𝑋𝑐 ), 𝑉(𝑋𝑐 ), ··· 𝑉(𝑋𝑐 ))That is going to be the
1 2 3 𝑘

representations corresponding to each of your codebook elements. And your matching


𝑇
kernel is now given by 𝑉(𝑋𝑐 ) 𝑉(𝑌𝑐), which is an inner product between the
1

𝑡ℎ 𝑡ℎ
representation for the 𝑐 codebook entry in X and the 𝑐 codebook entry in Y. But this is
simple to expand, because 𝑉(𝑋𝑐) is a summation over all the Xs that belong to that

particular code entry and their corresponding residuals.

So you expand 𝑉(𝑋𝑐) using a summation. Similarly, 𝑉(𝑌𝑐) using its summation and you

now have the new matching kernel as a summation over all the elements or the, all the
features that belong to that codebook entry in image X and a summation over Y for the
𝑇
same codebook entry and inside you are going to have 𝑟(𝑥) 𝑟(𝑦), so the residuals are
aligned with each other.

Rather, if you try to understand the intuition here, you are trying to say that if you have a
cluster center, say 𝑐3 in image X, the same cluster center 𝑐3 in image Y. Let us say you

448
have three features in image X belonging to this cluster center. Similarly say three
features belonging to this cluster center in image Y. You are going to take one of these
residuals and see how the other residual matches with this residual. If the two residuals
match, you are going to get a high matching score. Rather, if even how the features were
configured around the cluster center match between the two images, it is a better match.

(Refer Slide Time: 17:41)

A more general way of combining all of these ideas was known as the aggregated
selective match kernel or ASMK which combines a few ideas that we have seen so far. It
combines the non-linear selective function that we saw with hamming embedding. We
will see that in a moment and also it combines ideas from VLAD.

So the way this method works is you take VLAD, which is what you see as the argument
^ ^
inside and you normalize the VLAD vectors, which is what you define as 𝑉. So 𝑉, what
you see at the bottom of the slide is given by 𝑉(𝑋𝑐) / ||𝑉(𝑋𝑐)||, effectively you are trying

to make the VLAD vector which is the sum of all residuals into a unit norm vector. And
you now take an inner product between the VLAD vectors, very similar to what we saw
in the previous slide. Because this is an inner product, the output of this is going to be a
scalar.

449
Now you use a non-linear selective function σα of this scalar to get your final matching

function. What is this σα? The σα is defined as for any input u, σα is defined as the
α
𝑠𝑖𝑔𝑛(𝑢) |𝑢| . If u is greater than a threshold τ, remember the inner product is a measure
of similarity. Higher the value, the better for u. So if u is greater than a threshold, you
may want to weigh things a bit more which you can control using α. And if u is less than
a threshold, it is zero, very similar to the hamming distance. We say that if the hamming
distance was less than a threshold, we will count it more. If not, we would not count it.
This is a very similar idea, although we use VLAD vectors for achieving the same goal.
So you can see here that if α is 1, this is just u itself. If α is 1, 𝑠𝑖𝑔𝑛(𝑢) |𝑢| will be u itself.

(Refer Slide Time: 20:01)

Let us see a few illustrations of this idea to make this clearer. So this is a few illustrations
of different choices of alpha and tau values. So on the top left here, you see α is 1, which
as, we just said, which is u itself. It is simply the dot product of the normalized VLAD
vectors and τ in this case is zero. So anything greater than zero, you are going to consider
that as u itself to be the as the, the score to be u itself.

So you can see here, in this case, yellow corresponds to say zero similarity and red
corresponds to maximum similarity per image pair. So there are few features that do not

450
match at all and there are a few features that match very well and all of them are shown
in the top left image.

If you see the top right, you can see here, again αis equal to 1, but τ is equal to 0.25. You
can see now that a lot of the yellows have come down. You are now saying that that score
u must be at least 0.25 for us to consider that to be a match. And you see now that many
of these false matches disappear because anything, any low score is now disregarded.

The bottom row shows an example where α is equal to 3. Where you see again, when α is
equal to 3, remember that because you are normalizing your VLAD vectors, α being high
is going to reduce the value because u is going to be a value lying between 0 and 1,
because you have normalized the vectors, it is likely to be a value between 0 and 1. So if
α, which is the exponent of u is high, the values are going to go even smaller and that is
what you see here on the left, where do you see that some of these reds have disappeared,
because they have gone closer to 0.

And on the right, you see a similar thing where when α is equal to 3 and τ is equal to
0.25, again you get a few more yellows, which could be smaller values at this point in
time, because u, the exponent u, exponent α which is 3 here may have reduced the values
because u lies between 0 and 1.

So you can see here that larger selectivity down-weights false correspondences, which is
what we see here with tau is equal to 0.25 on both the images on the right. And this entire
approach replaces the hard thresholding that we have in hamming embedding and gives a
different way of going about a similar approach.

451
(Refer Slide Time: 22:41)

Here is another illustration of results after applying the ASMK method. Where you see
here that each of these colors in these different images correspond to the same visual
word. It is, so the green or the yellow or the blue is the same visual word occurring in
different images. So you can see here that if you take any particular example, for
example, if you take, say the pink or the red, you would see that the pink or the red visual
word corresponds to some corner or some pointed corner in each of these images.

(Refer Slide Time: 23:18)

452
All of these kernels can be generalized into efficient match kernels, where you could
define this kernel as some continuous function 𝐾(𝑋, 𝑌) and avoid using any codebooks
for that matter at all. Because codebooks can be computationally intensive to compute
and then compute the residuals and so on and so forth, you could directly impose a kernel
function between the individual features in one image and the individual features in the
other image.

Once again, very similar to how kernel functions were used in support vector machines or
other machine learning algorithms which had a kernel competent to it. Ideally you would
want this 𝐾(𝑋, 𝑌) can be decomposed into ϕ(𝑥)ϕ(𝑦), where ϕ is the representation of
each feature in a different space. Then you would have your final kappa to be some
normalization of X into the summation of all of your representations of x for each of
these, for all your features transpose a similar aggregation for Y. This is a generalization
of these methods that we have seen so far.

(Refer Slide Time: 24:42)

That concludes this lecture. So for more readings, this particular paper to aggregate or not
to aggregate summarizes these kernels very well. Please do read through it when you get
a moment and chapter 14.4 from Szeliski’s book would be your other reference for this
lecture.

453
Deep Learning for Computer Vision
Prof. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 19
Pyramid Matching
Continuing with the previous lecture, we will now go from matching kernels to matching
kernels for image pyramids. For example, having a multi-resolution pyramid of images
and being able to use that idea to develop matching kernels.

(Refer Slide Time: 00:34)

The slides have once again borrowed from the lectures of Professor Avrithis at Inria
Rennes.

454
(Refer Slide Time: 00:41)

Descriptor matching, as we just saw in the earlier lecture can be given by, you have Xc
and similar Yc for the features that belong to a particular visual word in images X and
image Y. Then a matching kernel for something like bag of words could be given by
summation over the same cluster centroids just counting the number of features
belonging to it. You could also include some weighting factor for each of these
summations if required.

And a more general form that we saw last time was what you see below here, which is

K (X, Y ) = γ(X)γ(Y ) ∑ W c M (X c , Y c )
c∈C

Wc is a weight that we are introducing, which we can choose to use or not and M of Xc
Yc, where M is the matching function.

455
(Refer Slide Time: 01:45)

Now, we will talk about going beyond single level matching and matching at the level of
pyramids and we will describe a seminal work in this context known as pyramid match
kernels. So pyramid matching is an efficient method that maps unordered feature sets,
which is what each image is, each image is an unordered set of features. We are going to
convert that into multi-resolution histograms and then do matching using weighted
multi-resolution histograms.

So we ideally can start with the finest resolution histogram cell where a matched pair first
appears, and then we keep merging histograms as we go up the pyramid in this particular
context. And the work has a very nice interpretation where it can be shown that it
approximates a similarity in a partial matching setting, where if you had only a partial set
of features in one image, match a feed, set of features in another image, pyramid match
kernel approximates that optimal partial matching between those two images. For more
details, I would also recommend you to read this paper called pyramid match kernel. It is
written very well and explain some of these ideas in detail if you are interested in
knowing more.

456
(Refer Slide Time: 03:22)

Let us start with defining histogram intersection, because we are going to define
histograms in both images. Obviously, we are going to define them at multiple levels, but
let us just talk about how do you match histograms in this context. So if you had two
histograms x and y of b bins each, so let us say this is x, and this is y, both are histograms
with b bins each.

We define the histogram intersection as minimum of xi, yi, one element of the histogram
in both of these images and sum these up over all the b bins. So you take the first bin of
histogram, the first bin of the second one, take the minimum value, take the minimum
value of the next bin in the histogram and add them all. That is what we define as the
history intersection.

Interestingly, you can show that this notion of histogram intersection, which we define as
Kappa sub-HI has a relation to L1 distance. We are not going to prove it here, but
probably leave it as an exercise for you to look at. You can see that the L1 distance
between two vectors, x and y can be given by

||x − y ||1 = ||x||1 + ||y||2 − 2κHI (x, y )

457
Try it out for yourself. Take a few exits, take a couple of examples of x and y, you will
actually see that it was in practice. Please try proving this also if you can. This is an
interesting exercise for you to work out, but you can show that this histogram distance is
related to the L1 distance. Remember L1 distance is the sum of absolute values of that
vector.

(Refer Slide Time: 05:16)

Let us come back to the pyramid match kernel now. So we said that the pyramid match
kernel does a weighted sum of histogram intersections at different levels of two images,
and it approximates optimal pairwise matching. So we first conceptually talk about it and
then we will give a concrete example and go how it is done. So if you had these two
images of the same object from different poses, different view angles, you could have,
once again, you have you extract key points and those key points could be these laying in
R power d. You have a similar set of features lying in R power d for the second image.

So you now, you have that entire feature space that you divide into a grid for instance.
And now you are going to count how many of the features in one image occur in each of
that grid of the feature space that is going to define a histogram. You match the histogram
at that level. Then collapse the grid and merge regions in your grid in R power d in your
D dimensional vector assuming this size of the descriptor corresponding to the feature,
your match at that level, so on and so forth.

458
And one intuition here is you want to give a higher weight to matches at a final level and
a lower weight to matches at a higher level where the histogram bins maybe merged. We
will give a concrete example and walk over this idea.

(Refer Slide Time: 06:50)

Let us consider now that you have a set of features, an unordered set of features in image
X, which is given by these blue points, a similar unordered set of features in image Y,
even by the red points. So remember, these are points, these are descriptors of those
features lying in R power d and you are going to bin them into a very fine bin of features
in that space.

So it is possible that this blue point was lying in this bin, this blue point was lying in this
bin and so on and so forth. You are just binning that entire R power d region into
different bins and you placing each key point occurring in each image into one of those
bins based on the descriptor values. Now, you have 1-D point it is X, Y on grid of size 1.
We are going to call it size 1. This is the finest resolution.

459
(Refer Slide Time: 07:51)

So now we define histograms. So your level zero histograms are going to be this
particular bin in your R power d grid as one feature. This particular bin in R power d in
X has one feature. This particular bin in R power d has one feature in image X and one
feature in image Y, so on and so forth. So you can construct your histogram. Obviously,
it is possible that you could have one more feature here of Y in the same bin, but at the
first level, we create these bins in such a way or you can always define bins at a very fine
level. That is up, you create these bins in such a way that there is only one feature in each

460
of the bins. We will obviously merge these as we go to be able to combine them in a
more effective manner.

So based on these histograms, when you try to match them, remember our histogram
intersection is going to be the mean of each element. So you are going to be left with the
intersection, which is simply one value here for this bin and one value here for this bin,
all of the other bins have one of the elements in X or Y to be zero, which means they will
get removed.

(Refer Slide Time: 09:09)

So you have two matches now between images X and Y and you are going to weight
them by a value 1. So your total similarity score now is going to be 2 into 1, which is 2.

461
(Refer Slide Time: 09:24)

Now we are going to merge your histogram bins. Originally, if you had say about 20
histogram bins, you need to merge every consecutive, every contiguous one and make
them into 10 bins. And now you see that it is possible that there are two features in image
X that belong to the same bin and so on and so forth.

So now we construct what are known as level 1 histograms where we count the number
of features in each of these merged bins in image X and image Y. You see that there are
two occurrences of features in this bin. Similarly, there are two occurrences of features in
this bin in image X, but image Y has just one feature still in each bin. So based on that,
we build the histogram for image X, build the histogram for image Y.

And now you compute the intersection of these two histograms and you find that there
are four matches. But you do not count all the four matches, you count how many new
matches are added. So we are only going to look at how many new matches are added by
matching these histograms, which is going to be we had two matches earlier, we have
four matches now, the new matches would be two.

So now you consider those new matches, you weight them by half. Why half, remember a
match at a closer level is given lesser weight than a match at a finer level, because the
finer level match means a closer match. So you take these two new matches weighted

462
them by half and now your similarity score becomes 2 into 1 from the earlier slide plus 2
into 1/2, which is totally going to be 3.

(Refer Slide Time: 11:19)

Now, we continue this process, you now make your histogram bins just five in number,
which means the number of features that you are going to have in each bin is going to
increase. You can now have three features in this bin in image X, so on and so forth.
Once again, you can get the histogram for X, the histogram for Y, you compute the
intersection, which is now going to give you the number of matches to be 1 plus 2 plus 2,
which is going to be 5, but you already had four matches in the previous level. So, the
number of new matches is going to be just one, so the new match here is going to be just
one.

So the similarity score now is going to be given by 2x1+2x1/2+1x1/4 , because you are
reducing the weight even further when you go to an even higher course level. So your
total similarity score is 2 plus 2 into 1/2 plus 1 into 1/4 which is going to be 3.25.

463
(Refer Slide Time: 12:29)

So let us try to put this together. So given as set X which consists of n different features
each belonging to R power d. Let us assume that the distances of those elements range
between 1 and D. This just helps us build your bins for constructing the histogram. Once
you know the maximum distance between elements, you can play around with your
histogram bins to define them accordingly.

So, we are going to define Xi as a histogram of X in R power d on a regular grid of side


length 2 power i. So we start with histogram at level 1, histogram at level zero, level 2, so
on and so forth. Technically speaking, we are going to start i at minus 1, but at minus 1,
there are no matches. It is purely for mathematical convenience as we will see in a
moment. And then we keep building the number of histogram levels until log D where
remember D use the maximum distances between the elements.

So now given two images with descriptors X and Y we are going to define formerly the
pyramid match as

l
1
K Δ (X, Y ) = γ(X)γ(Y ) ∑ (κ (X i , Y i )
2i HI
− κHI (X i−1 , Y i−1 ))
i=0

And at each level you are going to count the number of new matches. The first term
counts the number of matches at this level, the second term counts the matches at the

464
previous level and you are going to keep building that. So at each point, this is going to
refer to the number of new pairs matched.

So this difference can also be written, the summation of differences rather can also be
written. It would, if you expand this, you would get a telescoping sum because you would
have an i is equal to 0, you would have 1 by 2 power 0 into κHI (X 0 , Y 0 ) − κHI (X −1 , Y −1 )
which you ignore, that term is something that you ignore. Then you would have plus 1 by
2 into kappa of X1 by 1 minus kappa X0, Y0. So the X0, Y0 terms will be common
between these two elements which will keep getting telescope.

So if you put them all together, you would find that the telescoping sum can be written as
1 by 2 power L into κ(X L , Y L ) , which will be at the highest level plus all of the other
terms will get, so, for example, let us take one particular example. If you take i is equal to
1 and i is equal to 2. At i is equal to 1, you are going to have 1/2 for simplicity we just
going to read it as κ(X 1 , Y 1 ) − κ(X 0, Y 0 ) . And at i is equal to 2, you are going to have
1
4 κ(X 2 , Y 2 ) − κ(X 1, Y 1 ) .

1 1
So this 2 κ(X 1 , Y 1 ), 4 κ(X 1 , Y 1 ) will get subtracted and you will be left with 1/4 into
kappa X1, Y1 and that is what you writing out here. So which means X1, Y1 would have
only quarter left because one of them will get cancelled. So you will be left with 1 by 2
power i plus 1 kappa of Xi, Yi. In case this is just a simplification of the telescoping sum
that we see in the above equation. So this is just a mathematical representation of the
example that we just saw over the last few slides.

465
(Refer Slide Time: 16:32)

Now, it can be shown that this K delta function that we just defined actually happens to
be a positive definite kernel. Remember again, if you recall your discussion of kernels in
support vector machines and machine learning, you will recall that a positive definite
kernel has benefits because it satisfies the muscles theorem and the computational
efficiency increases, if your kernel satisfies this property. Let us see how that holds here.

Recall now that K delta is written as a weighted sum of kappa HI terms with
non-negative coefficients. What are those non-negative coefficients, 1 by 2 power i. those

466
are non-negative coefficients. And then you have a weighted sum of different kappa HI
terms. This is the terms that we are referring to. That is what K Δ is. Or if you look at
either of these equations it is simply a weighted sum of kappa HIs. And we also know
that each of these κHI ′s which is your histogram intersections is simply a min of values
in each bin. So it is a summation of min terms.

Now we know that min can be written as a dot product. How, if you had a number 3, and
if you had a number 5, I can write 3 as I put 1, 1, 1 for the first three values and then zero,
zero, zero assuming I can go up to value 8. Similarly, for five, I have 1 in the first five
indices followed by three zeros.

Now, the min of these two values, which is three, is simply a dot product between these
two binary vectors, which means I can write min as a dot product and the rest of it now
would fall in nicely because a sum of dot you would have min to be a dot product, the
sum of min terms can also be written that way and a weighted sum of such kappa HI
terms with non-negative coefficients can also be written this way, which means you can
write your entire K delta as a positive definite kernel. In case there are parts that are not
clear to you, please go ahead and read the pyramid match kernel paper to be able to get a
better sense of this.

So, one question here is we just said here that min can be written as a dot product by
writing out each of the numbers that you have there in this form. If you wrote out each of
those numbers in this particular form, then min becomes a dot product. So you could ask
me the question. You simply extrapolated the dot product to a sum of min terms and then
sum of the min terms to a sum of kappa HI terms with non-negative coefficients and
continued that as positive definite.

Then what would be the representation of the elements on which K delta is a positive
kernel, what would be that embedding. For min, the embedding was writing it out in this
manner, writing each number out simply as in enumerative, in an enumerative way. What
would be the corresponding embedding upon which K delta becomes a positive definite
kernel. To know what the embedding is Let us try to analyze this a bit more carefully.

467
(Refer Slide Time: 19:57)

So if you had two images X and Y for convenience let us assume that X has a lesser
number of features than image Y. Remember both of them are unordered set of features.
It could be the other way too. This is without loss of generality. In that case, it would just
be flipped. But otherwise you can assume that one is less than the other in terms of
cardinality of features. And that is define a function pi that takes us from image X to
image Y in such a way that pi is one-to-one, which means for every feature in image X,
you find the closest feature in image Y.

468
In that case, the optimal pairwise matching would be given by, you take a feature from
image X, you find the corresponding closest feature in image Y, you take the L1 distance
between these two features and you are going to find the pi of the function that takes you
from image X to image Y which gives you the least which, sorry, which maximizes the
reciprocal of this distance. Remember, reciprocal of this distance is going to give you a
sense of similarity because of the reciprocal you want to find the function pi which gives
you the maximum such distance.

For those of you who are a bit more familiar with distance metrics, you would find that
such a representation is similar to what is known as the earth mover's distance, which is

given by minπ ∑ ||x − π (x)||1 . Remember that this is a distance metric, while this
x∈X

representation of optimal pairwise matching is a similarity measure, which is why you


have max here and you have a min here. Remember that distance and similarity are
complimentary ideas. If one is high, the other should be low, so on and so forth.

So it happens that defining X the way we did, where we defined it in terms of grid
locations and histograms and so on and so forth and taking a one norm between those
intersections, actually gives us the embedding. For more details of this, this could be a bit
mathematically involved, but for details of this, please see this particular paper called fast
image retrieval via embeddings. But the core idea that you want to take away from here is
that the pyramid match kernel defines a positive definite kernel which makes it efficient
because we know that a positive definite kernel that satisfies the Mercer's theorem has a
certain benefit in computations using the kernel trick and also that the embedding that
corresponds to the kernel comes from a, can be related to the L1 distance between these
X values and this particular paper describes this in more detail.

And remember that once again pyramid match kernel is a similarity measure as any other
kernel functions and it does not penalize clatter except for the normalization. By that
what we mean is it is possible that many features could be congregated in a certain
section of your entire R power d space and you are not going to penalize it because that
would just increase the histogram intersection count in a particular bin or so on and so

469
forth. There is no penalization for that. The only penalization that you could have is the
normalization factor that you may be having here in your kernel definition.

(Refer Slide Time: 23:51)

One could extend this instead of dividing R power d into a uniform grid where you count
how many features are lying in each of that R power d grid. You could also cluster all
your features and now do it based on a vocabulary. So you could construct your entire
histogram based on, until now in the method that we discussed, the histograms need not
have been based on a vocabulary, they could have just been dividing your entire R power
d d into several bins and counting how many features occurred in each of those grids. But
you could also consider clustering them, clustering your key points into vocabulary and
then building your bins based on those cluster centers.

This would simply be an extension of the method that we have so far, where we would
replace the regular grid with say hierarchical or non-hierarchical vocabulary cells. And
compared to the vocabulary tree earlier at the beginning of the last lecture, we talked
about how hierarchical K means can be used in bag of words. And we said that one of the
concerns there is, there is no principle way of giving weights to each level in the tree.
Now, in pyramid match kernel we actually have a principle way which has given by 1 by
2 power i. Even here, the approximation quality can suffer at high dimensions simply

470
because of the curse of dimensionality and how distance get distorted in higher
dimensions.

(Refer Slide Time: 25:25)

One could extend this idea of pyramid match kernel to do a pure special matching
approach. So far, we talked about dividing. You take all the features from different
images and you divide entire R power d which is the D dimensional descriptors for the
features into grids and then build your histograms. But you could also build these
histograms on your image space.

In this context, what you will do is, let us say you have an image such as this, there a
person is performing a certain action. You could divide the image into four parts, into 16
parts and so on and so forth. And you have two different images. Now you can do
matching based on histograms. How many points belong to this bin, how many points
belong to the top right bin, so on and so forth. Clearly in this approach, you are only
considering the coordinate locations of the features. You are not considering the
descriptor or the appearance of how that feature looks at all.

But this approach could be used in trying to match say a person's position or how
different a person's position was with respect to an earlier position so on and so forth. So
this can be used, but has its own limitations, because in this case, you are simply counting

471
how many the histograms turn out to be in the spatial image space, dividing the image
into parts rather than taking the descriptor of the key point and doing the histogram in the
descriptor space. So you are only considering coordinates here or the geometry of the
points in the image rather than how each of those key point appears.

(Refer Slide Time: 27:14)

You could also combine these ideas to perform what is known as spatial pyramid
matching. This was an extension of the pyramid match kernel. In this context, what you
can do is you have a level zero again, very similar to pyramid match kernels, where you
take a set of vocabulary, you cluster all your features into a vocabulary and then you
count how many points belong to each of these cluster centers and you would get, say,
histogram bins, such as these.

Now, you divided your image into four parts. And now similarly, get a histogram bin for
each of these visual words for each of these segments. For the top left segment, you
would once again get a histogram of three bins. For the bottom right segment, you would
get a histogram of three bins, so on and so forth. So the three bins come from the
vocabulary guided pyramid match kernel, where instead of dividing your descriptor space
into uniform bins, you build cluster centers similar to bag of words and then you count
the number of features belonging to each of those visual words.

472
You can once again divide the image even further. Now, you are going to get even higher
number of histogram bins corresponding to each of these locations. So in this case, your
kernel is going to be, you have your pyramid match kernel, but you are now going to do
that for each part of the image and add them all up. So the pyramid match kernels still
exists for each part of the image and then you keep doing this over different parts of the
image.

(Refer Slide Time: 28:51)

So, one could look at it as a joint appearance geometry histogram. So pyramid match
kernel was a pure appearance histogram because you had building the histograms in the
descriptor space. We saw an example of how pyramid match kernels can be brought to
special matching, which was a pure geometry histogram and spatial pyramid matching
brings these two together to create what are known as appearance geometry histograms.

So these are robust to deformation, not completely invariant to transformations, but fairly
robust to deformation by simply the process that you are defining, where you are
considering the appearance as well as where each of these features occurred in a given
image, which was not there in the pyramid match kernel at all. So this can be used for
global scene classification where a different organization of objects should not distort
your final result.

473
(Refer Slide Time: 29:55)

A last method that we will talk about in this lecture is hough pyramid matching, which is
clearly an extension to hough voting if you recall. So, in this method, the idea is,
remember that in typical pyramid matching, you would take a set of features and match
them to features from another image and you could do this in a fast manner by using
image pyramids if you recall discussions in earlier lectures, where you first do matching
at a course level, then to final matching at a deeper level of the pyramid and so on and so
forth.

474
(Refer Slide Time: 30:35)

So you could have a bunch of correspondences which you get from matching at the level
of key points. And what we are going to do now is work with these correspondences
instead of two sets of unordered features. So you have a set of correspondences that you
already get by doing fast pyramid matching. And remember the central idea of hough
voting is each of your correspondences votes for a particular transformation or you have a
hypothesis of transformation based on say the rotation angle or scale or translation and
each of these correspondences votes for a particular hypothesis and we are now going to
build histograms in that transformation space.

475
(Refer Slide Time: 31:21)

Let us see an example here. You could assume that a local feature P in image P has a
certain scale, orientation and translation to it, position to it in this particular case, so
which is given by this transformation matrix. Remember that this transformation matrix
is just a different way of writing what we saw earlier, where we saw you have rcos θ rsin
θ -rsin θ rcos θ tx, ty, zero, zero, 1, which constituents in a fine transformation where r
is a scale, theta is the orientation and tx and ty are positions.

So this is just a concise way of writing such a matrix. So, these two zeros correspond to
this zero vector here. One is there for mathematical simplicity and then this s(p), R(p)
corresponds to the scale and orientation of that point P, which can be written as a two cos
two matrix and this vector t of p corresponds to the position of that particular point in the
image.

476
(Refer Slide Time: 32:41)

Assuming this is how a local feature is given to us. Then a correspondence between a pair
of features p ∈ P and q ∈ Q can be given by, F (c) = F (q)F (p)−1 , remember Fp is one
point p’s representation, similarly, Fq would be the point q’s representation in image Q
and the corresponding, the correspondence between these two points is given by [[s(c)r(c)
tc],[0 1]]. Once again this boils down to your rotation and scale matrix coming here, your
translation tx, ty coming here, and your zero, zero, 1. Now, tx, ty are not just the
coordinates, they are how much you moved from the coordinated image X or point P to
the coordinate q in image Q.

Similarly, the scale and the rotation tells you, what is the transformation, how much did
you rotate to go from image P to image Q, and how much did you zoom in or zoom out to
go from image P to image Q. So tc, so we are not going to go deeper into this, but just to
complete this discussion, this tc can be written as tq, which is the coordinate location of
Q minus sc Rc tp. Why is that so? tq is the position in of q in image Q, tp is the position
of point p in image P and sc, Rc says, how did you rotate p and how did you zoom p to
get to a point in image Q and the difference between those two locations is going to be
the actual translation tc.

477
Similarly, you can define the relative zoom in or zoom out to be the scale in q divided by
the scale in p and the rotation, similarly, to be given as Rq into Rp inverse or the angle is
given by the orientation and image of point q in image Q minus theta of p the orientation
of p in P. This is how the correspondence is given.

(Refer Slide Time: 34:52)

So, now let us come back to hough pyramid matching. So which means the
transformation can be given by a 4-D vector t(c) which as tx and ty, s(c), the scaling
factor, and θ(c) , which is the orientation of the rotation difference. So you are going to
define one more thing before we go into the hough pyramid matching, where if you had
two correspondences p, q, and p’, q’, we say that these two correspondences are
conflicting if either p is equal to p’ or q is equal to q’ or rather if two points from image P
match to the same point in image Q or one point from image P matches to two points in
image Q, you call such correspondence to be conflicting. You are going to see how to use
this when we go to the next slide.

478
(Refer Slide Time: 35:43)

So let us see how the hough pyramid matching actually works. So you have a set of
correspondences now, which are laid out in your 4-D space, remember each
correspondence as tx, tx, s and θ . So in this 4-D space, you are going to have each of
these correspondences laid out. Now you should be able to draw similarity to the pyramid
match kernel, because now you are going to be doing all your pyramid matching in this
4-D transformation space and that is why we call it hough pyramid matching. So each
correspondence c is weighted by some w(c) based on some say visual word. You can
choose to use this or you can give a, you can have a uniform weight if you choose.

479
(Refer Slide Time: 36:28)

Then at level zero, which is the first level of matching, remember where you have very
granular bins. If there are conflicting correspondences in the same bin, you are going to
erase them. For example, you see that c7 and c8 have two different points from image P
matching to the same point in image Q. So you are going to remove one of them. So c7 is
removed in this case and you retain only c8.

(Refer Slide Time: 36:56)

480
Now, in each of these bins in this pyramid, remember this binning now is in the
transformation space again like that 4-D space of translation, scale and rotation. So, in
each of these bin b with say nb correspondences. So, for example, this bin you have two
correspondences, this bin you have three correspondences. So each correspondence
groups with two others. In this case, there are three. So each correspondence groups with
three others and your weight at level zero is going to be 1 very similar to how we did it
for pyramid match kernel.

So, now, if you see here, you see that you have the similarity scores now, which is given
by, you have for c1 which is here, you have two new points on two new correspondences
that it gets combined with. So, you put the score to be two. For c2, similarly, you have
two new correspondences. So you put the score two. For c3, you have two, for c4 you
have just one new correspondence in addition to c4 in that bin, so you put a one. For c5
you have one new correspondence in that bin, so you put a one, so on and so forth. So
you see that for c6, c7, c8 and c9, there are no new correspondences. You assume now
that c9 belongs to the next bin. So it does not correspond there. So that is what you are
not going to count that here. So you have, those are the similarity scores that you are
going to have for this level.

(Refer Slide Time: 38:33)

481
So when you go to the next level, you merge those bins. Now you left with just two cross
two gridding of that transformation space. Once again, now, you are going to erase all
conflicting correspondences. Once again, here, you see that in this bin that c7, and c8 are
matching to the same point. So you are going to remove one of them c7 again. Now in
each of these bins you count the number of new correspondences that you have.

So c1 now has four correspondences of which two it already got counted in the previous
level. So it has two new correspondences now, which is where you are going to define for
c1. Similarly, c2 has two new correspondences, c3 has two new correspondences, c4 had
one new correspondence in the previous level and four now so which has, which means it
has three new correspondences. So that is the three that you see here. Similarly, c5 has
three and so on and so forth.

And the weight for this level is half. In both these levels, you do not count c6, because c6
had a conflict and it was already counted as part of c5, those two points. So you do not
count c6 as a correspondence in this level, so which means only c9 is left alone in this bin
and hence has no new correspondences in that group and does not contribute to any new
counts.

(Refer Slide Time: 40:08)

482
Now, when you go to the next level, where you combine everything into a single bin, you
now see that each of these correspondences gets newer, gets newer correspondences.
Once again, conflicting correspondences are erased, so you have c6, which is erased,
similarly, c7, which is erased, and now you see that c1 has already had four
correspondences, and it is going to get two new correspondences, which is why you get
two here. Similarly, c2 will get two new correspondences, c3 will get two new
correspondences, c4, c5 will also get two new correspondences and c8, c9 now get six
new correspondences each, because they were, they did not have any correspondences in
previous bins.

But the weight for each of these correspondents, since it at a higher course or level is
going to be 1/4. So now your final similarity scores can be given by 2 plus 1/2 into 2 plus
1/4 into 2 into any weight that you want to give for that particular correspondence w of
c1. You can always choose to give a uniform value for w c1 to w c5 or w c9, so that that
way does not matter or you can choose to give some weights for a given application. But
this is how adding up all of these gives you the similarity scores for each of these
correspondences.

(Refer Slide Time: 41:45)

And now you try to see which of these gives you the highest density, because that is what
hough voting is all about, where we try to look for regions where the density is

483
maximized in transformation space and then that gives you an effective match. That final
bin or translation, scale and rotation value is the final match between those two points in
these two images.

(Refer Slide Time: 42:16)

So hough pyramid matching is linear in number of correspondences, which means it


makes things easier and like pyramid match, where you may have to match several sets of
points, you are only talking about correspondences now assuming that points have
already been matched. So there is no need to count inliers here. Every correspondence
can just vote and you can just take whichever one has the highest density. And it is robust
deformations and multiple matching surfaces and reasonably invariant to transformations
because of the way of voting proceeds.

But one important limitation here is that it only applies to same instance matching. So,
once again, if you had say a bird in an image, and two such birds in the second image,
you would not be able to do a matching because one bird could have match to any of
these two birds in the second image, which means the transformation is to be very
different and one of those could cause a problem for you in your final result.

484
(Refer Slide Time: 43:21)

Your homework for this lecture is chapter 16.1.4 in Forsyth and Ponce book and the
pyramid match kernel has, also has a nice project page with more details that results with
code, with the link to the paper two if you are interested in knowing any more about these
methods.

485
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 20
Transitioning from Traditional
Vision to Deep Learning
To complete this week's lecture, we will just summarize some of the things that we have seen
so far, as we will be transitioning to deep learning from next week. What we have seen so far
is a breezy summary of work in computer vision that took two to three decades. So, we have
covered a few topics, but we have not covered several more. An important topic that we have
probably missed is part based approaches, so on and so forth. Hopefully, we will be able to
cover that in a future course, but we will try to summarize the learnings that we have had so
far, which will perhaps help us in transitioning to going to deep learning for computer vision.

(Refer Slide Time: 01:04)

So, one of the things that we learned so far is that convolution is a very unique operation. It is
linear, shift invariant. It has useful properties such as commutativity, associativity, it
distributes over additions and so on and so forth, so it's very unique in its processing of
signals. It forms the basis of image operations and it also forms the basis of neural networks,
which are the ones that are used for in computer vision most commonly are known as
convolution neural networks. So, convolution still remains used to this day even as part of
deep learning.

486
(Refer Slide Time: 1:48)

We have also seen that the common pipeline in traditional vision tasks is given by: we
typically extract some points or interest points and images could be edges, or could be key
points that have significant change in more than one direction, and we then extract
descriptors out of these key points.

This was a common theme if you saw over the last week of lectures at least and we also saw
an idea of trying to use banks of filters such as steerable filters or Gabor filters to be able to
get multiple responses from a single image and then concatenate them to be able to do any
further task or processing. We also saw that these descriptors are useful for tasks such as
retrieval, matching or classification.

(Refer Slide Time: 2:48)

If you had to abstract out the


understanding
that we had so far it is about
the fact that each of these
methods that we spoke about
went from lower level
image understanding to
aggregation of descriptors at

487
a higher level. So, we use banks of filters to capture responses at different scales and
orientations. steerable filters, Gabor filters, so on and so forth. Then there were histograms,
which could be considered as doing some form of encoding because you are trying to
quantize different key points into a similar scale or even doing some kind of pooling of
features to a common cluster centroid or a common codebook element.

So, one could see that there are some similarities here between how this processing was
happening to how the processing happens in the human visual system. We at least briefly
talked about the various levels of the human visual system, which also bears a similarity of
trying to get different kinds of responses at different orientations and scales of the input
visual and then trying to assimilate and aggregate them over different levels in the human
visual system.

So, there is a similarity here. Although, it was not by design, perhaps it was about solving
tasks for computer vision, but there is a similarity about trying to get some lower level
features probably features of different kinds with different scales and orientations because
choosing only one feature can be limiting for certain applications, so you want to use a bank
of different responses and then combine them and be able to assimilate them for further
information.

(Refer Slide Time: 04:38)

Another important thing that we also learned over the last few weeks is that there are
applications for which local features are more important. The entire image may not be
important, it may be important for certain tasks such as image level matching, maybe an

488
image level search on one of your search engines or there could be tasks for which only the
local features are important, for example, a certain key point or you want to find a
correspondence between partially matching images, so on and so forth.

So, it depends on the task, stereopsis is about detecting depth in images, if you want to
estimate motion or if you want to recognize an instance of an object, rather than just
recognize a class in an image. It depends, as to whether a local region matters or the full
image matters.

We also saw that encoding using methods such as bag of words can make your image
representation sparse. For example, it is possible that if you had say 10 cluster centers in your
k-means for bag of words it is possible that one of your images in your data set may have had
only features belonging to three of those cluster centers. The remaining seven cluster centers
had no occurrence in that vertical image, which means your image would have a histogram,
where for three of those bins, you would have some frequency counts, but the rest of the
seven bins will have a 0 count.

That leads to a sparse representation where there are lots of zeros for that particular image.
So, encoding can result in that kind of representation for an image. And an important
takeaway here is that a lot of operators that detect local features or even global
representations of images for that matter can be viewed as performing convolution to get
some estimate of features because to detect your key points you need convolution as the key
operation that you are relying on, and then that is followed by some kind of a competition.

So, for example, be it the cluster centers, so each of the cluster centers is trying to win votes
of different features that correspond to that cluster center, and one of them wins. So, there
seems to be some kind of competition or pooling of the result of the convolution operation,
which leads to the next step or a higher-level understanding or description of the image.

489
(Refer Slide Time: 7:13)

So, we also find that the goal so far has been to learn descriptors and representations that
make it easy for us to match. You do not want to spend too much time on matching. Of
course, we use some intelligence in coming up with matching kernels and so on and so forth,
but the key idea is to be able to describe key points, describe images in such a way that the
simple dot product or simple matching kernels can be used to be able to match images or
parts of images or regions, in images.

These kinds of descriptors have some invariance to geometry, transformations, a certain


scale, a certain rotation, certain translation, but in certain cases that are designed in the
algorithm in certain other cases, they may have to be learned through other means. This is a
brief summary of the topics that we've seen so far put into an abstract manner put into a
concise, succinct manner.

490
(Refer Slide Time: 8:16)

But what we're going to conclude with here is to show that we are going to move to deep
learning as I just mentioned, although, not by design, deep learning seems to be building
upon some of these principles. Some of these will become clearer when we start discussing
these deep learning approaches, but we see that the idea of trying to detect lower level
responses of images to different kinds of filters and then aggregating them, and building
higher level abstractions and then going to a point of a task where the last representation
becomes very simple for a task seems to be very simple, very similar to an idea that deep
neural networks also seem to use for solving vision tasks.

Although this may not have been by design, it seems to be similar in the overall structure. But
a key difference between all of these methods that we have seen so far and what we are going
to see with deep learning over the next remaining weeks of this course is that in deep learning
all of this is done in a learnable manner rather than we having to decide, which key point
should I use, should I use SIFT or should I use SURF.

Which descriptor should I use? Should I use oriented gradients or should I use GLOH should
I use local binary patterns, all of these become design decisions that sometimes become
difficult because they may depend on the task, and there is there was no complete knowledge
on which kind of a descriptor could be used for which kind of a task.

For example, for face recognition would local binary patterns be always the choice of a
feature or could something be used. This kind of a complete understanding of which method
to use for which task was not very well known and deep neural networks have in some sense

491
changed the game there by simulating a similar pipeline, but the entire pipeline is purely
learned for a given task at hand. We will see more of this soon, as we go into the next few
weeks of lectures on deep neural networks and how they are applied in computer vision.

492
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Neural Networks: A Review – Part 01

In the lecture so far we focused on computer vision as it was studied and practiced before the
advent of deep learning. Deep learning has completely revolutionized how computer vision is
done today. While there have been several dimensions to the efforts in computer vision before
this advent of deep learning, hopefully the few lectures that we have had so far give you a peek
into all those efforts, some of which continue to be relevant today when combined with deep
learning especially.

Moving forward, we focus our lectures on deep learning and its use in computer vision. We will
start with the first lecture as a review on neural networks, for those of you that may have done a
deep learning or a neural network course before, this maybe a review, but even if you have not
done such a course before all the interactive material will be covered in these lectures.

(Refer Slide Time: 01:25)

I would like to acknowledge that most of the content of this lecture is based on the excellent
lectures of Mitesh Khapra on deep learning at IIT Madras.

493
(Refer Slide Time: 001:35)

To start with the history of neural networks, the history of neural networks is perhaps older than
the history of computing itself, it is started in the early 1940’s when two researchers, McCulloch
and Pitts work on developing the computational model of the human brain and being able to use
that model to simulate simple function such as logic gates.

Later in the 1940’s a psychologist named Donald Hebb, who came up with what is known as
Hebbian learning which is used to this day. Hebbian learning states that in the human brain two
neurons that are wired together, fired together two neurons that are out of sync fail to link, rather
if two neurons keep getting fired together the strength of the connection between these two
neurons gets better and better over time.

While if there are two neurons that rarely get fire together the strength of their connection
becomes weaker and weaker over time, this is a principle that is used to this day. Later in the
1950’s Frank’s Rosenblatt came up with the first model of the neural network known as the
perceptron. The perceptron is the simplest neural network one could have, which has a set of
neurons which is known as a layer of neurons, which receives a set of inputs, acts on them
through a set of weights and then gives an output which is thresholded to get the final decision.
Rosenblatt also proposed a learning algorithm to obtain the weights in this simple neural
network.

494
In the early 1960’s this learning rule was improved by Widrow and Hoff which is known as the
Widrow-Hoff learning rule or the Delta learning rule. This was the foundation of algorithms that
we use to this day, which we will talk about. However, Rosenblatt claimed that this perceptron
could classify any kind of a data setting, a binary classification setting, Minsky and Papert two
researchers at that time disprove this using the example of the XOR gate which clearly cannot be
separated by a simple passive drop.

This led to a downfall of neural networks in the 1970’s because all the hype of the perceptron has
simply been crashed because of its inability to handle the XOR scenario as well as similar
complex functions. In the 1980's the interest once again revived with development of
Neocognitron in 1980, which is largely considered as the first version of a convolutional neural
network and in 1986 was the development of a ground breaking algorithm back propagation
which we use to this day by Rumelhart Hinton and Williams.

Later in the 1980’s and the mid 1990’s versions of neural networks such as convolutional neural
networks were developed at that time and these neural networks were adopted in various
different applications. One popular application at that time was the handwritten digit recognition
problem which is today known as the MNIST data set, the reason for the popularity of this data
set was the application setting at that time, which was a requirement of the United States Postal
Service, where they wanted the handwritten digits on postal mail to be automatically sorted by an
analysis of the digits on the postal.

Convolutional neural networks was one of the forerunners at that time of performance on the
USPS of the MNIST data set. In the mid 90’s also came the support vector machines SVMs,
which was backed by an elegant theory and also started performing very well on these
applications and for a large part between the mid 90’s and perhaps the first decade of the 21st
century machine learning applications were dominated by what today are known as traditional
machine learning algorithms such as support vector machines, boosting, bagging and many other
variants of these decision trees many other variants of these algorithms.

In the mid 2000’s Geoffrey Hinton and Ruslan Salakhutdinov came up with a hierarchical feature
learning algorithm also known as unsupervised retraining, using which they could initialize the
weights for deep neural network and they showed that this could give distinct advantages of
neural networks to outperform other methods, that was the seed of the deep learning revolution

495
in the years following there were further efforts to improve the training of deep neural networks,
until then neural networks could not be trained with many layers which seem to be a major
limitation.

This culminated in the development of the Alex Net developed by Alex Krizhevsky in 2012 to
weight the image net challenge. The image net challenge is a data set consisting of about 1000
categories images consisting of objects of 1000 categories with over a million images at that
time. And AlexNet outperformed all competitors by significant margin in 2012 taking the entire
attention of the computer vision community.

Since then, every year's winner of the image net challenge has been a deep neural network, not
only the image net challenge but even other benchmark challenges in vision have largely been
dominated by deep neural networks and that has led to the golden age for deep neural networks,
which we hope to see of the rest of this course.

(Refer Slide Time: 09:10)

Starting with the McCulloch-Pitts Neuron, McCulloch was a neuroscientist and Pitts was a
logician proposed a simple computational model of the neuron in 1943 and the way this model
work was you have a function g which aggregates a bunch of inputs which were all binary inputs
at that time and a function f that took a decision based on the aggregation. So the inputs at that
time were considered to be excitatory or 1 or inhibitory, a 0 or minus 1 depending on what
notation you choose for binary numbers.

496
So, your g function was denoted as summation over each of your 𝑥𝑖 and your final output y was

given as 𝑓(𝑔(𝑥), where the vector x denotes in different directions in the inputs in the input
space which was given by 1, if 𝑔(𝑥) > θ and 0 if 𝑔(𝑥) < θ. This is what is illustrated here on
the right, so you have a set of n inputs, they all coming into the function g, which aggregates
them and f is the function that acts on g based on a pre-specified threshold theta and that decides
the final output, which also is a binary value.

(Refer Slide Time: 10:52)

Here are a few examples of how the McCulloch-Pitts model of the neuron can be used to
simulate different logic gates. Here is the basic simple MuCulloh-Pitts neuron, here is the
McCulloch-Pitts neuron as an AND function, so you have 𝑥1, 𝑥2, 𝑥3 which are aggregated and

the threshold which is chosen to be 3, so the θ from the is previous slide has chosen to be 3 here
which means this neuron McCulloch-Pitts neuron would give an output 1 only if all the 3 inputs
where 1, because that is when it would exceed the threshold 3, a simple example of an AND
gate.

Similarly, you could use a similar approach to simulate an OR function, whether threshold would
be 1, if either of 𝑥1, 𝑥2 𝑜𝑟 𝑥3 is a 1, this condition is threshold be satisfied and you get a 1 as the

output and so the NOR function you have the McCulloch-Pitts neuron to be something like this,
where if you notice there is a slight change in notation and these inputs now designate that these

497
are inhibitory inputs and not excitatory inputs, which means if one of them is on, it is definitely
going to give an output as 0 and you can clearly see that as long as the threshold is greater than 0
here you would get an output.

Similarly, the NOT function is another setting where you have a single input and it is an
inhibitory input and if you keep the threshold as 0, you can get your output to be 1 to simulate
the NOT gate. As you can see, you can probably simulate most logic gate functions using the
McCulloch-Pitts neurons, in fact it is possible to show that not a single McCulloch-Pitts neuron.

But the network of McCulloch- Pitts feed forward neurons which means you keep taking input
and feed-forward that input to a set of McCulloch-Pitts neurons you can compute any Boolean
function f that goes around Boolean input in n dimensions to an output that is Boolean in 1
dimension rather which means if you had from machine learning if you had an n dimensional
input and the output is a binary classification problem, you could solve it as long as your inputs
and outputs they are all binary.

Early work has shown that recursive McCulloch-Pitts networks can actually simulate any
deterministic finite automata. We are not going to describe that here, but if you are interested you
can look at this work to understand this connection between neural networks and automator in
computer science.

(Refer Slide Time: 14:04)

498
As I just mentioned a few minutes ago Frank Rosenblatt was the psychologist who proposed the
perceptron model in 1958, this was refined much later in the later part 60’s very carefully
analysed by Minsky and Papert who concluded that this cannot work in certain settings, but
before we go there, let us introduce what a perceptron is. A perceptron was a generalization of
the McCulloch-Pitts neuron and now the inputs were no longer limited to Boolean values, you
could have any real value as input to a perceptron.

In addition, each of these inputs were weighted by a set of values which are denoted as 𝑤1 to 𝑤𝑛

which means mathematically your perceptron is going to look like you have your g function,
which was what we had with the McCulloch-Pitts neurons, which takes an input vector x of n
dimensions, once again this could be any n dimensions in any data you could be looking at a
patient record and you may want to say whether the patient is at risk for cancer, so these each of
these attributes could be say the patient blood group the patients age any real value not just
binary values anymore.

And the output of the g function is 𝑤𝑖𝑥𝑖 an inner product between a vector w and the input

vectors x and the final output is again 𝑓(𝑔(𝑥), which is going to be 1, if 𝑔(𝑥) is greater than
threshold or 0, if 𝑔(𝑥) is less than a threshold, as you can see it is a generalization of the
McCulloch-Pitts neuron where the inputs could be beyond and there are weights which multiply
the inputs.

(Refer Slide Time: 16:22)

499
A more general way of writing a perceptron is where the index starts at 0 instead of 1. And the 0
is simply to substitute the threshold on as the part of the equation itself, rather you are now going
to have i is equal to 0 to n, where 𝑥0 is going to be 1, it is a constant input 1 all of the time. Today

this is also known as bias. So, if you look at a perceptron as modelling a line, remember the
equation of a line can be written as w is equal to say 𝑤. 𝑥 + 𝑏 where each of w and x can be a
vector of a certain di mension say, b dimensions.

So, b is the weight w not that we are talking about which is multiplied by the value 1, that is now
subsumed into a common submission which goes from i is equal to 0 to n where 𝑤0 is going to

be denoted as − θ and the x corresponding to 𝑤0 which is 𝑥0 will always be 1. Why do we do it?

The reason we do it is now your output y which is 𝑓(𝑔(𝑥)) will be 1 when 𝑔(𝑥) is simply greater
than or equal to 0 and 0 if 𝑔(𝑥) is less than 0.

Earlier we had instead 0, we had θ which was the threshold, we are now subsuming that on the
left hand side of that equation inside the 𝑔(𝑥), we are bringing the θ, which means now 𝑔(𝑥)
^
will be the old 𝑔(𝑥), let us call that 𝑔(𝑥) − θ, so that − θ is now written as 𝑤0 which is − θ𝑥0,

so this is − θ𝑥0 is 1 which is what contributes to this − θ. So, this is a more accepted

convention of writing out a perceptron.

(Refer Slide Time: 18:41)

500
Let us now look at how the weights of the perceptron are derived. So, you have a set of weights
𝑤0 to 𝑤𝑛, you have a set of inputs 1 and then 𝑥1to 𝑥𝑛, remember 1 corresponds to the input for

the weight 𝑤0, you have an input with certain labels with label 1, let us call that set of inputs to

be the positive class or P and you have another set of inputs with a label 0, which let us call the
class as the negative class or the set N. We start by initializing the vector w randomly and then
we loop over a set of steps, let us see what they are.

(Refer Slide Time: 19:31)

501
The first step says, randomly pick a point and input data from P or N, rather you select x coming
from 𝑃 ∪ 𝑁.

502
(Refer Slide Time: 19:44)

If x belongs to P, which means it is a positive class and if your output was less than 0 remember
for a past positive class you would have wanted your output to be greater than 0.

(Refer Slide Time: 20:00)

But if it is less than 0 you simply add x to w, why is this so? We will see in a moment and if x
belongs to the negative class and your output turned out to be positive or non-negative could be
equal to 0.

503
(Refer Slide Time: 20:20)

you once again subtract x from w, now you could ask the question, what about the scenario
𝑛
where x belongs to P and ∑ 𝑤𝑖𝑥𝑖 >= 0, you simply do not do anything in that particular
𝑖=0

scenario. Same case with x belonging to m and the summation being less than 0, you assume
things are good and do not worry about the weights in that scenario. Now, we have a termination
criterion here, which says, until convergence keep doing this, so how do you converge?

You converge when all of your inputs are classified correctly by the w's that you have. What does
classified correctly mean? For all the data points P the output of a perceptron was greater than or
equal to 0 and for all your inputs coming from the set N the output of the perceptron is less than
0. That is the overall idea of the perceptron learning algorithm, but now let us ask the question.
Why would this work? Why should we add x to w? Why does that constitute a learning
procedure and why would that improve the quality of w? Let us see that a bit more carefully.

504
(Refer Slide Time: 21:42)

Remember as we said on the previous slide that a perceptron is a model of a line, because it is
𝑇
effectively 𝑤 𝑥 + 𝑏, so it is the equation of the line and the lines equation is given by
𝑛
∑ 𝑤𝑖𝑥𝑖 = 0, this line divides your input space into two halves, the halves of inputs where
𝑖=0

𝑛 𝑛
∑ 𝑤𝑖𝑥𝑖 < 0 and the halves of the inputs where ∑ 𝑤𝑖𝑥𝑖 >= 0.
𝑖=0 𝑖=0

(Refer Slide Time: 22:27)

505
𝑇 𝑇
Every point on that line satisfies the equation 𝑤 𝑥 = 0, remember we say 𝑤 𝑥 = 0, this is
𝑛
equivalent to saying 𝑤. 𝑥 = 0, which is equivalent to ∑ 𝑤𝑖𝑥𝑖 >= 0. This is something that we
𝑖=0

keep interchangeably using for the rest of this course, please keep in mind that whether we say
𝑛
𝑇
𝑤 𝑥 = 0 or 𝑤. 𝑥 = 0 or ∑ 𝑤𝑖𝑥𝑖 >= 0 all of them are the same. We know now that every
𝑖=0

𝑇
point x on that line has to satisfy the equation of that line, which means 𝑤 𝑥 = 0 .

(Refer Slide Time: 23:15)

Now, what can you tell about the angle between w and any point x which lies on that line? This
comes from geometry, the angle will always be 90 degrees, rather for the equation of any line, let
𝑇
us say you take the equation of any line which is given by 𝑤 𝑥 remember once again that the
intercept b is going to get subsumed into 𝑤0, the vector w will always be perpendicular to the

line. Why is that the case?

Because you have 𝑐𝑜𝑠 α, so if you take a point x here, let us take a particular point x on this line,
𝑇
the angle between x and w which is say given by α is given by 𝑤 𝑥 / || 𝑤 |||| 𝑥 ||, a simple
𝑇
expansion of the dot product. And angle we know has to be 0 because 𝑤 𝑥 has to satisfy 0 for all
points of the line and hence the angle has to be 90 degrees. Rather, the vector w is perpendicular
to points on the line, why does this matter?

506
(Refer Slide Time: 24:41)

Now, let us look at a point belonging to class P, remember this point does not lying on the line, it
𝑇
lies on this you wanted to lie on the side of the line, where 𝑤 𝑥 >= 0, because it belongs to the
positive class, you want your perceptron to classify this point correctly, which means the
perceptron should give you a value greater than or equal to 0. Let us see what it would be.

If you want the perceptron to classify this point correctly, the angle between the vector x to P and
w has to be less than 90 degrees. Why so? Let us again write out that line, this is the line which is
𝑇
given by 𝑤 𝑥, remember this is w, this is the set P, this is the set N, so all the input points which
have a negative class are one side of that line, all the input points which are the positive side of
the line belong to the positive class. So, now any point in the positive class lies somewhere here,
which means its angle is now going to be less than 90 degrees.

507
(Refer Slide Time: 26:06)

Similarly, what about the angle between the angle x, belonging to a negative set and w, this
should be straightforward, it has to be greater than 90 degrees for the same way we should
visualize the positive class a minute ago. If you do not get it, you can spend a couple of minutes
drawing it and you see that this should work.

𝑇
Now, if you have point x belonging to P for which 𝑤 𝑥 becomes less than 0, it means that the
angle α between this x and the current w is greater than 90 degrees, which you do not want to.
You want to change w to ensure this does not happen, but currently it is greater than 90 degrees.

508
(Refer Slide Time: 26:59)

So, what do we do? We wanted to be less than 90 degrees. So, we are saying that we are going to
add x to w to achieve this purpose and make the angle between w and this x belong to P less than
90 degrees.

(Refer Slide Time: 27:16)

Why do you think that helps? Let us understand that. Let us consider this w new to be 𝑤 + 𝑥,
which means we have added that x which belongs to P to w and got the new w in the iteration, let
us try to understand what the new angle will be.

509
(Refer Slide Time: 27:35)

𝑇
We just found out that when 𝑤 𝑥 < 0 for a point belonging to P the angle turns out to be
greater than 90 which we do not want we wanted to be less than 90.

(Refer Slide Time: 27:47)

𝑇
Now, Let us see what happens for w new. We know that 𝑐𝑜𝑠 α𝑛𝑒𝑤 = 𝑤𝑛𝑒𝑤 𝑥 / ||𝑤𝑛𝑒𝑤|| ||𝑥||, so
𝑇
we are going to ignore the denominator and simply say 𝑐𝑜𝑠 α𝑛𝑒𝑤 ∝ 𝑤𝑛𝑒𝑤 𝑥

510
(Refer Slide Time: 28:06)

𝑇
Which can be set is proportional to (𝑤 + 𝑥) 𝑥 because 𝑤𝑛𝑒𝑤 = 𝑤 + 𝑥, which can be written
𝑇 𝑇 𝑇 𝑇
as 𝑤 𝑥 + 𝑥 𝑥 which can be written as 𝑐𝑜𝑠 α, because we know that 𝑤 𝑥 ∝ 𝑐𝑜𝑠 α + 𝑥 𝑥.

511
(Refer Slide Time: 28:30)

𝑇
Which means because 𝑥 𝑥has to be a positive quantity because it is a dot product of a vector
with itself so all values will get squared and you will add up all of those values. You know that
𝑐𝑜𝑠 α𝑛𝑒𝑤 > 𝑐𝑜𝑠 α.

(Refer Slide Time: 28:52)

Rather, this tells us that α𝑛𝑒𝑤 < αand hence we will get the right weight vector which will

ensure that points belonging to P have the output of the perceptron to be greater than or equal to
C. But you can work out a similar scenario for points belong to the negative class and you will

512
get a negative class where the output of the perceptron is greater than equal or 0, you can work
out a very similar scenario, the only thing is you would a negative sign here and you will find
that the new angle will be greater than the old angle which is what we would want for a negative
class.

(Refer Slide Time: 29:33)

That should convince you that this perceptron learning algorithm will keep improving in each
iteration. Now, when would it converge? We only know that it would keep improving in each
iteration, but whether it will get us to that final solution?

We still have not proved it and probably not going to work out that detail here, but if you want a
formal convergence prove to show that as the number of data points increase as the number of
iterations increase you are going to get a solution as long as you can separate points originally
into two classes P and N using a line, you can see this link shared here for the formal conversion
is proof.

513
Deep Learning for Computer Vision
Professor. Vineeth N. Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology – Hyderabad
Lecture – 22
Neural Networks: A Review-Part 2

(Refer Slide Time: 00:15)

We said that although Rosenblatt proposed the perceptron. Minsky and Papert later showed
that the perceptron is limited to only a certain kind of data configurations and does not work
for data configurations beyond those times and the examples they used was the XOR gate.
Let us try to understand that we know with the graph truth of the XOR gate the truth table for

514
the XOR gate is given by something like this whereas one of the inputs is 1 you get a 1
otherwise a 0.

So from a perceptron perspective what we would want is in the first scenario we would want
a wi xi to be less than 0 because we want the output to be 0. Similarly, these two cases we
would want it to be greater than or equal to 0 and in the last case we wanted to be less than 0
again. So, let us analyze this let us take the first equation which says w0 plus w1 into 0
which comes here plus w2 into 0 which comes from here.

We ideally wanted to be less than 0 which is what we want the perceptron to do which means
w0 has to be less than 0. Similarly, from the second table a line of the truth table you have
w0 + w1 .1 + w2 .0 which are the inputs in the second line. We want that to be greater than or
equal to 0 which means w1 is greater than minus w0 . So since w0 is less than 0 w1 would
be a positive quantity greater than the absolute value of w0 .

Similarly, the third line of the truth table would similarly get you w2 to be greater than
negative w0 and the final line would show that w0 + w1 + w2 should be less than 0 or
w1 + w2 must be less than − w0 . It is quite clear here that because w1 and w2 will be
positive w1 + w2 cannot be less than − w0 because we know that individually each of them
are greater than − w0 which itself is a negative number.

w0 by itself is a negative number, so − w0 will be a positive number. So we can see a


contradiction in these criteria here and that should clearly show you what the perceptron
cannot solve the XOR problem. Also visually speaking you have the XOR problem to be
represented as you want these two elements which are 0, 1 and 1, 0 to have labeled to be 1
and you have these two elements 0, 0 and 1, 1 to have labeled it to be 0.

And as we said a perceptron simply embodies a line and you should draw a line here you are
going to get this element wrong and if you draw a line here you are not going to get this
element wrong and with XOR gate cannot be solved by the linear model. So, as you can see it
is impossible to draw a line which separates the red points from the blue points here.

515
(Refer Slide Time: 03:52)

And that leaves us to the concept of multi layered perceptron or the multi layer perceptron as
the figure shows. It is not restricted unlike perceptron to only input layer and an output layer
and also has the convenience of including a hidden layer of neurons. The number of neurons
in this hidden layer is a designed decision.

(Refer Slide Time: 04:21)

So here is an example of how multi layer perceptron can be used to solve the XOR problem.
This is just one way configuration that can solve the XOR problem. There could be other big
configurations that may solve or may not solve the XOR problem. All that we are trying to

516
say here is that there is at least one way configuration with a multi layer perceptron that can
solve the XOR problem.

Let us look at this example so you have the weights denoted on each of these connecting
edges here and you also have the y is given to be minus 1. So with these values given to us let
us take a couple of cases from the truth table and welcome that this indeed gives a solution.
So if you have 0, 0 as a input you are going to get 0 into 2 plus 0 into minus 1 plus minus 1
z1 we get a input minus 1.

So it is a perceptron and its input is negative input given output to be 0 you would get exactly
the same output for z2 you will get minus 1 and its output to be 0 which means you are going
to have what y gets as 0 which we can assume that only values greater than 0 is a threshold to
0 and only values greater than 0 gives you 1 is input and you would get and because the
output at y is 0 you would now get the output to be 0.

Let us take the second case now and see how this works out for this particular scenario. So,
let us consider the input now to be 0 and 1 so if that be the case you know z1 is going to get a
0 and a minus 1 and a minus 1 which would turn out to be minus 2 which means the output
would be 0 and unless z2 you would get z2 would get a 0 and z2 and a minus 1 which should
be 1 which means the output of z2 will be 1.

Which means you are going to get an output of 2 here plus 2 into 1 plus 2 into 0 which
corresponds to a 1. You could hold a similar thing would hold for the third row third tuple 1,
0 let us rather work out the last one now just to be sure about this again. So let us consider 1,
1 when we have 1, 1 you are going to have 2, minus 1 minus 1 which is 0 which means 0.

Similarly, you have a 2, minus 1, minus 1 coming from bias as 0, 0, 0 and a 0. We can put
this out if this was fast you will notice that this is a valid solution for the XOR problem.

517
(Refer Slide Time: 07:15)

You can in fact show that any Boolean function of n inputs can be represented exactly by a
network of perceptrons containing one hidden layer with 2n perceptrons and one output layer
containing one perceptron. So, if you have a hidden layer rather a multi layer perceptron and
you have 2n perceptrons in that hidden layer you can represent any Boolean function of n
inputs. How do you prove this?

We are not going to formally prove it, but informally it is fairly simple because every case
that you have can be represented as neuron in your hidden layer. Remember, if you have n
inputs assuming we are talking about Boolean functions there are 2n combinations and you
can ensure that the right perceptron clicks for each of those combination.

So you can come with a great configuration that ensures that each of those 2n perceptrons in
the middle layer corresponds to one combination of n inputs which automatically will give
you your solution fairly straight forward to see it informally a test, but one thing to keep in
mind here is while we say that any Boolean function can be implemented by hidden layer
with 2n perceptrons, this is sufficient, but not necessary which means you could solve a
problem with less than 2n neurons too.

For example we just now solve XOR problem which we solved with a hidden layer with just
two hidden neurons according to what we said in the next slide we should have needed
22 = 4 neurons, but that is not the case and that is the reason why we say that a network of 2

518
power n plus 1 plus 1 for the bias is not necessary, but it is sufficient. You can probably find
solutions with even less.

But you can definitely find a solution with 2n neurons or perceptrons in a way. Why does
this necessary insufficiency matter? The reason is as n increases the number perceptrons in
the hidden layers increases exponentially 2n . So, your multi layer perceptron can become too
computationally intensive to train or even just to take a formal pass through. So you do not
want that always to be computationally intensive.

You ideally want to find the multilayer perceptron solution that has the least number of
neurons in your hidden layer.

(Refer Slide Time: 10:00)

So just ask the question what do we do if you want to go beyond binary inputs and outputs?
We only spoke about Boolean inputs, we did say that perceptron could handle the inputs
beyond binary, but are there any relationship with the understanding with the previous result
that we showed was only that a multi layer perceptron with 2n hidden neurons can solve
taking Boolean function.

What if that function was not Boolean can be used the same perceptron in order to represent
such functions.

519
(Refer Slide Time: 10:40)

The answer is we need something called activation functions. Why activation functions we
will see that in a moment. So far we noticed that a perceptron only fires when the weighted
sum of its inputs is greater than threshold minus w0 or which was theta. So this thresholding
logic can become very harsh at times. For example if your minus w0 was 0.5 even 0.49 and
0.51 which are very close to each will end up giving very different results because one of
them is below the threshold.

And one of them is about the threshold which means your thresholding function is a step
function where you have a sudden change in your output even with a very small change in
your input. Typically, this behavior does not occur in the real world, but even in this case the
behavior is not a characteristic of the problem. It is about the characteristics of using a step
function as a thresholding function.

And we all know that in the real world you generally expect a smoother decision function
such as the one shown on red here as you go one value to another value, you do not wanted to
jump up, you do not want the output as the perceptron to be one suddenly when the value
goes from 0.49 to 0.5 or 0.499 to 0.5. How do we handle this?

520
(Refer Slide Time: 12:24)

The way we handle this is to introduce what are known as activation functions which aim to
replace the threshold function that you have with more smoother functions and one early
example which was used for several decades is known as the sigmoid activation function and
the perceptron or a neuron that uses a sigmoid activation function is known as a sigmoid
neuron.

You ideally produce any logistic function which has a shape such as this to obtain a smoother
output function than a step function. So one that we are particularly going to talk about here
is the sigmoid logistic function which is given input wx which can be expanded this way. The
sigmoid function computes 1 by 1 plus e power minus that input that is your sigmoid function
which in a graph form has this particular shape.

Clearly here you no more have a sharp transition at a threshold, but smooth transition that
goes as your input keeps changing. Also your output now is no longer just binary it is not just
0 or 1, but your output now can be any value lying between 0 and 1 which could potentially
be interpreted as a probability of the output. So which means if you used a sigmoid activation
function on the neuron in your output layer.

It would give you a value between 0 and 1 which can associate a probability with whether a
point belongs to the positive class or a negative class. So, if your output was 0.5 you perhaps
0.5 let us take another example let us say 0.6 you would say that assuming your input as

521
patient records you would say this patient has 60 percent risk of say suffering from cancer or
heart attack or whatever the problem you are modelling in this particular scenario.

More importantly unlike the step function this function is smooth, it is continuous at minus
w0 which is your threshold it is continuous there as long as continuities and is also
differentiable. Why is this important? We will see very soon it being differentiable it is
extremely important for how we are going to train these kind of networks.

(Refer Slide Time: 15:13)

There are other popular activation functions we will brief you little bit here, but we will cover
in detail in a later lecture this week, but a sigmoid activation function we just said is given by
x −x
assuming your input as z = 1+e1 −z and h as h = eex−e
+e−x
. There is also an activation function

called the rectified linear unit which if you see here is the blue line which is the blue line on
this particular graph which is given by if your input is z the output is max(0, z).

So if your input is negative it would give you 0. If your input is positive it give you value
itself that is the very popular activation function. There is also a variant of the ReLU
activation function called the leaky ReLU which does not make this 0, but keeps this a very
small value and we will see each of these being designed for each of these a little later this
week.

522
A more general variant of the ReLU activation function is known as the exponential linear
unit or ReLU which is simply a smooth form of w or leaky ReLU activation function which
is given by alpha into e power z minus 1, z max was there where alpha is the number greater
than 0 and you can see the ReLU activation function in the total color on this particular
graph.

We will see these activation functions a bit later in detail, but all that we are trying to tell here
is the activation function is important for us to understand how well multi layer perceptrons
can model non binary data, non boolean data.

(Refer Slide Time: 17:12)

And that is where the representation power or study of the representation power of multi
layer perceptions MLPs stands for multi layer perceptrons comes in. A very well studied very
well sighted theorem is known as the universal approximation theorem which states that a
multi layer network of sigmoid neurons with a single hidden layer can be used approximate
any continuous function to any desired precision.

This is a fairly strong statement we are saying that if you give any continuous function we
can use a simple multi layer perceptron sigmoid neurons and one hidden layer to approximate
that continuous function. We are not going to formally prove it here if you are interested
these papers sighted here are good pointers to the proof. There is also a very nice visual
explanation of the universal approximation theorem in chapter 4 of Michael Nielson online
book on Neural Networks.

523
(Refer Slide Time: 18:27)

So your homework for this lecture is try to solve XOR using a multi layer perceptron with
four hidden unit come out with your own weights and solve XOR using a multi layer
perceptron with 4 hidden units. This should also help you understand the theorem that we
spoke about how any Boolean function can be represented by a multi layer perception with 2
power n hidden unit this should also help you get an intuition of that.

For further reading please also feel free to refer Mitesh Khapra original lecture slides which
are on the website linked here. There are other good resources Deep Learning book which is
publically available on website called Deep Learning Book.org is a general resource that we
may point to various parts of this course. Chapter 6 is a good introduction to multi layer
perceptron.

There is also the Stanford CS231n course which is also a good course whose notes are
available here. There is also the Stanford UFLDL tutorial and a very nice introduction to
neural networks by Raul Rojas.

524
(Refer Slide Time: 19:42)

There are some references and we will stop here for now.

525
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 23
Feedforward Neural Networks and Backpropagation - Part 1

Our next lecture will be on continuing the discussion on multilayer perceptrons and knowing
how we can train them. So for perceptrons, we talked about the perceptron learning algorithm.
Then for multilayer perceptrons how do you train them with the same method work is what we
will try to find out now.

(Refer Slide Time: 00:45)

Multilayer perceptrons are also known as feedforward neural networks because the information
is fed forward from the input all the way to the layers to be output. And all neurons are organized
in layers. While we saw examples of multilayer perceptrons with one hidden layer in the
previous lecture you could have as many hidden layers as you need between the input layer and
the output layer. Obviously, at the end of the day a neural network, a feedforward neural network
or a multilayer perceptron is approximating a function that takes you from input to output.

So whatever function you are approximating can be better approximated perhaps if you have
more hidden layers if that function is complex, this is not a necessity but having more hidden

526
layers could help you. We will see later how adding too many layers can also cause problems but
at this point let us move forward with this discussion of training a feedforward neural network.

* *
So, as I just mentioned a neural network is typically used to approximate some function 𝑓 . 𝑓 is
the ideal function that in machine learning given an input can assign the correct label so that
*
particular input, so 𝑓 could just be a classifier also and to the neural network we try to
*
approximate this function 𝑓 .

The neurons are arranged in the form of a directed acyclic graph, directed because of these edges
from one layer to the next layer are directed edges you cannot information does not flow the
other way, although for training you use them we will see that in a moment but when you
propagate information you will propagate them only in the forward direction that is the directed
nature of this graph.

And it is acyclic because there are no cycles in this particular graph. The information flows only
in one direction from the input all the way to the output and that is why they are also known as
feedforward networks.

(Refer Slide Time: 03:20)

The number of layers in the neural network is typically known as depth; each neuron can be seen
as a vector to scalar function and which takes a better of inputs from the previous layer and

527
computes a scalar value. Remember, even a single perceptron is a vector scalar function because
it has n different dimensions of x as input and gives us one output each neuron is similar to a
perceptron in the assets.

And when you have multiple layers stacked one after the other in a neural network you could
look at the entire network as a composition of several functions, each layer is a certain function
which shakes you.

So let us assume that you have layer 1 which is denoted by function 𝑓1, this function takes you
𝑑𝑖 ℎ
from 𝑅 → 𝑅 1 this is what the function 𝑓1 does. Similarly, function 𝑓2 is a function that goes
ℎ1 ℎ2
𝑅 →𝑅

ℎ2 𝑑
Similarly, 𝑓3 goes from 𝑅 → 𝑅 𝑜. So each layer is a function by itself and the overall neural

network can be envisioned as a composition of functions that achieves the purpose you are
looking for.

(Refer Slide Time: 05:21)

So in machine learning we all know that we are given training data to train an algorithm and if
you consider the supervised learning settings, then you have data as well as labels provided to
you in your training data.

528
* *
Now to approximate some function 𝑓 , we are generally given noisy estimates of 𝑓 at different
points in the form of a data set, (𝑥𝑖, 𝑦𝑖), 𝑥𝑖is the vector that could have certain dimension, 𝑦𝑖 is

the output label corresponding to that particular input data point 𝑥𝑖 and there are a total of m such

examples in your data set, you could assume that these examples in your training data set are
*
instances of the overall function 𝑓 that takes you from x to y.

So our neural network defines a function y which is 𝑓(𝑥, θ), x is the input, by θ, we are going to
refer to all the weights and biases that we have in the neural network. We are going to henceforth
call θ the weights and biases of the neural network. These are the θ other values that
parameterize the neural network that is what defines the function output that you are going to get
when you propagate a value through our neural network.

*
Our goal is to somehow find a way that 𝑓 can best approximate 𝑓 , once again let me clarify
* *
what 𝑓 is. 𝑓 is a hypothetical function which takes you from input to networks, so each of your
training data points are instances of that function and our goal now is and we do not know what
*
𝑓 actually is in machine learning typically.

*
If you knew 𝑓 , you actually do not need machine learning. When new data points come in you
*
simply apply 𝑓 on it and you will get the label that you need. In machine learning we are only
* *
given those noisy instances of 𝑓 but we do not know what 𝑓 is and that is what defines the field
of machine learning itself.

*
And our goal is to use the neural network to best approximate 𝑓 . Now the question that we have
is how do you find the values of the parameter θ, the weights and biases to train this network? So
you are given a training data set, let us assume you are given a neural network of a certain
architecture, you are given say two hidden layers, the first hidden layer has 10 neurons, the
second hidden layer has 100 neurons whatever that be that is user defined.

That is given to you and the training data set has given to you your goal is to find what should be
your weights of the neural network which will do well on the training data. So that is the process
of training neural networks and to do that we introduce a very well-known method optimization

529
known as Gradient Descent. A gradient descent is very simple but a very well-studied method in
optimization which is used to minimize objective functions in general. And that is what we are
going to use to train feedforward neural networks.

530
(Refer Slide Time: 09:06)

Let us take a simple example and then take it forward neural networks. Neural networks are
usually trained by minimizing a loss function such as mean squared error. The neural network
when you give an input to a neural network it uses a certain output. What is that output? 𝑓(𝑥; θ)
Once again if you take the patient example if you gave a set of patient attributes to a neural
network assuming that you initialize this θ to some random values, some random values to start
the neural network with.

If you now pass one patient’s information such as blood group A is well person smokes or not so
and so forth, all of that information lets us say you passed it to a neural network you get a certain
output on the output label based on the weights that you have initialized. That value that you get
is defined by 𝑓(𝑥; θ) but in your training data you already know what x should give you because
that is why it is called training data.

*
You have the correct labels given to you so you have at least 𝑓 (𝑥) define for that particular
* *
value of x you may not know what 𝑓 is at other places but you know what 𝑓 is at that particular
location. I knew some of this error up over all your input data points m is the total number of
data points and your average error across all of your data points.

This is typically known as the mean squared error is the mean of the square errors for all your
training data samples. Let us take a simple one day example and then try to study this subject

531
2
further. Let us say we like to minimize the function 𝑓(𝑥) = 𝑥 specifically let us assume that
*
we know the value 𝑓 which gives you the smallest value for 𝑓(𝑥).

* *
Let us assume that 𝑓 (𝑥). Of course, we know that 𝑥 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 𝑓(𝑥), 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 gives you the

minimum value of 𝑓(𝑥) across all the possible values which you can choose for x. 𝑎𝑟𝑔𝑚𝑖𝑛𝑥 is

the value of x that gives you the minimum 𝑓(𝑥). That is the difference between 𝑚𝑖𝑛 and 𝑎𝑟𝑔𝑚𝑖𝑛

*
So 𝑥 is that x that gives you the minimum value of 𝑓(𝑥). So to summarize this slide so you have
mean squared error which we typically use to train neural networks but before we deal with that
mean squared error loss let us just consider any function 𝑓(𝑥) which we want to minimize, let us
try to see for the 1D case.

(Refer Slide Time: 12:14)

532
So what gradient descent suggests to us is that given the function 𝑓(𝑥) we can obtain the slope of
'
the function 𝑓(𝑥) at x by taking its derivative, let 𝑓 (𝑥) denote that slope of f at x. Now if you
give a very small push to x in the direction of the slope the function will increase, what we mean
'
is you have x, you have 𝑝. 𝑠𝑖𝑔𝑛(𝑓 (𝑥)) which means we take the sign of the gradient and
whatever the direction be, in that direction you take a small 𝑝 step forward.

So that is going to give you a function value which will be greater than 𝑓(𝑥). You could now ask
'
me the question: what if the sign itself was negative but 𝑓(𝑥 + 𝑝 . 𝑠𝑖𝑔𝑛(𝑓 (𝑥))) > 𝑓(𝑥) of
course yes, because that is how the gradient is defined. So if your sign is negative the reverse is
also true which means if you take one step in the negative direction of the gradient your function
value will become lesser than the function value that you had at x.

That should give us a clue which means if I take the gradient at a particular point x and I go one
step in the negative direction of the gradient I am going to find another point x where the
function value will be lower which means if I keep repeating this over and over and over again I
will finally reach a point where the function value is released. This is the basic idea of gradient
descent. We start off at a random x and keep taking small steps in the direction of the negative
gradient and we iteratively reach a point at which 𝑓(𝑥) is optimum.

533
(Refer Slide Time: 14:31)

Here is an illustration of what we just spoke about. Let us assume that f was indeed a convex
function that is your f and the minimum is attained at x is equal to 0. Since the minimum is
attained here we all know that which means that particular point is a critical point which means
'
𝑓 = 0 at that particular point.

So let us consider a certain x, say its minus 1 which is this particular point, we can see that at that
particular point the gradient is going to be negative. If the gradient is negative we take one step
in the direction of the negative gradient which means we grow one step towards the positive side
or rightward and you will take one step in the rightward direction which will definitely take you
to a function value which is lowered.

On the other hand if your current x was here at 1 let us say at this particular point your gradient
is, your gradient of the tangent is positive, you are still going to take a step in the negative direct
on the gradient which means you will go left when you increment this algorithm which again
will take you to a point which is where the function value is lesser.

As you can see it is a fairly simple method but given a function this is a simple method that can
help you reach the minimum of that function and find the x at which you can reach that min.
How is this connected to neural networks? Will come to that in a moment.

534
(Refer Slide Time: 16:12)

Let us consider now a slightly more complex sitting, a multivariate setting as shown in the
previous slides we consider 𝑓(𝑥) which is univariate. Let us consider a multivariate setting
where while training neural networks the loss function we minimize is parameterized by many
weights of the neural network. Let us subsume all of them into a quantity known as θ or the
weights and the biases.

You have to subsume into one quantity known as θ. Let us now denote this loss function as L of
theta and our aim is to find the weight vector theta which minimizes 𝐿(θ), very similar to what
we saw on the earlier slide.

535
(Refer Slide Time: 16:56)

Now, let u be a unit vector which is the direction that takes us to the minimum of this loss
function, rather we are saying now that we have the gradient ∇θ𝐿(θ) and we know that some

component of the gradient may help us go to the minimum of this loss function that we want to
minimize.

Mean squared error was one such loss function there could be at this point we are saying that
mean squared error may not be the only loss function for a neural network there could be other
loss functions that you minimize the primary test with respect to let us just call that 𝐿(θ) at the
gradient be ∇θ𝐿(θ)

And let u be the vector which is concurrent to the gradient that will take you to the minimum and
for simplicity you are going to assume that u is a unit vector. So we want to minimize over all
𝑇
possible unit vectors 𝑚𝑖𝑛 𝑇 𝑢 ∇θ𝐿(θ)
𝑢 𝑢=1

536
(Refer Slide Time: 18:03)

Now this can be written as min be simple this is simply a dot product, so it can be written as
||𝑢||2||∇θ𝐿(θ)||2𝑐𝑜𝑠 β where β is the angle between the u vector and the gradient vector,

gradient of the loss with respect to theta by 2. Clearly we know the 2-norm is a positive product,
this quantity here is a positive quantity. It is number to a norm in standard definition is simply
square root of L1 theta square plus L2 theta square so on and so forth depending on how many
components it has whatever components it has.

This is your standard true norm definition and that quantity 2-norm is positive a non-negative
quantity. So which means we ideally want to minimize this dot product because u is a unit vector,
the only way we can minimize this quantity is to make 𝑐𝑜𝑠 β = − 1. You remember 2-norm
of u, the value is 1 because u is a unit vector, we already know that.

So the only way to minimize this entire quantity is to make a 𝑐𝑜𝑠 β to be as low as possible and
the least value of the 𝑐𝑜𝑠 β is minus 1 which means since u is 1, u has to be the direction of the
negative gradient. So that is the vector that will actually minimize this particular quantity.
Remember for β to be 180 degrees u has to be because β is simply the angle between u and the
gradient.

u simply has to be in the opposite direction of the gradient which will minimize this quantity
rather if we want to use the gradient in some way to reach the minimum of the function of a

537
function we have to go the opposite direction of the gradient to reach that minimum. Keep in
mind that this is the gradient descent algorithm. Keep in mind that it has a complement known as
gradient ascent where if you go in the positive direction of the gradient you will reach a
maximum. That algorithm also exists but in this context we are interested in minimizing a loss
function and hence we are going to focus on gradient descent.

(Refer Slide Time: 20:34)

Let us now try to see how you actually use gradient descent in practice to train neural networks
or train any other function for that matter. For neural networks in particular we start with a
random weight vector θ, we compute the loss function over the data set which you have a
training data set, you take a 𝐿(θ) with the current network using a loss, suitable loss function
such as mean squared error.

You could have other loss functions and we will see many of them over this course but at this
point let us take one of the simplest which is the mean squared error. We compute the gradient of
the loss function with respect to each parameter in the network, it could be the way, it could be
the bias, could be any other parameter.

Every parameter that is important in getting the output of the neural network you compute the
gradient of the loss function every parameter that you need to learn ofcourse. You compute the
gradient of the loss function with respect to each of those weight values and we denote that as

538
∂𝐿/∂θ right. So now based on gradient descent which is an iterative procedure you define this
actual step to be given your which is one of your weights current value.

You take ∂𝐿/∂θ which is the gradient of the partial derivative of the loss with respect to that
𝑖
particular parameter θ , if there are many weights and biases on the neural network you have to
compute the derivative of the loss function with respect to every weight or every bias neural
𝑛𝑒𝑥𝑡 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑐𝑢𝑟𝑟𝑒𝑛𝑡
network. And θ𝑖 = θ𝑖 − η∂𝐿/∂θ𝑖 where η is simply known as the learning rate

or a step size.

It tells us how large a step you want to take the direction we will talk about ways in which this
can be chosen a little later. I am repeating this process until the gradient is 0, why? Because when
the gradient is 0 we have reached a critical point which is perhaps the minimum of that function.

(Refer Slide Time: 22:45)

Let us look at a neural network again as we just said sometime ago a feedforward neural network
is a composition of multiple functions organized as layers. Now how do we implement gradient
descent in this kind of neural network? We know we have to compute the gradient of the loss
function. Let us assume the loss function is given to us, let us assume it is mean squared error.
We have to compute the gradient of the loss function with respect to every weight in the neural
network. How do we do this?

539
(Refer Slide Time: 23:17)

We do this by taking advantage of the chain rule in calculus. Computing gradients with respect to
any weight in a layer 𝑖 requires the computation of the gradient with respect to outputs which
involve the weight in every layer from that layer to the output.

So for example if you had a certain weight 𝑤𝑖 here remember the loss function is defined at this

particular point loss of let us say θ where θ is nothing but a set with all the weights and biases.
So we want to find ∂𝐿(θ)/∂𝑤𝑖where 𝑖 is one of the weights in the neural network.

So to do that remember we need to apply chain rule which means we have to now find out
∂𝐿/∂θ by computing partial derivatives of all the steps in which 𝑤𝑖 equal to which 𝑤𝑖

contributed across all of the weights, for example 𝑤𝑖 probably contributes to every value because

this neuron contributes to every other neuron in that next layer and so on so forth. So we now
have to sum all of those contributions and then find the gradient of the loss function with respect
to this 𝑤𝑖. That is the overall intuition but let us say let us see how we actually do it as an

algorithm.

540
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 24
Feed-forward Neural Networks and Back-propagation Part-2

(Refer Slide Time: 0:16)

This leads us to the popular methodology which we talked about in the history of Neural
Networks too which is known as Back-propagation, back-propagation is a procedure that
combines gradient computation using chain rule and parameter updation using Gradient
Descent. Remember we said, we want to minimize the loss function, we said gradient descent
is the way to do minimize the loss function but, then gradient descent means, computation of
derivatives for which we need the chain rule that is going to be the overall strategy for us to
train the neural network and let us see more details.

541
(Refer Slide Time: 0:55)

Let us, consider once again the simple feed-forward neural network or what is the other name
for multi-layer perceptron let us, consider once again a set of m training samples given by
𝑖 𝑖 𝑀
{𝑥 , 𝑦 }𝑖=1 , θ can constitute all the weights and biases in the neural network the mean square

cost function for a single example be given by let us, assume that ℎθ(𝑥) corresponds to your

neural network at this time that is your neural network and y is the expected output because,
we know that is the output for the correct corresponding training sample.

So, given one particular sample x and its correct label y your loss function is given by
2
(ℎθ(𝑥) − 𝑦 ) remember, we said mean squared error we are not doing the mean here

because it is a single sample we are going to introduce we multiplied by half purely for
mathematical simplicity you will see why a bit later but, the reason why it does not matter is
because we are interested in the θ that minimizes the loss function and not the loss function
value itself.

So, whether we minimize half of the loss function or the full loss function it is the same theta
that will give you the minimum in both these cases so, it does not matter to us and we
introduce that purely for mathematical simplicity.

542
(Refer Slide Time: 2:24)

Now, the overall cost function across all examples we saw is the cost function for a single
example. The overall cost function across all of these examples is going to be given by
𝑀
(𝑖) (𝑖)
(1/𝑀) ∑ 𝐿(θ; 𝑥 , 𝑦 ). So, the same loss function summed up over all training samples
𝑖=1

𝑀
(𝑖) (𝑖) 2
and then averaged and this term can be replaced by (1/2𝑀) ∑ ||ℎθ(𝑥 ) − 𝑦 || , your
𝑖=1

mean square error definition that is the overall cost function that we are going to go with. We
still have to compute the gradient but that is the overall cost function.

(Refer Slide Time: 3:01)

543
Let us now define the notations to be able to define the back-propagation procedure
completely.

Let us, assume that there 𝑛𝑙 layers in the neural network, 𝑛𝑙 layers in the neural network
𝑙
going from 𝑙 = 1, ··· 𝑛𝑙 let us, denote the activation of a node i, at layer l as 𝑎𝑖 rather,
𝑡ℎ
remember every layer has a set of neurons. We are now looking at the 𝑖 neuron which is the
𝑡ℎ
node i, that we are talking about and we are looking at the 𝑙 layer of the neural network.

And the activation is simply the output of that particular neuron remember every neuron is
now individually like a perceptron and we already said that for a perception instead of having
a threshold function we can have activation functions such as, sigmoid, hyperbolic, tangent,
so on and so forth the output of that function is what we denote as an activation on that
𝑡ℎ 𝑙
particular node i, again to repeat that node is like a perception and in the 𝑙 layer we call it 𝑎𝑖

𝑙
We denote, the weight connecting node i, in layer l and node j in (𝑙 + 1) as 𝑊𝑖𝑗 and the
𝑙
entire weight matrix between layer l and layer (𝑙 + 1) is denoted as 𝑊 , for a three layer
network that we saw earlier a compact vectorized form of a forward pass can actually be
written like this and hopefully this will give you clarity on the entire procedure.

Given, input x you take the first layers weights and biases and you get the output on the next
(2)
layer that is 𝑧 second layer, you take z and apply an activation function f and the output of
(2)
that becomes 𝑎 in the second layer which is going to assume one neuron in each layer just
to make the notations simpler but, you can extrapolate this when x is a vector, z is a vector, a
is a vector, so on and so forth.

(2) (2)
Now, you take the activation take the next layers weights which is given by 𝑊 and 𝑏 and
(3) (2) (2) (2)
you do 𝑧 =𝑊 𝑎 + 𝑏 , that is the z in the third layer you apply an activation function
(3)
on that to get 𝑎 which is the output of the third layer, which is also the output of the neural
network. In this case, the function f denotes an activation function such as sigmoid, tanh,
identity, so on and so forth.

544
(Refer Slide Time: 5:52)

Let us now try to see what back-propagation actually does while training a neural network.
During the forward pass for neural networks, every neural network can now be divided into a
forward pass and backward pass. Forward pass is when you give input to a neural network
and you propagate the input through the weights of the various layers and you get an output
at the out layer or at the output layer you call that the forward pass.

Now, based on the output at that output layer you would get a certain loss based on the loss
function that you chose you expected a certain output because, it is training data you know
what the expected output should be you know what the neural network is throwing out the
difference between the two or whatever loss function it need not be just difference you could
use other loss functions that loss function of the output is what we now, have at the last layer
and we now have to use that loss to compute the gradient for each of the weights in the neural
network that is the backward pass.

So, during the forward pass we simply compute each layer’s outputs and move from left to
right during the backward pass we compute the loss in the last year and move from the
rightmost layer to the leftmost layer starting from nl all the way going to the first layer, we
compute gradients going from the rightmost layer to the leftmost layer and once we have all
the gradients computed we can use gradient descent to update the parameters how do we do
current new parameter is equal to old parameter minus eta which is the learning rate or step
size, times the gradient with respect to that parameter.

545
Now, let us also try to see how this specific propagation of gradients happens from the last
layer till the first layer.

(Refer Slide Time: 7:44)

The first thing is we denote for each node an error term that denotes how much that node was
responsible for the loss at the output layer. Let us imagine now that the error term at the
output layer is given by the error term at the ith node in the output layer. Remember, even the
output layer would have multiple output nodes, the output need not be a single value, you
could be giving a vector as an output too.

For example, you may want to predict at what location an event may occur next. So, let us
say you want to use some logic, you want to predict using machine learning to predict where
the next FIFA World Cup will happen so, you may want to predict that in longitude and
latitude for instance. So, that could be the two outputs that you may want to predict, so your
output layer could have many values in that layer.

(𝑛𝑙) 𝑡ℎ
So, δ𝑖 is the last layer delta i, is the delta or the error for the 𝑖 node in that layer and the
(𝑛𝑙) ' (𝑛𝑙)
error term at that output layer is given by − (𝑦𝑖 − 𝑎 ) . 𝑓 (𝑧𝑖 ). How did we get that?
𝑖

This is simply the gradient of the mean square and remember we said we are taking mean
squared error as the loss function, we simply take the derivative of the mean square error.

(𝑛𝑙)
So, in our case it is going to be 𝑦𝑖 − 𝑎𝑖 . Let us, assume that is your final layers output

after applying the activation function whole square and remember we said we will have a half

546
for convenience I mean, when you take a gradient of this with respect to ai you are going to
(𝑛𝑙)
have minus two and two would get cancelled and you would have 𝑦𝑖 − 𝑎𝑖 that would be

your gradient of the mean square error but, remember that 𝑎𝑖 itself is a function of 𝑧𝑖, the way

we defined it, you apply an activation function f and then you get 𝑎𝑖 which means in the chain
' (𝑛𝑖) (𝑛𝑖)
rule you would then have 𝑓 (𝑧𝑖 ), where 𝑎𝑖 = 𝑓(𝑧𝑖 )

So, that is going to be the error term at the output layer. Now, we are going to claim that the
error term of the hidden layer is going to be given by this quantity and let us try to find how.

(Refer Slide Time: 10:34)

547
For the hidden layer we have to rely on the error terms from the subsequent layers. So, once
again let us assume that the error term at the output layer is given by this. Now the error term
at the hidden layer is given by some of the error terms in the next layer which means,
𝑛𝑙+1 (𝑙+1)
(𝑙) (𝑙) ' (𝑙)
δ = ( ∑ 𝑊𝑖𝑗 δ ) 𝑓 (𝑧𝑖 ).
𝑖
𝑗=1
𝑗

' (𝑙)
𝑓 (𝑧𝑖 ) is simply the derivative of the activation function at that particular node. Remember,

this is the reason why we said it is nice that the sigmoid function is continuous and
differentiable because we need this derivative across the activation function to be able to
compute your chain.

Now, you could ask me why am I multiplying it by the weights? Because, that is the
𝑡ℎ
contribution that this node had to the error on that particular node in the next layer the 𝑖
𝑡ℎ
node contributed by 𝑊𝑖𝑗 times to the error at the 𝑗 node in the next layer, that is what we are
' (𝑙)
denoting it by the sum of errors as we just mentioned 𝑓 (𝑧𝑖 ) denoted the derivative of the

activation function for a linear neuron 𝑓(𝑥) = 𝑥 derivation is 1. Simply that term would not
matter for the rest of the computations.

−𝑥
But, for a sigmoid neuron 𝑓(𝑥) = σ(𝑥), derivative is equal to 1/(1 + 𝑒 ). The derivative
turns out to be σ(𝑥)(1 − σ(𝑥)). Suppose, you already know it for those of you who did
Assignment 0 in this course you should have seen it but, you can also work it out as
homework.

(Refer Slide Time: 12:47)

548
So, the way we are going to train the full network now, is we perform a feed forward pass
computing the activations for all layers, for each output unit i, in the set of neurons in the last
(𝑛𝑙)
layer denoted by 𝑛𝑙, we compute the error which is δ which is the derivative of the loss

function with respect to the final activation into the derivative of the final activation with
(𝑙)
respect to 𝑧 which is before applying the activation function and then on for the previous
(𝑙) (𝑙)𝑇 (𝑙+1)
layers each δ can be written as 𝑊 δ .

Remember, now we have written this as a matrix and that is why the summation disappeared
(𝑙) (𝑙+1)
and this 𝑊 matrix times this δ vector would ideally give you the summation that you
are looking for that we talked about on the earlier slide.

(Refer Slide Time: 13:48)

549
Having computed these deltas for different layers, remember delta is the contribution of every
node to the error we are now, going to use those deltas to find out the partial derivatives of
the loss with respect to every weight in the neural network. How we do it is very simple.
Now, it is going to be defined as the loss or the gradient of the loss with respect to every
(𝑙+1)
weight in any layer l is simply given by take the δ the contribution of the error term in
𝑡ℎ (𝑙)
the next layer and multiply it by activation of the 𝑙 layer 𝑎 , let us try to understand why
this is correct. Let us consider a neural network let us say this is these are some layers of a
neural network and these are nodes.

Let us assume that there are many more layers before this but, this is the last layer which is 𝑛𝑙

this is the penultimate one 𝑛𝑙 − 1, one before that 𝑛𝑙 − 2 Now, in the last node in the output

550
layer, remember let us draw that a little bit bigger, so this side is going to be this w is what we
want to compute it with respect to the gradient z and when you apply the activation on z it
becomes a.

So, the way z and a are related is 𝑓(𝑧) = 𝑎 where f is the activation function of your choice
and this a is finally used to compute your final loss function, remember we said it is
2
(𝑦 − 𝑎) where, a is the output and y is the expected output. So, to compute the gradient
with respect to w that is unnecessary information here, so to compute ∂𝐿/∂𝑊 for the moment
let us consider this weight that is what we are computing with respect to which we apply
chain rule, chain rule tells us this is ∂𝐿/∂𝑎 * ∂𝑎/∂𝑧 * ∂𝑧/∂𝑤.

Now, ∂𝐿/∂𝑎 is exactly this term here that is the derivative of the loss with respect to a. ∂𝑎/∂𝑧
'
is exactly this term here which is the gradient of the activation function 𝑓 . ∂𝑧/∂𝑤 is now
what we have to compute but, we know now that ∂𝑧/∂𝑤 is nothing but the activation in the
previous layer. So, let us probably write it this way the activation is right here the previous
layer z would be here and activation is applied and a is what would come here remember, for
(𝑙) (𝑙) (𝑙+1)
𝑊 we will go from 𝑎 to 𝑧 .

(𝑙) (𝑙)
So, if you want to compute the derivative with respect to 𝑊 , ∂𝑧/∂𝑊 will simply be 𝑎
(𝑙) (𝑙) (𝑙+1)
from the previous layer, why? Because, 𝑎 𝑊 will be say 𝑧 , we are simplifying it and
talking in terms of scalars but, you could now expand this to all the other nodes in that layer
and also write it as a vector but, that is the broad idea.

So, once again if you instead had a delta that was, let us assume now that we want to compute
𝑛𝑙−2
the gradient with respect to a previous layer let us call that 𝑊 , so one of these weights
once again is going to be the same. You just have to keep computing the chain rule over and
over again and you will find that the chain rule will give you delta at this node as the
contribution of that node of the error.

Now, you simply have to differentiate that with respect to w which will give you the
activation in the previous layer as the additional term that gives you your gradient, work it
out a bit carefully and you will follow that this is fairly straightforward. Now, this gives you
the gradient with respect to w, similarly if you noted the second equation here that has the
gradient with respect to the bias it is exactly the same thing where the activation is 1
remember because the activation that goes into the bias in any layer is going to be 1

551
remember once again, that a bias could there for every layer not just the first layer could be
therefore, every layer but, that value will come out to be 1.

So, what have we done to summarize we have computed those deltas that are the errors of
how each node contribute to the output and simply use that to make our chain rule
computation simpler, we could have instead not written out delta and worked out the chain
rule but, you would have got exactly the same value, delta gives us a placeholder to hold
what how a particular node in any layer contributed the error and now, it is very easy to
compute the gradient using the error for a previous layers weight.

(Refer Slide Time: 19:20)

That gives us our final algorithm for training a neural network, let us go over it. So, let us
assume that ∆𝑊 = 0 and ∆𝑏 = 0. Just a minor clarification here to differentiate ∆ and ∇ , ∆
defines the change in a value, ∇ defines the gradient of a function with respect to a value so,
please keep this in mind in the notations. So, ∆𝑊 and ∆𝑏 are the changes that we are going to
make to 𝑊 and b in a given iteration of gradient descent.

Let us, initialize that to 0 then for each data point that you have in your training data set you
forward propagate that data point, you compute the error, compute the loss, compute the
gradient and use back propagation to compute the gradient with respect to every weight in the
neural network that gives you ∇θ𝐿.

Now, your ∆θ composed of ∆𝑊 and ∆𝑏 you increase that by this much gradient now and you
would keep going this for a loop for every data point in your data set and that gives you the

552
total delta theta you have ∆𝑊 and ∆𝑏. That is the total change that you want to make to your
𝑊 and b.

Now, to update the parameters using gradient descent you are going to say,
(𝑙) (𝑙) (𝑙)
𝑊 =𝑊 − η * 1/𝑀(∆𝑊 ). Divide by m is simply to average the gradient across all of
your data points. Remember here we were summing up the gradients for all of the data points.

Now, we simply want to divide by m to average the gradient across all of your data points
and that tells you how your current weights had an impact on the loss across all of your
training data points that is the gradient that you are going to use to update w then you will get
(𝑙)
𝑊 at the next time step and similarly, b at the next time step and you keep repeating this
until convergence. Remember, convergence is when the gradient of the loss with respect to w
and b becomes 0. In practice this may be difficult. How do we handle it in practice? We will
talk about it in a later lecture.

(Refer Slide Time: 21:52)

For further, readings please read chapter 4 and chapter 6 of deep learning book not all
sections in these two chapters may be relevant so, I did advise you to read what is relevant to
the sections that we have covered so far also, go through the Stanford tutorial and the CS231
endnotes the links are right on this slide if you are interested.

553
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad
Gradient Descent and Variants – Part 01

We have so far seen an Introduction to Neural Networks and how one can train a feed-forward
neural network or a multi-layer perceptron using back propagation and gradient descent. We will
now move on to understanding the challenges of training such a neural network using gradient
descent and propose variants and adaptations that could be useful in improving the effectiveness
of training neural networks using gradient descent.

(Refer Slide Time: 0:55)

We will start with a brief review of gradient descent, as we already said gradient descent is an
optimization algorithm that is used to find the minima of any differentiable function. Remember
again that the loss function that we use to train a neural network, means square error is what we
use so far, but even if we use any other loss function that function has to be differentiable.

Please do not worry if you are not aware of what other loss functions to use, we will see plenty
of them as we go through the rest of this course. And when we use gradient descent at each step
parameters are pushed in the negative direction of the gradient of a cost function or error
function or loss function whatever we choose to call it.

554
And the parameter update rule is given by

θnew = θold − αΔθold . α is learning rate of step size times the change in the parameters. And we
saw this visualization also in the last lecture where if you have a simple convex function, you
will have, and if you consider a point to the left of the minimum, see this blue point here, the
gradient there is negative so when you go in the negative direction of that you are going to be
going towards the positive side or the right side which will take you towards the minimum.

If you have started at this point here on the right side the gradient there is positive so the negative
gradient will go to the opposite direction which will once again be towards your minimum and
this is an a visual illustration of how gradient descent can be used to minimize any objective
function. Keep in mind here that this example of a function, this black curve here is a convex
function.

So, you are going to have an exercise at the end of this lecture to know what convex functions
are and we will discuss them a little bit later if you are not aware. But for the moment let us go
ahead and try to understand how we can improve gradient descent.

(Refer Slide Time: 3:20)

Here is the gradient descent algorithm, you are given a learning rate, you have some initial
parameters θt and a training data set Dtr . As long as the stopping criterion is not met, you
initialize your parameter updates and for each point in your training dataset you compute your

555
gradient using back propagation of the loss function with respect to each of your parameters.
And you aggregate your gradients in your Δθ variable.

Once again just to remind you Δθ denotes the change in variables, ∇L denotes the gradient,
please note the difference in notation since they can look similar at first group. Then finally you
have to apply your update θ t+1 = θ t − α |D1 | Δθ t
tr

How long do you train this? Until your gradient becomes zero. Each of these for loops here is
typically called one epoch, one epoch corresponds to one full iteration over your training dataset,
so we compute the gradients across an epoch, average all your gradients and then update your
parameters, that is the simple summary of gradient descent.

(Refer Slide Time: 5:01)

Now, let us move on to understanding what is known as the error surface of neural networks.
The error surface of neural networks is simply a plot of all your parameters of the neural network
versus the corresponding cost value or the loss function value that you would have. From now on
when we say weights we are going to assume that it also encompasses biases in them, so just
please assume it that way.

So, if you see this undulated surface in this slide, each point there is one particular weight
configuration, which means one value assigned to each weight and the corresponding loss that

556
was incurred when you propagate all your training data points through that weight configuration.
Remember for every training data point you forward propagate it, you get an output, you would
get a loss.

Next training data point, forward propagated, an output and a loss, now you average all of these
losses that is going to be your overall loss for that weight configuration before you update
further. This is an error surface which can be very complex, modern neural networks that are
used in practice especially in computer vision have millions of parameters, a very popular
network known as AlexNet which once again we will see later has close to 60 million
parameters.

This one is ResNet which again has millions of parameters, which means although we see it in
three dimensional space here, this kind of an error surface actually exists in a 60 million
dimensional space. So, you can imagine the complexity that is going to be there on that particular
surface.

And the goal when you train a neural network is to start somewhere on the surface which means
you could be starting at any one particular point, remember that is the random initialization of
the base and your gradient descent has to take you to the minimum. Clearly, you can see here
that this is a very non-convex surface, it is not a single bowl like shape, it is a non-convex
surface with lots of undulations, so it is not a trivial task to be able to traverse this surface.

557
(Refer Slide Time: 7:32)

One of the major problems when you have a non-convex objective function is that you are going
to have many local minimum, which means the solution that you finally converge to after
applying gradient descent is going to depend on where you start, depending on where you start
that is going to decide which local minimum you are going to converge to. Why so?

Because when you hit one of this local minima, your gradient is going to be zero and your
gradient descent algorithm will terminate, which means it is going to be extremely important to

558
know where to start and unfortunately when we start training a neural network we do not
understand the complexity of the error surface involved.

While we saw one such visualization in the previous slide, it is not possible to understand where
your initialization could be or where the actual local minima could be and so on and so forth in a
very high dimensional space. While there have been such efforts to visualize these inner
surfaces, it is not good enough for us to use that for visualizing and training directly. Which
means training to find a suitable local minima is still an important challenge here.

(Refer Slide Time: 8:55)

Similarly, you could also have what are known as saddle points, saddle points are as you all
know from high school calculus are critical points again which means the gradient is 0, but it is
minima along a set of dimensions and maxima along another set of dimensions. For us
dimensions in this context are weights, we are talking about the dimension of that error surface
and remember the error surface is a plot between weights and the cost function.

So, here is an example of a simple saddle function or a saddle point, so you have here. This is a
point which is a minima along this dimension and a maxima along the other dimension. Why
does this matter to us? Once again because the gradient is zero at a saddle point, if you, your
gradient descent converges to saddle point you are going to end your training right there and a
saddle point may not be the ideal solution because it is a maximum along certain dimensions and

559
you may want to reach a better minima which can give you a better neural network solution that
you can use in practice.

Another important observation which was made in this particular work called identifying and
attacking the saddle point problem in high dimensional non-convex optimization was that as you
go to higher and higher dimensional spaces which means as your network becomes more and
more complex, more and more weights then your local minima which would be more prevalent
in low dimensional spaces keep getting replaced by saddle points as the dimensionality increases.

Why so? A simple intuition for that is if your dimension of your error surface, so the dimension
of your weights is going to be say 1 million, let us say you found a critical point maybe that
critical point is a minimum for nine hundred and ninety-nine thousand nine hundred and ninety
nine dimensions or weights, but there could be still that one dimension where it ends up
becoming a maximum.

And when you have a very large dimensional space it is more probable that there are some
points, there are some dimensions where that particular weight configuration could be a
maximum, that is a simple intuition for why saddle points proliferate as you go to higher and
higher dimensions.

(Refer Slide Time: 11:31)

560
Let us see a simple gradient descent traversal example to understand an issue before we go there
just to explain what is happening here, this is a plot of the error surface, as you can see, red is a
high value, blue is a low value, so what this plot is showing you is it is taking one value of w,
one value of b, it is a very simple neural network and this z-axis here plots the error, so you can
see here that there are certain points where the error is high and certain points where the error is
low and it is the surface looks like a carpet.

So, now let us, just before we, so we understand this which is the local minima here, where
would you want your neural network to converge? At the deepest blue color because that is
where your error is leased. So, let us now initialize from a random point on this error surface and
see what gradient descent does, watch carefully.

You see now, on the top surface you see red plots, on the bottom surface you see black spots
going, watch now that the updates are slow and then after some time the updates pick up speed
and finally you converge to that blue minima that you were looking for. Did you notice
something interesting?

Yes, there were some points in the traversal of gradient descent when the traversal became slow
and there were certain points when traversal became fast. Let us keep this in mind when we go
forward with the discussion.

(Refer Slide Time: 13:22)

561
Let us look at the same error surface again, but let us initialize at a different point and now see
what happens, we are going to initialize somewhere in the top left here, that is what you want to
be looking for when you see this visualization. You see these red lines, the black lines, gradient
descent is traversing initially a bit slow, it travels a bit fast, again a bit slow, again a bit slow and
then finally converges. So, you can see that there are different points at which the speed of
gradient descent can keep changing.

(Refer Slide Time: 14:07)

Let us see this now from a slightly different perspective we are going to see this from the
viewpoint of what is known as a contour plot. A contour plot is simply a two dimensional view
of any surface, so if you take any surface let us imagine the Himalayas in front of your head, let
us imagine a mountain range in front of your head, imagine that you are slicing the mountain
range with a huge knife and now you are going to view from the top all of these slices laid down
on a single table and that view is what you see as a contour plot.

What do you mean by a contour plot? So, if you take any of these contours what you see on the
screen here, if you take one of these lines like let us follow this particular line here. All it means
is that for all the points on that line the error is the same, these are also sometimes called iso
contours, which means all the error values at those points are the same. Why? Because you took
a cross section, a cross section simply means that the error value was the same at that particular
point.

562
Now, let us see the same example that we saw on the earlier slide as a contour point. If you see
the contour plot now you would notice that parameter updates which are given here, you can see
the parameter update values are smaller at points where gradient of error surface is small.
Remember we said that these are points which have a lesser gradient the blue areas, and the red
areas have a higher error value. And whenever the gradient of the error surface is small the
parameter updates are small and whenever the gradient is large these are points where the
gradient is large the parameter updates are also large on those particular points.

(Refer Slide Time: 16:15)

That brings us to this another facet of understanding how to train neural networks in terms of
plateaus and flat regions. Plateaus and flat regions constitute portions of the error surface where
the gradient is highly non-spherical. What does this mean? When we say non-spherical it means
that each dimension of the gradient is of different, different values.

Remember that the entire gradient is a vector; it is the derivative of the loss function with respect
to every weight in your neural network, you unroll all of your weights in the neural network
across all your layers into a single vector. Now, you take the loss with respect to each of those
weights and put them in the same vector, you would use chain rule and back propagation to
compute those gradients but once you compute them you input all of them in a single vector.

563
Now, if all of those gradients have similar values at a particular weight configuration you would
then say that that area of the error surface is spherical because it all has equivalent values in all
dimensions. But when you go to a plateau or a flat region you are going to have very largely
elliptical shapes where in one dimension things move very slowly, whereas in another dimension
things could move rapidly, that is what we mean by non-spherical in this particular context.
Non-spherical is elongated elliptical, spherical as a sphere and elongated elliptical.

So, gradient descent spends a long time traversing in these kinds of regions as the updates could
be very, very small. Remember when you have plateaus and flat regions, it is like going through
an area where the gradient very gradually tapers and when the gradient gradually tapers the
gradient values are small, the parameter updates are going to be small as straightforward is that.

So, one question we could ask is when you are going through a plateau, cannot you just walk
faster, cannot you just expedite this process? We can but there are some tradeoffs let us see what
we can do there. One simple option is to take longer steps, you know your gradient just take
longer steps. What does it mean to take longer steps in gradient descent?

Increase the learning rate or the step size alpha that we talked about. This is a good idea in
general when you go through a plateau, but if you simply just increase the learning rate alpha to
a very high value for your entire gradient descent traversable that may not be of use, let us see
why.

Here is an example that you see on the right. So, you see a certain error surface, this is a very
simple error surface, where on the x-axis is a weight, on the y-axis is an error, a very simple
one-dimensional, one parameter situation. So, you have these updates that go along the direction
of the negative gradient that happens and at a particular point the gradient changes rapidly and
you go to the next point.

At a certain point on your error surface you see here that the curve started jumping around. Why
is that so? Because this happens when your learning rate is very, very high, let us try to
understand this carefully. So, if you are at a particular point here where the mouse cursor is
currently showing you.

564
Let us say the gradient is along a particular direction and if you choose to take a long step, your
alpha was high on your x which is where you are making the parameter updates you may jump
by a large amount and go somewhere here. And when you go there the corresponding error there
is this particular value.

So, while you wanted to go towards this minimum, you took a large step and went to this value
along the w axis and at that value along the w axis the error turns out to be high. So, the learning
rate is too high you actually go to a point of higher error and once again from there if learning
rate is high an even higher error and you diverge out of that local minimum rather than converge
to that local minimum, not maybe this is the right local minimum you would have wanted to
converge to, in that case keeping the learning rate very high is really not going to help you.

(Refer Slide Time: 21:02)

565
That leads us to a method known as momentum based gradient descent. Momentum based
gradient descent relies on the intuition that with increasing confidence increase the step size and
with decreasing confidence decrease the step size. Simple intuition is, why not we make use of
the direction that we have been traversing on so far to make our next update? Why should we
rely only on the gradient at the current time step?

A good way to imagine all of this is as a blind person traversing Himalayas. So, Himalayas is a
mountain range and you have to ideally, a helicopter takes you and drops you at a particular
point on the Himalayas and your job is to navigate and reach the bottom of the Himalayas where
the error value is the least.

Why did I say a blind person? Because if you had vision you could see far and directly go to
where the minimum is, unfortunately gradient descent does not have that privilege. What does
gradient descent do? Has a stick in its hand, keeps tapping everywhere around it and sees where
the negative slope is the highest and goes in that direction at one step. This is the analogy that we
are following now to traverse the error surface of the neural network.

Why is this important, why is this analogy important? Because now, if I were traversing the error
surface of this neural network I would simply ask the question, let us say I was at a particular
point and I have been coming down in a particular direction all the while and at one point the
gradient at that point as I tuck my stick around tells me to go in a very different direction. The

566
question I would ask is should I completely change direction now, I have been coming along one
way so far, should I completely change direction now?

Maybe the answer is let us not do that, let us keep in mind that we were coming along a certain
direction, use that to some extent and the new gradient direction to some extent and combine
them in some way. This is exactly what momentum based gradient descent does. So,
mathematically speaking, how do you implement it?

You say, you define something known as a velocity vector v t , velocity is simply change of
distance over time, so it is simply your update in parameters over one iteration, it is simply a
name in this particular context. v t which is the change of parameters that you are going to have
in this time step is given by γ v t−1 . What was your change in weights in the previous time step
into weighted by a particular coefficient γ plus the rest of it is standard gradient descent, alpha
learning rate times the gradient of the loss with respect to the parameter θ .

What is this telling us? It is telling us that if you were coming along a particular direction and
your gradient tells you to go to a different direction, combine them in some way and then get a
velocity vector v t which is what you use to update your parameters. Then you have
θt+1 = θt − v t , which tells you the weight update.

So, keep in mind here that when v t−1 and the gradient are in the same direction, this will only
give us more momentum in the same direction and that explains the reason for this name of the
term if your v t -1 and l are in the same direction you are going to use your past momentum to
walk even faster in that direction. And this should give you an idea of why we are discussing this
in the context of plateaus and flat surfaces. If we were going along the same direction for a while
let that give us momentum to take longer steps in that direction and go faster to the minimum.

Now, you can probably see how momentum can avoid divergence in the previous figure, let us
go back to the previous figure here, so can momentum avoid divergence here let us see. In this
particular case you would have got the gradient to be something like this, the previous gradient
was something like that and the combination of the gradients would have taken you in a direction
which was directly close to the minimum.

567
And one important thing to keep in mind here is both of these while learning rate is typically
chosen based on the designs what the user wants to give a value for, γ is typically a value
between 0 and 1. Why is that so, why cannot γ be greater than 1? There is an important reason
here, firstly you do not want that to out shadow the gradient that is a simple reason, but the more
important reason is remember v t−1 has a component of v t−2 in it, because momentum was used
in computing v t−1 in the previous time step.

So, now this would become γ so the component of v t−2 that influences v t will be γ 2 into v t−2
and by ensuring that γ lies between 0 and 1, γ 2 will be smaller than γ which means how much
v t−2 contributes to v t will be lesser than how much v t−1 contributes to v t , that is one reason to
choose γ between 0 and 1.

And that should again tell you how combining these should now give you a smaller time step,
you are not giving, your alpha is not a very high learning rate but it now keeps you within check.
So, now you can also see that you are not going to randomly diverge. So, if you are at this point
and the gradient told you to go in this direction but your previous gradient had told you to go to a
particular direction, you now combine these two and you may actually come back to the minima
from the next data point, from the next weight configuration, this could help you avoid
divergence even in these scenarios.

(Refer Slide Time: 27:41)

568
So, here is a contour plot visualization, a very simple contour plot visualization for momentum.
So, once again these are contour plots, So you could imagine this to be an elliptical mountain
cross section and you are seeing it from the top. And you can see here that when you do not use
momentum, you start from a particular point on this error surface, this is the point on the other
surface that you start.

When you see things as contour plots the gradient would always be normal to the contour plots.
So, when you see it as the entire surface you would be going up but when you see it as a contour
plot the gradient is going to be normal to the contour plot. So, at a particular point your gradient
will take you this way and then you go there and the gradient takes you the other way, you go
this then the gradient will take you again the other way and so on and so forth.

You still are going towards your minimum which is in the center of this contour plot, but you are
going to be oscillating to reach that minimum that you are going for. Keep in mind here that if
this error surface or this contour plot was spherical, currently we are looking at it as elliptical
which means certain dimensions gradient is higher, certain dimensions gradient is lower.

But if this error surface was spherical then your gradient would be normal and it would straight
point to the center or the minimum and you would probably reach in one step. So, imagine an
spherical error surface you can try to draw it and see that the normal at any contour point will
point straight to the center and that will take you provided you choose suitable step size you will
a learning rate you would straight go to the center.

But you will see here that you probably may not find the need for momentum when your error
surface is spherical, momentum is more useful when your error surface is non-spherical which is
once again more likely to happen in a complex error surface. When you have momentum you see
that the oscillation seems to be reducing, let us try to understand why.

You first go for one step to the next step, the first gradient. When the next gradient comes you
will combine it with some portion of the previous gradient so that takes you here, now let us see
what happens at this particular point your gradient is going to point you in a particular direction
but your previous gradient was pointing you in this direction, you take a sum of these two
gradients and that gradient will take you in this direction.

569
So, the sum of those two gradients will take you in this direction and this way your oscillations
reduce, you still oscillate but your number of oscillations reduce and you can quickly get to your
minimum. More simply speaking momentum dams the step sizes along directions of high
curvature where you get an effective larger learning rate on the directions of low curvature,
rather when your error surface is very steep go slow, when your error surface is flat go fast.

Larger the γ more the previous gradients affect the current step. A general practice is to start γ
with 0.5 until your initial learning stabilizes. Why? Because you may not be able to trust your
gradients, initially your gradients could be going in different directions, it is remember again the
blind man on Himalayas, if you keep traversing that for some time then your gradient probably,
you understand what direction is the right direction to go, it stabilizes after some time then you
can increase your momentum parameter γ to 0.9 or 0.95, wait for your train to stabilize a bit and
then increase momentum because you can then trust the previous directions you are walking on
or traversing by. But generally in practice people often just set it to 0.9 or 0.95 from the very
beginning itself.

570
(Refer Slide Time: 31:49)

Here is the algorithmic version of a momentum based gradient descent, you have a learning rate
alpha, a momentum parameter γ , initial parameters, training data set, pretty much the same
algorithm as gradient descent, the only change now is lines 8 and 9 where you compute your
velocity which was γ times v t−1 plus alpha times Δθt , which is your update based on your
gradient and then you complete your parameters.

One small change here is we remove the 1 by t training cardinality of the retraining set here just
for convenience, you can assume that when we aggregate the weight updates you are subsuming
that there and taking the average across the training data points. So, for convenience you are
going to avoid that.

571
(Refer Slide Time: 32:42)

Here is a visualization of convergence of momentum, this is gradient descent once again this is a
contour plot you can see here that red values are high values, blue values are low values, so you
know that you want to converge to somewhere in this bottom right region where error values are
low.

So, this is your gradient descent curve until a certain point, a certain number of iterations. Now,
let us try to see how momentum does on the same surface. You can see the red curve now, you
can see the steps that it is taking are long steps, it seems to overshoot the minima, take a u-turn,
come back and seems to get close to the minimum.

So, what is happening this is what we saw on the previous slide that when the error surface is
elliptical the gradient makes you oscillate a little bit around the minimum before converging to
the minimum itself. So, clearly the momentum performance was better than GD, in fact we will
mention later in terms of what step and what error was there in this particular example. But it
seems to make a sense that momentum this seems to go faster than GD in this scenario, but is it
always good is a question that we would like to ask.

572
(Refer Slide Time: 34:05)

To see that let us take this particular example, here is the error surface, you can see here that this
is something like a carpet with a hollow somewhere in between. So, you see here that blue
corresponds to the minimum, red is again a high value, this is your error surface and the
corresponding contour plot is shown on the right, so this is the contour plot you are seeing this
from top, where you want to get to this center point here which is the minimum of that error
surface.

Now, you can see gradient descent traversals for a few iterations here, in the same number of
iterations the gradient descent traveled this distance you can see the mouse cursor, let us try to
see what momentum does. You can see the red line again, it seems to be shooting forward and it
reaches the region of the minima, then keeps oscillating and finally gets to the minimum. Let us
try to summarize that clearly.

573
(Refer Slide Time: 35:07)

Momentum based gradient descent oscillates around your minimum before eventually reaching
it, even then it converges much faster than vanilla GD. So, just as I said after 100 iterations
which was the number of iterations that was shown for gradient descent of the same figure,
momentum based gradient descent has an error of 10 power minus 5, whereas vanilla gradient
descent is still at an error of 0.36. Nonetheless, we see that momentum seems to be wasting some
time in oscillating around that area, let us ask ourselves if we can reduce that oscillation time in
some way.

(Refer Slide Time: 35:50)

574
That leads us to another method known as Nesterov Accelerated momentum. This is based on
Yurii Nesterov’s work, Yurii Nesterov is a huge researcher who has created a significant impact
in the field of optimization. This work called accelerated gradient descent of his was published in
1983 and was recently reintroduced in the Machine Learning context in 2013.

And the key idea of Nesterov accelerated momentum is to develop the idea of momentum to
include one key thought which is look before Yurii rather I am going in a particular direction on
the Himalayas again, traversing the Himalayas trying to find out that minimum point of the
valley.

And I do get some sense that my current gradient is telling me to go in a different direction. Can
I go one step further along the way I was coming, see the gradient there and then decide rather
than use the gradient at the current time step? So, you want to compute a look ahead gradient or
and then use that knowledge in taking your current step. Let us see what that means when we
write out the equations.

So, we are saying now that this equation is very similar to what you saw for the momentum
equation v t = γ v t−1 + α∇θ˜ L(θt − γ v t−1 ; x(i) , y (i) ) , what is the difference here? If you observe
t

carefully the main difference between this equation and the equation that we had a couple of
slides back for momentum is this term here, this term for momentum was simply θt but this
term for us now is θt − γ v t−1 , what does this mean?

575
This means that we take θt which is the current parameters, v t−1 was my previous parameter
update, I will have one more update of that with γ with my momentum parameter, go to that
step, compute my loss on that weight configuration and then use that gradient to make my move
in this step. This idea is empirically found to give good performance.

576
(Refer Slide Time: 38:23)

Here is a visualization of how momentum, Nesterov momentum differ. So, you can see here in
the momentum update that momentum based gradient descent is a combination of two vectors,
the momentum step which is the direction in which you have been traversing and the gradient
step which is what the current gradient is telling you to do, and your actual step is an weighted
addition of these two vectors which is perhaps going to tell you the step that you have to take
now.

In Nesterov momentum update you first follow your momentum step and then at that step you
find what is the gradient and use that gradient to be able to take your actual step, so you do a
momentum plus the look ahead gradient which will probably give you this vector. So, there is a
slight difference in where you would reach after one step using momentum and Nesterov
accelerated momentum.

577
(Refer Slide Time: 39:20)

Here is the summarized algorithm for Nesterov accelerated momentum, it is once again the same
as the momentum based gradient descent, but for step 4 here, step 4 gets your look ahead
parameters θ t minus γ v t−1 , you are going to denote that as tilde θ t and your loss is computed
with respect to tilde θ t and that is the gradient that you keep accumulating in your parameter
updates and the rest of it follows your momentum based gradient descent algorithm.

(Refer Slide Time: 40:02)

578
Here is the illustration of the same example that we saw a few slides ago. This was the
momentum traversal that we saw a few slides ago. Let us see how Nesterov momentum performs
in the same setting. You see a blue curve starting out there at the same initialization, it seems to
form a momentum and here is where it differs, it does not oscillate as much as what momentum
does.

The reason being, when you do momentum remember you are oscillating so you have your
gradient which takes you to the other bank of that minimum, that gradient tells you to go back to
the other bank of that minimum. So, now because your Nesterov momentum is looking ahead it
is going to also see where you would go in the next step and use that right away this allows you
to minimize these jumping around the minima before you actually converge.

579
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad
Gradient Descent and Variants – Part 02
Let us now try to analyze, we have seen gradient descent, we have now seen at least one or two
methods, momentum based gradient descent and Nesterov momentum based gradient descent,
which can help you improve gradient descent training, what are the pros and cons of gradient
descent you have seen so far?

(Refer Slide Time: 0:36)

For every parameter update gradient descent passes the entire dataset, we already saw that it
computes the gradients, it keeps aggregating the gradients and then takes the average, we called
it an epoch. So, gradient descent passes the entire dataset and it is hence called batch gradient
descent.

And the advantages of batch gradient descent is, theoretically speaking from an optimization
standpoint, the conditions of convergence are well understood and many acceleration techniques,
there are improvements from an optimization perspective, such as conjugate gradient operate
very well in the batch gradient descent setting, once again it is called batch gradient descent.
Because you are waiting for the entire batch of data points to get completed, take all of these
gradients, average them and then make an update.

580
So, the disadvantage of batch gradient descent is, it can be computationally slow for the same
reason. Especially if your training dataset had say 10000 data points or 1 million data points, you
are going to wait for passing through all of them before making an update. ImageNet which is a
computer vision challenge, it is a very commonly used dataset in vision, has today about 14
million samples and one iteration over all of those 14 million samples can be very slow to wait
for one parameter update.

(Refer Slide Time: 2:10)

That brings us to a very popular approach known as stochastic gradient descent. Stochastic
gradient descent or SGD stands for, you randomly shuffle your training dataset and update your
parameters after gradients are computed for each training example. So, you randomly shuffle and
pick one data point, forward propagate it, compute the error, compute the gradient, update all
your parameters right away before propagating your next data point. So, here is your algorithmic
difference with gradient descent.

The third, the third step here, sorry, this step 5 here on this algorithm was outside the for loop in
gradient descent, you did all, you went through every data point accumulated all your gradients
and then made the parameter update, whereas in stochastic gradient descent that step of updating
your parameters goes inside the for loop, which means for every data point you are going to
make a parameter update, this is going to be faster, let us see a few pros and cons for stochastic
gradient descent too.

581
(Refer Slide Time: 3:26)

But before we go there we are going to talk about the setting which lies in between, which is
known as mini batch stochastic gradient descent. What is mini batch stochastic gradient descent?
Mini batch stochastic gradient descent as the name says is you take a mini batch of training
examples, do not just take one, do not take all, take 20 examples from your training dataset,
forward propagate all of them, compute all the outputs, compute all the errors, compute all the
gradients and then taking average of those 20 gradients alone and update your parameters.

Why do we need this? It is possible that if you keep updating your gradient with every sample, it
may give you random directions, one data point will tell you go like this, this is where Himalayas
lowest point is there, another data point will tell you another step, you will just be taking
different zigzag steps as you traverse through the surface if you listen to every data points
gradient.

Averaging gives you a better sense of a gradient direction that is consistent across a few more
data points. This mini batch version of stochastic gradient descent is the default option for
training neural networks today, often today when people say they use SGD they actually mean
mini batch SGD.

Remember mini batch SGD with batch size to be 1 is your SGD itself. But typically people use
larger batch sizes 20, 100, so on and so forth to compute these gradients. So, algorithmically

582
speaking the applying update once again went out of the loop but this time you are not going to
compute, you are not going to aggregate gradients for the entire training dataset, but only for one
mini batch which you denote as B-mini and you are only going to aggregate the gradients and
average the gradient only for that mini batch in the algorithm.

(Refer Slide Time: 5:32)

Let us try to see an illustration of stochastic gradient descent, this is once again a contour plot of
an error surface, let us try to see how stochastic gradient descent works. You can see it
progressing towards the minimum, remember the minimum is the blue surface, as it goes closer
to the minimum, it seems to oscillate. Do the oscillations remind you of something?

583
(Refer Slide Time: 6:08)

Momentum may be, momentum also had similar oscillations when it reached the minimum.

(Refer Slide Time: 6:16)

Here is an illustration of mini batch SGD, once again the same error surface, let us see if
batching data points and computing the gradient helped us a little bit, you do see here the
oscillations have mildly reduced for many batch SGD when compared to SGD but the batch size
here was just 2, you could probably see lesser oscillations when you keep increasing the batch

584
size, because then you are not going to vacillate based on every data points gradient, but you are
going to batch them across a larger set of data points before you update your parameters.

(Refer Slide Time: 7:04)

What are the pros and cons of stochastic gradient descent when you compare it to batch gradient
descent, let us try to understand this? The advantages of stochastic gradient descent far
outnumber its disadvantages, it is usually much faster than batch learning when you train neural
networks. Why is that so? It is simply because there is a lot of redundancy in batch learning.

Because if you have one million data points in your dataset, while training your neural network
model, then many of these data points may give you similar information, you probably do not
need to wait for all of them to give you the gradient and then average, even if you take 100 of
them maybe you are going to get some sense of where the gradient is for that particular iteration
and move forward in that direction.

So, you are exploiting the redundancy in data to go much faster than standard batch gradient
descent, it also often results in better solutions, this may seem a little counterintuitive because
you would think that by using the entire batch it could be computationally slow but that is the
ideal way to do it, then why does stochastic gradient descent yield better solutions?

The reason is stochastic gradient descent is a noisy version of gradient descent. What do we
mean by that? We mean that you are only taking 20 data points, let us assume the mini batch size

585
was 20, you are taking just 20 data points getting the gradient and you keep moving along those
directions.

So, it is like asking once again if you are traversing Himalayas, if you ask a thousand people
assuming that you had infinite computation and infinite access to people, if you ask a thousand
people and they avoid going in a particular direction you probably can rely on that direction. If
you asked only two people and asked them what direction to go in, maybe they may get it wrong.

Which means that when you ask a smaller set of people you may oscillate, you may take one
direction then somebody else tells you go in another direction, you keep moving around and
fluctuating a little bit but you eventually get there because between across all of these people
they may point you in the direction, in expectation you would get there. But there may be
fluctuations along the way.

But this noisy updates of stochastic gradient descent is what would actually help you in escaping
local minima, bad local minima or saddle points because of these noisy updates you may avoid
saddle points and probably get to better local minimum over time, that is what researchers have
been able to understand from how SGD has performed in deep learning.

Stochastic gradient descent is also useful for tracking changes. What do we mean? If you have a
new set of data points that come in, it is very easy to update with just 20 data points rather than
have to run an entire epoch with 1 million plus 20 data points, so remember in batch gradient
descent you have to recompute the gradient for all the data points before you make your next
move, but in stochastic gradient descent if new data points come that is what we mean by
tracking changes here, you can easily update the system and keep moving.

What about disadvantages? Disadvantages of stochastic gradient descent is that the noise in
stochastic gradient weight updates can lead to no convergence. To this day sometimes remember
as we said that because of the complexity of the error surface where you initialize your weights
is very important as to where which local minima you converge to, so often today in practice if in
the first few iterations you find too much fluctuations in your error people often simply change
the initialization and try again.

586
Which means noise in SGD weight updates can lead to no convergence but the solution is fairly
simple if in the first few iterations things are not going well for you just stop, go back and start at
a different location and start all over again, it is fairly a simple solution. This can also be
controlled using learning rate, if the gradient you know is noisy do not use a high learning rate,
so it can also be controlled using a learning rate but then identifying a proper learning rate can be
a problem of its form and that is the problem we are going to talk about next.

(Refer Slide Time: 11:54)

How do you choose a learning rate? We said that momentum was one way to artificially inject a
learning rate into the system because the momentum in the previous step and the current gradient
aligned it is almost like you had a high learning rate in those directions. In general, though if the
learning rate is too small it can take a long time to converge, if the learning rate is too large the
gradients can explore.

Now, how do you choose a learning rate in practice, assuming we do not use momentum, how do
you choose a learning rate in practice? One simple option is naive linear search, you just keep
your learning rate to be constant throughout the process, just fix it, just use the same learning rate
across all your iterations until you reach convergence.

Or you could use annealing based methods, annealing based methods are methods where you
gradually reduce the learning rate over time using some formula that you come up with, let us

587
see a couple of them now. One of them is known as step decay, where you reduce the learning
rate after every n iterations or if certain degenerative conditions are met.

For example, if the current error is more than the previous error which should not happen and
you do gradient descent, you expect that the current error should always be lesser than the
previous error but when you do stochastic gradient descent that may not be guaranteed because
you are using only 20 data points to design your gradient.

So, what you can do is keep reducing your learning rate after every n iterations. So, that is one
simple approach. The idea here is that as you keep training over iterations in epochs you are
probably getting closer to your minimum and when you get very close to your minima you want
to take small steps so that you reach your minima and do not over shoot it, that is the reason for
reducing the learning rate over time, that is called step decay.

Another popular option is known as exponential decay. In exponential decay you set an initial
−𝑘𝑡
learning rate called α0 and then α = α0 where α0 and k are hyper parameters and t is an

iteration number. So, you start with the learning rate α0 and keep progressively reducing it

exponentially in this particular form over every few iterations, because t is an iteration number
here, as the iteration number increases the learning rate automatically starts falling down.

You can also use a simple 1 by t decay formula, where you start with an α0 and α at a particular

time step can be given by α = α0/(1 + 𝑘𝑡), where α0 and k are hyper parameters and t is an

iteration number. All of these as you can see are heuristics, we are just saying let us just do one
of these and maybe it will work. Can we do something better, can we be more aware of what a
neural network is doing and accordingly choose a learning rate rather than have these global
rules that we fix irrespective of what the network is doing? Is there a way to do this?

588
(Refer Slide Time: 15:29)

Yes, there are a few methods which we are going to discuss now. One of the earliest methods in
this context was known as adaptive gradient or adagrad. Adagrad has a simple intuition as to the
fact that sparse but important features may often have small gradients when compared to others
and learning could be slow in that direction.

In our case features means dimensions of weights rather a subset of weights, so you have your
entire space of weights, a subset of weights which change only once a while, so let us assume
that there are certain weights that to keep do not change with every mini batch, in mini batch
SGD, but change only once a while.

They may change by a small amount but that amount is important because they only change once
in a while. Sparse features could be important and learning could be slow in this direction
because the gradients could be small in that direction. So, a simple thought then is why should
the learning rate be the same for every gradient in your neural network?

Once again remember that the gradient is a vector of the derivatives of the loss with respect to
every weight in your neural network. Why should you use the same learning rate for all of them?
Why can't you use a different learning rate for each of these weights? So, what you do in
Adagrad is you accumulate the squared gradients as a running sum, rt is a quantity that holds a
running sum of squared gradients, once again keep in mind that the gradient is in ∆θ𝑡 is simply

589
the gradient that is the parameter update and that is a vector of the gradient with respect to every
weight in the neural network.

So, for each of those weights you maintain a running sum of the gradient squares and then your

parameter update states θ𝑡+1 = θ𝑡 − α/(δ + 𝑟𝑡) * ∆θ𝑡, ∆θ𝑡 is your gradient in your current

time step.

Now, why does this address this intuition? Delta here is simply used for numerical stability, you
−6 −7
just set it to be a value like 10 or 10 so that you do not face a divide by zero error, in case 𝑟𝑡

is zero you do not want to divide by zero error when you compute so that is the reason you have
a δ there.

But otherwise what this is doing is, when 𝑟𝑡 is small which is going to be a very small value, this

is going to become large or rather the learning rate for that weight is going to become large,
when 𝑟𝑡 is large the learning rate will become small. We are saying that for one of your weights

if your gradient was large give it a small learning rate and for one of your weights if your
gradient is small then give it a larger learning rate.

This seems to fit with the intuition that we had so far and the nice part now is this kind of an
approach because 𝑟𝑡 is maintained for each gradient or each weight individually. This

automatically takes care of changing the learning rate for each weight differently.

590
(Refer Slide Time: 19:08)

Let us try to understand this from a simple example. So, we do understand that your error
surface, these are contour plots, when your contour plots are spherical, things are easier, it is
easier to converge, when your contour plots get elliptical is where your oscillations can occur.
Another important property of elliptical contour plots is that there are certain dimensions where
the change can be rapid, and certain dimensions where the change could be rather small, which is
what elliptical basically means.

So, if you see this example here which denotes, let us assume the gradient and the corresponding
𝑦𝑡 value, they are now saying here that if you look at one of these, let us look the black frequent

and predict irrelevant which simply states that all those values keep changing regularly but they
do not seem to be connected to the class output, these are some values and they do not seem to
be connected to the class output in this particular scenario or the output of whatever function you
are modeling.

While there could be other values such as red and green which change only once a while so you
can see in ϕ𝑡, 2 it does not matter here as to what these quantities are as this is only an example,

so when ϕ𝑡, 2 is 1 we see that 𝑦𝑡 definitely becomes 1, it does not matter when it is 0 but when it

is 1 the output is definitely 1.

591
Similarly, for ϕ𝑡, 3 you see that when it is 1 the output is definitely minus 1. So, it looks like this

particular parameter changes only once a while but when it changes it makes an impact on the
final decision. So, this is a scenario that says that even if the gradient is small but if it happens
only once a while, let us give more weight to changes in those directions, in those weights and
that is what this algorithm does.

(Refer Slide Time: 21:18)

So, here is the algorithm, so you aggregate the gradient, you update your gradient accumulation,
so you initialize your gradient accumulation in a parameter called 𝑟𝑡−1 = 0 and you update your
2
gradient accumulation to be 𝑟𝑡 = 𝑟𝑡−1 + (∆θ) , you keep accumulating the square gradients and

then this is the formula.

One small observation here which is probably not clear from this algorithm is this particular
quantity on step nine is an element-wise product. So, which means, remember in this case that

while alpha was a scalar so far, 𝑟𝑡 is a vector, 𝑟𝑡 will also be a vector, so this entire

α/(δ + 𝑟𝑡) will become a vector, it is a vector of learning rates multiplied by a vector of

gradients element-wise multiplication, it is sometimes also called Hadamard product. Hadamard


product is nothing but element-wise multiplication, remember it is not like dot product, in a dot
product you take element-wise multiplication and then add them all up and get a scalar.

592
In a Hadamard product, you simply take element wise multiplication and your output is still a
vector, you do not add them all, that is the main difference and this operation here in step nine is
actually a Hadamard product, element wise multiplication between two vectors. There is one
problem with the Adagrad method, can you try this for the problem?

The problem is that this gradient accumulation term 𝑟𝑡 is a running sum that simply keeps

accumulating over iterations, which means as you run more and more iterations, it will only keep
increasing and get higher and higher and higher over time which means this denominator is
definitely going to get high over time for all the weights.

(Refer Slide Time: 23:22)

In particular, there are some of the weights for which gradients keep occurring in every iteration
where the denominator term can very quickly increase, this will lead to a lower learning rate for
those dimensions in the weights and learning may not happen along those directions. RMSProp
which was a method that came along the same time as Adagrad proposed a slightly different
2
approach to this problem and said let 𝑟𝑡 = ρ𝑟𝑡−1 + (1 − ρ)(∆θ𝑡) , it is a linear combination

where the two coefficients add up to 1.

593
This now ensures that because of the ρ and 1 − ρ, the sum cannot keep exploding and is now
controlled. Think about it for a while as to why and you will get why this will not allow you to
explode. And now your θ𝑡+1 and θ𝑡, the entire expression is exactly the same as Adagrad.

(Refer Slide Time: 24:42)

Here is the algorithm for RMSProp with only that change in the step where you have now the
running sum of the accumulated squared gradients to be a convex combination monitored by a
quantity called ρ which is known as the decay rate. The decay rate decides how much you want
to consider the previous squared gradient while you get the current squared gradient, very similar
to momentum but in a different context.

(Refer Slide Time: 25:12)

594
The most popular algorithm that is used for training neural networks today for adapting learning
rates is known as ADAM or adaptive moments. Adaptive moments uses a simple intuition of
combining ideas of RMSProp and momentum in a way, let us see how that is done. While
Adagrad and RMSProp accumulate the squared gradients ADAM accumulates both the gradient
and the square gradients, 𝑠𝑡 accumulates the gradient, 𝑟𝑡 accumulates the squared gradient just

like RMSProp. Why?

~ ~
You look at the final update equation here, the update says that θ𝑡+1 = θ𝑡 − α𝑠𝑡/(δ + 𝑟 𝑡). I

will explain the tilde part in a moment but let us first assume that there is no tilde let us just

assume 𝑠𝑡 and 𝑟𝑡. So, what is this doing α𝑠𝑡/(δ + 𝑟𝑡) is exactly the same that we saw in

RMSProp and Adagrad, 𝑠𝑡 would have been the current gradient in Adagrad and RMSProp but

now 𝑠𝑡 is the running sum of the gradient. And what is the running sum of the gradient?

Momentum, remember momentum was what you use the previous gradient, the previous gradient
so on and so forth using momentum term ρ1 is the momentum coefficient for us in this particular

context, that is why ADAM can be looked at as a combination of RMSProp and momentum.

Now, coming to why this is tilde here, why not just 𝑠𝑡 and 𝑟𝑡 is for a very simple reason of doing

what is known as bias correction in the initial iterations. In the initial iterations if ρ1 is set to a

595
high value which is what is typically done, ρ1 just like momentum ρ1 could be set to something

like 0.9 or something like that.

In your initial step 𝑟𝑡−1 would be initialized to 0 that is where 𝑟𝑡−1 will start, 1 minus ρ1 will

make it a very small value because ρ1 is close to 1, so which means the gradients will not have

any impact on 𝑠𝑡 in the initial iterations until they add up, so there may not be much training that

happens in the initial iterations.

~ 𝑡 ~ 𝑡
So, what do we do for that? We say let 𝑠𝑡 = 𝑠𝑡/(1 − ρ1 ) similarly 𝑟𝑡 = 𝑟𝑡/(1 − ρ1 ). What

does this do? If ρ1 is 0.9, remember 1 minus 0.9 is 0.1, so 𝑠𝑡 by 0.1 will now increase it 10 fold,

so by whatever fold it was reduced here, it is now going to increase it by 10 fold in the initial
iteration.

What happens in later iterations? We have 𝑡 here, t is the number of the iteration that you are
talking about. What happens here? Remember ρ1 is less than 1 it is chosen to be 0.9 or
𝑡
something like that, so ρ1 will keep reducing all the iterations, 0.9 into 0.9 is 0.81, so it keeps as

long as you have a value less than 1 as you raise it to the power t it is going to keep reducing
which means this quantity the denominator will keep reducing over iterations and sorry this is
𝑡 𝑡
not the quantitative denominator that ρ1 will keep reducing to 0 and 1 − ρ1 will get closer to 1

over the iterations.

~ ~
Which means after a few iterations 𝑠𝑡 will become 𝑠𝑡, 𝑟𝑡 will become 𝑟𝑡. So, the step of bias

correction was only to handle the initial epochs where the gradient may not have a significant
impact.

(Refer Slide Time: 29:27)

596
This is your ADAM algorithm, it is simply a summary of what we just said. These are the two
steps, you update a first moment estimate which is the gradient, second moment estimate which
is the square gradient you maintain a running sum of both, correct for biases and do your update.
The key thing in the update is you do not have the gradient here because the gradient is
subsumed into 𝑠𝑡.

597
(Refer Slide Time: 29:54)

That summarizes the various methods of improving gradient descent while training neural
networks. So, some of the issues that we saw are plateaus and flat regions, local minima and
saddle points, other issues that we will see as we go forward are vanishing and exploding
gradients, as neural networks become deeper and deeper the gradients that propagate from the
last layer to the first layer can keep vanishing or exploding, that can cause what are known as
cliffs in the error surface, we will see this a bit later.

Other challenges that could happen is sheer ill conditioning of the matrix of weights or the loss
functions or the gradients that you are dealing with, please look at this link here to understand
what this means, you could face from a (prob), you could face a problem of inexact gradients, the
gradients numerically may not be computed properly, for example, finding if a gradient is 0 is
not trivial because numerically attaining 0 while training is not possible, you have to ensure that
you manage it in some way.

But in general numerical issues and gradients could also cause problems. There could be poor
correspondence between local and global structure, which means you have a global error surface
and you want to get to the global minimum but the local region in which you are in may not
really correspond with that global structure, you will be stuck with trying to handle the
undulations in that local structure. Choosing learning rate and other hyper parameters like the
momentum hyper parameter or the decay rate are also problems to deal with.

598
(Refer Slide Time: 31:34)

To summarize, the error surface or cost surface in neural networks is often non-quadratic,
non-convex, high dimensional. Potentially in many minima and flat regions, there are no strong
guarantees that the network will converge to a good solution, the convergence is swift or that
convergence occurs at all, but it works. This is an issue that is being investigated today, but it is
fabulous that training neural networks with stochastic gradient descent works and is being used
around the world in several applications.

(Refer Slide Time: 32:15)

599
Here is your homework, read up on Deep Learning book chapter 8, at least the relevant sections
in that, as well as you can also go through this particular lecture on SGD. A few questions for
you to take away is how to know if you are in a local minima or any other critical point on the
loss surface, I did mention a few times that finding that or finding the convergence criterion is
not trivial. Let us ask that question for you to find before we answer.

Why is training deep neural networks using gradient descent and mean square error a
non-convex optimization problem? This is the first question that we started the lecture with, this
is something for you to ponder upon. And finally if we assume a deep linear neural network, no
activation functions at all, would using gradient descent and mean square error still be a
non-convex optimization problem when you train deep neural networks? These are some
questions to think about.

(Refer Slide Time: 33:18)

And here are references.

600
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 27
Regularization in Neural Networks
(Refer Slide Time: 00:14)

In this lecture, we continue and talk about Regularization in Neural Networks, which is an
important topic to obtain good test set performance with these models.

601
(Refer Slide Time: 00:33)

Before we start, let us review some of the questions we left behind in the previous lecture. The
first question was, how to know if you are in a local minimum, or any other critical point on the
loss surface, when you train Neural Networks. All the algorithms that we spoke about so far,
backpropagation Gradient Descent, all the variants of Gradient Descent always had this clause in
its while loop put follow, which said until convergence or until stopping criterion is achieved.

What should be the stopping criterion? Hope you had a chance to think about it. We are not
going to talk about the answer now. But we are going to talk about it over the course of this
lecture a little later in this lecture. The second question that we left behind was why is training
deep Neural Networks using Gradient Descent and Mean Squared error as cost function, a non
convex optimization problem?

I think we should probably spend some time here. Hope you tried to answer this question. And
you have a good understanding of convex and non convex functions, but let us very briefly
review that before answering this question. So, remember, this is a simple example of a convex
function and a convex function says that if your x axis was here, and here was your 𝑓(𝑥) axis, a
convex function says the definition says that if you have a point 𝑥1, and if you have a point 𝑥2,

this is 𝑓(𝑥1) and this is 𝑓(𝑥2).

602
Then all that we say is that this line connecting 𝑓(𝑥1) and 𝑓(𝑥2) will always lie above the curve.

So, mathematically speaking, 𝑓(λ𝑥1 + (1 − λ)𝑥2) <= λ𝑓(𝑥1) + (1 − λ)𝑓(𝑥2)

That is the definition of a convex function. So, clearly, a non-convex function is one which is not
convex. So, a non-convex function as an example could be a surface something like this. So, you
can clearly see that you will find lines such as these for instance, where the curve does not
necessarily lie below the line at all times. Now, let us come back to our question as to why is
training deep Neural networks using Gradient Descent and Mean Squared Error as cost function,
a non convex optimization problem?

One thing that you have to first understand here before answering this question is the nature of
the cost function Mean Squared error itself. Remember, Mean Squared Error you all know is
given by let us just take let us use the same, let us assume that y was the label and say ℎ(𝑥) was
the output of the Neural Network, Mean Squared Error tries to minimize the square of these
terms and of course, you take a mean, half can be added for simplicity, you of course, take a
mean of these quantities, quantities, that is Mean Squared Error.

Now one look at it, you can say that Mean Squared Error will be a Convex function, it is is a
quadratic function, which means it could be very similar to what we drew here and it could turn
out to be a convex function. Now, in this context, let us ask this question, then why is training
Deep Neural Networks using Gradient Descent and Mean Squared Error, a non convex
optimization problem.

603
(Refer Slide Time: 05:27)

The reason is you could have nonlinear activation functions in your Neural Network that can
make the problem non convex with respect to the weights, okay, then your counter question is
okay, let us assume now a linear deep neural network, where you do not use any nonlinear
activation function at all, no sigmoid, no tanh, no relu so on and so forth, you only use linear
activation functions, which is as good as saying no, no additional activation function other than
the Neural Network itself.

Now, would training a Neural Network using Gradient Descent and Mean Squared Error be a
convex optimization problem or non convex optimization problem? From the answer I said
earlier, it should be a convex optimization problem, because the cost function is Mean Squared
Error. And I just said that nonlinear activation functions are what could make the problem non
convex. So, which means, if you have linear activations, it should be convex.

604
(Refer Slide Time: 06:35)

But the answer is no, it is still not convex. Or other Yes, for non convexity? A reason for that is
the phenomenon of Weights Symmetry. There are many other symmetries in a Neural Network.
But Weights Symmetry is at least one reason. What does this mean? What does Weight
Symmetry mean? Remember that, when we talk about an error surface, we have to give an index
to each weight in the Neural Network, that inner surface has to have a certain dimensionality to
it.

So, the first dimension would be the first weight in the first layer connecting to the first weight in
the second layer. The second dimension would be that first weight in the first layer, connecting to
the second weight, the second layer, so on and so forth. That is how you index all your weights,
and you build your error surface. Now, let us take any two layers in a Neural Network.

And let us connect them using some set of weights. It is obviously fully connected. But these are
some set of weights. Now, in a trained Neural Network, what I could do now is take this neuron
here, and this neuron, neuron, in the second layer, and simply swap them, and also swap all the
weights coming into those neurons. Once again, you have a trained model, you are no longer
learning it, you are fully trained.

But what I am going to do now is take a couple of these neurons and simply swap them and also
swap all the weights coming into these neurons, will the output change? No, because the weights

605
are the same, it simply changed the ordering of the weights in one of the layers. But on the error
surface, this corresponds to a very different position. Let us take an example of three dimensions.
Let us say we had x dimension, y dimension and z dimension. And let us say we had values
which are 2, 1 and 0 in these three dimensions. What we have d1 now is swapped, and Y. That is
what we did for some of the neurons and hence some of the weights.

(0, 1, 2), (2, 1, 0) and (2, 0, 1) are different points in three dimensional space. Similarly, for a
Neural Network, if you swap neurons, after you swap these points are different weight
configurations, because of the index to construct the error surface. But because of the symmetry
of the Neural Network, they would give exactly the same output, exactly the same error.

Which means these two weight configurations, as I said, we take a train model, which means it is
already at a critical point. Let us see a minimum. We are not saying that here is another weight
configuration, which I am giving you which also will give you the same minimum which is
going to be far away because (2, 1, 0) and (2, 0, 1) are not necessarily close by in the grid, there
are many points in between (2, 1, 0) and (2, 0, 1), if you plotted them in three dimensional space.

Which means there is another point further away, which also has the same minimum value of
error and this is just one swap of a neuron. For a Neural Network, assuming you have a million
neurons, you could swap a million times. So, which means there are going to be many similar
weight configurations, which all have the same error, which clearly tells us that the error surface
must be non-convex. It is something like this. All these points have the same error, which can
happen only if your error surface is non convex, because there are many other points in between
those minimums.

606
(Refer Slide Time: 11:07)

Let us move on to Regularization, which is the focus of this lecture. Let us first start with an
Intuitive understanding of Regularization. Most of you may have perhaps covered this in an
introductory Machine Learning course, but let us briefly review this, and then talk about how
Regularization is done in deep Neural Networks or Neural Networks.

Let us take a very simple example. Let us take these sets of values 1 2 3, we say that it satisfies
some rule 4 5 6, it satisfies the same rule 7 8 9 also satisfies the same rule. 9 2 31 does not satisfy
the rule. The question for you now is what is the rule? If you observe carefully, this is very
similar to the Machine Learning setting that we have. Imagine now that these are all data points.

And we are given in training data that certain data points have a certain label and certain other
data points have another label. And our job is to find out what is the model that tells us whether
something satisfies plus 1 or plus 2. Coming back to this example, what could be the rule here,
which satisfies these first three, and does not satisfy the set of values. A simple thought process
should tell you that there are many possibilities.

You could say three consecutive single digits. You could say three consecutive integers, you
could say three numbers in ascending order, three numbers whose sum is less than 25, three
numbers less than 10. It should have 1 4 or 7 in the first column, or for all you know, you can

607
just say, say Yes to the first three sequences, and No to all others, which is also correct in this
particular case.

Clearly, the last two cases would not have been the answers that you came up with intuitively.
Why is that so? The reason is, even as humans, we are always looking for a rule that can
generalize well if I gave you newer data points. I never even mentioned to you at this time that I
am going to give you newer data points. But still, it is human tendency to always come up with
rules that can generalize well to data that we may see tomorrow.

And this is exactly what we want to do with Neural Networks, or Machine Learning in general.
And we call this process Regularization, where we use methods and Machine Learning to
improve generalization performance and avoid overfitting to training data. So, we do not want to
necessarily come up with a rule that fits only the training data.

But we want to come up with a rule that can do well, maybe on data that we will see tomorrow,
which is what we mean by test set performance or generalization performance. Sometimes even
at the cost of not getting hundred percent accuracy on your training data set. Let us see now how
we can do this with Neural Networks.

(Refer Slide Time: 14:36)

We need one of the concepts to be able to follow some of the discussions in this lecture, which is
the concept of 𝐿𝑝 norms. Hopefully you are all aware of this, but let us quickly review them and

608
move forward. Remember, 𝐿𝑝 norm is formally written as 𝐿𝑝(𝑥‾). Let us assume that x is a vector

of the d dimensions [𝑥1, 𝑥2, ···, 𝑥𝑑]. 𝐿𝑝(𝑥‾) is given by or also denoted by ||𝑥‾||𝑝is given by

𝑝 𝑝 𝑝 1/𝑝
(𝑥1 + 𝑥2 + ··· 𝑥𝑑 ) .

You can make out that if you put p as 2 you'll get your standard Euclidean, Euclidean norm or L2
norm. Similarly, you can have 𝐿0 norm, 𝐿1 norm and so on and so forth till 𝐿∞. You can actually

have to be any number for that matter any, any, any positive number for that matter, non negative
number. So, you see here diagrammatically you can see that this is how the unit ball of each of
these 𝐿𝑝 norms look like.

If you took the unit ball of 𝐿1 norm, it looks like a rhombus, the unit ball of 𝐿2 norm will look

like a circle and 𝐿∞ norm will look like a square, 𝐿∞ norm given by max of a value max of the

vector, max element of that, of that vector. This is important because this helps us understand
certain aspects of Regularization that we will visit. So, a couple of points to keep in mind here is
that when p is less than 2, it tends to create sparse weights. And when p is greater than 2, or
greater than or equal to 2 for that matter, it tends to create similar ways. We will see why this is
the case in some time from now.

(Refer Slide Time: 16:55)

609
The most popular Regularization method for training, not just Neural Networks, even other
Machine Learning models is known as 𝐿2 Regularization which imposes a penalty on the 𝐿2

norm of the parameters, which is the reason it is also known as 𝐿2 weight decay, or simply

weight decay. In this case, the loss function that we have for a Neural Network is upended by
another term, which has the 𝐿2 norm of the weights.

This loss function can be any loss function used to train your Neural Network. So far, we have
seen Mean Squared Error, so you can assume its Mean Squared Error. But in future, we will see
other loss functions that you can use to train your Neural Network. And the 𝐿2 penalty is added

to any of those loss functions. So, the α here denotes how much importance you want to give to
penalizing the 𝐿2 norm of the weights. And the by 2 here is for mathematical simplicity.

(Refer Slide Time: 18:08)

~
So, the gradient now of the total objective function will look like ∇𝐿(𝑤) = ∇𝐿(𝑤) + α𝑤. This
new objective function with the L2 weight decay term is equal ∇𝐿(𝑤), which was your original
loss function, plus α𝑤. And you differentiate this with respect to w. 2 and 2 get cancelled, and
you are left with α𝑤. So, your Gradient Descent update rule is going to look like
𝑤𝑡+1 = 𝑤𝑡 − η∇𝐿(𝑤𝑡) − ηα𝑤𝑡. That is going to be your new Gradient Descent rule for your

regularized loss function.

610
(Refer Slide Time: 19:07)

Let us do some analysis of this Regularization to understand what is really happening. Let us
first talk about it conceptually, intuitively, and then we will go over one mathematical way of
looking at what is going on in Regularization. We just said some time back that the shape of the
𝐿2 norm ball can help you understand what you are doing with Regularization. Let us let me add

one point here.

(Refer Slide Time: 19:40)

611
That you could also minimize other norms of w in this term. You could have 𝐿1 weight decay

which is also a common approach which enforces sparsity in weights in the Neural Network. By
sparsity we mean many of the weights in the Neural Network will be forced to go to 0, in 𝐿1

weight decay, we would say this is going to be with respect to 𝐿1 norm. And remember the 𝐿1

norm of any vector is the sum of the absolute values in that vector. This is also possible.

(Refer Slide Time: 20:19)

Let us try to see what would be the effect on these different cases. Remember, in the last lecture,
we talked about contour plots. So, you have these contour plots of your error surface with respect
to the weights, piece of contour plots such as these, if you recall, which will cross sections of
your error surface something like this. This is the contour plot of your error surface, which is the
term without a regularizer.

So, when we add the regularizes, what are we doing, we are now saying you want to minimize
over the set of weights, your loss with respect to the weights, as well as some constant times you
are 2 norm of the weights or 1 norm of the weights in certain cases, 2 norm of your weight,
weight square.

This is equivalent to saying that you want to minimize your loss function such that your 2 norm
of your weights is less than a certain quantity, you can come up with some number, there is some
constancy. These formulations are equivalent to a certain degree under certain conditions. From

612
an optimization standpoint, why are we saying this? What this means now is the minimum of this
new object to function is equivalent to the minimum of this original objective function subject to
the 2 norm being lesser than some value.

If you recall, this is nothing but a norm ball of the weights. And it is not the unit normal ball
now, it is a C norm ball that a unit norm ball is when the 2 norm of the weights lies within 1 or
all the weights lie within, within a norm of 1 around the, around the center, around the origin.
Now, we just C norm ball as a measure C, which should be less than. So, which means this, the
solution here will be the intersection of let us assume that the C norm ball was something like
this.

Let us assume that that is where the origin was. Let us assume that here is the origin and the C
norm ball is somewhere here. You can see now the C norm ball intersects the error surface at
some point, that is now going to be the minimum of this new function, not the point at the center,
because that is outside the C norm ball. So, you have to find a minimum somewhere here, which
is 1, which is minimum for both the loss and lies within the C norm ball, which is what you are
trying to do.

So, when you have your 𝐿2 norm, your norm ball is going to look like a sphere, and you intersect

with the error surface and you get a certain solution. However, if you had 𝐿1 weight decay, then

in that particular case, your norm ball is going to be a rhombus, not a square not, not a sphere.
So, a rhombus has sharp edges, which means it is very likely that you will find minimum along
points where 1 of the axis is 0.

You can probably see this in a slightly different way to let us try to draw the error surface
elsewhere just to make a point clear. So, we want to find the intersection of the error surface and
the 1 norm ball. And the intersection will typically happen at one of these corners. And what
happens at one of these corners in 𝐿1 norm ball, the other weight is going to be 0. And in a high

dimensional rhombus, there may be many weights that go to 0.

And that is why enforcing 𝐿1 weight decay will enforce sparsity in your weights. Both are valid

regularizers. In one case, you are reducing the 2 norm of the weights. In the other case, you are

613
reducing 1 norm of the weights but the effect can be slightly different. Let us now also see it as
to what we are trying to do from a mathematical perspective.

614
(Refer Slide Time: 24:49)

*
Let us assume that we have an optimal solution for your original problem. That is called that 𝑤
* *
So, which means ∇𝐿(𝑤 ) = 0. So, 𝑤 was our optimal solution before Regularization, only with
*
the original loss function. Let us now consider a term 𝑢 = 𝑤 − 𝑤 , let us write a Taylor series
*
expansion of loss function at 𝑤 + 𝑢.

* 𝑇 * 𝑇
By Taylor series expansion this is given by 𝐿(𝑤 ) + 𝑢 ∇𝐿(𝑤 ) + 1/2(𝑢 𝐻𝑢), where H is the
Hessian of the loss with respect to your weights, what is the Hessian, the matrix of all second
partial derivatives you can have matrix H is a matrix of second partial derivatives you can have
2 2 2
∂ 𝐿/∂𝑥 or ∂ 𝐿/∂𝑥∂𝑦 , so on and so forth. If you had many more weights, you can have that
many that is the size of your Hessian. The other if you had 1 million weights, the size of your
Hessian will be 1 million times 1 million. It is definitely a large matrix to compute.

615
(Refer Slide Time: 26:19)

*
But let us move on with this mathematical discussion. So, which means 𝐿(𝑤 + 𝑢) is actually w.
* * * * 𝑇 *
So, I can say 𝐿(𝑤) = 𝐿(𝑤 ) + (𝑤 − 𝑤 )∇𝐿(𝑤 ) + 1/2(𝑤 − 𝑤 ) 𝐻(𝑤 − 𝑤 ). We now know
* *
that ∇𝐿(𝑤 ) = 0, because that is our assumption of what 𝑤 is, it is an optimal point for your
original loss function. Which means we are left only with the first and the third term.
* * 𝑇 *
𝐿(𝑤) = 𝐿(𝑤 ) + 1/2(𝑤 − 𝑤 ) 𝐻(𝑤 − 𝑤 ).

616
(Refer Slide Time: 27:16)

Let us take its gradient now. The gradient will be given by


* * *
∇𝐿(𝑤) = ∇𝐿(𝑤 ) + 𝐻(𝑤 − 𝑤 ) = 𝐻(𝑤 − 𝑤 ) .

(Refer Slide Time: 27:51)

(Refer Slide Time: 28:05)

617
Going back to this equation that we have on the previous slide, let us write that out.

(Refer Slide Time: 28:12)

~ ~ *
Here, ∇𝐿(𝑤) = ∇𝐿(𝑤) + α𝑤, that is what we saw earlier. So, ∇𝐿(𝑤) = 𝐻(𝑤 − 𝑤 ) + α𝑤.
Why all this? What do we want to do?

618
(Refer Slide Time: 28:30)

~
Let us now consider 𝑤 to be the optimal solution in the presence of regularization, which means
~
∇𝐿(𝑤) = 0.

(Refer Slide Time: 28:46)

~ * ~ ~
Which means 𝐻(𝑤 − 𝑤 ) + α𝑤 = 0, because we just showed on the previous slide that ∇𝐿 can
be written this way. Since that 0, this also ought to be 0.

619
(Refer Slide Time: 29:08)

~
Now, let us rearrange the terms a little bit. And you can write it this way. Let us group all the 𝑤
terms. That would give us 𝐻 + α𝐼, because when you add it, you have to make it a matrix. So,
we have to put α𝐼 here to ensure that α is added to every diagonal element of H you are going to
~ *
have (𝐻 + α𝐼) 𝑤 = 𝐻𝑤 .

(Refer Slide Time: 29:37)

620
~
This means 𝑤, which is a solution of your regularized loss function is going to be given by
−1 *
(𝐻 + α𝐼) 𝐻𝑤 . Just to remind you, I hear this notation means the identity matrix in case you
do not you did not get that already. This is your identity matrix.

(Refer Slide Time: 30:12)

−1 *
Now, in this expression, if α goes to 0, it is very evident that you would be left with 𝐻 𝐻𝑤 and
−1 ~ *
𝐻 𝐻 = 𝐼, or rather, 𝑤 = 𝑤 itself? That is equivalent to doing no Regularization, of course,
because if α = 0, the coefficient of your L2 weight decay term goes to 0, and there is no
Regularization.

(Refer Slide Time: 30:39)

621
~
So, we are only concerned about the case when α is not equal to 0, what can we say about 𝑤. It is
~
not trivial to say something in general about 𝑤, unless you assume some form for H.

(Refer Slide Time: 30:58)

So, to do that, let us assume that H is symmetric positive semidefinite. Positive semidefinite
means that the Eigenvalues of H are all greater than or equal to 0. In such a scenario H can be
𝑇 𝑇
decomposed as 𝑄Λ𝑄 where Q is an orthogonal matrix Q, which means 𝑄 𝑄 = 𝐼. This should
𝑇 −1
also tell you that 𝑄 = 𝑄 .

622
(Refer Slide Time: 31:39)

~ ~ −1 *
So, why do we do this with this, let us try to analyze what 𝑤 is? 𝑤 = (𝐻 + α𝐼) 𝐻𝑤 . Finally,
~ −1 𝑇 *
we get 𝑤 = 𝑄(Λ + α𝐼) Λ𝑄 𝑤 .

~
So, this is what you get as 𝑤. What are we doing with this seems to be a complex representation
of w tilde up what do we want to do with it?

~ *
The main takeaway here is that 𝑤 is connected to 𝑤 . The optimal solution of the regularized loss
is connected to the optimal solution of the original loss by this transformation, the transformation
*
that multiplies 𝑤 . Let us try to assess this transformation more carefully.

623
(Refer Slide Time: 34:19)

If you observe here,

(Refer Slide Time: 34:23)

* 𝑇
You first take 𝑤 multiplied by a matrix 𝑄 which is equivalent to doing some transformation on
*
𝑤 . And after you do that, you are multiplying by this term here. Remember Λ as a diagonal
matrix, whenever you have a decomposition such as this Λ will be a diagonal matrix. So, you
−1
have (Λ + α𝐼) Λ.

624
(Refer Slide Time: 35:01)

𝑇 *
You now write this more succinctly as you take every element of 𝑄 𝑤 , it gets scaled by
−1
λ𝑖/(λ𝑖 + α). How did we get this, we got this as one element of lambda, this is (Λ + α𝐼) Λ,

which is what we had on the earlier slide. So, this is just one term of that, because this is
equivalent into So, these are going to be diagonal matrices, one of these terms will actually turn
* 𝑇
out to be λ𝑖/(λ𝑖 + α), which means every element of you first taking 𝑤 , transforming it by 𝑄 ,

scaling it by λ𝑖/(λ𝑖 + α).

(Refer Slide Time: 35:50)

625
And then at the end, you are rotating it by Q again, Q is again, when we say rotate, we mean it is
a transformation imposed by Q on the output.

(Refer Slide Time: 35:58)

𝑇
Now, you are initially transforming it by 𝑄 and later transforming it back by Q. So, in some
sense, they will probably cancel out each other in terms of the effect, but in between, you are
imposing a scaling on your weight vectors or on your optimal weight vector.

(Refer Slide Time: 36:20)

626
Let us see what that scaling means, the scaling says that if λ𝑖 >> α. This is going to become 1.

(Refer Slide Time: 36:31)

And if λ𝑖 << α, this will become 0 rather, the scaling is going to be a value only when λ𝑖 >> α,
*
because in other cases, it is going to become 0. So, those components of 𝑤 may just become 0
after this transformation.

(Refer Slide Time: 37:00)

627
* ~
So, only when we have large Eigenvalues, the 𝑤 components will be retained in 𝑤 which is the
new solution for your regularized loss. And the total number of parameters is going to be given
𝑛
by ∑ λ𝑖/(λ𝑖 + α) which has to be less than n because λ𝑖/(λ𝑖 + α) will be a quantity less than 1
𝑖=1

less than or equal to 1. So, which means the sum for all n elements will be less than n. So, which
*
means, even if you had n parameters or n components of your 𝑤 vector, the effective parameters
will definitely be less than the number of parameters divided by the number of components that
*
you had or dimensions that you had in your 𝑤 .

(Refer Slide Time: 37:52)

628
* ~
The summary here is that your original solution 𝑤 is getting rotated to 𝑤. When you do 𝐿2

Regularization, all of its elements shrink, because λ𝑖/(λ𝑖 + α) is always less than 1 as long as α

is a non-zero value. α is always going to be less than 1 which means all of its elements are
shrinking, but some are shrinking more than the others depending on those Eigenvalues λ and
what are those Eigenvalues of H.

629
(Refer Slide Time: 38:33)

But that is why we wrote it. A couple of slides back it is Eigenvalues of H then Hessian.

(Refer Slide Time: 38:37)

This ensures that only important features get high weights and other features may not get high
weights. That is one way of understanding 𝐿2 weight decay.

630
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No 28
Regularization in Neural Networks
(Refer Slide Time: 00:13)

That was one regularization approach by imposing penalty of the L2 norm or L1 norm of the
weights which is very popular. But even then you have to specify a coefficient for the L2 weight
decay or L1 weight decay or any other weight deca that you choose. Another approach is known
as Early stopping, which is perhaps the crudest simplest approach that you can use. The simple
idea here is keep monitoring the cost function.

And do not let it become too low, consistently stop attempting earlier iteration. Rather, do not let
your model over fit your training data. If it goes to 0 error, you are probably 0 training error, you
are probably over fitting your data. And you do not want that to happen. So, you want to stop a
little earlier. So here is a visual example. So if your y axis is accuracy, and x axis is the Epoch
core training.

So, It is possible that your red curve here tells you that as you keep training, your training set,
accuracy keeps increasing, or your training set loss keeps decreasing. They are both
complementary. But It is possible that if you held aside some portion of your training data, called

631
it is a test set or a validation set and did not use it for your training, as in when you complete the
Epoch, you can take that model and test it on that holdout set and see how it is performing.

As long as the performance from that holdout set or test set keeps improving, you keep training.
When it starts dipping on that holdout set, it is probably time to stop. Even if you start going
lower in training error. Even if your training set, performance keeps increasing. That time when
you start when you when your performance on your holdout set, which you do not use for
training decreases, it is time to stop training your Neural Network. This is the idea of early
stopping.

(Refer Slide Time: 02:13)

This reminds us of the question that we asked last lecture, which we said we will cover some
time here is when do you stop, we said we will talk about the stopping condition or the
convergence criteria of these algorithms that we saw last lecture. So, when do you really stop,
there are a few heuristics that you can use when working with them. The problem here is that
when you work with computers, which are numerically approximated at some precision, it may
not be possible to get an absolute 0 value for the grading.

You may have to ensure that you are Gradient say 10−5 , 10−6 or something like that, and say I
am going to stop here. But even that can lead to numerical errors. So, we are going to talk about
a few heuristics, which we can use to decide when to stop training. One thing you can do is to

632
train n epochs, lower the learning rate, train m epochs epochs, so on and so forth. This is an
approach.

(Refer Slide Time: 03:18)

But it is not a very advisable approach, because you do not know what n and m should be for all
kinds of Neural Network models. Instead, what we can use are a couple of criteria, we can use a
criteria known as the Error Change condition. In the Error Change condition, we keep checking
for the error, and how it has been dropping over a window of epochs could be 10 epochs, 5
epochs, 3 epochs whatever.

And if the error is not dropping significantly across those epochs, or many batch iterations, you
say it is time to stop your training. After you stopped, you can always train for a fixed number of
iterations, you decide to stop because you still have not got to a critical point, which means your
gradient is not absolute 0, you as well may just train a little bit more just to ensure that you
probably get closer to the critical point. That is one heuristic you can use.

Another heuristic is a Weight Change criteria, you can compare the weights at an iteration on a
epoch, that was at t-10 and at t which is your current iteration. And you can test if the maximum
weight change is bounded by a value. Why do we say maximum weight change? If we simply
take L2 norm of weights in the earlier iteration and L2 norm of weights in this iteration, that
could be less than drop.

633
Because there were some weights that were not changing, but some weights were changing by a
large amount. And you may have decided that the L2 norm of the weight change is fairly small,
but by focusing on pairwise differences between weights in t-10. And now, and ensuring that the
maximum pairwise difference between the weight at that epoch and this epoch is less than a
constant is a good heuristic to stop training to.

(Refer Slide Time: 05:13)

So in this case, it may be better not to base it on the length of the overall weight change vector,
as I said, you could probably even this row could also be a percentage of your weight, you could
also say that instead of row being a number saying less than 10th of - 3, you can say that it must
be less than 10%, your overall norm of the weight or something like that. Those are ways in
which or any other value for that matter. Those are ways in which you can use this as a stopping
criterion for your Neural Network.

634
(Refer Slide Time: 05:48)

Another regularization method is Dataset Augmentation, where, in addition to the data that is
given to you in your training data, you also add more data through transformations on your
original data, which, to some extent, exposes your Neural Network to data beyond the training
data, and hence gives you better generalization performance. So, let us see how we go about this.
So, let us see how you really augment your data in other ways.

You exploit the fact that certain transformations to the image do not change the label of the
image. For example, if you had a cat in an image, and you want to call it a cat, whether the cat
was small in size, large in size, rotated by 30 degrees rotated the other way, by 60 degrees, the
cat is a cat. So why do not we take the original image, make these transformations, and then train
the model with all of these transformations to hope that it will probably do well on some data
that it has not seen so far.

Because that cat that comes tomorrow, and a new image could be rotated. So, there are different
things that you can do, you can do Data Jittering, which means you can blur the image, you can
distort the image a little bit to handle nice variations in your test data, you could rotate the image,
as we just said, you could impose color changes in the image. How do you do color changes?
Remember that every image has an R channel, G channel and B channel, it is common to have 3
channels as input to the Neural Network.

635
People generally do not use other color spaces, although you can. You can take the intensities in
each of these channels and mix them up, you can take the intensities in the green channel and
call it the direct channel, you can take the intensities in the blue channel and call it the green
channel. And you can get several permutations combinations out of these, you could also inject
noise in your data, you could inject some Gaussian noise across your image, you could also
mirror your image. In cases where the mirroring does not change the label.

Whether you have if there are objects, I mean, if you see a cat one way or the other, It is still a
cat. And you may still want your model to not change its decision based on how the cat looked
this way or the other way. This helps increase data Firstly, because Neural Networks need large
amounts of data for training. And in addition to increasing data, it also acts as a regularizer
because it is exposing the model to newer kinds of data beyond the training data alone.

(Refer Slide Time: 08:39)

Here are some examples of how these augmentations look. So, here is an Original photo, here is
Red color casting, just by increasing the intensity of the red channel, Green color casting,
similarly Blue color casting, RGB all changed as we said just keep flipping color channels,
Vignette, more Vignette, Blue casting plus Vignette.

636
(Refer Slide Time: 09:05)

Here are some more examples, Left rotation and crop, see you rotated by left so let us see the
original image again.

(Refer Slide Time: 09:13)

So, you can see the difference you can see. So, you can see that this was the original photo
observed on the top left on the next slide.

637
(Refer Slide Time: 09:19)

You can see that there was a left rotation and then what went outside the frame was cropped
rather what went outside the size of Original image was cropped similarly Right rotation and
crop, Pincushion distortion, Barrel distortion, Horizontal stretch, more Horizontal stretch,
Vertical stretch, more Vertical stretch so on and so forth. Horizontal stretch is this way. Vertical
stretch is this way. It is a good exercise for you to see which image processing operation would
do all of these.

Remember, these were all byproducts of image processing operations that we talked about in the
very beginning of this course, you can now try to see how you construct these distortions with
simple image processing operations. If you need help, you can also look at this paper called Deep
image scaling up recognition to understand what kind of augmentations and how you can arrive
with them.

638
(Refer Slide Time: 10:22)

More newer methods in the last one or two years have done data augmentation in a very different
manner. A very popular method today is called Mixup. What Mixup does for doing data set
augmentation is create virtual training examples by interpolating data. If you have two data
points xi and xj could be two images, xi and xj, you construct a new sample x tilde, which is a
convex combination between xi and xj.

So, the new image lies on a line drawn between xi and xj. And for the label to xi may have a
label yi, xj may have a label yj, remember that when you use it for training the Neural Network.
yi let us say you had three class labels, yi would maybe be [1 0 0]. This could mean a dog could
mean a cat. And this could mean a horse. So, you represent yi for a dog to be [1 0 0] and yj for a
cat to be [0 1 0].

So, now you could take combinations of these, you can say λy i + (1 − λ)y i . And that will give
you a new combination, the values need not be 1 and 0, it can lie between 1 and 0. And that is
the purpose here. ỹ tilde will give, for if you mix up the image in a particular way, you also mix
up the label in the same way, that is what Mixup does and Mixup for the last couple of years has
led to several variants like Manifold Mixup, AugMix, CutMix, so on and so forth.

This is one interesting and effective way of performing Dataset augmentation today. Another
popular method today is known as Cut Out, where you randomly mask out square regions of

639
input, you can see these gray boxes here. So, these are all boxes of gray, which have masked out
certain regions during training. That is the reason it is called cut out, we just cut out certain
portions and give the rest of the image you fill in black in that location and fill in the rest of the
image and fill in the image with that black portion. And let the Neural Network train on that.

Even these kinds of augmentations have shown fairly good performance, fairly good
generalization performance when you train Neural Networks. More recent variant of a cutout is
also called CutMix. And there are many other follow up methods, but these are the broad, broad
ones, which you perhaps should know.

(Refer Slide Time: 13:13)

Another approach as we said for regularization is Injection of Noise. So, this noise can be
injected at a data level and can also be indicated at a label level or a gradient level as we will see.
In data noise, you add noise to the data while training. And you can mathematically show that
adding Gaussian noise to the input is equivalent into doing L2 weight decay, when the loss
function is the sum of squared error.

That is an interesting connection. We are going to leave it to us to do homework. So, this is the
paper that actually showed it way back in the 90s. You can look at that paper if you want to, if
you want to try to understand how to do it, but we will review it next, next lecture please try on
your own to see if you can prove this.

640
(Refer Slide Time: 14:08)

So, just adding noise to your input data. As I just said, adding Gaussian noise to input is
equivalent doing L2 weight decay for a particular kind of a loss function. But which means
adding noise is also a regularizer and also becomes a certain kind of Augmentation. We can also
do Label noise and Gradient noise.

(Refer Slide Time: 14:31)

And let us see how that is done. In case of label noise, you disturb each training sample with
some probability alpha. And for each disturb sample, you label you the, you draw the label

641
randomly from a uniform distribution, regardless of the true label. So, here is a more this is the
algorithm drawn from this paper known as discrete disturbed label, just published in 2016. You
generated Disturbed Label so for a particular training sample, which picked up with probability
alpha, you instead of taking its correct label, you uniformly sample from any of the labels that
you have in your set of categories when you do a classification problem. This just adds some
label noise. This is clearly an incorrect label. But this, you hope will keep the Neural Network
from over fitting to training data, and hence, generalize with. This is an idea, but it is not used
too often, in practice.

(Refer Slide Time: 15:34)

An idea that is used in practice is known as Gradient noise. In this case, you add noise to the
gradient, instead of the input or the output, you add noise to the gradient while training. And how
do you do that, you take the gradient of any weight in the Neural Network when you are doing
back propagation, and to that gradient, before you update your parameter add some noise.
Remember, you would have formerly propagated, got an output, got an error, got a gradient.

And then you would ideally use that gradient to back propagate, before you back propagate and
update your weights. You add noise, what noise Gaussian noise with mean 0 and variance σ t 2 .
And this work also suggests that you anneal the Gaussian noise by decaying its variance over the
η
iterations. So, the σ t 2 can be written as (1+t)γ , where η , γ are user specified constants.

642
You have to specify them to use this method. But what it is doing now is, as you keep training
over time, you are trying to reduce the variance of this Gaussian, which means you are trying to
keep the nice, lesser and lesser and less as you go through training. This does seem to show
significant improvement in performance in certain applications?

(Refer Slide Time: 17:03)

A gentle approach to achieve regularization in Machine Learning is Ensemble methods. In


ensemble methods, you train several different models separately, and then have all models vote
on the output. Why is this a regularizer? Because if one of those models over fit your training
data, you still hope that the other models would not have over fit. And they would perhaps help
you do well on test data in the generalization setting.

A standard example of a strategy for assembling is model averaging. So, you have k different
models, and you get all their outputs an average the outputs and the hope is that not all of them
would over fit your training. So, these different models that you choose for assembling could be
different models that you get from different hyper parameters, you could use different learning
rates or different number of layers in a Neural Network.

It could be different features in terms of input that you give to these models, or it could be
different samples of your training data. So, if you have 10,000 data points in your training data

643
you can take 1000 of them, and then train one model, take another 1000 train another model, we
will see that in a slide from now.

(Refer Slide Time: 18:29)

So, here is that example, bagging with stats, which stands for bootstrap aggregating is an
ensemble method in traditional Machine Learning, which does one ensemble using the training
data set. How does it do this, if you had, say, k logistic regression models or any machine
learning models for that matter, we just take an example of logistic regression here.

Given a data set, a training data set, you construct multiple training sets, by sampling with
replacement. So, you construct k different data sets from your original data set by sampling with
replacement. So, you take if you had a data set with 10,000 data points, you take thousand of
them, train a model, replace them and take another thousand so you just keep you can train as
many models as you would like by when you start sampling with replacement. And each such
model here is trained with the corresponding training data set. So, you can get k such models
now using k samples of your training data.

644
(Refer Slide Time: 19:39)

Now, let us try to analyze why this can be useful. Suppose each model makes an error epsilon i
on a test sample. Let us make this assumption. Let us to be able to mathematically analyze this.
Let us assume that epsilon i is drawn from a 0 mean multivariate normal distribution with
variance of epsilon I square to be given us V, and covariance epsilon i, epsilon j that is the
variance between these models to be given by quantity called C.

The error made by the average prediction of, of these models is 1 by k summation over i epsilon
i. That is the average error made by all models for the given training example test example. Let
us now look at the expected squared error of this ensemble predictor. The mean square error of
the ensemble is going to be the expectation over 1 by k summation epsilon i square. That is what
you have here.

Now, that can be expanded since 1 by k is constant, 1 by k square, because 1 by k was inside the
brackets, it comes out as 1 by k square. And the term here can be expanded this way, a simple
way to understand why that is correct is remember, a plus b, the whole square is a square plus b
squared plus ab plus ba. So that is exactly the way we are writing it. So, we are writing out a
summation over i epsilon i whole square, which is which is equivalent to saying something like
epsilon 1 plus epsilon 2 plus so on and so forth, till epsilon k whole square which is like a plus b
whole square is a square plus b squared plus ab plus ba.

645
So, in this case, we are writing epsilon i epsilon j, when i and j are the same, those will give you
the square term, and epsilon i epsilon j, when i and j are not equal, so that will be the ab ba terms
you can. Now work out for higher dimensions, but that is how the expansion comes. So, you then
have since j is equal to i, this can be summarized as summation over i epsilon i square. And the
rest, the other term remains the same.

Now, by our definition is here. And the linearity of expectation, over expectation can be of the
term can be because of the linearity of it, you can say expectation of a plus b is expectation of a
plus expectation of b. Using that you can write this expectation of sum of i epsilon i square,
which would be sum of i epsilon expectation epsilon i square so let me write that out to make it
clear.

So, this would be expectation of summation epsilon i square plus will have an expectation of the
other term. Now, but this term is can also be written as summation over i expectation, epsilon i
square, once again, because of the linearity of expectation, you can write this way. So, when you
do that, you would have an expectation of epsilon I square, which is V, and you will take a
summation k times over V so you will have kV.

Similarly, in this case, you would have k into k minus 1 C, why is that the case you would have i
going from 1 to k, j not equal to i. So, you will get the k minus 1 values in the second
submission. So, k into k minus one into epsilon i epsilon j, the expectation is given by C. So you
will have that to be C. Now cancelling out case, you will be left with 1 by kV plus k minus 1 k
by C.

646
(Refer Slide Time: 23:56)

Why, what are we going to do with this? Let us again write that out. So, we are saying that the
mean squared error of this ensemble is given by M SE = 1k V + k−1
k
C . What does this tell us?
This does tell us something interesting. It tells us that if the errors of the model are perfectly
correlated, which means V and C are the same.

(Refer Slide Time: 24:20)

So, remember, in this case, they are perfectly correlated, which means all of them are exactly
correlated the same way.

647
(Refer Slide Time: 24:26)

So then V is equal to C, then when you substitute V is equal to C here, so you replace this with
V, you will end up with V itself to be the answer. Rather, bagging really does not help.

(Refer Slide Time: 24:46)

Because we are saying now that V is the variance of any one model. And it is also the variance
of between the models.

648
(Refer Slide Time: 24:50)

And our ensemble also has the same variance from the mean squared error. So, it does not really
help much, when all the errors of the model are all exactly the same, they are all correlated in
exactly the same way.

(Refer Slide Time: 25:03)

However, if the errors of the model are independent or uncorrelated, especially if C is equal to 0
number C was the correlation between epsilon i and epsilon j. So, if two different models errors
are uncorrelated, and C will be 0, and your mean squared error will be 1 by k times V. So, which

649
means the ensemble will have significantly lesser variance, the MSE of the ensemble will have
significantly lesser variance. Another way of summarizing this discussion is that the ensemble
will perform at least as well as its individual members.

(Refer Slide Time: 25:44)

Now, why did we talk about ensembles? We want to now see, how do you bring that idea of
ensembles to regularize a Neural Network that an ensemble of Neural Networks for that, for that
matter. So, one major issue before we design ensembling for Neural Networks, a major issue that
occurs when you learn large Neural Networks is what is known as co-adaptation. Co-adaptation
is, as it should remind you of what we talked about as Hebbian learning at the start of this week,
that as Network is trained iteratively powerful connections are learned more and more while
weaker ones are ignored.

You could ask me what is wrong, that is what Hebbian learning does. Hebbian learning is written
this way by a person called Donald Hebb. So Hebbian learning seems to say the same thing. It is
all right in general. But when the same connections get to learn more and more and other weaker
connections are ignored, you may have to be concerned that your model may be over fitting to
your training data.

Because you do want to ensure that your models, as we said earlier, It is okay if you make a few
mistakes on your training data, but you want to do well on your test set. So, It is possible that

650
after many iterations, only a fraction of node connections actually participate in generating the
output, and just increasing your Neural Network size is not really going to help because the
situation will continue to proliferate even then.

(Refer Slide Time: 27:28)

So, what can we do here? We try to use a method called Dropout, which is a regularization
method. What does dropout do in the training phase for each hidden layer, for each training
sample, you ignore a random fraction of the nodes.

(Refer Slide Time: 27:49)

651
And in the test phase, you use all the activations, but reduce them by a factor of p. p was the
probability with which you pick nodes. And the test phase, you multiply all the activations by p.
p could be 0.5 for that matter. So when, when p is 0.5, in every layer, you drop 50 percent of the
nodes in each mini batch iteration, when you train a Neural Network using SGD. So, you do that
for each mini batch iteration, you probably do thousand iterations.

And each iteration, your Neural Network could be slightly changing, because you are randomly
dropping of nodes in each layer. So, at the end after training, what is the model that you have to
use for testing? You take the old original full model, but you multiply the activations that you get
from every layer by 0.5, in this case. Why do you need to do that, because each node could be
participated 50 percent of the time, that is how it was sampled. So, you probably should give
only 50 percent weightage, to its activation.

(Refer Slide Time: 28:58)

So, here is the illustration of how Dropout works. This was proposed in 2014. So, you have a
standard Neural Network. In this case, this is the input layer. And here is the output layer. So,
you see here that when you any apply Dropout in each layer, there are certain neurons that you
simply do not consider for that mini batch iteration when you forward propagate. You only
consider the other weights. It is almost as if those weights were 0 or do not exist, you could also
do dropout in the input layer. So certain input features are just not considered at all in that
iteration. Why do we do this?

652
(Refer Slide Time: 29:42)

Let us examine this a bit more. And we will also give you an intuition of how this how this can
be canceled. Remember, we said that ensemble methods are good regularizes because they can
minimize the variance of a model and hence do well on future test data. Ensemble is generally
more robust than a single model. So, we want to see how to do it with a Neural Network. And if
a Neural Network has H hidden units, each of which can be dropped, you can now have 2H
possible models. That is the total number of possible models that you can have the Neural
Network to create an ensemble.

653
(Refer Slide Time: 30:23)

So, how do we go about creating such an ensemble, you probably do not want to train 2H
models, and then ensemble them by voting or any other means. So, what we do now is impose
one small constraint, we say that, for a particular hidden unit H, there are 2H−1 models, which
can vary when that hidden unit is fixed. Because there are H-1, so 2H−1 models. Apart from this
node can be changed when this unit is fixed. We are saying now that across all of those 2H−1
models, this unit H must have the same weight, this does not change, It is weight is fixed, and
only the other weights can change. This is the trick that we are going to use.

654
(Refer Slide Time: 31:15)

And how does that actually work? It happens that if you do this, if you drop out with a
probability of a particular probability p of 0.5, during training in every mini batch iteration, test
time, simply multiplying your activations by half, or p for that matter, is equivalent to the
geometric mean of all your 2H models. We are going to prove this. But before we prove, let us
try to understand intuitively why this could be useful as a regularizer.

(Refer Slide Time: 32:02)

655
An intuitive example is let us assume that there was a leader of an organization who had 1000
employees. Now each of these neurons here are like employees of an organization. And the CEO
puts together these employees in certain configurations to achieve a certain task. Now, the way a
Neural Network learns to perform a task is not through knowledge, is by repetitively doing the
same thing again, and again and again.

And it is effectively trial and error, where you keep changing the weights and effectively get
there get to performing well on your task. So, if an organization had to run this way, what could
happen is, over time, certain people in the organizations gather certain specializations and they
keep getting better and better at that, while maybe others who are not exposed to the task did not
do so well, in that kind of task.

This is what we mentioned as co-adaptation. And this could be dangerous, because after
finishing a particular task, if the CEO had to take up a new project, and certain employees left,
then the CEO would be left with certain people who may not really know certain tasks well. So,
what does the CEO do? The CEO simply takes an interesting decision of asking employees to
randomly show up every week, only 50 percent of the employees show up every week.

And they have to do all the tasks between them, even the tasks that the other 50 percent was
doing. How does this help? Now each employee is even learning other tasks, which they may not
have specialized on. And this builds a robust organization that can handle newer tasks in the
future. This should give you the intuition of how Dropout works. Now let us see mathematically
about how dropout ensembles.

656
(Refer Slide Time: 34:05)

So, let us consider a single example. Let us consider a logistic sigmoidal function on an input x.
So, the output is equal to σ (x) , which for the moment, let us define it as something like this, let
1
us call it 1+ce −λx where c >= 0. So, this is a valid logistic sigmoidal function.

(Refer Slide Time: 34:30)

Let us assume now that there are 2n possible sub networks, which are indexed by k.

657
(Refer Slide Time: 34:41)

k is the index we could use for covering all of those to power m sub Networks. Let us define the
geometric mean of the outputs from all of these sub networks as G. This is your geometric mean
formula for all those k models. Remember, k is indexing 2n possible sub networks for all 2n
possible sub network models. G is the geometric mean of the outputs.

(Refer Slide Time: 35:03)

Similarly, you could also get the geometric mean of the complimentary output instead of ok. You
can also take 1 minus, ok. So, if you have a sigmoid, it is going to give you a value between 0

658
and 1. So, if you get this is typically used for binary classification setting, where if your output is
0.6, you are telling that this is probability with a probability of 0.6, you are saying it is class 1.
But that also means the probability of class 2 is 0.4. That is the information that we are using
here to get the complimentary outputs geometric mean.

(Refer Slide Time: 35:41)

The normalized geometric mean now is given by G/(G+G`) prime, where G and G` are as
defined here, this can be written as this will turn out to be 1/(1+G`/G), and substituting for G`
and G, which are given here and substituting for o, you get this term here. So, G` by G is (1- Ok )/
Ok . and you have the product that goes outside. And instead of Ok , you replace it as σ (x) and
instead of 1- Ok you replace it by 1- σ (x) .

That is how you get the first term there. Now, (1- σ (x) )/ σ (x) is equal to c e bar minus lambda x.
we are going to, let us leave that as an exercise for you to work out. But that comes by simple
substitution of terms here using σ (x) here to substitute those terms, you will find that this is the
answer. So, that is what we are replacing here. Now, this product can be converted, converted to
a
a summation inside the exponential term. Remember, e .eb = ea+b . Using that idea, this product
can be converted to a summation inside the e power term. And that is how you go from here to
here.

659
(Refer Slide Time: 35:14)

And now, let us define this, this if you look at this carefully, this looks very similar to the
original definition of σ just the input x is different. So, this is given by σ (E(x)) , that expectation
over x is this term inside. But what is this telling us? You could now summarize this as the
normalized geometric mean of σ (x) can be written as σ (E(x)) . So, why is this important? We
are saying now that if you know sample these 2n models in subway, probably you consider only
half of them at a particular point in time, simply multiplying the values of x by half will give you
the same value.

(Refer Slide Time: 38:15)

660
So, we say that the NGM or the normalized geometric mean is equivalent to the output of the
overall Network with weights divided by 2. Rather, this tells us that doing dropout by dropping
nodes in each layer with a certain probability becomes equivalent to taking the overall Network
and multiplying the activations by the same probability at test time. And this makes this
ensemble easy to work in each mini batch iteration, you drop a few neurons. And then at the end
of training, you really are not taking 2n different models and averaging, you are using just one
model, you are multiplying each layers outputs by p, and you are done with your ensemble
result. That is why it is a powerful tool.

(Refer Slide Time: 39:08)

661
With that, your homework for this lecture is Chapter 7, Sections as given here. And if you would
like a tutorial on Dropout, to understand it better, please look at this link. And as we already said
earlier, your exercise for this lecture is show that adding Gaussian noise with 0 mean to the input
is equivalent to L2 weight decay. When your loss function is mean squared errors. Give it a try.

662
(Refer Slide Time: 39:38)

Here are references.

663
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 29
Improving Training of Neural Networks
In the last lecture we spoke about how regularization is an important component to improve the
generalization performance of Neural Networks, we also spoke about a few different ways of
performing regularization. In this lecture we will talk about other components that can help
improve the training of Neural Networks.

(Refer Slide Time: 0:46)

Before we go forward, we left one question behind, your exercise was to show that adding
Gaussian noise with zero mean to input is equivalent to L2 weight decay when your loss function
is mean squared error. Hope you had a chance to watch it out. Let us try to understand why this is
the case.

So, when you add Gaussian noise to inputs the variance of the noise is amplified by the squared
weight as you go to the next layer of the neural network. So, which means if you simply had a
simple neural network where the output layer was directly connected to inputs, the amplified
noise would directly get added to the output. And that added output makes a contribution to the
loss function term, the squared error term and that is what makes it look like an L2 weight decay
in the loss function.

664
2
So, you have Gaussian noise, say, sampled from a distribution with zero mean and variance (σ )
and when you send it through a layer it is amplified by w and it goes through on the output layer
2
with a new variance which is amplified by 𝑤𝑖 . That is the intuition, let us work it out

mathematically also.

(Refer Slide Time: 2:22)

So, mathematically if you consider one input with added noise, which means you would have

your output 𝑦𝑛𝑜𝑖𝑠𝑦 = ∑ 𝑤𝑖𝑥𝑖 + ∑ 𝑤𝑖ϵ𝑖. Remember here that originally your output would have
𝑖 𝑖

been only ∑ 𝑤𝑖𝑥𝑖. So, this is a component that got added to the output because you added noise
𝑖

to the input. And what is the noise? ϵ𝑖 which is sampled from a Gaussian with mean 0 and
2
variance σ𝑖 .

Now, let us try to look at what happens to the mean squared error. So, your expected mean
2
square error now, so your new mean squared error is going to be 𝐸[(𝑦𝑛𝑜𝑖𝑠𝑦 − 𝑡) ], where t is

your ground truth or target. So, that is the correct value. So, your expected mean squared error

665
will be given this way, let us expand it out, we know that 𝑦𝑛𝑜𝑖𝑠𝑦 = ∑ 𝑤𝑖𝑥𝑖 + ∑ 𝑤𝑖ϵ𝑖 term that
𝑖 𝑖

2
comes here and then you have the − 𝑡 that comes this way.

2
Now, by grouping terms this can be written as ((𝑦 − 𝑡) + ∑ 𝑤𝑖ϵ𝑖) , now we use your standard
𝑖

2
expansion of (𝑎 + 𝑏) and you get the terms on the next equation. Keep in mind that the first
term here does not have an expectation because there is no random variable there, there is no ϵ𝑖

in that, so you can remove the expectation in that first term. And the expansion follows the
linearity of the expectation.

So, now if you see the middle term here, we assume now that all weights have a mean 1 and all

ϵ𝑖 are going to be mean 0 by the very definition of how ϵ𝑖 is sampled. Which means the

𝐸[∑ 𝑤𝑖ϵ𝑖] = 0, so you are left only with the first term and the third term.
𝑖

Now, this tells you that this third term can now be written as the expectation can be pulled in and

2
this can be written as ∑ 𝑤𝑖σ𝑖 , because that is the variance of ϵ𝑖. This as you can see clearly looks
𝑖

like a penalty on your L2 norm of the weights where the coefficient of your penalty is given by
2
σ𝑖 . Rather when we add noise to the input your expected mean squared error, the new loss

function rather is going to look like your old loss function plus an L2 term.

(Refer Slide Time: 5:55)

666
Let us move on now to this lecture’s focus which as I said is going to talk about a few different
components on how you can improve the training of neural networks. The first component that
we are going to talk about is activation functions. In the very first lecture of this week, we gave
an introduction to activation functions. We are going to spend some more time now to
understand each activation function and its pros and cons.

We know now that an activation function is a non-linear function that is applied to the output of
every neuron in your neural network. Traditionally people use the same activation function for
the entire neural network or at least a layer. Why do you do this? Why can't you change the
activation function for every neuron in the neural network?

The simple reason for this is to keep things less complex, if you have the same activation for all
neurons in a layer, you can use matrix vector operations directly to make your computations
rather than have to compute values for each neuron separately. And matrix vector operations are
already optimized using linear algebra subroutines and we would like to take advantage of those
fast subroutines to make computations in a neural network.

Let us now try to ask what characteristics must activation functions possess? What do you think?
By now you must know that they must be differentiable, because otherwise you cannot back
propagate gradients. But what else should they possess? What kind of functions can become
activation functions?

667
We have seen a few so far, we have seen the sigmoid, we have seen the tan hedge, we have seen
ReLU, so on and so forth. Do you see any commonality in the shape of these functions, in the
nature of these functions? There are a few important aspects, the functions must be continuous so
that you do not have discontinuities where the gradient is not defined.

It must be differentiable because we need to perform back propagation. It must be


non-decreasing too. You do not want an activation function which is something like this, at least
ideally speaking because as the input increases you do not want the output of the activation to be
fluctuating between high and low values, so you ideally want it to be monotonic or
non-decreasing.

And obviously you also want it to be easy to compute. Remember anything hard to compute is
going to make your neural network slower. Common activation functions that we have seen are
sigmoid, hyperbolic tangent, rectified linear unit and so on. Which one do you choose, does any
of these activation functions have a specific effect on training? We are going to see that over the
next few slides.

668
(Refer Slide Time: 9:18)

Here are some of the activation functions that we visited earlier too. Let us quickly review them
before we look at each of them in more detail. The sigmoid activation function was one of the
earliest activation functions we used to overcome the threshold problem in perceptrons and it is a
logistic function given by this shape.

The hyperbolic tangent is also a logistic function that looks very similar in shape to sigmoid, but
for its range being between minus one and one, whereas sigmoids range is between 0 and 1. The
rectified linear unit developed in 2012 as part of the AlexNet contributions has a form given by
this.

The equation for this graph is 𝑚𝑎𝑥(0, 𝑥). If the input is negative the output is going to be 0, if
the input is positive the output remains as it is, the input itself. A variant of ReLU is known as
Leaky ReLU where when the input is negative your output is going to be 0. 1𝑥, 0.1 is just an
example, this could be any value α which is reasonably close to 0, it is not 0, but it is somewhat
close to 0.

There is a particular reason why Leaky ReLU was proposed and we will see that in some time.
Let us see ELU before coming to max out. ELU is Exponential Linear Unit and ELU was
developed because the ReLU activation function actually seems non-continuous, it has a
discontinuity at 0 where it is not differentiable.

669
In practice people get around this by assuming that the derivative at that particular location is 0,
which is valid because for those of you who are aware of the concept of sub gradients, 0 is a
valid sub gradient for ReLU at that particular point, so 0 is a valid choice, but there are other
gradients possible at that particular location.

Nevertheless, it is a point of discontinuity and people get around this by assuming the gradient of
ReLU at 0 to be 0. If you did not want to make that assumption ELU is an option which is a
smooth continuous function, which is given by for any value greater than 0 it is x just like what
𝑥
we saw for ReLU, but for values the input values less than 0 we assume it to be α(𝑒 − 1) which
has a shape as you can see here. As you can see here this now becomes a smooth continuous
function with no discontinuities.

Another activation function that was proposed is known as Maxout, it is used occasionally these
days but not as popular. Maxout is you can look at maxout as a generalization of ReLU because
of the max function there. But it does not take a max on a single activation on a single neuron
output, it takes a group of neurons and takes the max value of that group of neurons as the output
of all of those neurons.

In this mathematical definition, you see it written in terms of two neurons, remember here, that
this would be the output of two of these neurons, let us assume that these were two neurons in
some network, some layer of the network. So, assuming that the weights coming in to this
neuron was 𝑤1 vector, the weights coming into the second neuron was 𝑤2 vector, so the output of
𝑇 𝑇
this neuron would be 𝑤1 𝑥 + 𝑏1 but the output of this neuron will be 𝑤2 𝑥 + 𝑏2.

So, now the activation of both of these neurons is given by the max of these two values, it is not
a very traditional activation function, the activation function is not defined on a single neuron but
on a group of neurons. Now, let us try to visit some of these at least in more detail and try to
understand how they can affect the training of neural networks.

670
(Refer Slide Time: 14:10)

−𝑥
Let us consider the earliest one sigmoid which is given by 1/(1 + 𝑒 ), clearly it compresses all
inputs to range (0, 1). The gradient of the sigmoid function hope you have work this out, this was
a homework which is ∂σ/∂𝑥 = σ(𝑥)(1 − σ(𝑥)), it should follow very easily from this
definition of sigmoid.

But while sigmoid activation functions were very popular in the 80s, 90s, they are not used as
often today. Do you know why? Of course, one reason is today you have other activation
functions such as ReLU which was not available in the 80s, 90s and 2000s but there are specific
reasons why sigmoid neurons can create certain problems.

671
(Refer Slide Time: 15:10)

The one important problem is if you look at the shape of the sigmoid which is somewhat like
this, you see here that when the input is very small or even very large, remember that your input
is along the x axis here, the y axis is your σ(𝑥). When your input is small or large, even a very
large change in the input may result only in a very small change because this function is almost
flat after a certain point.

This is the case whether the input is small or whether the input is large, this could create some
problems for us while training, because even as you keep increasing an input you would find that
it has no impact on the gradient for certain weights in the neural network where you have a
sigmoid, this can slow down training.

672
(Refer Slide Time: 16:17)

Sigmoid is not 0 centered because its output is in the range between 0 and 1 which means if you
want your activation to consider negative values and positive values differently, sigmoid will not
be able to do that for you, let us also see now another reason why sigmoid could have certain
problems, let us look at this neural network here, a very simple neural network where you have
your inputs or rather a part of a neural network, your inputs in a particular layer for a particular
neuron are ℎ1 and ℎ2, the weights are 𝑤1 and 𝑤2, the weights are combined to become a, you

apply a sigmoid and then get a y, one simple perceptron or one simple part of a neural network.

Let us analyze how the gradients flow in this part of the network using back propagation and
chain rule, you perhaps know now that if I had to find out the gradient of the loss function L
whatever the loss function was for this neural network ∂𝐿/∂𝑤1 is going to be given by

∂𝐿/∂𝑦 * ∂𝑦/∂σ * ∂σ/∂𝑎 where a was the pre-activation of that neuron before applying the
sigmoid activation and then into ℎ1.

Why ℎ1? Because ℎ1 is going to be ∂𝑎/∂𝑤1. Similarly, for 𝑤2, if ℎ1 and ℎ2 were also outputs of

some sigmoid neurons from a previous layer, they would both be positive, which means you can
see now that ℎ1 and ℎ2 would be positive, ∂σ/∂𝑎by the definition of the gradient of the sigmoid

function would also be positive, which means both these gradients will assume the sign of this
quantity which is common for both the gradients.

673
∂𝐿/∂𝑦 * ∂𝑦/∂σ is common for the gradient of 𝑤1 and 𝑤2, which means if that gradient turned

out to be negative both 𝑤1 and 𝑤2 will have negative gradients, if that turned out to be positive

both 𝑤1 and 𝑤2 will end up having positive gradients. Rather, this may not just be for one

specific neuron, it could hold for an entire layer where you use a sigmoid activation function
because all of those activations because they are outputs of sigmoid will be positive. This may
slow down learning too.

Also sigmoid is inherently computationally expensive. Why? Because you have to compute an
exponential function, remember an exponential function always takes time to compute, an
exponential function is an infinite expansion in terms of a polynomial series and it always takes
time to compute at the most basic level.

(Refer Slide Time: 19:42)

What about hyperbolic tangent? Hyperbolic tangent is given by this equation and this visual
form, it compresses all inputs to a range (-1, 1) and the gradient of the 𝑡𝑎𝑛ℎ function is given by
2
1 − 𝑡𝑎𝑛ℎ (𝑥) which you may have worked out as homework again. What advantage do you
think this has over a sigmoid activation function?

674
(Refer Slide Time: 20:10)

It is zero centered, which means the outputs of a 𝑡𝑎𝑛ℎ activation function can be both negative
and positive which could help in learning better. However, very similar to sigmoid it also suffers
from some disadvantages, it also has saturation at lower inputs and higher inputs as you can see
from the shape here, any logistic activation function may suffer from the same issues. It is also
computationally expensive because of the presence of the exponential functions.

(Refer Slide Time: 20:52)

675
Moving on to the rectified linear unit which as I mentioned was proposed as part of the AlexNet
in 2012 is the most often used activation function today, its equation is 𝑚𝑎𝑥(0, 𝑥) and here is the
graph form of the activation function, the gradient is defined by 0 if x is less than or equal to 0 to
be honest because at 0 you just define it to be 0 and if x is greater than 0 the gradient is 1 because
𝑓(𝑥) is x itself.

For positive inputs the ReLU activation function is non-saturated which means if your input
keeps increasing or if you give a higher input to a particular neuron, the ReLU will ensure that
that input also gets a higher gradient based on how it contributed to the final loss, which is
considered desirable, you do want a weight to get its reasonable share of the gradient depending
upon how it contributed to the output.

In the AlexNet work in 2012 this was found to accelerate convergence of SGD compared to
sigmoid or 𝑡𝑎𝑛ℎ functions by a factor of six. In other words convergence happened six times
faster using a ReLU activation function as compared to using a sigmoid or a 𝑡𝑎𝑛ℎ activation
function. It is computationally very cheap, all that you need to do in an implementation is
threshold and just move on, if it is greater than 0 just take the value and move on, if it is less than
0 set it to 0 and move on.

(Refer Slide Time: 22:50)

676
But ReLU has one problem which is known as the dying neuron or a dead neuron problem, if the
input to a ReLU neuron is negative, the output would be 0. Now, let us try to understand how the
gradients would flow in this kind of a situation. Let us assume now that a previous layer's inputs
were given this way with inputs x1, x2, weights w1, w2, a is the pre-activation of that neuron,
you then apply ReLU activation and you get the output y.

If you want to now compute ∂𝐿/∂𝑤1 by chain rule, you have

∂𝐿/∂𝑦 * ∂𝑦/∂𝑅𝑒𝐿𝑈 * ∂𝑅𝑒𝐿𝑈/∂𝑎 * 𝑥1. Because the input a is negative ∂𝑅𝑒𝐿𝑈/∂𝑎 would

straight away become 0 and none of these weights would get any contribution through the ReLU
activation, which means those weights 𝑤1, 𝑤2 may never get updated, they stay the same and

this cycle could continue.

For example, if 𝑤1 and 𝑤2 were negative which means in every cycle it is very likely that a will

keep getting a negative value and because of ReLU there will be no gradient that comes in, 𝑤1

and 𝑤2 will stay negative and this cycle keeps on going and the neuron never participates in the

neural networks output, this is known as the dead neuron problem. And it has been observed
through empirical studies that for a neural network that has a ReLU activation in all its layers,
many neurons end up suffering from the dying neuron problem.

(Refer Slide Time: 24:52)

677
How do you overcome this? A couple of ways, one is you could initialize your bias weight to a
positive value. How does this help? In case 𝑤1𝑥1 + 𝑤2𝑥2 turns out to be negative, having a

positive bias 𝑏 here may push a towards the non-negative side and this could help ReLU get
activated and the gradients start flowing through this particular neuron.

Other options are use Leaky ReLU, now you should be able to appreciate why Leaky ReLU was
useful, if you go back and see the shape of Leaky ReLU we said that it is very similar to ReLU
on the positive side but the negative side instead of making it 0 just let a small value go through
and that is why it was max of α𝑥, where α could be a small value such as 0.1.

This lets the output of the ReLU not become a 0 which may kill the gradient but let a small
gradient at least pass through the neuron, so it does not stop contributing to the neural network at
least contributes marginally. Exponential Linear Unit or ELU also has the same effect.

(Refer Slide Time: 26:20)

Now, you could ask, having seen these, which one do I choose when I design a neural network.
In practice most often people use ReLU, it just seems to work very well, very cheap, very fast, so
people often use this, but it may be handy to keep monitoring the fraction of dead units in a
neural network, neurons that are not contributing and see if that requires you to change
something in your design.

678
In case ReLu does not work, consider using Leaky ReLU or ELU or Maxout. 𝑡𝑎𝑛ℎ can be tried
but you may have to expect a performance which is not as good as ReLU depends on a certain
circumstance. Sigmoid is not used much today unless a gating kind of a response is required, we
will see such networks later in this course where you want a neuron to have a gating effect, rather
have an output line between 0 and 1, in those cases including an output neuron for example in a
binary classification problem.

In those scenarios sigmoid neuron is the default choice, 𝑡𝑎𝑛ℎ is also used in certain such settings
where you want the output of the neuron to be bound in a certain range. In general, a takeaway
here is you would want an activation function to be in its linear region for as much part of the
training as possible.

What does that mean? So, if you take the sigmoid activation function, you would want to ensure
the activation function is most often used when the inputs are lying in this range. So, you could
modify your sigmoid activation function, you could modify certain coefficients, this is sigmoid
activation function to make this linear region a little wider instead of lying between specific
values you could make it lie between a larger range if you knew that is how the inputs are going
to be in your neural network.

You can do such things too to improve the performance. In fact, some of these suggestions here
are given in a very nice reference which I will point you to at the end of this lecture.

679
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 30
Improving Training of Neural Networks
(Refer Slide Time: 00:15)

Another important component we are going to talk about here is weight initialization. You
know by now that for a neural network to learn, you must first initialize the weights with
certain values. And then you perform gradient descent to learn. Is this really important? Can I
just initialize all of them with 0’s and go?

It is important because if you recall the neural network error surface where you start is critical
to which local minima you will reach. We saw that the error surface can be very complex
which means it depends on which starting point you have. And you may accordingly
converge to a local minima.

If you want to get to a better local minima, you may have to start at a better position initially
to be able to get that solution. There have been recent studies which tried to analyse this from
other perspectives. But weight initialization in general is considered a very important aspect
of designing neural networks.

In fact, to a certain extent, this was a game changer when deep learning started becoming
popular in the first decade of the 21st century. Let us first ask the trivial question: Why do not
we just initialize the neural network weights to 0’s and let it learn what it should learn? Is this

680
a good answer? No. In fact, any constant weight initialization scheme will perform poorly.
Why so? Let us take an example and understand this.

Let us consider a neural network with two hidden units. Let us say ReLU activation. And let
us make all biases 0 just for us to understand this and let all the weights be initialized with
some constant α. It could be zero; it could be non-zero; any constant initialization is okay
here.

If we do this and forward propagate an input in the network, the output of both hidden units
in the next layer will be 𝑅𝑒𝐿𝑈(α𝑥1 + α𝑥2) because the weights are the same. Now, why is

this an issue? Which means both hidden units will have the same output. They will have an
identical influence on the cost or the loss, which means they will have identical gradients.

If they have identical gradients, they will have identical updates and the weights will always
remain the same. They start the same, the gradients will be the same and they will keep
taking the same value always though training. This is not desirable. So, we want to do
something more than simply initialize the weights with the same values.

(Refer Slide Time: 03:24)

How are weights initialized then? They are generally chosen randomly. Or for example, you
can sample weights from a Gaussian distribution. You have some variety in the weights. Both
very large and small weights can cause activation functions such as sigmoid and tanh to
saturate. So, that is something important to keep in mind when you initialize. Let us see more
specifics on how we can initialize.

681
(Refer Slide Time: 03:59)

In 2006, weight initialization was one of the major factors in turning around how deep neural
networks were paid attention to. Salakhutdinov and Hinton introduced a method known as
greedy layerwise unsupervised pre-training which is considered to be a significant catalyst in
the training and success of deep neural networks.

For a few years since 2006, the general understanding was this was extremely essential to
make deep neural networks work in practice and led to widespread adoption of deep neural
networks for various applications. However, today it is understood that you do not need this
method. That there are simpler weight initialization methods that you can use and you get
good results. But let us try to understand what this method tried to do.

As the name says, greedy layer wise, unsupervised pre-training. The suggestion was when
you have a deep neural network, before you start training the neural network we are talking
about the weight initialization step here, you take the first two layers and train them using
some unsupervised technique. For example, you could use methods such as Boltzmann
machines or auto encoders. We will see that later to be able to train these two layers in an
unsupervised way.

Using that, you would get a set of weights. Now, you freeze those weights and then take the
next two layers and train them in an unsupervised manner. Because the first layer's weights
have frozen, you could provide inputs and multiply it by those frozen weights, and they will
become inputs to the second layer.

682
So, you could take the second and third layer, train them in an unsupervised manner. And you
would get a certain set of weights. Then you freeze those weights. Then you take the third set
of third layer’s weights, and train them in an unsupervised manner. And you keep repeating
this process to get an initialization on your weights for the neural network.

(Refer Slide Time: 06:27)

Why is this called so? Greedy because it optimizes each piece independently rather than
jointly. Layer wise because you are doing initialization in a layer wise manner. Unsupervised,
as I just said, is because you use only the data, no labels to be able to train those networks.
Because this is not a formal training process. It is only to initialize the neural network. And it
is called pre-training because this is done before the formal training process starts. As I
mentioned, it is for weight initialization.

So, a question that you may have now is how do you do that unsupervised training between
every pair of layers? You may have to wait for that answer in this course. This is achieved by
networks known as autoencoders or Restricted Boltzmann Machines. Or you could even use
other methods to be able to achieve this. But we will see these examples; we will see
autoencoders and RBMs, a little later in this course.

683
(Refer Slide Time: 07:35)

But as I mentioned, this got superseded by newer weight initialization methods over the last
decade. Let us try to understand what has been the thought process behind coming up with
newer initialization methods. If you took a neural network such as what you see here, let us
consider that all your inputs have been normalized already. With mean 0 and variance 1. It is
very common these days to normalize your data inputs before providing it as input to any
machine learning algorithm. Let us also assume now that your weights have to have mean 0
and a certain variance. We do not know what that variance should be. That is what we want to
find.

Let us now take one of these neurons. Let us call 𝑎1. Let us take one of the neurons here. Let
𝑛
us call that 𝑎1 for simplicity. That is going to be given by ∑ 𝑤𝑖1𝑥𝑖. Then the variance of 𝑎1
𝑖=1

𝑛
would be ∑ 𝑉𝑎𝑟(𝑤𝑖1𝑥𝑖).
𝑖=1

Assuming now the w and x are not correlated with each other. You can write it as
𝑛𝑉𝑎𝑟(𝑤)𝑉𝑎𝑟(𝑥). The joint covariance would go to 0. And assuming that we also want all
neurons to have the same variance, the summation turns into n. Let us now try to understand
what happens to the variance when we go deeper in the neural network. The variance would
keep getting multiplied, and you would have the variance of a pre activation at the later layer
𝑘
to be (𝑛𝑉𝑎𝑟(𝑤)) 𝑉𝑎𝑟(𝑥). This could result in blowing up or becoming 0, depending on what
the variance of w originally was.

684
Remember now that we ideally want, we already know the variance of x is given to be 1. We
would want the variance of every succeeding layer also to be 1 because that would ensure
that even the layers in between are normalized. So, we would ideally want the variance of 𝑎1

also to be 1. The way to do it is that 𝑛𝑉𝑎𝑟(𝑤) = 1. We know the variance of x is 1. We want


variance of 𝑎1 to also be 1 because then even the variances of activations in later layers will

be normalized.

And then 𝑛𝑉𝑎𝑟(𝑤) = 1. So, which means for a good initialization, you can draw weights

from a normal distribution, a Gaussian distribution, and scale them by 1/ 𝑛, where n is the
node’s fan-in. That is the same thing here. By fan-in we mean the number of weights coming
into a particular neuron.

(Refer Slide Time: 11:04)

685
Using the skew, there have been a few weight initialization methods that have been
developed. The most popular ones today are Xavier’s initialization or Glorot’s initialization,
which is given by this down here. You uniformly sample from a certain range given by these
values or timing these initializations where you uniformly sample in this range. These came
from these two papers which were published in 2010 and 2015. Let us try to understand one
of them at least in detail as to how that range came through.

Let us not derive how Xavier’s or Glorot’s initialization was derived. We just now showed

that it is good to have the variance of w to be 1/ 𝑓𝑎𝑛𝑖𝑛. But we also have to consider the

backward pass during backprop which also could result in affecting the weights and the
gradients. So even the backward pass could have an impact when you have to design the
variance of the gradients because the gradients are distributed based on how the weights in a
particular layer are, which flows through the previous layers, and so on and so forth.

So, it may be wise to also consider variance of w to be 𝑎/ 𝑓𝑎𝑛𝑜𝑢𝑡. Fan-out is the number of

weights going out of a neuron for the next layer. Let us hence go with the average variance of

w to be 2/ 𝑓𝑎𝑛𝑖𝑛 + 𝑓𝑎𝑛𝑜𝑢𝑡. Now we know from statistics of uniform distribution that given
2
a uniform distribution from the range, m, n; we know the variance is given by (𝑛 − 𝑚) /12.
Which means if we now sample uniformly from minus [− 𝑎, 𝑎], its variance is going to be
2 2
given by (2𝑎) /12 which is 𝑎 /3.

686
Now, let us put these together. We said that 𝑓𝑎𝑛𝑖𝑛 * 𝑉𝑎𝑟(𝑤)must be 1. Fan-in is the same as

n of the previous slide. We know that the variance of w coming from a uniform distribution
2 2
from an interval is 𝑎 /3. Putting these two together, 𝑓𝑎𝑛𝑖𝑛 * 𝑎 /3 = 1or 𝑎 = 3/ 𝑓𝑎𝑛𝑖𝑛.

And now considering the backward pass and the forward pass, you get Xavier’s initialization

to be uniform(− 6/( 𝑓𝑎𝑛𝑖𝑛 + 𝑓𝑎𝑛𝑜𝑢𝑡), 6/( 𝑓𝑎𝑛𝑖𝑛 + 𝑓𝑎𝑛𝑜𝑢𝑡)).

Kaiming initialization is built on top of this. But that is going to be homework for you to
work out as to how you get this as another possible initialization. You can by looking at it say
that the ideas have to be similar, but there are some minor changes. You can also look at the
paper that proposed this, which was in the footnote on the previous slide, to be able to do this
homework.

687
(Refer Slide Time: 14:18)

688
Let us move to the last component of this lecture that we are going to talk about which is an
important development that came about in 2015 known as batch normalization. Covariate
shift is a problem that in machine learning refers to a change in input distribution between
training and test scenarios. This generally causes problems because the model needs to adapt
to a new distribution. This could also happen even during the training process itself between
the distribution at one stage and the distribution at another stage.

In a neural network, this issue can show up as internal covariate shift where the input
distribution can change across the layers. So, the first layer we could normalize the data and it
would have the inputs would follow a certain distribution. Depending on the weights in the
first layer, the second layer would receive a different distribution. The third layer would
receive a different distribution based on those weights, and so on and so forth. And the neural
network has to learn to handle these different distributions. Can we do something to address
this?

We know that a network trains well when its inputs are widened or normalized. They are
linearly transformed to have 0 means and unit variances. Can we now do something to also
have a similar effect in each layer? Can we make each layer also a unit Gaussian? That is not
in our hands entirely because it depends on the weights. But how do we achieve such an
effect? The answer to this is, we could do this by considering a mini-batch.

Since we train neural networks using mini-batch SGD, we could try to see if we could
compute the mean and variance of any layers or outputs in a mini-batch. And then come to
subtract the mean and divide by the variance to normalize them in some way. To do this, this

689
method known as batch normalization, which was proposed in 2015 introduces 2 additional
parameters: gamma and beta. The superscript k denotes the specific layer in which you
(𝑘)
introduce these parameters because you have to do layerwise. γ = 𝑉𝑎𝑟(𝑥 ) and
(𝑘)
β = 𝐸[𝑥 ].

690
(Refer Slide Time: 16:59)

Let us look at how this batch normalization works. Batch normalization is often introduced as
a layer itself that succeeds one of the layers, one of the existing layers of a neural network.
So, let us consider this. We assume that after a certain layer you have certain output values
that you get. Let the mean of all of them across the mini-batch that you are propagating in a
certain SGD iteration, let that be µ𝐵.

Similarly, you can compute the mini-batch variance across those values that you are
2
considering, across that layer that you are considering to be σ𝐵 . Now, if we normalize your
^ 2
𝑥𝑖 = (𝑥𝑖 − µ𝐵)/(σ𝐵 + ϵ).

691
^
What we do is to say that the output of this batch normalization layer, 𝑦𝑖 = γ𝑥𝑖 + β, where γ

and β are learned.

How does this help? Once you learn γ and β, in a sense the output of that layer is
renormalized. The mean, the new so-called mean would be β and γ would be the standard
deviation of this new distribution. And by learning γ and β, you are asking the neural
network, normalize each layer the way you want to normalize it. You do not need to do
standard normalization. That means 0 and unit variance. Choose the mean, choose the
variance that helps you perform the best and let the neural network learn γ and β.

The Batch Norm layer is generally inserted before you apply your non-linearity. So, you have
a layer which receives inputs from the previous layer, you do batch normalization, and then
apply your non-linearity. This is typically the way it is implemented. And this has given good
empirical results. Batch normalization allows higher learning rates. Why?

One of the reasons to keep your learning rates low was to avoid your weights to explode in a
neural network or on the other side if it becomes too low. But we are talking about higher
learning rates here. By ensuring that your activations are controlled because of the batch
normalization, the learning rates can now be increased because this learning the γ and β will
take care of what the neural network needs to not let it explode.

It reduces a strong dependence on initialization. Because now, irrespective of the


initialization, the weights can be controlled using the γ and β in every successive iteration of
SGD. It also acts as a form of regularization. Because whatever weights come or whatever
activations come out of a layer, you are going to multiply by a γ and β, which has a sense of
adding noise to those activations which becomes the regularizer.

An important difference, an important concern of batch normalization is that at test time, your
data may not be forward propagated in batches. You may want to predict only on a single
point. Then you may not have a mean and variance for a mini-batch at test time. How do you
2
handle it? You choose µ𝐵 and σ𝐵 based on your training data. Maybe the last few training
2
batches or you take a running average of µ𝐵 and σ𝐵 over a set of training iterations, and you

use those values at testing.

692
(Refer Slide Time: 21:15)

That concludes our discussion of batch normalization. Your recommended readings for this
lecture are chapter 8, section 8.7 of the deep learning book. And a very nice article called
Efficient Backprop by Yann LeCun. It is a very old article. Some of it is not relevant to deep
neural networks but some of it has very nice insights on efficient training of neural networks.
I would recommend you to read it.

Your exercises are visit this link here which has some nice animations of how a neural
network behaves with different initializations. It is a simple “click and try” exercise for you.
The second exercise is to prove batch normalization is fully differentiable. Can you try
finding out how you would compute gradients for the Batch Norm layer? It should be by
looking at it you can make all that it is simply a multiplicative factor and an additive factor
which should be differentiable. But how would you get ∂𝐿/∂γ and ∂𝐿/∂β. Please work it out.
And as we already left behind, derive Kaiming He’s initialization.

693
(Refer Slide Time: 22:32)

Here are some references.

694
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 31
Convolutional Neural Networks: An Introduction Part 01
Continuing our discussion, we will now move on from simple feed forward neural networks into
the building blocks of deep learning for computer vision, which are Convolution Neural
Networks.

(Refer Slide Time: 00:35)

Before we start that discussion, we will review the homework or the exercise that we left behind
in the last lecture of the previous week. One of the questions there was, how do you backprop
across the batch normalization layer. So, what you see here is the algorithm of batch
normalization that we saw last time. Let us quickly review that in terms of how it looks on this
graph, what is known as the computational graph. So, you have forward propagation which is
given by, you are given x1 and x2.

And given a mini batch of data, you find the mean and variance of x the inputs out of a particular
layer, out of a particular layer the activations out of a particular layer. And then you normalize
that data using the mean and variance. Once you normalize, you renormalize them by scaling and
shifting, rather you adopt a variance γ , and a mean β , which is suitable to solve that problem.

695
This is the main idea of batch normalization and that is represented by this graph here. The
forward prop as we can see, is quite straightforward and we just follow this algorithm. The
question here is, how do we do the backprop. Let us try to work through the different terms here
and understand this.

(Refer Slide Time: 02:04)

696
The first term that we have to find out is β That is one of the simplest terms to evaluate so we
will start with that. To recall, let us go back and see what β was. Note here, that in the last line
of the algorithm, y i = β + γ xˆi . Keeping that in mind, so this is the main equation that you may
want to keep in mind for the next few slides. So, you can assume that β is almost an input into y
∂L ∂L 1∂y ∂L 2 ∂y ∂y 1
with, so you have this here, y 1 = γ xˆi + β .1 . So, So, ∂β = ∂y + ∂y . as you can see
1 ∂β 2 ∂β ∂β

∂L ∂L ∂L
here is going to be 1, so this is going to simplify to ∂y + ∂y , which is the final ∂β .
1 2

697
Let us move on to the next one. Similarly, ∂L
∂γ
, remember γ is the constant that multiplies xˆi .

∂L ∂y 1 ∂L ∂y 2 ∂y 1
So, ∂L
∂γ = ∂y 1 ∂γ + ∂y 2 ∂γ and so on and so forth for all the y terms. And ∂γ as you can see is

going to be xˆ1 by the very definition of these terms. So, that term gets replaced by xˆ1 . So, you
∂L ∂L
have ∂y 1 xˆ1 + ∂y 2 xˆ2 and that gets summarized as a summation in this way.

∂L
Let us move on to the next term. The next term is ∂xˆ1
, x1 is not a parameter, but we need to be

able to compute this so as to propagate the gradient to previous parameters in the network.
∂L ∂L ∂y 1 ∂y
∂xˆ1 = ∂y 1 ∂xˆ1 , ∂xˆ1 = γ as we just saw, that is what we put in here. And similarly, you would
1

∂y
be able to calculate ∂xˆ1 .
2

(Refer Slide Time: 04:45)

698
∂L
Let us move on to the next term, which is going to be ∂σ 2 . We again need this because the

gradient needs to pass through this to be able to go to ∂L


∂x
which is going to be an earlier term
∂L ∂L
and all the other gradients in the previous layers. ∂σ 2 is given by ∂xˆ , which we already
1

∂xˆ
computed on the previous slide into ∂σ12 , which is the new term that we need to compute.

∂L 2 ∂xˆ
Similarly, we need to do this for all the x terms. So, you have ∂xˆ2 ∂σ 2 and it is written as a
∂xˆ
summation. The main term that we need to evaluate here is ∂σ2i . So, which is formed by the
gradient of the, form of x y hat itself. Let us go back a couple of slides and see what that is. So,
∂xˆ
xˆi is given by this term here. So, the gradient of ∂σ2i just follows the gradient of this particular

term which can be written by (xi − μ) −1


2
(σ 2 + ε)−3/2 .

699
(Refer Slide Time: 06:05)

∂L ∂L ∂xˆ1 ∂L ∂xˆ2 ∂L ∂σ 2
Similarly, we have ∂μ = ∂xˆ1 ∂μ + ∂xˆ2 ∂μ + ∂σ 2 ∂μ So, the first two terms get absorbed into a
∂xˆi
common summation and the last term stays as it is. ∂μ
is given by this gradient here; this once
∂σ 2
again follows directly from the definition of xˆi to be (xi − μ)/√σ 2 + ε . And the term here ∂μ

gets written in terms of x's and μ ’s.

Because σ 2 , once again by the same expression can be rewritten in terms of all the x's and the
corresponding μ , which is to repeat that. Remember that if we go back to the algorithm from the

700
same boxed equation here, you could take sigma b squared + epsilon the other side and get xi hat
as the denominator and you would do this for xˆ1 , xˆ2 and all of that.

So, doh sigma doh μ by sigma square becomes a summation of all of these gradients and that is
what we get here. Each of this is one gradient with respect to one of those inputs x1. Similarly,
this would be the gradient with respect to x2, and so on and so forth. And that is how you get
your doh L by doh μ .

(Refer Slide Time: 07:47)

701
Finally ∂ L/∂x1 , that is the final gradient that you need to be able to compute to back propagate
to earlier layers, that is going to be ∂ L/∂xˆ1 . ∂ xˆ1 /∂x1 + ∂ L/∂σ 2 . ∂ σ 2 /∂x1 + ∂ L/∂μ . ∂ μ/∂x1 . So,
this one follows as it is ∂ xˆ1 /∂x1 by once again the definition of xˆ1 , you will get it to be this term.

∂ σ 2 /∂x1 , once again, by the expansion of sigma with respect to x1. Now, you do not need to
worry about the x1 term. So, you will have only one gradient term here, you get this. And
finally, ∂ μ/∂x1 would simply be 1/2 here, by definition of x1 by definition was μ and how it is
connected to x1.

Similarly, you would get this for ∂ L/∂x1 , and so on and so forth for all the doh L by doh xi's.
That is how the backprop is done for batch normalization. A takeaway for you at this stage is, for
any of these new parameters that you introduce in a neural network, a careful working out of
chain rule for every term involved would take care of, would take care of how you should be
able to compute the gradients and back propagate in that scenario. All that you need to ensure
that you have only differentiable components in all your new additions to a neural network.

(Refer Slide Time: 09:26)

With that, let us move on to convolutional neural networks. And this lecture is largely based on
the excellent lectures of Mitesh Khapra, from IIT Madras.

702
(Refer Slide Time: 09:37)

Let us start with reviewing convolution. So, we did do it in the first week of this course. We
know now that convolution is a mathematical way of combining two signals to form a third
signal. We also know that it is an extremely important technique in signal processing in general,
images are 2D signals. But convolution is also used for other kinds of signals such as speech or
accelerometer data, time series data, so on and so forth.

In case of 2D data, let us assume the simple case of grayscale images, convolution operation
between a filter W, whose dimension is k x k, and an image X​, whose dimension is n1 x n2 is
expressed as y of yj index u going from -k to k, index v going from -k to k, W u, v, which is the
filter an X i-u, j-k, which is the input image. We have already seen this as the definition of
convolution.

703
(Refer Slide Time: 10:46)

More generally, we can also write convolution in the form that you see here. Where we are
writing the indices going from the floor of -k1 by 2 to k1 by 2 and -k2 by 2 to k2 by 2. There is a
particular reason we are talking about it. Because if we write the equation of convolution this
way, that is the only difference, otherwise the rest of it is more or less the same. Obviously, the
indices of W change because of how we wrote the summation here, the indices in the summation
here.

All that we are saying in this definition is that, we are going to center the filter on the pixel
around which we are computing the convolution. We already did that earlier, too. But we are just
writing that formally here. So, for different parts in the course, based on what we want to
establish, we may write the definition of convolution slightly differently. But you can see that,
that the heart of it, it is the same equation, you do have W indices, and X indices defined this
way, and you get the output at a particular pixel.

704
(Refer Slide Time: 11:58)

We know from what we have seen so far, that for the 1D case, when you do convolution, you
slide a 1 dimensional filter over a 1 dimensional input. Similarly, in the 2D case, you slide a 2
dimensional filter, as an images such as a sobel edge filter, and so on over a 2 dimensional input.
Let us try to take this further. What would happen in a 3D case, why do you need a 3D case is
not an image 2D, an image is 2D. But when you get to color images, you have 3 channels of
images which becomes a third dimension, if you want to convolve with them together at one go.
So, let us try to understand how this convolution would look in the 3D case.

(Refer Slide Time: 12:51)

705
706
707
708
709
710
Let us try to understand what a 3D filter would look like. Let us assume now that our input will
be a color image, which has 3 channels R, G, and B which makes the input now a volume. And
what would a 3D filter look like, it would also be a volume. And we are going to refer to that as a
volume going forward. So, when you now slide the volume over your input volume and perform
the convolution operation, you will be moving it across the input volume at different locations,
and getting 1 scalar output as the output of the convolution in the output image.

See you can see, that as you move the volume on different locations in the input volume, you get
the output as pixels in the output image. We also call this output image as a feature map in
convolution neural networks. One assumption we make here is that the depth of this filter is the
same as the depth of your input volume and this is important. Why is it important, because the
depth of the filter is the same as the depth of the input volume.

In effect, we are doing a 2D convolution on a 3D input. Why do we say that, we slide along the
height and the width of the input, but we do not slide along the depth, because the depth of the
filter is the same as the depth of your input volume. So, your output of this operation will also be
a 2D image. When you did 1D convolution with a 1D filter on a 1D signal, you got a 1D output.
When you took a 2D filter on a 2D input and image you got a 2D output.

Now we are taking a 3D filter on a 3D input and our output now is not 3D, but 2D. And why is
that so, because the depth of the filter is the same as the depth of the input. Then can not you get
a volume, can not you get a 3D as output here, you can, even with this operation you can by

711
defining multiple filters. If you define multiple such filter volumes, for each filter volume, you
would get 1 such output or 1 such feature map, 1 such feature map. And by using multiple filters,
you would get multiple such feature maps, which together form a volume.

(Refer Slide Time: 15:48)

712
Let us now try to understand the dimensions. We know that even when we did convolution, we
had some issues around the edges of the image where convolution cannot be performed and we
use techniques such as padding, let us revisit some of those in the context of convolution the way
we see it for color images.

Let us assume now that our input dimensions are W1 x H​1 x D1 with height, and depth. And the
spatial extent F of each filter, the depth of each filter is the same as the depth of the input, so you
have a say spatial extent to be F x F, and the depth is same as D1, because that is the depth of the
input volume. Let us assume now, that the output dimensions are W2 x H2 x D2. So, we are now
allowing the output to have some depth D2, if you are used only 1 filter this D2 will be 1, you
can increase D2, but increasing the number of filters that you have. Let us try to work out a
formula for W2 and H2 in this context.

We will also talk about a quantity called stride, which decides how much you overlap successive
convolutions on different parts of the image. And we will also talk about the number of filters in
deciding the size of this output volume or output feature maps.

713
(Refer Slide Time: 17:26)

714
715
716
717
718
719
720
Let us take it one by one. Let us first start with trying to compute the dimensions W​2 and H2 of
the output. So, you have your standard convolution, where you place your filter on a particular
location in your input image, you convolve the filter with the input at that location. And you get
one single scalar output at the centered pixel in your output, at the centered pixel location in your
output. And you move this all along and keep getting different outputs in the, different pixels as
output in the output feature map.

Let us try to understand, what happens if you place the kernel at the corner. So, remember that if
you place the kernel at the corners, you do not have inputs to compute the convolution of. So,
this is a problem with all these shaded regions here, depending on the size of the filter. So, this
results in an output, which is of smaller dimensions than the input. We have seen this earlier with
convolution. But we are revisiting this, to find out what would be the size of W2 and H2. As the
size of the filter, or the kernel increases, you can make out that even more pixels along the
boundary are going to get affected.

And for example, if you take a 5 x 5 kernel now, and as you slide this 5 x 5 kernel across, you
are going to get an even smaller output, because more border pixels are going to get affected,
because of you will not be able to convolve with those pixels that go outside the boundary. So,
you would have an entire region which is outside the boundary for which you will not be able to
compute convolution, so your output in this case would be a 3 x 3 matrix.

721
We have seen this before. In general, we can say with this restricted set up, the W2 will be W1​-F
+ 1, and H2 will be H1- F + 1. So, if your input was W​1, in this case, it is 7 x 7, W1 and H1, F
was 5 x 5. So, you have 7 - 5 + 1, which is 3, and the same for H2 and that makes your output a 3
x 3 matrix.

(Refer Slide Time: 20:02)

722
723
724
We saw earlier, that if we want our output to be the same size as the input, we do what is known
as padding. In the first week, we also talked about a few different ways of doing padding. The
simplest way to do padding is to add zeros along multiple rows and columns outside the image,
so as to extend the original dimensions of the image.

So that way you can convolve, even the boundary pixels. Now, you would have something like
this. So, your 7 by 7 becomes a 9 by 9. And now you can place your kernel, even at the corners,
as you convolve your 3 x 3 filter on this image, and your output remains 7 x 7 as your input
image. We now have, W2 to be W1-F + 2P + 1, where to P is the padding, that is you have done,
the padding that you have done. And it is 2P because you have to pass along both ends along the
columns, and both ends along the rows too, that is the reason you need 2P. You are going to
further refine this formula.

725
(Refer Slide Time: 21:26)

726
727
728
729
730
731
We are not done yet. And that is where we bring in the concept of a stride. When we convolve, a
filter over an input, we can also define the interval at which the filter is applied. When you move
the filter from 1 location on the input to the next location, you can decide whether you would
take it to the immediate next pixel, or whether you want to jump 2 steps and go to 2 pixels later
and that decides what your stride S.

So, if you went to the immediate pixel, your stride is 1, that is the standard convolution. If you
went 2 pixels further, your stride is 2 and that is where stride comes into the picture. So, let us
see an example. In this example, S is equal to 2 or stride is equal to 2. So, you can see here that
when you took your first 3 x 3 window, and the next 3 x 3 window, you skipped 1 step in
between, and then went forward. And let us see these skips a bit more carefully, you can see
those skips again. It is not only along the columns, it is also along the rows, you also skip pixels
along the rows, the stride happens in both directions.

And this is another way in which the size of your image gets output image gets reduced. And
your new formula now with the stride included becomes W2 is equal to W1-F + 2P which is
what we had earlier, which is now divided by S the stride. Since the stride along the columns and
the rows are the same, you use the same S​. Technically speaking, the strides could be different
between the rows and the columns. But in practice, it is generally retained as the same value. Are
we done with finding the values of W and H2, not yet done, there is one more detail left.

(Refer Slide Time: 23:29)

732
And that detail is the depth of the output. As we already mentioned, the depth of the output
comes from using multiple feature maps or multiple filters. Each filter gives us one 2D output
and K such filters will give us K such 2D outputs, as you can see now in the output region. We
are saying now that the depth D2 is K​, that is the number of filters you used. For each filter, you
would completely convolve your filter volume with the input volume get 1 2D image.

You take another filter, you could imagine taking a sobel filter along the vertical sobel filter and
a horizontal sobel detect those would be 2 different filters. So, you run both of them, and you
will get 2 images as output. You just place them one after the other to form a volume. So, your
D2 on the depth of the output would be your number of filters, or K.

733
(Refer Slide Time: 24:33)

Let us try to work out an exercise so you are clear about this. Let us consider this input to be 227
x 227 x 3. Let us assume that your filter size is 11 x 11 x 3. Let us assume that we are going to
use 96 such filters. These numbers have a purpose which will probably be clear to you a few, a
couple of lectures down but take these numbers for now as they are, you have 96 such filters. Let
us assume a stride of 4 and a padding of 0, no padding.

We already know that the formula W2=((W1-F + 2P)/S)+1. And similarly, for H​2, can we now
find out what D2 W2 and H​2 are. Let us start with the simplest one. What is D​2, D2 as we said,
is the number of filters, which means D2 is 96.

734
(Refer Slide Time: 25:40)

Let us then go to H​2, H2 was given by H2 = ((227-11)/4)+1 = 55. And we get a similar value for
W2. That is how you can get the output, the dimensions of the feature map. One important thing
to keep in mind here is when you embed these kinds of operations in a neural network. Unlike a
feed forward neural network, where you would specify how many neurons you want in a first
hidden layer, here, the size of that next layer is automatically decided by the size of the filter and
these choices that you make.

In a feed forward neural network, the number of weights gets decided by the number of hidden
layers and the neurons in each hidden layer. And here, it is the reverse. You decide the number
of weights and the output map gets decide. We will explain this in more detail as we go forward
now.

735
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 32
Convolutional Neural Networks: An Introduction - Part 02

(Refer Slide Time: 0:14)

Having reviewed convolution so far, let us try to ask the question, what is the connection
between convolution and neural networks? If we want to use it on images, can’t we use feed-
forward neural networks? We will try to understand this more carefully by taking the example of
image classification.

736
(Refer Slide Time: 0:44)

Let us try to understand how traditional machine learning was done earlier on computer vision.
Given an image, traditional ML-based computer vision solutions for image classification would
extract somewhat are known as handcrafted features from images, examples being SIFT, LBP,
HoG, so on and so forth. Why are they called handcrafted? Because we came up with a
procedure to extract those features, using some thought processes and heuristics. These were
effective for quite a while.

But it was a bottleneck of scaling the solutions to more datasets and scaling the solutions to more
real-world images. One would be given an image, you see the raw pixels, or extract edges, or
extract features such as SIFT and HoG. It could be many others. This was known as Static
Feature Engineering, where there was no machine learning or learning involved at all. These
features were then provided to a classifier, such as a neural network, to finally give the output as
say monument in this case.

737
(Refer Slide Time: 2:05)

The question that we try to ask now is, instead of those handcrafted kernels, for edge detection
for SIFT or any other method, can we learn meaningful kernels or filters to solve the problem? In
addition to learning the weights or the parameters of the classifier? Rather, if you have an edge
detector, which is a Laplacian of Gaussian, which has certain weights in its kernel, or any other
kernel for that matter, can we automatically learn these weights as part of the learning algorithm,
rather than separate these 2 as the first step is hand-engineered, and the second step being
learning step. That is one of the first questions we ask.

We could go even better and ask, can we learn multiple meaningful filters or kernels in addition
to learning the weights of the classifier? Why just learn one, if we are learning, we may as well
learn many of them.

738
(Refer Slide Time: 3:09)

And in addition to that, can we also learn multiple layers of meaningful kernels and filters? We
now know that convolution has some nice mathematical properties. And you can compose have a
composition of convolution operations across different layers of a neural network. So, can we do
that, in can we use that in some way to learn multiple layers of meaningful kernels or filters,
before we learn the weights of the classifier or along with learning the weights of the classifier?
This is possible. And this is possible by simply treating the kernels weights of the kernel
coefficients as parameters and learning them, in addition to learning the weights of the final
classifier.

So, you have to have you would have an input, you would have a convolution kernel, that gives
you a certain output map, which becomes a second layer, you could have another convolution
kernel, which would give you an output, which would give you the feature map at the third layer.
And this can now be used as input to a final layer of fully connected neurons, which give us the
final output. And we can use backpropagation to update these filter coefficients at each
intermediate step. This network is called a convolution neural network.

Why is that so? Because in each step, instead of taking a fully-connected bipartite graph to
connect the first layer, the next layer, we are going to define a set of weights, which convolve on
the input image and given output feature map and we are going to learn those weights using
backpropagation.

739
(Refer Slide Time: 5:02)

Let us try to pause here and ask 1 or 2 questions again. So, learning filters seems interesting. But
why not use flattened images with fully connected neural networks? Why do we want to
complicate our lives and learn the filters? Why cannot we just take the raw pixels of the input
image? Anyway, the neural network is going to learn the weights it should? Why should we?
Why should that be a convolution operation in between? Why cannot the neural network simply
take the pixels of the images as flatten them out and give them as the input to a neural network
and let the layers learn whatever they should? Why cannot we do this?

(Refer Slide Time: 5:48)

740
There are a few reasons let us try to evaluate them. If you take even a simple data set, such as
what is known as the MNIST data set, so this is an example of the MNIST data set. The MNIST
data set, as we mentioned earlier, is a data set for handwritten digits, which was used, which was
used by the United States Postal Service in the 90s to automatically classify mail, based on zip
code recognition from images.

So, if you use a feed-forward neural network without convolution, or without learning the filters
on a data set such as MNIST, you get fairly good performance, close to about 98 percent, 90
percent performance with a simple feed-forward network. But there are certain limitations, and
let us try to analyze them. The first limitation is this method ignores spatial correlations or the
structures in images.

By taking pixels and flattening all of them into a single vector, we have removed the spatial
relationships that exist in different parts of the image or different corners of the image. Spatially
separate pixels are treated the same way, as adjacent pixels, we seem to be losing some important
information that characterizes images. By doing this, that is the first concern.

The second concern here is, there is no obvious way for networks to learn the same features at
different places in the input image. Unlike a few other machine learning problems in images, one
is expected to recognize a cat on the image, whether the cat was located on the left top or the
bottom right. If we simply flattened an image into a vector, a cat on the left top would have a
very different vector representation, the dimensions in which the cat exists would be very
different from a flattened vector when the cat was on the bottom right.

We are not enforcing the network to learn that a cat is a cat, irrespective of where it is in the
image. And lastly, it can get computationally very expensive if you took an image, flattened it,
and only used a fully connected feed-forward neural network. Why is that so? Let us take an
example. If you had a 1-megapixel color image with 20 neurons in the first hidden layer, how
many weights would you have in the first layer?

Think about it. And the answer is 60 million. 1 megapixel is 10 power 6 pixels. Color image, 3
channels, so that is 3 million pixels into 20 neurons, a fully connected graph is going to end up
with 60 million weights in only the first layer, you now have to add layers probably for getting
better performance and that is going to lead to an explosion in the number of weights.

741
(Refer Slide Time: 9:12)

So, how do we overcome this? We overcome this by using the idea of convolution, where we
know that a filter only operates on a local neighborhood in the original image. We call these
local receptive fields. Rather the region of the input which is used for convolution is typically
known as the receptive field. So, that receptive field is a local part of the image. So, each hidden
unit of the next layer is connected to only 1 local part of the input image.

This serves a few purposes. Firstly, it captures spatial relationships because if you do
convolution of a filter with a patch, you are capturing the 2D spatial relationships in that region.
And such relationships may not be captured by feed-forward neural networks effectively, it
greatly reduces the number of parameters in the model. Because now you only need to connect to
a local region rather than have to have a fully connected graph where every neuron in the first
layer is connected to every neuron in the second layer.

So, now if we take the same example, if you had a 1-megapixel colored image, and let us assume
now that you have a filter size of K1xK2 that you want to learn in that first hidden layer, how
many weights would you have now? The number of weights would simply be the size of the
filter itself. As we said, when you use convolution in a neural network, the filter coefficients
themselves become the weights. So in this case, the number of weights would be K1xK2
compare with the 60 million that we talked about feedforward networks on the previous slide.

742
Secondly, convolution neural networks introduce a concept of what is known as weight sharing.
Weight sharing is, you take a filter, whether you convolve it with the top right, or the bottom left.
You use the same values in the filter, you do not change the filter weights when you convolve
with the top left, and when you convolve with the bottom right, which is what you would have
done if you flattened out an image and used a fully connected layer to estimate the weights in
that particular layer. This has 2 purposes too.

Firstly, as we just mentioned, it enables translation invariance of the neural network to objects in
images. So as I said, whether the cat was on the left top or the bottom right, it is the same
weights that are operating on them. So, it would recognize a cat be it in whichever part of the
image. Secondly, it reduces the number of parameters in the model, as we just say. A third thing
that CNNs also do is known as pooling. Pooling serves to condense information from the
previous layer. This also serves 2 purposes.

One, by condensing information, it aggregates information, especially minor variations that are
kind of average out to get a common output in the next layer. Secondly, pooling ends up
subsampling an original image into a smaller image in the next layer. And by doing that, the next
layer, when convolution is applied on that next layer, the number of convolutions you have to do
reduces, which reduces the number of computations in later layers of the CNN.

743
(Refer Slide Time: 13:11)

Let us look at each of these in more detail. Now. Let us start with local receptive fields. So to
recap, CNNs have local receptive fields, weight sharing, and pooling. Let us review each of them
in sequence. Let us start with the local receptive field. So, if we used a regular feed-forward
neural network, it would look something like this. So, your input is these red neurons. Here, your
input is a handwritten digit image.

Let us assume that you have because there are 16 pixels here, you have 16 input neurons. And
then the blue squares here are the neurons in intermediate layers. And finally, on the output
layer, you have 10 green neurons, which correspond to the 10 digits that you have. You can see
here that every input neuron is connected to each hidden layer neuron here in the first hidden
layer. So, that is a dense connection. So, and if you are now connected to all the input pixels in
the first hidden layer, and so on, this becomes a very dense collection of weights, as we just
mentioned. Let us contrast this with what happens if we do convolution.

744
(Refer Slide Time: 14:27)

So, if you observe here in this figure, once again, the input is 16 input neurons because that is the
size of the input image. But now, only pixels 1, 2, 5, and 6. So, these 4 neurons here are the ones
that contribute to the first convolution, pixels 3 and 4, and 7 and 8 do not contribute to that first
convolution. So, only pixels 1, 2, 5 and 6, contribute to the computation of h11 which is the first
pixel in the hidden layer.

Similarly, as you move forward, you have a similar number of operations for each of your pixels
in the hidden layer. How does this help? The connections now are much sparser. This past
connectivity automatically reduces the number of weights, you have to learn, as we just
mentioned. Importantly, we are also taking advantage of the structure of the image. As you can
see, here, it is 1, 2, 5, and 6, that is part that participates in the h11 output, and 1, 2, 5, 6 are
spatially correlated as compared to 1, 2, 3, 4, 7, 8 for instance. That is another advantage of the
local receptive field.

745
(Refer Slide Time: 15:46)

One question you could ask here is, but is sparse connectivity, really a good thing? Would not it
be better to have full connections? Assuming we have infinite compute bandwidth and storage
bandwidth and let the neural network learn what it should? By having sparse connections are we
not disallowing certain pixels to interact and contribute to the output of the next layer? The
answer is not really, we do not lose information through this process. Because as we go through
the depth of such a neural network, if you look at the 2 highlighted neurons x1 and x5.

Here, they may not interact, when we consider the first hidden layer, but they may start
interacting to later hidden layers indirectly because x1 contributes to h2; x1, x2, x3 contribute to
h2, and h2 and h4, where h4 gets input from x5, contribute in the successive layer, and indirectly
x1 and x5 interact, or the depth of the neural network. So, we are not losing out on this
information, as long as we have a few layers in the neural network.

746
(Refer Slide Time: 17:07)

The second is weight sharing. Let us try to understand this with an example. Again, let us say
that pixels 1, 2, 5, 6 are connected through a set of weights to that first-pixel hidden neuron in the
first hidden layer. And similarly, the last 4 pixels, are certain numbers 15, 16, 11, and 12. They
are connected to the last pixel of that hidden neuron. We would like to ask the question, do we
want these kernel weights to be different? Do we want them to be a different set of weights here
and a different set of weights here?

The answer is no. We would want that filter to respond to an object or an artifact in the image, in
the same way, irrespective of where it is present in the image. So, this is known as translation
invariance. So, we want our neural network to be invariant to the translation of an object in an
image from one position to another position. Secondly, we can also have different kernels to
capture a different kinds of artifacts. So, it is not required that you must have 1 kernel for the
first half first, top left part, and another kernel for the bottom right part, you have the same
kernel, and you have another kernel to capture another kind of an artifact, maybe one of them
will capture the value for a person, another of them will capture the nose of a person.

This idea of having the same weights for all parts of your input, which is deep, which comes as a
default from convolution is known as weight sharing. This also reduces the number of
parameters.

747
(Refer Slide Time: 18:59)

Let us now try to see how our CNN looks in completion. Here is a sample CNN convolution
neural network, given an input, you have a convolution on layer 1, which can have many filters.
So stride is 1, size of the filter F=5x5, K=6. That is the number of filters. So, the number of
feature maps so the depth of the output in that first layer, P is equal to 0 paddings is equal to 0.
And the total number of parameters that you see here is 150, parameters 150 here. Then you have
something called a pooling layer, and then a convolution layer, then a pooling layer.

And then you have what is known as FC layers or fully connected layers. Let us now try to
understand what these pooling layers in between are. So, you can see here that most CNNs have
alternate convolution and pooling layers, at least the initial CNN that was designed had it this
way. These days, there are other options, more layers that you can fit in into an architecture of
CNN, which we will revisit later in this week's lectures. Let us now try to understand what a
pooling layer does?

748
(Refer Slide Time: 20:13)

A pooling layer is a parameter-free downsampling operation, subsampling, or a downsampling


operation, and it is parameter-free, with no weights involved, you do not need to learn any
weights in this particular layer. Let us see an example of one such pooling operation. So, you can
see here that this could be a feature map, or it could be an output of a convolution step, which is
a feature map in one of the hidden layers, which have switches of size, say 4x4 if we now try to
do what is known as max pooling with stride, 2, here is how it would look.

See here, that you get 1, 4, 5, and 8, you do what is known as max pooling, which means you
take the maximum value from that, which is 8, and you put that here. Then you do a stride 2,
which means you skip 1 pixel in between and then go to the next 1. Once again, you take this
2x2 window, the max value, there is 4, and you put the 4 here. If you see down, we have the
same max pool operation, same 2 cross 2 filters, but now with a stride of 1.

So, when you stride 1, you go to the immediate neighbor, and you do not skip a step and go to
pixels later. So in this case, you can have 8 as the max pool pixel here. And similarly, when you
go to the next step, the maximum value, here again, is 8, and you put 8 in the corresponding
output in that layer. Let us complete this operation. Similarly, you go to the next step, and you
have to stride you're to go to the next row in stride 1, you now have to still complete that row.

And you keep repeating this over every step, all that you have to do is in that 2 by 2 window that
you are looking no weights involved, simply take the max element and put it in that position in

749
the next layer. As you can see, this turns out to become a sub-sampling operation, which means it
would reduce the size of whatever image or feature map that you had here to a smaller size based
on your choice of filters, and choice of stride.

In this example, we saw what is known as max pooling, where you take the max value in a 2
cross 2 window. But you have other kinds of pooling operations to where you can take you can
do what is known as average pooling, or L2 pooling. Average pooling is where you take the
average of values in a given 2x2 window. L2 pooling is where you take the L2 norm of all the
weights in a given 2x2 window, and so on.

There are many more cousins of pooling, such as mixed pooling, which combines max and
average pooling, that is spatial pyramid pooling, spectral pooling, so on and so forth. And we
will visit some of these as we go forward to later lectures in this course.

(Refer Slide Time: 23:19)

In addition to these 3 operations, that is local receptive fields, weight sharing, and pooling. There
are also variants that one can introduce in these layers that we discussed specifically, in the
convolution layer of a convolution neural network. Let us see a few of these variants. Before we
close this lecture. A popular variant is known as dilated convolution, where there is a new
parameter introduced, known as dilation rate, which as the name says, controls the spacing
between values in a kernel. So, when you apply a kernel on an input image, you change the
spacing in how the values are applied, or convert the kernel with respect to the input.

750
So let us see an example here of a 3x3 kernel that is green in color, the blue is the input image
with a dilation rate of 2. Let us see how it works. You can see now that when you convolve you
are taking not you are not placing a 3x3 kernel directly on a 3x3 window of the input, produce
spacing things out, and taking your placing the 3x3 kernel on that spaced out image.

One could also say this is equivalent to subsampling your original image and then applying by a
dilation rate and then applying the same standard convolution. Both of them turn out to be
equivalent, but we are doing it in 1 step. Now, remember, we already said sampling and
interpolation can be interpreted as conventional operations. We just taking advantage of that now
using a dilated convolution. So, that is the animation of how dilated convolution looks.

You can notice now, that when the dilation rate is 1, it becomes standard convolution. There is a
subtle difference between dilated convolution and standard convolution with say stride 2 what is
it? When you do standard convolution with saying stride 2, the convolution is still with the
original neighbors neighboring pixels in the image, just that the next convolution goes 2 pixels
further, but a dilated convolution, each convolution itself is dilated and sees a larger
neighborhood in the input image.

(Refer Slide Time: 25:54)

Another popular variant of convolution that is used while training neural networks or CNNs is
known as transpose convolution. As the name suggests, transpose convolution allows
upsampling of an input. What do we mean? In traditional convolution, you have an input image,

751
you take a filter, and if you do not pad, you are going to get an output that is smaller in size. We
are going to transpose that now.

We are going to see if given an input image, and if you convolve can you get a bigger image
without doing anything else? And that process is what is known as transpose convolution. It is
sometimes also called deconvolution or convolution. Although it is not technically
deconvolution, deconvolution means a different thing in a signal processing context, although
certain articles call such an operation deconvolution.

Traditionally, we would achieve upsampling using interpolation or similar rules. We saw this in
the first week of lectures, that one could take something like a tenth kernel, and convolve that
with an image to get a higher resolution image, or up to sample the image. We try to now ask,
why should we use the tenth kernel? Why cannot we learn the weights of that interpolation
kernel? Let us try to see how this transpose convolution is done using an example.

So, you have an input. Let us see a 1D example just to keep things simple and be able to easily
explain, let us consider a 1D example, which has a dimension of 4 (2, 3, 0, -1) are the values in
this input. Let us consider the kernel to also be 1D, which is 1, 2, -1. Now we want to find out
how do we get an input which is larger than the output, which is larger than the input, let us see
how we do it.

We take the 1, 2, -1, and convolve it with the first value of the input. And that gives you 2, 4, -2.
Then we slide 1, 2, -1 to the second value of the input. And we now get 3, 6, -3. We further slide
1, 2 -1 to the next location and input get 0, 0, 0. And finally, we get -1, -2, 1. Now in each of
these locations, we add up the values that we get, as we convert this filter with each location of
the input.

So, your final output would look like 2, (4+3=)7, (-2+6+0=) 4, (-3-1=) -4, (0-2=) -2, and 1. This
gives you the output which is now larger than the input. And you could do the same thing, even
in 2 dimensions to make an image larger. We will see examples of where this is used a little later
in this course.

752
(Refer Slide Time: 29:08)

Let us understand this with a visual illustration. See, you can see 2 examples here, 1 of
upsampling, a 2x2 input to a 4x4 output, and another upsampling for 2x2 input to a 5x5 output.
In these images. The blue image in the middle is the 2x2 input. And the green image is the
output. The shaded regions show the filter and how they are applied. So, you can see here that
you first apply the 3x3 filter on just 1 pixel of your input and that gives you one of the values in
the output.

Similarly, you move the filter on locations, and in this case, the last 2 values of the filter get
multiplied by 2 locations. In the 2x2 input, and that gives you the value in the second location at
the 4x4 output. In the 5x5 output scenario, you leave gaps in the 2x2 image. And so your values
are going to be slightly different here. But you see this sliding over. And as you move that 3x3
filter over different parts of the 2x2 image, you get different values for your, for each pixel in
your output image.

753
(Refer Slide Time: 30:34)

There are other kinds of convolution that are also possible. Some of these are something that we
will see in detail when we talk about their applications in different contexts. But let us briefly
review them now. 3D convolution, as we already mentioned, is where you do 3D convolution,
and you get the output also to be a 3D volume. In the scenario that we talked about in this
lecture, we did have our filter to be 3D, but our output feature map was 2D, and we made it 3D
by taking many filters or many feature maps.

But what if we ensure that the output is a 3D volume, by the nature of convolution itself, that is
3D convolution. We also will talk about a 1x1 convolution where you only convolve along with
the depth of a volume and that becomes 1 single value in the output. So, an input of WxHxB
becomes WxH, because you convolve the entire depth along 1 particular pixel with a depth
kernel and you get one scalar in that particular location.

You can also have grouped convolution where different filters convolve with different depths in
your input. So, if you had about say 100 channels in your input, 1 set of filters could interact with
the first 10 channels, then another with the next 10 channels, so on and so forth, that is grouped
convolution.

754
(Refer Slide Time: 32:13)

We can also have separable convolution, the way we saw it in the first week, the way we could
separate convolution in 2 dimensions as convolution along 1 dimension, followed by convolution
along the second dimension. Recall that we said that this could help reduce computations, which
is what we talked about here. So given an input, if you originally had a 3x3 kernel, to get us the
3x3 output, you first have a column kernel, get an intermediate output, then have a row kernel
and then you get your final output. This is just a visualization of separable convolutions that we
saw earlier in this course.

We could also have what is known as Depthwise Separable Convolution, were given an input
volume, you have filters in 3 different channels for 3 different channels. And you can go through
each of them individually to get 3 different feature maps. And then you run a depth convolution
to compress all of them into a single feature map. Now, this can be repeated for different filters
to get multiple filters to feature maps in this next layer.

755
(Refer Slide Time: 33:36)

We can also look at what is known as flattened convolutions which are very similar to separable
convolutions where you first have a filter which is along 1 of the dimensions, then have a filter
which is along another dimension, and then have a third filter along a third dimension. In some
sense, this is a combination of separable convolution and depthwise convolution, this is just the
general paradigm for that.

And finally, we can also have spatial and cross-channel convolutions where we can do
convolutions in parallel and then concatenate them to get an output. We will see examples of
several of these types of convolutions in later lectures this week, in the context of how they are
used.

756
(Refer Slide Time: 34:27)

In conclusion of this lecture, here are your recommended readings. For an interactive illustration
of convolution, please see this link. For a very nice discussion of the deconvolution operation,
please see this link at distill.com. Other good resources are chapter 9 of the deep learning book
and certain notes are the CS231n course. A couple of questions for you to take away given a
32x32x3, image and 6 filters of size 5x5x3. What is the dimension of the output with a stride of 1
and a padding of 0?

Work it up. Another question here is, is the max-pooling layer differentiable? How do you
backpropagate across it? Think about these questions at the end of this lecture.

757
(Refer Slide Time: 35:23)

And here are some references.

758
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 33
Backpropagation in CNNs

Having seen an introduction to Convolutional Neural Networks, we saw a couple of new layers
that you can have in a neural network, a convolutional layer or cooling layer. Now, we ask the
question with these new layers how does this affect backpropagation the way we saw it earlier?

(Refer Slide Time: 0:42)

Before we go there, let us revisit our exercise from last class which was given a 32x32x3 image
and 6 filters of size 5x5x3, what is the dimension of the output volume with a stride of 1 and a
𝑊1 −𝐹+2𝑃
padding of 0? The answer is straight forward. 𝑊2 = + 1. The similar formula for 𝐻2
𝑆

is simply plugging the values and you get 28x28x6 which is the depth of the output because of
the number of filters. We did leave behind another question which was if the match pulling
layer differentiable and how do you back propagate across it and this is something that we will
visit at the end of this lecture.

759
(Refer Slide Time: 1:37)

Let us now start and try to do backpropagation across a convolutional layer first. A large part
of this lecture’s content is adapted from Dhruv Batra’s excellent lectures at Georgia Tech.

(Refer Slide Time: 1:51)

To begin, let us assume a grayscale image, we assume that the number of input channels C=1.
You should see that this one affects the derivation per se, it just makes it simpler for explaining.
Also, we are going to assume the number of convolutional filters to be 1 which means the size
of your output map will of the depth of your output map also be 1. But each convolutional will
follow the same procedure to be able to computer its gradients.

760
(Refer Slide Time: 2:27)

To start let us consider a single convolutional filter, let us assume the size of that filter to be
K1xK2. This is applied to an image which is of size N1xN2 and you get an output Y which is
of size M1xM2. Visually speaking, this is your space, you have an input X, filter W, and output
Y. From the definition of convolution, we know that an element of the output Y, Y at ij.
𝐾1 −1 𝐾2 −1
𝑌[𝑖, 𝑗] = ∑𝑎=0 ∑𝑏=0 𝑋[𝑖 − 𝑎, 𝑗 − 𝑏]𝑊[𝑎, 𝑏] .

Note here that we are not placing the filer at the center of the input pixel but a corner. This does
not matter, it is only to make this expression a little simpler to see. So, we would make the rest
of the derivation a little simpler to understand.

(Refer Slide Time: 3:42)

761
We now are given a loss function L which is used to train a CNN. So, for our convolutional
𝜕𝐿
layer, there are two quantities that we have to compute. One is 𝜕𝑊, the gradient of the loss with
𝜕𝐿
respect to every weight in every filter and which is the loss with respect to every pixel of
𝜕𝑋

the input because that is necessary to backpropagate further to an earlier layer. We are going
to try to derive the gradient for each of these quantities using the chain rule.

(Refer Slide Time: 4:23)

𝜕𝐿
Let us start with , the gradient with respect to the weights on the filter. To do this let us
𝜕𝑊

consider one particular weight in the convolutional filter. Let us say that you have 𝑊[𝑎′ , 𝑏 ′ ] as
one location on your filter. So, if you have your filter to be W which is K1xK2 dimensions,
you have one of those values inside them to be 𝑊[𝑎′ , 𝑏 ′ ]. We now want to compute the gradient

762
of the loss with respect to that particular weight and then this can be generalized to all weights
in that filter.

To do this let us ask the question if you used a filter in a convolutional layer which is W, how
many pixels and which pixels in the output or the next layer Y does that particular weigh value
affect? Because remember the way backpropagation is done, all the values that a particular
pixel impacts the next layer we have to accumulate all those gradients back into the gradient at
that particular pixel. This is what we saw with feed-forward neural networks.

So, in this case, the question is if we took the weight at a prime be prime, what all pixels in the
next layers map does that particular weight affect? The answer is every pixel in Y because
every pixel in Y is obtained by convolution of the filter with the certain location in X. while a
certain pixel in Y depends only on a few pixels in X, it does depend on every value in W which
is the filter.

So, each pixel in the output corresponds to one position of the filter overlapping the input and
every pixel in the output is a weighted sum of a part of the input image but the weight in the
filter affects every pixel in the output.

(Refer Slide Time: 6:47)

𝜕𝐿 𝑀1 −1 𝑀2 −1 𝜕𝐿 𝜕𝑌[𝑖,𝑗] 𝜕𝐿
Now, let us move forward we now write (𝜕𝑊[𝑎′ ,𝑏′ ] = ∑𝑖=0 ∑𝑗=0 ):
𝜕𝑌[𝑖,𝑗] 𝜕𝑊[𝑎′ ,𝑏 ′ ] 𝜕𝑊[𝑎′ ,𝑏′ ]

given by the summation of the entire dimension of Y, remember Y we said has a dimension of
𝜕𝐿
M1xM2. So, we have to some overall of these gradients where you have 𝜕𝑌[𝑖,𝑗] which is going
𝜕𝑌[𝑖,𝑗]
to be the gradient of the loss until one particular pixel in that next layers map into 𝜕𝑊[𝑎′ ,𝑏′ ]

763
𝜕𝐿
chain rule. 𝜕𝑌 we assume is already known through backpropagation until that particular step.
𝜕𝑌 𝜕𝑌[𝑖,𝑗]
Our challenge now for the convolutional layer is to compute 𝜕𝑊 in particular 𝜕𝑊[𝑎′ ,𝑏′ ]. That is

the quantity that we now have to compute to compute this overall gradient of the loss with
respect to 𝑊[𝑎′ , 𝑏 ′ ].

(Refer Slide Time: 7:57)

𝜕𝑌[𝑖,𝑗]
Let us consider that particular quantity now which is . By definition of convolution
𝜕𝑊[𝑎′ ,𝑏′ ]

once again we have 𝑌[𝑖, 𝑗] is this follows from the standard definition of convolution this K1,
K2 is the filter sizes and this your first principles definition of convolution using this we can
𝜕𝑌[𝑖,𝑗] 𝜕𝑌[𝑖,𝑗]
now say 𝜕𝑊[𝑎′ ,𝑏′] can be written as 𝜕𝑊[𝑎′ ,𝑏′ ] which you now expand 𝑌[𝑖, 𝑗] in terms of the entire

output that you have from the first equation. So, you have the entire RHS of the first equation
you put that in there.

Now, in all these summations here is only one summation one term here that is going to depend
on 𝑊[𝑎′ , 𝑏 ′ ]. Remember in one convolution, one particular filter value is converted with only
1 pixel in the input in that single convolution operation when you move the filter to the next
location, it may contribute something else in the other location but in one single convolution
operation, one value in the filter is multiplied by only 1 input pixel and that term now is going
to be 𝑊[𝑎′ , 𝑏 ′ ] into 𝑋[𝑖 − 𝑎′ , 𝑦 − 𝑏 ′ ].

Every other term in this double summation when you differentiate with respect to 𝑊[𝑎′ , 𝑏 ′ ]
will become 0 because it does not depend on 𝑊[𝑎′ , 𝑏 ′ ]. This quantity trivially becomes 𝑋[𝑖 −
𝑎′ , 𝑦 − 𝑏 ′ ]. Let us see how to use this.

764
(Refer Slide Time: 9:47)

𝜕𝐿 𝜕𝐿
Let us plug this back end we now have 𝜕𝑊[𝑎′ ,𝑏′ ] equal to summation over all the pixels in Y, 𝜕𝑌

for every pixel in Y into the second term which we just compute on the previous slide which
𝜕𝐿
turns out to be 𝑋[𝑖 − 𝑎′ , 𝑦 − 𝑏 ′ ] look carefully this is now the convolution of X and 𝜕𝑌 which
𝜕𝐿 𝑀1 −1 𝑀2 −1 𝜕𝐿
is beautiful. (𝜕𝑊[𝑎′ ,𝑏′ ] = ∑𝑖=0 ∑𝑗=0 𝑋[𝑖 − 𝑎′ , 𝑦 − 𝑏 ′ ] )
𝜕𝑌[𝑖,𝑗]

𝜕𝐿
remember is also going to have the dimensions of Y it also that gradient that set of gradients
𝜕𝑌

will look like an MH because you have a gradient of the loss with respect to every pixel in Y.
𝜕𝐿
So, that is going to form one matrix and X is a matrix by itself the convolution of X and is
𝜕𝑌
𝜕𝐿
. So, you can compute this as a convolution in the back-propagation step.
𝜕𝑊[𝑎′ ,𝑏′ ]

765
(Refer Slide Time: 10:52)

𝜕𝐿
Let us now move on to the next quantity which is𝜕𝑋, gradient with respect to inputs of that

layer of that Convolutional layer. Let us once again consider a single input pixel 𝑋[𝑖 ′ , 𝑗 ′ ]. Let
us again ask the question which pixels in Y would be affected by this particular pixel in X? If
we see it visually we can say that given an X and let us say there is 1 pixel given by this red
square which is at [𝑖 ′ , 𝑗 ′ ] when it is convolved with W you would now have this it would affect
this entire range of pixels that go from [𝑖 ′ , 𝑗 ′ ] to [𝑖 ′ + 𝐾1 − 1, 𝑗 ′ + 𝐾2 − 1].

Why is it like this, why is it not centered at [𝑖 ′ , 𝑗 ′ ]? Because that is the way we define
𝐾1⁄
convolution in this context for simplicity. If we had defined convolution as going from − 2
𝐾1⁄
to plus − 2 it would have been centered but because this is the way we defined convolution
the indices went from 0 to 𝐾1 − 1 and b to 𝐾2 − 1. You would now have these pixels here
represented by this dotted square which are the pixels that would be affected by [𝑖 ′ , 𝑗 ′ ].

Now, we call this region P and we now know that the gradients of the loss with respect to these
pixels will be influencing the gradient of the loss with respect to this pixel. All other pixels in
Y do not have a contribution from this pixel and hence those gradients of L with respect to Y
do not matter to us.

766
(Refer Slide Time: 12:44)

𝜕𝐿 𝜕𝐿 𝜕𝑌[𝑝]
Let us now say = ∑𝑝∈𝑃, . From the figure in the previous slide, we can
𝜕𝑋[𝑖 ′ ,𝑗 ′ ] 𝜕𝑌[𝑝] 𝜕𝑋[𝑖 ′ ,𝑗 ′ ]

now define that region P as going from 0 to 𝐾1 − 1, b going from 0 to 𝐾2 − 1. 𝜕𝐿 at a particular


pixel 𝑌[𝑖 ′ + 𝑎, 𝑗 ′ + 𝑏] and the derivative of 𝑌[𝑖 ′ + 𝑎, 𝑗 ′ + 𝑏] with respect to 𝑋[𝑖 ′ , 𝑗 ′ ] rather, if
you see the previous slide we are taking each pixel here and adding them and all I am saying
is that we want a gradient of a particular pixel in this region P with respect to this pixel at in X
that is what we are writing in the summation.

The left quantity here is known because we assume that all the 𝜕𝑌 until 𝜕𝑌 are known until
that time we are only worried about how to compute the back prop across a single convolutional
layer. If there were more convolutional layers the same procedures would be applied iteratively.

767
𝜕𝑌[𝑖 ′ +𝑎,𝑗 ′ +𝑏]
But the second quantity here is currently not known to us . Let us try to compute
𝜕𝑋[𝑖 ′ ,𝑗 ′ ]

this on the next slide.

(Refer Slide Time: 14:17)

From the definition of convolution once again we have 𝑌[𝑖 ′ , 𝑗 ′ ] given by this equation this
comes again comes from the basic definition of convolution. So, if this was the definition of
𝑋[𝑖 ′ , 𝑗 ′ ] then the definition of 𝑌[𝑖 ′ + 𝑎, 𝑗 ′ + 𝑏], which is going to be another pixel location will
be replaced wherever 𝑖 ′ is thereby 𝑖 ′ + 𝑎, replace where ever 𝑗 ′ is thereby 𝑗 ′ + 𝑏. So, 𝑖 ′ + 𝑎 a
becomes 𝑖 ′ , 𝑗 ′ + 𝑏 − 𝑏 becomes 𝑗′ and 𝑊[𝑎, 𝑏] stays as it is.

𝜕𝑌[𝑖 ′ +𝑎,𝑗 ′ +𝑏]


Now, the = 𝑊[𝑎, 𝑏] for a particular choice of a and b, for a different choice of a
𝜕𝑋[𝑖 ′ ,𝑗 ′ ]
𝜕𝐿
and b it would be that term in the summation as we said the last for the previous derivative 𝜕𝑊

also, all the other terms in the summation would not affect the derivative with respect to 1
particular a, b. So, this is the quantity that we have for 𝜕𝑌 at a particular pixel location with
respect to 𝜕𝑋[𝑖 ′ , 𝑗 ′ ]. Let us now plug this back

768
(Refer Slide Time: 15:45)

𝜕𝐿 𝜕𝐿
We have 𝜕𝑋[𝑖 ′ ,𝑗′ ] to be friends all the pixels in the region P 𝜕𝑌 at one of those pixels into 𝑊[𝑎, 𝑏]

for this choice of a and b. Once again, this becomes interesting. This looks like the definition
𝜕𝐿
of cross-correlation. It is the cross-correlation of with W, the filter. Rather, we can say it is
𝜕𝑌
𝜕𝐿
the convolution of the filter flipped with .
𝜕𝑌

We know that cross-correlation and convolution are it differ by that flip of the filter and that
become by a flip by 180 degrees which is a flip in 2 directions and that is what now we have
𝜕𝐿
here, 𝜕𝑌 into a double convolution with a double flipped filter. This is interesting again because
𝜕𝐿
even now the chain rule across the convolutional layer to compute can be computed as a
𝜕𝑋

convolution.

Why does this matter? Recall our discussion of image frequencies in the first week. We said
that convolution can be evaluated through very efficient methods such as fast Fourier
transform. You could use all that to efficiently compute these gradients during backpropagation
in this step.

769
(Refer Slide Time: 17:22)

Now, we are felt with one question, so that defines, that concludes how backpropagation is
done across a convolutional layer. There are only 2 quantities W and X. We are now still a felt
with how do you backpropagate across a pooling layer? As we said before, a pooling layer is
parameter-free. It does not have any weights. There are no weights to updates, no gradients.
The only gradients we have to compute are the gradients with respect to X in the previous layer
to allow the gradients to propagate through to earlier layers.

How do we do this? If we have max pooling, let us take this visual example on the right. So,
you can see here that these red, green, yellow, and blue regions are max pooled in the forward
pass, 1156, the max value is 6. 2478, the max value is 8. 3212, the max value is 3. 1034, the
max value is 4. So, in the forward propagation, for every 2x2 region, you take the max element
and put it here and then you take it further through later layers.

In the back-prop step, in the same layer, you keep track of what was the winning position in
each 2x2 window. For example, we know that 6 came from this position of the first two crosses
2 windows. So, whatever gradient we have at 6 when we backpropagate goes to that location
in the previous layer, backpropagation is drawn the other way here but this is the previous
layer, what you see on the right here is the previous layer.

The full gradient at 6, remember we assume that when we backpropagate, the gradients until
this step for each of this pixel is known to us, we only have to see how to backpropagate those
gradients across the pooling layer. All these gradients go to this location, all these gradients go
to this location, all these gradients go to this location because that was the position of the
winning neuron and similarly for this to this.

770
What happens if we did not use max-pooling? If we use say, average pooling? If we used
average pooling, you would take whatever gradient came into one particular location say the
position of 6, distribute hat gradient equally to each of these 4 locations in case of 2x2 pooling.

If you did 3x3 pooling, you would divide that among 9 pixels. In this case, we have 2x2 pooling,
so whatever gradient you have at 6 will be divided into 4 and that gradient would be given to
each of these pixels in the 2x2 neighborhood that led to 6 and that is how backpropagation is
done across pooling layers.

(Refer Slide Time: 20:23)

That concludes our discussion of backpropagation in CNNs. For more details please read the
lecture notes of Dhruv Batra on the same topic or a very nice write-up by Jefkine on
Backpropagation in CNNs.

771
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 34
Evolution of CNN Architectures for Image Classification - Part 01

Having seen an introduction to convolutional neural networks, and how back propagation works
in convolutional and pooling layers, let us now move forward and see how CNN architectures
have evolved over the years. Today depending on what task you have at hand, and what
constraints you have for a particular task, there are a wide variety of architectures that are
available; we will try to cover a significant number of them over the next couple of lectures. Let
us start with the earliest like earliest architectures.

(Refer Slide Time: 0:53)

The history of CNNs dates back to the Neocognitron in 1980, which was perhaps the earliest
precursor of CNNs. The concept of feature extraction, the concept of pooling layers as well as
using convolution in a neural network; and finally having recognition or classification at the end
was proposed in the Neocognitron.

772
(Refer Slide Time: 1:20)

But, the name convolutional neural networks came with the design of the LeNet by Andre Kuhn
and team. Which was largely developed between 1989 and 1998 for the handwritten digit
recognition task. The LeNet convolutional neural network architecture looks like what you see
here. It starts with a convolutional layer, then goes for a pooling layer, then again a convolutional
layer, then a pooling layer; then to fully connected layers and then finally the output layer.

The convolutional filters were of the size 5x5 , applied at stride 1. The pooling layers were 2
cross 2, applied at stride 2. You can see the overall architecture written this way. This is a
common way of writing about the architectures of neural networks; where you say
CONV-POOL-CONV-POOL-FC-FC. It denotes a convolutional layer followed by a pooling
layer, followed by a convolutional layer, followed by a pooling layer, FC layer and FC layer.
Even architectures until today follow similar principles in their design of CNN architectures.

773
(Refer Slide Time: 2:40)

A major effort in development of newer architectures for CNNs and newer designs of
architectures was thanks to the ImageNet classification challenge; which started in 2010.
ImageNet is a dataset and they organized the challenge called the large scale visual recognition
challenge in 2010. Image database in ImageNet is organized according to the WordNet hierarchy.
It currently has images corresponding to various norms in the WordNet hierarchy. It has millions
of images at this time and currently about 500 images per node in the WordNet hierarchy.

This led to a significant effort across researchers to benchmark their machine learning models,
computer vision models in particular for image classification on a common dataset in which
everybody could check how their models performed. The performance measure on the dataset
was known as the top-1 error; which is the error, when you look at the highest the best prediction
of the neural network against the ground truth; which is known as the top-1 error.

And the top-5 error, where you try to check if the ground truth was in the top-5 of the labels in
the output layer of the neural network; which is still a good measure. Because the ImageNet
dataset has about 1000 objects, which means your output layer in the neural network has 1000
neurons. As long as your ground truth label is in the top-5 of the values, the scores that you get in
the output layer, you consider this a correct match and not doing so would give a top-5error. For
more details on the ImageNet dataset which is very popularly followed by computer vision
researchers today, please see this link.

774
(Refer Slide Time: 4:46)

Before we move forward and discuss CNN architectures, let us take a step back and see how
neural networks in general can be used for the classification problem. When we discussed feed
forward neural networks, we took the loss function as the mean square error; which was fine for
that time. But, when you are working on a classification problem, mean square error may not be
the right choice of a loss function. Mean square error works better for regression problems.

In classification problems, the most common loss function is known as the cross-entropy loss
function which measures the entropy between the probability distribution on the output labels in
the ground truth, versus the probability distribution on the output labels as output by the neural
network. This gives you a sense of error committed by the neural network; let us see the formal
𝐶
1 ^
definition. So, the cross-entropy loss function is given by − 𝐶
∑ 𝑦𝑖𝑙𝑜𝑔𝑦𝑖, where C is the total
𝑖=1
^
number of classes, 𝑦𝑖 hat is the prediction for a particular class I of a neural network or the score

of the last layer of the neural network for the ith class, and 𝑦𝑖 is the correct label for that

particular class.

Rather, you can say that 𝑦𝑖 is if you had a certain number of classes, let us say 1 to C; let us

assume the second class was the correct class. Then you would have your y vector expected y
𝑡ℎ
vector to be [0, 1, 0, …, 0, 0, 0] until this 𝐶 class. This would be the probability distributions

775
on the label space in the ground truth; because this is the correct class. So, this would be 𝑦2, this
^ ^
would be 𝑦1 and so on and so forth. That is the𝑦𝑖 we are talking about here, and 𝑦𝑖t is the score of

the output layer of the neural network. In the binary case this simplifies to
^ ^
− 𝑦𝑖 𝑙𝑜𝑔 𝑦𝑖 + (1 − 𝑦𝑖)𝑙𝑜𝑔(1 − 𝑦𝑖).

For example, when you use a sigmoid, you will have only one score that comes out of a neural
network. And 1 − 𝑦𝑖 will be the score for the negative class. So, this is how cross-entropy will

simplify for a binary setting binary classification setting. Now, if we introduce a new loss
function, the first thing to check is its differentiable; by looking at it one can say that
cross-entropy is definitely differentiable. And then what are its gradients with respect to a say
weight in the last layer. Because once we get that we can propagate it further through earlier
layers, through whatever we have discussed so far.

So, if we consider the activation function to be sigmoid, which is let us assume a binary
classification problem. The derivative of the cross-entropy loss function with respect to a weight
in the last layer 𝑤𝑗 is given by

∂𝐿 1 𝑦 (1−𝑦) ∂σ 1 𝑦 (1−𝑦) 1
∂𝑤𝑗
=− 𝑛
∑( σ(𝑧) − (1−σ(𝑧))
) ∂𝑤𝑗
=− 𝑛
∑( σ(𝑧) − (1−σ(𝑧))
)σ'(𝑧)𝑥𝑗 = 𝑛
∑(σ(𝑧) − 𝑦)
𝑥 𝑥 𝑥

^
remember σ(𝑧) = 𝑦.

^ 2
So, even in mean square error when you have (𝑦 − 𝑦) /2, if you differentiate, you would get
^
(𝑦 − 𝑦) in the loss. So, the gradient of the cross-entropy looks very similar to how the gradient

of the mean square error looks.

(Refer Slide Time: 11:16)

776
One more point before we move to CNN architectures is when you are doing classification.
While, we talked about different kinds of activation functions; such as sigmoid, Tanh, ReLU,
leaky ReLU so on and so forth. We typically use what is known as the softmax activation
function in the output layer alone. What is the softmax activation function? Given 𝑧1𝑡𝑜 𝑧𝑘 in the

last layer, let us say they are all the outputs of the c neurons in the last layer. The final output of
𝑧𝑗
𝑒
the network is given by . What does this do?
∑𝑧𝑘
𝑘

𝑧
Remember in this case, the summation of all the 𝑒 𝑘 in the last layer will always be greater than
𝑧𝑘
𝑒 which means these values are definitely going to lie between 0 and 1; also because the
denominator is the summation of all those exponential terms, exponential values of those
neurons in the last layer. Doing this for every single neuron will also add up to 1.

In a way now we have converted whatever scores, the neural network outputs into a vector of
probability values; where each is lying between 0 and 1, and they all add up to 1. This allows us
to interpret the output of a neural network as a probability distribution over your output layer
space. So, softmax is a very is a default activation function; that is used in the output layer of
neural networks used for classification. A small notation or a nomenclature here the value is
before the probability score are typically called the logits of a neural network.

777
So, the neural network outputs logits, you apply a softmax activation function; and convert it into
probability scores. Here is an example on the right; you see the output layer gave a set of real
values as output. By doing the softmax activation function, this became [0.02, 0.9, 0.05, 0.01 and
0.02] This is because of the exponential scaling of the values, and these add up to 1; allowing us
to interpret that vector as a distribution over your output labels.

(Refer Slide Time: 14:18)

Let us now come back to the ImageNet classification challenge. So, all these CNNs that will talk
about now all of them use cross-entropy loss as the loss, and more or less use the softmax
activation function in the last layer as a common choice of activation function and loss across
these CNN architectures. As we said a few minutes ago, the ImageNet dataset was created and
launched in 2010, and researchers around the world participated in the ILSRVC challenge. And
in 2010, the winning error rate was 28.2; this was done without neural networks. And in 2011
researchers improved the score from 28.2 to 25.8 error rate.

And in 2012, Alex Krizevsky and Geoffrey Hinton came up with CNN architecture popular to
this day as AlexNet, which reduced the error from 25.8 to 16.4 which was a significant
improvement at that time. So, this was the first winner of the ImageNet challenge which was
based on a CNN, and since 2012, every year’s challenge has been won by a CNN; significantly
outperforming other deep other shallow machine learning methods. So, the traditional machine

778
learning methods are often called shallow learning; because you do not have deep learning over
many many layers.

But this will become clearer to us in the next week’s lectures, when we try to understand what
value the depth of the neural network brings to the architecture. Let us now start with the
AlexNet’s architecture and try to understand it.

(Refer Slide Time: 16:22)

Here was the architecture of AlexNet; we actually looked at the first layer of AlexNet as an
example when we worked out sizes in the first lecture on CNNs. The input is a volume which is
given by 224x224x3, which is the RGB channel. And the filter size was 11x11, of course a depth
of 3 which follows in all of these scenarios. And with a stride of 4, this led to a 55x55x96 of
which 48 filters went to one of the processing units. And the remaining 55x55x48 went to
another processing unit. We will try to describe this in more detail over the next few slides.

But, this was the overall architecture that won the LSVRC in 2012. It had designed principles
similar LeNet. But, it also had many features which were different and it was also deeper than
LeNet and there were also a few other important improvements which we will talk about. It
trained over 1.2 million images using SGD with regularization.

(Refer Slide Time: 17:54)

779
Let us try to look at this architecture more carefully. So, it has the input image which was
224x224x3, which is the size of images in the ImageNet dataset. The first layer had Conv plus
Pool, which you see the output together here. So, this output of 55x55x48 was after doing the
convolution part and then you do a max pooling, which reduces the size further. Then once again
they had a Conv plus Pool, then only a Conv, then only a Conv, then again a Conv plus pool;
then a fully connected layer, then a fully connected layer and then finally the output layer.

Let us look at at least the first layer's numbers once more, and you can perhaps work out the
remaining layers based on the principles that we talked about in the first lecture this week. So,
224x224x3, 11x11 was the size of the filter; and then you have the stride of 4. So, you would get
your output to be, they also had a certain level of padding; so you would have 227x227 as the
final input given to the network. So, to be (227-11)/4 +1 which gives you 55. That is what we
talked about even earlier. Similarly, you can work out the parameters in each of these layers, by
knowing by knowing the number of filters; and the kernel size that you see in each layer, which
is shown on the figure.

780
(Refer Slide Time: 19:39)

It had 8 layers in total, 5 convolutional layers and 3 fully connected layers here and here; the last
3 layers were fully connected. It was trained on the ImageNet dataset as we mentioned. The
AlexNet designers also introduced a normalization layer called the response normalization layer.
So, we so far saw a batch normalization layer, similarly the response normalization layer;
normalized all the values in a particular location across the channels in a given layer.

For example, if you had a certain depth, you would then take a particular pixel location and go
along the depth and normalize all the values along the depth at that particular special position.
This was known as response normalization, and the authors argued that this would give an effect
known as lateral inhibition. So, it would allow one of those to win over the other by getting a
higher value in terms of the output after normalization and so on. In addition to this, this
particular architecture also introduced the rectified linear unit or ReLU as an activation function.

That was one of the contributions that made this architecture a success. Of course, ReLU is now
a default component or a default choice for an activation function, in many other CNNs. As you
can see in the architecture Max-pooling follows the first, second and the fifth convolutional
layer; the third and fourth did not have a pooling. And the ReLU non-linearity was applied to the
output of every layer here in the entire architecture. Why was such a design chosen? At that time
perhaps that was the best that was the architecture; that led to the best empirical performance.

781
(Refer Slide Time: 21:47)

Let us try to analyze the complexity of this architecture in terms of parameters and neurons. You
see here the parameters in the first layer would be given by, there are 11x11x3 is the size of the
filter. Plus 1 for a bias and 96 are the total number of filters chosen for the first layer; 48 which
went to one processing unit and 48 which went to another processing unit. So, the designers of
this architecture tried to take advantage of two processes that they had, in particular two GPUs
that they had into which they simultaneously forked out these volumes into two different GPUs.

And then combine them in later layers, where you see these cross inputs as dotted lines, is when
the output of one GPU was also fed to another GPU and so on and so forth. These cross links in
the architecture should help you point out in which layers that was done. Coming back to the
parameters. So, the total number of parameters in the first layer was 11x11x3 size of the filter
plus 1 bias into 96, the total number of filters which roughly comes to about 35000. On the other
hand, the numbers of neurons are given by the size of the image itself, which is 55x55x96. And
that turns out to be something like 253440.

Similarly, you can compute the numbers for the other layers. The second layer the parameters
turn out to be about 300k, and the neurons turn out to be about 186000 odd. Third layer, new
parameters are 884k, number of neurons are 64000; so you can say see that the number of
neurons is reducing, because the size is reducing after applying pooling. By a neuron we mean

782
each pixel in a feature map in the next layer; that is what constitutes a neuron here. And because
we do pooling, the size of the image becomes from 224x224 to 55x55 to 27x27 to 13x13.

And because of the reducing size, the neurons reduce; but the number of parameters keep
increasing because the number of filters taken are significant. Initially it was 96, the next layer
had 128 + 128 you know filters or totally 256 filters, then 192 + 192 that is about 384 filters and
so on and so forth. You can see that as you go deeper, the number of parameters keeps
increasing, the number of neurons decreases until you go to the fully connected layers, where the
number of parameters simply takes off and goes to the order of millions. So, about of the total 60
million parameters about 57 million parameters are located only in the fully connected.

And that should again go back and explain to you, how efficient the convolutional and pooling
layers are in terms of number of parameters. So, and the number of neurons comes to totally
about 0.63 million.

(Refer Slide Time: 25:17)

Another way of looking at it is just visualizing this from a different perspective. You can see here
that go the parameters, go from 35k to millions in the last three layers; while the neurons start
253k and reduce towards the last layers. One interesting observation of this architecture is the
convolutional layers may contain 5 percent of the parameters. Remember they contain only 3

783
million of the total 60 million parameters that AlexNet has. But, they account for 95 percent of
the computation, why is that so?

Although they have only 5 percent of the parameters. Remember in convolution, you have to
take the same weight and keep placing it on every location in the original image and get the
output. Although the weights of few, the number of computations is high in a convolutional
layer. So, 90 to 95 percent of computation sits in the convolutional layers; whereas 90 to 95
percent of the parameters lies in the fully connected layers.

The AlexNet was trained with SGD, as I mentioned on two NVIDIA GTX 580 3GB GPUs and
that is what allowed them to fork the feature maps into two different GPUs. And it took them
about a week to train this architecture; today it can be done in much lesser time. Thanks to
improvement of GPUs, but at that time it took them a week to train this architecture.

(Refer Slide Time: 26:55)

In the following year 2013, the winner of the ImageNet LSRBC was a network known as the
ZFNet. It came from the names of Zeiler and Fergus, who came up with this design of CNN.

784
(Refer Slide Time: 27:15)

The main contribution of the ZFNet in 2013 over AlexNet was the architecture was fairly the
same. The overall Conv Pool, Conv Pool, Conv, Conv, Conv Pool fully connected, fully
connected. But, there were a few changes in hyper parameters, what were the changes in hyper
parameters? In Conv1, the filter size was changed from 11x11 stride 4 to 7x7 stride 2. In Conv 3,
4 and 5 the number of filters was increased from 384, 384 to 256 to 512, 1024 and 512. And this
careful choosing of hyper parameters led to a significant decrease in the top-5 errors from 16.4
percent to 11.7 percent.

One term I should clarify at this stage is the notion of hyper parameters, why do we call these
values 11x11, 7x7, 384 as hyper parameters? Because the parameters that we learn are the
weights of the neural network; all the other values are user defined, they are not learnt. So, we
call them hyper parameters, the parameters are the weights which is what we actually learn; but,
this was the main contribution of ZFNet in 2013.

785
(Refer Slide Time: 28:39)

Moving on to 2014, one of the major contributions 2014 was a new architecture known as the
VGGNet.

(Refer Slide Time: 28:50)

The VGGNet, stands for an (arcade) architecture, developed by the VGG group at stanf… at
Oxford, which is also stands for visual geometry group. And they came up with an architecture
with a certain philosophy in their design, they came up with multiple depths in their architecture.
They argued that while AlexNet took LeNet from a certain depth to the next level, and the ZFNet

786
maintained the same depth as AlexNet. They argued that by making a CNN deeper, you could
solve problems better, and you could get a lower error rate on the ImageNet classification
challenge.

So, the way they went about doing it is to take multiple architectures of different depths. So, you
can see here that the input is on top, the output is at the bottom in all of these architectures. You
can see Conv layers, Max-pool layers, Conv layers, Max-pool layers so on and so forth and these
64, 128 denote the sizes that you have in these architectures. The interesting design philosophy
that they had, so this architecture won the runner-up in the ImageNet challenge in 2014. And
their philosophy was that by increasing the depth, you can model more non-linearities in your
function; and the key contribution was that depth is a critical component.

They maintained a homogeneous architecture from beginning to end, what does that mean? They
used only a 3x3 convolution across all of their layers. We saw with AlexNet that it had 11x11
filter the first layer and ZFNet changed that to 7x7 filter in the next year. But, VGG argued that
they would just maintain instead of struggling to find the filter size, while designing the
architecture. They decided to fix it at 3x3, how does that help will come to that later moment.
And they had a 2x2 Max-pool across all of their pooling layers in the network.

Their main idea was that a smaller receptive field means lesser parameters, which means if you
have a 3x3 filter; there are lesser parameters to learn. More importantly, two 3x3 convolutions
done in sequence have the same effect as one 5x5 convolution. When you do one 3x3
convolution, you get an output; if you do a 3x3 convolution in the output of that first one. You
would have the effect of the neighborhood of 5x5 pixels in the original image having an effect on
the latter later output. Rather, two 3x3 convolutions have the same receptive field as a single 5x5
. Similarly, three 3x3 convolutions have the same receptive field as a 7x7 convolution and so on
and so forth.

Rather, instead of trying to keep filter sizes like 11x11 or 7x7; they argued that they could always
keep the filter size as 3x3. Importantly, as we said it would have fewer parameters, if you took a
single 7x7 convolution; you would have 7*7 C*C square parameters, 49 C*C. Whereas, if you
had three 3x3 convolutions arranged in sequence; you would have 3*3*C*C parameters, which

787
would be 27 versus 49. So, this would have a fewer parameters than doing a single 7x7
convolution.

(Refer Slide Time: 32:44)

They found in this architecture that VGG19 was only marginally better than VGG16. So, until
VGG16 the performance kept improving with depth, but it seemed to saturate after a certain
depth. And this issue was addressed later as we will see in the next few slides. And they also
tried using ensembles of different networks to get improved performance on the same task.

(Refer Slide Time: 33:13)

788
In terms of parameters, VGG however had a significant memory overhead and a parameter
overhead. So, you can see here a comparison of AlexNet versus VGG16 in terms of memory.
Memory would simply be the number of neurons across all of your layers in the VGG versus
number of neurons across all of the layers in your AlexNet. Remember again when we say
number of neurons, we mean the size of the feature maps; because every pixel in every feature
map is a neuron for us in a CNN.

So, you can work this out carefully and find that AlexNet has a total of 1.9 MB versus VGG16,
which has a 48.6 MB memory in terms of parameters. The first few convolution layers are
reasonably small for AlexNet and VGG in the fully connected layers. VGG has a significant
increase in parameters and while AlexNet had a total of about 60 to 61 million parameters;
VGG16 ended up having about 138 million parameters. So, while VGG works well in many
settings, that is a limitation of VGG, because of its increased parameter count and memory.

While, each choice of 3x3 by itself reduces the parameters; the parameters added up for them
because of later layers, the depth, the fully connected layers so on and so forth. It uses a lot of lot
more memory and parameters; and as we saw most of these parameters are in the first fully
connected layer. Most of the memory as we saw earlier is in the early convolution layer because
that is when you need a maximum number of neurons to contain all of those feature maps.

789
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 35
Evolution of CNN Architectures for Image Classification - Part 02

(Refer Slide Time: 0:16)

Along with VGG in 2014 was also another network called the GoogleNet. GoogleNet came from
Google focused on deeper networks. But, with an objective, for efficiency to reduce the parameter
count, to reduce memory usage, and to reduce computation. So, how did they go about achieving
all of this? They had up to 22 layers in the network; remember VGG went up to 19, so the
GoogleNet went to 22 layers. So, the networks were getting deeper each year, they did not have
any FC layers.

How did that help? Recall that FC layers had the maximum number of parameters for both AlexNet
and VGGNet. So, by getting rid of FC layers, they reduced the number of parameters. And what
else did they introduce? They introduced a module known as the inception module; which will
describe in detail over the next few slides. Please do not worry if you are not able to see this figure
carefully; will describe it in more detail over the next few slides. It had a total of 5 million
parameters, which was 12x less than AlexNet and significantly lesser than VGG.

790
And it was the winner in the 2014 ImageNet challenge ahead of VGG. VGG was a run-up runner-
up this year that year and GoogleNet won the challenge with a 6.7 percent top-5 error. So, as you
can see every year as the networks got deeper, the error started dropping significantly.

(Refer Slide Time: 2:04)

Let us now talk about the inception module, which was a key innovation in their approach. The
inception module looks somewhat like this, which was one of those units that keep repeating in
their architecture. So, the output of the previous layer was branched out into one layer which did
a 1x1 convolution. In parallel another couple of layers, which did 1x1, 3x3, another couple of
layers which did 1x1 and 5x5. And another one we did Max-pooling and 1x1 convolution.

So, this was a local unit with parallel branches, and this local structure was repeated many times
in the neural network as you can see. This inception module kept was tagged on top of each other
to complete the architecture of the GoogleNet.

791
(Refer Slide Time: 3:03)

So, the Naive inception module was the final inception module that was in the architecture; but let
us develop this slowly. So, the Naïve inception module takes a previous layer, instead of deciding
a particular feature size, filter size, or VGG which always relies only on 3x3. What the GoogleNet
designers decided to do was they decided to do multiple such filter sizes in parallel, and
concatenate all of them. So, you have a 1x1, a 3x3, a 5x5, even a 3x3 Max-pool and they
concatenated all of those. So, the depth that you get in this layer after the concatenation would be
the concatenation of all of those depths; that you get as the output of each of these convolutions or
Max-pooling.

So, they had multiple receptive field sizes and they had a pooling operation, which they
concatenated depth-wise. But, there is a problem here, do you see the problem? The problem is
this is computationally very expensive. Remember we said for AlexNet that most of its
computations are in its convolutional layer; while the parameters may not be high. Remember you
have to take the filter and convolve on every part of the image, and that can be pretty expensive.
And now when this inception module does this across 1x1 convolution, 3x3 convolutions, and 5x5
convolutions simultaneously. And then concatenates things become even more computationally
expensive. So, what do we do?

792
(Refer Slide Time: 4:50)

They introduced something called 1x1 Bottleneck layer to reduce the channel dimension before
expensive convolution layers. Now, let us look at what this means, so if you have a 56x56x64
volume, you can do a 1x1 convolution across the depth. So, that filter would be 1x1x64, and if you
do that, the entire 56x56x64 will collapse to 56x56. Because for each of those special locations, a
1x1x1x64 convolutions will give you one scalar as an output, which means you will be left with
one pixel here, one pixel here, one pixel here so on, and so forth. In the next layer which would
lead to a 56x56.

So, remember the filter size for this 1x1 convolution would have been 1x1x64; that is the 1x1
convolution of the Bottleneck, convolution of Bottleneck layer we are talking about. But they had
32 such filters, which means 32 such 1cross 1x1x1x64 filters; which results in the output now
becoming 56x56x56x32 because of 32 such filters. So, by doing the 1x1 convolution, a 56x56x64
volume became a 56x56x32 volume.

And you can imagine now that you could have instead 32 filters, you could have just chosen 3
filters, 4 filters so on and so forth. So, the depth could have been arbitrarily reduced by using this
trick of doing 1x1 convolution; and that is what the GoogleNet designers did in the inception
module. What does this help with? This preserves spatial dimensions and reduces depth, and it
projects the depth to lower dimensions.

(Refer Slide Time: 7:02)

793
So, now the final inception module was the Naïve inception module we saw a couple of slides
back. The final inception module does the same convolutions, but now before doing those
convolutions applies 1x1 convolution to reduce the depth. Which, means the number of
convolutions or operations that you have to do will reduce because of that and then does 3x3 of
5x5 convolutions. This trick helps them become far more efficient in the number of computations.
Also, you see here that there is a 1x1 convolution after Max-pool, instead of before Max-pool
when compared to the other branches.

The reason for that is Max-pool maintains depth, it maintains the depth at the same thing as what
was given. So, when you do Max-pooling, the Max-pooling is done in a 2x2 neighborhood on each
channel in the volume. So, for every channel the number of channels remains the same; so they
first do the Max-pooling and then do 1x1 convolution to reduce the number of channels. This
inception module brings down or increases the efficiency, brings down the computations; but
increases the efficiency of the GoogleNet.

794
(Refer Slide Time: 8:25)

So, here is the complete architecture, so is the this is where the input comes in; and then you have
a small stem network which is a Conv-Pool. And two times a Conv-pool, so you repeat that; then
comes all these inception modules, stacked one after the other. And at the end, you have classifier
output with no fully connected layers, and in between, you see here that there are auxiliary
classifier outputs. This is because it is possible that a gradient in the last layer may not reach earlier
layers when you have many stacked layers across the neural network.

So, they also had intermediate classification outputs, where they took the output of a particular
layer had a bunch of classification layers that had a loss. And that loss is gradient was also
backpropagated at these locations to improve the strength of the gradient to earlier layers for
weight update. As you can see the classification layer both here and here; had a structure of average
pool 1x1 convolution FC-FC Softmax. There were 22 layers in total, the parallel layers count as
one layer, so these different layers that you see inside the inception module.

The branches we just counted as one layer, given that it has a total of 22 layers; and the auxiliary
output layers are not counted. Just from this end to the other end, together it is 22 layers in
GoogleNet.

795
(Refer Slide Time: 10:08)

So, as you can see each year since 2012, the networks got deeper and deeper and deeper. So, the
obvious question the researchers in the next year had was how far can we take it? And they took
it to about 150 layers which we will see in a moment.

(Refer Slide Time: 10:28)

So, while asking this question, they try to understand that how deep can you go with CNN. And
when they studied stacking deeper and deeper layers on a plain CNN, they found that 20 layer
CNN had a lower training error over the iterations of training than a 56 layer CNN. This was a

796
little non-intuitive and the same observation was also made on the test error in the same setting.
So, the 20 layer CNN had a lower test error over iterations when compared to a 56 layer CNN.

Why would this be happening? A deeper model performs worse than a shallow model. This was a
little counterintuitive to how the research was progressing with CNNs at that time. One possible
explanation you could give for why this could be happening is the initial guess could be that the
deep model is perhaps overfitting, and that is why a test error increases for 56 layers neural
network. You have more parameters and it perhaps is becoming complex and hence is overfitting.
But, if that was true the training error would have been stronger for a 56 layer network over a 20
layer network.

Unfortunately, even there the 20 layer network had a lower error, which means the 56 layer
network was also underfitting. What would have been the reason for this to happen as the depth of
the neural network increases?

(Refer Slide Time: 12:18)

The answer for this lies in what is known as the vanishing gradient problem. The problem
complementary to vanishing gradient is known as exploding gradient, but let us talk about each of
them in one after the other. So, if you took a simple feed-forward neural network, let us assume
with sigmoid activations in each layer. So, here is a very simplistic visual for that. You have 𝑥0 ,
there is a weight one that comes in the first layer, there is a sigmoid; then x1, w2 is a weight in the
second layer, a sigmoid, and so on until the last layer.

797
𝜕𝐿
If you now took a loss L and wanted to find 𝜕𝑥 ; this is one of the initial layer's inputs. You have
0

𝜕𝐿
= 𝜎 ′ (𝑤3𝑇 𝑥2 )𝑤3 𝜎 ′ (𝑤2𝑇 𝑥1)𝑤2 𝜎 ′ (𝑤1𝑇 𝑥0 )𝑤1. 𝜎 ′ is the gradient of the sigmoid function. This
𝜕𝑥0

comes from applying the chain rule. You work out the gradient in each step, considering a sigmoid
activation function in each layer; you would get this.

But, the key point I would like to draw you towards here is that the gradient of the sigmoid is less
than 1 by 4. Why is that the case? Because this is how the gradient of the sigmoid function looks.
It starts at a 0 close to 0, it peaks at 0.25 and then falls again. This comes from the shape of the
sigmoid function, so the gradient is deep in this region. It increases and starts falling and flattens
out again, so the gradient of such a sigmoid function turns out to be something like this; and the
𝜕𝐿
range is between 0 and 1, which means the gradient turns out as capping at 0.25. This means 𝜕𝑥
0

has a bunch of values, which are all less than one-fourth.

𝜕𝐿
What you expect will happen to 𝜕𝑥 ; the product of very small values will quickly go to 0.
0

(Refer Slide Time: 14:52)

And you have the problem of what is known as vanishing gradients. As you go deeper and deeper
into a neural network, the gradient especially if you have such squashing activation functions, the
gradients are less than one and as they get multiplied over the several layers during
backpropagation; the gradient becomes feeble and feeble as it proceeds towards the initial layers.

798
In other words, as the gradients never reach the initial layers or the initial layers, may get stuck
with the random initialization that we came up with, and may never go further.

So, you could argue that instead of randomly initializing the neural network with some random
weights. And then not updating them at all convincingly, why have those layers in the first place?
Let me not take such a deep network; that could be one way of arguing against solving the
vanishing gradient problem. But, will talk about a different approach in a slide from now. The
complementary problem known as the exploding gradients can happen, when maybe there is
another activation function, such as ReLu; where the gradient could be 1 or greater than 1 or
depends on what the activations are.

Or, as I said any other function where these terms end up becoming greater than 1, the product of
those values as you go deeper through many layers can explode to a high value and that also can
cause problems. However, exploding gradients are often easily handled by a trick known as
gradient clipping; this is very popularly used in practice. People just say that when the gradient
goes beyond 10 for instance, you assume the gradient to be 10.

So, anything greater than 10 will just be 10 you just clip it, and at that value and that will ensure
that the gradient does not explode.

(Refer Slide Time: 17:02)

799
Coming back to the vanishing gradients problem, this problem was addressed in CNN architecture
in 2015 called ResNet or residual nets. They wanted to ensure that a deep model should at least
perform as well as a shallow model. How is this possible? The idea they came up with was to
introduce what are known as residual blocks, which are connected through identity connections or
identity matrices. What does that mean? In the original CNN, you have an x; it goes through a few
layers such as Conv layer, then a ReLu, then a Conv layer. And it gets transformed over those
layers, and becomes a certain H(x); this could be one portion of a neural network for instance.

But, in a residual block what these designers came up with was to say that x goes through a Conv
layer. A ReLU, a Conv just as Vanilla CNN; but they also have a side channel to take the input x
as it is. And added up here two layers later, which means what was originally the function H(x);
now becomes F(x) which is a new transformation that happens through these layers. Plus an x that
comes directly here and gets added at this layer after the Conv. Each such block where the identity
connections come back and add the original input is known as a residual block. Why is it a residual
block?

Because originally these layers captured a function H(x); but now these layers capture the residual
H(x) - x. Remember, now we are saying here that this together here must be equal to H(x); or
rather F(x) will be H(x) - x, which is now residual rather than H(x) itself.

(Refer Slide Time: 19:26)

800
So, the way the residual network was designed is to have a stack of many residual blocks. Each
residual block had two 3 by 3 Conv layers as you can see, 3 by 3 Conv layers; and then the x that
came out of this. Or, whatever representation came out from this particular joint here, was directly
passed on again at the to the output of those two Conv layers. And this was repeated again and
again over the blocks in the neural network.

Periodically, the number of filters was doubled, so you can see here that 64 which was the number
of filters as in the initial residual blocks became 128 after the set of residual blocks. Which after a
few more residual blocks became 512 and so on and it was also down-sampled spatially using
stride 2 that is divided by 2 in each dimension to bring the size of the feature maps lower as you
went deeper. Remember that helps you control the number of neurons in later layers by
downsampling, the feature maps. They also used a technique called global average pooling.

Global average pooling is like Max-pooling average pooling, but global average pooling is the
entire average of a feature map. So, if you have a feature map, you take the average of all the pixels
in that feature map and it gives you one scalar. So, if you had k channels in a particular layer, by
doing like global average pooling; you will get k scalars, 1 scalar per channel which will be the
global average of that channel. You can imagine that as giving a feature map taking the average
intensity of that feature map; that is global average pooling. And then for each channel, you would
get one global average pool value.

So, if you add k feature maps, you would get a k dimensional vector as the output of global average
pooling. And they had a single linear layer at the end you can see, that there are no fully connected
layers here; but for the FC 1000. FC 1000 is done because ImageNet has 1000 classes, so the FC
1000 is required to give 1000 outputs in that last layer, and finally a Softmax to convert those logits
into probabilities. They had total depths of 34, 50, 101, and 152 layers for ImageNet. So, it is called
ResNet 34, ResNet 50, ResNet 101, ResNet 152 so on and so forth.

801
(Refer Slide Time: 22:17)

For very deep networks, which went beyond 50 layers in a ResNet; they also used a Bottleneck
layer. You can see the 1x1 convolutions to improve the efficiency similar to GoogleNet.
Remember the 1x1 convolutions help you reduce the depth from a certain depth to a smaller depth
and then you perform your operations at that smaller depth. And you can again now do 1x1
convolution to go back to an initial depth if you like which is something that they did do to keep
the depth in the same range. But, reduce the computations by a couple of 1x1 or bottleneck layers,
a couple of 1x1 convolutions, or Bottleneck layers periodically.

So, this is the example they start with they had a 28x28x256 input; the bottleneck did it to
28x28x64 by doing 1x1 convolution. Then do the 3x3 convolutions only on those 64 feature maps
and then once again did 1x1 convolution, and took it back to 28x28x256. This keeps the dimension
the same as the input, but these 1x1 convolutions make the number of computations far lesser, and
hence the training more efficient.

802
(Refer Slide Time: 23:43)

With this ResNet, they were able to train very deep networks. As we mentioned they went up to
152 layers to train on ImageNet, and they now found that deeper networks perform better than
shallow networks. Why so? Why does this identity map or this residual block help train deeper
networks? When we backpropagate the gradient that comes to a particular point in the residual
network; while, there is a path of the gradient through these layers, which may result in an
attenuation of the gradient. There is another side path for the gradient to go as it is to a previous
layer, without any diminishing in its value.

And these identity connections that connect residual blocks are like serve like highways, for the
gradient to just flow through; they are like short circuits. The gradient just flows through them,
while there is a part of the gradient that still keeps coming through the layers. And that allows even
the earlier layers to get a strong gradient signal, and now deeper networks perform better than
shallow networks. And it won first place in all ImageNet competitions and also a competition
called COCO that was introduced in 2015.

So, they won the char first place in all the major challenges in 2015. We mentioned classification
ImageNet detection, where they were 16 percent better than the second-best ImageNet localization;
where they were 27 percent better than the second-best COCO detection, where they were 11
percent better than the second-best. COCO’s segmentation where they were 12 percent better than
the second-best. We will discuss detection, localization, segmentation, and the COCO dataset a bit

803
later. But, just to let you know that this model swept through all challenges in 2015. And has since
then been an important architecture that people use for various applications in computer vision.

(Refer Slide Time: 25:56)

So, the homework for this lecture is to go through this illustration of 10 CNN architectures, which
is very well done which would perhaps help you understand these architectures better. If you are
further interested, you can go through these respective papers, which are for ImageNet, ResNets,
VGGs so on which could probably help you understand some of the decisions; which were made
in the design of these architectures.

The exercise for this lecture is going to show that minimizing negative log-likelihood in a neural
network with a Softmax activation function in the last layer, is equivalent to minimizing a cross-
entropy loss function. We are not going to work this out in the next lecture, please read this chapter
3 of Nielsen’s online book, if you would like to understand this further.

804
(Refer Slide Time: 26:57)

With that here are your references.

805
Deep Learning For Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 36
Recent CNN Architectures

In the architectures that we have discussed so far, you will notice that whatever components were
added be the inception module the bottleneck layer, or the residual block none of them affect
backpropagation in any way. They are all different ways of connecting inputs out to outputs and
as long as one computes the chain rule carefully and accounts for the gradient through all possible
paths that reach a particular node backpropagation can be implemented like this.

In this lecture, we will move on to talk about a few more recent CNN architectures that have been
developed since residual nets. ResNet won the ImageNet challenge in 2015 and has continued to
be a popular choice for several applications. We will talk about ways in which they have been
extended in this lecture.

(Refer Slide Time: 1:22)

In particular, we will talk about WideResNets, ResNeXt which is an extension of residual nets.

Deep networks with stochastic depth DenseNets, MobileNet, EfficientNet and squeeze excitation
net so on and so forth. Hopefully, this will give you a flavor of how you can improve CNN
architectures.

806
(Refer Slide Time: 1:51)

One of the earliest works that followed ResNets was a work that tried to understand the importance
of identity mapping in residual networks. As we discussed last class the identity mapping between
residual blocks allows the gradient to flow through a highway during backpropagation. The design
the creators of ResNet improved their design here with this effort where if you notice the original
residual unit here at the end of the residual block there was a ReLU that was applied on the output
of the residual block.

The creators of ResNet noticed that could affect the flow of information through that direct path
and hence change the design to bring ReLU into the residual block. And this they found created a
more direct path for propagating information because now this identity map connection does is not
succeeded by ReLU and the information passes as is. And it was also found to give better
performance in their empirical studies.

807
(Refer Slide Time: 3:16)

A popular variant of ResNets which is used a lot today is known as WideResNets. As the name
suggests, it was an improvement of residual nets but the objective in this design was a hypothesis
that residual networks work because of the residual block and not because of increasing the depth.
In residual networks, remember on one hand the residual blocks were introduced to mitigate the
vanishing gradient problem.

On the other hand, the authors argued that allowed us to create deeper and deeper networks which
gave better performance. But in WideResNets, the authors argued that you do not need depth as
long as you have strong residual blocks that can give you good performance. So to show that and
to study that hypothesis they increased the width of each residual block by multiplying the number
of filters k fold.

In the basic residual block, if you had two three to three convolutions with f filters each or f feature
maps each that is again if you recall the depth of the output of each layer. They increased each of
them k fold and made it fxk and they showed that with this increase of width in each residual block,
a 50 layer WideResNet could outperform a 152 layer original ResNet. This increases the width
instead of depth can also make computation efficient through techniques like parallelization.

If things are done one after the other which is what happens when you increase the depth it is not
possible to parallelize. But when you increase the width you could potentially send each filter to a

808
different computing unit and then retrieve the results and take it forward. So, increasing the width
instead of depth has a computational benefit too.

(Refer Slide Time: 5:38)

Another work that was again by the creators of ResNet was known as aggregated residual
transformations, which was also called ResNeXt. As the visual shows, if the left block is the
original residual block, the right block which was proposed in ResNeXt was multiple branches,
multiple branches in each residual block. So you can see here that you have a one-by-one Conv
with four filters and a three-by-three Conv with four filters, then a one-by-one Conv with 256
filters. So this reduces the depth to four and then increases it to 256, keeping in mind the same idea
of the bottleneck layer.

So they did this along 32 different paths and showed that this could improve performance. In a
sense, this tried to bring the idea from the inception module into residual networks.

809
(Refer Slide Time: 6:42)

Another interesting extension of residual networks was the idea of deep networks with stochastic
depth. The way to think of this approach is to think of dropout in terms of residual blocks because
now residual blocks are connected through identity connections. This now gives you an interesting
direction for dropping layers. You could not have done this with simple feed-forward networks or
CNN speed forward CNNs. Because if you remove the layer the network would get disconnected
but now because of the presence of the identity connections through residual blocks you could
randomly drop a few of the residual blocks in every mini-batch iteration.

And that could be done in a stochastic way by randomly sampling from a uniform distribution or
any other probability distribution and accordingly deciding which blocks should be dropped. So
the motivation is this reduces vanishing gradients further and it can also reduce training time
because the forward propagation is perhaps shorter now and it can also act as an added regularizer,
some more noise just like how dropout serves as a regularizer.

And at test time, very similar to dropout again, you use the full network as it is very similar to how
it is done in dropout.

810
(Refer Slide Time: 8:24)

Moving on from these methods came a method known as DenseNets which is perhaps after
ResNets and a version of ResNets known as inception ResNets. DenseNets are perhaps the most
popular choice for computer vision tasks. As the name suggests, DenseNets stands for densely
connected convolutional networks.

In this case, you have a dense block similar to a residual block but a dense block where each layer
is connected to every other layer. Remember in ResNets you had identity connections only
periodically after every residual block. But in a DenseNet within a dense block, you have identity
connections from every layer to every other layer. You also call them to skip connections.

These connections that take you from one layer to another layer are called identity connections
because the weight on that connection is identity. They are also called skip connections because
you are skipping a few layers in between and going straight away to a later layer. Similar to
ResNets this also alleviates the vanishing gradient because of those identity connections that give
keep giving a stronger gradient through these DenseNet blocks.

And they showed that using this approach a shallow 50 layer network can outperform a deeper 152
layer ResNet. Of course, you have to keep in mind that the word shallow here is a relative term
with respect to 152 layers. This is popularly in use today for image classification. You can see the
overall architecture of this network on the far right here where you see that you have an input, you
have a conv layer, and then comes one entire dense block of all of these dense connections.

811
So you can see again that within each dense block, you have a conv layer, then you concatenate
using the input, then you have a conv layer you again concatenate using the input from the previous
Conv layer as well as the original input. Then you again have a conv layer and then concatenate
and then you can have a conv layer before you come out of that dense block. After the dense block,
you have a conv pool conv once again dense block Conv pool conv a dense block pool fully
connected and then finally a Softmax layer for classification.

(Refer Slide Time: 11:12)

As architectures started evolving through the interesting use of skip connections all the variants of
ResNet and DenseNets were different ways of using skip connections to avoid the vanishing
gradient problem as well as make computations more efficient. There was another need emanating
as CNNs became popular for vision applications which were can we use vision applications on
low power devices, embedded devices, or what are known as edge devices, devices that sit at the
edge at the cusp of interaction with users or an environment for that matter.

MobileNets was one such effort that refers to a class of efficient models for mobile and embedded
vision applications it was developed in 2017 for the first time. And before we describe MobileNets,
what are the desirable properties of a CNN for use in small devices? You need low latency which
means the forward prop of a neural network should be real-time, should be should not have latency,
low power consumption, running a neural network should not consume too much power.

812
Often in these edge devices, there is very little there the power sources are very little and they may
have even other components to run. So you do not want too much power to go into the running of
a neural network. You want the model size to be small we saw that from AlexNet to VGG there
was a model size increase. But then with ResNets, the model's ties started decreasing while being
able to achieve similar performance.

And you still want at least a sufficiently high accuracy, you may be found to drop one or two
percentage points in accuracy but at significantly smaller model size and a low power footprint.
MobileNets are aimed in that direction they are small, low latency networks which are trained
directly. A complementary approach to this problem which has also been popular over the last few
years is the idea of compressing neural networks.

Compressing while in compression of neural networks you try to prune out weights or there are
other methods in which you can try to remove the weights that you think do not matter for the
outcome. But MobileNets try to do this by design.

(Refer Slide Time: 14:03)

The key ingredient of MobileNets is one of the variants of convolution that we talked about in the
first lecture this week which is depth-wise separable convolutions.

In your standard way of convolving it would be to do a D_kxDKxM convolution and N such filters.
But now in depth wise convolutional separable convolutions what you do is you first convolve

813
only on each channel. So which means this is 𝐷𝑘 x𝐷𝑘 x1, M different times. So if there are M
channels you do 𝐷𝑘 x𝐷𝑘 x1, 𝐷𝑘 x𝐷𝑘 x1on the second channel, 𝐷𝑘 x𝐷𝑘 x1 on the third channel and so
on until M channels.

This is going to give you a certain set of outputs. Now, on these outputs, you concatenate and then
run a 1x1xM convolution along with the depth. Remember each 𝐷𝑘 x𝐷𝑘 x1 will give you an output
that is the same size as the input. For instance, if you pad and you would have M such outputs,
now you do one-on-one M convolution across all of those channels you will have N such 1x1xM
filters to finally give you the same output that you would have got with standard convolution.

Let us see this in some more detail. So DSC which is depth-wise separable convolutions replaces
standard convolutions with depth-wise convolution followed by 1x1 convolution. As you can see,
1x1 convolution is a very popular choice to reduce computations in CNNs. And DSC applies a
single filter to each input channel. So how do you think this helps over standard convolution?

(Refer Slide Time: 16:13)

Let us try to take a few dimensions and analyze this carefully. Assume that input is 𝐷𝑓 x𝐷𝑓 xM and
output feature map after a conv layer is 𝐷𝑓 x𝐷𝑓 xM, assuming that the size did not reduce. Now if
you assume padded convolution to ensure the size does not reduce. Now, let the square convolution
kernel be K, rather than KxK. Now in a standard convolutional layer, this would have KxKxMxN
parameters.

814
Because you would have KxKxM, KxK for the filter size and M for the depth and N such filters
so that would lead to KxKxMxN. And a computational cost would be 𝐾 ∗ 𝐾 ∗ 𝑀 ∗ 𝑁 ∗ 𝐷𝑓 ∗ 𝐷𝑓
because that is the width and height of the input. Let us see what happens if you use a depth-wise
separable convolution.

If you do depth-wise separable convolution this factorizes the above computations into two parts
first depthwise convolutions, then point-wise convolutions right which is the 1x1xM part. The
depth-wise convolutions which operate channel-wise have KxKxM parameters, each filter being
KxK and you have M such for each channel. The total cost there would be 𝐾 ∗ 𝐾 ∗ 𝑀 ∗ 𝐷𝑓 ∗ 𝐷𝑓 .

And the point-wise convolutions become 1x1xM which is the size of each filter and N such filters
which means 1x1xMxN parameters. The total cost will be 𝑀 ∗ 𝑁 ∗ 𝐷𝑓 ∗ 𝐷𝑓 . So if you now see
what was multiplicative in terms of these parameters now becomes an addition of 𝐾 ∗ 𝐾 ∗ 𝑀 ∗
𝐷𝑓 ∗ 𝐷𝑓 + 𝑀 ∗ 𝑁 ∗ 𝐷𝑓 ∗ 𝐷𝑓 . By what fraction is computation reduced?

You can try computing it yourself but clearly, you can see that this would become additive this
would become multiplicative and that should tell you by what fraction the computations reduce.
This is very similar in principle to separable convolutions that we saw way back in the first week.
Remember, even inseparable convolutions instead of using a 3x3 filter which may have nine
computations if you do a 3x1 filter followed by a 3x1 filter along the other dimension you are
going to have three plus three which is 6 computations.

It is a very simple similar principle here and you can work this out as to what fraction of
computation is reduced. To summarize, depth-wise convolutions filter feature maps channel-wise
and point-wise convolutions combine the feature maps across channels. Standard convolution tries
to do both at the same in the same step.

815
(Refer Slide Time: 19:35)

The MobileNet architecture also had a couple of other components to help improve the
performance. So had after the depthwise convolution, they had a batch norm and a ReLU layer
and after the pointwise convolution or rather a 1x1 convolution, they also had a batch norm and a
ReLU layer that is what they found to give a good empirical performance. And to also obtain faster
and smaller models they had two hyperparameters, a width multiplier that controls the number of
channels based on a multiplier.

So if you had an M and an N for the input number of channels and the next layers output number
of channels, α is a constant that scales M and N based on the constraints under which you are
operating these models. MobileNets was intended to be a family of models which can be designed
for certain constraints rather than a single model.

Until now, whatever model we saw ResNet, VGG, GoogleNet, ResNeXt any of those variants
were one specific model with a specific architecture but MobileNets is about developing a family
of architectures where the architecture can be fine-tuned based on certain constraints and these
multipliers a width multiplier α and a resolution multiplier rho try to cater to that need of changing
constraints.

It could be hardware constraints it could be power constraints. So based on a particular constraint


you could use the width multiplier to automatically adjust the architecture the M and the N in the
architecture would get scaled down or scaled up based on some constraints or what is what

816
additional resources you may have in that case it could get scaled up. Similarly, the resolution
multiplier scales the input image to a fraction of its size based on constraints that you may have in
a particular setting.

So that allows you to use the same idea to develop a family of models that cater to different
constraints in the setting of where this model is used.

(Refer Slide Time: 21:58)

A similar approach was proposed last year in 2019 known as EfficientNet this had a defined
different premise to the way it went about the same idea but the objective was similar which is to
develop a family of models that cater to some constraints that you may have in terms of
computational resources.

And the way they approach this problem is to say that we generally understand through
conventional wisdom that if you scale up CNN architectures in terms of width, in terms of depth,
in terms of input resolution the model should work better. But when they made when they ran
studies to observe the trend of increasing depth, increasing width, and increasing resolution they
found that it did increase but plateaued after a certain time.

And they tried this method to explore a principled way in which a CNN can be scaled based on
resource constraints.

817
(Refer Slide Time: 23:05)

As we just said, scaling up any dimension independently improves accuracy but return diminishes
for bigger models which is what you see in these graphs. The first graph points out flops on the x-
axis and the ImageNet accuracy on the y axis. You can see that as w the width of the model keeps
increasing flops keep increasing but the accuracy also keeps increasing but at a certain point, the
accuracy saturates even as you keep increasing the width of the model. You see a similar behavior
for depth. You see a similar behavior even for the resolution of the input image. So the inference
that they made was it may be critical to balance all of these dimensions while scaling up or scaling
down a model.

818
It is not about only increasing width like WideResNets, it is not about only increasing depth like
in ResNet itself but one should consider changing the width, changing the depth, and changing the
input resolution carefully in a balanced manner to get maximum throughput with your resource
constraints. So how do you go about doing this? They proposed a new scaling method known as a
compound scaling method.

The idea was you could scale your architecture given a particular baseline architecture. You could
scale the depth, width, and input resolution of the image through your current model design using
a single scaling coefficient called φ and that is why it is known as compound scaling. It is based
on one single coefficient but all your dimensions, your depth, your width, and your resolution all
get changed based on that single coefficient.

Through a lot of empirical studies, they came up with this specific formulation that you see on the
right which says that 𝑑𝑒𝑝𝑡ℎ: 𝑑 = 𝛼 𝜑 ; 𝑤𝑖𝑑𝑡ℎ: 𝑤 = 𝛽 𝜑 ; 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛: 𝑟 = 𝛾 𝜑 𝑠. 𝑡. 𝛼 ∗ 𝛽 2 ∗ 𝛾 2
should be about 2 and each of 𝛼, 𝛽, 𝛾 must be greater than or equal to 1. And they are all constants.

You may wonder how this came. Firstly let us try to understand each of these why α into β square
into γ square and not α square into β square and γ square for instance. The reason for this is that
convolutions are the costliest operation in a CNN and when you increase depth convolution or the
number of operations increases linearly but when you increase the width of the resolution, width
is the number of filters or the resolution.

The computations increase in a squared manner and that is the reason this relationship is in place
and this was empirically found by them. Now, why should it be equal to two and one can explain
this through an example. Let us assume φ was equal to one and in their context, this means that
you are now provided with double your current resources. You had a certain computational
resource so far, for example, you had a certain GPU and now today you bought a new one and
your resources doubled.

In that scenario, 𝜑 would be set to one in their framework and if 𝜑 is set to 1 you can make out
that you would solve for 𝛼 ∗ 𝛽 2 ∗ 𝛾 2 = 2 with 𝛼, 𝛽, 𝛾 greater than equal to 1. They solve this using
a grid search and you would get values such as 𝛼 to be 1.1, 𝛽 to be 1.15, 𝛾 to be 1.3, or one of
those variants that would give you the solution.

819
Now, when you do that you will notice that D would become 1.1 times the old D, w will become
1.2 times the old w and r would become say 1.3 times the old γ. So now when you multiply all
these d w r, you would have d w r will be 𝛼 into d into 𝛽 into old w into 𝛾 old r.

Now, because d w and r were old ones, the new factor which gets added will be 𝛼 ∗ 𝛽 2 ∗ 𝛾 2 because
that is how things will increase now which will now be two. So when 𝜑 = 1 and you have doubled
your resource availability, solving this problem, solving this equation here α into β square into γ
square is equal to 2 allows you to scale your depth, width, and resolution in a manner in which you
will maximum go up to two times your current number of computations.

That is the way you go about doing it and once you decide your 𝛼, 𝛽, 𝛾 for a given problem you
then change 𝜑 and you can get a family of models.

(Refer Slide Time: 28:36)

One question here, for any compound efficient 𝜑 the total flops will approximately increase by
2𝜑 . Why is that so? I think I partially gave you the answer already if you go back and look at what
I just said remember if 𝛼 = 1, we saw that D will become 𝛼 times the old depth.

And remember these are multipliers, right? These are scaling coefficients. So when we say D is
equal to 𝛼 of 𝛼 we mean if α is 1.2 then D will become 1.2 times the old depth, it is a scaling
coefficient. So if the total number of computations is d into w into r your new number of
computations after doing the scaling would be α into d if you get α=1.2, β=1.1, and γ=1.3.

820
Then d would be α times the old d, w will be β times β square times the old w, r will be γ square
times the old r. Remember, for width and resolution things get increased in a squared manner. So
now you know that it will be the new depth would be α into old d into the new total number of
computations would be α into old d into β square into old w into γ square into old r which would
effectively become two times your earlier computations.

This was for a choice of φ to be one. If you chose φ to be two you can work this out in a similar
manner and show that the number of computations will increase by two power φ. Using this
approach you can fix first fix φ is equal to 1, you do a grid search on α, β, γ, solve this equation
that you see here α into β square into γ square is equal to 2 and that would give you a certain value
of α, β, γ.

Now for that choice of α, β, γ vary your φ based on whatever your computational budget is. As I
said if your computational budget double set φ is equal to one and now if your computation budget
is reduced accordingly change your φ and using this you will get a family of models where each
model has scaled depth, scaled width, and scaled input resolution and with this they show that they
can outperform MobileNets and achieve significantly high accuracy even under resource
constraints.

The baseline model in this approach is obtained through a method known as neural architecture
search. So the baseline model was not a ResNet it is not like they took a ResNet and then scaled
the width, resolution, and depth. It is not like they took an AlexNet. The baseline model which is
then scaled based on different constraints is found using a method called neural architecture search.

This is a topic that we will cover much later in this course since it is an advanced topic and needs
a few more concepts before talking about it.

(Refer Slide Time: 31:57)

821
The last architecture that we will talk about in this lecture is again a fairly recent one that was
introduced in 2018 and this is known as the squeeze and excitation network. It was once again a
network that helped improve the efficiency of the model by reducing computations in a certain
way. Let us see how they went about it. The main motivation of this approach was to improve the
quality of the representations which means representations of a CNN are the outputs or the certain
feature maps of certain layers in the CNN. They wanted to improve those feature maps by
modeling interdependencies between channels in each convolutional layer.

This was achieved by the introduction of a new architectural unit known as a squeeze and
excitation block or an SE block which consisted of two operations, a squeeze operation that embeds
global information and an excitation operation that recalibrates feature maps channel-wise. Let us
see how this happens.

To understand it, let us consider an input x which has 𝐻′x𝑊′x𝐶′ dimensions, an 𝐹𝑡𝑟 is a
transformation that takes you from x to u which has dimensions HxWxC.

822
(Refer Slide Time: 33:35)

The SE block works on U and the way it works is what is shown in this visual here. So you have
an x block it could be an input to a convolutional layer F transform could be 𝐹𝑡𝑟 which is denoted
by 𝐹𝑡𝑟 is a convolutional layer that probably took x to u with the dimensions HxWxC.

So the output of u is where the squeeze excitation block is squeezed in and what it does is to do
two things as we just said. Firstly, you have a squeeze function which is denoted by 𝐹𝑠𝑞 in this
visual here where you take each of these c channels in u and do global average pooling. Once again
as we talked about this earlier in global average pooling you take the average of all the pixels in
each of these channels.

823
So you have c channels here in u, for each channel you take the average that gives you by doing
global average pooling you would get one scalar per channel. So this channel would give you one
scalar, the next channel here let us assume that the next channel was something like this averaging
all of those values will give you one scalar. Then similarly and for each of those channels here and
that would form this 1x1x1 c vector, that is the squeezing part.

Then they have an excitation function which is denoted by 𝐹𝑥 here which learns a set of weights,
squeezing is only global average pooling. Remember pooling layers are parameter-free, there are
no weights. They only squeeze they only sub-sample but then they have an excitation function
here which is parameterized by some weights w which re-weights each of these scalars here by
certain numbers based on weights.

What are we trying to do? We are trying to learn the relationships between the channels and the
weight of each of these channels' outputs accordingly. These learned outputs of the excitation
function are then used to multiply each of the channels in u by these values to get the new output
𝑋̃. You can see that 𝑋̃ has the same dimensions as HxWxC but now the values in each of these
channels are scaled by the values that were learned through the excitation function.

824
(Refer Slide Time: 36:16)

The SE block can also be placed inside a ResNet, the way you see in this visual. So you have a
residual block and then you take it through a new SE block that is embedded inside the ResNet. In
this case, the first step the output of a residual block is HxWxC. The first step is global average
pooling and the size becomes 1x1xC. Now using a fully connected layer you reduce the size to
1x1xC by r.

Remember bottleneck layer you try to reduce the number of computations it is the same idea here
where r is a hyperparameter that controls the size of that hidden layer. Then you perform a ReLU
then you again increase the size back to c and then do a sigmoid and then send it back to scale and
improve the original volume that you got in the residual block. So this is known as the SE ResNet
where the squeeze and excitation block is embedded into the ResNet module.

Just to explain this the output of 𝐹𝑒𝑥 is the set of c numbers between 0 1, each detailing how much
attention each channel receives. We talked about it as the scaling for each channel and this becomes
a simple way to increase model depth artificially, to scale up each channel is one way to increase
the effect of model depth and this could be added to a wide variety of convolutional architectures,
not just ResNets.

But it could be added to many architectures and can help improve performance with very little
added cost.

825
(Refer Slide Time: 38:06)

To conclude, please follow the readings of lecture nine of CS231n. There is a very nice Google AI
blog on the MobileNet worth reading and an optional lecture on Svetlana Lazebnik. Your exercises
were by what fraction is computation reduced when DSC is used over standard convolution, depth-
wise separable convolution which was on slide ten and for compound if coefficient φ.

In EfficientNets total flops, we said will approximately increase by two power φ y. We partially
answered the questions in this lecture so we would not answer this the next time but these are a
couple of things for you to think through at the end of this lecture so that you know you have
understood what we discussed.

826
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 37
Fine tuning in CNNs

So far we have seen several variants of CNN architectures, which firstly try to increase the
depth to improve performance, then try to make computations more efficient, while not
sacrificing performance. And then finally variants in which it was okay to sacrifice
performance, but you needed it to bring the CNN under a lower regime of computations.

We saw all of these variants. And in all of these variants, including the squeeze excitation
network, the back propagation of the CNN by itself was not affected, which let us play with
these architectures. In this lecture, we will talk about an important aspect of CNNs that
allows us to design architectures for newer tasks or newer domains, given an architecture that
works on a particular domain. For example, if we know AlexNet works on the Image net data
set, how can we leverage that knowledge to develop a model for another domain for another
task also.

(Refer Slide Time: 01:30)

To discuss this, let us start with limitations of CNN themselves. CNNs have a few
fundamental problems. Firstly, optimization of the weights or parameters in these deep
models, CNNs are typically deep models, is generally hard, it requires proper weight
initialization, it requires a lot of hyper parameter tuning, various kinds of hyper parameters
that is one of the fundamental limitations. Secondly, it can suffer from over fitting because

827
data samples used for training are generally much lesser in number to the parameters being
claimed. We already saw that Alex net has 16 million parameters, VGG has 138 million,
Inception v1, which is the first version of Google net had 5 million parameters, an improved
version of it called Inception v3 had 23 million parameters, Resnet 50 had 25 million
parameters and so on.

So CNNs fundamentally required a long time, and a lot of computational power to train these
networks. Alex net, as we said, took about a week to train on 2 Nvidia GTX 580 GPUs, VGG
took two to three weeks on 4 Nvidia Titan black GPUs, and Inception v1 or Google net
version 1 took about a week on hardware that was not shared.

So all this can become even more difficult when you have to design an architecture for a new
domain, in which case, you have to design an architecture, run it for about a week or several
days, see how it works, go back and change the hyper parameter, again, run it for one more
week, see how it works. And this kind of an approach to designing new architectures of CNN
architectures will not scale. So how do we manage doing this for newer tasks and newer
domains?

(Refer Slide Time: 03:36)

Some things you can try are perhaps try a good weight visualization. So that way, you
become independent of the design of the architecture. We have seen Glorot or He
initialization, that is one possibility. You can also initialize your CNNs with hand design
filters, you could use an edge filter, or one of those filters to initialize the filters in
convolution layers, and then ask the CNN to fine tune them. Or, we have already seen

828
unsupervised greedy layer wise pre-training as another approach that could provide an
initialization which the neural network can then capitalize on.

As we said earlier, this approach of unsupervised greedy layer wise pre-training is not used
these days due to the increased computational power and also data set sizes, it takes a long
time to do the greedy layer wise pre-training as only an initialization method. The other thing
that we talked about to improve training of neural networks is to use regularization methods,
be it L2 weight decay. L1 weight decay, dropout, batch norm, adding input noise, adding
gradient noise, data augmentation, so on and so forth. This may alleviate over fitting, but may
not really lead to faster training of CNNs. So what do we do?

(Refer Slide Time: 05:01)

An interesting property of CNN, which has been observed, which is something that we will
focus on in the next week of lectures, is that if you visualize the filters that are learned by
layers in the convolution network after training the CNN, one happens to observe that the
earlier layers have filters that probably represent filters that could capture oriented edges of
different kinds, color blobs, so on and so forth.

In other words, these are what we call low level features, simple low level features that we
get out of images. Filters that are learned by intermediate layers in a CNN seem to capture
some mid-level features which could be considered as combinations of low level features, a
certain texture that is a combination of two edges, or a certain congregation of colors, and
edges in colors, and so on and so forth.

829
And finally, the filters at later layers learn high level features, such as impressions of objects,
for instance, which are higher level abstractions, when you compare to only edges in initial
layers. The question we asked now is, what can we do with this? Can we use this in some
way? Can we use this understanding of how CNN filters are learning to be able to design
architectures better for newer tasks and newer domains?

(Refer Slide Time: 06:36)

The answer there is a setting in machine learning known as transfer learning. In transfer
learning, before we go to transfer learning in traditional machine learning, if one had different
tasks, you would have a data set for each of these tasks. And using a data set for each task,
you would have a different model through a learning system that you learn.

In transfer learning, you have a set of source tasks that are given to you. And a target task on
which you want to perform well, that is the new domain, a new task that you want to solve
the problem on. So when we say source and target, for example, you could imagine now that
you have trained Alex net, on the image net dataset and tomorrow, you have data coming
from a different domain.

For example, let us say you want to build a recognition of objects for self-driving cars in
India, these objects may be different from the objects in image net. So how do you now
transfer the model that you learnt on image net to a model that you want to develop on in a
new domain? In this case, you try to use the data in the target domain, as well as the model or
the knowledge that you got from the source domain to be able to learn a model for the target
domain. So this we call transfer learning to be able to transfer knowledge from an earlier

830
source task to a target task that we are currently interested in. How does this relate to the
hierarchical learning of abstractions of features in CNNs, we will see that in a moment.

(Refer Slide Time: 08:23)

So we know now that using knowledge learned over a different task to aid the training of
current tasks is transfer learning. We now also know that we do have pre trained models with
good results available to us, for example, an Alex net on image net is pre trained, and that
model is already given to us, can be leveraged this in some way to improve performance on a
target task.

We can do this in a couple of ways. We can do this because we know that the Alex nets
filters or any other models filters, any other CNN models filters for that matter, has some
general features in the first few layers. And the features slowly start getting specialized as
you go to deeper layers. Why do we say that?

831
(Refer Slide Time: 09:15)

If you go back to an earlier slide, you see the earlier layers filters seem to capture general
information like ages, and color blobs, and so on. Whereas the later filters seem to capture
some object specific information, which could be specialized for that task.

(Refer Slide Time: 09:40)

So what we can do now is, you could take a pre trained neural network such as an Alex net to
a newer domain, and keep all of those weights of Alex net exactly the same. Do not update
them for this new task, instead take only the last layer, which we call as the classification
layer, and train only those parameters for the newer task.

832
We call this to be fine tuning, where you are initializing the weights of a network for a new
task based on some pre trained weights of another model, and you only fine tune certain
layers in the new network for the new task. You could now talk about the first setting here,
which only trains the classification layer.

You could also use the pre trained weights as an initialization, and then fine tune the entire
network or certain layers at the end of the CNN, not just the classification layer, you could
take the last 3 layers or last 4 layers, anything of your choice to fine tune to better model the
target task. Which one do you choose depends on your data set size, and how similar is your
target task to the source task.

(Refer Slide Time: 11:05)

Let us see a few possibilities here and see what we can do. If your data set was fundamentally
small, and your target and source data sets are fairly similar to each other. What you can do
now is because this target and source datasets are similar to each other, you could say that the
specialized features are likely to remain the same in both these datasets, for models trained on
both these datasets.

So what you can do now is you take the source tasks, generic layers, you take the source tasks
special layer, so this entire block that you see in the middle could be an entire pre trained
network, such as an Alex net, or Resnet, that you trained on another data set, which we call as
a source task. We leave that as it is and we only train you randomly initialize the
classification layer alone and train only that one for the outputs in your new task.

833
This is because your data set is small, you feel you may not have enough data to retrain or
fine tune all these feature abstraction layers, that as well focus only on training the
classification layer properly. So in this case, we are initializing the new network with a pre
trained weights and we are fine tuning only the classification layer.

(Refer Slide Time: 12:26)

Another setting is where the data set is small and also the target and the source data sets are
dissimilar this time, which means the specialized features could still be common. The Edge
detectors and those kinds of filters could be common for the source and target task. But the
specialized features may not really be common between the two tasks.

What can we do here, we now consider the entire pre trained CNN because we know that the
feature maps of the generic layers which are by General layers, we mean the early layers of
the CNN, you can consider that from your pre train network, take those feature maps as
output, and then train another classifier such as support vector machine, or could also be a
neural network to give you the output for the target task.

Why are we doing this? Because the data set is small, and the target and source data sets are
dissimilar, we cannot use the specialized features of your pre trained network on the source
task. We now assume that the generic features that we get from the first few layers by
features by representations we mean those feature maps for each input. Those generic
representations, we assume will help us improve performance on a classifier, although the
data set is small.

834
(Refer Slide Time: 13:57)

A third setting now is when your data set is fundamentally large. When your data set is large,
you have some more luxury, you can now use the pre-trained network as a good initialization
to fine tune the entire network now on the target data set. But you can do a couple of things
also intelligently to improve performance.

While fine tuning, you can keep the learning rate to be quite low. Remember, learning rate is
the coefficient of your gradient in your gradient descent method, you can keep your learning
rate to be quite low, so that your pre trained parameters are not changed significantly. You
are saying that I still want my overall network weights to be similar to the weights that I
learned from my source task. But I have the data, I am going to update it but let me keep the
learning rate low so that I do not make too many updates to my weights from what they were
from a source task.

On the other hand, if data set is very different, and the data set is large, there are explicit
methods for this, such as what is known as transitive transfer learning or things like that,
where you can either train from scratch, you do not need to fine tune at all, you can randomly
initialize using Clorox initialization and train from scratch on the new target task because you
have a large data set, you are saying that the target task and the source tasks are not similar to
each other so you may as well train from scratch, or use methods such as transitive transfer
learning, which is a known method in traditional machine learning and that can help you
improve performance.

835
The main idea of this discussion is to tell you that the design of architectures for CNNs while
we saw a few different designs, it is not trivial to design newer and newer architectures for a
given problem. There are many hyper parameters, number of layers with resolution, number
of filters and also learning hyper parameters to try to come up with the appropriate
combination for a domain and a task by itself can be a very difficult task.

And using the idea of transfer learning and fine tuning could help you take an existing
architecture that has solved a different problem and adapt it to a new problem. This is a very
common procedure that people follow to take CNNs to newer domains and newer tasks.

(Refer Slide Time: 16:33)

For more reading, you can read chapter 9 of the deep learning book and the lecture on
transfer learning in CS 231n. And if you are interested, one of the earliest articles that try to
study how transferable are features of CNNs, in the context that we just discussed.

836
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 38
Explaining CNNs - Visualization Methods

Continuing from where we left the last time. This week, we will focus on understanding,
visualizing, and explaining the model predictions of a CNN, a convolution neural network, or
for that matter any other neural network also. We will start with the first lecture on visualization
methods that have different kernels of filters in a CNN or perhaps even activations in a
particular layer of a CNN or even other methods that we will see later in this lecture.

(Refer Slide Time: 01:08)

Most of these lecture slides are based on lecture 13 of CS 231n at Stanford, and some of the
content is borrowed from the excellent lectures of Mitesh Khapra at IIT, Madras.

837
(Refer Slide Time: 01:16)

Let us start with the simplest form of visualization, which is visualizing the filters or kernels
themselves. Remember, that when you have a CNN, in every convolution layer, you have a
certain number of filters. For example, if you recall, in the AlexNet, the first convolutional
layer had 11x11 filters, it had 96 of them, 48 going to one GPU, and 48 going through the other
GPU if you recall AlexNet architecture.

In this particular slide that you see here, we are looking at a variant of AlexNet, which was
developed by Alex Krzyzewski a little later in 2014 when he came up with a method to paralyze
CNNs, this is just an example to be able to visualize this more easily. So in this variant of
AlexNet, the architecture had 64 filters in the first convolutional layer.

So what you see on the left top here is 64 filters, each level across 11x11 in size. And each of
them has three channels, the r channel, g channel, and b channel the three colors. So that is
what we have is the total number of filters. So, if we visualized each of them on a grid such as
this, remember that a filter is an image on its own, just like how convolution is commutative,
you can always choose to look at an image as a filter or a filter as an image, it does not matter.

Any matrix of the size of the filter can also be plotted as an image. So, when you do it that way,
you get something like what you see on the top right here. So, let us try to look at some of them
more carefully. So, if you visualized some of these filters more carefully, you see that there are
filters that try to capture oriented edges. So, you can see this one here on the bottom row, the
fourth from the left which captures it looks like a Gaussian edge detector, which smoothens
out along a certain orientation. Similarly, you have another edge detector here, another edge

838
detector on top, you also have some which capture slightly higher order variations such as a
checkerboard kind of a pattern or a series of striations so on and so forth. You also have color-
based edge detectors. So, in the last filter here on the bottom right, you see an edge detector
that goes from green to pinkish or red color. So, you see similar color-based filters even on the
top left here.

So does is this a characteristic of AlexNet alone? Not, if you took even the filters of resnet 18,
resnet 101, or densenet, 121, In each of these, the filters in the first convolutional layer have a
very similar structure. All of them detect edges of different orientations. Certain higher-order
interactions, such as checkerboard patterns, striations in different orientations, color blobs,
certain color gradations as in edges in different colors, so on and so forth.

You will see this as part of the assignment this week where you try out some of these
experiments. So this tells us that the first layer seems to be acting like low-level image
processing, edge detection, blob detection, maybe a checkerboard detection, so on and so forth.
Remember here, that these are filters that are completely learned by a neural network, which
we did not prime in any way.

(Refer Slide Time: 05:32)

You can also visualize the kernels of higher layers, just like how we did it for the first
convolutional layer. You could also take all the filters of the second convolutional layer, the
third convolutional layer, so on and so forth. But it happens that if you had to generalize them
across applications, they are not that interesting. We did see an earlier example last week, where
we took face images and showed that filters in the first layer correspond to low level image

839
features. Then we talked about middle layers, extracting noses, and eyes, and so on. And then
we talked about the later layers extracting face-level information. It does happen in certain
applications but in general, if you had a wider range of objects, if you only focused on faces or
a smaller group of objects, maybe you could make sense of the higher layers, filters.

(Refer Slide Time: 6:52)

.
But in a more general context such as image net, which has a thousand classes in the data set.
These kinds of visualizations of filters of higher layers are not that interesting. So here are some
examples, so you can see that the weights, remember in a CNN, the weights are the filters
themselves.

So if you look at weights in a later layer, you see that it may not be that interesting for
understanding what a CNN is learning. That is because of the variety of classes that may result
in various obstructions across the data set. So the input to the higher layers is no more the
images that we understand. At the input layer, we know what we are providing as input but
when you go to higher layers, you do not know what you are providing as input, so it becomes
a little bit more difficult to understand what is happening.

840
(Refer Slide Time: 07:21)

However, if you take the filters of the first layer alone, across various models and datasets,
Hope by now you are familiar with the various CNN models such as AlexNet, ResNet, dense
net, VGG, so on and so forth. So if you have to look at the filters of the first layer, first
convolutional layer, across all of these models you get very similar kinds of filters and it is
generally called the Gabor-like filters fitting. Why is that so? Recall the Gabor filters discussion
that we had earlier in the course, where we said, a Gabor filter is like a combination of a
Gaussian and a sinusoid.

So, and you can change the scale and you can change the orientation of the Gabor filter. And
you end up detecting edges in different orientations, you perhaps end up detecting different
striations, checkerboard patterns, so on and so forth, which is exactly what we see as the CNN
learning on its own too. So that is the reason why we call this entire visualization of the filters
of the first convolutional layer a Gabor, like filter fatigue, by fatigue here we just mean, it is
exactly the same across all of these models and data sets.

841
(Refer Slide Time: 08:45)

Another option, other than visualizing the filters in different layers, when we talk about
visualizing the filters, remember it is 11x11 or a 7x7 or whatever be the size of the filter, you
simply have to plot it as an image. Another thing that you can do is to visualize the
representation space learned by CNN. What do we mean, if you took the AlexNet, remember
that the output of FC 7 or the fully connected layer in the 7th position of the depth of the
network, which we denote as FC 7 is a 4096-dimensional vector?

That is the layer immediately before the classification. So what we can do is take all the images
in your test set or validation set for that matter and you forward propagate those images until
this particular layer and collect all these 4096-dimensional vectors.

(Refer Slide Time: 9:48)

842
What do we do with them? You can now visualize the space of these FC 7 feature vectors by
reducing the dimensionality from 4096 to any dimension of your choice. But for simplicity, let
us say two dimensions. How do we do this? Once again, hopefully, you have done a basic
machine learning course.

And you know that you can use any dimensionality reduction method to be able to do this. A
simple example could be principal component analysis. So you take all of those 4096-
dimensional vectors of several input images, and you do a PCA on top of them to bring all of
them into two-dimensional space.

A more popular approach, which is considered to be a very strong dimensionality reduction


method is also t-SNE, which stands for t Stochastic Neighborhood Embedding, this was a
method once again developed by Hinton along with Vonda Martin in 2008. We also have a
link for this towards the end of this lecture.

So you can play around with the t-SNE if you would like to understand it more. So when you
apply t-SNE, on the representations that you get as the output of the CNN in the penultimate
layer, you end up getting results such as this, this is specific for the enlist data set, where you
have 10 classes, this is the handwritten digit data set. So you see here that each class invariably
goods goes to its cluster.

This seems to tell us while we cannot, in reality, visualize a 4096-dimensional space. By


bringing it down to two dimensions, we understand that the classes, the representations
belonging to different classes are fairly well separated into different clusters. And why is that
important? Now, developing a classification algorithm on these representations becomes an
easy task. And that is why having a classification layer right after that penultimate layer of
representations makes the entire CNN work well to classify.

843
(Refer Slide Time: 11:58)

So here is an example of the same for image net. So where this is a plot, a two-dimensional
plot of various images in the image net data set, taken to 4096-dimensional space by AlexNet,
and then brought down to two-dimensional space and plotted on a two-dimensional map, the
only thing we are doing here is, we are putting the respective image on each location, just for
understanding what is happening. So this is just a huge map, and if you had to look at one
particular part of it, let us try to zoom in into one particular part of it.

(Refer Slide Time: 12:42)

You see, that all the images corresponding to say a field seem to be coming together in this
space of representations. For this matter, if you scroll around and see other parts of it, you will
see that similar objects you can see at many points of these embeddings here that at many

844
points, you see very similar objects being grouped, you see all cars somewhere here, so on and
so forth. This tells us that these embeddings or representations that you get out of the
penultimate layer, actually capture the semantic nature of these images. And objects that have
similar semantics are grouped, while objects of different semantics are far apart from each
other. So this gives us an understanding the CNN representations seem to be capturing the
semantics.

Keep this in mind, when we talked about handcrafted features and learn representations, this is
what we were talking about. In handcrafted features such as Sift or Hog or LBP, you have to
decide what may be useful for a given application and then hand design that filter that you want
to use as a representation of the image after which you may apply a machine learning algorithm.
But now, we are letting the neural network the CNN in particular, automatically learn these
representations that it needs to solve a particular task.

(Refer Slide Time: 14:16)

Here is a visualization of the Conv5 feature maps, in AlexNet the Conv5 feature map is
128x13x13. So there are 128 feature maps each 13 by 13. If you now visualize them as
grayscale images, you can see something interesting here. So when this specific image with
two people is given as input, you see that one of the filters, or actually, there are quite a few of
them in fact, seem to be capturing the fact that there are two entities in the image. So this could
give you a hint that the later layers in the CNN can capture these higher-level semantics of the
objects in the images.

845
(Refer Slide Time: 15:00)

Another way of visualizing and understanding CNNs is, you could extend the same thought
and consider a trained CNN. Remember that all of this analysis is for a trained CNN, after
training the CNN, you want to understand what it has learned. Remember, that is the context
in which we are talking about this.

So you can consider a CNN, and consider any single neuron in its intermediate layers, so let us
consider that particular one in green. Now, you can try to visualize which images cause that
particular neuron to fire the most. So you can give different images as input to the CNN, and
keep monitoring that particular neuron and see which image is making it fire the most. What
can we do with that?

Now we can work backward and understand that this particular pixel here will have a certain
receptive field in the previous convolution layer, which means that is the region that led to this
pixel being formed in this particular convolutional layer. Similarly, you can take the pixels in
the previous layer, and look at the receptive field of each of them and find the receptive field
in the earlier layer, in this case, the first convolution layer, you can go further again and find
out the receptive field in the original image, which was responsible for this pixel in the third
convolutional layer.

Remember, we were also discussing this when we talked about backpropagation through
CNNs, where we try to understand the receptive field that leads or to a particular pixel getting
affected in a particular layer. It is the same principle here.

846
(Refer Slide Time: 16:56)

Now, if we took images, and try to understand which of them caused a particular neuron to fire,
we end up seeing several patterns. So in this set of images, each row corresponds to one
particular neuron that was fired, and the set of images and the region inside the set of images,
which is shown as a white bounding box, which caused that neuron to fire.

So here is the first row, interestingly all of those images correspond to people, especially busts
of people, it looks like that particular neuron was capturing people until they are a bust until
their chest. The second neuron here seems to be capturing different dogs, maybe some
honeycomb kind of a pattern, even a US flag was taught that honeycomb kind of a pattern is
present.

A third neuron captures a certain red blob across all of the images. The fourth neuron captures
digits in images, and if you look at the last neuron here, the sixth row, you see that it seems to
be capturing specular reflection in all of the images. So over time, as you train the CNNs, each
neuron is getting fired for certain artifacts in the images.

And this should probably go back and help you connect to drop out, where we try to ensure
that no particular neuron or weight overfits to a training data. And we allow all neurons to learn
a diverse set of artifacts in images. So this should help you connect to drop out in that sense.

847
(Refer Slide Time: 18:40)

Here are further examples of the same idea where you take different neurons in CNN and try
to see which images or patches of images fired that neuron the most. Once again, you see a
fairly consistent trend here, that some of them seemed to fire for I think, this is an eye of an
animal, there is again some text in images, there is vertical text and images, there are faces,
there are dog faces, again, and so on.

(Refer Slide Time: 19:12)

And the last method that we will talk about in this lecture is what is known as occlusion
experiments, which attempt to leverage our objective that we finally want to understand which
pixels in an image corresponding to the object recognized by the CNN. Why does this matter?
We would like to know if the CNN looked at the cat in the image, while giving the label as a

848
cat for the image, or did it look at a building in the background or grass on the ground?
Remember, a neural network learns correlations between various pixels that are present in your
data set to be able to give good classification performance. So if all of your cats in your data
set were found only on grass, a neural network could assume that the presence of grass means
a presence of a cat.

If you have a test set, where a cat is found in a different background, the neural network may
now not be able to predict that as a cat. So to be able to get that kind of trust in the model, the
model was indeed looking at the cat while calling it a cat. The occlusion experiments, do this
using a specific methodology.

So given these images that you see here, we occlude out different patches in the image centered
on each pixel. And you see the effect on the predicted probability of the correct class. Let us
take an example, so you can see a gray patch here on the image so you occlude that part of the
image, fill it with gray, and send the whole image as input to the CNN, and you would get a
particular probability for the correct label in this case, which is Pomeranian. So that is plotted
as a probability in that particular location.

Similarly, you gray out a patch here, send the full image as input to a CNN, get a probability
for a Pomeranian. How do you get the probability as the output of the softmax activation
function, and that probability value is plotted here in this image. So by doing this, by moving
your gray patch across the image, you will have an entire heat map of probabilities of whether
a pixel or a patch around a pixel reduces the probability of a Pomeranian or does it keep it the
same way.

So in this particular heat map, red is the highest value, and blue is the lower value. So you
notice here, that when the patch is placed on the dog's face, the probability of the image being
a Pomeranian drops to a low value. So this tells us that the image for the CNN model was in
fact, looking at the dog's face to call this a Pomeranian.

So in fact, this entire discussion came out of Zeiler’s and Fergus' work on visualizing and
understanding convolutional neural networks. It is a good read for you to look at the end of this
lecture. They observe that when you place a gray patch on the dog's face, the class label
predicted by the CNN is a tennis ball, which is perhaps the object that takes precedence when
you cover the dog's face. Similarly, in the second image, you see that the true label is the car

849
wheel, and as you keep using this gray patch all over the image, you see that the probability
drops to the lowest when the wheel is covered.

And the third image more challenging is where you have two humans and a dog in between
them, which is an Afghan hound, which is the true label, you once again see that when the
pixels corresponding to the dog are occluded by the gray patch, the probability drops for the
Afghan hound drops low. This is even trickier because there are humans and the model could
have been biased or affected by the presence of those humans, but the model does well in this
particular case.

(Refer Slide Time: 23:47)

So occluding, the face of the dog causes a maximum drop in the prediction probability.

850
(Refer Slide Time: 23:51)

To summarize the methods that we covered in this lecture, given a CNN, we are going to call
all of these methods as do not disturb the model methods, which means we are not going to
touch anything in the model, we are only going to use the model as it is and be able to leverage
various kinds of understanding from the model.

So you can take the convolutional layer, and then visualize your filters and kernels that is one
of the first methods that we spoke about. Unfortunately, this is only interpretable at the first
layer, and may not be interesting enough at higher layers. And we also talked about the Gabor-
like filter fatigue here. You could also take any other neuron, for example, in a pooling layer,
and visualize the patches that maximally activate that neuron, you can get some understanding
of what the CNN is learning using this kind of approach.

The third thing that we talked about is you can take the fully connected layer, the
representations that you get at the fully connected layer, and visualize the representation and
then do dimensionality reduction methods such as t-SNE on these representations, and you get
an entire set of embeddings for image net.

And lastly, we spoke about the occlusion experiment, where you take input and perturb the
input, and then see what happens at the final classification layer and that gives you a set of a
heat map that tells you which part of the image the model was looking at while making a
prediction.

851
(Refer Slide Time: 25:37)

Recommended readings, lecture notes of CS 231n, as well as a nice deep visualization toolkit
demo video on the web page by Jason Jasinski, I would advise you to look at that. You can also
get to know more about t-SNE and t-SNE visualizations as a dimensionality reduction
technique on these links provided here.

(Refer Slide Time: 25:53)

Here are some references.

852
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 39
Explaining CNNs - Early Methods

Continuing from the previous lecture, let us now discuss a few more methods that can help us
understand CNN and its predictions.

(Refer Slide Time: 00:27)

Let us start with a question, which is a bit different from what we saw in the previous lecture.
The question is, can we find an image that maximizes a class score? So when AlexNet or any
other network, when trained on image net has been exposed to a lot of, for example, say cat
images, so at the end of the training, does AlexNet know what an average cat looks like? If we
ask it to reconstruct an image of a cat, can it do it, is this the question that we are trying to ask?
Can you think of how we can do this?

853
(Refer Slide Time: 01:09)

In case you do not have the answer, you can do this using a different use of the gradient descent
approach that we used for updating the weights of a neural network. And this was first
introduced by a work called deep insight convolutional networks in 2014. And the idea here is
you take the trained AlexNet model, and you provide a 0 image as input, a 0 image could be a
black image or a gray image, you can choose whichever ones you would like.

And you give this as input to the CNN model, and you now want the final prediction to be one
hot vector, where the one is in the position of the cat or any other class that you want to get an
image of, and 0s in all other places that is what you would like to see at the last layer, at the
output classification layer.

But, when you feed a 0 images input, you will not be able to get that, you probably get a
different probability vector. Based on the difference between these two, you have a loss, you
can now update your network. But this time, you are not going to update your network, you are
going to do what we call back prop to the image. What is a backdrop image?

𝑑𝐿
If you had a loss in the last layer, we have spoken so far, on different ways of computing, 𝑑𝑊,

depending on what weight you are trying to compute the gradient of. If it is the last layer, it is
straightforward, if it is an intermediate layer, you use the chain rule, if it is a convolutional
layer, you have to think through it a bit differently, batch normalization layer, you have to work
these things out, so on and so forth, we have been able to do that.

854
𝑑𝐿
Can we extend this to compute 𝑑𝑋, where X is the input? That can be done. This is just another
𝑑𝐿
version of the chain rule, you will just have to carefully through all the activations and
𝑑𝑋

weights in the neural network. If you do not believe it, try it out as homework to work it out.
Once you have such a gradient, we try to do an image update, and how do we do the image
update? We do it using gradient ascent. So far, we spoke about gradient descent as a
methodology to minimize an objective function. In gradient ascent, we maximize an objective
function and we will see this objective function in the next slide. So we will use gradient ascent
to get the final image.

So once you do a small image update, the updated image. So it could have been an initial 0
image and after doing an update, the x becomes different now, or the I becomes different
assuming this image is I which is equal to X for us. Now the I becomes a new iterated update
I and that is forward propagated through the network, you again get an output, you compare it
to the expected output, which should have 1 at the position of a cat and 0s elsewhere and you
repeat this process over and over again.

(Refer Slide Time: 04:32)

What could be the objective function? The objective function is formally written as you want
to arg max over me, where I am the input image, 𝑆𝑐 (𝐼), which is the scores that you have in the
last layer, 𝑆𝑐 (𝐼) is when you propagate I through the network and get scores in the last layer.
You want to find an I that maximizes the score corresponding to a particular class.

855
And you have a regularizer on the image, just so that you will not overfit. Remember, because
this is a maximization problem, you have a negative sign because you would ideally like to
minimize the two norms of the image that is the reason for the negative side. So how do you
solve this problem, because it is a maximization problem, you use gradient ascent.

So you go in the positive direction of the gradient and keep climbing, and keep updating the
image over and over again. And at the end of the many iterations, when the gradient between
your output, and your expected output in the last layer is close to 0, you would have converged,
and you would have got one such image, let us see a few examples.

(Refer Slide Time: 05:45)

So here are a few images that maximize a class score. This may not feel like the way we
perceive these images. But this is what the model things are representative of those objects. So
any such artifacts in an image, it is going to consider it as belonging to an object. So the top
left is a washing machine, below is a goose, you can see that there is a structure of a goose
located at different points, then you have an ostrich, a limousine, a computer keyboard, and so
on.

856
(Refer Slide Time: 06:21)

You can do such an optimization, not just with respect to the last layer, you can do this with
respect to any neuron in the neural network, you can try to see if you want to find out which
image is maximally activated when you run, we saw in the previous lecture that we keep
forward propagating several images, and then see which all images fire a particular neuron,
that would have been one way of doing it.

But now what we can do is start with a gray image or a 0 image, forward propagate your image
until a particular layer, let us say you have a particular neuron in one layer that you wanted to
maximize, that you wanted to fire for a particular image, then you go forward propagate until
that layer, you set the gradient with respect to that neuron to be 1 and everything else to 0,
which means you want that to fire and you backpropagate from there and update the input
image iteratively using gradient ascent, and you will now be able to get an average image that
fires that particular neuron.

857
(Refer Slide Time: 07:32)

Another added up that was proposed in the same work was the concept of just visualizing the
𝑑𝐿
data gradient. So, how do you go about doing it, which means we are talking about in this
𝑑𝐼

case, remember, we are saying that I is equal to x, which is the input image for us.

𝑑𝐿
So we have a , which gives us the gradient of the output with respect to the input. So in our
𝑑𝐼

case, we are defining l as 𝑆𝑐 (𝐼), that is the score that we want to maximize. It is not a loss here,
but it is a maximization of the score. But because you have three channels in your input, how
do you understand the gradient by itself? You can, this paper suggests that you take an absolute
value of the gradient along each channel, and then take the maximum of those as the final
gradient at a particular pixel location.

Instead of trying to update an image iteratively, this suggests that simply visualizing that
gradient will tell you the shape of that gradient gives you a rough picture of what maximizes
that particular output neuron. Because there is color, you take the maximum among those, and
you assume that that could be a good representation

858
(Refer Slide Time: 08:55)

Here are a few examples. So here you see a sailboat and you see that these were the pixels that
had the highest gradient across the color channels, which maximize the score. Similarly, this
one for the dog and this one for this object, and so on. So what can you do with this gradient?
Okay, it does seem to give us an indication of which part of the image is responsible or probably
caused the particular probability score to go up.

(Refer Slide Time: 09:30)

They suggested that you could join this with a segmentation method known as GrabCut, which
can be applied to the data gradient to get an object mask. GrabCut is an extension or adaptation
of the GrabCut segmentation that we saw earlier in this course. Using this if you had an input
image, and you get the gradient corresponding to a particular class, then using those gradients

859
and this GrabCut segmentation algorithm, which is a way of taking those pixels and segmenting
the region around them, you end up getting a mask in the input image, which is responsible for
a particular class to be predicted. Here are a couple of more examples, so you see especially
the third row a bit more clearer.

You can see here that you have a bird, it is a data gradient, then you use GrabCut, to segment
out that object from the background, and you get a nice mask of the object, which you can use
for other purposes. For details of GrabCut, please see this link below and that is also an exercise
for you in this lecture, to see how GrabCut can be used with the data gradient to obtain these
kinds of masks.

(Refer Slide Time: 10:56)

Another question that we can ask you is, given the FC 7 representation of data in a CNN, so
that is the output representation of the FC 7 layer or the fully connected 7th layer of AlexNet,
is it possible to reconstruct the original image? In the last lecture, we saw that a two-
dimensional embedding of these representations from the FC 7 layer does seem to capture
semantics and similar images seem to be put together in that embedding space.

But now we are going further and asking if I gave you a code of a particular object, can you
reconstruct how that object looks doing some kind of inverse mapping? How do you think you
would do this? This can be done again, by solving an optimization problem with two criteria.
One, we would like the code of the reconstructed image to be similar to the code that we are
given. By code, we mean the representation obtained at the output of the FC 7 layer. And the
second is, we want that image to look as natural as possible, we're going to call that image prior

860
regularization or image regularization. So we will keep these two criteria in the objective
function, which means what we are going to get is x star is going to be a minimization problem
over 𝜑(𝑥), take x propagated through AlexNet and take the FC 7 layer output that is what we
refer to as 𝜑(𝑥). It is a function which we are calling as 𝜑(𝑥), we want the 𝜑(𝑥) that we
optimize through this process to be close to the phi not which is given to us.

Remember, we said there is a code given to us, we want to find an image whose code is close
to the code that we have in our hand. So, we are trying to do that using an optimization
approach, take an x, 𝜑(𝑥) − 𝜑0 must be minimized, the mean square error between them must
be minimized and you add a regularizer on top of x similar to what we saw on the earlier slide,
this is just an image regularization step.

(Refer Slide Time: 13:09)

Here are some results using the AlexNet model. So, this is done by taking the log probabilities
for the image net classes in the last layer. So, this is the original image, and this corresponds to
a particular class in image net, and we are now trying to take that representation for this
particular image, and then trying to reconstruct similar images that would give the same FC 7
representations. Why 5 different images?

861
(Refer Slide Time: 13:45)

If you go back to the previous slide, you see that this is an optimization problem on x, which
means you would start with the value of x and then do gradient descent to keep updating x until
you got a minimum value on this objective function. So, if you start with different x's, you will
get different solutions and those are the different solutions that you see here. So, these are with
five different initializations.

(Refer Slide Time: 14:10)

And you see that there is an overall similarity to the original image, which we wanted to
reconstruct.

862
(Refer Slide Time: 14:23)

Here are more examples, this is the original image, we take the FC 7 representation of that
original image, and then ask this kind of an optimization methodology to rediscover that
original image, whose representation would have been that FC embedding that was given to us
and you see a fairly close reconstruction here. Similarly, this one, this one, this one, and this
one.

(Refer Slide Time: 14:53)

Moving further, we talk about an important method called guided backpropagation that was
developed in 2015 that helped improve the performance while visualizing data gradients. Let
us look at this approach. It is sometimes also called the deconvolution method of visualizing
and understanding CNNs. But the more popular name today is guided backpropagation. So

863
guided backpropagation is used, along with other explanation methods in more recent years,
which we will see in later slides this week. Let us once again start with AlexNet let us and of
course, needless to say, AlexNet can be replaced with any other model of CNN, we are only
explaining all these methods using AlexNet.

Let us take the AlexNet model. Let us feed an image into the trained AlexNet model. Let us
pick a layer and set the gradient there to 0, except for one particular neuron, which we want to
maximally activate. Let us assume that this is our setting, so the way we would go about it is
you take the input image, you will forward propagate it through as many layers.

And remember, you have a ReLU activation function in AlexNet. And what does the ReLU
activation function do? In any layer, where you have such a matrix such a set of activations in
any layer, wherever there are negative values denoted by these red boxes here, you replace
them with 0 and anything that is nonnegative is retained as it is that's the standard ReLU
operation.

Now, when you do this, you can then backdrop to the image as we just mentioned, because we
want to set the gradient of a particular neuron to be 1, everything else to 0 in that layer, and
then backprop from there to the image and do gradient ascent on the image to understand which
image maximally activated that neuron in that country layer, for instance.

And when we backpropagate, remember that if these were your gradients, right? Let us assume
that what you see here were your gradients, we know that, because of the effect of ReLU,
wherever there were negative values, these were the locations that had once there, there the
gradient would become 0 because when you backpropagate, it will get multiplied by the
activations in those locations when you go through chain rule.

And because the activations are those locations become 0, the gradient will also become 0. And
this is what you will be left with which you backpropagate further to reconstruct the original
image. Note here, that in this particular location, in this particular representation of the
gradients, the gradient values can be negative or positive, it is just that where the input was
negative, the gradient becomes 0 there.

In the other locations, the gradient can be negative or positive. And when you do this, you can
visualize your data gradient and you can see an image that looks somewhat like a cat here, if
you observe closely, you can get the gradients that correspond to an outline of a cat. While it

864
has some resemblance to the cat, you can also make out the risk fairly noisy. So what can you
do about it?

(Refer Slide Time: 18:19)

So to handle this scenario, guided backpropagation, which is a method proposed in 2015,


proposed that, instead of using in the backward pass, instead of allowing all the gradients to
pass through, let us not allow the negative gradients to pass through. What does it mean? We
are saying here, originally we said when the input is greater than 0, only those gradients will
pass through because the rest of the gradients would have been cut off by the ReLU, which was
applied in the forward pass. But now, we are adding that the gradient must also be nonnegative
when you propagate.

(Refer Slide Time: 19:03)

865
Rather, if you went to the previous slide. In addition to making these four 0s, you would also
make this minus 2 as 0, this minus 1 as 0, and minus 1 as 0, only the positive gradients will be
passed through to the previous layers when you do your gradient update for the reconstructed
image. And doing this greatly improves the final visualization of the data gradient.

(Refer Slide Time: 19:36)

And now you get a clearer image of the cat the data gradient. Why does this happen? Because
you allowed negative gradients to propagate, even aspects of the image that negatively
correlated with the outcome in a particular neuron also contributed to coming up with this
reconstruction of the image.

By removing those, we now retain only those pixels that had a positive impact on the activation
of a neuron in one of the layers that we are interested in. This is known as guided
backpropagation and this is something that we use in other methods in the rest of this week's
lectures also.

866
(Refer Slide Time: 20:22)

The recommended readings for this lecture are once again the lecture notes of CS 231n on
visualizing CNNs as well as three papers; Deep inside convolution neural networks,
Visualizing and understanding convolutional networks in ECCV, and Striving for simplicity,
the all convolutional net, which was the paper that introduced guided backpropagation. And
one exercise, use this link hyperlink here to understand GrabCut, and how it can be used to
generate masks using data gradients.

867
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 40
Explaining CNNs: Class Attribution Map Methods

Having seen a few different methods that use various aspects of a CNN model to explain its
predictions such as its activations of different layers, its gradients, its output probabilities so on
and so forth. We will now look at what are known as class discriminative saliency maps or class
discriminative attribution maps. By saliency maps we mean maps or regions in an image that are
salient for a given prediction.

And by class discriminative saliency map, we mean a saliency map that helps distinguish one
class from another, for example if we had a cat and a dog in an image which part of the image
led to it being predicted to be a cat and which part of the same image led to it being predicted as
a dog. Let us see this in more detail over the next few slides.

(Refer Slide Time: 1:19)

So the question, we still continue to ask is can we know what a network was looking at while
predicting a class? But the approach as you will see is different from what we have seen so far.

868
(Refer Slide Time: 1:35)

One of the earliest methods in this regard was known as class activation maps or cam which was
published in 2015 and 16. So this takes a convolutional neural network and uses the notion of
global average pooling to achieve the objective. So let us look at this in more detail. So if you
had say five to six convolutional layers in your architecture, you take the last convolutional
layer.

And then for each map in that convolutional layer remember you could have 100 filters, you
would have 100 maps so for each of those maps you do global average pooling or cam. What
does global average pooling do? You take a particular attribution map or a feature map sorry not
an attribution map a feature map this green one and you average all the intensity values there into
one single scalar and that becomes this green circle here.

Similarly, you take the red feature map and average all its values and it becomes the red scalar
here. You take the blue feature map average all its values and it becomes the blue scalar here.
Global average, it takes the average of the entire image. Now what do we do with these global
averages? Having these global averages so each of these scalars here represents one feature map.

Now, we learn a simple regression model or a linear model that takes us from these scalars to
each of the class labels on the last layer. So for each of the class labels in the last layer we
learned a w1 into the first attribution into the first feature maps average plus w2 into the second

869
feature maps average and so on and so forth until wn where n is the number of feature maps in
that last convolutional layer.

Now, how does this help us? So for a given image such as the image that you see here, if you
forward propagate address through a trained AlexNet, you would get a set of n activation maps
as the at the last convolutional layer. Let us assume that these are those activation maps so you
see one here, the second one here and so on and so forth until the last one here. Now, the weights
that we learned between the output of the gap layer and the classification layer are now used to
weight each of these activation maps.

And when you do a weighted sum of all of these activation maps that gives us the contribution of
these activation maps towards one particular class label. In this example, let us say we want to
predict this image as belonging to an Australian terrier. Then you learn weights corresponding to
each of those activation maps to the Australian interior class and you weight each activation map
in the same way.

And this weighted combination of activation maps of that conv5 layer corresponds to the
Australian terrier. If you say you wanted to predict a man in the image through this then you
would have a different set of weights maybe let us say this was the man class so then you would
have a different set of weights of the same activation maps that connect you to the man class. So
a different weighted combination of the same activation maps will tell you which part of the
image corresponded to the man class. Now, this approach gives us a way of getting us class
discriminative saliency maps.

870
(Refer Slide Time: 5:36)

There are some advantages and disadvantages as we will see soon but here are some visual
examples to start. So you can see here that in the first three images it is examples of the class
Briard and the second three examples it is the class barbells. And you can see here that in each of
these cases the CNN model looks at this particular region and calls it a Briard.

Similarly so in this case similarly so in this case and similarly for the barbells you can actually
see that the CNN looks at the weight plates to make the decision that this image corresponds to
barbells.

871
(Refer Slide Time: 6:22)

So obviously you can ask the question: what if I change the class? So let us see an example for
that too. So here is an image and here are the predicted saliency maps or class activation maps
for the top five predicted classes based on this image. The top five predicted classes were palace,
dome, church, altar and monastery and you see here that when the model was predicting the
palace it was looking at the entire structure.

When it was predicting domes it was only looking at the dome. When it was predicting church it
looked at only the facade close to the dome and similarly for the altar and for the monastery it
was looking at certain parts which probably made it think it was a monastery. So this gives us as
we just mentioned class specific activation maps which can be useful in practice.

872
(Refer Slide Time: 7:24)

An intuition for doing cam is that in any CNN so far that we have seen in most CNNs we have
had a few convolutional layers, then followed by some fully connected layers. Convolutional
layers do maintain a certain level of object localization capability. If you recall in the first lecture
we saw an example of the conv5 feature map with two people sitting in an image.

And we saw that the feature map actually showed where they were positioned in the image. So
the convolutional layer did give us an idea of where the objects were localized with no
supervision or with no details given to us explicitly. However, if you try to look at the
representation that you get after a fully connected layer for example the fc7 representation, you
would lose this information.

That is the very nature of the difference between a convolutional layer and a fully connected
layer. So that is one of the reasons for developing a convolutional layer itself. So that is now by
doing the cam based approach. We actually see now that we can retrieve back the activation
maps of the fifth layer and be able to use them to explain our decisions at the classification layer.

These are just more examples here of the localization capability of CNNs in different feature
maps and you can see here that you have different receptor fields of convolutional units and the
patches that maximally activate that patch. And you can see here that you do get a certain set you

873
can you get a certain sense of localization through these kinds of feature maps at different
convolutional layers. You obviously lose that when you go to fully connected layers.

(Refer Slide Time: 9:21)

Here are a comparison of cam maps across different models. You can see here this is applying
cam or using gap global average pooling on GoogleNet, vgg, AlexNet just GoogleNet alone
another architecture known as NIN or network in network and these are compared with a back
propagation on AlexNet and a back propagation on GoogleNet.

Back propagation is your data gradient remember we talked about the data gradient in the earlier
lecture. So this is that visualization and you can clearly see that cam gives a far stronger saliency
map a far more useful saliency map when compared to the data gradient.

874
(Refer Slide Time: 10:09)

What are the pros and cons of cam? Can you think about it? Can you think of any significant con
disadvantage of cam? Advantages class discriminative can localize objects without any
positional supervision and it does not require a backward pass through the entire model unlike
something like guided back prop or back prop to image. What are the disadvantages? The
disadvantage is that you actually the key one is actually the third bullet here which is there is a
need to retrain these models to be able to get those weights that we get after global average
pooling.

After training an AlexNet you still have to do global average pooling and learn those many linear
models at the last layer to be able to understand the relationship between the activation maps and
each class label. You will have to do this retraining explicitly for explanations on top of training
your AlexNet or any other CNN model and that can become an additional computational burden.

And we are imposing a constraint on architecture by saying that you will have to introduce a
global average pooling layer to be able to explain your model. And that may cause problems.
When you want to generalize to many vision tasks using this kind of a method and there is a
chance that the model may trade off accuracy for interpretability. So to get a better
interpretability it may end up achieving lower accuracy.

875
If you used the gap model and the corresponding weights itself for classification. Now let us try
to try to see how we can address these disadvantages which was done in a follow-up method
called grad cam which is published in ICCV of 2017.

(Refer Slide Time: 12:05)

ICCV is a top tier computer vision conference. Grad cam stands for gradient weighted cam. As
we will see in the next few slides it is a very intelligent approach on repurposing cam using
existing quantities in a CNN and let us see how that works.

But here is the overall idea and architecture we will describe each of these components over the
next few slides. So you have the input image here you send it through a CNN. You have your
convolutional feature maps and the convolutional feature maps could be followed by any task
specific network.

You could be doing classification which is what we have seen so far but you could also use their
approach for any other tasks such as image captioning or visual question and answering. These
are tasks that we will see later in this course, irrespective of the task we assume now that there is
a last layer there is a loss and there is a gradient which can be assumed in any neural network for
that matter.

876
Once you have the gradient of a loss with respect to any task you have now gradients with
respect to all of the feature maps, so the activation maps. You now combine them you combine
the gradients that you get for each of those activation maps and they automatically become your
weights for each of the feature maps.

And in grad cam because we want to ensure that only positive correlations are shown in the final
saliency map we apply a ReLU on the on the weighted combination of the activation maps and
that becomes our final grad cam saliency map. The method also talks about adding guided back
propagation to make a variant of grad cam called guided grad cam. Let us see this in a bit more
detail and also mathematically as to why grad cam becomes an extension of cam.

(Refer Slide Time: 14:08)

877
878
From cam mathematically speaking, we have Yc which is your scores or the class course in the
last layer which is given by summation over k which is all your k feature maps. We assume now
that you have k such feature maps and you have a class weights for each of these feature maps
and summation over i summation over j, a i j k, 1 by z is going to be your global average pooling
of each of the k feature maps.

Y c = ∑ wk c Z1 ∑ ∑ Ak ij
k i j

And then you have the weights and in cam we know that these wc case wc corresponds to the
weight for each class and wk corresponds to the weight for each activation map. So you need to
do both. This is what we learn through linear models in that last layer. Now let us go from there.
Let us now assume that f superscript k is given by the last terms 1 by z summation over i
summation over j, a i j superscript k.

Fk = 1
Z
∑ ∑ Ak ij
i j

Then you are going to have Yc is going to be given by sorry there is an equal to missing there so
you are going to have Yc to be equal to summation over k wck into fk. We are simply replacing
the last terms with fk and if you now took the gradient of Yc with respect to fk, you have dou Yc

879
by dou fk is equal to dou Yc with respect to dou a i j k divided by dou fk with respect to dou a i j
k.

∂Y C
c ∂Aij k
∂Y
∂F k
= ∂F k
∂Ak ij

We just have the same component that we are taking the derivative with respect to. If you look at
this closely dou Yc by dou fk which is what wc is we can see that from this equation here.
Remember once again that this Yc is equal to summation wck fk. So which means dou Yc by
dou fk will be wck for a particular k whichever fk you chose for a particular feature map. And
that is given by dou Yc with respect to dou a i j k divided by dou fk by dou a i j k.

∂Y c ∂Y c
∂F k
= wk c = ∂Ak ij
Z

∂Y c
∑ ∑ wc k = ∑ ∑ Z
∂Ak ij
i j i j

∂Y c
Z wc k = Z ∑ ∑ ∂Ak ij
i j

∂Y c
wc k = ∑ ∑ ∂Ak ij
i j

Now, dou fk by dou a i j k by the very definition of f j will turn out to be 1 by z and because that
is 1 by z in the denominator you get an into z on the numerator here when you when you write
this out more clearly. So what does this tell us? Now if we sum the terms on the left hand side
over i and j which are all your pixel locations in each feature map, you similarly have a
summation on the right hand side.

Now the summation over inj for wck because wck does not depend on i and j is just going to be z
which is the total number of pixels in each feature map or activation map and similarly the z
constant comes out here and you have your summation over i summation over j dou Yc by dou a

880
i j k here. Rather we can say wck then is given by summation over i summation over j dou Yc by
dou a i j k.

This tells us tells us something important that the wck which we actually learned in the cam
model are actually simply the summation of the gradients of each the each of the class score with
respect to every pixel in the feature map and adding them all up. So in truth the class feature
weights here are the gradients themselves and we actually do not need to do the retraining the
way we saw it with cam.

You do not need the global average pooling you do not need the retraining those weights that
you did the global average pooling for can actually be obtained as gradients of the last layer
scores with respect to any feature map or activation map with which you want to compute your
saliency maps.

(Refer Slide Time: 18:35)

So which means we can now write out wck as summation over i summation over j dou Yc by
dou a i j k. We are going to have a normalization factor 1 by z because we want to divide it
across all of the pixels.

1 ∂y c
wk c = Z
∑∑
∂Ak ij
i j

881
This becomes our new weight and our final saliency map or the localization map in grad cam
becomes given by summation over k wck ak where ak are the different k activation maps that we
have. And because we want only the positive correlations to be shown on the final saliency map
we apply a ReLU on it to get the final to get the final image.

(Refer Slide Time: 19:18)

You see examples now for original image here is the grad cam for cat which focuses on the cat
and similarly doing this with the ResNet model, here is the location of the cat and similarly for a
dog it looks at the dog to be able to say it is a dog and similarly for the ResNet model looks at
the entire dog to be able to say it is a dog, which is quite good. But one careful observation here
is that the network actually predicts this to be a tiger cat and not just a cat.

So can we elaborate on this further and see why it is a tiger cat and not just a cat, can we try to
do anything further?

882
(Refer Slide Time: 20:02)

And here is the where here is where the method now proposes a variant of grad cam called
guided grad cam which brings together guided back propagation that we saw in the earlier
lecture and grad cam together and juxtaposing them one on top of the other by doing what is
known as a hadamard product or a pixel wise product.

And we now see that taking this particular region which was pointed out by grad cam and
combining it with guided back propagation output in terms of what was salient in the image, we
now get a clearer estimate of what was the dog. Remember guided back propagation was not
necessarily class discriminative. So if you only used that you would also have other kinds of
pixels which are active as you can see here.

But by combining it with grad cam, we get a more localized understanding of what the CNN was
looking at while calling it a dog. And now, you see why you called it a tiger cat why the model
called it a tiger cat because you take the grad cam saliency map and combine it with guided back
props output again and you actually get this kind of a output which shows the striations on the
body of the cat which explains why it was a tiger cat.

883
(Refer Slide Time: 21:27)

Grad cam went on to also show how this method could be used for what are known as
counterfactual explanations. Rather can you try to find out which class or which region in the
image maximally affected my model in calling an image as belonging to a dog. So I have an
image I want to label it as a dog and the model does give me certain probability of the label
being a dog.

But can I find out which other pixels in the image may have suppressed my probability for a dog
as the output of the model and this can be done by using the exact same grad cam procedure. The
only difference now is use negative gradients instead of the positive gradients and this would tell
us which combination of feature maps or which weighted combination of feature maps were
negatively influenced a particular class probability to be high.

And that would give us what are known as counter factual explanations and eventually how do
you use this? You can remove or suppress these features in some way to improve the confidence
of the model if you would like to use it in that particular way.

884
(Refer Slide Time: 22:55)

Grad cam however had some limitations when there were multiple instances of objects or when
there were occlusions in the image. So here is an image where there are multiple dogs and you
can see that grad cam gives such an output where it does not seem to capture all the dogs in the
image, maybe it does somewhat well when there are fewer number of objects there are still three
dogs here and grad cam captures two dogs but misses the one in the middle that is one of the
limitations of grad cam. Another case where it seems to not completely get a good saliency map
is where there are occlusions.

You see a bird here whose legs are hidden underneath the water and as well it has a beak and
here you see a hedgehog which has a beak-like structure too. In both of these cases grad cam
does not seem to capture those aspects which are actually salient aspects of that object as class
discriminative in its in its visualization. Can we do something about it to our or are there any
limitations in the formulation of grad cam itself that we can improve.

885
(Refer Slide Time: 24:13)

And this was done in a work called grad cam plus plus and the main motivation of grad cam plus
plus is observing that grad cam took and took the gradients of dou Yc with respect to dou, each
of the pixels in your activation maps and then took an average of all of them to get its final
weight. In a sense it is weighting each pixel equally by taking the average when it when it gets
the final weight.

Grad cam plus plus’s idea states that maybe pixels that contribute more towards the class should
get more weight than have equal distribution while computing this weight w k c. Let us see how
we can do that. So this can especially suppress activation maps with lesser spatial footprint rather
we saw this example on the previous slide too when you have three dogs and there is one dog
which is smaller the other two dogs got most of the gradient and the third dog did not because it
has a lesser spatial footprint.

We will see this more clearly on the next slide and when you have this kind of a bias in the
visualization some of the smaller objects may just not be picked up in the saliency maps. What
can we do about this? Grad cam plus plus suggests that we retain the same formulation of grad of
grad cam. However, this time while computing your final weight wkc we are going to give each
pixel in each activation map a certain weight as to how they must be how they must contribute to
that saliency map.

886
So let us call those constants alpha sub i j superscript kc so alpha sub i j corresponds to each i jth
location of a feature map. K corresponds to the kth feature map and c corresponds to the class
which we want to maximize. Grad cam plus plus also adds a ReLU here to ensure that only
positive gradients are considered in the computation of this weight. But the larger question here
is how do you get these weights?

∂y c
wk c = ∑ ∑ αij kc Relu( ∂Ak )
i j ij

In grad cam, it was simpler to average all of these gradients that you get and use that to get a wkc
and remember each wkc then becomes the weight of the kth feature map towards the cth class.
So but now how do you compute these alpha ijs at each pixel level?

(Refer Slide Time: 26:44)

Before we go there, let us try to understand the intuition of grad cam plus plus again visually. So
here you notice that if you had an image with three different objects say dogs a dog of a large
occupying a larger special footprint, another dog occupying a mid-level special footprint and
another dog occupying a small footprint and for the moment let us assume that different feature
maps capture different dogs. This for instance could be any other object for that matter could
have been a dog, cat and a jug or something like that.

887
So each feature map let us assume captures each of these objects and you see here that in grad
cam when the saliency map gradients are computed you can see that the area with the largest
footprint ends up getting most of the gradient, while the gradient towards the rest of the pixels
are smaller because they have fewer pixels and hence contribute lesser towards the output while
grad cam plus plus tries to overcome this by doing the pixel wise weighting and you can see here
that in grad cam plus plus the weights are in the same range those the gradients are in the same
range when you use this kind of an approach.

This is actually the final saliency map that is in the same range for grad cam plus plus. So we
still are left with the question in grad cam plus plus as to how do you compute those alphas at a
pixel level.

(Refer Slide Time: 28:10)

We are not going to derive this here this can be lengthy and that is going to be part of your
homework. But what happens is by reorganizing the gradients and using some arithmetic around
the expressions of the gradients, grad cam plus plus shows that αij kc can actually be obtained as
a closed form expression of several gradients that you have already with you. Both a b and i j
here are iterators on the same activation map and you can go ahead and look at the paper for grad
cam plus plus to understand how this derivation is done.

888
But once this derivation is done, the rest of it stays very similar to grad cam. In grad cam plus
plus you still do a ReLU at the end of summation over k wck ak where wck is given by
summation over i summation over j alpha i j k c ReLU of dou Yc by dou a i j k. How does this
help?

(Refer Slide Time: 29:18)

You now see that given the same image of multiple dogs while grad camp did not localize all of
them effectively. Grad cam plus plus seems to get a better saliency map around the dogs and it
works a bit better even when there are three dogs and it also captures the setting when there are
structures such as beak or legs under occlusion as compared to grad cam.

(Refer Slide Time: 29:44)

889
Grad cam plus plus also showed that the saliency maps obtained from grad cam plus plus give
better localization if compared to the bounding boxes that are provided with images when
compared to grad cam.

And you see several results here the first column in this left block are the original images then
you have the corresponding grad cam visualizations for each of these classes Hare, American
lobster, gray whale and a necklace and you see the grad cam plus plus localization which seems
to be better especially for some things like gray whale when compared to grad cam. You see a
similar set of images for another gray whale here.

A kite a go-kart and an eel where grad cam plus plus localizations improve over grad cam by
considering the pixel wise waiting strategy.

(Refer Slide Time: 30:37)

890
Here are more examples of grad cam plus plus for multiple occurrences of objects, once again
improved performance over grad cam.

891
(Refer Slide Time: 30:50)

For homework, there are these three papers cam, grad cam and grad cam plus plus and your job
would be to read through them and the other exercise would be to work out how we get the
closed form expression for alphas in grad cam plus plus.

(Refer Slide Time: 31:10)

References.

892
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Explaining CNNs: Recent Methods
Part 01
(Refer Slide Time: 00:13)

In the last few years, researchers have extended methods such as GradCAM, GradCAM++, as well
as the use of other statistics such as Gradients and Activations, and so on, to improve explanation
methods. This is what we will focus on in this lecture, which is on Recent Methods for explaining
CNNs. So, this is not just about explaining CNNs. These methods are developed more broadly for
explaining any kind of neural network.

893
(Refer Slide Time: 00:54)

Let us see a few of them in this lecture. See, if you recall our discussion on visualizing the data
gradients, our overall approach was you forward pass your data x any input image, to get the output
through the neural network, let us assume that the neural network is given by a function f, you get
a 𝑦 = 𝑓(𝑥). And y is the output of the neural network corresponding to a particular class, it was a
particular score the way we saw it.

Then we do a backdrop to the image to get the gradient doh y by doh x. And using those gradients,
we visualized the image back in terms of the gradients by taking the maximum among the gradient
across the channels. And we get such an attribution map. Recall an attribution map is a map of
how much each input attributes to the output. That is the reason it is called an Attribution map. Let
us now ask if this is sufficient to explain a deep neural network. Is this method sufficient? The
answer is not always. So, let us see a counterexample where such an approach of using gradients
can fail.

894
(Refer Slide Time: 02:14)

Let us consider a scenario where we have 2 inputs, 𝑖1 , and 𝑖2 , which are fed into the next layer. So,
the bottom is the input, the top is the output, this neural network is presented this way. And these
go into the next layer where let us say we have a ReLU activation function. And which now gives
out an output h, which is given by max⁡(0, 1 − 𝑖1 − 𝑖2 ). Let us assume that those are the weights
that the neural network has learned at that time.

And let us also assume that the final output y is given by 1 minus h. This is one such neural network
that may be learned when you train among when and when you train a model. So, if you visualized
the gradients in this kind of a setting, you would notice that on the x-axis, if you had i1, plus i2,
you would notice that h is max⁡(0, 1 − 𝑖1 − 𝑖2 ), which means when 𝑖1 + 𝑖2 becomes greater than
1, h is always 0.

Until then, it would be 1 − 𝑖1 + 𝑖2 . So, you see the graph of h will be this blue line here, where it
is 1 − 𝑖1 + 𝑖2 until 1. And when 𝑖1 + 𝑖2 exceeds 1, it becomes 0. On the other hand, the graph of
y is given by 1-h, which means it increases as h decreases until 1. And when h becomes 0, y stays
static at 1. This is evident from this kind of construction of a neural network. This is fair.

Now, why are we bringing up this example? What do we want to see? We have already given a
hint here, we are talking about something called a Saturation problem. So, what does that mean?
We notice here that the gradient of h with respect to both 𝑖1 and 𝑖2 is 0 when 𝑖1 + 𝑖2 is greater than
1, which means as soon as the sum of the inputs exceeds 1, the gradient is going to become 0.

895
So, if the gradient is 0, the guided backdrop or any other method that you use to get visualize your
data gradients in the first place, will also be 0. That is 1 part of it. The second part is that the
gradient of y with respect to h is negative you can see that as h goes down, y goes up, which means
the gradient of y with respect to h is negative.

Remember that in guided backprop, we said that any negative gradient would also be made 0 when
you backpropagate, which means even from 0 to 1, when 1 + ⁡𝑖1 + 𝑖2 goes from 0 to 1, even in
that range, the gradient of y will be negative, and we will make it 0 due to the guided backprop
approach. This means the gradient now is always going to be 0 everywhere, which is not going to
be useful at all. We call this the Saturation problem. So how do you solve such a problem?

(Refer Slide Time: 05:41)

So, this problem was first pointed out by a work called Deep lift in 2017. And what deep lift
proposed to address this issue is, let us not use gradients. But let us use a variant of a gradient.
Remember, a gradient by definition of first principles is you infinitesimally perturb an input and
see what change it causes in the output. Now, we are seeing, let us not look at an infinitesimal
change in the input or a perturbation in the input.

Let us instead see, if we had some reference input. And our current input moves away from that
reference input by a certain ∆𝑥𝑖 one of the attributes 𝑥𝑖 one of your input attributes for an image,
you could consider it to be one of your input pixels. So, and then we see what is the difference in

896
output with respect to some reference output? So, if we moved from some reference input by a
certain amount, how much does the output move from some reference output.

So, it is no more an infinitesimal change, we keep a baseline for input and output and see if we
move from the baseline by a certain amount, how much do we move from a baseline in the output,
it is as you can see an extension of the idea of an of a gradient. And then you assign contribution
scores, these contribution scores are given by 𝐶∆𝑥𝑖 ∆𝑡 , such that the summation of all ∆𝑥𝑖 ,⁡∆𝑡 over
all your attributes has to equal ∆𝑡, because that is the overall change in your output.

So, that is how the contributions of each input towards the change in the output are measured. You
can see now that with this kind of approach, the saturation problem that we saw on the earlier slide
will go away. Because no more will you be considering the difference between successive points
on your x-axis, but you will always be looking at it as a difference with respect to the reference,
which you could keep a 0.

And then why, even whether a point was at 𝑦1 , 𝑖1 + 𝑖2 = 1.1, or 𝑖1 + 𝑖2 = 1.2, it will still have a
difference with respect to reference 0, and there will be a valid gradient and your gradient will no
more be 0. That is the way Deep lift counters this saturation problem.

897
(Refer Slide Time: 08:22)

So, deep lift introduces three a few different rules three different rules, but we will talk about one
of them here to be able to explain it. For more details, you can look at the paper. This rule is known
as the Rescale rule, and it broadly explains the idea behind the paper. So, you start from an output
layer L and you proceed backward, layer by layer, redistributing the difference of prediction score
from baseline until input layer is reached.

Let us explain that more clearly. Let us assume now, that 𝑧𝑗𝑖 = 𝑤𝑗𝑖 1 + 𝑖 𝑡ℎ layer into the activation
of the previous layer, 𝑥𝑖𝑙 into 𝑤𝑗𝑖𝑙+1 ⁡becomes 𝑧𝑗𝑖 in the next layer. That is the notation we are using.
Similarly, 𝑧𝑗𝑖 dash is 𝑤𝑗𝑖𝑙+1 ∗ 𝑥̅𝑖 where 𝑥̅𝑖 is the baseline input for the reference input that we talked
about. So how do we use this? Let us now consider that 𝑟𝑖 is the relevance of unit i and superscript
l denotes layer l.

So, the 𝑟𝑖 in the last layer L (is the total number of layers) would be given by 𝑦𝑖 (𝑥) − 𝑦𝑖 (𝑥̅ ), where
𝑥̅ is the baseline reference input. And you are going to see what is the change in the output as you
change the reference input. And that is going to be your relevance of unit i, in your last layer, if
there is no chance at going to be 0 otherwise.

For all previous layers, 𝑟𝑖 of 𝑙 is given by summation over j's, which is all the neurons that you
have that particular neuron is connected to, in the next layer, 𝑧𝑗𝑖 − 𝑧̅̅̅⁡
𝑗𝑖 by summation of all 𝑧𝑗𝑖 −

𝑧𝑗𝑖 Let us try to draw that out to make it a little clearer. So, you have a particular layer with a few
̅̅̅.

898
different neurons, you have a particular next layer with a few different neurons, we already know
how to compute, the relevance is in the last layer.

We are now taking i specific neuron I in the lth layer, let us call this the layer l. So, we take all the
j's that are in the 𝑖 + 1𝑡ℎ layer. So, these are the j's that we have in the 𝑖 + 1𝑡ℎ layer. And we see
in the 𝑖 + 1𝑡ℎ layer. So let us consider one particular j, its relevance would have already been
computed that would be given by 𝑟𝑗𝑖 + 1. And how much did it contribute to the relevance at 𝑟𝑖 l,
that would be given by 𝑧𝑗𝑖 − ̅̅̅⁡minus
𝑧𝑗𝑖 the summation of all the 𝑧𝑗𝑖 − ̅̅̅⁡-s
𝑧𝑗𝑖 in, in that particular
layer?

So that gives you an estimate of what is the relevance of node i at the lth layer, you keep back-
propagating these relevances back. And then you get an estimate of the relevance of each input at
layer 1, which will be your input layer. Note here that the key difference in the process of
backpropagation is very similar to what we did with gradients. But we are not computing gradients
here, we are computing how much did the activation at any layer change by giving a baseline input
instead of the current input.

That difference is what we are measuring as the gradient. And the rest of the process stays very
similar to what we did with backpropagation. So, this is the idea of using the deep lift to understand
how each input neuron attributes to every output neuron 𝑦𝑖 in this particular case.

899
(Refer Slide Time: 12:42)

A popular method, another popular method rather, is known as Integrated Gradients, very
popularly used today. And Integrated Gradients is motivated by an observation similar to deep lift,
which is shown here. See if you have a given image, which is an image of a fireboat. And if you
only used the vanilla data gradients, to see where the fireboat is, you get a set of gradients, such as
what you see on the right side.

This kind of structure in the gradients is predominantly because of a saturation problem that you
will see on the next slide. So, what do we do, what integrated gradients do is to avoid this problem
of saturated gradients by accumulating gradients are different pixel intensities of the given image.

900
(Refer Slide Time: 13:42)

Let us see this in more detail. So here are the same set of gradients. But now, each image in this
set of tiles here is the input image, however, with a reduced intensity at each pixel by a particular
scale. So, you notice here that there is an alpha of 0.02, alpha of 0.04, alpha of 0.06, alpha of 0.08,
so on and so forth. All of this says that this first set of gradients were obtained by taking the given
image and scaling down its intensity to the level of 0.02.

So, it becomes a imagine an image that is an interpolation between a black image and the current
given image. But you weighed the black image to a level of 98 percent and the current image to a
level of 2 percent. That is what alpha is equal to 0.02 will give you Similarly, you can do alpha is
equal to 0.04 alpha is equal to 0.06, so on and so forth, and you get a different set of images. And
for each input image, you can compute your data gradient. And now you see that you start seeing
more structure more clearer structure in the data gradients

901
(Refer Slide Time: 15:01)

Why is this clearer structure, because this is a fireboat, you would ideally want to localize the boat,
as well as all of these water streams, because that is part of the nature of the boat itself.

(Refer Slide Time: 15:12)

And that is what you see on all of these gradients here. But as you got closer and closer to the full
image, you see that perhaps the gradients are close to each other. And hence, the method, if you
use any data gradient, it thinks that all of the pixels have about the same gradient. And then you
end up getting a cluster of gradients like this, which do not isolate out the key pixels. Let us see

902
this also, mathematically to make this a bit clearer. And mathematically, this is achieved using
what is known as a path integral. And that is the reason it is called Integrated Gradients.

(Refer Slide Time: 15:54)

Let us try to see how that is defined, ie, or the integrated gradient along ith dimension for input x,
and baseline x prime, which we just took as a black image, when I gave the example a minute ago,
is given by 𝐼𝐺𝑖 (𝑥) = 𝑥𝑖 − 𝑥𝑖′ , when you have 𝑥𝑖′ , to be a black image, that will always be 0. So,
this is just 𝑥𝑖 itself. And you integrate alpha from 0 to 1. And then you do, 𝜕𝑓, where f is the neural
network.

𝑥−𝑥 ′
(𝑥 ′ + 𝛼) 𝜕𝑥 𝜕𝛼. What is this partial derivative, it is taking (𝑥 ′ + 𝛼)(𝑥 − 𝑥 ′ ). So, you have a black
𝑖

image, and you keep adding little and little of the given image to the blank image. And now you
take the output of the neural network f, as you forward propagate that constructed image between
a black image and a given image and differentiate and that with respect to the input.

Now, you compute the data gradient of this interpolated image, and you find all such interpolated
images as you move alpha from 0 to 1, and then integrate all of them to be able to get an integrated
data gradient. They show that this kind of approach can lead to more robust attributions. But in
practice, it is not possible to integrate all possible alpha values. So we come up with an
approximation, where we take, take a set of alpha values, we define a set of intervals between 0
and 1.

903
And we keep stepping on from each of those intervals. So summation goes from k is equal to 1 to
m, and then you have a k by m. So you take it at 1 by m, 2 by m, 3 by m, 4 by m, so until m by m
1. So you keep taking those steps of your input image added to a baseline image or a black image,
and then you compute your data gradient and average out all your data gradients, and that becomes
your integrated gradient. Now you have an IG attribution map, which focuses on the fireboat as
well as those streams of water, which is far more convincing than what we saw initially. That the
vanilla gradients.

(Refer Slide Time: 18:21)

Another approach that was proposed is known as Smooth Grad, which uses a very similar idea is
integrated gradient, but a more heuristic and a simpler approach. It says, let us add pixel-wise
Gaussian noise, too many copies of the image, and average the resulting gradients, why do we
have to do the path integral to go from, say, a black image to the given image, instead let me take
of a given image, I will just distort it in several ways using by adding Gaussian noise.

And now I will compute a data gradient for each of those images when I forward propagate that
image through a model. And I average all of those gradients. And I now get a new data gradient,
which I am going to use as my final attribution map. An interesting observation that you can notice
here is that this method talks about removing noise from the saliency map by adding noise to the
input, which is an interesting approach and it works reasonably well.

904
There are many other methods now, which have added a smooth variant to their approach. For
example, there is also a smooth IG approach, which takes the integrated gradient and takes a
smooth version of it by adding Gaussian noise to different inputs and then averaging the data
gradients. So here you see an example of a result.

You have the original image, you can see that the vanilla gradients are spread out across the image,
you do see an outline of the structure, but otherwise, the gradients are spread out across the image,
but by doing smooth, Grad you get a fairly robust attribution map of the gradients which
corresponds to the object.

(Refer Slide Time: 20:06)

A more recent variant of the integrated gradient is known as XRAI, it was published in ICCB of
2019, where the idea is to take integrated gradients, but in the context of computer vision, to treat
it in terms of segments rather than pixels. So far, we have talked about attribution maps of every
pixel towards a particular output class. But doing it at the level of every pixel can become tedious
in an image.

Instead, can we reason at the level of segments in an image is what XRAI tries to do fairly practical
approach, what it does is, you first get the attribution map given by IG so that is the first step that
you need to do. I should point out here again, that all these methods that we are covering this entire
week, including this lecture, the previous lecture, all of their case, in all of their cases, we already
have a trained model.

905
We are now talking about after a training model, what are the different things we can do? These
are not methods that affect the training of the neural network, that part of it we already did, we are
now trying to see given a train network, how can you explain its behavior? Let us come back to
XRAI. So, you get an attribution map given by IG, and then you over-segment the image, which
means you get a lot of segments in the image.

And now you start with an empty mask, which means there are no masks to start with. And then
you add a region, which has the highest sum of attributions in a given segment. So you have several
segments in an image, you pick the segment, which has the highest attribution among all its pixels
in that segment. And that is the first segment that you are going to add as corresponding to a
particular class.

Remember, these attributions that you have, that you add into a segment are attributions with
respect to a particular output or a particular class. Let us see a couple of examples here. So here
you see an original image. So Integrated Gradients does give you a region around those hot air
balloons, you also get a few other places, some graded, but by doing XRAI, you get a fairly neater
presentation of which aspects or which regions of the image were responsible for these objects to
be called Hot Air Balloons.

(Refer Slide Time: 22:47)

Let us see another example. So, here is an original image. And as you keep adding 3 percent of
segments 10 percent of segments, you see that more and more objects keep getting added the way

906
it happens is you start with XRAI segments. So, this is the over-segmented image that we are
talking about at the bottom. So, you now try to find out which of these segments has the highest
attribution pixels, the highest sum of attributions of pixels corresponding to the object bird.

And you take that region and if you take the top 3 percent of such segments, you would get this
region. If you take the top 10 percent of such segments across all your segments, you would get
these two regions and these will be the corresponding XRAI heat maps. So it is a way of extending
integrated gradients to reasoning at the level of segments instead of pixels.

907
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Explaining CNNs: Recent Methods
Part 02
(Refer Slide Time: 00:13)

A popular approach for explainable machine learning in general, not just Neural networks, not just
CNNs, is known as Lime, Lime stands for Local Interpretable Model Agnostic Explanations. So,
each of those words means something which should become clearer when we explain the method.
The way the method works is to approximate any underlying model by an interpretable linear
model.

Let us see how it does that. So, you first have a trained model. And now you are going to train an
interpretable model on top of some small perturbations of a given instance. So here is an intuitive
visualization of the line approach. So the blue and pink background here gives the black box
models decision function f. So that is the classification boundary that you have between two
classes, let us assume it is a binary classification problem.

And all of these pink regions correspond to one class, all of these blue regions correspond to the
other class. The Bold Red Cross is one specific test instance that you are given, which you have to
explain. So, I have been given an image for instance of a cat. And I know how to explain. And let

908
us assume that this is the decision boundary of a CNN, I now have to explain why this Red Cross
is called a cat.

So that is what, that is what we have been talking about. So far, this is just a different way of
viewing the same problem. And finally, you also have a dashed line here, which is a learned
explanation. Now, let us see how this explanation is learned. The way it is learned is to take this
bold red cross, which is a given image, construct a few perturbations of the image. So, you could
not take an image, occlude a part of it, occlude a segment of it, occlude an object in it, whatever it
be, these are different kinds of perturbations that you construct around the image.

And you notice that when you remove a certain portion, our model classifies it as positive. And
when you remove another portion, our model classified it as, classifies it as negative. And that is
why you see these plus reds and these blue dots, which means those respective perturbations were
classified by our model as belonging to class one or class two. So, you should be able to intuitively
understand now how this can be used. If I know which perturbations are making me lose the class
label and make it go to another class, you know that they are my most important regions in the
image, too, for a particular class label to be assigned to that image.

(Refer Slide Time: 03:10)

Let us see this in more detail. So given a point x, that is your bold red cross let 𝑧 ′ be a point
obtained by perturbing one dimension or region in x. So, you could perturb up a pixel, or you could
perturb up a particular region, you could do occlusion kind of experiments, or you could just

909
perturb up an entire region of an image. And Z is the set of all perturbations. Let us assume that
we have some metric of 𝜋𝑥 (𝑧 ′ ), which is a proximity measure between instances 𝑧 ′ and 𝑥.

So, for example, if you take this the not bold plus, and the bold plus the distance between them, is
what we would refer to as 𝜋𝑥 (𝑧 ′ ), this could be if a distance metric, such as an inverse exponential
distance metric. So you could have an exponent of minus d exact prime whole square, where d is
a squared distance divided by some sigma square, which can be predefined by a given user. f is
again, the model is explained. Our goal now is given these perturbations 𝑧′and your original image
x, we want to build a model 𝑔(𝑧 ′ )’s such that 𝑔(𝑧 ′ ) = 𝑤𝑔 𝑧 ′ , which is the linear model, such that
the output of 𝑔(𝑧 ′ ) is the same as the output of 𝑓(𝑧 ′ ).

So This means we want this to be also equal to 𝑓(𝑧 ′ ). What do we mean? Remember that f is this
function that is already given to us. G is this linear model that we are trying to develop. using only
those perturb examples, and we are saying now, that g should be learned in the neighborhood of x
in such a way that it makes the same decisions on all of these perturbations that the original model
would, on those in if you gave those as inputs.

So, in a sense, g is like a local approximation of f, f is what occurs across the domain of the inputs.
If you take a particular example, and perturbations of that example around that around its
neighborhood, then g would be a linear model that approximates the original decision surface of f
in that local neighborhood.

910
(Refer Slide Time: 05:42)

Now, we want to learn 𝑤𝑔 , that would be our goal, because we want to learn that linear model, 𝑤𝑔 ,
is learned by minimizing, you take two 𝑧 and 𝑧′s belonging to z. 𝜋𝑥 (𝑧), which is the distance of x
from z, 𝑓(𝑧) − 𝑔(𝑧 ′ ). So, you want to ensure that the decisions of f, as well as the decision of g,
are together, they make the same prediction. And you also want to keep in mind how far z is from
x.

So, if it is very far away, you can perhaps allow it to make a little bit of a mistake. But if it is close
by, you do not want this, you do not want to lose anything in the decision at all. That is the last
function that we use to learn g of z prime.

911
(Refer Slide Time: 06:38)

Let us see how this is used. So given an image such as this, your perturbations could be where you
take to let us see the face of the dog and remove it and give the rest of the image and see what the
network predicts as similarly, you may receive, you may remove the strings in the guitar and see
what the model predicts as similarly then you remove your, the board of the guitar, and you would
see what the model predicts says. And based on all of these, if you learn a g,

(Refer Slide Time: 07:11)

912
The way we talked about it, you would get your 𝑤𝑔 -s, which would tell you how much each of
these 𝑧′-s weights towards these decisions. In a sense, that tells you which of these 𝑧′-s is most
important for a particular decision.

(Refer Slide Time: 07:27)

Rather, you would be able to say that these are the regions in the image that are most important to
call this image as corresponding to electric guitar. These are the regions that are most important to
call this image as corresponding to acoustic guitar. And these are the regions corresponding to this
image, which correspond to the label Labrador.

913
(Refer Slide Time: 07:53)

Lime also has a fidelity slash x interpretability trade-off. And the reason for this is the last function
that we used, tries to ensure that the function g, the linear model that we learn should be
approximating f in the locality defined by 𝜋(𝑥). So the goodness of function g is very important.
At the same time, you also do not want g to be too complex a function. So g ideally, is a model
that belongs to the class of all interpretable models.

So a simple example would be a linear model. And we define Ω(𝑔) to be the complexity of the
model. What do we mean by complexity, if we are using g to be a decision tree, it could be a depth
of the tree. If we are using a linear model, it would be the number of nonzero weights. If we are
using it for image-based analysis, we may simply say that the number of superpixels should not be
greater than a certain amount.

What is a superpixel? Region. A superpixel is just a region of an image, we would not want more
than three to four regions to be nonzero in that model when you make your decision. Remember,
in this case, that g is a specific model learned for only one image, that bold red cross that we saw,
if you gave a different image, you will have to learn a different g. So this is the complexity of the
g model.

In the earlier example that we talked about a couple of slides back, we learned a Sparse Linear
model, you could use, say, L1 regularization to make your model sparse, as you may have learned
in machine learning. That is what we use here to be able to get these models to be sparse.

914
(Refer Slide Time: 09:45)

So now why does that matter? You could now view explanations in line as a tradeoff between the
fidelity which is the fidelity of 𝑔(𝑓), which is given by the first term, and the complexity or the
interpretability of the explanation, which is the second term, it may not be possible for both of
these to be minimized. If you want g to be truer to f, you may need a more complex g. But if you
make a more complex g, the interpretability could be reduced. So, it could lead to this kind of a
trade-off. But irrespective if you chose a linear model for g, it does work fairly well in various
settings.

915
(Refer Slide Time: 10:26)

The last model that we will see here is inspired by what is known as Shapley values in Game
Theory. This method is called SHAP and was introduced in NeurIPS of 2017. To understand this
model, let us define a few notations. Let N be the total number of features in our, in our method in
our setting. Let small v be a value function similar to attribution. But we will see there is a slight
difference. A value function that assigns a real number to any coalition S subset of N.

By coalition, we mean a subset of features. Why are we calling it a coalition? Because this is
inspired by Game Theory. So, an example where Shapley values can be used is imagined a football
team. And you want to find out how important Messi is for, say the Barcelona team. So, you want
if you want to find that out, you have a team already, let us say five to six players. And you see if
you added Messi to that, what value would the team now obtain with Messi and without Messi
with the other five players.

So that is so now, the team is what we refer to as coalition. So, in game theory could be players or
any other context. But for us, each of those are attributes we are saying that if we already have five
attributes if we now add a sixth attribute, how much is our value changing? That is what we want
to measure. And phi sub v of i is the attribution score for a feature i.

Given these notations. attribution score in shap in the shap method is given by the marginal
contribution, that player in our case, a feature or an attribute or a dimension, whatever you want
to call it. A marginal contribution that player i makes upon joining the team averaged overall orders

916
in which the team can be formed. Let us let us again take the example of Messi. So if you wanted
to understand what value, Lionel Messi brings to the Barcelona football team, you consider the
other players in the team, and you consider them in all possible permutations.

And see if Messi is added how the value changes. So you take the first three players add Messi,
see how the value changes, you take seven players add Messi see how the value changes, you take
seven different players to add Messi to see how the value changes. Average the value increase that
you got across all of these possible subsets. And that is the value that Messi would bring to the
Barcelona team.

Formally, you would write this as 𝜙𝑣 (𝑖) is equal to summation over all the subsets without Messi
1
without i, 𝑁! |𝑆!|(𝑁 − |𝑆| − 1)! this looks at all possible combinations in the subset without Messi

given N to be the total number of players or features for us. And this term here says 𝑣(𝑆 ∪ 𝑖) −
𝑣(𝑆), what if you add i to the team S minus only the team S without i.

That is the value of adding player i to a coalition. So, I hope you could translate this to
understanding the usefulness of a feature in a Neural network. But we have to answer one question,
what is v for a neural network? What is that value? How do you measure that? So, if you have a
function f, which is a neural network, which is for predicting, then we are going to define v as v
of S, which is there S is a subset of features is given by.

The expectation of 𝑝(𝑥 ′ ) given 𝑥𝑆 ,𝑓(𝑥𝑆 ∪ 𝑥𝑆′ ̅ ) : 𝑆̅ is the complement of 𝑆. Rather we are seeing
now that the value of a set S is given by expectation over expectation over all possible probabilities
of picking elements 𝑥 ′ given subset access already given a team of or a set of features S already
𝑓(𝑥𝑆 ∪ 𝑥𝑆′ ̅ ). What would be the output if you added an 𝑥 ′ from 𝑆̅ to 𝑥𝑆 .

So, you take all of those f’s outputs to take the expectation, that becomes the value of S here. But
in this particular method of SHAP, we assume that the features are independent, which means this
conditional probability simply becomes the probability of picking a particular feature from 𝑆̅,
which is given by 𝑥′. So, you try to see, given an 𝑥𝑆 , what happens if you add a feature from 𝑥′,
and then what is the new output, if you take the expectation of all possible 𝑆′-s, that is the value of
S.

(Refer Slide Time: 15:48)

917
There is an extension of SHAP called Deep SHAP, once again, where the input features are
assumed to be independent of one another. And the explanation model is linear. But now, the
reason it is called Deep SHAP, SHAP is it is a combination of SHAP and Deep Lift. And you take
a distribution of baselines and compute the deep lift attribution for each input baseline pair. And
then you average the resulting attributions per input example, to define your v or value function.
The value function, in short, the value function is inspired by Deep lift. That is the idea here,

(Refer Slide Time: 16:31)

The rest of it goes back and follows. So, the final attribution score would still be given by this
equation here, we are only now talking about how the v is defined using these equations.

918
(Refer Slide Time: 16:45)

Having seen these different methods to be able to explain the dishes, the decisions of CNNs or
Neural networks, Neural network models.

(Refer Slide Time: 16:58)

An important question to ask is, how do you evaluate these explanations? Do you have an idea?
How do you check the goodness of these explanations in all of the results that we showed so far,
we showed qualitative results. So, we took an image and said, look at this, it seems to work well.

919
But when you have a million images in your test set, you really cannot look at every image in the
test set and study whether CAM is working better, or Deep lift is working better or Deep SHAP is
working better. You need some objective metric that averages the performance across all of your
images. How do you do this? A simple way would be to take the intersection over the union IOU
of the threshold salient region with a ground truth bounding box if it is available.

So, if somebody already told you that, given an image, I have a cat in the image, but I also am
giving you a bounding box around the cat. Let us say that is part of the dataset, then if my saliency
map has a high intersection over union with this particular cat, what is intersection over union, we
will see this in more detail when we see other tasks such as Detection in vision. But if that
intersection over union is high, we assume that our saliency map or our method to compute the
saliency map has worked fairly well. What else can we use?

(Refer Slide Time: 18:30)

There is another metric that is been recently proposed in NeurIPS 2018 and BMVC of 2018, which
is called faithfulness. In faithfulness, we measure the correlation between the attribution scores
and the output differences on perturbation. So, we are trying to see if if you perturb up input and
you see an output change, then and if you see a significant output change, then the attribution of
that input should also be high because that affected significant that affected the output
significantly.

920
So, the correlation between the attribution score and the output differences on perturbation should
ideally be high. So, here is the equation we say that faithfulness is given by rho which is the
correlation of R comma delta R is the relevance of a particular pixel i and delta is the output
difference when xi is the image obtained after perturbing pixel i. Another metric that was defined
in the BMVC paper here is known as the causal metric.

And it has two variants known as Deletion and Insertion metric. Let us see the deletion metric and
the insertion metric turns out to be complimentary. So, the deletion metric you first delete seek
pixels sequentially based on what our method says is most relevant first for a particular outcome.
So, if we think the pixels of a cat is the most relevant for calling it a cat, remove those pixels from
the image.

So, replace it with a black pixel or a gray pixel. So, remove those pixels from the image. And then
you compute the AUC. I hope all of you know what AUC is. AUC is the area under the ROC
curve, which is a standard metric is used in several machine learning settings. So, you compute
your area under the ROC curve of the network's output, as so the area under the ROC curve, you
would know has two axes.

Where generally you vary some threshold and see what happens to a certain set of metrics that you
are monitoring. And you ideally want your ROC curve to be something like this with the high area,
but in this particular case, we see what is the function of the perturbed inputs versus the amount of
perturbation we are trying to see, as you keep increasing the perturbation? What happens to the
function of perturbed inputs? Does it change a lot as you vary the x-axis, if the y-axis changes a
lot, then perhaps those pixels do have a high influence on the output?

But because we have deleted the most relevant pixels, we would want the AUC to be lesser here,
after the removal of a particular pixel. So if you remove the most relevant pixel, and plotted a ROC
curve, the way we saw it here, which means you put up the input and see how much the output
changes, and you plot this as a curve, you would want the AUC to be low after removing the most
relevant pixel.

(Refer Slide Time: 21:55)

921
You could also look, look at this complementarily and see it as an insertion metric, where you
insert pixels sequentially, least relevant first, in this case, a higher AUC is better. Just the way we
spoke about it for the deletion metric.

(Refer Slide Time: 22:14)

Another recent metric that is been proposed in NeurIPS of 2019 is known as a Roar. Roar stands
for \Remove and Retrain. What the Roar matrix suggests is you get a saliency map for each image
in your training data. You retrain the model after perturbing the most relevant pixels, the new
model should have a large reduction in accuracy. So if you take an image of a cat, you know what

922
saliency map you got for a cat, you now perturb all of those pixels, you should ideally see a
reduction in accuracy of the model calling it a cat.

That is the Roar metric. There have also been some things called Sanity checks for saliency maps
that is been proposed, you are going to leave this as homework reading for you. And there is also
a work that was published in ICML of 2017 known as Axiomatic Attributions, which defined a set
of axioms for any of these attribution methods. It, it says that any attribution method is Grad CAM,
be at Lime, be at Deep Lift, all the different methods that we saw this week must satisfy certain
axioms for it to be for them to be reliable in practice. What are those axioms, once more homework
for you to go back and read those axioms?

(Refer Slide Time: 23:40)

To summarize this lecture, both Deep Lift and Integrated Gradients overcome the saturating
gradients problem, although deep lift is generally faster because integrated gradients have to
compute the data gradients across an entire path of different inputs. However, Deep lift violates
one of the axioms we talked about on the previous slide, known as Implementation Invariance
Axiom. Maybe you should try finding out why as homework.

Smooth integrated gradients are generally preferred over integrated gradients when sparsity is
desired. For better interpretability in terms of visual coherence XRAI is a good choice, because it
reasons at the level of regions rather than pixels. Lime is a model agnostic method. It can be used

923
on top of any machine learning model because you can anyway good be approximated as a linear
model at the end.

And it can be used for image data, text data, tabular data, so on and so forth. It is a bit slow because
you have to train a model for every input image that you have. And is also sometimes inconsistent
between runs. But it is a model that is used by many companies today. SHAP has a strong game-
theoretic background. But to make it work, you have to use some approximations in the real world.

(Refer Slide Time: 25:12)

So, your homework is going to be about going through the list of axioms of attribution in this
work. And for each axiom, try to find out the attribution methods that satisfy that the methods that
we have covered this week that satisfy that particular axiom. Also, go through the proposed Sanity
checks and experimental findings in this paper. Both of these papers are good reads.

And from a programming perspective, play with Captum it is a popular library for model
interpretation by Facebook open source. You can also try visualizing any models that you have
built-in your assignments so far, through the lens of OpenAI Microscope for visualization and
understanding.

924
(Refer Slide Time: 25:57)

Here are some other resources. There is a very nice book by Christopher Molnar, known as
Interpretable Machine Learning covers all of these methods in more detail if you are interested. It
is an online book. There's also a collection of tutorials and software packages on this link that you
see here.

(Refer Slide Time: 26:16)

925
The references for this lecture are here

926
Deep Learning for Computer Vision
Professor Vineeth N. Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Going Beyond Explaining CNNs
(Refer Slide Time: 00:14)

For the last lecture of this week, we are going to try to go beyond explaining CNNs. And look at
a couple of interesting applications of using the concepts that we saw so far Gradients and
various other artifacts for explaining CNNs, but use them in interesting ways. Before we go
there.

927
(Refer Slide Time: 00:41)

Let us review one of the home works that we left behind last lecture, which was for you to read
axioms of attribution. I hope you had a chance to go through that work. So here are a few of the
axioms that have been claimed in the paper which all attribution methods or CNNs methods or
visual explanation methods, they are all the same. In this particular context. attribution methods
generally refer to methods that broadly cover entire machine learning, maybe even beyond
vision, but in the context of computer vision, attribution methods, saliency map, visual
explanations, they are all broadly synonymous.

But we would like these attribution methods to satisfy a few of these axioms. The completeness
axiom says that for any input x, the sum of the feature attributions equals F(x), which means if
you had say N different attributes or 10 different attributes in your input, the sum of their
attributions should be the, the actual output that you get of the neural network, the sum of the
attributions respect to your neural network should be the actual output.

This is a bit similar to what you see in Deep lift, where we said that the contributions must add
up to δ t , which was the difference of the output with respect to the reference output. So the
completeness property is something that Deep lift satisfies. Another axiom sensitivity states, if x
has only one nonzero feature, and F(x) affects the output of the neural network is nonzero, then
the attribution to that feature should be nonzero.

928
And you would actually find that gradient based methods do not satisfy this sensitivity axiom.
Think about it and try to answer why you can also read this paper for the answer. When I say
Integrated gradient actually satisfies Vanilla gradient based methods do not satisfy the sensitivity
axiom. Another axiom is Implementation Invariance. When you have two neural networks that
compute the same mathematical function, F(x).

Regardless of how differently they are implemented, the attributions to all features should always
be identical. So let us assume that you had two neural networks, one three layer on one two
layer, but it happens at the end, that the exactly give the same output for every input that you can
give, by some means they have learned the same function, maybe use Relu in one three layer
network and use a Sigmoid in another two layer network.

But let us assume they learned exactly the same function. If that happens, the attributions to all
features that you get in both of these networks should be the same. And you will actually see that
deep lift violates implementation invariance. Once again, the reason, please look at this paper at
the bottom. Another axiom states about the symmetry preservation of the attribution method,
which states that for any input x, where the two values of two symmetric features are the same,
their attributions should be identical as well.

By symmetric features, we mean features that can be swapped and the neural network output will
still remain the same. So, this could be in an input this could be an intermediate layer. Generally,
symmetry is a broader con is a broader concept. But we are saying now that if two inputs could
have been replaced, and output still stays the same. In that scenario, both those features should
have the same attribution.

Please do read this work below for more details about the Axioms. With that, let us go on to
these interesting applications that go beyond explanations, but use the concepts that we discussed
over the course of this week.

929
(Refer Slide Time: 04:52)

The first one is a work known as Deep Dream, which was developed in 2015. Which uses the
ideas that we talked about in the earlier lectures of the week. But in a different and interesting
way. It modifies a given image in a way in which that boosts all activations at any layer, creating
a feedback loop. And which means if you have a dog, a feature or a Kernel that detects a dog's
face in a particular layer, you back propagate that on to this given sky image, and keep adding
the dog's face to this cloud image. And you will end up getting a pretty interesting artistic image
to, to end at the end of this process.

(Refer Slide Time: 05:41)

930
Let us study this method in more detail. So, you have an input image, let us choose a particular
layer and a particular filter in that layer. Let us say that this is that layer, and there is a filter in
that particular layer. Now you input the image through the CNN, and you forward propagate
until that particular layer in the neural network. Once you do that, you back propagate, the
activations of that filter back to the input image.

So, you do a backprop to image very similar to what we did in one of the earlier lectures. But this
time, the input is not a black image, or a gray image. This time, the input is just any other image
that you want to juxtapose or overlay that input onto. So you take the an image of the sky, take a
face of the dog or any other filter, and keep adding to this image, the gradient that you get from
the face of the dog, you try to see what in the input image would have resulted in that filter being
fired, which would be given by the gradient of that filter the respect to the input.

Remember, we talked about, if you wanted to maximize a filter's activation, you said the gradient
of only those filters to one, everything else to 0, and backprop to image and update the image
using Gradient Ascent, you do the same thing here. And now you multiply those gradients with
your learning rate and add them to the input image. That is the gradient ascent that we talked
about.

What happens, you now see that you will get an image such as this, in this case, it is not a dog, it
looks like a bunch of buildings that you add to the original image. And then you go back to 2

931
again, you input this image into the neural network, again, maybe you want to use the same
filter, maybe you want to use a different filter.

932
(Refer Slide Time: 07:44)

And you can keep playing with this to create different kinds of images.

(Refer Slide Time: 07:47)

You could now take a different filter in a different layer. And now you could add that, and you
know you end up getting a different kind of a construction of the image. So, you would see that
higher layers produce more complex features, while lower ones enhance the edges and the
textures. You can even see that example. In these two images here. This was the top red one

933
created from a slightly earlier layer. And the yellow one here was constructed by the yellow
bordered one here was constructed by a slightly later layer.

(Refer Slide Time: 08:22)

A few more examples of deep dream. So, given an image of an horizon, you can actually convert
it to one of a tower, you can actually see that this is the same image taken. But you keep on
juxtaposing a tower filter on it. And you end up getting a construction set such as this. Another
example is going from trees to keep adding buildings to it. And you come up with a nice pagoda
like structure on the same scene. And here is a photo of leaves, and you keep adding the
gradients of bird and insect filters. And you get an interesting variant of an image starting with
the original leaf image.

934
(Refer Slide Time: 09:04)

More examples.

(Refer Slide Time: 09:10)

And now let us see the other application. Another similar interesting application, which is known
as Neural Style, or Neural Style Transfer, as it is popularly known. The idea here is given an
image and given a particular style, let us say a modern art style. You want to know how I get this
image in this style? For example, can I take my photograph and make it look like Van Gogh had
painted it? How would I get that style into my photo? That is what we want to study here. One is

935
you could simply overlay one of Van Gogh's paintings, overlay that on top of an image that I
have, but unfortunately, this really does not give good results.

(Refer Slide Time: 09:56)

Another option is to ask a human to do it for us. But that is not what we want to go for. So the
third option, what we are going to look for is what is known as Neural Style Transfer, which was
also a method that was proposed in 2015.

936
(Refer Slide Time: 10:12)

And let us see how this works. So once again, you have a trained Alex Net model. Remember, in
the scope of this entire week, we already have a pre-trained model, we are not training the model
in any way, we are only working with the train model in different ways. So, you provide the
input image that you want to your network, and you extract the activations at different layers for
the given input image.

Then you also extract what are known as Style Targets. So you give the style image to the same
neural network. So, a style image could be a painting by Picasso or Van Gogh, or something like
that, you give that as input to the same neural network. But this time, you do not get the
activations of this input at different layers. Instead, you get what are known as Gram matrices of
ConvNet, activations of all layers. What does that mean?

So, if you take a particular layer here, remember, this is composed of a set of feature maps. So,
you have a volume of different set of feature maps. Now, each feature map could be say 20x20
dimensions as an example. Now, what we do here is to linearize or rasterize, this entire feature
map into a 400-dimensional vector. And now, you take the covariance of this 400-dimensional
vector with the another channel or another feature map in that volume.

937
And you would now get one such value of the covariance of one channel with the next channel.
Similarly, you compute the covariance of every channel with every other channel, and you would
get a Gram matrix of ConvNet activations of all layers. Why is this a style target? There is no
clear answer. But one could surmise that this is because the relationships between the filters give
us an idea of what style is being understood by the neural network. That is the broad idea here.

(Refer Slide Time: 12:33)

So, what do we do with the style target, once you obtain these Gram matrix of ConvNet
activations from the style image, now, you take a new you again, take a network, a train model,
and now you give a blank input as an image. And to this image, you backprop the image again,
but you try to ensure that the activations that you get at any layer of this model is close to the
activations that you get for this image on the same model in a different context.

And the Gram matrix of activations that you get here resemble the gram matrix of activations
that you get for the style model. So, remember, this is the content part of it. And this is the style
part of it. So, you now backprop to image for a gray image, and another of these Alex Net
models or any CNN model, you do backprop. The image, remember that we said was argmax of
I. We tried to maximize the score of I with respect to a particular neuron.

But this time, we are going to try to minimize the distance between the activations that we get for
this input here, and the activations that we get for this dog image on this network. Similarly,

938
minimize the distance between the Gram matrix of the covariances between activations here, and
the Gram matrix of the covariance between the activations in the style. When you forward
propagate the style image.

939
(Refer Slide Time: 14:12)

when you combine these two, you end up getting an image, which takes the style from the style
image and the content from the input image. And that is the output that you get while you started
with a simple gray image.

(Refer Slide Time: 14:29)

Here are some examples of various Neural Styles. So you can see here is the input image, and
here is the style and you get a pretty interesting output that resembles the style. So here is Steve

940
Jobs picture and you have a style and you see the picture changes. Similarly, you see several
other examples, which are fairly good in terms of what they achieve.

(Refer Slide Time: 14:56)

More examples here, if you would like to take a look at where the content and style is varied, and
you have an entire grid of several examples of taking a content and trying different styles on
them.

(Refer Slide Time: 15:10)

941
So, if you would like to understand this more, there are a couple of interesting links here, which
gives you an intuitive understanding of neural style transfer. Both are on the towards data science
links. And there is also a nice understanding of Deep Dream on Hacker Noon with the code
provided. And an interesting exercise for you is going to be to watch this fun video of YouTube
on doing Deep Dream on videos.

The example that we talked about in this lecture was about doing Deep Dream on a given input
image. What if you do Deep Dream on videos? That is what this fun video on YouTube does.
Please take a look at it. More importantly, try to figure out how it was done. That is the exercise
for you this week.

942
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 44
CNNs for Object Detection-I

We have so far seen neural networks in its most simplest form, feed forward neural networks,
learnt back propagation, seen how different variants of gradient descent can be used. Learnt a
few practical tricks while training neural networks, we then moved on to convolutional neural
networks, the building blocks of computer vision. We saw how they can be trained using back
propagation, we saw various kinds of architectures in CNNs.

And then we also saw last week on how one can visualize and understand what CNNs are
learning as well as explain their predictions. But in all these lectures so far, we focused on the
task being classification, given an image our objective was to classify what object was in the
image. You could extend that to regression where you predict a real value by changing the
cross entropy loss to a mean square error.

However, there are many other kinds of vision tasks that are possible and we will now move
on to those tasks in this week. We will start with one of the most used tasks today which is
object detection.

(Refer Slide Time: 1:38)

So the difference between classification, localization and detection is as follows. In


classification, given a cat in an image, the objective is to say that there is a cat that

943
corresponds to the image. In localization, in addition to saying that there is a cat you also
have a bounding box around the cat to say where in the image the cat was located. In object
detection, you go one step further and you detect all possible occurrences of the objects in
your set of classes in a given image as well as localize them. In the localization task, there is
only one object which you localize whereas in object detection there could be multiple
objects, multiple instances of the same object, you could have a cat and a dog or two dogs
and one cat all of these possibilities.

The objective here is to recognize each of these objects as well as localize them using these
kinds of bounding boxes. Finally, the task of segmentation where the job is to label each pixel
as belonging to one of say c classes depending on the number of classes we have. So, we will
start this lecture with detection and then move on to segmentation as well as other tasks later
this week.

(Refer Slide Time: 3:17)

Before deep learning came into the picture, there were other methods that were used to
deploy computer vision for object detection on various devices. So, object detection and face
detection were even available in say point and shoot cameras 15 years ago when deep
learning methods had still not matured. Most of those methods used a variant of what is
known as the Viola-Jones Algorithm, this was developed by Viola and Jones and that is the
reason for the name and this work was published in CVPR of 2001.

944
And this is a framework to perform object detection in real time. It was primarily used for
detecting frontal upright faces although it was adapted to be able to detect other kinds of
objects or even detect parts of objects such as say eye detection on a face. The main
contributions of this work was they performed weak classification using what are known as
Haar-like features. We will see what these are in the next slide and they also introduced a
very powerful idea known as integral images which helps make computations of such
haar-like features significantly faster as we will see.

Then they used Adaboost as the classifier of choice but how they used Adaboost was slightly
different from the traditional formulation in machine learning and they also used this as a
cascade of classifiers as we will see over the next few slides.

(Refer Slide Time: 5:07)

Haar features are rectangular features based on haar wavelets, so these are, this is how the
haar features look like. So, you can see here that there are generally a region of black and a
region of white but there could be different combinations of those you could have just two
regions white and black horizontally aligned vertically arranged or you could have a black
sandwiched between a white vertically or horizontally or you could have checkerboard
patterns and you can keep going forward you will get various kinds of patterns as you use the
same idea.

So, the idea behind the haar filter is, the feature value is given by the sum of the pixels in the
white rectangles subtracted from sum of the pixels in black rectangles. Remember this is

945
convolution, so things would get flipped which is why you have the output of convolving
with a Haar filter would be the sum of the pixels inside the black rectangle minus the sum of
the pixels inside the white rectangle, remember black is 0 white is 1 but because of the
flipping this is what the output would turn out to be.

And to normalize for images of different scales, these features could be scaled through
height, through width or even different kinds of patterns as we will see. So, you could have a
6 × 4filter, a 10 × 6 filter. You could have a 10 × 24 filter so on and so forth. You could
have them of various sizes depending on what you are applying these haar wavelets for.

(Refer Slide Time: 6:52)

So, the intuition of using haar features for object detection or face detection in particular is
that a lot of the features of the human face are about contrast between a certain region and the
adjacent region. So, for example, if you looked at the eye region as you can also see here, it is
likely to have a black area below and a reasonably brighter intensity area right above that. So,
Haar features such as this would capture that difference. You could also look at eyes the other
way and represent them and capture eyes using a white block sandwiched by two black
blocks in a Haar wavelet. So, these are likely to be good feature detectors of parts of the face.

946
(Refer Slide Time: 7:44)

As I just mentioned, this work also introduced the idea of what is known as an integral image
this is a fundamental concept not necessarily anything to do with face detection, it is a way of
making computations faster especially when you use haar-like features but this was
introduced in this particular work by Viola and Jones for doing object detection.

So, let us see what an integral image is? So, the idea is to reduce computational complexity
when you are adding pixel intensities for whatever option and the integral image is defined
as, for any pixel in the integral image you sum up all the pixel values up till that point starting
from the origin on the top left and the summation of all the pixel intensities is what is filled at
a particular location.

So mathematically, you could write it as, integral image at x,y location would be the

∑ 𝑖(𝑥', 𝑦') where x is the current x location, y being the current y location. So, you add
𝑥'≤𝑥,𝑦'≤𝑦

up all the pixel values until that particular pixel that you are looking at and the sum of all of
those intensities is what you fill in at that pixel in the integral image. So, given this kind of an
image, the integral image would look like this. You can see that for example the 2 × 2
window 290 is the value given and 290 is given by 38 + 66 + 89 + 97 and so on and so
forth.

So, the recurrent definition for an integral image would be given by


𝑠(𝑥, 𝑦) = 𝑠(𝑥, 𝑦 − 1) + 𝑖(𝑥, 𝑦) the above column plus current pixel location and the integral

947
image at x, y would be 𝑠(𝑥, 𝑦) + 𝑖𝑖(𝑥 − 1, 𝑦) . You can work this out and see that this is
correct. And 𝑠(𝑥, 𝑦) is the cumulative row sum and to make this recurrent definition work
you have to define 𝑠(𝑥, − 1) = 0 and 𝑖𝑖(− 1, 𝑦) = 0.

(Refer Slide Time: 10:11)

Let us actually now see how the integral image can be used in particular for haar filters. Let
us assume now that we want to find the sum of the pixel intensities inside this rectangle
defined by 1, 2, 3, 4 A is this top left rectangle B is this entire rectangle including a and
including a part of it is A, C is this entire rectangle and D is this full rectangle. So, the sum of
the pixels within the rectangle 1, 2, 3, 4 would be given by 𝐷 + 𝐴 − (𝐵 + 𝐶). You can
work this out and see that it would be quite true.

So, more formally you can say that sum of pixels inside a rectangle with an upper left pixel as
x1, y1 and bottom right pixel as x2, y2 could be given by the
( ) ( ) (( ) ( ))
𝑖𝑖 𝑥2, 𝑦2 + 𝑖𝑖 𝑥1 − 1, 𝑦1 − 1 − 𝑖𝑖 𝑥1 − 1, 𝑦2 + 𝑖𝑖 𝑥2, 𝑦1 − 1 , so you can work this

out to see that this is actually true.

So, you would notice now that in this particular case, let us say you wanted to find out the
sum of pixel intensities in this particular square block, so then what you could do is you could
you already have the integral image computed, so you only have to look up a certain value
certain set of values here and say 1492 + 532 − 798 − 1827. Why is this useful?

948
When you want to use a convolution such as a Haar filter for these kinds of images, if you
had a 2 rectangle feature, what we mean by a 2 rectangle feature is let us assume that you
have a Haar filter which is given by just a black region followed by a white region.
Remember we said that this is the sum of pixel intensities in the black region minus the sum
of pixel intensities in the white region, such a convolution for filter would only need 6
lookups in your integral image, why so?

You would need one lookup for all of these corners, so there are six corners here including
the middle ones you will need those 6 lookups and you will be able to compute what would
be the output of the haar filter without actually doing any kind of a convolution, you can just
look up the integral image and be able to get these values. As an exercise, try to think how
many lookups you would need for a 3 rectangle feature. What do we mean by a 3 rectangle
feature. Consider a haar filter with say black in the center and white on both sides.

So, this would be a 3 rectangle feature, how many lookups would this need? It is simple but
work it out later. Similarly, what if you had a 4 rectangle feature something like a
checkerboard something like this. So, then how many lookups do you need? Think about it as
homework.

(Refer Slide Time: 13:32)

So, once you have the outputs of these Haar filters, the algorithm of Viola and Jones
suggested that each feature can be now considered as a weak classifier. Given a feature 𝑓𝑗

which is one of those Haar wavelets could be a 2 rectangle feature, a 3 rectangle feature, 4

949
rectangular feature one of those you can have all possible variants. A thresholdθ𝑗 and a parity

𝑝𝑗 which only indicates the sign. The classifier is defined as, the weak classifier is defined as

ℎ𝑗(𝑥) = 1 𝑖𝑓 𝑝𝑗𝑓𝑗(𝑥) < 𝑝𝑗θ𝑗and 0 otherwise.

So, you are saying that if that feature value is less than a threshold, you are going to call it 1,
0 otherwise, of course you could define the parity to use this otherwise if you like. So now,
using this you are going to get one classifier for every Haar feature. We know that a
combination of weak classifiers can give strong classifiers but do you see a problem with this
approach so far?

If you thought carefully, even if you had a simple sliding window of 24 × 24 and let us say
that is the window you are going to slide across the image and then inside each of those
sliding windows you are going to compute various Haar features and use this thresholding to
be able to detect whether Haar feature says there is a face or not a face. But the problem is if
you had even just a 24 × 24window there are over 160, 000 features possible. By that we
mean, so within a 24 × 24 window you could have a very large checkerboard then you
could imagine all possible combinations of arranging black blocks and white blocks adjacent
to each other inside a 24 × 24 window, you will have a huge number of features.

Now to be able to compute 160, 000 weak classifiers inside each sliding window seems
tedious, what do we do?

(Refer Slide Time: 15:51)

950
That is where AdaBoost comes into the picture. So, the way Adaboost was used in the Viola
Jones algorithm is, the algorithm given here this is from the paper, we will only walk over the
salient parts of this, this is the standard Adaboost algorithm where given example images
𝑥1𝑦1 till 𝑥𝑛𝑦𝑛 you initially start with uniformly weighting all your training samples and you

normalize those weights across all of your training samples.

Now, for each feature j which is one of your Haar features among the all possible
combinations that you could have, you train a classifier j which is a ℎ𝑗 which is a weak

classifier which is restricted to using a single feature and the error with respect to this weak
| ( ) | ( )
classifier is given by ϵ𝑗 which is given by 𝑤𝑖 × ℎ𝑗 𝑥𝑖 − 𝑦𝑖 , rather ℎ𝑗 𝑥𝑖 is the output of

this weak classifier 𝑦𝑖 was the expected output, so how close this weak classifier was to the

expected output multiplied by the weight of that sample that we are evaluating, that is the
standard Adaboost approach to weight more erroneous samples a little higher and less
erroneous samples a little lower.

But now, the Viola Jones approach to using Adaboost says, among all those possible features
which would have given you several different classifiers, choose the classifier ht which with
the lowest error epsilon t is among all your possible classifiers choose the one that gives you
the least error. Based on the error of this classifier you re-weight all your samples as you see
here, this is again the standard reweighting step in AdaBoost and you repeat this over and
over again and your final strong classifier after a certain number of iterations, remember now
you would not be using all features you are only using the T number of features where capital
T is a parameter that you can give.

In each iteration you pick one of the features that is really strongest and at the end you only
use a weighted combination of capital T different features. And once again this final strong
classifier rule is the standard AdaBoost algorithm.

951
(Refer Slide Time: 18:19)

A third component of the Viola Jones algorithm was known as a classifier cascade, so this
came from the motivation that very few sub windows of an image actually have objects or
faces. Remember you could have 100s of 1000s of sub windows in a given image but you are
not going to have that many faces in a given image, so there it is very likely that the false
positive rate may be very high if you evaluate every possible window. So, how do we
overcome this issue?

So, what this classifier cascade idea suggests is eliminate negative windows earlier therefore
reducing computation time spent on them, so if there is a sub window that is negative ignore
it and never include any further sub windows inside that sub window for any further
processing. Speaking at a very high level, so you have all possible sub windows that goes to
the first stage and you rule out a certain number of them and assume that they will never have
a face at all then only those sub windows where a face is likely it goes to a next stage in the
classifier which again checks for faces again and if there are sub windows rejected at that
stage, that again comes out. Only the remaining sub windows go through the next stage so on
and so forth.

And in each of these stages, you have a set of weak classifiers. You can constrain that using
the number in capital T, but now you can ensure that you can use a stage of different
classifiers to keep refining your performance over time.

952
(Refer Slide Time: 20:05)

Why does this classifier cascade matter? We just mentioned that for a problem like phase
detection or object detection, false positive rate is very important. So, if you had a one
6
megapixel image, remember you are dealing with 10 pixels and at least a comparable
number of candidate phase locations, even as if you assumed that a face could be at each of
6
these locations you are at least looking at 10 probabilities. Once again, you could have
different sized faces which would even further expand the possible candidate phase regions
6
but for the moment, let us assume that you have at least 10 candidate locations.

−6
So, your ideal false positive rate has to be less than 10 assuming there is only 1 face or of
−6
the order of a few faces. So, you want your false positive rate to be about 10 . If you took
these basic Haar filters that we spoke about a few slides ago, this combination would yield
100 percent detection rate would give you a 50 percent false positive rate and it gives you a
lot of possibilities of faces. Not all of them would actually turn out to be faces.

If you went a little further and included 200 features, remember the above example we had
only two kinds of Haar filters, if you included 200 kinds of Haar filters, when we say 200
kinds you arrange those black rectangles and white rectangles in several ways you could have
them as a checkerboard 2 × 2 checker board, a 5 × 5 checker board, you could have 3
rectangles you could have 5 rectangles 7 rectangles all those possibilities are open. You have
200 features that yields a 95 percent detection rate and a false positive rate of 1 in roughly

953
−6
about 14, 000. It is better but it is still not enough to what we want which is 10 false
positive rate.

What do we do? The classifier cascade comes to our rescue. So remember, that when you
have a cascade the final detection rate and the false positive rate are found by multiplying the
false positive rate and reduction rate of each of these stages. So, if each of these stages was
say a 200 feature classifier which means your capital T was 200, then you may have a
detection rate of say approximately 0.9. You could then achieve a false positive rate of 10
power minus 6 across a 10 stage cascade if each stage had a deduction rate of 0.99 and a false
positive rate of 0.3.

Because if you had a false positive rate of 0.3 in each stage of the cascade, you would then
10
have 0. 3 because you are talking about 10 stage cascade so 0. 3 × 0. 3 × 0. 3 and so on,
10 −6
we will have 0. 3 which is roughly about the order of 10 . So the idea of classifier cascade
allows us for each stage in the cascade to have a reasonable detection rate of 0.99 and a false
positive rate of 0.3 which as we saw is possible with just a few features and we just keep
refining these over stages of the cascade and obtain the false positive rate we ideally wanted.

(Refer Slide Time: 23:39)

Another aspect of this method, for that matter any detection method, is the notion of
non-maximum suppression. So, it is likely that many windows around an object may be
classified as containing an object. So, how do you choose which one of them is the right
object?

954
(Refer Slide Time: 24:06)

So, we have to use some kind of bounding box similarity measures and the most popular one
is known as intersection over union. Intersection over union states that if you have two
bounding boxes B1 and B2 you take their intersection and you take their union. The ratio of
the intersection to the union gives you a sense of how close these bounding boxes are. If the
ratio is 1 which means the bounding boxes B1 and B2 are exactly the same or as they get
close to 1 you are going to have that these bounding boxes are fairly overlapping with each
other. How do we use this IoU?

(Refer Slide Time: 24:45)

955
We can now use it for non-maximum suppression. So, if you had a set of possible bounding
boxes which contain an object you can select a random box from them. Then you compare
that box with the rest of the boxes. If IoU is greater than 0.5, you remove that box from the
list. So, you are ensuring that if there are multiple boxes that have a 50 percent overlap with
each other, you retain only one of them.

We will see a little later that we would not just follow this, we will also follow trying to find
which of these boxes has the highest confidence on the object being in the box, we will see
that in a few slides from now. So, this process is known as non-maximal suppression NMS,
which is extensively used in detection today as we will see.

(Refer Slide Time: 25:39)

That was the Viola Jones algorithm for phase detection. Very popular. Was used in several
technologies. Another popular approach at that time was known as the Histogram of Oriented
Gradients. We have seen this when we talked about extracting handcrafted features from
images. But we will now talk about it as to how they are used for the task of pedestrian
detection.

So, if you have say a pedestrian given in an image, you have a detection window that slides
over an image and gets all your gradients. Could be a sobel filter, an elegy filter so on, so you
have all of your gradients. Now you divide the entire image into cells and in each cell you
draw out a histogram of orientations of gradients. We have done this before with sift so you

956
perhaps know how to do it. Once you get these histograms of oriented gradients in each of
these cells, then you consider overlapping 2 × 2 blocks of cells.

So you see here that you have 4 different 2 × 2 cells. So they could be overlapping. But you
take a set of 2 × 2 cells, normalize the histograms in all of those 4 blocks and that is what
you define as the final histogram for the center pixel. When you do this for all the cells in
your image, you would get a final descriptor which looks like this where you have a set of
histograms of oriented gradients across different locations in the image. How do you use this?
Once you get this final descriptor you then train a support vector machine to say whether this
is a pedestrian or not a pedestrian. So, this was introduced in this work in 2005 by Dalal and
Triggs.

(Refer Slide Time: 27:38)

So, you can also visualize as to how the weights of the support vectors look in this particular
scenario. So, if you took the average gradient image over all your training examples, you can
visualize what are the positive and negative SVM weights. Remember these are your
different dimensions here because the cells are your dimensions of your entire descriptor
vector. So, you can now imagine positive and negative SVM weights after you learn a
support vector machine.

Given a test image and its corresponding histogram of oriented gradient descriptor, you can
multiply these descriptors by the positive and negative weights and you would get something

957
like this and you notice here that looking at the positive weights it would be easy for the
SVM to classify this as a pedestrian being in an image.

958
(Refer Slide Time: 28:31)

This was also extended to be done at a multi-resolution level, where you have an image
pyramid and you perform the same approach at each stage of the pyramid. So, for example, at
the lowest resolution you have one specific block and you would get a score using a support
vector machine for that particular window in the lowest resolution image. You get a certain
score for a pedestrian being there. You would do this similarly for the higher resolution and
the highest resolution images and then you can use various techniques to integrate the scores.
You could integrate the scores the way we did it for pyramid matching by giving a high score
for the highest resolution and a low score for the lowest resolution.

Or you could take a vote or you could use other kinds of heuristics to combine these scores to
find out whether a person was in an image. These were the different approaches that were
used for detection before deep learning came into the picture.

959
(Refer Slide Time: 29:38)

Now let us try to see how one would do object detection using CNNs in the simplest possible
manner. If you had say images of cars during training, you would train a CNN on various
images of cars to learn a car classifier. So you would use a standard cross entropy loss. CNN,
a simple CNN. At test time, when you get an image where a car could be located anywhere in
the image, you take different sliding windows of multiple scales.

So, you may have to take say 100, 000 sliding windows potentially from this image and each
of those sliding windows you give as input to the CNN assuming that they all are of the same
size or you can bring them to the same size. You give that as input to the CNN and that CNN
now classifies each window in the original image as belonging to a car or not a car. So, based
on that you would detect a car here and here.

960
(Refer Slide Time: 30:42)

Once again in this approach you can include non-maximum suppression to improve the
performance but here as we said before once you get the list of all possible bounding boxes
where a car could be present, you consider a bounding box with the highest class signal. So,
in this case, the red box would be the box with the highest class signal 0.93. Now you take
the IoU of all the other bounding boxes with respect to this bounding box, this red bounding
box. If any of them have a high overlap or a high IoU, with this red bounding box, you
eliminate them and retain only the red bounding box because it had the highest confidence of
a car present in that bounding box.

(Refer Slide Time: 31:34)

961
Do you see problems with this approach? This is the simplest way of adapting CNNs for
object detection but do you see any issues? There are actually a few of them. Firstly, the
bounding boxes may not be tightly around the object. You can see here that some of these
bounding boxes are fairly loose. We ideally want a tight bounding box so that we exactly
focus on only that part of the object, that part of the image that contains the object. That could
happen in this kind of an approach.

The other problem here is obviously that you have to evaluate thousands of sliding windows
through the CNN to be able to find which of them contain your object of choice. We are
going to talk about methods that overcome this issue.

(Refer Slide Time: 32:33)

But before we go there, we will talk about object localization which as we said was a
precursor to detection. So, object localization can be achieved by, if you had only a single
object in an image, you can then give that as input to a CNN, you get your feature maps, you
get your fully connected network. Now, you branch out your CNN into two parts, one part
that gives you a classification score which can be learnt using a cross entropy loss, that is the
standard CNN for classification that we have spoken about so far, but we could also do a
bounding box regression.

So, you could now try to see what are the exact coordinates of the bounding box say with
respect to the entire image or with respect to any other box in the image. You could see what
is the offset and you also learn that as part of your learning procedure. So, your fully

962
connected network at the end would have two heads, a classification head and a regression
head. The classification head would require clock cross entropy loss, the regression head
which will learn in x, y the top left corner of the box, a height and the width. 4 coordinates, 4
values.

And these values can be learnt using a normal L2 or mean square loss, so the entire network
would be trained by the sum of these two losses to not only classify the image but also
localize the image.

(Refer Slide Time: 34:16)

One of the first successes of a good localizer in CNN was OverFeat. OverFeat was the winner
of the ImageNet localization challenge. Every year, ImageNet, in addition to having the
classification challenge also had a localization challenge and OverFeat was the winner of this
challenge in 2013 and the way OverFeat did localization was to ensure that your entire
network is convolutional instead of having fully connected layers. So, this avoids
computation time over sub windows by applying filter directly to the image, let us see how.

963
(Refer Slide Time: 35:02)

So, given an input which is say 14 × 14 or you pad and make it 16 × 16 you perform
5 × 5convolution this is only an example for visualization you get a 10 × 10 output assume
if you did not do padding and if you do 2 pixel padding it becomes 12 × 12 then a 2 × 2
pooling makes it 5 × 5. A 5 × 5 convolution makes it 1 × 1 and now using multiple one
cross one convolutions you can get an output directly without having to perform any fully
connected layer operations.

(Refer Slide Time: 35:45)

So, this idea of OverFeat made computations significantly faster as you can see now, instead
of using a fully connected layer OverFeat replaces it with a fully convolutional layer which

964
finally has 1 × 1 convolution similar to what we saw with InceptionNet if you recall or
GoogleNet and the this OverFeat approach uses that to get your final performance. Why is
this useful? Because we remember we said that fully connected layers use up a lot of
parameters. That is what we said and using a fully convolutional approach overcomes this
and hence can speed up performance.

965
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 45
CNNs for Object Detection-I Part 2

(Refer Slide Time: 0:15)

Having seen how object detection was performed in the pre deep learning era as well as how
basic CNNs can be adapted to object detection, let us now move to how object detection is
done today in terms of contemporary methods. Existing methods that obviously are based on
top of CNNs can be broadly divided into what are known as the Region Proposal based
methods and Dense Sampling based methods. So we will cover region proposal based
methods in the rest of the structure and take up dense sampling methods in the next lecture.

In region proposal based methods, the framework is composed of two stages, in the first stage
there is a network or a module that proposes a different regions that could contain objects and
in the second stage a classifier, in our case a CNN takes those regions and classifies them as
belonging to an object or could even belong to the background, this is more robust in
performance but is slow because of the two-stage approach.

On the other hand, dense sampling approaches are a single-stage approach where the proposal
of regions and the classification or detection of those regions into objects is all done in one
shot through a dense sampling procedure. This is simpler and faster but sometimes can fall
short of the accuracy when compared to region proposal methods.

966
(Refer Slide Time: 2:05)

Let us now discuss region proposal methods. So, the earliest one was known as the R-CNN or
the region based CNN which was proposed in CVPR of 2014 by Ross Girshick,. This
approach has a first stage where it uses a method known as selective search which is an
image processing method, it does not use a neural network for that but using a selective
search approach it proposes several regions that could have objects. Each of these regions is
then evaluated by a CNN to perform classification as well as do bounding box regression to
get the correct localisation inside those region proposals.

Remember when we talked about OverFeat, we said that we could divide these into two
heads, we will continue to use that for all detection methods that we will see.

967
(Refer Slide Time: 3:02)

How do you do that first stage selective search? So, the selective search module uses a graph
based image segmentation or a mean shift based image segmentation to get an initial set of
region hypotheses. So, these hypotheses are then hierarchically grouped based on region
similarity measures like colour, texture, size, region-fillings so on and so forth. So, if you
had, if you used colour for grouping up regions, you may end up getting an object such as
this, if you used texture for grouping up regions, you may get a region such as this or if you
used region-filling, you may end up getting a region such as this.

(Refer Slide Time: 3:50)

968
In practice, you have an image, you use an existing segmentation method, we covered this in
the initial weeks, you will get various different segments that you see here, then you combine
these segments using agglomerative clustering, if you recall, using either colour cues, texture
cues or region filling cues or any similar cues and this gives us regions of a larger scale.

(Refer Slide Time: 4:24)

Now how do we use these regions? Once you get these region proposals, this particular
method selects 2000 such region proposals by combining and qualising different subregions.
You get somewhere around 2,000 region proposals. Each of these region proposals are
warped to the same size. Why do we need that? Because we are going to give that as input to
the CNN and remember, the CNN can only take a fixed input size as its input dimension, you
cannot give a 20 cross 20 image and a 100 cross 100 image.

A CNN can only accept one size that is the nature of the architecture that we use. So,
whatever be the size of the region proposals, they are all warped to make it the same size so
there could be some artefacts of stretching because of this. And these proposals are then
given to a CNN so in this particular work, the CNN was like an Alexnet or a VGG, then you
get feature maps out of the CNN that goes through a fully connected network.

And then there are 3 components, you have a softmax component that does classification of
each of these regions into one of the known classes that you are looking for while performing
detection. It also has an SVM which is actually used for classification, we will explain that in
a moment and finally it also takes the feature maps through a different fully connected layer

969
to perform bounding box regression. I will repeat that this bounding box regression step that
you see here does not you already have a region given to you, remember that this region is a
part of the original image, so what the bounding box regression module does here is to give
you the offset from this region in the original image to the correct bounding box.

So, if this region was in a specific location of the original image, the bounding box regression
module may give value such as +1, -2 and maybe -3 -4, which means that the top left corner
should be moved by one pixel further on the x-axis, moved by two pixels to the left on the
y-axis and the height and width must be adjusted by +3 and -4 so on and so forth. So, this
bounding box module regression module predicts the offsets from the current location of the
box that you have.

(Refer Slide Time: 7:19)

Now let us see the entire training procedure. So in the first step, you consider all of these
modules, the CNN, the feature maps, the fully connected network and you fine tune this fully
connected network to the warped images using a softmax and a cross entropy loss. We
generally refer to cross entropy loss also as log loss because when you have multiple classes
your cross entropy loss turns out to be simply the log of the probability of the ground truth
class, why so? Because you would have your cross entropy loss to be given by y i summation
over i y i log p i this is your cross entropy loss.

− Σ 𝑦𝑖 𝑙𝑜𝑔 𝑝𝑖( ) ( )
→ − 𝑙𝑜𝑔 𝑝𝑖

970
Remember that your ground truth is going to be one only for the correct class and is going to
( )
be 0 for all other classes, so you will be left with just − 𝑙𝑜𝑔 𝑝𝑖 no summation and that is

why we sometimes call cross entropy loss as just log loss because when you use it for a multi
class classification setting that is what it turns out to be. So, that is the first step to tune the
CNNs, you already initialised the CNN with weights of an Alexnet model or a VGG model.
Now we will further refine those weights by using a log loss and the softmax in the last layer
through the fully connected network and the feature maps.

(Refer Slide Time: 8:50)

In the second step, you now take the outputs of the fully connected neural network and use
them as representations of each of these regions and then train a class specific SVM
depending on the number of classes in your detection problem you could be detecting say up
to 20 different kinds of objects or even further, you will then learn 20 different SVMs based
on the feature that you got as output of the fully connected network and then the SVM is then
used to make the final decision of the class in that region.

And the stage 3 is training the fully connected layer for performing bounding box regression
where one would use L2 loss or the mean square error loss, remember they both are
synonymous.

971
(Refer Slide Time: 9:41)

What are the limitations of R-CNN? Firstly, it is a multi-stage pipeline of training as we just
saw, you have to first train one head then you have to take representations and train 20
different SVMs or as many different SVMs as classes and then you also have to train a
bounding box regression head, can we make this simpler? We will see this in a minute but
before that the extraction of region proposal initially could require a lot of time and space.
Remember, we have to do segmentation, qualis them and then finally pass each region
proposal through the CNN and get your classification bounding box offset so on.

A dataset such as Pascal VOC which has 5000 images approximately required about two and
half GPU days and several hundred GBs of storage which can turn out to become impractical.
And using R-CNN at test time the object detection took about 47 seconds which is not really
real time.

972
(Refer Slide Time: 10:57)

This led to the development of the fast R-CNN approach by the same authors in fact, and it
was published in ICCV of 2015. In this case there were a couple of different innovations that
were introduced. One idea was, while we still do selective search to get the region proposals,
we would map these region proposal locations directly to the feature maps after sending the
full image through the CNN. If you go back and see how it was done earlier, earlier each
region proposal was fed individually to the CNN.

(Refer Slide Time: 11:44)

But now we are saying that we will feed only the entire image through the CNN. This helps
us because otherwise we have to do 2000 forward propagations where 2000 are the number

973
of region proposals through the CNN. Now we just need to do one forward prop through the
CNN which is the entire image. Through selective search, we would get a lot of region
proposals. We now map because the region proposals have certain coordinates in the original
image. You can map those coordinates to the feature maps that you get as the output of the
CNN.

And once you get this feature maps you have two heads, one for classification where log loss
is used, one for bounding box regression where L2 loss is used, no more need for SVMs
which seems obvious but that is how this method was improved.

(Refer Slide Time: 12:40)

So now, how is the region of interest projected? So, we said that you take these region
proposals that are outputs of the selective search approach and map them onto the feature
maps, how would you do that? If this was one of your region proposals, let us assume that the
6 × 8 patch was a region proposal in your original image, then remember that your CNN
follows a set of steps where the image gets in most likelihood downsized as you go through
each iteration because if you perform convolution the image size reduces, if you perform
pooling it further reduces and so on.

So, you take the same region proposal’s dimensions, keep forward propagating it through a
similar set of operations and you will notice that this 6 × 8 patch matches to this 2 × 3
patch on the feature maps.

974
(Refer Slide Time: 13:45)

As a next step, after projecting the region proposals on the feature maps, we need to bring all
of these region proposals to the same size for the same reason that we said that the next layers
can only take input of the same size. To do this, fast R-CNN proposes a module known as
RoI pooling. RoI stands for region of interest again, the RoIs can be of different scales, so to
get a scale invariant way of pooling, what RoI pooling does is, given any region which is say
of size ℎ × 𝑤, it converts it into region of 𝐻 × 𝑊 grid of subwindows with each subwindow
of sizeℎ × 𝐻 and 𝑤 × 𝑊.

So for example, if you had an 8 × 8 region and your target size was 2 × 2, you would
accordingly divide this8 × 8 into a 2 × 2 grid and do pooling inside each of these 2 × 2
regions and get your output in the target location. So, your pooling here would happen in this
location you would do say a max pooling and only the max value would go here so on and so
forth. So, if you had a 4 × 6 region of interest, then you would still have to do a 2 × 2, you
still want a 2 × 2 output.

So, you divide it this way now and you pool across all of these 6 pixels and that value goes
here. So, that is how RoI pooling works.

975
(Refer Slide Time: 15:19)

And once RoI pooling is done, you then send it through further layers, fully connected layers
and at the end of which you have two heads with two different loss functions. So you have a
multi task loss, one task for classification and one task for regression. Say, if we assume that
u is the true class and v are the ground truth bounding box regression targets, if p is the
predicted class probability and tu are the bounding box regression offsets for the true class,
then the loss for each region of interest would be the classification loss plus your localisation
loss. By localisation loss, we mean the bounding box regression loss.

( )
The classification loss as we just said is given by − 𝑙𝑜𝑔 𝑝𝑢 . The cross entropy loss

simplifies into a log loss and the localisation loss is prefixed by an indicator function which
states that we are going to compute this only for the true class, for the other classes we do not
need to compute this kind of a localisation, that is what the indicator function dictates. What
is the localisation loss in fast R-CNN?

976
(Refer Slide Time: 16:49)

Earlier, we said we typically use the L2 loss but in fast R-CNN the method recommends the
use of a new loss known as a smooth L1 loss. Note that L1 loss is the sum of the pairwise
differences between dimensions and their corresponding absolute values and this is used
instead of the L2 loss and a smooth version of the L1 loss, remember that L1 loss, any
absolute value function looks somewhat like this, that is the f of x is equal to absolute value
function.

𝑓(𝑥) = |𝑥|

But this function can become non-differentiable at 0, so to avoid that you have a smooth
2
variant of the L1 loss known as smooth L1 loss which is given by 0. 5 𝑥 if |𝑥| is less than 1,
otherwise |𝑥| − 0. 5otherwise. So, this would give you slightly smooth, when it comes close
to the 0 it becomes quadratic, see you would have a slightly smooth function but otherwise its
absolute value or L1 everywhere else. Why smooth L1 loss? It turns out that smooth L1 loss
is less sensitive to outliers then L2 loss, why is this so? Is homework for you and we will
discuss that in the next lecture.

977
(Refer Slide Time: 18:13)

So, the summary of fast R-CNN is given an image, you still have selective search, you also
forward propagate the full image through the CNN, say in Alexnet or VGG whose output
gives you a depth of feature maps like the output of any convolutional layer. Now you project
all the region proposals that you got through selective search onto this feature maps, this
avoided as we said multiple forward propagation through the convolution layer, then you
perform RoI pooling to get all those region proposals to the same size, you then send this
through a fully connected network which has a softmax head or a classification head and a
bounding box regression head.

(Refer Slide Time: 19:06)

978
With these improvements, fast R-CNN significantly outperformed R-CNN in training time
and test time. Fast R-CNN took about 8.75 hours for training while R-CNN took 84 hours for
training, that is the 10x reduction. Importantly at test time, while R-CNN took 47 or 49
seconds for doing detection on each image this came down to 2.3 seconds for fast R-CNN
and just 0.3 seconds if you have to avoid the region proposal finding step.

So, that gives us a clue on what needs to be done next, which is we have to find a way of
avoiding the process of getting region proposals using selective search because among the 2.3
seconds it looks like 2 seconds is required just for that step.

(Refer Slide Time: 20:06)

This leads us to faster R-CNN, faster R-CNN replaces selective search with a region proposal
network. So, instead of getting region proposals through segmentation methods that are
offline, we are going to try to see if this can also be done as part of the neural network itself,
this was proposed in NeurIPS of 2015. And let us see how this is done, so given an image
you forward propagate the image through a CNN like Alexnet or VGG, you get a set of
feature maps.

Now these feature maps are provided to the region proposal network whose job is to tell us
which regions are likely to have an object, it does not worry about classifying the object it
only worries about objectness of a region and these regions are then provided through an RoI
pooling layer, that part of it is very similar to fast R-CNN. And finally you have classification
and bounding box regression just like how we had it with fast R-CNN. How does the region

979
proposal network work? It uses what are known as anchor boxes of different scales and
aspect ratios to identify multiple objects in a given window.

(Refer Slide Time: 21:33)

Let us see how that works. So, given a particular window, let us say any particular window in
an image. So, it takes a different predefined anchor boxes, so you can define an anchor box
which is defined given a particular location you can define an anchor box that is given by this
rectangle here, you can define an anchor box that is given by this rectangle here and you can
define anchor box that is given by different this rectangle here.

So, these are a set of predefined rectangles which, given the centre, can define several kinds
of regions around that centre. What does the region proposal network learn? It has its own
loss function to learn that part of the network, recall InceptionNet, recall that in InceptionNet
we had certain loss functions at different stages of the network, you could consider this to be
similar to that.

So here, the output expected of the region proposal network is, is any of these anchor boxes
an object, so you are only looking for object not an object, so there are no further classes, it is
a binary classification problem and if you had k anchor boxes which as I just mentioned are
predefined, you have to define them when you start. And if you had your depth to be, if you
had 512 × 20 × 15 to be your image features, then you are going to have𝑘 × 20 × 15as
the output.

980
And you would have 4𝑘 × 20 × 15as the bounding box offsets for each of these anchor
regions. So, this part of the network does do a bounding box offset recognition or fine tuning
but it also checks whether a region is an object or not an object, let us see then what happens?

(Refer Slide Time: 23:38)

So, you are going to have a fully convolutional layer which does objectness classification and
bounding box regression, so in this case we are talking about say 9 anchor boxes which is
what was k on the previous slide but this could be different based on what one chooses to do
in a given setting.

(Refer Slide Time: 23:56)

981
So in the first, you take the input image, you take a CNN pre trained on Alexnet, use the
feature maps, send it through a region proposal networks which is a further set of
convolutional layers and check for objectness and bounding box coordinates and train this
region proposal part of the network, you can fine tune the previous weights too. This is step
1.

(Refer Slide Time: 24:24)

In the second step you take these region proposals obtained from the region proposal
network, feed that onto the feature maps. Those are going to be your regions, now the rest of
it stays very similar to fast R-CNN and you train all these weights using your classification
loss and bounding box regression loss where now the classification loss tries to categorise
region into any of say, 20 objects or how many ever objects you have in a given problem
setting. So, this is the second step which is very similar to what is done in fast R-CNN.

982
(Refer Slide Time: 25:08)

In step 3, the CNN is made common for your entire network as well as the region proposal
network, so you share those parameters and you train your RPN while freezing your CNN
parameters to help improve the RPNs performance.

(Refer Slide Time: 25:30)

And finally in step 4, you freeze everything else in the network and only fine tune your FC
networks to get your final classification output and bounding box offset regression outputs,
this is the four step training procedure proposed by faster R-CNN.

983
(Refer Slide Time: 25:50)

And using this approach the test time speed reduces from 2.3 seconds to 0.2 seconds which
becomes closer to real time for every image.

(Refer Slide Time: 26:04)

The homework for this lecture would be to go through the Viola Jones object detection
framework and a very nice write up called object detection for dummies linked here and if
you like to understand OverFeat here is a link. There is also a nice video series on evolution
of object detection models if you like and here is the exercise that we left for you, smooth L1
loss is less sensitive to outliers then L2 loss as was proposed in fast R-CNN, can you try to

984
find out why? The next lecture, we will talk about dense sampling methods for object
detection.

(Refer Slide Time: 26:47)

985
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramaniam
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 46
CNNs for Object Detection II

We will continue now with dense sampling methods for object detection.

(Refer Slide Time: 0:23)

Before we go there, the exercise that we left behind from the last lecture. Why is smooth L1 loss
less sensitive to outliers than L2 loss? When the deviation of the predicted output from the
ground truth is high, which happens when you have outliers, the squared loss or the L2 loss
exaggerates the deviation which also results in the gradient exaggerating it which could cause an
exploding gradient problem. This issue gets mitigated within L1 loss because L1 loss is not
differentiable at 0 we use a smooth version of it known as a smooth L1 loss which mitigates the
problems caused due to outliers.

986
(Refer Slide Time: 1:10)

So, we said that contemporary methods can broadly be divided into Region Proposal methods
and Dense Sampling Methods. So, we will now go into Dense Sampling Methods which are
single stage detection frameworks.

(Refer Slide Time: 1:27)

The most popular that is still used today is known as YOLO or You Only Look Once. The first
version of YOLO was developed in CVPR of 2016. It is a single stage detector, you could say
that all these single stage detectors are loosely based on OverFeat, they all use only

987
convolutional layers, no fully connected layers and speed along with good performance is the
main aim. So, the high level pipeline is you resize an image, you run a convolutional network,
you get a set of outputs then you do non-max suppression to identify your final bounding boxes.
Let us see it in more detail.

(Refer Slide Time: 2:16)

So, here is an overall flow, given an input image you first divide the image into an S by S grid
and each grid cell is responsible for the object with its coinciding center. So for example, if you
took this particular grid cell here that would be responsible for this object dog here and has to
predict those bounding box offsets. So, each grid cell predicts B bounding boxes, B is another
hyper parameter and confidence scores for each of these boxes.

So, you can see here that each grid cell can predict multiple bounding boxes and a confidence for
each of those bounding boxes and each grid cell predicts just one probability per class. So, total
of C classes would mean C total probabilities. You could consider that this is computing a
quantity which is given by probability of class I given an object. So that is a conditional
probability. The final predictions can be encoded as you have an S cross S grid then in each grid
location you predict B bounding boxes. For each of those B bounding boxes you have 5 values.
What are those 5 values?

988
4 coordinates that could be the center of the object and the width and the height and a confidence
for each bounding box so that amounts to 5 and then you have C class probabilities per grid cell
and that is how you get 𝑆 × 𝑆 × (𝐵 × 5 + 𝐶) that is the total number of outputs that you
would have in your output layer before applying non-max suppression.

(Refer Slide Time: 4:14)

So, each bounding box gives 4 coordinates x, y, w, h; x, y are the coordinates representing the
center of the object related to the grid cell and w, h are the width and height of the object relative
to the whole image and the confidence score given for each bounding box reflects how confident
the model is that that bounding box contains an object and also how accurate the boxes. So, you
could view the confidence to be the probability of an object, any object at this point you do not
know which class it belongs to, so it is just the 𝑃(𝑜𝑏𝑗𝑒𝑐𝑡 ) × 𝐼𝑜𝑈(𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ, 𝑝𝑟𝑒𝑑). So,
confidence takes into account both of these quantities.

989
(Refer Slide Time: 5:10)

And you have the conditional class probabilities for each class in each grid cell regardless of the
number of boxes B predicted by each grid cell. YOLO predicts only C class probabilities for one
grid cell. So, you could assume that those C class probabilities are these conditionals here given
by probability of class i given an object. So, if you multiply these confidence scores, these class
conditionals with your confidences, you would get probability of class i given an object which is
the conditional class probability and the confidence which is given by product of probability of
object and the IOU between ground truth and predicted values which amounts to the
𝑃(𝑐𝑙𝑎𝑠𝑠 𝑖) × 𝐼𝑜𝑈(𝐺𝑟𝑜𝑢𝑛𝑑 𝑇𝑟𝑢𝑡ℎ & 𝑝𝑟𝑒𝑑)

So, for each cell we are saying what is the probability that this class occurs in this grid cell and
for the predicted box, how much IOU does it have with the ground truth.

990
(Refer Slide Time: 6:19)

What loss function do you use to train YOLO? The loss function effectively combines the
several components that you need to ensure that each of these quantities that is being predicted is
close to the ground truth. So you see here there are multiple terms, so the first term, so let me
first explain these indicators here. One i j object indicates if the jth bounding box predictor in cell
i is responsible for that prediction. Remember, every cell predicts b bounding boxes, so i j here
corresponds to the jth bounding box predicted by ith cell and one i denotes if object appears in
cell i.

Now let us look at what each of these quantities are doing. So, if you are looking at a particular
bounding box predicted by a particular cell, you are first trying to ensure that the 𝑥𝑖 𝑦𝑖 predicted
^ ^
by the network is close to the expected 𝑥𝑖 𝑦𝑖, then you are ensuring that the width and the height

predicted by the network is close to the original width and the height, these are taken a square
root because you have a width and a height in two dimensions so you take a square root here and
^
the 𝐶 and 𝐶𝑖 are the confidence values.

So, you also try to ensure that if the confidence is low with respect to what the confidence should
have been which would have been 1 if an object was there, you also try to ensure that that is
minimized and all these summations as you can see is done for across all of your grid cells and
across all of the bounding boxes predicted by each of these grid cells and you do the same even

991
when an object is not present you would want the confidence to match the expected confidence
which in this case would be 0. So, that is what these two terms in sense are complementary for
positive and negative classes.

And finally, the last term here denotes the conditional class Probabilities matching the expected
conditional class probabilities. So, these are the different terms in the loss function used to train
the network.

(Refer Slide Time: 8:44)

What about, what are the limitations of such an approach? YOLO v1 had a few limitations.
Firstly it detects only a small number of objects. It misses objects that are small or close to each
other because of the methodology itself because of the spatial constraints of the methodology and
how many objects you can present that could be overlapping. It ended up having a high
localization error and a relatively low recall.

992
(Refer Slide Time: 9:21)

To overcome these issues YOLO v2 which was proposed as an extension of YOLO v1


introduced the idea of anchor boxes into the YOLO framework, these anchor boxes are similar to
what you saw with faster R- CNN. So, let us see how these anchor boxes work. So there are 5
coordinates predicted per anchor box 𝑡𝑥, 𝑡𝑦, 𝑡ℎ, 𝑡𝑤 𝑎𝑛𝑑 𝑡𝑜. Let us see each of them. So, if the cell

is offset from the top left corner by 𝑐𝑥, 𝑐𝑦 cx cy, top left corner of the image and the bounding

box has width and height 𝑝𝑤 𝑎𝑛𝑑 𝑝ℎ, then the predictions corresponding to the anchor box are

( )
given by 𝑏𝑥 = σ 𝑡𝑥 + 𝑐𝑥.

( ) ( )
So, you can see here that σ 𝑡𝑥 is right here similarly 𝑏𝑦 = σ 𝑡𝑦 + 𝑐𝑦 so that is the other

coordinate of the same location added to 𝑐𝑦 and 𝑏𝑤 𝑎𝑛𝑑 𝑏ℎare written in terms of scaling
𝑡 𝑡
multiples over 𝑝𝑤 𝑎𝑛𝑑 𝑝ℎ. So, 𝑏𝑤 = 𝑝𝑤 × 𝑒 𝑤 and 𝑏ℎ = 𝑝ℎ × 𝑒 ℎ. So, 𝑡𝑤 is a predicted quantity

so 𝑏𝑤is given by what scaling factor should you change the width or the height? It is similar to

predicting an offset but the offset is a multiplicative factor for width and height and finally, the
( )
𝑝(𝑜𝑏𝑗𝑒𝑐𝑡) × 𝐼𝑜𝑈which is the quantity that we just saw for YOLO v1 is σ 𝑡𝑜 that is the

conditional class probability.

993
So, those are the different quantities that are predicted by the network and which relate to the
actual bounding box that that grid location is trying to point to. Remember that each anchor box
predicts only these 5 values, this is, the 𝑏𝑥, 𝑏𝑦, 𝑏𝑤, 𝑏ℎ are the actual bounding box corresponding

to these values given by the anchor box.

(Refer Slide Time: 11:42)

Moving on, there was another single stage detection method known as the single shot multibox
detector known as SSD. SSD again uses an OverFeat like methodology. You can see here that
given an input image the initial part of the network is a VGG like network then on there are
convolution layers that keep reducing the size and you can see there are some skip connections
that take you from a convolutional layer directly to the classifier. Let us see them in a bit more
detail.

(Refer Slide Time: 12:18)

994
So you have multi-scale feature maps for detection because you are sending these convolutional
layers directly to the output, these convolutional feature maps directly to the output, these ones
directly to the output. So, the output layer receives the feature maps from convolutional layers of
different scales that is why multi-scale feature maps.

(Refer Slide Time: 12:42)

Then we also see that for each of these convolutional layers there are a different set of
convolutional filters that connect it to the output layer. So, this initial layer here goes through a
3 × 3 × 4 × (𝑐𝑙𝑎𝑠𝑠𝑒𝑠 + 4), that is the number of outputs that it has, that is the number of,

995
that is the size of the convolutional filters that you have rather which gives the corresponding
output in the output layer and so on and so forth for each of these intermediate convolutional
layers.

So, if you took anyone of these feature maps, let us take a feature map layer, a layer with a
feature map of size 𝑚 × 𝑛 with 𝑐 channels, then that would give 𝑚 × 𝑛 locations. So, each of
those pixel locations in one of those feature maps could be a center of an object for instance, So,
each of them is like a grid cell if you could compare this to YOLO and the bounding box offset
values are relative to that grid cell location. So, you have, remember you see here that each of
these convolutional maps predict 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 + 4, that is the number of values that they would
predict. So a class probability for each class plus 4 values for the bounding box offset
corresponding to each pixel location in these convolutional feature maps.

(Refer Slide Time: 14:12)

So, as feature maps get smaller and smaller when you go through later parts of the network so
you can see here if this was your ground truth boxes in your original image and you can see here
an 8 × 8 feature map followed by a 4 × 4 feature map. So in an 8 × 8feature map, if you
looked at the anchor boxes for each of these grid cell locations, each of these anchor boxes
would predict a set of values the way we talked about it the previous slide. So, if you had k
anchor boxes with different aspect ratios as you can see here, SSD would predict c class specific
scores plus 4 anchor box of offsets. That is what we saw on the earlier slide as 𝑐 + 4 is the

996
number of channels that you had for each of your convolutional layers when you connected them
to the output layer.

So, for an 𝑚 × 𝑛 feature map, you would totally have (𝑐 + 4)𝑘 𝑎𝑛𝑐ℎ𝑜𝑟 𝑏𝑜𝑥𝑒𝑠 × 𝑚𝑛 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
because you are assuming now that each pixel can be the center of an object and you have k
anchor boxes around each of those pixels and each anchor box predicts c class probabilities and 4
bounding box offsets. So, that is why for each 𝑚 × 𝑛 feature map, you would have these many
outputs in the SSD framework.

997
(Refer Slide Time: 15:45)

What is the loss function you use here? Very similar to the loss functions we have seen so far,
SSD implements a localization loss function and a confidence loss function. So, the confidence
loss function compares the confidences of x and c, c are the predicted class probabilities and x
𝑝
are the ground truth and the way the ground truth is written is you have 𝑥𝑖𝑗 is given by 1 or 0

depending on whether there is an object or not an object, if the 𝑖𝑡ℎ default box or the anchor box,

default box and anchor box are the same synonymous here, the ith default box matches the jth
ground truth box of category p. That is the notation for these x i's and l and g are the predicted
and ground truth box parameters. Let us see each of these loss functions in a bit more detail.

998
(Refer Slide Time: 16:45)

𝑘
The localization loss for SSD is given by, once again a smooth L1 loss. It also has a factor 𝑥𝑖𝑗 in

the beginning to say whether that is a class or not because you only want to evaluate this quantity
when a class is involved that is when this would turn out to be 1 and this 1s are the predicted
^
offsets and 𝑔𝑗are the ground truth offsets and these ground truth offsets are given by

^𝑐𝑥
( 𝑐𝑥
)
𝑐𝑥 𝑤
𝑔𝑗 = 𝑔𝑗 − 𝑑𝑖 /𝑑𝑖 which is the center of the current anchor box. You are just ensuring that

your 𝑔𝑗 represents the correct offset with respect to the anchor box under question and you are

normalizing it with respect to the width and the height for the x and y dimensions.

And this quantity here is the exponential factor that we were using to scale the width and the
height. Instead of an additive factor we said that YOLO uses a multiplicative factor for the width
and the height and because that had an exponential term there, you are using a log here to reverse
the operation, so that is going to be your ground truth scaling factor that you would want your
network to predict and when you take an exponent of that 𝑙𝑖, you would get rid of the log here

and you would get your correct expected width and height. The confidence loss which is the
other loss with SSD is a soft max loss of class confidence probabilities.

So, it is given by the first term for all your positive bounding boxes which is your standard cross
entropy loss and the second term for your negative bounding boxes where there is a class label

999
0
corresponding to a background class which you would want to maximize. So, 𝑐𝑖 here corresponds

to a class label known as background which is considered as one of the classes you would like to
^
predict and 𝑐𝑖 is your standard Softmax activation function.

(Refer Slide Time: 19:07)

In its practical implementation, both YOLO and SSD have this problem is that, most anchor
boxes are likely to be negative when you compare with something like a faster R-CNN. So, to
counter this, you select negative examples that have the highest confidence loss such that you
maintain a ratio of negative to positive to be about 3:1 because otherwise remember that even if
you had 100 objects in a single image, the number of anchor boxes and the grid cells that you
have if you take SSD, it is in fact going to be per pixel, you will have k different anchor boxes
and that can be a huge number in terms of the number of boxes that are negative and have no
object in them.

And then learning can get affected and which is the reason you do this hard negative mining
which when you train, you only select some of those boxes that have a very high confidence loss
and only use them in the loss function that we talked about. Remember the loss function
considered those negative boxes also. SSD also used a data augmentation strategy where given
an original image the original image was also used. It also randomly sampled patches from
images trying to ensure that the minimum IOU with the actual object is in a predefined set of

1000
ranges to ensure that the network gets exposed to different kinds of patches from images and
different kinds of objects and their overlaps.

(Refer Slide Time: 20:52)

With this approach, SSD could outperform YOLO and faster R-CNN. All detection methods are
measured using an evaluation metric known as MAP or Mean Average Precision. Precision
refers to the standard precision metric used in machine learning methods, average precision talks
about the average precision obtained across all of your classes and the mean is across all of your
bounding boxes. So, one typically measures mAP at a particular IOU. So, how do you confirm
whether you have predicted a bounding box or not?

So, you pre-specify a particular IOU such as 0.5 and say that as long as my predicted box has at
least an IOU of 0.5 with my ground truth box, I am going to consider my prediction correct. So,
that is how correctness is defined to get your precision and then you take the average across the
classes and the boxes that you have predicted. So, you can see here that SSD matched faster
R-CNN in its mAP but at a significantly higher FPS, at a significantly higher frames per second
rate which was the main objective to make the single stage methods much faster in practice.

You can see that (this was also) the number of output boxes are significantly higher obviously
with SSD because you do it for every pixel and they also showed that this works reasonably well
with different input resolutions.

1001
(Refer Slide Time: 22:37)

A third single stage detector approach is known as the Feature Pyramid Network or FPN. FPN
uses the idea that feature layers from feature maps from initial layers may not really be suitable
for detection, they are high resolution. When you go through a convolutional network, the initial
layers are high resolution and then as you go deeper, the resolution gets lower and lower and
lower but the initial layers although they are high resolution, we have seen from our
visualizations of CNNS that they may not really be capturing semantics of objects in those initial
feature maps but they are the higher resolution ones.

So, we are caught with this dilemma where the lower resolution feature maps have more richer
features for detection whereas the higher resolution is in this initial feature maps. How do we
bridge this gap is what feature pyramid network tries to do.

1002
(Refer Slide Time: 23:41)

So, here, here is a visualization of how a Feature Pyramid Network attempts to bridge this gap.
So, if you were to use an image pyramid to be able to do detection, so you, one thing you could
do is you could subsample your images and for each resolution of the image, construct a feature
map and predict for each of those feature maps or you could take your input image, construct
feature maps at lower and lower resolutions and finally predict at the least resolution or given an
input image construct feature maps at different resolutions that as you build many convolutional
layers and predict at each of these resolutions.

And what feature map, feature pyramid network suggests is you do construct feature maps at
different resolutions as you go through a convolutional network but now you upsample and get
back feature maps at higher resolutions and now make predictions. So, this way you try to get
your semantics at your, at your least resolution but upsample back to transfer the semantics to a
higher resolution.

1003
(Refer Slide Time: 24:59)

Let us see how this is done in the architecture. So this is the overall architecture. So you have an
image, you have a conv layer stride 2, then 0.5 x denotes a sub sampling max pool layer, 2 × 2
max pool, then you have a conv 2 with a stride 4, a max pool, a conv 3 with a stride 8, a max
pool, a conv 4 with a stride 16, a max pool and a conv 5 with a stride 32. So, you can consider,
that is like a ResNet that is used. In the FPN, all convolutional feature maps C1 to C5 are treated
with a 1 × 1convolution which you see on these arrows here with 256 channels and M5 is up
sampled by a factor of 2 to to get the next M4.

But before you get M4, you get the signals from C4 after applying your 1 × 1convolution and
combine these outputs of C4 with 1 × 1 convolution and M5 to get M4 and you similarly
continue to do this to get M3 and M2. Once you do this a 3 × 3 convolution is applied on M4,
M3 and M2 and this is done to reduce the aliasing effect of M's. Remember, that we are up
sampling when we go from M5 to M4, M4 to M3 and M3 to M2 and up sampling, recall, we said
could result in aliasing.

So, to smoothen out those aliasing factors we use a 3 × 3 convolution which takes us from M4
to P4, P3, P2 so on and so forth. So finally, you are left with all of these P's here which are
provided to individual object detectors to get your final predictions.

1004
(Refer Slide Time: 27:03)

A more recent approach to object detection focused on the loss function that is used to train these
object detectors. So, this was known as RetinaNet and the loss function that was proposed is
known as the focal loss which was proposed in ICCV of 2017 and this relies on the intuition that
two stage detectors were known to be more accurate than one stage detectors. One stage
detectors were obviously faster and it surmised that two stage detectors are more accurate
because this lesser class imbalance between background or negative classes and object
containing or positive proposals.

Remember, when you do selective search we restricted ourselves to only 2000 region proposals
whereas in one stage detectors you could be dealing with 100,000 regions because you could be
getting multiple anchor boxes around each grid cell, in SSD it is on each pixel of the feature map
but even with YOLO you may be predicting b bounding boxes for each of those 𝑠 × 𝑠 grid cells
which could be a very high quantity. So, how do we address this imbalance between negative or
background classes and the actual positive classes?

1005
(Refer Slide Time: 28:33)

Whenever you have such an issue of where there are many negative examples which really do
not help the the model and there are only a few positive examples, certain things that could
happen are training could become inefficient because the easy negatives are not really giving any
useful signal there could be so many kinds of easy negatives that it is not really going to help the
model learn how to distinguish the positives from those negatives because that class of negatives
or the background will be extremely vast in detection, so the background could be the sky, could
be grass, could be buildings, it could be any of those and all of them are still a background .

Secondly, the loss could get overwhelmed because of the negatives instead of the positives and
this could degenerate the training process lead to degenerate models. To some extent the hard
negative mining that we spoke about in SSD where we try to ensure that the final loss only uses
where there is a significant confidence loss and ensures that the ratio of 3:1 between negative and
positive. That does alleviate these issues but the issue still remains.

1006
(Refer Slide Time: 29:53)

So, what this particular paper proposes is that cross entropy loss for using for the classification
branch of detection could be inherently bad. Let us try to see why. Remember that the cross
( )
entropy loss is defined by − 𝑙𝑜𝑔(𝑝). In the multi-class setting, it just turns out to be − 𝑙𝑜𝑔 𝑝𝑡 ,

log loss as we mentioned and over their empirical studies they observed this graph here where if
you notice even when you have, if you observe the gamma is equal to 0 remember when gamma
is equal to 0, this loss introduced by them known as the focal loss would turn out to be this
coefficient will turn out to be 1 when gamma is 0.

And the loss would just become your standard log loss or cross entropy loss so when gamma is
equal to 0, blue is your standard cross entropy loss and you can see here that even when the
network predicts a high probability for the ground truth class the loss value is fairly non-trivial,
you get a fairly high loss even when the model is predicting a high probability for the correct
class and this can defeat the purpose of learning.

1007
(Refer Slide Time: 31:17)

So, what RetinaNet with focal loss proposes is, one, you could do a balanced cross entropy
( )
where the − 𝑙𝑜𝑔 𝑝𝑡 , the log loss is weighted by some quantity αand α can be given as some

inverse class frequency. That is one way to do this, but this RetinaNet Method also proposes a
focal loss which considers the predicted probability itself to fine tune the loss. So you weight

(
your log loss with 1 − 𝑝𝑡 )γ where γis a tunable focusing parameter so γis a hyper parameter
that you have to provide while training the network and the final focal loss gives this quantity.

If you observe here, let us assume γ is a certain values such as 5, so when 𝑝𝑡 is high, this quantity

is going to become low and you are bringing down the overall loss because when 𝑝𝑡 is high you

want the loss to be low and when 𝑝𝑡 is low let us say it is 0.1 because γ is 5 you would still have

this quantity to be a reasonable quantity and the log loss would be maintained at a high level
when your probability, predicted probability 𝑝𝑡 for the ground truth class is low. That is the main

idea of the Focal loss.

1008
(Refer Slide Time: 32:48)

The RetinaNet architecture otherwise uses of FPN the feature pyramid network that we spoke
about along with the focal loss where you can see here the first part of it is the feature pyramid
network itself then for each of these scales you have a classification subnetwork and a bounding
box regression subnetwork and the classification sub network uses the focal loss to learn.

(Refer Slide Time: 33:17)

There are all implementations of all of these contemporary detection methods both Dense
Sampling Methods and the Region Based Proposal methods in a popular library known as

1009
Detectron. Detectron was provided by Facebook AI Research especially to promote usage of
object detection algorithms. So, if you are interested in implementing any of these object
detection algorithms for any of your projects, you can look at Detectron or Detectron2 for further
details.

(Refer Slide Time: 33:50)

So, your readings are a continuation of Object Detection for Dummies, this time Part 4 for the
dense sampling methods. Here is a tutorial of the entire YOLO family of methods and tutorials
on understanding SSD, FPN and RetinaNet.

(Refer Slide Time: 34:11)

1010
A few exercises to leave behind, we only covered YOLOv1 and YOLOv2 in this lecture. YOLO
also had a YOLO9000 which talked about scaling YOLO to 9000 categories and a YOLOv3
which was very close to YOLO9000 in its ideas. How were these different from YOLO v2 is
going to be a homework for you. Please do read the link that was given in the reading section in
the previous slide for understanding YOLO and a couple of more simple problems, given two
bounding boxes in an image an upper left box which is 2 × 2and a lower right box which is
2 × 3 and an overlapping region of 1 × 1. What is the IOU between the two boxes?

To understand YOLO better, consider using YOLO on a 19 × 19 grid on a detection problem


with 20 classes and 5 anchor boxes. During training you have to construct an output volume as
the target value for the neural network. What would be the dimension of this output volume?
Remember, that in YOLO we said 𝑆 × 𝑆 × (5𝐵 + 𝐶). Try to use that formula here to find out
what should be the output volume for this particular YOLO object detector. Please do these
exercises and we will continue the discussion in the next lecture.

(Refer Slide Time: 35:52)

1011
1012
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramaniam
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 47
CNNs for Segmentation

Having seen detection, we now move on to segmentation of images using CNN architectures.

(Refer Slide Time: 0:26)

We will quickly review the exercise from the previous lecture. Given two bounding boxes in an
image, an upper left box, which is 2 × 2 and a lower right box, which is 2 × 3 and overlapping
region of 1 × 1. What is the IOU? This should be a simple one. So, you have an upper left,
which is 2 × 2 and then you have a lower right which is 2 × 3 with a 1 × 1 overlap. So, the
total number of boxes here are 9 of which the intersection is one box. So, your overall IOU
would be 1/9.

And the second question, consider using YOLO on a 19 × 19 grid, 20 classes, five anchor
boxes, the output volume would be, remember we said it is
𝑆 × 𝑆 × (5𝐵(𝑡ℎ𝑎𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑜. 𝑜𝑓 𝑎𝑛𝑐ℎ𝑜𝑟 𝑏𝑜𝑥𝑒𝑠) + 𝐶(𝑛𝑜. 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠)). And that is the final
answer you get. We assume here that C, the number of classes, also includes the background. If
you want to have background as a separate class, then you will have to add C to be 21.

1013
(Refer Slide Time: 1:47)

So, let us recall image segmentation in the first place. Early on in the lectures, we talked about
different image segmentation methods such as Watershed, Graph Cut, Normalized Cut, Mean
Shift, so on and so forth. To an extent, these methods inspired early versions of CNN
architectures for tasks such as detection and segmentation. In fact, R-CNN, which uses selective
search to get region proposals, in fact used a min-cut segmentation method known as CPMC
(Constrained Parametric Min Cuts) to generate the region proposals. But we will now move on to
deep neural networks for segmentation.

1014
(Refer Slide Time: 2:36)

So, the task at hand for us to start with is going to be semantic segmentation, where we would
like to assign a class label to every pixel in the image. How do you solve this problem using
neural networks, in particular, convolutional neural networks? We are going to cast this as a pixel
wise classification problem. While, so far we spoke about image level classification where we
had only a label at the level of an image that we went to detection where we had labels at the
levels of bounding boxes, where we not only gave class label, but also an offset to the bounding
box to make a final prediction.

But in semantic segmentation, we are going to classify at the level of every pixel. So, for each
image, each pixel has to be labelled to the semantic category. So, you can assume that creation of
a dataset for semantic segmentation is very annotation intensive. There have been various
architectures in recent years, such as FCN, SegNet, U-Net, PSP Net, DeepLab, Mask R-CNN.
We will cover each of them in this lecture.

1015
(Refer Slide Time: 3:58)

Starting with FCNs or Fully Convolutional Networks for Semantic Segmentation, which was
proposed in 2015 and 16, they adopted various classification networks such as VGG net,
GoogleNet so on and so forth into fully convolutional networks by converting the FC layers into
1 × 1 layers. We saw that with single shot detection methods, it is the same idea here because
for semantic segmentation you still have to classify at the level of a pixel. So you do not want to
move away from the convolutional region.

What do we do to obtain classification for each pixel? We can have 1 × 1 convolutional layer at
the end of any CNN. And have C+1 channels in that final output volume where C is the number
of classes and the +1 is for, say, a background class. But do you see any problem with this
approach? There is a problem when you use convolutional layers, which is, all standard CNNs
keep down sampling as you go further down the layer, be it through convolution or through
pooling, which means your output feature maps are not going to be the same size as your input.

But we finally want to give a pixel level label in the input resolution, in the input images
resolution. So, what do we do? We already know the answer. We can upsample after a certain
stage. So, you have convolutional layers and after a certain stage you upsample and bring the
feature maps to the same dimension as your input image. How do we upsample? One of the
methods that we have discussed before is transpose convolution.

1016
(Refer Slide Time: 6:00)

Let us quickly recall transpose convolution. So, this was the one dimensional example we spoke
when we discussed convolutional neural networks. So, where your output can be larger than your
input.

(Refer Slide Time: 6:15)

And here is the illustration that we gave for transpose convolution where you have a dilation in
your convolution when you take the filter and apply it on an input image. Because there is a
dilation, transpose convolution is also known as dilated convolution, or it is also sometimes

1017
referred to as Atrous convolution, as we will see a little later because it dilates when you perform
the convolution.

(Refer Slide Time: 6:51)

So, now in FCNs, this is the idea that is used to upsample after a certain level. Let us see an FCN
architecture with a VGG backbone. So, if you have a VGG-16 backbone, the first thing that was
done in FCN is to remove all the fully connected layers. So, what you have now are only the
convolutional layers of VGG-16. So, you can see all the standard feature maps of VGG-16. The
3 FC layers are replaced with 1D convolutional layers. And the last 1D convolutional layers has
C+1 filters.

So, you can see here the last one has 21 dimensional filters. This was applied on a dataset that
had 20 classes. So, that was 21 dimensional. However, you notice here that while you have 21
channels, the output is still smaller in size when compared to the input resolution. So, you finally
have one upsampling layer which is done using transpose convolution to get back to the original
size. This is what was done in fully convolutional networks. However, is transpose convolution
the only way to obtain the upsampling? Not necessary. We can do several things, including skip
connections, which we will see later.

1018
(Refer Slide Time: 8:19)

But before we go there, we will talk about another method that came around the same time as
FCNs known as SegNets. SegNets are what you see here in the image. You have an input image,
you have an output and the architecture is a fully convolutional encoder decoder architecture.
How does, how is this different from FCNs? Very similar to FCNs. The encoder is VGG-16
without FC layers. The decoder maps the low resolution encoder feature maps to input
resolution. Here, there is a difference from FCN.

FCN had only one upsampling layer towards the end, whereas SegNet has a sequence of decoder
layers which mirror the architecture of your encoder. So, the encoder and decoder architectures
are mirrors of each other. The important contribution in SegNet was to avoid transpose
convolution in upsampling. So, how does SegNet do the upsampling in that case? It actually uses
the pooling indices that were obtained in the encoder layers and uses them to be able to
upsample.

1019
(Refer Slide Time: 9:40)

Let us see how this is done. So, if you had an output map from a particular decoder stage, we
used a max pool indices from the original encoder. So, let us assume the maxpool indices were
00, 10, 10, 01. So, a would go to the 0, 0th location, which would be the top left. b would go to
the 1 0th location, which would be the bottom left. c would go to the 1, 0th location, which
would be the bottom left and d would be 0, 1 which would be top right. Now this becomes an
upsampled version of a previous layers output map in the decoder. And one can continue to do
this to get your final upsampled image.

What do we mean by max pool indices, in your encoder, whenever you had a pooling layer you
keep track of the indices of which of those entries in 2 × 2 patch or 3 × 3patch, depending on
what pooling you are doing, you store the winning indices in a separate memory buffer and we
are going to use that max pool indices to upsample back to higher resolutions. Why is this good?
Storing max pool indices is more efficient than storing full feature maps of the encoder. And in
each stage of the decoder, the max pooling indices of the corresponding encoder is used to
produce the upsample feature maps.

So, if you see the complete image, the pooling indices of this encoder feature map goes to this
decoder, this encoders feature map goes to this decoder and so on and so forth. What is the loss
function? So, we have seen a couple of architectures now, FCNs and SegNets, what would one

1020
use as a loss function to train such networks? Simple, it is the sum of pixel wise cross entropy
losses. Remember, in both FCN and SegNet, your output is going to be a volume which would
be the size of the input resolution, but the number of channels would be the number of classes+1
for background.

So, you can actually have for each pixel, a vector of probabilities, which is in the output value.
Now you could compute the pixel level cross entropy for each pixel. You can sum all of them up
and that becomes your loss function, which because is a sum of cross entropies will remain
differentiable.

(Refer Slide Time: 12:23)

A third kind of architecture, which also came in the 2015-16 time frame, is known as U-net.
U-Net is also a fully convolutional, encoder decoder architecture. But it introduces the concept of
skip connections in a different way from SegNet. In the contracting path, that is the encoder path
is an existing classification network with FC layers removed similar to SegNet and FCNs. But in
the expanding path, in addition to upsampling of feature maps, the network also receives
corresponding feature maps of the contracting path along with further convolutional layers.

Finally, at the end, you have a 1 × 1 convolution with C+1 channels where C being the number
of classes. The Upsampling in U-Net is performed using transpose convolution, 2 × 2 transpose
convolution with stride 2 and padding 0. And the number of feature channels are halved in each

1021
upsampling step. Which means the total input dimensions will double now with the 2 × 2
transpose convolution and channels being halved, the total input map dimensions will double at
this time. So, the original unit architecture unpadded convolutions are performed, which means
the output segmentation is smaller than the input by a constant border width.

(Refer Slide Time: 14:00)

Let us look at the visual. So, here is the U-Net and the shape of it tells why it is called the U-Net
architecture. You can once again see the contracting path is similar to any other architecture
without FCN layers. And in the expanding path in addition to doing transpose convolution at
each step, you also receive the corresponding feature maps of your contracting part, which is
added to your expanding path. So, that gives a localization information from the encoder to each
step of the decoder beyond just upsampling. So, you get additional information from the
corresponding layer of the encoder rather than rely only on upsampling as in SegNet or FCNs.

1022
(Refer Slide Time: 14:57)

U-Net was originally proposed for biomedical image segmentation. It was, in fact, published in a
medical conference, Mechai. How do you think this can affect the learning of the unit model?
Biomedical image segmentation has some unique challenges, firstly, often you do not have
training datasets which are of the order of 1 million or even tens of thousands. Very few training
images may be available. So, this particular work employed excessive data augmentation by
applying elastic deformations to available training images, which was one of the contributions.

Another challenge in the medical domain is there could be many objects of the same class
touching each other because there could be objects which are similar to each other placed next to
each other. So, U-Net also proposes a loss function which penalises pixels closer to edges more
than pixels away from edges. Edges are extremely critical for medical image segmentation. Over
the years, there have been several variants of U-Net, mostly by changing the architecture, by
replacing the convolutional blocks with dense blocks and increasing the depth of the U-Net, so
on and so forth.

1023
(Refer Slide Time: 16:30)

Another network used for segmentation is known as the PSPNet. And the PSPNet starts by
asking the question, What challenges could you face if you applied FCNs on complex scenes? If
you think deeper, you would notice that FCNs do not learn patterns that are co-occurent. For
example, we know that cars are on roads while boats are on rivers. And FCNs (may) can also
predict parts of an object as different categories. For example, it can look at a skyscraper and
think pixels belong to different buildings because of the sheer scale of that object.

Thirdly, FCNs could fail to detect large objects or small objects, depending on the architecture
and the receptive field employed in the architecture. So, how can we address some of these
constraints? PSPNet hypothesizes that one should use global context information to improve
segmentation performance. So, how do you get global context information? Segment at multiple
scales. So, when you segment at a lower resolution, you are actually trying to bring more global
information into the segmentation result.

1024
(Refer Slide Time: 18:15)

So, PSPNet introduces a pyramid pooling module where, given an image, you forward propagate
it through a CNN and then there is a pooling layer that pools at multiple scales. What are the
different scales? There is a coarse scale which is simply a global average pooling. So, this will be
1 × 1 × 𝐶, where C are the number of channels. And each successive pooling level gives
increased localized information, localisation information. So, you would have your pooled
outputs to be 1 × 1 × 𝐶. That is the global average pooling. So, each of the C channels here in
the convolutional output feature map, you would take that channel, global average pool and get
one scalar.

And for each of these C channels you would get 1 × 1 × 𝐶. Similarly 2 × 2 × 𝐶, 3 × 3 × 𝐶


and 6 × 6 × 𝐶. And those are the 4 scales that PSPNet employs. Remember that when you do
pooling, typically the number of channels do not reduce. You operate channel wise.

1025
(Refer Slide Time: 19:39)

So, once these 4 scales pooling is done, the PSPNet now applies a convolutional 1 × 1
convolution on each of these maps that came from the pooling operations to reduce the number
of channels in the output. By how much does each of these 1 × 1 convolutions reduce the
channel size? By one fourth. There is a reason for that. The reason is, once these 1 × 1
convolutions are performed, all of these are upsampled and concatenated, so each of these pooled
maps are upsampled using simple bilinear interpolation and then they are concatenated. And this
concatenated volume is going to have the same number of channels as the original feature map.
Once this is done, this is then sent through a convolutional layer to get your final prediction of
segmentation.

1026
(Refer Slide Time: 20:46)

Another method is known as DeepLab which is a popular method for semantic segmentation
which is similar to PSPNet. In this case, the image goes through several blocks of convolution.
And then instead of having a pooling layer, which splits the image into 4 different scales, it
performs Atrous Spatial Pyramid Pooling. So, this module is known as the ASPP module.
Remember, Atrous convolution is the same as dilated convolution or transposed convolution, or
sometimes even called fractionally strided convolution.

So, when your stride is less than one. So, then you get the same effect as performing pooling at
multiple scales. So, Atrous spatial pyramid pooling, you could say, is a different way of
implementing your pyramid pooling module in PSPNet. So, once you do your Atrous spatial
pyramid pooling, you once again upsample, concatenate and you finally get your result on that
concatenated volume.

1027
(Refer Slide Time: 22:00)

DeepLab has gone through a few different versions over the years, and one of the more recent
versions, known as DeepLab V3, adds a few modules to the basic DeepLab architecture. In this
DeepLab V3 module, image level features are also passed on to the ASPP module. There is batch
normalization which is used for easier training. And there is also an entire decoder architecture
which is used to refine the segmentation results. So, you can see here that you have the ASPP
module using which you perform 1 × 1 convolution, 3 × 3 convolution, so on and so forth.

You get a set of feature maps which are concatenated. 1 × 1 convolution is done to reduce the
depth of those channels and that is then upsampled by 4 and passed on to the decoder along with
the output of the ASPP module. These are then concatenated and you finally have one more
convolution at the end and then upsample and make your predictions at the end. Remember that
this predict, when we say prediction, it is going to be again an output volume, whose resolution
is the same as the input image and the number of channels would be C+1 where C is the number
of classes, +1 for background.

1028
(Refer Slide Time: 23:34)

Moving on from semantic segmentation, instance segmentation is a slightly more challenging


task than semantic segmentation.

(Refer Slide Time: 23:52)

In semantic instance segmentation or what is popularly known as instant segmentation, Our goal
is to not only give a class label for every pixel, but to also assign an object ID for each of the
objects. So, if there are multiple people or multiple chairs or multiple dogs, you will want to also
separate those people, the instances of the people and instances of the dog, which the basic

1029
semantic segmentation task did not consider. In semantic segmentation, a dog would be a dog
pixel whether there were 2 dogs or 3 dogs or 4 dogs, all of them would just be classified as dog.
But in instance segmentation we want dog 1, dog 2, dog 3 and dog 4. And the challenge here is
you may not know the number of dogs which is similar to the detection problem.

Which is why one of the most popular approaches, for instance segmentation known as Mask
R-CNN actually tries to improve upon faster R-CNN to achieve instance segmentation. So, Mask
R-CNN was proposed in ICCV of 2017. And it uses a faster R-CNN like architecture but adds a
branch to also mask out and get instance level segmentation. So, you could look at mask R-CNN
as faster R-CNN with FCN on the regions of interest, it adds a parallel head. Along with your
SVM object classifier or simply your classification module and your bounding box offset
regressor, you now add a mask predictor that is a third head.

(Refer Slide Time: 25:42)

An important contribution in mask R-CNN to be able to achieve the mask accurately was a
module known as RoIAlign. And since mask R-CNN operates in the framework of Faster
R-CNN, the main limitation of Faster R-CNN that this module addresses is that when you map
object proposals to feature space, recall that both in fast R-CNN and faster R-CNN. You had
certain object proposals and you map them to a certain feature space.

1030
And feature space could be of a different resolution than the resolution in which you obtained
your object proposals. In both Fast R-CNN and faster R-CNN, we simply warp it to the size
required and then bring your object proposals to that dimension. However, when you do
something like this, you could face errors due to quantization. Why does this matter? This
matters because one pixel in feature space could be equivalent to many pixels on an image
because the feature space is smaller in later convolutional layers when compared to the initial
input.

When you try to get a mask, these few pixels can bring a significant error in the final
performance. So, how do we overcome this problem? RoI align performs bilinear interpolation to
get the exact coordinate for the locations where the object proposals match to a given feature
map. This preserves translation equivariance of masks.

(Refer Slide Time: 27:48)

What do we mean? We mean that given an image where an object could be located in this yellow
box or red box, both could be valid detections. But the mask position in each of these bounding
boxes is different. So, by translation equivariance, we mean that if the object has moved by a
certain amount inside the bounding box, the mask also has to move by certain, by the same
amount. And this can be achieved by the RoI align layer.

1031
(Refer Slide Time: 28:23)

So, the way RoI align does it is, when you convert your object proposals to locations on your
feature map, if the location turns out to be in between a few pixels on your feature map, then you
perform bilinear interpolation of the pixels around the feature map and then you get your specific
pixel value at that specific point pointed by the object proposal. And that is what is then pooled
to get your fixed dimensional representation that goes to later layers of your faster R-CNN
architecture.

(Refer Slide Time: 29:08)

1032
So, in addition to your standard faster R-CNN architecture, with the classification loss and a
bounding box loss, we also have a mask head with the mask R-CNN. And what the mask
R-CNN does is, it uses an FCN kind of a branch. So, which means it is fully convolutional. And
for each RoI the mask size is 28 cross 28, as you can see here. And it is finally rescaled to the
bounding box size and then overlaid on the image during inference. So, this mask as you can see
here is obtained for every RoI and then finally overlaid back on input image at inference time.

(Refer Slide Time: 29:55)

A last and more recent form of segmentation that is popular is known as panoptic segmentation.
Panoptic segmentation combines semantic segmentation and instance segmentation. So, semantic
segmentation, as you can see, is where you segment the pixels of all objects. But all instances
remain the same. Instance segmentation is where you only care about the instances of the objects
in your categories, but you label each instance separately.

And panoptic segmentation is the union of the two where you label every pixel as belonging to
different objects. And you also ensure that each instance of an object gets a different label
altogether.

1033
(Refer Slide Time: 30:52)

So, how do you perform panoptic segmentation? So, this work was first introduced in CVPR of
2019, about a year old. And one could achieve panoptic segmentations using a couple of
architectures. So you could perform Mask R-CNN to get instance segmentation results. You
could do a dilated FCN to get semantic segmentation results and then you merge the two to get
panoptic segmentation. That is one approach. So, this particular work actually implements a
couple of approaches and then studies how this needs to be evaluated differently.

(Refer Slide Time: 31:31)

1034
So, another approach is you could take a feature pyramid network instead of what we saw here.
Instead of using a dilated FCN or a mask R-CNN, You could use an FPN, a feature pyramid
network and then send those detections through a mask R-CNN head to get your instance level
segmentations. And you could also send those different feature maps through a pixel level
recognition head to your semantic segmentation. So, in both these approaches, there is an
instance segmentation pathway and there is a semantic segmentation pathway which are
combined to give your panoptic segmentation. What would be the loss function here? The loss
function recommended is a combination of both semantic segmentation and (panoptic) and
instance segmentation.

Remember that instance segmentation the way we saw with mask R-CNN has three losses
classification, bounding box and the mask loss. And then for semantic segmentation, you have
your standard pixel wise cross entropy loss.

(Refer Slide Time: 32:46)

One of the challenges of panoptic segmentation is that it cannot be evaluated using the metrics
one may use for semantic segmentation or instance segmentation. Why is this so? In semantic
segmentation, we use IoU and per pixel accuracy. So, you can try to see if you do a set of pixels
that belong to a particular object, you could try to get the intersection over union with the ground
truth of that object. You could also get a per pixel accuracy because you are going to predict

1035
probabilities at each pixel level for all the classes that you have. You can also look at that
accuracy as a metric.

On the other hand, for instance segmentation, which is similar to detection, you would have
average precision over different IoU thresholds. Why cannot we use these for panoptic
segmentation? That is because this could cause asymmetry for classes with or without instance
level annotations. Remember that in panoptic segmentation you could have certain classes,
where there are multiple instances. For example, people or dog, and there could be certain
classes such as sky, road which may not have many instances. So, to combine these metrics
without considering these asymmetries may not be wise.

(Refer Slide Time: 34:28)

So, this particular work recommends a new metric known as the PQ metric. How does the PQ
metric work, you can look at this example here. Let us assume that this is your correct ground
truth. So, there is sky, there is grass, there are three people. That is the instance segmentation part
of it and then there is a dog. Let us assume now that whatever module we came up with predicted
it this way.

The sky, it got it right. The grass, it got it right. The dog became person. And instead of three
people, it only predicted two people where it combined two of these people. So, how do you now
measure error of this kind of a system? So, the way this approach recommends is, you first write

1036
out all your true positives, which would be your two people, which are true positives between
your ground truth and prediction. There is a false negative, which is the person you missed and
there is a false positive, which is the person you added instead of a dog.

So, for a ground truth segment g and for a predicted segment p, the P Q metric is computed as

∑ 𝐼𝑜𝑈(𝑝,𝑔) ∑ 𝐼𝑜𝑈(𝑝,𝑔)
(𝑝,𝑔)ϵ𝑇𝑃 (𝑝,𝑔)ϵ𝑇𝑃 |𝑇𝑃|
|𝑇𝑃|+0.5|𝐹𝑃|+0.5|𝐹𝑁|
= |𝑇𝑃|
× |𝑇𝑃|+0.5|𝐹𝑃|+0.5|𝐹𝑁|

So, the first term here gives the segmentation quality. Among the true positives, What was the
IOU for all of your segments? And the second part gives the recognition quality as to among all
of your possible segments, how many of them did we get as true positive? So, you have a part of
it that checks for segmentation quality and a part of it that checks for recognition quality and that
is why this metric becomes useful for panoptic segmentation.

(Refer Slide Time: 37:07)

For more details, please read a good overview of semantic segmentation by Nanonets, a good
overview of Deep lab semantic segmentation method, which is a popular one at analytics Vidhya
and also a good introduction to panoptic segmentation.

1037
1038
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 48
CNN for Human Understanding Faces: Part 01

We will now move to another important task, where CNNs have been extremely useful over
the last few years. In computer vision tasks, CNNs are used in understanding humans from
various different perspectives. In this first lecture, we look at understanding faces and
processing faces for tasks such as recognition and verification.

(Refer Slide Time: 00:46)

Face Recognition has remained an extremely important computer vision task for several
decades now as part of biometrics. It has applications in security, finance, healthcare, and
various other aspects of society in finance. Unconstrained face recognition, which is about
recognizing faces in the wild, is a very challenging problem, because of the variations that
you could have in lighting, in occlusions, in pose or alignment, in expressions, so on and so
forth.

1039
(Refer Slide Time: 01:29)

Face recognition is not a new topic, it's been around for several decades now, because of its
importance. Algorithmically speaking, the first face recognition efforts started in the 60s with
this work (on) from Woodrow Bledsoe. Then came in the 1970s from Takeo Kanade. Then
one of the most popular algorithms in early face recognition in the 90s known as Eigenfaces,
which repurposes principal component analysis in a slightly different way to perform face
recognition. Then came various approaches using Local Feature Analysis for face
recognition. Then came Elastic Bunch Graph Matching. Of course, we had the Face Detector
from Viola Jones, that was an important component of face recognition systems.

Locally binary patterns, local binary patterns, which we talked about. And we talked about
handcrafted features were extensively used for face analysis tasks. Then, towards the end of
the first decade of the Twenty First Century, came Sparse Feature Representations for face
analysis tasks. And of course, since 2014, Deep Neural Network-based Approaches for face
recognition, which we will focus on in this lecture. From a hardware perspective, cameras
began way back in 1915 with a digital camera coming in the 90s. Then face recognition
shifted to Surveillance Cameras, Camera Smartphones, Kinect-based devices, Microsoft
Kinect-based devices that you see on your Xbox, Google Glass kind of devices towards the
end of the first decade of Twenty First Century.

And in 2011, Samsung Galaxy had their face unlock feature implemented as part of the
smartphone. And then came the RGB-D Camera. And more recently, Body Cameras that can
do face recognition. All of this has been very well chronicled in a recent article known as “50

1040
Years of Biometric Research: Accomplishments, Challenges and Opportunities.” Do look at it
if you have time.

(Refer Slide Time: 03:51)

A standard face recognition pipeline as used in, as deep learning is employed for, is given by
this diagram here. So, you start with an Input Image. You first perform Face Detection,
because you have to isolate the faces from the image. Once face detection is done, there are a
few pre-processing tasks that need to be done before you give the cropped face for a
recognition task. So, the first task is to align the face to a set of predefined geometry. The
second task, which is optional depends on the application, is to check whether this image is a
spoof or life, “Did I hold up an image to the camera? Or is it really my face?” Once that
check is done, if it is spoof, you conclude this pipeline right there.

But if it is not a spoof, we go on to the next stage. The next stage could be looked at as Face
Processing. There are two kinds, “one-to-many” and “many-to-one”, once again, depending
on what application you want to use these for. We will see both of these in more detail in a
couple of slides from now. Of course, the pre-processing task is an optional task, depending
on whether you need it for a given application. And once you have your training data, after
all of these processing steps, you provide this to a Convolution Neural Network for Feature
Extraction.

And at the end, you use different kinds of Loss Functions, such as Euclidean Distance or
Angular Distance, so on and so forth to train the CNN. At test time, you have a test image,
which goes through a similar round of processing. And once you have a Test Data after

1041
processing, you extract the features, and you finally match the final embedding of that face
using various approaches such as matching against the threshold, or doing Nearest Neighbour
Matching, or as we will see doing Metric Learning, or Sparse Representations.

We will see some of these over the next few slides. Broadly speaking, one could say that to
deploy face recognition systems in the wild, it is a combination of detection, alignment, and
matching. As we will see, matching can be of several kinds too when we talk about faces.

(Refer Slide Time: 06:41)

So, the key components are face processing, where you could have a one-to-many
augmentation or a many-to-one normalization. We will describe both of these very soon.
Then you have a Deep Feature Extraction through a Network Architecture and a
corresponding loss function. And finally, you match these embeddings or features or
representations using various different approaches.

1042
(Refer Slide Time: 07:14)

What is this pre-processing step that we just spoke about? One can perform one-to-many
augmentation, which is the standard data augmentation that we spoke about, where we could
generate many patches or many images by varying the face image in different ways. An
example could be by rotating the face image in different ways, you could be simulating
different poses of the person to an extent. And this kind of an augmentation can help the face
recognition system be more robust. On another hand, one may also want to get a single
canonical view of a person by normalizing several face images onto a standard model.

For example, you could have all these different images of a single person. As you can see,
these have variations from an illumination standpoint, these have variations from a pose
standpoint. How do you get these? These could be obtained at different points in time or
these could be obtained as frames in a single video sequence, say as a person was speaking,
and maybe his head was moving, and you capture all of those frames, and then you normalize
all of those frames into a single frame, which gives a frontal view which can be used for
further processing.

A lot of work in this space, which we may not be able to focus on in this course, also
corresponds to the idea of using 3D models for face recognition. Such a normalization
approach can help in preserving identity, despite variations in pose, lighting, expression and
background. Because once you normalize for these factors, you are likely to have a canonical
image that a CNN may do better on while performing a recognition task.

1043
(Refer Slide Time: 09:24)

So, what network architectures have been used for face recognition over the years, so we will
focus on the deep learning era for face recognition in this particular lecture. At each stage
over the last few years, the architecture used for face recognition mirrors the CNN
development at that stage. So, if you see this timeline below here, 2012 was AlexNet. So
Deep Face, which was one of the first comprehensive efforts of using the new generation
CNNs for face recognition, used AlexNet as its main backbone architecture.

Subsequently came Facenet in 2015, that used GoogleNet. VGGface in 2015 that used
VGGNet and a network known as SphereFace in 2017, that used ResNet. A network known
as VGGFace2 in 2017, that used Squeeze and Excitation net. A more comprehensive listing
can be seen here, a table that was obtained from this paper known as Deep Face Recognition:
A Survey. And you can see here that the backbone architecture of all of these different
methods have predominantly been the popular architectures that have been successful over
the last few years. While there have been a few deviations, where researchers have varied the
architecture slightly, for a large part, the backbones of face recognition architectures in recent
years have been AlexNet, VGG, ResNet, and so on and so forth.

1044
(Refer Slide Time: 11:14)

The entire space of face recognition can broadly be divided into two kinds of tasks that are
required from an application standpoint, verification and identification. In both these cases,
generally, it is the last stage of the architecture that changes, you have an image, you perform
detection and alignment, you get a cropped image that is well aligned, then you take a deep
CNN, a feature extraction network, get a feature representation, which is then passed on to a
verification system or an identification system.

(Refer Slide Time: 11:57)

What is face identification? Face identification is the task of assigning a given input image to
a person name, or identity from a database. It would be a one-to-many matching. So, you
have a given image, and you have to match this image with many identities in your database

1045
to find the closest match. So, this task is very similar to the classification task that we have
been speaking about so far, such as an ImageNet, where given an image, you match with
1000 different classes, and find which is the class that should be assigned to this image.

This is generally formulated as a K + 1 multi-class classification problem, where you do have


one additional class, in case the face comes from outside the database. So, your input is the
face image and the output is the identity class or the face ID.

(Refer Slide Time: 13:04)

What is the loss function used in this scenario? Standard Softmax and CrossEntropy loss that
we have been visiting all the while. So, you have a last linear layer, which parameterizes the
subjects, the representations of the subjects. And the cross entropy can be given by this form,
where the x here is the penultimate layers’ representation, on top of which you have some
weights, which gives you the final outputs in that last layer, on which you apply the Softmax
Activation function, and then the CrossEntropy loss. To remind you here, these values are
also known as Logits, the set of outputs of a neural network before you apply the Softmax
Activation function are also referred to as the logits of a neural network.

And generally, when you use the Softmax and CrossEntropy loss, these embeddings that you
get before you apply the Softmax, geometry look like Ellipsoids, where you have a large
intra-class variance because that is not the focus of the cross entropy loss, but you have a
good inter-class variance, where you separate these classes using this loss.

1046
(Refer Slide Time: 14:29)

Now for the verification task, the verification task is about verifying whether two images
belong to the same identity. This is a one-to-one matching task. Given an input image and an
identity that the person claims to have, this task has to ascertain whether the input image
belongs to that identity. You could imagine this setting to be in a, in an immigration setting,
where a passport is presented to an immigration officer, and the officer looks at your face to
check if the face matches the image and the identity on the passport. It is a one-to-one
matching. The immigration officer is not comparing your image with a database of identities,
but is only verifying whether you are the person you claim to be.

So, in this case, your input is a face image and the face ID unlike the identification setting,
and the output is a binary classification problem, which states whether it is a match or not a
match. So illustratively speaking, you have an image, you have a convolutional neural
network, you get a feature representation, you have your face database from which one
identity is picked out based on what the person claims himself or herself to be. And there is a
similarity measure that is used to check the similarity between these 2 images, and a binary
decision of Match or Not Match is taken.

1047
(Refer Slide Time: 16:18)

One can also perform identification as a verification task. How is this done? You can now
take a test image, take the feature representation through a convolutional neural network, and
for each of your identities in your database, you can check whether there is a match or not a
match. This would make an identification task as a verification task. Why is this even talked
about? Why is this useful? This removes the need to retrain the model on addition of new
face classes to the database. So, if you have 1000 people in your database, and you trained a
CNN already, if a new person is added, you again have to retrain your CNN, because the
number of neurons in your last layer will now get added by 1.

However, if we treat this as a verification problem, we only need the representation of the
neural network, and the matching with respect to every new identity can be done using other
similarity metrics. This makes this approach scalable. Identification, however, could have
multiple verification steps, which means error could get amplified. If there was error
anywhere in the pipeline, that could now get amplified, because you are now comparing with
multiple verification, that you are going through now, multiple verification steps. In this case,
the goal would be to develop an, a very accurate and efficient verification system, so that it
can then also be used effectively for identification.

1048
(Refer Slide Time: 18:06)

One of the earliest efforts for verification was the Siamese Networks way back in 1994. This
was first proposed for the task of signature verification. So, a certain bank has a person’s
signature on their records. And if a person visits the bank on a particular date, and signs
again, the bank has to check if the signature matches the signature on the records. This task is
signature verification. So, the architecture proposed in a Siamese network, as the name says it
comes from the Siamese Twins, is to have the same CNN architecture replicated twice, where
the signature currently given passes through the same CNN, the signature on the records
passes through the same CNN, and you get 2 different representations f1 and f2.

And there has to be a distance measure that checks how close these 2 representations are,
which is then passed on to a binary classifier to give the decision of a match or not a match.
So, an important takeaway from this approach is that the representations have to be learned in
such a way to respect the distance measure that is used to compare these 2 features. So, we
will see now that this setting has been improved in many ways over the last.

1049
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 49
CNNs for Human Understanding: Faces – Part 02

(Refer Slide Time: 00:10)

Now, we will move on to some of the recent efforts that have used CNNs to perform face
identification and verification. One of the first efforts in recent years since the success of
AlexNet was DeepFace, which was published in CVPR of 2014, which first performed a
pre-processing step, which included Face Localization, Fiducial Point Detection and
alignment to get a frontal crop of the face. This image on the left shows the steps that were
involved to obtain the frontal crop.

Given the input image, the first step is to detect the face and crop out the face from the image.
Then, about 67 fiducial points are detected on the 2D crop and they are triangulated. These 67
fiducial points are different markers on the face, such as tip of the nose, corners of the eyes,
corners of the lips, so on and so forth. There are existing methods that can do that. Once this
is done, these are then cast on to a 3D model, onto which this particular face image is
projected. And that 3D model is then normalized into a 2D face image with its corresponding
fiducial points, which is converted to a frontal crop of the image.

And finally, since we now have the 3D model of the some, same person's face, you could also
generate other poses of the same person. So, the frontal crop that we talk about here in Step 1,
is this image g, which you can see is fairly well normalized when you compare to the image

1050
b. This frontal crop is passed on to the deep CNN model, deep CNN architecture with a
K-way softmax for differentiating between K different classes using the standard soft, trained
using the standard softmax and cross-entropy loss.

(Refer Slide Time: 02:32)

For the verification part, DeepFace used a Siamese Network kind of an approach, where the
representation learned from the identification task is frozen for verification. So, these
architecture parameters are frozen. Given a face image I and its corresponding image in the
record, which the person claims to be, you get 2 representations 𝑓1and 𝑓2, which is 𝐺 𝐼1 ( )
( )
and 𝐺 𝐼2 . The network is then trained, the rest of the network is then trained by taking an

absolute difference between 𝑓1and 𝑓2 followed by a fully connected layer and then your final

binary decision as Sigmoid Activation Function.

| |
So, the distance induced in this particular scenario is 𝑓1[𝑖] − 𝑓2[𝑖] α𝑖, where alpha is a

weight learned in this particular layer. And the final output is then a match or not match or a
same or a not same. This approach that was proposed in DeepFace, which was taken from the
Siamese Network idea of the 90s has since been expanded in several ways.

1051
(Refer Slide Time: 03:52)

And one of the primary ideas, which has been used is the notion of what is known as
Contrastive Loss. Contrastive loss is a loss that is used for training the neural network based
on the paradigm of metric learning in machine learning. Metric learning methods in machine
learning attempt to learn a distance metric, rather than use a Euclidean distance metric, or a
cosine distance metric. So, the same idea is now used to learn distinctive discriminative
feature representations for the task that you want to use these representations for.

So, in the face verification scenario, our objective is to map the input into an embedding
space where the distance between points (are) correspond to semantic similarity. Euclidean
distance could do it, but if we could learn a distance that does this more effectively, that
would be more useful, and that is what most of these methods try to do. There have been
various formulations, such as Pairwise Contrastive Loss, Ranking Loss, Triplet Loss, so on
and so forth. We will see some of these over the rest of this lecture.

1052
(Refer Slide Time: 05:15)

Pairwise contrastive loss came from a work in 2006, introduced by Hadsell, where an
approach to learn a mapping invariant to complex transforms (for int) was introduced. And as
we just said, the idea was to make similar points close to each other on the embedding
manifold. So, this representation that you get as output of your CNN is also called an
embedding. And if we now assume that those low dimensional embeddings, generally these
embeddings have a lower dimensionality than the image itself, the image could be a
200 × 200 image, whereas the embedding could just be a 1000-dimensional vector or a
4000-dimensional vector.

We ideally want these embeddings that lie on, say, 1000-dimensional manifold or a
4000-dimensional manifold to be similar to each other for similar entities, and far away from
each other for dissimilar entities. So, our goal here is to learn W, the weights of the neural
|| ( )
network in a way in which the distance or 𝐺𝑤 𝑋1 − 𝐺𝑤 𝑋2 ( )||2, given X1 and X2 are inputs
to approximate the Semantic Similarity of the inputs. Let us see how this is done.

1053
(Refer Slide Time: 06:42)

So, the overall algorithm’s ideology can be given by this algorithm environment here. Given
a training data set, you first identify the set of samples that are similar to each other and put
them in a set, say, 𝑆𝑋 , given an input 𝑋𝑖, we identify the samples 𝑋𝑗 that are similar to 𝑋𝑖. In
𝑖

the face verification context, it could be other pictures of the same person. We also say that
𝑌𝑖𝑗, which is given 𝑋𝑖 and 𝑋𝑗 would be 0 if these points are similar. Similarly, you identify

dissimilar points, which could be images of 2 different people, where you give a label
𝑌𝑖𝑗 = 1. Now, what do we want to do when we train?

( )
For each pair 𝑋𝑖 , 𝑋𝑗 in the training set, if 𝑌𝑖𝑗= 0, that means they are similar images, we

then update W to decrease the distance ||𝐺𝑤(𝑋𝑖) − 𝐺𝑤(𝑋𝑗)||2. And if 𝑌𝑖𝑗 = 1, that is

dissimilar images, then we would like to increase this distance 𝐷𝑤. Let us see now how this is

done using a single loss function. So, given 𝑋1 and 𝑋2 be a pair of high dimensional input

vectors, let y be a binary label given to be 0 if 𝑋1 and 𝑋2 are similar, and 1 otherwise.

Then the pairwise contrastive loss, 𝐿𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡𝑖𝑣𝑒 is given by


1−𝑦
2
2
𝐷𝑤 +
𝑦
2 ( 2
𝑚𝑎𝑥 0, 𝑚 − 𝐷𝑤 . )
The first term, as you can see, gets activated only when 𝑦 = 0. And the second term gets
2
activated only when 𝑦 = 1. When 𝑦 = 0, the images are similar. So, you are minimizing 𝐷𝑤

1054
, which is what we would like to do. And when 𝑦 = 1, you have dissimilar images as input,

(
then you are minimizing the 𝑚𝑎𝑥 0, 𝑚 − 𝐷𝑤 .
2
)
Let us analyse this. m is a user defined margin, that you want the dissimilar images to at least
be separated by a certain number, 10 units or 20 units in your manifold, so on and so forth.
2
And this states that if 𝐷𝑤 > 𝑚, then the second quantity will become negative, the max
2
would become 0, and you are saying that there is nothing to do. 𝐷𝑤 or the distance between

the dissimilar images is already greater than my user defined margin m. So, this term would
2 2
disappear. However, if 𝐷𝑤 < 𝑚, in that scenario, you would be minimizing 𝑚 − 𝐷𝑤, which
2
is equivalent to maximizing 𝐷𝑤 or the distance between these dissimilar images.

(Refer Slide Time: 10:14)

This idea is used in the subsequent network DeepID2, which was proposed in NIPS of 2014.
So this, similar to DeepFace, trains a deep CNN to jointly perform identification and
verification, where it can be argued that the identification task increases inter-personal
variations, you are trying to separate different identities apart, images of different identities or
embeddings of different identities. And the verification task reduces intra-personal variations
by bringing together features or embeddings of the same identity. How is this implemented?

1055
(Refer Slide Time: 10:59)

In DeepID2, cross-entropy loss is used for the identification parameters and pairwise
contrastive loss is used for learning the verification parameters. So, you have θ𝑖𝑑, which

correspond to the weights of the identification module and θ𝑣𝑒, which correspond to the

verification module. The identification loss is given by cross-entropy as you can see here,
where θ𝑖𝑑 or the Softmax layer parameters, the last layer parameters for identification, and

similarly, the pairwise contrastive loss, where the verification loss parameters are given by
θ𝑣𝑒.

(Refer Slide Time: 11:43)

1056
But DeepID2 uses a different way of learning the entire pipeline, which is slightly different.
And let us see this algorithm here. So, given that you sample 2 training samples 𝑥𝑖 , 𝑙𝑖 and ( )
(𝑥𝑗 , 𝑙𝑗) from your training data set. So, you now have 𝑓𝑖, which is the output of the

convolution layer for 𝑥𝑖 and 𝑓𝑗, which is the output of the basic backbone CNN for 𝑥𝑗. These,

this backbone CNN is parametrized by weights θ𝑐. Now, the gradient of θ𝑖𝑑 is given by

(
∂𝐼𝑑𝑒𝑛𝑡 𝑓𝑖 , 𝑙𝑖, θ𝑖𝑑 ) (
∂𝐼𝑑𝑒𝑛𝑡 𝑓𝑗 , 𝑙𝑗 , θ𝑖𝑑 )
∂θ𝑖𝑑
+ ∂θ𝑖𝑑
. So, in this case, you are using the cross-entropy loss for

both 𝑓𝑖 and 𝑓𝑗.

(
∂𝑉𝑒𝑟𝑖𝑓 𝑓𝑖 , 𝑓𝑗 , 𝑦𝑖𝑗, θ𝑣𝑒 )
Similarly, the gradient for the verification parameters is given by λ. ∂θ𝑣𝑒
, where

the verification loss is given by the pairwise contrastive loss, where we know that in this
particular case 𝑦𝑖𝑗 was set to 1 if the identities were the same and 𝑦𝑖𝑗 was set to minus 1 if the

identities were dissimilar. Then similarly, ▽𝑓𝑖and ▽𝑓𝑗was given by, remember 𝑓𝑖 and 𝑓𝑗 are

the (param) the gradient with respect to the embeddings 𝑓𝑖 and 𝑓𝑗, which are outputs of the

CNN, which is given by ∂𝐼𝑑𝑒𝑛𝑡 + ∂𝑉𝑒𝑟𝑖𝑓 for 𝑓𝑖 and ▽𝑓𝑗, similarly for 𝑓𝑗.

And the final gradient for the entire CNNs parameters is given by
(
∂𝐶𝑜𝑛𝑣 𝑥𝑖 , θ𝑐 ) (
∂𝐶𝑜𝑛𝑣 𝑥𝑗 , θ𝑐 )
▽𝑓𝑖 ∂θ𝑐
+ ▽𝑓𝑗 ∂θ𝑐
. Rather, you are ensuring that the weight updates for the

CNN are shared between the gradient, are firstly updated using the gradients using both the 𝑥𝑖

outputs and the 𝑥𝑗outputs, and the same weights are then shared for both the CNN

architectures that are applied on both 𝑥𝑖 and 𝑥𝑗.

So, to summarize, θ𝑖𝑑and θ𝑣𝑒 are separately updated via respective objectives. But θ𝑐, which

are the weights of the backbone CNN before you branch out into verification and
identification are trained via a weighted contribution from both identification and verification
objectives. This enables the backbone network to both learn good inter-personal features and
intra-personal features.

1057
(Refer Slide Time: 15:09)

Subsequently, came FaceNet, which was developed in CVPR of 2015, which premised that
existing architectures for face recognition, including DeepFace and DeepID2 typically
learned an entire neural network, and then used one of the penultimate layers, which is called
a bottleneck layer for, and those embeddings for any final decision making in matching. Is
there a problem with this approach? FaceNet claimed, yes, it has an indirectness, because we
do not know what those embeddings from an intermediate layer or a penultimate layer or
what is called as a bottleneck layer could be used for in later tasks.

And because it was not explicitly trained for a later face or later task that those embeddings
are used for, it may be ineffective. So, FaceNet rather tries to ensure that its goal is to output a
good representation. There are no further goals of classification, which are considered
separately based on the learned representation. In fact, FaceNet learns a 128-dimensional
embedding using a new loss function known as a Triplet Loss Function, which is commonly
used across several settings in deep learning today.

1058
(Refer Slide Time: 16:37)

So, here is the idea of triplet loss. It is an extension of contrastive loss. While contrastive loss
used similar and dissimilar examples, triplet loss uses a triplet. What is the triplet made of?
An anchor point, a positive point and a negative point. So, given any input, which is an
anchor point for us now, we now then have a positive point, which has the same identity as
the anchor, and the negative point, which has a different identity as the anchor. We now try to
use all these 3 in, while learning through a loss function. And our goal here, obviously, is to
ensure that the distance between the anchor and the negative sample is at least a certain
margin more than the distance between the anchor and the positive sample.

How do we implement this using a loss function? We have a triplet constraint that states

||𝑓(𝑥𝑎) − 𝑓(𝑥𝑝)||22 + α < ||𝑓(𝑥𝑎) − 𝑓(𝑥𝑛)||22. We are saying that the distance of the anchor
to the negative sample must be at least a certain margin m, which is m is a positive, alpha
here, alpha is a positive quantity more than the distance of the anchor to xp, which is the
positive sample.

1059
(Refer Slide Time: 18:13)

So, you could visualize this as, you have an anchor image that is given as input to your CNN
model. You have a positive example for that anchor image and you also have a negative
example for that anchor image in your data set. You first ensure that after you get the
representations as output of your CNN architecture, the same CNN architecture is used for all
these 3 inputs, very similar to before. You ensure that these representations are normalized, so

that you get ||𝑓(𝑥)||2 = 1 for each of these representations. So, you normalize those vector

representations that you get.

Now, this would be equivalent to each of these points lying on a unit hypersphere or a unit
ball in whichever dimensional space, 128-dimensional space that you are considering here.
And your goal now is to train θor the weights of the neural network, given all these triplets to

ensure that ||𝑓(𝑥𝑎) − 𝑓(𝑥𝑝)||22 + α < ||𝑓(𝑥𝑎) − 𝑓(𝑥𝑛)||22, where alpha is margin. This is
achieved by using the triplet loss, which is given by

||𝑓(𝑥𝑎) − 𝑓(𝑥𝑝)||22 − ||𝑓(𝑥𝑎) − 𝑓(𝑥𝑛)||22 + α

It is simple to see that this will now try to ensure that this loss at least has a certain margin of
alpha between the distance to the negative sample to the anchor and the positive sample to the
anchor.

1060
(Refer Slide Time: 19:55)

One question here is, “How do you select these triplets, given an anchor image?” You could
have what are known as Easy Triplets, where its very clear as to what the positive sample is
and what the negative sample is. However, easy triplets may not really facilitate learning,
they may not be able to allow the network to learn distinctive features. So, it may be
important to select triplets that initially violate the triplet constraint, so that the network
learns. And eventually, perhaps over time change the nature of triplets.

(Refer Slide Time: 20:39)

So, you could first choose what are known as Hard Triplets. Given a mini-batch of triplet
samples that you are going to use (the) to train this neural network, you could choose only the

1061
( )
hard positives and the hard negatives. What are hard positives? Where 𝑓 𝑥𝑎 − 𝑓 𝑥𝑝 is ( )
maximized. Those samples, where the distance to positive samples is the highest, would be a
hard positive. And where the distance of the anchor to a negative is minimum, would be a
hard negative. So, we choose those kinds of samples in a given mini-batch as hard triplets
that we can train the network on.

So, FaceNet actually uses online triplet sampling, where hard positives and negatives are
sampled from every mini-batch. There is one problem here though, if you choose very hard
negatives early on, there could be bad local minima, and the training could collapse. So, the
idea would be to choose semi-hard negatives in a initial stage, and then move on to more
harder negatives in later stage of training. How do you choose semi-hard negatives? We
ideally want to choose negatives that lie inside the margin between, alpha was the margin, so
we want to choose those kinds of samples to start with. It may not necessarily be the
minimum distance, but we want to choose a negative sample that is inside the margin.

How do you do? How do you choose such a semi-hard example? You can pick

||𝑓(𝑥𝑎) − 𝑓(𝑥𝑝)||22 < ||𝑓(𝑥𝑎) − 𝑓(𝑥𝑛)||22. You ignore the α that we had, and this way, you
may end up choosing samples that are lying inside the margin while you train initially.

(Refer Slide Time: 22:39)

Going further over triplet loss, in 2017, there was an approach known as Angular Softmax
Loss that intended to improve identification and verification performance with faces. It can
also be used for other domains. And these set of diagrams from the paper illustrate the overall

1062
idea. So, if you had an original softmax loss, and you considered the embeddings of the
neural network before applying the softmax loss, whatever embedding you had from the
penultimate layer, before applying the Softmax Activation function and the cross-entropy loss
after that, you may have an embedding something like this, where all these points, yellow in
colour belong to one class, all these points that are pink in colour or purple in colour belong
to the other class.

So, one can see that if you now embedded, if you normalize these vectors and get a unit
norm, it may be a little difficult to use the cosine distance between these two to separate the
identities. You can see here that there are certain points, where the yellow and the purple
almost overlap. This could be improved by using a Modified Softmax Loss, where you
normalize your weight vectors before obtaining these embeddings. We will see the formal
definition soon.

And this particular work proposes an Angular Softmax, where you enhance the distinction
between the positive class embeddings and the negative class embeddings, so that then you
could use a cosine distance metric to effectively separate the identities or get similarity
between the identities. Why is this important in this context? Remember that for most of
these identification or verification methods, the last step is a step of matching using a distance
metric. And one distance metric that could be used for matching is the cosine similarity or the
angle between these representations. So, we now look at different definitions of these
modified Softmax losses.

(Refer Slide Time: 24:59)

1063
So, the original Softmax loss can be written as e power, ideally, this would have been Zi,
which would have been the logits of the neural network. But those logics can be written as
𝑊𝑥𝑖 + 𝑏, where 𝑥𝑖s are the activations of the layer before applying the Softmax. So, that is

what you have as the definition of Softmax. Now, one could write out this dot product Wx

|| || ( )
that you have here, as ||𝑊|| 𝑥𝑖 𝑐𝑜𝑠 θ𝑦 + 𝑏𝑦 , that is the definition of the dot product, plus
𝑖 𝑖

the bias. Remember, we are looking at the 𝑦𝑖th label, and that is why this subscript 𝑏𝑦 . So,
𝑖

this is your standard Softmax loss.

The modified Softmax loss is given by, after your penultimate layer of the neural network,
before you apply the Softmax, you normalize the weights that you obtain, and then apply the
Softmax. So, in this case, the norms of the weights 𝑊1 and 𝑊2 for any other neuron,

remember that you are going to have different weights for every neuron connecting to that
last layer, so each of those weights are normalized to 1. Now, your Softmax loss becomes this
first term here, ||𝑊||becomes 1.

And it also ensures that the bias becomes 0. That is also done as part of the operation. So,
your new modified loss looks something like this, where the ||𝑊|| becomes 1 and the bias
becomes 0. These are operations that are performed before applying the Softmax. The
angular Softmax loss extend this, extends this to make the angle corresponding to the positive
class be scaled by m, while the angles corresponding to all other classes not scaled by the
same m value. m is a positive constant. This ensures that the angle is further separated
between the positive class and the rest of the classes.

Remember, that the final match to check whether a feature belongs to one class and the other
class would be given by where 𝑐𝑜𝑠 θ𝑖 > 𝑐𝑜𝑠 θ𝑗, you would then say that this point belongs to

the ith label and not the jth label. And by doing Angular Softmax, we are ensuring that we
would now have 𝑐𝑜𝑠 (𝑚θ𝑖), which is likely to be greater than 𝑐𝑜𝑠 θ𝑗, we are trying to ensure

that for the positive class, the cost value can be further separated from the cost value for the
angle with respect to the other classes. That is the main idea for the Angular Softmax.

1064
(Refer Slide Time: 28:03)

So, here is another visualization from the same paper. The Softmax decision boundary can be
given by (𝑊1 − 𝑊2)𝑥 + 𝑏1 − 𝑏2. Taking the activations from 2 different classes,

remember, you are going to have multiple classes in that last layer. We are considering 2
classes, 𝑊1, and the weights corresponding to them are 𝑊1and 𝑊2. And this is your Softmax

decision boundary. You see here, that when you have your Softmax decision boundary, you
are separating those 2 classes, this particular way.

In your modified Softmax boundary, you are normalizing your weights, and hence, all your
points lie on a unit ball. And you are now trying to ensure that you get a separation such as
this. Your points belong to positive classes lie on one side of the ball and your points
belonging to the negative class lie on the other side of the ball. This is a little bit better,
because in the previous Softmax approach, you do not know where these 2 balls may overlap
with each other. You are now trying to ensure that these go on different sides of the same ball.

But this still have a problem with modified soft max, at these points, which are close to each
other at where the ball, where the classes meet in the ball. And angular Softmax tries to
ensure that that separation is further maximized by having W1s in a further arc on one side of
the ball and W2s on the other side of the arc in the other side of the ball. The further
separation helps better classification in the final step.

1065
(Refer Slide Time: 29:57)

Over the last few years, there have been other ideas to improve these loss functions. A lot of
these have been used in the face analysis context. CenterLoss is another approach that was
proposed in 2016, which was a loss added to the cross-entropy loss, where in addition to the
standard cross-entropy loss, we also obtain a centroid for each class label. And for a point
belonging to a particular class, we also minimize the distance of this representation to the
centroid of that particular class, which this data point belongs to. So, these centroids are also
updated online during learning as you keep training over different iterations.

You get, you get these centroids based on the representations in each particular iteration over
training. And in that iteration, we try to ensure that the representations are close to the
centroid of that particular class. And the diagram on the left, on the right shows that as you
vary the coefficient λ, which weights, how, which tells you how much you are going to weigh
that distance to the centroid. You can see that initially when λhas a very small value, you get
a certain separation with respect to the centroid.

And as you wait, λ more and more, you see that all the embeddings of a particular class end
up going close to the centroid, which are given by these white dots in each of these classes.
This allows good separation between the classes, when you want to make your final decision
on whether there is a match with a certain identity or a match versus a no match.

1066
(Refer Slide Time: 31:54)

L2-Softmax is another improvement, which tries to ensure that the L2-norm of the features
before the Softmax lie on a certain hypersphere of a fixed radius λ. Instead of normalizing the
features to ensure they have a unit norm, this approach tries to ensure that they have a
particular norm λ. And this, the intuition behind this approach comes from their observation
that features with high L2-norm are easy to classify and features with low L2-norm because
they could be close to the origin could be more difficult to classify.

When it is more close to the origin, the points could look more close to each other. When the
points have a high L2-norm, they are further away from the origin and hence could be
separated a bit more easily. So, in this case, the optimization turns out to be, you minimize
cross-entropy loss such that the representations that you get before you apply a Softmax
|| ( )||2 =
ensures that 𝑓 𝑥𝑖 α.

1067
(Refer Slide Time: 33:14)

This was further improved in RingLoss in 2018, which is this very similar idea as
L2-Softmax. However, even in this case, you minimize the cross-entropy loss such that
||𝑓(𝑥𝑖)||2 = 𝑅. But the difference from L2-Softmax is that in this particular case, you also

learn the R as part of your training process. So, the neural network decides what should be
the value of the L2-norm value also in addition to learning the weights of the neural network.
This is known as RingLoss, because you are trying to project the feature representations
before the Softmax layer onto a ring, whose radius is also decided by the neural network.

(Refer Slide Time: 34:06)

1068
Based on these approaches, there have been a slew of efforts. CosFace proposed in 2018
proposes a large margin-based cosine loss for face recognition. UniformFace proposed in
2019, proposes a uniform loss to learn equidistributed representations, so that you exploit the
full feature space rather than focus on one ball around the origin. RegularFace proposed in
2019, again proposes an exclusive regularization to explicitly enlarge angular distance
between different entities. GroupFace, which was very recently proposed in 2020 uses
multiple group-aware representations to improve the quality of the embedding.

CurricularFace, which was again a recent work in 2020, proposes adaptive curriculum
learning to adjust the relative importance of easy and hard samples in different training
stages. Curriculum learning is a facet of training deep neural networks, where initially you
expose the model to say, simple training examples, and then gradually increase the difficulty
of the training examples to help the neural network learn better. So, we saw this with triplet
mining, where we said, “let us start with semi-hard negatives, before we go to hard
negatives.” And this approach called CurricularFace took this further to propose an adaptive
curriculum learning loss.

(Refer Slide Time: 35:49)

To conclude this lecture, face recognition also has what is known as a Closed-Set and an
Open-Set setting. In a Closed-Set setting, you have a set of pre-given identities that you have
to match an input image to. And one could use a Softmax plus CrossEntropy loss or a Center
Loss, L2 loss, RingLoss, Angular Softmax Loss, so on and so forth to implement Closed-Set
face recognition.

1069
On the other hand, in Open-Set face recognition, one could have people, images of people
that are not in your database, and the neural network has to at least say that this person does
not belong to the database and is unknown to the current system. How do you, what kind of
loss functions do you use to train such a network? There is what is known as Double Margin
Contrastive Loss, that could be repurposed to achieve this. Even triplet loss could be used to
achieve Open-Set face recognition. For more details, you can look at this work called Deep
Face Recognition: A Survey by Masi and others.

(Refer Slide Time: 37:03)

So, your homework readings for this lecture is going to be a very nice blog on understanding
Ranking Loss, Contrastive Loss, Margin Loss, Triplet Loss, so on and so forth, in case they
are all confusing to you. A very detailed tutorial on deep face recognition by Masi and others.
And if you are interested, look at a Python code base for face recognition with OpenCV and
also the entire paper on Deep Face Recognition: A Survey. So, the question that we left
behind was, “What is Double Margin Contrastive Loss? And how can it be used for Open-Set
face recognition?” The hint for you is read this tutorial by Masi et al to get your answer.

1070
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 50
CNNs for Human Understanding Human Pose and Crowd

Beyond face understanding, CNNs have been used for many other human understanding
tasks; gesture recognition, emotion recognition, gait recognition, so on and so forth. But to
give a flavor of different varieties of tasks, we will now look at two other tasks, human pose
estimation and crowd counting, using deep learning and CNNs.

(Refer Slide Time: 00:52)

The task of human pose estimation is the problem of localization of human joints, such as
elbows, wrists, so on and so forth in both images and videos, where would such an task be
used, it could be for sports analytics, it could be for the Microsoft Xbox that detects your
pose and accordingly asks your avatar on the screen to play a particular tennis shot or a golf
swing.

Existing methods, especially deep learning based methods for human pose estimation, are
broadly categorized into single person pipelines, where you are trying to get the pose of just a
single person in the frame, or a multi person pipeline, where there could be multiple people in
the frame. And you would like to know the pose of each of them.

What you see here are illustrations of how a pose estimation model works. What we are
ideally looking for is the positions of each of these joints that you see on any of these images.

1071
As you can see here, there are many challenges. Firstly, this seems different from the tasks
that we have seen so far, different from image level classification, or detection or
segmentation or face verification as another task. And we will see how this is done using
CNNs in this lecture. Beyond that, you can also see that when occlusions come into play,
especially self-occlusions where a part of the human body occludes another part, this task can
get very challenging.

(Refer Slide Time: 02:52)

Before we actually go ahead and discuss the methods, we will try to first ask the question,
how would you know whether the model that you developed for human pose estimation is
good or not? How do you evaluate it? There are well known metrics today. So some of the
metrics are PCP, which stands for percentage of correct parts, which states that a limb is
considered detected, if the distance between the detected joint and the true joint is less than
half the limb’s length.

So you could have a long limb or a short limb depending on which part of the body you are
trying to model. And if the distance between the predicted joint position and the correct joint
position is less than half the length of that limb, we consider that to be a fairly correct
prediction. This is known as PCP at 0.5, if you considered quarter limb length, it would be
PCP at 0.25.

A related metric is known as percentage of detected joints PDJ, which states that the detected
joint is correct, if the distance between the predicted and the detected joint is within a certain
fraction of the torso diameter. So you could look at the torso diameter as a central scale for

1072
the person that you are modeling in that particular, when you are trying to predict the pose of
that person. So, if you say PDJ at 0.2, you want to ensure that the distance between the
predicted and the true joint is less than 0.2 times the torso diameter of the person under
consideration.

There has also been a different metric known as object key point similarity based mAP which
you can say is an equivalent of IOU for human pose estimation, which is given by

∑ 𝑒𝑥𝑝
𝑖
( )
−𝑑
2

2 2
2𝑠 𝑘𝑖
× 𝑎𝑛 𝑖𝑚𝑝𝑢𝑙𝑠𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑡𝑜 𝑐ℎ𝑒𝑐𝑘 𝑤ℎ𝑒𝑡ℎ𝑒𝑟 𝑣𝑖 > 0 𝑜𝑟 𝑛𝑜𝑡.

Let us try to explain each of these quantities, denominator is going to be over all possible i’s,
𝑑𝑖 here is the Euclidean distance between a detected key point and the corresponding ground

truth, which is the actual presence of the joint. So that is the 𝑑𝑖 that you see in the numerator

here. 𝑣𝑖 is the visibility flag of the ground truth, as we said, there could be certain joints that

are occluded by other joints, and trying to get a correct estimate of that joint would be an
impossible task.

So by ensuring that you have a visibility flag for each joint, you know that if a joint was not
visible, any error on the joint can be weighted to a lower extent. So 𝑣𝑖 is a visibility flag of a

ground truth, s is the object scale. So this distance between the detected point and ground
truth has to be normalized by the scale of that particular object.

So if that object is very large in the image, this distance could be relatively larger than for
another human in the same image, where the object scale is small. Finally, k is a per key point
user defined constant just to control fall off that you could consider as a hyper parameter. So
as you can see here, we now have a metric that is an inverse exponent of the distance which
means it gives an extent of how good the prediction is, but it factors in scale, as well as
visibility of the joint to evaluate your final performance.

1073
(Refer Slide Time: 06:42)

With these metrics specified, let us now talk about methods for human pose estimation using
deep neural network models. The precursor of all of such models was called DeepPose,
which was proposed in 2014. This was perhaps the first work to kick off deep learning based
human pose estimation. In this approach, an image was given to a model and the architecture
at that time was of course, AlexNet inspired because in 2014, that was one of the most
common models in practice, the only change from the original AlexNet architecture is in the
output space.

Now, you are not predicting a class label for the entire image, but you are going to predict a
set of joint positions, 2D joint positions, 𝑥𝑖 𝑦𝑖for each joint on your input image. So, your

output is given by y, which is a set of 𝑦𝑖’s, where each 𝑦𝑖 contains x y coordinates for the

corresponding joint. What would be the loss to use here? The loss would be a mean square
error for the position of every joint or the L 2 norm for the position of every joint with respect
to the ground truth of that joint.

This method did not stop here, to further improve the performance, it also used what are
known as cascaded regressors to improve the precision of the location of each joint. What did
this mean? Once you identify a joint position in your initial stage, you crop out a region
around the joint position and that region is scaled up to the input of the entire CNN.

And this again predicts a refined value of the joint position inside that patch, which helps you
fine tune your joint position over multiple stages. So these cropped images are fed into the

1074
network in next stages and this ensures that across the final image scales, you get precise Key
Point locations as output.

(Refer Slide Time: 8:58)

Another approach which took this in a different direction, which was also an iterative error
feedback based approach, but approached this differently, was a method proposed in 2016
where the entire prediction started with a mean pose skeleton, which is then updated
iteratively over many steps. So you have an input image, and you have an average human
pose skeleton here. And the job of the neural network is only to predict the deviation from the
mean pose skeleton. So given an image concatenated with its output representation, the
neural network now is trained to predict the correction that brings the mean pose closer to the
ground truth.

Let us see this in a bit more detail. In the first stage, 𝑥0 would be the image itself that is the

image that is given here. And this image is given to the neural network, or the neural network
predicts an ϵ𝑡 for each joint, which would be the deviation from the mean pose that is the ϵ𝑡

here. This ϵ𝑡 is added to 𝑦𝑡 which was the pose in your previous step, and you get a new

pose 𝑦𝑡+1.

And now this 𝑦𝑡+1 is overlaid on the image and the patch around it is now used to again

refine the error and the deviation to get a new ϵ𝑡 and this is now iterated over and over again

to get a final estimate of the pose. Visually, here are a couple of illustrations. So you see here

1075
that on this image, the mean pose is overlaid on this image, you can see the mean pose is
often a person standing upright.

And in every step, the mean pose is adjusted towards getting the pose of this particular
person. And you can see that after four steps, the predicted pose becomes close to the ground
truth, which is shown in the last column. You see a couple of more examples here, where this
can be challenging, you can see once again here, the standing pose, and over a few steps, you
get the pose of the person squatting on a particular location.

1076
(Refer Slide Time: 11:31)

While regression based methods that we saw on the earlier slides try to predict the joint
location, another family of methods for human pose estimation are detection based, where at
single shot, try to get the regions of each of these joints as a heat map. Methods that have
used this kind of an approach, one of which was proposed in 2015, try to also employ a
course to find multi scale strategy to help refine these heat maps to get better joint
localizations.

Let us look at this approach here. So in this approach, an input image is given to a coarse heat
map model, which gives you the heat map of joint locations, this could be a standard CNN
backbone with minor modifications. And for each joint that you have predicted here around
that could be located at the center of a particular heat map, you crop out a patch around it,
and then have a fine heat map model, which improves the localization of the joint in that
particular location.

Let us now see this fine heat map model in a bit more detail. So for each joint location in the
coarse heat map, which could be the center of a particular region of the heat map, you have a
multi scale pyramid, if the original image was 64 × 64, you have 128 × 128 and
256 × 256 versions of it, you correspondingly crop out a 9 × 9 and 18 × 18 and a
36 × 36 region.

And now each of these three regions go through separate convolutional pipelines. You can
see here that the 9 × 9 goes through this pipeline, the 18 × 18 goes through two different
pipelines with different convolutional filters, and the 36 × 36 goes through different

1077
pipeline, different convolutional plus relu layers. And now all of these are upsampled to
36 × 36, obviously the top one does not need upsampling.

The remaining ones are up sampled to 36 × 36. All these feature maps are concatenated to
make the final prediction, which is more precise for that particular joint. This refinement
process helps the final performance.

(Refer Slide Time: 14:00)

Another approach, as we mentioned, is another category of methods, as we mentioned, is


when there are multiple people whose poses you would like to estimate. Here, there are
broadly two kinds of pipelines, the methods are very similar to what we saw earlier, but two
pipelines which are different, one of them is known as the top down pipeline. In this
particular case, we like to detect all persons, the poses of all persons in the given image. So
we first of all start by detecting all people in the given image.

Then for each bounding box of the people detected, you run a single pose estimation
approach that we just saw on the previous slides. You could help refine the pose estimates
using some global context information if you like. So here is an illustration. You have an
input image, two people detected using a human detector, using any other detection approach.
Now you crop out these two people and run a single human pose estimator using a regression
or a detection based approach. And you can get the skeleton, and similarly for the other
person, and you then overlay both of these skeletons on the input image.

1078
(Refer Slide Time: 15:22)

On the other hand, you can also have a bottom up pipeline, where you reverse the process,
where initially, you detect all the key points in the image, irrespective of whom it belongs to,
so you just detect all the key points, so you could use a detection based approach, where you
get a heat map for the full image. And the centers of all of these heat map regions could be
different key points, you do not know which key point belongs to which person.

Once these key points are detected, then you associate these key points to human instances
using different methods. And for more details of these methods, you can see this survey
called Deep Learning based 2D human pose estimation. Evidently, you can see that in this
approach, inference is likely to be much faster, because you are processing all people's
information at the same time, rather than run each people's bounding box through different
pipelines.

1079
(Refer Slide Time: 16:21)

Another task that we mentioned that we will look at is the task of crowd counting. Crowd
counting is an extremely important task for urban planning, public safety, security,
governance, so on and so forth. However, this can be a very challenging task in practice.

(Refer Slide Time: 16:43)

You could face several challenges such as occlusion, as you can see here, in a single patch in
this image, there are so many different faces denoted by green dots, each of which are
occluded heavily with respect to the other, you could have a very complex background, you
could have scale variations, depending on what perspective the camera took the picture from,
you could have a non-uniform distribution of people across the image, you could have
perspective distortions, you could have rotation issues, you could have illumination variance

1080
variations, such as a show where you may want to count the number of people or you could
be faced with weather changes. So all of these make this problem of crowd counting
extremely hard.

(Refer Slide Time: 17:40)

Existing methods in using CNNs for crowd counting can be categorized into three different
kinds of methods, one that use a basic CNN architecture to achieve the purpose, as we will
see soon, where you have an input, you have a simple CNN, and you get a density map as the
output of the model itself.

And the peaks in the heat map or the density map can give you an estimate of the count of the
people that are in a given picture. Another approach is a multi-column approach, which
hypothesizes that to be able to count crowds, when you have people in different scales,
people faces in different scales, a big face, a small face, so on and so forth. You need to
counter this with a multi column approach where each column looks at faces in a different
scale.

So you can see that here, you are given an input, you now convert it to a multi-scale pyramid
kind of an approach, where in each of these individual pipelines you are trying to detect faces
of a certain scale. A third approach is a single column approach again, but this approach tries
to observe performance of multi column approaches, and simplify them to an extent to ensure
that the network architecture is not too complex. Let us see each of these approaches in more
detail.

1081
(Refer Slide Time: 19:09)

So the basic CNN approach looks at crowd counting problem as a regression problem, where
given an image, you have to predict a number as the output. So here are some training data
points where given a set of crowd positive examples, and the counts in each of these images.
Similarly, negative examples of other scenes where the crowd count is 0, each of these
training samples are fed through a CNN architecture and the output is a count of the people in
that image, which can directly be solved through regression and say L2 loss or a smooth L1
loss.

So you could here have an expanded set of negative samples because it is perhaps easier to
get images without people whose ground truth counts as zeros, and this helps reduce
interference and get better performance across the positive samples. However, you can make
out that this is a crude approach, and it can be sensitive to density, distribution of crowd, scale
of people, so on and so forth.

1082
(Refer Slide Time: 20:26)

That leads us to multi column based approaches, which try to address the fact that people's
faces could be at different scales in an image. So, in such a multi column CNN, each
individual pipeline looks at an input image at different scales. You can see here, a conv
9 × 9 filter, a conv 7 × 7 filter and a conv 5 × 5 filter which looks at multiple scales in the
input and this helps get a better performance towards the end, where after the last layers of
each of these pipelines, the feature maps are merged and then you have a conv 1 × 1 layer to
get a density map on the same resolution as input.

So, here are a few illustrations, given test image and the corresponding ground truth, you can
see the estimated heat map here whose peaks you can consider to be able to get an estimate of
the count if required. In certain applications, a density map by itself may actually serve the
purpose but if a count is required, one could infer the count using these heat maps.

1083
(Refer Slide Time: 21:45)

A further approach of a multi column CNN expanded on the approach on the previous slide,
and introduced the concept of a switch classifier, this work was done in 2017, where each
patch of an input image was given to a switch layer, which decided which of these resolutions
was the right scale for this particular patch.

So, the switch classifier, which you can see in the rightmost region here, took the patch of the
image and then gave an output to see among these three scales which you had as individual
columns in your CNN in the multi column CNN, it now says that for this particular patch R 3
is the right scale to detect faces in and now that patch is given to this CNN and the density
map at that scale is used to get the final outcome. Why does this make sense?

This comes from the fact that within a local region of an image, it is likely that scales of faces
are going to be maintained within the same range, whereas in another patch of the image
there could be a scale that is very different. So, this uses that locality of scale in crowds to
switch the corresponding CNN pipeline for each region and thereby achieve a good
performance towards the end. And at the end, it uses weighted averaging to fuse the features,
which can be used globally.

1084
(Refer Slide Time: 23:21)

Single column CNNs are derived from multi column CNNs by making some observations
from their performance, one of the first efforts here in 2018 observed that a single column in
a multi column CNN for crowd counting retained about 70 percent of the accuracy on certain
data sets. So, why make the architecture complex. So this single column CNN uses a standard
set of convolutional layers initially, and then on passes feature maps from earlier layers to
later layers, which are obtained after deconvolution.

So you can see here that after conv 6, the feature map from the previous layer is concatenated
and then you do deconvolution, which is the equivalent of up sampling here to get a higher
resolution image, then a feature map from an earlier layer is added to this up sampled image
to get a new feature map and these are passed through certain set of convolutions and finally,
a 1 ×1 convolution to get the final density map.

In this particular case, deconvolution was used instead of up sampling or element wise
summation. Also, this work used both density map based loss as well as a count based loss to
train the neural network. So all these approaches assume that the ground truth density map is
given as well as the headcount is given.

And so you can use the loss corresponding to both of these to back propagate and train the
rest of the neural network. So what loss do you use? An L 2 loss on the density map. And if it
is a count, you can just use an L 1 loss on the count, which is the absolute value or the
difference between your predicted count and the correct count.

1085
(Refer Slide Time: 25:21)

Another single column CNN, a more recent one in 2019 observed that low level features from
multi column CNNs had very similar functions, very similar features in the crowd counting
context. So what they propose is to retain the same pipeline for initial layers of the CNN. And
when you go to the later layers, have a multi scale pyramid, which is known as a scale
pyramid module, which combines features at different scales to get the final output of the
density map.

So in this particular case, the scale pyramid module was implemented using dilated
convolution at different scales to be able to analyze faces at different scales. And this was
placed between conv 4_3 and conv 5_1 of a VGG 16 architecture. So you can see here a few
examples, where given an image, the estimated count and the correct count. And you can see
that in most of these cases, it is fairly close to a certain error tolerance to the ground truth
estimate.

1086
(Refer Slide Time: 26:40)

So, the homework for this lecture is a very nice blog post on human pose estimation by
Nanonets, as well as the survey on density estimation and crowd counting that was recently
released in 2020. If you are interested, you can also read this survey on 2D human pose
estimation.

The exercise for this lecture, or a thought experiment for this lecture is CNNs, when used
especially for human understanding, can suffer from biases in datasets. So depending on
which race dominates the kind of people in a particular data set for human pose estimation or
face recognition, the decisions could be biased by those statistics in a data set. Before we talk
about addressing those biases, how do you first of all find if a model that you trained for a
human based task is biased? Think about it.

1087
(Refer Slide Time: 27:39)

Here are some references to conclude.

1088
Deep Learning for Computer Vision
Professor Vineeth n Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 51
CNNs for Other Image Tasks

In the last lecture of this week, we will very briefly look at a few other applications, such as
depth estimation, super resolution and anomaly detection ways in which CNNs can be used
for these tasks.

(Refer Slide Time: 00:37)

Taking the application of depth estimation, given a scene such as what you see on this image,
there are objects at several depths from the perspective of the camera, you have the road, you
have a pillar at 1 corner of the scene, you have the car itself and you also have the person
standing on the pavement, each of whom almost act like different layers at different depths
from the camera.

1089
(Refer Slide Time: 01:09)

The question that we would like to ask here is, can we get estimates of depth using just single
images? Existing methods for depth estimation, including the human visual system rely on
stereo estimates where there are two cameras, but the question we would like to ask now is
can you get depth from just a single 2D image, and there have been efforts to do this, we will
look at a couple of samples briefly in this lecture.

(Refer Slide Time: 01:41)

So, one of the earliest efforts of using deep CNNs for depth estimation was in NeurIPS 2014,
where given an input a coarse level CNN is used to forward propagate the image and then at
the end of the last layer, get a pixel level estimate of the depth for each pixel in the image.
This estimate, very similar to the approaches for pose estimation or similar applications that

1090
we have seen, also uses a refinement step to improve the precision of the depth estimate. In
this case, how this is done is the same image also with a different filter at a different
resolution is passed to a final depth estimation network. And in addition to this network, the
coarse estimate of the other network is also passed, and concatenated as an input in an
intermediate layer. And together, these two are combined to give a refined depth estimate,
which is more accurate.

(Refer Slide Time: 03:00)

Here are examples of this kind of an approach, where given an input, the output of the coarse
network looks something like this. And after refinement, it starts becoming more usable.

(Refer Slide Time: 3:13)

1091
Here is another example, given an input image, here is the depth estimate from the coarse
network and here is the depth estimate after getting this from the final network. The
assumption here is you have a pixel wise estimate of the depth provided to you as ground
truth. So you could get a simple L 2 error or a mean square error pixel wise to be able to use
as a loss function to train these networks.

(Refer Slide Time: 3:39)

A more contemporary approach for depth estimation, called GeoNet is very popularly used.
We will only visit this briefly, since it brings in concepts from vision which we have not
covered in detail in this particular course. But this network performs both depth estimation
and camera motion estimation and pose estimation in a combined framework, but let us focus
on the depth estimation part here.

It assumes that depth can be estimated when you have a sequence of images rather than a
single image, which is likely to happen if you are using this for an application say like
autonomous driving, you have a camera that is captured as a car drives, and you are going to
get a sequence of frames, which can be given as input as a volume, just like how we gave
RGB channels as inputs to a CNN so far, this network takes the frames captured over an
entire image sequence over time as different channels of an input volume.

Then a depth network estimates depth on each of these channels, and interactions between the
depth maps of these temporal frames give an estimate of what is known as rigid flow which is
the flow of stationary objects in the scene. And this is then combined with a non-rigid motion
localizer which uses optical flow, if you recall, we talked about optical flow in the initial

1092
lectures of this course. So, it uses an optical flow based approach to get a non-rigid motion
localizer, this is for dynamic content in the scene. And finally, combining the rigid flow and
the non-rigid motion localizer, the final flow or the depth is predicted for each of these for
this entire scene. And at the end, a consistency check is done to check for whether the flow
predictions take care of occlusions and even non Lambertian surfaces. For more details, you
can see this paper. This was published in CVPR 2018.

(Refer Slide Time: 06:02)

Another task, which is growingly popular with deep neural networks is the task of super
resolution. Can you see a difference in these two images? There is no semantic difference but
there is a perceptual difference in terms of the resolution of the right image over the left
image. And this is used extensively to improve the resolution of content.

1093
(Refer Slide Time: 06:31)

Today, televisions have already started enhancing the resolution of the image presented on the
screen using such approaches. Here is an example of an LG screen, which in its settings
allows you to choose an AI enhanced resolution up-scaling option, which in fact does super
resolution to get the final image on the screen, so it is ultra HD, where you use CNN based
approaches to super resolve images and get a better resolution on the screen. So how do you
do super resolution, there are multiple ways in which this can be done.

(Refer Slide Time: 7:12)

We will talk about a CNN based approach now. Later in this course, we will also talk about
generative based approaches, or what are known as Gann based approaches to super resolve
image content. Let us see how this is done using CNNs. You have a seemingly low resolution

1094
image that is provided as input to a set of convolutional layers, which use the idea that a
patch around a pixel can be used to improve the resolution of the image around that pixel in
the output image. Let us see this in more detail. You have a low resolution image input,
which is passed on to get 𝑛1 feature maps of the low resolution image. So you have

( )
𝐹1(𝑌) = 𝑚𝑎𝑥 0, 𝑊1 * 𝑌 + 𝐵1 where y we assume is the input.

And using that 𝐹1(𝑌), you are going to forward propagate that to 𝑛2 feature maps of the next

layer, which gives you 𝐹2(𝑌)which is a relu over a linear layer of the 𝐹1(𝑌) feature maps.

And finally, these 𝐹2(𝑌)'s are used over a linear layer over a set of weights to get the final

reconstruction of the high resolution image. Assuming that you have data of high resolution
versions of low resolution images, you can use a pixel wise loss to train this entire CNN.
How does this work?

(Refer Slide Time: 8:45)

Given this input image of the butterfly wing, the feature maps of the first convolutional layer
here obtain feature maps such as these, some of them recognize edges, slightly varying
textures, or so on and so forth. These feature maps are passed on to the next convolutional
layer, whose feature maps may look something like this. And finally, all of this information
together, is put together to get the final output.

1095
(Refer Slide Time: 9:20)

So you can see that given this input to the super resolution CNN, you get an output, where the
resolution and the contrast looks far better from a perceptual point of view.

1096
(Refer Slide Time: 9:35)

A third task that we will briefly describe to close the lectures in this week is the task of
anomaly detection. It is a need in many applications today, where given a set of inputs or set
of data points, one may need to find out which of these is an outlier, or anomalous with
respect to the distribution that we are handling.

Generally speaking, given, say keywords such as table chair, computer cupboard, bed, or
Faraday, Newton, Edison, Beethoven, or pen, calculator, pencil, ink, you are looking to find
the odd one out. Computer in the first case, Beethoven in the second case, and calculator in
the third case. But how do we do this with images using a CNN is what we talk about now.

(Refer Slide Time: 10:30)

1097
So our goal now is to find out an out of distribution image from a given training data set. So
an image comes, which is not in the training data set. How do you now distinguish this from
what you have in the dataset? Let us consider a known CNN architecture. Let us assume this
neural network is trained on, say cats and dogs.

(Refer Slide Time: 10:53)

Now, at test time, you now get an image that does not belong to the classes that the network
was trained on. We call such an image as an out of distribution image. If you simply try to
classify this into a cat, or a dog, this is not really going to give you a useful answer, because
that is the label space that you have in the neural network. But that is not going to give you a
useful answer. And trying to give probabilities for cat and dog is flawed by design. So what
do we do?

1098
(Refer Slide Time: 11:28)

We are going to look at the softmax activation function that you use in the last layer. And a
recent work, which tried to propose an interesting approach for detecting such anomalies or
out of distribution images proposed that you could use the notion of what is known as
temperature, which is this new added component in the softmax activation function.

What does temperature do? T is a constant that you provide as input to the softmax activation
function. As you can see here, T scales down the inputs that are provided to the exponential
function here. How does this help? By scaling things down, the exponential function will now
be able to separate these values that you get for different classes better. This idea of using
temperature in softmax, is used in many other kinds of applications and we will see some of
them a little later in this course.

1099
(Refer Slide Time: 12:39)

In addition to temperature, there is another contribution this approach makes, this work was
published in 2018. Assuming that the model predicted a dog as the output for a given image,
you do not know if the image was really of a dog or an out of distribution image. You now
consider the gradient of the loss function with respect to this input.

Consider the sine of the negative gradient, and now add an epsilon perturbation, add or
subtract based on the sign and epsilon perturbation to the input. Why do we do this? We
expect that if this was really a dog image, adding this perturbation would help us recognize
better that it is a dog, or the softmax outputs would make it even clearer that it is a dog by
taking the log softmax the gradient of the log softmax and its sine in this perturbation
calculation.

On the other hand, if it was an out of distribution image, and you added this perturbation,
now the softmax outputs would get a bit more confused. This simple intuition is now used to
reconstruct an image based on this perturbation, a small perturbation, you provide that as
input to the neural network again, and now you look at its output and see if the output was
greater than a threshold.

You say that it is not an anomaly, it is perhaps an image of a dog itself. But if the output now
is less than a threshold, because that is when the confusion among the softmax would
increase it is an anomaly. A simple approach without changing much in the entire pipeline of
the neural network or the architecture of the training, but it actually works well to detect an
out of distribution image or an anomaly in practice.

1100
(Refer Slide Time: 14:37)

For more details, please read these links for depth estimation, super resolution or anomaly
detection.

1101
Deep Learning for Computer Vision
Professor Vineeth n Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 53
Recurrent Neural Networks: Introduction

(Refer Slide Time: 00:16)

So far, we have seen how CNNs, Convolution Neural Networks work, how back propagation is
used to train them; how they can be used for image classification; what loss functions we use;
how do you visualize and understand CNNs. We have also seen how CNNs can be extended for
other tasks such as detection, segmentation, face recognition, as well as miscellaneous other
tasks. This week, we will move on to another pillar of the space of deep learning which is
recurrent neural networks.

Recurrent neural networks are intended to process time sequences, which means in terms of
applying it to computer vision, they become relevant when you have videos as input.

1102
(Refer Slide Time: 01:17)

Before we get started on the topic, we had one question that we left behind from last week, when
we completed our lectures on CNNs for human understanding. And the question, an important
one in today's times was; how do you find bias in a model?

If you had a face recognition model, how do you know if the model is biased to a certain
segment of population or the dataset? Hope you had a chance to think about it. This is an
important problem in today's times, when such technologies are being deployed at a fairly large
scale, the simple check is if you think there could be a particular feature or an attribute such as
say, the background of a particular person, then you could change that attribute in your input and
see if the prediction changes.

If this happens, the model is relying on that feature to make its prediction. And if you think that
should not exist, you should try to regularize based on that attribute, use data augmentation
methods to vary that attribute’s value and ensure the model does not change its prediction.

1103
(Refer Slide Time: 02:54)

We will now move on to recurrent neural networks. This lecture’s slides are based on the CS231n
course by Fei-Fei Li and the course from IIT Madras by Mitesh Khapra.

(Refer Slide Time: 03:08)

1104
The broader context of recurrent neural networks is sequence learning problems. So far, with
both feed forward and convolutional neural networks, the size of the input was always fixed.
Given a CNN, you always fed say a 32 × 32 image for image classification. Each input was
also independent of the previous or future inputs and the computations, outputs, decisions for
two successive images were completely independent of each other.

What do we mean? This also relates to the common property called IID which we associate with
datasets in machine learning. IID stands for Identically Independently Distributed, if we have a
data set of images, we assume that each image in the data set is identically independently
distributed, which means that irrespective of which order you provide these images in, your
model is likely to learn the same prediction function at the end.

1105
(Refer Slide Time: 04:41)

So irrespective of the input being say, this image or this image or this image, obviously the class
label changes, but nothing else changes in the model parameters or the architecture.

1106
(Refer Slide Time: 04:56)

1107
However, there are another category of problems known as sequence learning problems, which
requires something more than what we have seen so far. If you take this example of text auto
completion, something that you perhaps encounter on a daily basis, see, if we said Google
autocomplete a, you have a certain set of recommendations.

If you say, Google autocomplete b, you have a certain set of recommendations and similarly for
c. In such problems, successive inputs are no longer independent. To make a prediction at a
particular state, you do need to look at the inputs at previous time steps. This is something that
we did not need so far when we talked about feed forward networks or CNNs. And in this kind
of problems, the length of the inputs and number of predictions you need to make may also not
be fixed.

So an example here is if you say Google autocomplete c, the last word could be code or could be
country, these have different lengths, and you want your model to be independent of such
constraints. Another important takeaway in such problems is, irrespective of what you may have
as the input, whether you had 2 words before the c or 3 words before the c, or 1 word before the
c, the model has to perform the same task across all of these contexts.

Given a previous fair phrase, it has to now predict the next set of characters, that task does not
change irrespective of at what time step you are trying to make the prediction. If we use a typed
one word, you still have to predict the next set of characters. If the user typed 2 words, your

1108
context increases, the temporal context increases in length, but what the task has to achieve
remains the same. These kinds of problems are known as sequence learning problems.

(Refer Slide Time: 07:24)

1109
1110
The neural network architecture that is designed to handle such sequence learning problems are
known as recurrent neural networks. And we will see why they are called so soon. If you
visualized a standard feed forward vanilla neural network as this kind of visualization where you
have an input, say set of hidden layers and an output, then the variance of recurrent neural
networks can be visualized as you could have a one to many variant where your input is at one
time step but your output is actually a sequence of outputs over time steps.

What is an example? Image captioning, where your input is a single image, and your output is a
sequence of words where one word has to follow the previous word to make a complete
sentence. You call this one to many where you have one input and many outputs. Similarly, you
could have a many to one architecture, where you have inputs at multiple time steps, but you
make a prediction at one particular time step.

What is an example here? Action prediction, where you have a sequence of video frames as
input, and the model has to give one particular class after seeing the full input sequence. For
example, after seeing a complete video of say, a football game, maybe this particular outcome
should predict whether that snippet of the video was a goal scoring event or not a goal scoring
event. There is also a scenario of a many to many setting where given a set of multiple inputs
over time steps, the output also has to have multiple outputs over time steps.

This is called many to many. And an example would be video captioning, where the input is a
video, which is a set of frames over time and the output, a set of words over time again. You

1111
could also have a slightly varied many to many settings, where you have a set of inputs over
time, a set of outputs over time again. But the outputs and the inputs are synchronized at each
input.

An example here could be video classification at frame level. So you have a video coming in as
input to your model which means, remember video has a temporal component to it. But at the
same time, you want to make predictions at the level of every frame. That would be a many to
many setting in this particular context.

(Refer Slide Time: 10:30)

So how do we really model such tasks involve sequences, we saw high level architectures, but
how do you actually learn the models with these architectures? How do we account for
dependence between inputs which is common in time series data? How do you account for
variable number of inputs? How do you ensure that, for example, if you take image captioning, it
could be variable number of inputs or variable number of outputs, if you took image captioning,
given an image, you could have a caption which has 7 words, given another image, you could
have a caption, which is 10 words, how do you allow multiple outputs variable number of
outputs, or similarly, variable number of inputs for another application.

A third consideration is how do you also ensure that the function executed at each time step is
the same? Why do we need this? Irrespective of where a frame was in a video, you want to have
the same logic to be able to execute your prediction. This is similar to what we said in

1112
convolution, where we said that irrespective of whether a cat was on the top left, or the bottom
right, we want to say that the cat exists in an image.

We overcame that problem through the idea of convolution. But how do we do that now over
time series, and sequence learning problems is what we are going to talk about over the next few
slides.

(Refer Slide Time: 12:10)

So the high level abstraction of a recurrent neural network is this figure here. You have an input
x and output y. And in between, you have what is known as a recurrent neural network. As the
word says, there is a recurrence here, which is now represented by this self-loop.

The key idea of RNNs is that they have an internal state, which is updated as a sequence is
processed. One could view this internal state as some kind of a memory that retains some
information from past data of the same time series. And this idea is something that we will talk
about later this week also.

1113
(Refer Slide Time: 13:02)

If one unfolded the recurrence, you would get a network such as this, where you have a set of
inputs, 𝑥1to 𝑥𝑡 over time. So that is a sequence of inputs that you have. And that is why we are

looking at this as a sequence learning problem. And for each input, you can give that input to an
RNN block. So this RNN block, could be a single hidden layer, could be multiple hidden layers,
could have skipped connections, could have any other architectures inside this RNN block. But
we are going to look at that as one single obstruction.

And the output of that RNN is the output of the model at that time step. This would be the entire
unfolded RNN. And when you work with RNNs, you have to decide beforehand how much you
want to unroll an RNN. What do we mean? Given a particular problem, you have to decide what
this small t must be. Do you want to look at the past 20 frames to make a decision? Do you want
to look at the past 100 frames to make a decision or do you want to look at the past 200 frames to
make a decision.

This is something that has to be decided before you start learning an RNN. This t here would
correspond to that length of the sequence that you want to look at while making the prediction.
And once t is decided you could unroll the RNN this way where the same operations inside this
RNN block is used for the input at every time step.

As we said earlier, depending on what application you are using RNNs for, you could have an
output at each time step, or you could have an output only at the last time step or you could also

1114
have a sequence of outputs that follows the input at the last time step. All those possibilities are
variants of this abstraction that we see on the screen.

(Refer Slide Time: 15:19)

Let us go into some more details now. So we said it is a recurrent neural network. And we also
said that we want this RNN block here to repeat the same operation for the inputs at every time
step. So what is that recurrence operation that we want to actually use? The recurrence formula
at every time step for us would be given an input 𝑥𝑡at a particular time step.

And ℎ𝑡 − 1 is the state of the RNN in the previous time step, then you are going to apply some

function, this function are the parameters inside the RNN with some weights, U and W. And you
get an output ℎ𝑡at that time step which is passed onto the next stage of the RNN, which process

the input at the next time step.

1115
(Refer Slide Time: 16:19)

Here is a visualization. At time step 𝑡1, given an input 𝑥1 and ℎ0, an initial state, the RNN

processes, these 2 inputs ℎ0 and 𝑥1, using weights, say U and W which are matrices of weights,

and outputs a 𝑦1 if an output is required at that time step, and a state ℎ1, which is passed on to the

next time step of the RNN.

But the next time step, the RNN receives 𝑥2and ℎ1 as input, uses the same weights U and W as

the previous time step. Remember, we said we want the RNN to treat every frame as a similar
kind of input and perform similar operations, although they are different, although they are
differently placed in time. So the RNN at the second time step outputs 𝑦2 if an output is required

at that stage, depending on the problem, and ℎ2 which is its state hidden state, which is passed

down to the RNN at the third time step. And this is continued all the way till the last time step.

1116
(Refer Slide Time: 17:40)

( )
So the recurrence formula, just to repeat is ℎ𝑡 = 𝑓𝑈𝑊 𝑥𝑡 , ℎ𝑡−1 where U and W are those weight

matrices. Given input 𝑥𝑡and ℎ𝑡−1, the key observation here is that the same function and the same

set of parameters are used at every time step. That does not change. And that is where the
recurrence actually lies in a recurrent neural network.

(Refer Slide Time: 18:10)

So you could now think of using activation functions depending on the task that you have. So
(
this recurrence relation that we just wrote here, would now be ℎ𝑡 = 𝑡𝑎𝑛ℎ 𝑈𝑥𝑡 + 𝑊ℎ𝑡−1 . )

1117
𝑦𝑡could have a softmax of 𝑉ℎ𝑡. So if you had an output ℎ𝑡 that comes out of this RNN, then you

have a set of weights V here which multiplies 𝑉ℎ𝑡.

And then if you are dealing with a classification problem, you apply a softmax on that 𝑉ℎ𝑡 to

give your final output of the RNN. And the weights here going from x to RNN would be U and
the weights that look atℎ𝑡−1 and give a ℎ𝑡would be W. So you can see that there are 3 matrices of

weights. Ofcourse W could change based on how many layers you have inside the RNN block.

Otherwise, you have 3 sets of weights here that operate on input, hidden state, and hidden state to
output. This is sometimes called also an Elman RNN. After Professor Jeffrey Elman, who was
responsible with initiating these ideas.

(Refer Slide Time: 19:56)

To go forward and understand how the RNN is actually trained, We will take one step back and
look at the notion of what are known as computational graphs. Computational graphs are a very
common approach to implement neural networks, both inference and how they are back
propagated today, including in frameworks such as Py Torch or tensor flow. So we will review
computational graphs in general first, and then talk about how computational graphs are defined
for RNNs. So a computational graph is a directed graph whose nodes correspond to either
operations or variables.

1118
The values that are fed into the nodes and come out of the nodes are tensors, tensors recall are
multi-dimensional arrays generally, anything greater than 2-dimensions, or 2 dimensional
generalization of a vector would be a matrix, a generalization of a matrix would generally be
called a tensor. A tensor generally subsumes scalars, vectors, matrices into all of them, tensors is
the more general term to represent these quantities. And as I just mentioned, the computational
graph can be instantiated, to do both a forward pass through a neural network, as well as a
backward pass through a neural network.

1119
(Refer Slide Time: 21:35)

Let us take a simple example before we talk about neural networks, let us talk about how we can
use computational graphs to create expressions and compute them. So if you had the expression,
𝑒 = (𝑎 + 𝑏) * (𝑏 + 1), this is an arithmetic expression. There are 3 operations here, 2
additions, and 1 multiplication. And one could write this entire expression or operation as a
sequence of intermediate operations. What are those intermediate operations, you can first say
𝑐 = 𝑎 + 𝑏, 𝑑 = 𝑏 + 1,𝑒 = 𝑐 * 𝑑. So you are saying the first term here is c, the second term
here is d.

1120
And then you finally multiply the two to get an e. This seems trivial. Why are we talking about
this? One could now write this as a computational graph. You have values a and b, given as
input, you compute c, given by a + b, then you compute d which depends only on b which
becomes b +1. And then you combine c and d to get your final output e which is c*d.

(Refer Slide Time: 23:12)

How do you evaluate these expressions? Let us say you had specific values of a and b given to
you. How do you evaluate this expression using a computational graph? You would set the input
variable to certain values that are given to you. For example, a is a particular value, and b is a
particular value. And then you would compute the nodes up the graph from those variable
instance values all the way till the root node to get your final answer.

So it would look something like this, you set a = 2, you set b = 1, you first compute c = a + b
which means you get c = 3. Similarly, d = 2. And then you combine them and get your final
answer for e which is, e = 6.

1121
(Refer Slide Time: 24:03)

That was nice and simple but how do you use such a computational graph for the backward pass
if you want to do back propagation or in other words, how do you use computational graph to
compute gradients or derivatives. The key here is that to understand gradients, you need to
understand the edges of a computational graph, because the edges denote how a particular
variable affects another variable. For example, if you had the same computational graph, you
have a at this leaf node here and you have c = a + b at a parent of that leaf node.

Now one could write ∂𝑐/∂𝑎, as you have c = a + b. Hence, ∂𝑐/∂𝑎 will be equal to 1. Similarly,
on this edge connecting b and c, ∂𝑐/∂𝑏 will also be 1, because c = a + b, and so on and so forth
for every set of edges in this graph. So how are we doing this, we are applying the standard sum
rule and product rule appropriately to the gradients to compute each of those gradients at the
edges. And to get your final gradient, you could sum over all possible paths from one node to
another, multiplying the derivatives on each edge of the path together.

So for example, if we want ∂𝑒/∂𝑎, we look at all possible paths that take us from a to e. So in
this particular case, that would be just this, this path, you multiply all the gradients that are on
that path, so in this case, it will be 1 x 2. And that gives you the gradient of ∂𝑒/∂𝑎. Similarly, if
you had to compute ∂𝑒/∂𝑏, you would have to consider all possible paths that take you from b to
e. You multiply the gradients in each of these paths, and add up both these paths contributions,
and you will get your final gradient.

1122
(Refer Slide Time: 26:21)

So once again, ∂𝑒/∂𝑏 in this case would be 1 × 2 + 1 × 3. That is how you would use a
computational graph to get your gradients in a particular expression.

(Refer Slide Time: 26:37)

So, to give you a more concrete example, in a framework that you may be using, which is Py
Torch, Py Torch changes are tracked on the go during the forward pass, allowing for the creation
of such a computational graph, which is called a dynamic graph. And when you say
loss.backward, when you compute when you actually train a neural network in Py Torch, if you
say c.backward for instance, then it could be loss for training a neural network or in if you are

1123
just evaluating an expression, you could just say c.backward in this particular example that we
saw.

And when you dot c.backward, the Py Torch framework triggers the computation of the
gradients. And you can then print out the gradients of every edge for every gradient in your
expression computation.

(Refer Slide Time: 27:36)

What about computational graphs for neural networks feed forward neural networks. So if you
had a couple of layers in a neural network, given an input x, let us say your hidden layer h is
𝑡𝑎𝑛ℎ(𝑊𝑥 + 𝑏), Wx + b is your linear combination of your inputs x.

And let us assume you have a nonlinear activation function tanh on each of those nodes and that
gives you a vector h, then you apply another set of weights v on h to give you your final output y,
where b and a are your biases. If this was your mathematical representation of a feed forward
neural network, your corresponding computation graph would look like your first inputs are W
and x based on that, you get a certain output here and then you combine that with +b and you
then get a certain output.

And then finally, you take that and apply tan h and get your tanh(u). So this first step here can be
represented as a computational graph, that is a sequence that takes W x and b as inputs and
outputs h. Similarly, now for the next operation, you can now consider h as the input, V as the

1124
other input a as the other input. And the computational graph at this time would be this set of
operations. Obviously, the input to one of the nodes in this computational graph is the h, which
came as an output from previous computational graph.

(Refer Slide Time: 29:22)

1125
Now coming back to RNNs, since RNNs are a slightly complex, let us try to view RNNs from
the point of view of computational graphs. So given input 𝑥1 and ℎ0, the input is given to 𝑓𝑈𝑊

which is what we talked about earlier. And 𝑓𝑈𝑊 outputs ℎ1, then ℎ1 and 𝑥2, go to 𝑓𝑈𝑊 again,

which is the same function that is evaluated at each time step as we already mentioned. Then
𝑓𝑈𝑊 at second time step gives you ℎ2 and this is again recurrently given at the next time step, and

so on and so forth, until you get the final hidden state at the last time step of the sequence.

(Refer Slide Time: 30:13)

1126
So from a computational graph perspective, the inputs, UW here which are the weights of the
matrix of the, the weight matrices rather, are the same for every time step. And that is what is
shown by these arrows that go into a f UW at every time step.

(Refer Slide Time: 30:36)

If you had a many to many RNN, you would only slightly change this, because now you are
going to have outputs at every time step, which you did not have in the previous RNN that we
just saw.

1127
(Refer Slide Time: 30:51)

You could now have an output L1 that comes out of the many to many RNN which is what is
finally given as output to the user. You could also look at combining all of these 𝐿1, 𝐿2, 𝐿3, until

𝐿𝑡to give your final L as the output. Why does this matter? If you look at say video captioning,

which would have been an instance of such an many to many architecture, you could look at
𝐿1, 𝐿2, 𝐿3, 𝐿𝑡 so on and so forth, as variables that hold each word of the caption, and L be the

final caption that is given as output of the RNN.

1128
(Refer Slide Time: 31:39)

If you considered many to one architecture the computational graph would look something like
this, the initial part more or less, remains the same. But the output y now is only at the last time
step; is this complete? Not really, in terms of the computational graph, you also need the
connections where the y could be an output that also receives ℎ1, ℎ2, ℎ3 so far, and so on, as also

inputs to compute the final output value.

(Refer Slide Time: 32:14)

1129
In case of a one to many setting, you could have this kind of a computational graph, where the
input is only x at the first time step. But you now have outputs at every time step 𝑦1, 𝑦2, 𝑦3

through 𝑦𝑡 one question here is, what would you give as inputs at intermediate time steps? Why

do we need to talk about this? Remember, 𝑓𝑈𝑊 is the same function at every time step. We

already know that the first time step 𝑓𝑈𝑊needed 2 inputs x and h naught, which means at every

time step, it does need 2 inputs. What do we do?

One option, we can input zeros at every time step, that could be one option that we can provide.
Another option, which is also sometimes followed is the output of the previous time step could
also be given as input at the next time step. By adjusting for dimensions, you could use the
output at the previous time step also as input to the next time step. Why is this relevant? To
generate the next word in a caption, you could provide the previous word in the caption to help
the model generate the appropriate next word.

1130
(Refer Slide Time: 33:44)

To give a more tangible example, so let us consider a RNN model. Let us assume that the inputs
and the outputs come from a vocabulary, h, e, l, o. And your input character, let us say is h,
which is denoted as 1 0 0 0 0. Such a representation is also known as a one hot vector. A one hot
vector is a vector which has a 1 at the position of the input that is given there at the position from
the vocabulary, and 0 elsewhere, since h was the first character here, if your input characters h,
the corresponding 1 hot vector would be 1 0 0 0 0.

And given that as inputs, let us assume that you have a hidden layer, you have an output layer.
And let us assume that the output layer had these set of outputs and doing a softmax you end up
getting a certain set of probabilities where the second one has the highest probability, which
means you are going to predict the next character as e.

1131
(Refer Slide Time: 34:55)

Now this character is given as input to the next time step. Remember, we just said the 𝑦1is given

as input to the next time step as 𝑥2. So e because of position of e as the second index in your

vocabulary will be given by one hot vector 0 1 0 0. And this is then passed through the same
network, and you now get a softmax output as 0.25, 0.2, 0.5, 0.05. The highest one is the third
one, which is an L. And that is the output of the RNN at that time step, that L is then again given
an input to the next time step, and L is denoted as 0 0 1 0, because of the position of L, in that in
the vocabulary.

1132
Once again you do this process, and it once again gives L as the output, that L is given to the
next time step. And the output then is O and at some point, the entire RNN model can give an
end token as the output to know where the set of outputs end. So an end token would be a default
character in the vocabulary and that would be predicted when the RNN model decides to end the
generation of that particular word.

(Refer Slide Time: 36:24)

As homework, your readings would be chapter 10 of the Deep Learning book. Andrej Karpathy's
excellent blog posts on RNNs. And additionally, if you would like to go through the Stanford
CS231n course or the IIT Madras CS7015 course. In the next lecture, we will talk about how
these RNNs can be trained using back propagation. Before that, a couple of questions for you.
Can RNNs have more than one hidden layer?

We answered that question, but think about it to ensure it can. And the next question is, the state
ℎ𝑡 of an RNN records information from all previous time steps. Each new time step, in a sense,

the old information gets morphed slightly by the current input. So you have an ℎ𝑡−1 that comes,

you have an 𝑥𝑡 that comes. The previous steps state is now modified due to the input that comes

at this time step.

1133
What would happen if we morph the state too much? Rather, what would happen if you pay a lot
of attention only to the current input and lesser the previous time steps state? Think about it and
we will discuss this soon.

1134
Deep Learning for Computer Vision
Professor Vineeth n Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 52
Backpropagation in RNNs

(Refer Slide Time: 00:22)

We will now move on to Backpropagation in RNNs. Before we go there, we left behind a couple
of questions. Can RNNs have more than one hidden layer? We already answered the question.
We said that you can have as many hidden layers as you want in each RNN block. In fact, you
could also stack RNN blocks if you like, one on top of each other. See you could have an input
that goes to one RNN block, whose output goes to another RNN block which is then given to the
output at that particular time step.

And then similarly, this RNN block would go over time. And its outputs would go to the upper
RNN block. So in such an architecture, which is also known as a stacked RNN, the weights at
each level are all shared. So at this level, all the weights are the same across all the time steps.
And at level 2, all the weights are the same across all of the time steps. Such an architecture is
known as a stacked RNN.

Going forward we asked the question, given that the state of an RNN records information from
all previous time steps, what would happen if we morph the state at a given time too much with
the current input?

1135
(Refer Slide Time: 01:57)

The answer is evident here again, the effect of previous time steps will be reduced, which may
not be desirable for sequence learning problems.

(Refer Slide Time: 02:10)

Moving on now to backpropagation in RNNs let us first revisit the forward pass in RNNs
assuming that this is now your diagram for visualizing an RNN you have an input x weights U

hidden state h weights W then weights V that take you to an output 𝑦, then your forward pass
equations are given by ℎ𝑡 = 𝑡𝑎𝑛ℎ(𝑈𝑥𝑡 + 𝑊ℎ𝑡−1).

1136
And 𝑦𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑉ℎ𝑡). This would be your forward pass equations. For an RNN that solving

a classification problem where you have the output layer defined by a softmax. What is the cross
entropy loss in this setting, you could have used, because it is a classification problem, you could
have the standard cross entropy loss as given by this formula.

(Refer Slide Time: 03:20)

Now, if we want to compute the gradients of error E with respect to the 3 sets of weights that we
have here; U, V, and W, let us assume that these gradients are going to be used to update the
weights using stochastic gradient descent exactly the same way we did this for feed forward
neural networks, or CNNs.

And it is also important to keep in mind that depending on the kind of RNN variant that you are
using, you could have an error in each time step. If you had a many to one setting, you may have
an error only at one time step. But in a more general case of an RNN, you could have an error for
your output at every time step. So you could have an error 𝐸0at time step t = 0. Similarly, 𝐸1 at

time step t = 1, and so on and so forth, in this case till 𝐸4.

The question now is, how do you compute the gradient of the error with respect to U, V and W?
How do you do this? It is similar to the general principle of computing the gradient for any other
neural network. If a weight influenced an output through multiple paths, then you have to sum up

1137
the contribution of that weight to the output along all possible paths. In our case, you would have
a weight here, here, here, here for all the time steps.

∂𝐸
And all of them are the same weights in an RNN. So if we had to compute ∂𝑊
, where E is an

∂𝐸 ∂𝐸𝑡
overall error, ∂𝑊
would be given by ∑ ∂𝑊
where 𝐸𝑡is the error at each time step. So our next
𝑡

∂𝐸𝑡
question boils down to how do you compute each of these ∂𝑊
. Let us see that now.

(Refer Slide Time: 05:35)

∂𝐸 ∂𝐸𝑡
Before we go into computing ∂𝑊𝑡 let us take a simpler case, and try to compute ∂𝑉
. In particular,
∂𝐸3 ∂𝐸3
let us consider ∂𝑉
, which is, let us say the third time step. So to compute ∂𝑉
, let us assume that

∂𝐸3 ∂𝐸3 ∂𝑦3


we can write 𝑧3to be 𝑉ℎ3 then the gradient can be computed as ∂𝑉
will be ∂𝑉
. Now 𝑦3 is a
∂𝑦3

softmax of 𝑧3. That is the way we have defined this network.

∂𝐸3 ∂𝑦3 ∂𝑧3


So you would have this by chain rule as . Now, this assuming that you have a linear
∂𝑦3 ∂𝑧3 ∂𝑉

activation function, or let us assume that this activation function is trivial, and let us assume that
∂𝐸3
, if you use mean squared error or cross entropy, let us assume that it boils down to a simple
∂𝑦3

1138
∂𝑧3
𝑦3 − 𝑦3, where 𝑦3 is the predicted output and 𝑦3 is the expected output and ∂𝑉
would be ℎ3,

because of the definition of 𝑧3 itself.

∂𝐸3 ∂𝐸3 ∂𝐸2 ∂𝐸1


This becomes the gradient for ∂𝑉
so you would sum up ∂𝑉
+ ∂𝑉
+ ∂𝑉
and so on and so

forth to get the gradient of the overall error with respect to V. Once you compute that, you can
update all the weights in V using gradient descent.

(Refer Slide Time: 07:39)

∂𝐸3
Now, let us move on to the next case, which is ∂𝑊
. Recall again that in RNNs we have U, V and

W. We need to compute the gradients of the error with respect to each of them. So, let us say we
∂𝐸3 ∂𝐸3
have to compute ∂𝑊
, that would now be written as very similar to what we wrote for V, ∂𝑊

∂𝐸3 ∂𝑦3 ∂ℎ3


would be which is the W that you have coming into it from the previous layer. The
∂𝑦3 ∂ℎ3 ∂𝑊

∂𝐸3 ∂𝐸2
question now is, is this good enough? If we now took this quantity, and summed up ∂𝑊
+ ∂𝑊

∂𝐸3
and so on. Would we have solved ∂𝑊
overall?

1139
(Refer Slide Time: 08:42)

Unfortunately, no, because while ∂ℎ3 depends on W , ∂ℎ3 also depends onℎ2, which in turn

depends on W again, which means chain rule needs to be applied again to be able to complete
∂𝐸3
this computation of ∂𝑊
. Why did we not need this with V? Because we did not have this

problem because it was directly connecting h to the error. So, how do we complete this?

(Refer Slide Time: 09:19)

1140
∂𝐸3
So, ℎ3depends on W via ℎ2, ℎ1 and all other earlier hidden states. So which means ∂𝑊
can be
3 ∂𝐸3 ∂𝑦3 ∂ℎ3 ∂ℎ𝑘 ∂𝐸3
written as; ∑ ∂ℎ3 ∂ℎ𝑘 ∂𝑊
. What about ∂𝑈
? We are going to leave that as homework
𝑘=0 ∂𝑦3

∂𝐸3
because it is going to be very similar to ∂𝑊
, you only have to apply the chain rule in a principled

manner. Just to complete this discussion, so, if one had to look at this, how would it look in
expansion?

It would look like


( ∂𝐸3 ∂𝑦3

∂𝑦3 ∂ℎ3 ) ∂ℎ3


∂𝑊
+
( ∂𝐸3 ∂𝑦3

∂𝑦3 ∂ℎ3 ) ∂ℎ3


∂ℎ2 ∂𝑊
∂ℎ2
and you will have further summations

that do similar chain rules for ℎ1and so on and so forth. Remember, this summation now is only
∂𝐸3 ∂𝐸4 ∂𝐸2
for ∂𝑊
, you will have, similarly another summation for ∂𝑊 ∂𝑊
so on and so forth. And your
∂𝐸
final gradient for W has to add up all of those to compute ∂𝑊
.

(Refer Slide Time: 11:20)

∂ℎ3 ∂ℎ3
So, if you now observe ∂ℎ𝑘
when k = 1 as we just said, can be expanded as ∂ℎ1
would be

( )
∂ℎ3 ∂ℎ2 3 ∂𝐸3 ∂𝑦3 3 ∂ℎ𝑗 ∂ℎ𝑘
∂ℎ2 ∂ℎ1
. So, this entire gradient can now be succinctly written as ∑ ∂ℎ3
∏ ∂ℎ𝑗−1 ∂𝑊
.
𝑘=0 ∂𝑦3 𝑗=𝑘+1

1141
(Refer Slide Time: 12:08)

Do you see any problem in this particular approach? If you thought carefully, you will realize
that RNNs are often used for time series data that can be reasonably long. You could be using it
for data that has 20 time steps, 50 time steps, 100 time steps depending on the nature of the
problem that you are dealing with.

So when you now back propagate, you are going to be multiplying the gradients across all of
these time steps. So if you saw in the slide earlier, you would have this term which continues to
multiply these activations across multiple time steps. Now, why could that cause a problem? If
your gradient for each of those values is less than 1, multiplying these terms will lead to a

1142
vanishing gradient problem, because the multiplication of values less than 1 will quickly go to 0.
Is this really a problem?

(Refer Slide Time: 13:15)

Let us consider, say, a sigmoid activation function that we use in a layer in the RNN. So we
know that the sigmoid function is upper bounded by 1, the values lie between 0 and 1. Let us,
even if we took a tanh activation function, it would lie between -1 and 1. So the gradient of the
sigmoid activation function, it is also upper bounded by 1 which means all these terms will have
gradients which are upper bounded by 1.

1143
∂𝐸3
And what does that tell us? It means that the gradients in this particular computation ∂𝑊
will

quickly vanish over time. And an earlier time step. The weights or the impact of an earlier time
step may never be felt on a later time step. Because the gradients that you get due to an earlier
time step, it is most likely will become 0 because of this product over a long range of activations
across many time steps.

So effectively, although you want RNNs to model long term temporal relations, you may not
really be able to achieve that purpose because of the vanishing gradient problem, because an
earlier time step may not really influence an output at a later time step. How do you combat this
problem? We will see this in the next lecture. There are already solutions for this problem. And
we will see this in the next lecture.

(Refer Slide Time: 14:56)

But before we go there let us ask the counter question. What if I did not use a sigmoid activation
function? What if I just use the linear activation function let us assume, on the contrary, that each
∂ℎ3 ∂ℎ2
of my gradients ∂ℎ2
, ∂ℎ1
were very high values, then multiplying all of them could lead to what

is known as the exploding gradient problem, because the product of values, say in the range of
3
10, by multiplying 3 such values, you will quickly go to 10 magnitude.

1144
And that can lead to an explosion,exploding gradient problem. This generally is not too much of
an issue during implementation. Can you think why this may be the case? The answer is, firstly,
it is likely to show up as NaN, not a number during implementations. And more importantly, you
can simply clip the gradients beyond a particular value.

This is known as gradient clipping. And it is very popularly done today, while training neural
networks, where you say that if the gradient exceeds 10, you are going to stop the maximum
3
value it can obtain as 10. So even if your gradient was 10 , you are only going to choose it as 10
and move on with the rest of the computations. This generally takes care of the exploding
gradient problem, although the vanishing gradient problem remains, and we will see this in the
next lecture.

(Refer Slide Time: 16:40)

So your homework for this lecture is, continue to read chapter 10 of the Deep Learning book,
and also go through this excellent WildML RNN tutorial by Danny Britz, which explains
backpropagation and RNN very very well. The question that we are going to leave behind at the
end of this lecture is, as I just mentioned, in the next lecture, we will see how you can change an
RNN architecture to avoid the vanishing gradient problem.

But can you solve or address the vanishing gradient problem, without any change in the overall
architecture? Through some choices, can you solve the vanishing gradient problem? Think about
it and we will discuss the next time.

1145
(Refer Slide Time: 17:33)

The references.

1146
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
LSTMs and GRUs

(Refer Slide Time: 0:19)

We will now talk about how to solve the vanishing gradient problem using changes in the
architecture of an RNN, in particular we will talk about LSTMs and GRUs.

(Refer Slide Time: 00:31)

1147
One question that we left behind was, how do you tackle the vanishing gradient problem in
RNNs without changing the architecture? You could do a few things, you could use ReLu instead
of tanh and sigmoid. Remember, ReLu does not vanish the gradient, it keeps the gradient at least
the positive, for the positive activations it keeps the gradient as it is, you could regularize in
some way, you could initialize the weights in some way that helps carry forward the gradients
backward and you could also consider using only short time sequences so that the gradient does
not vanish within that short time frame.

(Refer Slide Time: 01:18)

Let us revisit this problem of long-term dependencies. So, if you had a situation where you were
dealing with the sequence learning problem, which say wants to predict a sentence the clouds are
in the sky or it wants to classify say the sentiment or the category of the sentence, an RNN may
be good enough because the word sky may only depend on the word cloud which is say just two
words before if you remove all the basic words in the phrase.

However, if you have a more detailed sentence, a longer sentence such as I grew up in France, I
speak fluent French. Now, the word French at a particular time step may depend on some other
word that was used several words ago. These long term dependencies may not really get captured
by an RNN, because the presence of the word France may not really have an impact on the word
French because the gradients may vanish between these two time steps. So, we are going to talk
about how we address this in this lecture.

1148
(Refer Slide Time: 02: 40)

Just to recall RNNs are trained using back propagation through time. The limitations of back
propagation through time that we discussed were vanishing gradients and exploding gradients.
Exploding gradients we said we could handle using gradient clipping. How do you now address
vanishing gradients using changes in RNN architecture are through LSTMs and GRUs as
examples and we will discuss both of them in detail in this lecture.

(Refer Slide Time: 03:15)

LSTM stands for Long Short-Term Memory. It was introduced way back in 1997 by Hochreiter
& Schmidhuber. Primarily intended to address this long-term dependency problem although it

1149
was not in the same form that we will discuss now, but we will also discuss its evolution as we
go forward.

So, here is the design of an LSTM. At each step t, in an RNN we had a hidden state which the
RNN block would output. Now, in an LSTM, we would have a hidden state which an LSTM
block would output and also an internal cell state which maintains the information across a
temporal context. The cell stores long-term information and the LSTM block can erase, write and
read information from that cell state based on whatever context defines.

The selection of what information you can forget or read or write is controlled by three gating
mechanisms, one for erasing, one for writing and one for reading. And on each time step these
gates could assume values open which would be 1, which would allow all the information to pass
through, closed which would be 0 which would not allow any information to pass through or
somewhere in between.

You can also have an information lying between 0 and 1. These gate values are dynamic and are
learnt and computed based on the input at a particular time step and the hidden state that comes
from the previous time step. Let us see this, each of these components in more detail.

(Refer Slide Time: 05:20)

A RNN has a general form of a chain of repeating modules. So, if you took a vanilla RNN the
repeating module is perhaps a single layer with the tanh activation function. As we said earlier, it

1150
can also be not just a single layer but two, three layers, but that would be the repeating block that
is applied to each, to the input at each time step.

So, this diagram here you have an input 𝑋𝑡−1at t-1, 𝑋𝑡 at time t and 𝑋𝑡+1 at time t+1 and what

you see here is a single neural network layer which has a tanh activation function and the input at
a particular time step 𝑋𝑡 as well as ℎ𝑡−1 which is the output of the previous RNN block, is

provided as input, a tanh is applied that gives us ℎ𝑡 and that ℎ𝑡 is given as output as well as given

to the next RNN block at the next time step. That is the vanilla RNN that we have seen so far
drawn in a different way. Let us now try to draw a parallel with LSTMs.

(Refer Slide Time: 06:44)

LSTMs have a similar structure. Once again it is a chain of repeating modules where you apply
the same block at every time step. But now the structure of each block is not just one layer or
two layers. It is different. It contains four different layers and they are not sequential layers
unlike the networks we have seen so far. They also interact with each other. We will see each of
these layers in detail.

1151
(Refer Slide Time: 07:17)

So, in this case, you can see that you have four different layers denoted by these four blocks, four
yellow blocks. One of them is known as a forget gate which decides to forget some cell content
coming from the previous state. Another of them is known as an input gate and that is why you
see a sigmoid activation function here given by σ. So, these ones are sigmoid activation
functions. And the reason we use sigmoid here is because we want the output to lie between 0
and 1.

So, the input gate along with a tanh activation function decides how much of the input should be
written and also adds that new cell content at the end. So, that is what happens here. And finally
there is an output gate which decides how much of the cell state should be exposed as the hidden
state. So, cell state is denoted as 𝐶𝑡 and the hidden state is denoted as ℎ𝑡 and a part of the cell

state is revealed as hidden state to the next time step as well as an output of the cell for further
processing. Let us see each of these in more detail.

1152
(Refer Slide Time: 08:46)

So, the cell state contains information which you can look at as a memory across a temporal
context and the information can flow across the cell state unchanged. If you see this diagram
here, this is just the diagram from the previous slide with a few components grayed out for the
sake of explanation. So, you can see here that the information in a cell state 𝐶𝑡passes as it is to

the previous cell state 𝐶𝑡−1. Why is this important? Try to connect this to the vanishing gradient

and we will come back and talk about this point later in this lecture.

1153
The only thing that we have is a multiplication operation here and an addition operation here, the
multiplication operation decides whether you want to forget something from a previous cell state
𝐶𝑡−1 that contains a sigmoid neural network layer, which has a dimension as the size of the cell

state and has a value between 0 and 1 for each dimension of that cell state.

So, for each dimension of the cell state you can decide how much of that information you want
to retain using this multiplication operation here. This multiplication operation is a point wise or
element wise multiplication operation. Then you also have an addition operation here which
decides how much of that information gets added to a, how much of an information coming in
gets added to this new cell state.

(Refer Slide Time: 10:36)

Now, let us talk about the forget gate to start with. The forget gate, which is one of the layers
inside the LSTM block takes input 𝑥𝑡, takes input ℎ𝑡−1 very similar to the RNN block that we

saw so far. Then it has a set of learnable weights 𝑊𝑓, those 𝑊𝑓s are learnt while training you

have a bias 𝑏𝑓 corresponding to the weights. And then you apply a sigmoid activation function to

ensure your outputs lie between 0 and 1 that is the output of the forget gate. And what does the
forget gate do? It does an elementwise multiplication with the previous cell state 𝐶𝑡−1 to decide

how much of that information should be erased and how much of that information should be
kept.

1154
(Refer Slide Time: 11:32)

Similarly the input gate decides what information to remove from the cell state. So, the input
gate also receives a copy of 𝑥𝑡 and ℎ𝑡−1 as input similar to forget gate has its own weights 𝑊𝑖, a

bias 𝑏𝑖 and operates a sigmoid activation function to ensure the output lies between 0 and 1. And

in this particular part of the architecture, you also have another component which takes the same
input ℎ𝑡−1, 𝑥𝑡. Remember, those are the only two inputs that you get to an RNN block. It is the

same inputs here, which has its own weights 𝑊𝐶, a bias 𝑏𝐶 applies a tanh activation function so

that you can have both negative and positive values and you get an output 𝐶𝑡

1155
(Refer Slide Time: 12:34)

Let us see how these are combined now. So, the final cell state Ct is given by 𝑓𝑡 * 𝐶𝑡−1 So, this

tells you which dimensions of the previous cell state 𝐶𝑡−1 should be forgotten and to what extent

and 𝑖𝑡 * 𝐶𝑡 to decide what should be written on to the current cell state. So, you have a cell state

which is like a memory, in the current time step you want to decide what aspect of the memory
do you want to remove and what new content do you want to add to the memory.

(Refer Slide Time: 13:16)

1156
Finally, you have the output gate which controls what parts of the cell state are provided as the
hidden state which goes to the next time step and the output. So, 𝑜𝑡 is another gate similar to the

forget gate and the input gate, once again receives ℎ𝑡−1 and 𝑥𝑡 as input, has its own weights

𝑊0 𝑎𝑛𝑑 𝑏0, sigmoid activation function because it is also a gating mechanism and this is now

combined with 𝐶𝑡 which is the current cell state element wise to decide what should be the ℎ𝑡that

is provided to the next cell state, next state as well as to the output.

(Refer Slide Time: 14:09)

Now, to summarize these in equations. So, 𝑓𝑡is a forget gate which controls what is kept versus

what is forgotten from the previous cell state. The input gate controls what parts of the new cell
content are written to the cell. The output gate controls what part of the cell are output to the
hidden state. All of them use a sigmoid activation function so that the output is similar to a
gating mechanism and lies, the values lie between 0 and 1.

And 𝐶𝑡 is the new cell content that you want to write to the cell. So, that is just a simple

processing of the inputs and 𝐶𝑡 finally is obtained by forgetting some content from the previous

state and writing some new cell content 𝑓𝑡controls 𝐶𝑡−1 and the input gate 𝑖𝑡 controls the 𝐶𝑡

which you ideally want to write onto the cell state.

1157
And finally the hidden state is and decides to read some content from the cell as output using the
output gate. It again uses a tanh to maintain negative and positive values. Let us ask a couple of
questions here to understand how this entire setup works. What can you tell about the cell state
𝐶𝑡, if the forget gate is set to 1 and the input gate is set to 0.

Let us try to analyse this, the forget gate is set to all 1s and the input gates is set to 0s. What
would happen from this equation here for 𝐶𝑡 you can notice that the information of the cell

coming in from the previous cell state will continue to be preserved because there would be no
input that would be added due to the input gates being 0 and the 𝐶𝑡 would not have effect on the

cell state.

Let us ask another question, what would happen if you fix input gate to all 1s, forget gate to all
0s and output gate to all 1s? If you thought carefully this will almost be the standard RNN think
about it to ensure you can understand this. Why do we say almost standard RNN? The only
difference now is there would be a tanh that gets added here, which was not there in the vanilla
RNN.

(Refer Slide Time: 17:02)

1158
Now, let us come back and ask this question. It seems like a complex architecture. One point to
add to the discussion so far is that these four layers in an LSTM are not necessarily sequential the
way we have seen networks so far. We of course saw skip connections where any layer could be
connected to any other layer, similarly even in an LSTM there are interactions between the layers
to achieve a specific objective.

Now, let us try to ask. How does the LSTM really solve the vanishing gradient problem? The key
to that lies in this highway here between 𝐶𝑡and 𝐶𝑡−1. We are going to call that the gradient

highway. So, whatever gradient of the error that you receive at 𝐶𝑡will be passed on as is to 𝐶𝑡−1.

Earlier we had to worry about that being mitigated by the activation function across a layer so on
and so forth. Here if you notice between 𝐶𝑡−1 and 𝐶𝑡 there is no layer parse, the only thing that

exists between them is the output of the forget gate, which is just a vector, it is not a layer, there
are no weights, it is just a vector. Why is this not a problem? This is not a problem because the
job of the forget gate is to decide how much 𝐶𝑡−1 should contribute to 𝐶𝑡.

So, if the gradient is reduced to a previous state based on how much it contributed that is a fair
gradient. We will not be worried about it. The gradient does not get mitigated because of any
other operations in the LSTM. It depends only on the forget gate and the forget gate’s job is to
control how much of the previous cell state goes to the next cell state. So, the gradient would

1159
only get mitigated by that much amount for each of the dimensions in the previous cell state.
This allows LSTM to solve the vanishing gradient problem.

So, once again, if you go back to the equations here this equation on 𝐶𝑡 tells us that𝐶𝑡 ,it depends

on 𝐶𝑡−1 only with respect to 𝑓𝑡while the second term here the 𝑖𝑡 * 𝐶𝑡 could still be affected by

the vanishing gradient problem. We do not worry about it because there is another component
which will allow the gradient to pass through as is through the LSTM network to earlier time
steps. This allows addressing the vanishing gradient problem in LSTMs.

(Refer Slide Time: 19:56)

Over the years since LSTMs came way back in 1997 there have been a few variants of LSTMs
that have been developed, a couple of popular variants are, one of them is known as an LSTM
with peephole connections. The reason it is called an LSTM with peephole connections is in the
computation of the forget gate, input gate and output gate you notice that in addition to ℎ𝑡−1 and

𝑥𝑡 you also provide 𝐶𝑡−1 as inputs and for the output gate you also give 𝐶𝑡 as input.

So, you are allowing your gates to peep into the cell state and then decide which dimension you
should cut down or which dimension you should let through. That is why these are known as
LSTMs with peephole connections.

1160
(Refer Slide Time: 20:57)

Similarly, there is also been another variant known as an LSTM with coupled forget and input
gates where instead of having a separate forget gate and an input gate the cells state is computed

( )
as 𝑓𝑡 * 𝐶𝑡−1and 1 − 𝑓𝑡 * 𝐶𝑡, we had an 𝑖𝑡 in the vanilla version of the LSTM, but in this

version the 𝑓𝑡doubles up to serve both these purposes. So, 𝑓𝑡tells you how much to forget and

1 − 𝑓𝑡tells you how much to let it. This is a variant that has been proposed.

(Refer Slide Time: 21:39)

1161
So, the history of LSTM to summarize is in 1997 Hochreiter and Schmidhuber first proposed
LSTMs and the learning at that time used was known as real-time recurrent learning with
backpropagation. There was no forget gate in this version of the LSTM in the late 90’s. Then in
1999 Schmidhuber again their group introduced the forget gate into the LSTMs, in 2000 the
variant with peephole connections was introduced. And in 2005 Alex Graves who is probably
responsible for the version of RNNs and LSTMs that we see today introduced the vanilla LSTM
as we know today with all the components.

(Refer Slide Time: 22:28)

Over the last few years especially between 2013 to 15 LSTMs started achieving state of the art
results on various applications such as handwriting recognition, speech recognition, machine
translation, parsing, image captioning so on and so forth that they started becoming the default
choice for doing sequence learning.

However, while we will not have the opportunity to discuss this now, in 2020, as we speak, One
of the hottest trends to handle sequence learning problems are known as Transformers.
Transformers use the idea of what is known as self attention. In fact in a recent competition a
summary report of all the methods that participated in that competition WMT stands for machine
translation. It is a machine translation which is a sequence learning problem.

When we say machine translation, We mean an application such as Google translate to go from
English to German or Hindi to English, so on and so forth. So, with all the participants that

1162
provided entries to this competition only 7 of them were RNN based, while 105 of them were
transformer based. That should give you an idea of what is the latest trend at this time.

(Refer Slide Time: 23:51)

In 2014 there was another variant similar to LSTM known as the gated recurrent unit GRU,
which was also proposed, which is also used popularly as an alternative for LSTM today. It was
proposed by Chung et al. So, the main idea here is very similar to the coupled forget and input
gates. GRUs combined forget and input gates into a single update gate. It also merges the cell
state and hidden state and this is the overall architecture. Let us see this in some more detail.

(Refer Slide Time: 24:31)

1163
So, in this GRU you have only two gates instead of an input gate, a forget gate and an output
gate. Now, you have only two gates. These gates are called an update gate and a reset gate. So,
these look exactly the same as any other gates that we saw with an LSTM.

But now the new hidden state content is given by the reset gate, the reset gate looks at ℎ𝑡−1,

which is the hidden state coming from the previous time step decides how much of that should
be processed and then gives you an updated hidden state. So, the reset gate selects useful parts of
the previous hidden state and uses this to compute the new hidden state and the final hidden state
that is exposed out of the GRU block is (1 − 𝑧𝑡) * ℎ𝑡−1 + 𝑧𝑡 * ℎ𝑡.

So, the update gate controls what is kept from the previous hidden state and what is updated to
the new hidden state. So, to understand this further, let us ask this question. What happens if the
reset gate is set to all 1s and update gate is set to all 0s? Let us try to analyse this. If the reset gate
is set to all 1s this would just remain ℎ𝑡−1 there would be no impact there. It would just remain as

ℎ𝑡−1.

If the update gate is all 0s, the second term here would disappear because all those terms would
become 0s and the first term here would ensure you will have all 1s which meansℎ𝑡will simply

become ℎ𝑡−1 and the second term this entire computation from the current time step would not be

considered at all. So, effectively very similar to what we talked about as one of the cases with

1164
LSTMs, here, the same state from the previous time step would be retained and there would be
no influence of the current input.

1165
(Refer Slide Time: 26:50)

To summarize the differences between GRUs and LSTMs. The input and forget gates of LSTMs
are combined by an update gate in GRUs and the reset gate is applied directly to the previous
hidden state in GRUs. So, GRUs have two gates and LSTM has three gates. What does this tell
us? Lesser parameters to learn in GRUs. So, learning could be better even with lesser data.

GRUs do not have any internal memory a cell state 𝐶𝑡whatever is the internal cell state is also

exposed as it is because there is no formal output gate to control how much of the cell state is
output out of the cell state. In general LSTM is a common preferred choice, especially if you
know that your data has long range dependencies or if you have large amounts of training data,
but if you want better speed and fewer parameters and maybe smaller datasets GRUs are a good
choice to try for sequence learning problems.

1166
(Refer Slide Time: 28:06)

Your homework would be to continue to read chapter 10 of the deep learning book and these nice
links on understanding LSTMs and GRU networks and if you would like to understand how
LSTMs are trained using back propagation through time this is Alex Grave’s book on RNN, Alex
Grave’s as we said was responsible for designing the first LSTM and there is also a nice code
tutorial here if would like to understand the hands on side. You would have assignments to also
get a hands on experience.

But if you like to go through this you can, let us try to leave one question for you to take away.
We did answer how LSTMs address vanishing gradients. How do GRUs address vanishing
gradients? Try to look at its architecture and think about this question to ponder.

1167
(Refer Slide Time: 29:06)

References.

1168
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Video Understanding using CNNs and RNNs
(Refer Slide Time: 0:17)

Having seen RNNs which deal with temporal sequences. We will now move to temporal
sequences in vision, computer vision, which are videos. So, we will try to see how videos are
processed using CNNs and RNNs.

(Refer Slide Time: 00:36)

1169
The one question that we left behind in the last lecture was, how do GRUs address vanishing
gradients? And the answer is the same as LSTMs. They also have a gradient highway between a
hidden state at a particular time step and the hidden state at the previous time step, which was in
the final hidden state update equation.

You would notice that that was given by (1 - z) * (the previous hidden state) + z or the (update
gate) * (𝑡ℎ𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ℎ𝑖𝑑𝑑𝑒𝑛 𝑠𝑡𝑎𝑡𝑒). So, you would notice that in that equation the previous, the
gradient of the current hidden state with respect to the previous hidden state depends only on a
gate which does not affect the gradient per say as long as the update gate allows the gradient to
pass through and that is by design.

(Refer Slide Time: 01:37)

Moving on to videos. Let us first ask the question. Why do we need to understand a video at all?
So, here are a couple of examples. So, this is an example of an automated baseball bat hitting a
baseball and if we wanted to understand, where would this baseball bat hit the ball or what
direction will the ball fly or what kind of an activity is this so on and so forth. One would be to
look at all the frames while coming up with the decision and one single frame in the sequence
may not let you answer that question for example, as to where the ball will go.

1170
(Refer Slide Time: 02:21)

Another example is if we wanted to understand in which direction this screw was being turned
in, one of these frames will not give you the information of which direction the screw was being
turned in. You would need the sequence of frames to be able to make this decision.

(Refer Slide Time: 02:44)

So, the question that we have is, how do we then use whatever we have learned so far. So, we
have seen feed forward neural network CNNs and now RNNs. How do we use these to
understand a video, can one of them suffice, do we need multiple of them, let us try to see if we
can figure this out by ourselves and go ground up.

1171
(Refer Slide Time: 03:06)

1172
So, if we have to understand a video, you can look at one specific location of, let us, if we took
the example of the screw video, you could take one position of that screw and track it over time
and then be able to say what direction the screw was turning in. So, how do we get this
information? How do we actually track a particular point? And then see where it goes over time?

(Refer Slide Time: 03:37)

There are a few possibilities. Considering the focus of this course is on deep learning, We are
going to look at deep learning (ways) approaches to understanding videos. So, broadly speaking
one could use only CNNs to understand videos or can use CNNs and RNNs together to
understand videos. We will see both examples in this lecture. Even within CNNs there are two

1173
ways to perform video based understanding. One is to do what is known as 3D convolution and
the other is to see if you can repurpose 2D convolution itself to understand videos.

In a 3D convolution approach the time becomes another dimension of analysis just like how
images are two dimensional signals, videos are three dimensional signals, time becomes another
dimension. So, every operation that we saw so far you have to add another dimension in the
operation. For example, 3D convolution.

Another way of looking at it is to use 2D convolution but to consider time as a separate entity not
a separate dimension by itself and we will see an example of this to make this clear. One
question you could have is, were not we doing 3D convolution all the while. The inputs to our
CNNs so far were volumes. We had R channel, G channel and B channel.

So, were not we actually already doing convolution over volume, was not that 3D convolution. If
you have not thought through that before, not really, the reason why we do not call that 3D
convolution is we only moved convolution along the two dimensions of the image. The depth
was always fixed, the number of channels that you had, you did not move across that dimension
when you performed convolution.

The movement was only along two dimensions and so whatever we saw so far is 2D
convolution. Although (we have) the filter extends into the third dimension and we process it,
We actually do not move on windows in the third dimension. But now when we talk about videos
and 3D convolution here, we are talking about moving in the temporal dimension.

1174
(Refer Slide Time: 06:27)

Let us try to understand 3D CNNs first as a way to understand videos. Traditionally speaking if
you had one frame from a video or more frames, if you performed standard 2D convolution, you
would look at a particular patch of each frame, perform convolution and you would get a certain
output in the output feature map.

Unfortunately, when we apply convolution this way, the two frames are not connected and you
are not sharing any information between the two frames. However, when you do video
understanding the core idea is to be able to pass information or combine information from
multiple frames to make a prediction if it was a classification problem for instance. So, how do
we do this?

1175
(Refer Slide Time: 07:26)

As I just mentioned, the idea is to extend convolution to a further dimension. So, if you had three
frames instead of doing 2D convolution on each frame individually, what we are going to
propose is to do convolution across the frames. Let us try to study this more carefully.

So, you can see here that the first patch has a set of weights which gives you a pixel in the output
feature map, but that feature map is also influenced by another set of weights that is applied to
the second frame and a third set of weights that is applied to the third frame. Similarly, you can
see that in a second feature map you would have a different set of weights applied to each frame
of your video. So, you have another dimension that you have to convolve on now, which is the
temporal dimension.

1176
(Refer Slide Time: 08:28)

So, we are going to look at one specific paper, which was perhaps the initial harbinger that
brought on a lot of other work. This was known as 3D convolutional neural networks, which was
published in, firstly, ICML of 2010, then in PAMI 2012, and let us look at the architecture first
for this 3D CNN, the architecture has evolved over the years, but at least this should give you
one example.

In this particular example 3D convolutional neural networks were used for human action
recognition to look at a video and predict the action. So, the input was a set of 7 frames whose
resolution was 60 x 40. It was just grayscale frames resolution 60 x 40. Those were the 7 frames
that were considered. This particular work initially had one layer, where hardwired filters were
used.

So, these filters here were not learnt there were five different kinds they had a grayscale filter,
they had a gradient along the X direction, they had an optical flow along the X direction and the
optical flow along Y direction, they had five different kinds of filters and those gave these sets of
images. So, they obtained a total of 33 feature maps each of them still at the resolution 60 x 40,
after this initial processing of a few filters that were hardwired. These were not learnt, these were
filters that detected edges, detected optical flow so on and so forth.

The next layer was 3D convolution which was done using learned weights. They performed 7 x 7
x 3 3D convolution where for each of these specific components as we said they had five

1177
different kinds of feature maps. So, they ensured that they did 3D convolution within each of
those five kinds of feature maps and that is what you see colour coded here. You can see these
five colours and those five colours are also maintained in subsequent layers.

So, the 3D convolution is within only the yellow region, within only the red region, within only
the purple region so on and so forth. And to get some variety, they also introduced a completely
different set of feature maps with different set of weights, but performing the same operations.
You could consider this similar to your AlexNets 48 feature maps and 48 features maps in the
first layer going to two different GPUs. Although in this case the intent was not to send to two
different GPUs, the intent was to get more variety in the feature maps. The rest of the layers
follow a traditional CNN approach.

They then performed 2 x 2 subsampling within each of these feature maps, then they performed
another (3 cross) another 3D convolution, a 7 x 6 x 3 3D convolution which again had two sets
of feature maps. Then a 3 x 3 subsampling then a 7 x 4 convolution to bring everything together.
Then a fully connected layer to make the final decision. And how are these learnt? Nothing
changes from that perspective. You still have cross entropy loss at the end and you still can back
propagate through all of these layers very similar to how we talked about for CNNs.

(Refer Slide Time: 12:10)

Using this kind of an approach they could show that you could perform action recognition,
certain actions that they had in the data set was to check if there was a cell to your action where

1178
someone puts a cell phone to the ear or to find an action where somebody left behind an object in
a public environment or when someone points to another person.

So, you can see here if you observe carefully that for each of these boxes there is also a trajectory
that follows so if you looked at this particular box here, you would see there is also a trajectory
which shows the previous frames and the positions in the previous frames that led to this
particular outcome on this frame and also a probability or score rather of the outcome. So, this is
the 3D CNN approach to understanding videos and there have been several variants over the
years in trying to perform such operations.

(Refer Slide Time: 13:13)

Let us now turn the problem around and ask, could we have used 2D CNN itself to solve a video
understanding problem, turns out there is a very popular method that does this known as 2 stream
convolutional networks, was very popular, is still popular for doing video understanding tasks.

Let us try to understand how this works. So, if you had a video where at one type step you see a
person trying to pull out something from behind their shirt and the hand moves and something
comes out from behind their shirt. Now, we could use a traditional optical flow approach to try to
find out which pixels moved how much across the X direction and the Y Direction. So, using this
optical flow, you could have a horizontal component, you could also have a vertical component
of the optical flow. How do we use this?

(Refer Slide Time: 14:14)

1179
So, in the 2 stream CNN you have your input video which is a volume. So, you have a spatial
stream of the CNN which takes a single frame which could be colour and now sends it through a
standard CNN architecture to get your final output of what action this maybe so this has a conv
1, conv 2, conv 3, conv4, conv 5 fully connected, fully connected very similar to an Alex net
architecture to get a final classification at the end.

But this is only for a single frame. Now, the other stream, this is known as the spatial stream and
the other stream performs a multi frame optical flow of how the content changes across the
frames. So, you have multiple frames of the optical flow between every set of consecutive
frames across all the frames in the input video.

Now, this set, this optical flow images becomes a set of channels across the temporal dimension
and one could now perform a standard convolution, 2D convolution on this volume of optical
flows each optical flow image is a 2D image and across time for between every successive frame
you can get optical flows and you stack them all together to make a volume and this volume can
be sent through another standard 2D CNN architecture, which we would call the temporal stream
convolutional network and both of them provide their own classifications on what action this
maybe.

And the final step is to perform a fusion of class scores, you can aggregate them in several ways,
majority vote them or average them and then make a final decision so on so forth. And this gives

1180
the final decision on what action this is. Note here that all these convolutions are 2D
convolutions and not 3D convolutions.

(Refer Slide Time: 16:26)

What about using RNNs with CNNs to perform video understanding. So, given a visual input
such as an image, you first extract visual features from a CNN. So, you would forward propagate
that through a CNN and take, say the penultimate layer of AlexNet or something like that as the
representation of the image. Now, these representations of the image can be provided to a set of
LSTMs or RNNs to perform sequence learning to give your final predictions.

So, if you had a video where you have multiple frames, then you provide each frame to a
corresponding CNN to get different feature representations, and then they can then be provided
to an LSTM which gives, so you could assume now that each column here is your LSTM
unfolded over time and this is a staged LSTM unfolded over time and finally the output of these
LSTMs are the predictions that you make. So, you could have a prediction per frame or you
could ignore all of these to make a prediction only at the last time step. That would be the way to
combine CNNs with RNNs.

How do you learn such a network? Remember that every component that we see here is back
probable, we know how to back propagate across an RNN, an LSTM, a CNN. So, given a cross
entropy loss at the end or a sum of cross entropy losses for each time step. We could compute the

1181
back propagated error, the gradient of that error with respect to every weight in the CNN or the
LSTM. The CNN is the same model that is used on each of these images.

(Refer Slide Time: 18:27)

Now, with these approaches, be the two stream CNN or a CNN with RNN or a 3D CNN, What
kind of problems in video understanding can we solve? So, if you took a data set such as the
UCF101 which is a very popular data set for video understanding, you could perform action
recognition like, is the person doing push-ups, rock climbing, is the person trampoline jumping
so on and so forth.

1182
(Refer Slide Time: 19:00)

Or if you had a different kind of a data set, another dataset that is popular is known as
Hollywood in homes, which is an activity understanding dataset again. In this you could perform
action recognition such as smiling at a book so on and so forth or you could also perform
sentence prediction such as caption generation for a full video or equivalent problems.

(Refer Slide Time: 19:29)

There are more problems that can be solved for video understanding, Action forecasting. So, if
you had a sports video and you see a particular sequence of events so far, what would happen
next. For example, if you had a water polo game and you had a sequence of moves that you have

1183
seen so far, where will the person pass the ball to next or in basketball for instance? Object
tracking is another important video understanding task.

Dynamic scene understanding, so far we have seen methods that, given a scene, can classify the
scene into outdoor or indoor, say morning or night, urban or rural so on and so forth using image
level classification methods. We know a CNN, you know a cross entropy loss given data, we can
learn a scene classification model.

But what happens if you want to understand a dynamic scene? a dynamic video? Then we have
to switch to video understanding models such as a 3D CNN or two stream CNN or a CNN plus
RNN to solve such a problem. One could also think of other problems such as temporal action
segmentation so on and so forth which would be an unsupervised kind of an approach which are
also important problems in video understanding.

(Refer Slide Time: 20:55)

Your homework would be to get, to go through this tutorial on large scale holistic video
understanding, which should give you a clearer picture of more tasks in video understanding and
how these architectures can be adapted to solve these kinds of problems and this particular link
also gives a good listing of papers in the space.

We will conclude the lectures for this week with a very interesting question, which is, what do
you think will happen if you train a model on normal videos and now do inference at test time

1184
you give a reversed video? Would it work? Would it not Work? Think about it and we will
discuss this the next time.

1185
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture - 56
Attention in Vision Models: An Introduction

(Refer Slide Time: 00:14)

Having discussed RNNs last week, we will now move to a topic, which is very contemporary in
terms of trying to address some of the technical features of what RNN brings to Deep Learning
which is attention models.

1186
(Refer Slide Time: 00:34)

Before we go into attention models, let us discuss the question that we left behind, which was,
what do you think will happen? If you train a model on normal videos and do inference on a
reversed video, hope you had a chance to think about this. It depends on the application or task.
For certain activities, say maybe let us say you want to differentiate walking from jumping, it could
work to a certain extent, even if you tested it on a reversed video.

However, for certain other activities, see a sports action such as a Tennis forehand, this may not
be that trivial. An interesting related problem in this context is known as finding the arrow of time.
There are a few interesting papers in this direction, where the task at hand is to find out whether
the video is forward or backward. This can be trivial in some cases, but this can get complex in
some cases. If you are interested, please read this paper known as Learning and Using the Arrow
of Time if you would like to know more.

1187
(Refer Slide Time: 01:59)

So far with RNNs, we saw that RNNs can be used to efficiently model sequential data. RNNs use
backpropagation through time as the training method. RNNs, unfortunately, suffer from vanishing
and exploding gradients problems. To handle the exploding gradient problem, one can use gradient
clipping, and to handle the various vanishing gradients problems one can use RNN variants, such
as LSTMs or GRUs, which was good.

We saw how to use these for handling sequential learning problems. But the question we asked
now is this sufficient? Are there tasks when RNNs may not be able to solve the problem? Let us
find more about this.

1188
(Refer Slide Time: 02:58)

Let us consider a couple of popular tasks where RNNs may be useful. One is the task of image
captioning. Given an image, one has to generate a sequence of words to make a caption that
describes the activity or the scene in the image. Another example where RNNs are extremely useful
is the task of Neural Machine Translation or what is known as an NMT. It is also what you see on
your translate apps that you may be used where you try to, you have a sentence given in a particular
language, and then you have to give the equivalent sentence in a different language. Both of these
are RNN tasks.

1189
(Refer Slide Time: 03:51)

A standard approach to handling such tasks is given any input your input could be video, could be
an image, could be audio, or could be text, you first pass these inputs through an encoder network
which gets you a representation of that input which we call the Context Vector. Given this context
vector, you pass this through a decoder network which gives you your final output text. These are
known as Encoder-Decoder models, and they are extensively used in such a context.

(Refer Slide Time: 04:37)

1190
Now let us take a brief detour to understand encoder-decoder models a bit more. The standard
name for such encoder-decoder models is known as the Auto Encoder. Although in this case, it
says that the decoder is trying to encode the input itself and that is the reason why this is called an
autoencoder. Not all encoder-decoder models need to be autoencoders. However, the conceptual
framework of encoder-decoder models comes from autoencoders, which is why we are discussing
this briefly before we come back to encoder-decoder models.

And An autoencoder is a neural network architecture, where you have an input vector, you have a
network which we call the encoder network. And then you have a concept vector or we also call
that the bottleneck layer, which is a representation of the input, and then you have a decoder layer
or a network that outputs a certain vector. In an autoencoder, we try to set the target value to the
input themselves.

So you are asking the network to predict the input itself. So what are we trying to learn here, we
are trying to learn a function f parametrized by some weights and bias wb. 𝑓(𝑥) = 𝑥. Rather we
are trying to learn the identity function itself and predict an output x hat, which is close to x.

So how do you learn such a network using backpropagation? What kind of loss function would
you use? It would be a mean squared error, where you are trying to measure the error between x
and x hat, which is the reconstruction of the autoencoder. Then you can learn the weights in the
network using backpropagation as with any other feed-forward neural network.

1191
(Refer Slide Time: 06:56)

Now, the encoder and the decoder need not be just one layer, you could have several layers in the
encoder. Similarly, a several layers in the decoder in the autoencoder setting traditionally, the
decoder is a mirror architecture of the encoder. So have if you have a set of layers in the encoder
with a certain number of dimensions, number of hidden nodes in each of these layers.

Then the decoder mirrors the same architecture the other way, to ensure that you can get an output,
which is of the same dimension as the input. That is when you can measure the mean squared error
between the reconstruction and the input. However, while this is the case for an autoencoder, not
all encoder-decoder models need to have such architectures, you can have a different architecture
for an encoder, and a different architecture for a decoder, depending on what task you are trying
to solve.

1192
(Refer Slide Time: 08:00)

Just to understand a variant of the autoencoder, a popular one is known as the Denoising
Autoencoder. In a denoising Autoencoder, you have your input data, you intentionally corrupt your
input vector, for example, you can add something like a Gaussian noise and you would get a set
of values X1 hat to Xn hat, so those are your corrupted input values. Now, you pass this through
your encoder, you get a representation than a decoder and you finally try to reconstruct the original
input itself. What is the loss function here?

The loss function here would again be a mean squared error, this time it would be the mean squared
error between your output and the original uncorrupted input. What are we trying to do here? We
are trying to ensure that the autoencoder can generalize well, tomorrow at the end of training rather
so that even if there was some noise in the input, the autoencoder would be able to recover your
original data.

1193
(Refer Slide Time: 09:18)

With that, Introduction to Autoencoders, let us ask one question. In all the architectures that we
saw so far, with autoencoders, we sorted the hidden layers that were always smaller in size in
dimension when compared to the input layer. Is this always necessary? Can you go larger?
Autoencoders where the hidden layers have a lesser dimension than the input layer are called
complete autoencoders.

So you can say that such auto-encoders learn a lower-dimensional representation on a suitable
manifold of input data. From which if you use the decoder; you can reconstruct back your original
input. On the other hand, if you had an autoencoder architecture, where the hidden layer dimension
is larger than your input, you would call such an auto-encoder an over-complete autoencoder.

While technically this is possible, the limitation here is that the autoencoder could blindly copy
certain inputs to the certain dimensions of that hidden layer which is larger in size and still be able
to reconstruct, which means such an over complete autoencoder can learn trivial solutions, which
do not really give you useful performance, they may simply memorize all the inputs and just copy
inputs back to the output layer.

Then the question is are all auto encoders also dimensionality reduction methods? Assuming we
are talking about under complete autoencoders? Partially yes, largely speaking, autoencoders can
be used as dimensionality reduction techniques. A follow-up question then is then, can an

1194
autoencoder be considered similar to principal component analysis, which is a popular
dimensionality reduction method? The answer is actually yes, again. But I am going to leave this
for you as homework to work out the connection between autoencoders and PCA.

(Refer Slide Time: 11:54)

Let us not come back to what we were talking about, which was one of the tasks of RNNs which
is Neural Machine Translation, or NMT. These kinds of encoder-decoder models are also called
Sequence to Sequence models, especially when you have an input to be a sequence and an output
also to be a sequence. So if you had an input sentence, which says India got its independence from
the British.

Let us say now that we want to translate this English sentence to Hindi, what you do know is you
would have an encoder network, which would be a Recurrent neural network and RNN where each
word of your input sentence is given at one-time step of the RNN. And the final output of the RNN
would be what we call a Context vector. And this context vector is fed into a decoder RNN, which
gives you the output, which says Bharat Ko.

The rest of the sentence, Mili, and then you have an end-of-sentence token. This is what we saw
as a many-to-many RNN last week. Why are not we giving output at each time step of the encoder
RNN? For the machine translation task, if you recall, the recommended architecture, we said that
it is wiser to read the full sentence, and then start giving the output of the translated sentence.

1195
Why so? Because different languages have different grammar and sentence constructions. So it
may not be correct for the first word in English to be the first word in Hindi, or the Hindi sentence
may not exactly follow the same sequence of words in English, because of grammar, grammatical
regulations. So that is the reason why in machine translation tasks, you generally have a reading
of the entire input sentence, you get a context vector, and then you start giving the entire output in
the translated output.

(Refer Slide Time: 14:28)

Similarly, if you considered the image captioning task, you would have an image. And in this case,
your encoder would be a CNN followed by say, a fully connected network, out of which you get
a representation or a context vector. And this context vector goes to a decoder, which outputs the
caption, A woman dot, dot dot, say in the park end of the sentence.

1196
(Refer Slide Time: 14:57)

What is the problem? This seems to be working well, is there a problem at all? Let us analyze this
a bit more closely. So in an RNN, the hidden states are responsible for storing relevant input
information in RNNs. So, you could say that the hidden state at time step t or ht is a compressed
form of all previous inputs. That hidden state represents some information from all the previous
inputs, which is required for processing in that state, as well as future states.

(Refer Slide Time: 15:41)

1197
Now, let us consider a longer sequence. If you considered language processing, and a large
paragraph, if your input is very long, can your ht the hidden state at any time step encode all this
information? Not really, you may be faced with the information bottleneck problem in this kind of
context.

(Refer Slide Time: 16:07)

So, if you considered a sentence such as this one here, which has to be translated to German, can
we guarantee that words seen at earlier time steps be reproduced. at later time steps. Remember,
when you go from a language such as English, to a language such as German, the position of the
verbs the nouns may all change and to reproduce this one may have to get a word much earlier in
the sentence in English, which may follow much later in, say the German language. Is this
possible? Unfortunately, RNNs do not work that well, when you have such long sequences.

1198
(Refer Slide Time: 16:57)

Similarly, even if you had Image Captioning and related problems, such as visual question
answering, which we will see later, so if you had this image that we saw at the very beginning of
this course, and if we ask the question, what is the name of the book? The expected answer is the
name of the book is Lord of the Rings. The relevant information in a cluttered image may also
need to be preserved. In case there are follow-up questions with a dialog.

(Refer Slide Time: 17:34)

1199
So a statistical way of understanding this is through what is known as the BLEU score. BLEU
score is a common performance metric used in NLP Natural Language Processing, BLEU stands
for Bilingual Evaluation Understudy. It is a metric for evaluating the quality of the machine-
translated text. It is also used for other tasks such as image captioning, visual question answering,
so on and so forth.

And when one looks at the BLEU score, one observes that as the sentence length increases, then
while the expected curve is that you should get a high BLEU score after a certain sequence length,
unfortunately, as the sentence length goes further beyond a threshold, the BLEU score starts falling
which means using such encoder-decoder models where encoders are RNNs decoders are also
RNNs starts failing in these cases, when the sequences are long by nature. If you would like to
know more about BLEU, you can see this entry in Wikipedia for more information.

(Refer Slide Time: 18:57)

So, what, what is the solution to this problem? The solution which is extensively used today is
what is known as Attention, which is going to be the focus of this week's lectures.

1200
(Refer Slide Time: 19:13)

So what is this Attention? Intuitively speaking, given an image, if we had to ask the question, what
is this boy doing? The human way of doing this would be you first identify the artifacts in the
image. You pay attention to the relevant artifacts, in this case, the boy and what activity the boy is
associated with.

(Refer Slide Time: 19:42)

Similarly, if you had an entire paragraph, and you had to summarize, you would probably look at
certain parts of the paragraph and write them out in a summarized form. So paying attention to

1201
parts of inputs, beat images, or beat long sequences, like the text is an important way of how
humans process data.

(Refer Slide Time: 20:12)

So let us now see this in a Sequence learning problem in the traditional encoder-decoder model
setting. So this is once again, the many to many RNN setting, similar to what we saw for Neural
Machine Translation. So you have your inputs, then you have a context vector that comes out at
the end of the inputs, that context vector is fed to a decoder RNN, which gives you the outputs Y1
to Yk.

Now let us assume that head j's are the hidden states of the encoder and Sj's are the hidden states
of the decoder. So what does Attention do? Attention suggests that, instead of directly outputting
hedge T, which is the last hidden state, to your decoder RNN, we instead have a context vector,
which relies on all of the hidden states from the input. This creates a shortcut connection between
this context vector ct and the entire source input x.

How would you learn this context vector? We will see there are multiple different ways. So given
this context vector, the decoder hidden state St is given by some function 𝑓(𝑆𝑡−1 ), the previous
hidden state in the decoder 𝑦𝑡−1 , the output of the previous time step in the decoder could be given
as input to the next time step, as well as the context vector 𝐶𝑡 .

1202
(Refer Slide Time: 22:05)

And what is this context vector? This context vector is given by 𝐶𝑡 , which is overall the time steps
in your encoder, RNN, 𝛼𝑡,𝑗 ℎ𝑗 ,. So it is a weighted combination of all of your hidden state
representations in your encoder RNN. How do you find 𝛼𝑡,𝑗 ? How do you find those weights of
the different inputs? A standard framework for doing this is 𝛼𝑡,𝑗 can be obtained as a softmax over
some scoring function that captures the score between 𝑆𝑡−1 and each of the hidden states in your
encoder.

So St minus 1 gives us a current context of the output. So we try to understand what is the alignment
of the current context in the output with each of the inputs and accordingly pay attention to specific
parts of the inputs? Now there is an open question, how do you compute this score of St minus 1
with each of the ℎ𝑗 ’s in the encoder RNN. Once we have a way of computing that score, we can
take a softmax over ℎ𝑗 with respect to all of the ℎ𝑗 ’s.

So we will do this for each of the hedge j’s in the encoder RNN. And using that, we can compute
your 𝛼𝑡,𝑗 's. And using 𝛼𝑡,𝑗 ', we can compute the context vector. Once you get the context vector,
you would give the corresponding context vector as input to each time step of the decoder RNN.
How do you compute this score?

1203
(Refer Slide Time: 24:05)

There are a few different approaches in literature at this time. We will review many of them over
the lectures this week. But to give you a summary, you could have content-based attention which
tries to look at 𝑠𝑡 and ℎ𝑖 . So each a particular hidden state in your decoder RNN St and a particular
hidden, hidden state in an encoder, RNN hi as cosine similarity between the two. That is one way
of measuring the score.

You could also learn weights to compute this alignment. So you can take st and hi learn a set of
weights, take a tan h and use another vector to get the score. So this is a learning procedure to get
your final score. One could also get 𝛼𝑡,𝑗 as a softmax over a learned set of weights, W and 𝑠𝑡 again.
One could also use a more general framework, where you have 𝑠𝑡𝑇 ∗ ℎ𝑖 , which is similar to cosine,
which will give you a dot product.

But you also have a learned set of weights in between, which tells you how to compare the two
vectors st and hi. Remember, any W here is learned by the network to compute the score. Or you
could simply use just a dot product by itself, which would be similar to your content-based
attention, the cosine and the dot product would give similar values. Or there is a variant known as
the Scaled dot product Attention where you use the dot product between the two vectors st and a
hi. But scale it by root n, which tells you the number of inputs that you have.

1204
(Refer Slide Time: 26:06)

What about Spatial data? So, we saw how it is done for temporal data where you had a sequence
to sequence RNN a many to many RNN What if you had an image captioning task if you had
spatial data. So, in this case, your image would give you a certain representation 𝑠0 out of the
encoder network. Unfortunately, when you use a fully connected layer, after the CNN, you lose
spatial information in that vector.

So, instead of using the fully connected layer, we typically take the output of the convolution layers
themselves, which would give you a certain volume, which let us say is mxnxc. Now, we know
that if you considered one specific patch of this volume mxnxc, we know that you can trace that
back to a particular patch of the original image which was passed through a CNN.

So you know the output feature map say a con five feature map, if you looked at one particular
path part of that depth volume, you would get a certain patch in the input image. Now, this gives
you Spatial information. So, what can we do we take this feature map that we get at the output of
a certain convolution layer, we can unroll them into 1x1xc vectors. So, you ideally have mxnxc.

So you can unroll this into c different vectors. And then you can apply attention to get a context
vector. In what way is this useful, this context vector, now can be understood as paying attention
to certain parts of the image while giving the output because each of these bands, each of these

1205
sub-volumes here highlighted in yellow is certain parts of the input image and one could now
understand the same weighted attention concept.

The alignment part of it could be implemented very similarly to what we saw on the previous slide.
But now, this represents different parts of the input image.

(Refer Slide Time: 28:35)

Another use of performing Attention is it gives you the explainability of the final model. Why so
how so? If you have said a machine translation task, you know, that when you generated a certain
output word from a decoder, RNN, your attention model or your context vector tells you which
part of the input you looked at, while predicting that word as the output, and that automatically
tells you which words in your input sequence corresponded to when or a word in your output.

So in this case, you can see that this particular sequence here European Economic Area, depended
on Zone Economic European, so that is also highlighted by these white patches here. So white
means a higher dependence. Black means no dependence. And looking at this heat map gives you
an understanding of how the model translated from one language to another.

1206
(Refer Slide Time: 29:55)

What about images image captioning task In this case too you can use the same idea, given an
image, if the model is generating a caption, you can see that the model generates each word of the
caption by looking at certain parts of the image. For example, when it says A it seems to be looking
at a particular part of the image, when it says A woman, it seems to be looking at a certain part of
the image, while the other object is also in relevance.

And if you keep going, you see when it says the word throwing, it seems to be focusing on the
woman part of the image. And if you see the word, Frisbee, it seems to focus on the Frisbee in the
image. And if you see the word Park, it seems to be focusing on everything other than the woman
and the child. This gives you an understanding and trusts the model is looking at the right things
while giving a caption as output.

1207
(Refer Slide Time: 31:03)

What are the kinds of attention one can have, you could consider having Hard versus Soft attention.
What do these mean? in hard attention, you choose one part of the image as the only focus for
giving a certain output, let us say image captioning, you look at only one patch of the image to be
able to give a word as an output. So, this choice of a position could end up becoming a stochastic
sampling problem.

And hence, one may not be able to backpropagate suit through such a hard attention problem,
because that stochastic stamp sampling step could be nondifferentiable. We will see this in more
detail in the next lecture. On the other hand, one could have Soft attention, where you do not
choose a single part of the image, but you simply assign weights to every part of the image. In this
case, you are only going to have a newer image, where each part of the image has a certain weight.
In this case, your output turns out to be deterministic, differentiable. And hence, you can use such
an approach along with standard backpropagation.

1208
(Refer Slide Time: 32:30)

Another categorization of Attention is Global versus Local attention. In global attention, all the
input positions are chosen for attention whereas in local attention, maybe only a neighborhood
window around the object of interest or the area of interest is chosen for Attention.

(Refer Slide Time: 32:55)

A third kind, which is very popular today is known as Self-attention where the attention is not with
respect to when decoder RNN with respect to the encoder, or an output RNN with respect to parts
of an image, but is of attention of a part of a sequence with respect to another part of the same

1209
sequence. This is known as Self Attention or Intra Attention. And we will see this in more detail
in a later lecture this week.

(Refer Slide Time: 33:32)

Your homework for this lecture is to read this excellent blog by Lillian Wang known as Attention?
Attention!, it is a blog on Github. And one question that we left behind, which is, is there a
connection between an Autoencoder and Principal Component Analysis? Think about it and we
will discuss this in the next lecture.

1210
(Refer Slide Time: 34:01)

References.

1211
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture - 57
Vision and Language: Image Captioning

(Refer Slide Time: 00:13)

We will now focus on one of the tasks that we discussed in the previous lecture, which is Image
Captioning.

(Refer Slide Time: 00:25)

1212
We left behind one question from the previous lecture, which was our Autoencoders, which we
understood as a dimensionality reduction technique. Similar to PCA, Principal Component
Analysis in some manner. The answer is yes, let us try to understand the connection. If you
normalize your data to the autoencoder in a particular way, where x hat, which is then normalized
data is given by 1 by root m into x minus mean of that dimension under consideration.

So, what is this Normalization doing? It is making the mean of each dimension to be 0. It is a
standard normalization step. And when you do this kind of normalization, the covariance matrix
1 1
is given by 𝑋 𝑇 𝑋, which is the covariance is given by 𝑋̂ 𝑇 𝑋̂. This is because of the that you
𝑚 √𝑚

have here when you compute your 𝑥̂. Given this pre-processing, let us try to look at the
Autoencoder loss.

The autoencoder loss is let us say the parameters of your autoencoder are summarized as theta is
the mean squared error between your reconstruction and the original input. One can also write this
loss in terms of x and can write it as (‖𝑋 − 𝐻𝑊 ∗ ‖𝐹 )2. In this case, the Frobenius norm is given
by summation along each of the dimensions, aij whole square is the squared sum of all elements
of a matrix.

H here is the representation of the autoencoder in the middle layer, the bottleneck layer or the
context vector, and 𝑊 ∗ is the weights of your decoder. So, W star into H is the reconstructed output
from that intermediate representation. X is the original input. So both of these are equivalent
representations for the loss function. Why did we write it this way? Let us see a bit more.

We know that for a loss function written this way, 𝐻𝑊 ∗ using SVD can be written as 𝑈Σ𝑉 𝑇 , where
you consider only the k columns of 𝑈 and 𝑉 𝑇 where you have a non 0 rank a non 0 singular value.
So, that is what we are writing this 𝐻𝑊 ∗ as this comes from the SVD decomposition of x.

So if this was indeed the solution, then one factoring could be H could be the first two terms here.
So that would be 𝑈, Σ. And W could be V here. This is one possible factoring and lets us consider
this to be one of the outcomes of the neural network and move forward with this outcome. So if
you considered H to be 𝑈, Σ, or considering only the k columns.

(Refer Slide Time: 04:03)

1213
Then you have H which can be written now as let us multiply by (𝑋𝑋 𝑇 )(𝑋𝑋 𝑇 )−1 , which we know
is identity. We are just taking the expansion of H and multiplying by identity which is given by
(𝑋𝑋 𝑇 )(𝑋𝑋 𝑇 )−1. Now we know from SVD of Z, we can write 𝑋 𝑇 as V. So remember, if we have
𝑋 = 𝑈Σ𝑉 𝑇 , then 𝑋 𝑇 = 𝑉Σ𝑈 𝑇 .

That is what we are writing here. And 𝑋 = 𝑈Σ𝑉 𝑇 and 𝑋 𝑇 = (𝑉Σ𝑈 𝑇 )−1.

1214
(Refer Slide Time: 05:02)

Now, by looking at this, you can make out that again from the definition of SVD 𝑉 𝑇 𝑉 is would be
identity because in SVD V is an orthogonal, orthonormal matrix or orthogonal matrix, where 𝑉 𝑇 =
𝑉 −1 . Hence, 𝑉 𝑇 𝑉 = 𝑉 −1 𝑉, which means it should be I. So, when this term here is 𝑉 −1 𝑉, which is
I, because V is an orthogonal matrix. So, that collapses and you are left with 𝑈ΣΣ 𝑇 𝑈 𝑇 .

(Refer Slide Time: 05:43)

Now, you can also write this as the first U comes out here and you have, (ΣΣ 𝑇 )−1 𝑈 𝑇 . How did this
happen? This comes from the inverse of this matrix here. Remember, if you have ABC three

1215
matrices whole inverse, that is given by 𝐶 −1 𝐵 −1 𝐴−1 . Now, (𝑈 𝑇 )−1 will be U itself, (ΣΣ 𝑇 )−1 stays
that way. And 𝑈 −1 = 𝑈 𝑇 . Once again, you can interchange inverse and transpose because U is an
orthogonal matrix when you do SVD decomposition.

(Refer Slide Time: 06:35)

Now, this can be collapsed as 𝑈 𝑇 𝑈 because it is orthogonal would be the same as 𝑈 −1 𝑈, which is
the identity matrix. So, that collapses, and you are left with the rest of the terms.

1216
(Refer Slide Time: 06:48)

And once again here,(ΣΣ 𝑇 )−1 = (Σ 𝑇 )−1 Σ −1 . And that is what we are writing here. Now, you have
a ΣΣ 𝑇 (ΣΣ 𝑇 )−1, which would be identity and collapse. So, you will be left with 𝑋𝑉Σ −1 𝑈 𝑇 𝑈, once
again 𝑈 𝑇 𝑈, where U is constrained by the first k columns would be given by I constrained to the
first k columns which are what we write here.

(Refer Slide Time: 07:35)

So, with that, we go on to the last step, where Σ −1 𝐼 constrained to the first k columns would be
sigma inverse constrained to the first k columns and Σ −1 Σ𝑘−1,k would be I constrained to the first

1217
k columns, this can be written as X into V constrained to the first k columns. Why is this derivation
important?

(Refer Slide Time: 08:09)

This now tells us that H is a linear transformation of X and W, which is in this case, V constrained
to the first k columns. What does this mean? From SVD? We know that given a matrix X, and it
is SVD - 𝑈Σ𝑉 𝑇 we know that V is a matrix of Eigenvectors of 𝑋 𝑇 𝑋. Go back and see assignment
0 in the course and refresh the basics if you are not aware of this.

So, which means our encoder weights which are given by V are the Eigenvectors of the covariance
matrix. What does this tell you? It should remind you of PCA where the projection matrix is the
Eigenvectors of the covariance matrix. So, are our auto encoders always a variant of PCA, or are
they do they become PCA? Once you learn the model, not really, this happens when your encoder
and decoder are linear.

Remember, in this entire example, we did not talk about having any activation functions in your
encoder or decoder. And you have to normalize it, the inputs the way we saw, and you have to use
the mean squared error loss function. Under these conditions, the mapping learns by the encoder
of autoencoders is indeed the projection matrix of principal component analysis.

(Refer Slide Time: 10:01)

1218
Having answered this question, let us come back to the topic of this lecture, which is Image
Captioning. Let us say we wanted to describe an image such as what you see here, or what you see
here, or what you see here. As humans, we can describe these images in words. How can we ensure
a computer or a Machine Learning model can do the same? How do we go about this?

1219
(Refer Slide Time: 10:34)

Ideally, we need some method, which can work on the images and give us this output. So given
some images, and given some captions, which describe those images, we give them as input to
some method, which now takes a new image and gives the output dog is looking outside the
window.

1220
(Refer Slide Time: 11:02)

Now, how do we go about coming up with this method, we look at the images and use Vision-
based Deep Learning models. We look at the text and use NLP-based deep learning models, which
would be RNNs in this case. So in this case, the computer vision part would be replaced by CNNs.

1221
(Refer Slide Time: 11:22)

The NLP part would be replaced by RNNs and you combine them to get your final output as the
caption.

(Refer Slide Time: 11:38)

Let us see each of these components in detail. Let us say we have the training face of such a model.
So you have an image. And you know, the corresponding caption is a straw hat. How do we train
a relevant model? You take the image and give this image as input to a CNN and you take the text

1222
and you use that to train an RNN. However, it is important here to be able to learn the interactions
between your CNN and RNN between the image and the text.

How do we do that? We briefly mentioned this in the previous lecture, you discard the last
classification layers of your CNN, you take a representation that you have as the output of the
CNN and you give that also as input to the RNN. So what, what happens to the RNN now, earlier,
your RNN was something like this, where the hidden state H naught was given by say if you use
the relu.

Then it would have a max of 0 commas some weight matrix W xh into your input, say x naught.
But now, you are going to have H naught to be max of 0 that comes from a relu activation function.
A weight matrix W xh into x naught plus a new weight matrix W ih into V where this V vector
comes from the CNNs output.

(Refer Slide Time: 13:24)

What happens at test time, given a different image, this image as is given as input to the CNN and
the CNNs FC seven representation is given as input to an RNN along with a start token, remember,
for an RNN to start the output process, you have to give some starting token, there is no word
there, you would just consider in your vocabulary, that in addition to all of the words in your
vocabulary, you also have two words known as start token and end token.

1223
The start token is the input that you would give to the first-time step of the RNN. So the RNN
receives the start token as well as the FC seven representation of the input image. And together
these result in a certain output, y naught. And then y naught is then used as x1, to, to generate a
new word y1, and y1 is given as input x2

(Refer Slide Time: 14:42)

and that generates a new word. And if that final word is the end token, you know, you have
completed your generation of words, which in this case, is Straw hat. What does it mean to generate
words? How does the RNN generate words, in practice in implementation, it does not generate
words, you have a vocabulary, it could be a set of thousand words or even more. The RNN predicts
a softmax over your vocabulary.

And you pick the word with the highest probability, there are a few things that you can change
when you pick that word from the vocabulary based on the softmax. But the most standard way of
doing it is to pick the word which has the highest probability in your softmax vector. So in that
sense, all of these are treated as classification problems. In this case, it is not image classes, but
you have a vocabulary.

And the RNN predicts a softmax, over each word in your vocabulary. What is the start token then,
if you had a thousand words in your vocabulary, the start token, and end token would be two
additional words? So you would now have 1002 words. And when you give the start token, you

1224
would set a 1 at that particular location, and 0 everywhere else in 1002 dimensional vector. This
is also known as a 1 Hot Vector.

Where the active word currently is given a 1 and every other word is set to 0. That is how such an
RNN is implemented in practice. But coming back to captioning, given the image, you get an FC
seven representation of basically you could also take a representation in an earlier layer that is
given us input to the RNN and the RNN generates the phrase, Straw hat.

(Refer Slide Time: 16:58)

So here are a few results of such an approach. You can see here for this image, this approach
predicts a group of people standing around a room with remotes, that is a fairly good caption, a
young boy is holding a baseball bat. Again, a fairly good cat caption, a cow is standing in the
middle of a street, fairly good caption again, the log probability, in a sense shows the loss that you
get for this particular test instance.

1225
(Refer Slide Time: 17:33)

On the other hand, here are some failure cases of such a model. In this particular case, for this
image, the model predicts a man standing next to a clock on a wall. For this image, the model
predicts a young boy is holding a baseball bat. And for this model, for this image, the model
predicts a cat is sitting on a couch with remote control. All of these are wrong captions. But one
can understand why the method failed. It was reasonably close to the output but was perhaps misled
by a few elements and the interactions between the elements in the images. There are also other
failure cases,

1226
(Refer Slide Time: 18:18)

which are more blatant. So you see here an image and the caption as a woman holding a teddy bear
in front of a mirror. Or for this image, you have a horse standing in the middle of a road. For these
cases, it is tough difficult to justify why the method failed. So how can we do better? How can we
go beyond just CNN and RNNs to improve image captioning performance?

We already know the theme we already know the answer, we want to use Attention to help improve
the performance. Let us now try to see in more detail how Attention could have been used to
improve the performance with this kind of approach.

1227
(Refer Slide Time: 19:08)

So this method is based on a paper known as Show, Attend Tell, which was published in ICML in
2015. A very popular Image Captioning Approach. In this approach, you have an input image, you
extract a certain feature map. Remember here, we mentioned this in the last lecture two. You do
not consider FC seven layers, because if you need attention, you have to retain the spatial properties
of the representation.

To extract a certain convolution layers output, in this case, say a 14 cross 14 feature map. Then
you have an RNN with attention on this feature map and you finally get to word by word
generation, a bird flying over a body of water. Let us now try to see how the RNN with attention
is performed in this particular method.

1228
(Refer Slide Time: 20:14)

Given an input image, whose dimensions are H cross W cross 3. You give that as input to a CNN.
And the feature map after a particular convolution layer, let us assume is L cross D, where L is W
cross H. Now, as we said, we give this as input to an RNN first hidden state H naught.

(Refer Slide Time: 20:42)

So H naught, in addition, to let us assume, say a start token, receives this feature map as input, an
H naught gives this time distribution over locations. Recall, we just said that an RNN outputs a

1229
distribution over a vocabulary. This time, we are going to ask the RNN to generate a distribution
over L possible locations.

What are these locations? These are the locations of each of these grids in the feature map. So we
are going to ask the RNN, can you give me a softmax vector, which distributes attention over each
of these grid locations? What do we do with this a1? a1 is a softmax vector, whose length would
be the number of grid locations that you have. That could be H cross W, which we are going to
denote as one single value L here.

Now we are going to use the a1, to wait, each of these locations in the feature map, and that
becomes a weighted combination of features, which we denote as z1. z1 is given by ai vi. vi
remember, are the features inside each of these grid locations that come as the output of the feature
map. And that z1 now becomes the input to the RNN along with the first word, or the start token.

Now h1, which is the next hidden state of the next time step in the RNN gives us two outputs. It
gives us a2 which is once again, a distribution over L locations, as well as d1, which is a
distribution over vocabulary, which tells us what is the predicted word? What do we do with these
two, a2 is once again multiplied with the feature map to give us a new weighted representation of
the image feature map z2.

And d1, which is the word generated could be the input y2 or you could when you are training,
you could just have the second word of the caption as input y2 to the next step of the RNN. So z2
and y2 are the inputs to the hidden, hidden state of the next step in the RNN. Once again, this
generates a3 and d2, a3 is the distribution over L locations. d2 is the distribution or the vocabulary,
which tells you what is the word a3 tells you which part of the image to focus on.

1230
(Refer Slide Time: 23:52)

This is repeated again and again. And the caption is generated. As we said in the previous lecture,
there are two ways of going about this Attention model. Soft attention and Hard attention. In soft
attention.

(Refer Slide Time: 24:13)

The weight vector a1 is a softmax vector. And you use that as weights to multiply each of these
grid locations in the feature map. So you have the z1 to be a1 into v, v1 depending on which entry
you are looking at. A1 i into vi for instance. And this gives you soft attention or the feature map

1231
of the entire image. What is then Hard attention? Hard attention is when you look at that particular
a1 softmax vector, pick one winning entry, and only use that grid location as the input to the next
time step. This is known as hard attention.

Because you are not using a soft distribution of the weights across all parts of the image, you are
going to choose a winning entry and pick only one part of the image will be one particular grid
location here, maybe this location alone as input z1 to the next image. Are there advantages and
disadvantages? Yes, indeed, hard attention requires an arg max at this particular step, because you
want to consider the location which was the winning entry of the softmax vector.

And then our max is nondifferentiable so, you may not be able to use backpropagation to solve the
heart attention approach. So, how is that implemented, it is implemented using reinforcement
learning methods, which we will not focus on at this time. But the soft attention method at the
same time is a weighted combination of all locations of the grid, which is deterministic and can be
backpropagated through and hence can be learned through backpropagation.

So, how do you learn such a network using backpropagation? Remember, you have a distribution
or locations, you have a distribution over vocabulary, the distribution or vocabulary, you know the
word to be predicted. So you would have a cross-entropy error there. Similarly, for each step of
the RNN, you would have a cross-entropy error. And using that, you can learn all the weights in
the RNN.

And if required, you can also backpropagate that error back to the CNN to also fine-tune the CNNs
weights. But the last year would be the sum of cross-entropy losses of each time step.

1232
(Refer Slide Time: 27:08)

Here is a visualization of Soft attention versus Hard attention. So you can see here that in soft
attention, you focus your attention on different parts of the image. And when the method generates
a bird flying over a body of water, you can see that initially, when it looks at a bird, that is where
its attention is when it generates the word bird. That is where its attention is. And gradually, as it
starts generating a body of water, it looks at everything other than the bird, if you see this part of
it.

The bird is black, and the rest of the image is white, which means the attention was suggesting the
model look at everything outside the bird to generate the words the body of water. In case of hard
attention remember, the model focuses only on one patch. And when it comes to birds, it does
focus on one part of the image which corresponds to the word, and it when it comes to the body of
water. It does focus on one part of the image that corresponds to the body of the water.

But the localization may not be as evident, because we are choosing only one single part that is
the winner of that softmax of a1 vector.

1233
(Refer Slide Time: 28:33)

Here are some results of this approach by using image captioning with attention based on this
paper called, called Show Attend and Tell. So in this case, you can see that for this image, the
caption generated by the model was a woman is throwing a Frisbee in a park. And you can see that
when the word Frisbee was generated, the attention was completely on the Frisbee. Here are a few
more examples; a dog is standing on a hardwood floor.

And when the model generated a dog, the attention was completely on the dog. A stop sign is on
a road with a mountain in the background, the attention on stop sign when generated the word stop
sign, and so on. The last image here is a challenging image where the generated caption says giraffe
standing in a forest with trees in the background. And when it generated the word trees, it looked
at everything other than the giraffe, which substantiates the choice in this particular context.

1234
(Refer Slide Time: 29:44)

In terms of more recent efforts, so Show Attend and Tell was a paper published in 2015. More
recent efforts have explored image captioning from a few different ways and let us see a couple of
different examples. In one work in 2017, the idea was to see if the performance of image captioning
could be boosted with attributes in LSTMs. So if you had an image I, and a set of words that form
the caption, W.

So x at minus 1 is some transformation applied on the image by transformation Tv. That could just
be a set of weights. And xt, which is the input at every time step is a transformation of each word
in the caption. So this is the Vanilla Image Captioning approach that we have seen so far, where
the image is processed, and then given to the first step of the LSTM. And each word of the caption
is given as input at each time step of the unrolled LSTM.

And ht, which is the hidden state processes using some function f, the input at each time step. This
is the vanilla version of using an image representation and the caption words to train the model.
So this particular work tried a few different variants.

1235
(Refer Slide Time: 31:22)

Let us see each of these variants in one of these variants, which is the first one which we denote
as A1, they considered transforming the images into attributes. So given an image, you could use
an image classifier or you could use attributes that may be provided as metadata with the image.
The attributes associated with this image are boat is 1, water, man, riding, dog, small, person and
river with different probabilities.

Now, these become a set of attributes, an attribute vector, which comes from a set of attributes.
And in this approach, the x at minus 1, which is the input that you give to the first hidden state,
was the attributes and not the image. So the image is not given as input to your captioning model.
It is only these attributes that you get out of this image. Why are we doing this? Because anyway,
the output is a caption. So you may as well give a textual version of an image as input, the rest of
it stays the same. This was A1 a variant.

1236
(Refer Slide Time: 32:44)

In a second variant, A2 both the image and the attributes were given as input to the image
captioning model. So the image is first given to the one-step of the LSTM of the RNN and in the
subsequent time step, the attributes are given to the LSTM. So you have an x minus 2, which is
based on the image, and an x minus 1, which is based on the attributes. Once again, Tv and TA are
transformations that are given by learned weights.

So you could imagine that Tv would be a set of weights here. And TA would be a set of weights
here, which are learned when you backpropagate from the output. Those are other weights that
you learn. So this was a variant known as A2.

1237
(Refer Slide Time: 33:37)

A3 was a variant, where the attributes were given first, and the image was given next, it is a variant
of A2, where the inputs are swapped first attributes then the image.

(Refer Slide Time: 33:51)

Then came a variant A4 were where at x minus 1, only the attributes were given, but at each time
step, so at each time step, you give both the words in the caption and the image as input at each
time step. So the attributes initially and the image at each time step, this was variant A4.

1238
(Refer Slide Time: 34:21)

And A5 was a variant where the image was given as input at x minus 1 and the attribute was given
along with the caption in each time step, this was variant A5. So this particular work, which was
by Yao et al, published in ICCV, 2017 concluded that A1 the variant A1 was better than LSTM,
this indicated the advantage of using high-level attributes instead of image representations.

1239
Variant A2 where the image representations were also integrated with attributes helped further
improve performance. And similarly, by feeding the attributes first and the image later, A3
performed better than A2, then came a surprise that A4 did not perform, as well as A3.

(Refer Slide Time: 35:30)

Recall if you go back and see what A4 was, A4 was where you provided attributes at x minus 1,
and the image as input at every time step.

1240
(Refer Slide Time: 35:42)

Unfortunately, this did not perform as well. And a possible hypothesis for why it did not perform
as well is that any noise in the image could now be accumulated in each time step of the RNN.
And hence, the model could be suffering from some overfitting and not learning the relationship
between the image and the text well. However, when this was swapped in A5 where you have the
image given an x minus 1, and the attributes are given such at each time step.

1241
(Refer Slide Time: 36:25)

Now, this model performed better than A3. And that gives the conclusion that image
representations, as well as attributes, could be complementary. And both could help improve image
captioning performance, depending on how you use them in the model.

(Refer Slide Time: 36:46)

Another interesting approach for image captioning was known as Style net, which tried to change
the tone in the caption. So if you had an image, and the original caption was a man on a rocky
hillside next to a stone wall, one could now give a romantic version of this, which says a man uses

1242
rock climbing to conquer the height. Or one could give a humorous version of this, which is a man
is climbing the rock like a lizard? How do you change this tune in the caption?

(Refer Slide Time: 37:30)

1243
To do this, recall our standard LSTM where we had our Input gate, forget gate, and output gate
that was our LSTM. So this particular approach, suggests a change in the weight matrix of LSTMs,
where it suggests that a weight matrix W xx especially can be written as Ux Sx Vx, this would be
similar to an SVD decomposition. So they call this a factored SVM. How does this help?

(Refer Slide Time: 38:08)

Now, given an input image and a caption, you send the image through a CNN and then you use a
factored SVM. now, where the weights Wx are now given by U Se Sf V for the factual caption.

1244
(Refer Slide Time: 38:32)

For the romantic caption, that way, Wx could be U Sr V and for the humorous caption, that will
be given by U Sh V. Each of these matrices U, Sf Sr, Sh, and V are learned while training the
model.

1245
(Refer Slide Time: 38:56)

But now, at test time, we can swap the Sx accordingly to get the desired effect on the caption. So,
a test time gave an image such as this, you could have a Factual caption with states a snowboarder
in the air or a Romantic caption, which says a man is doing a trick on a skateboard to show his
courage or a humorous caption, which is a man is jumping on a snowboard to reach outer space.

The tone may not be completely convincing, but it gives an idea of how you can vary what you
need by factoring the S in the LSTM and then trying to change the S matrix in the middle to get
different tones in the caption. This was Style net.

1246
(Refer Slide Time: 39:55)

Another recent work was known as Neural Baby Talk, which is from CVPR of 2018, which again,
provides a different spin on how image captioning is performed. So, in a more traditional computer
vision sense before deep neural networks came. To be able to caption would have needed two
specific kinds of approaches, known as Deformable Park Models, and Conditional Random Fields.

We will not get into these details now. But one could assume that using these methods, which are
based on graphical models, you could say that this is a photograph of one dog and one cake. And
maybe the caption continues after that, in a more natural sense, at this time, based on deep neural
networks, given an image, you provide that to a CNN then an RNN, and you then have a dog is
sitting on a couch with a toy.

Now can we go somewhere in between, that is what, this method tries to do, is neural baby talk
says, Let us try to use detection models to be able to give better caption generation performance.
So given an image, you have a detector, you get your region features. And you give these region
features to an RNN to get your final generation, which states a box with a particular box is sitting
at a box with a box where each of these boxes is different bounding boxes or region proposals.

The first yellow box is a puppy. The second one is a Tie. The green one is a cake. And the blue
one is a table. So a puppy with a tie is sitting at a table with a cake. That is what would, would be
done if you use deduction here.

1247
(Refer Slide Time: 41:57)

Let us see this model in more detail. So in the training phase, one would see that given an image
using an object detector, such as a faster R-CNN model, or a mask, R-CNN model, you would get
a set of region proposals on which if you use ROI align, which comes from mask R-CNN, you
would get a set of aligned region proposals. Similarly, you also have a caption in the training face
given in your training data set.

You transform that get a certain embedding of that caption. By embedding, we mean a
representation of that caption, which is learned. How do we learn it? Using some weights, which
are obtained through backpropagation? That is what we refer to as an embedding here; those
embeddings are given to an RNN with attention along with each of these region proposals. So then
you have two parts.

One, you have the bottom part here, which contains a softmax and gives you the textual output.
That is the caption part of it, which you would use for inference or test time. The second part of it
is using the RNN with attention, you decide which word in the caption that you are currently
focusing on. And you take all these region proposals.

Together, you construct a vector, which is then given to a model, which predicts the label of that
particular region in the image. So by combining the regions, and the textural prediction, one can
predict the word and one can also say which part of the image was predicting the word.

1248
(Refer Slide Time: 43:49)

So the homework for this lecture would be to read the papers on respective slides. It is all right if
you did not understand every paper. This lecture was meant to give you an idea of different Image
Captioning methods. And if you would like to get some hands-on, in addition to the assignment
that we have, here is in the nice notebook on Image Captioning released by Georgia Tech.

The question for this lecture - Can we now do the opposite? Can we go from caption to image
compared to what we learned in this lecture? How do we do this? Think about it and we will
discuss the next time.

1249
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture - 58
Going Beyond Captioning: Visual QA, Visual Dialog

(Refer Slide Time: 00:14)

The relevance of RNN models and Attention models in Computer Vision becomes pronounced
when you look at problems at the combination of vision, and language. This setting results in many
sequence learning problems. We saw one in the last lecture, Image Captioning. And we will now
go and go even further and talk about tasks such as Visual Question Answering, and Visual
Dialogue.

1250
(Refer Slide Time: 00:51)

One question that we left behind was, instead of image captioning, can we do the opposite, go
from caption to image, using any models that we have seen so far? We can, but not using the
models that we have seen so far. We will see generative models quite soon. And see how we can
use them to do caption to image generation.

(Refer Slide Time: 01:19)

1251
In visual question answering, which you can think of as an extension of the image captioning task.
The overview of the task is given an image such as what you see here, and given a question, such
as, what is the mustache made off? The goal is for our system, a deep learning model to give the
answer bananas. So you need an understanding of the question. You also need an understanding
of the image and the relevance of that question to a specific part of an image and then be able to
give a reasonable answer.

(Refer Slide Time: 02:05)

Here is a demo of how this VQA task works. Given an image and the question, where is the kid
sitting? If you had a good train model for this particular task, the kind of answer you would get is
the sink gets 99 percent confidence, which is true. If you look at the image. This is a demo from a
website known as cloud CV, developed by Dhruv Batra and Devi Parekh a Georgia Tech. You are
welcome to go to this link and try out visual question answering on your images.

1252
(Refer Slide Time: 02:49)

Before we go forward and talk about what models you can use to solve VQA problems. We will
be looking at combinations of CNNs RNNs and Attention. When we say RNNs, we also subsume
LSTM's and GRUs in that same term. Before we discuss those models, let us discuss the kind of
datasets that one needs to solve VQA problems. So far, we have needed Image Classification
datasets, Image Detection data sets, Image Segmentation datasets and to some extent, what we saw
in the last lecture, Image Captioning datasets.

In each of those, the expected data set to train the model is self-evident. For this task, VQA, things
are a bit more complex. So let us see a few datasets that have been developed to address this
problem. So this was the first dataset known as the VQA dataset developed in 2015. This dataset
has images and questions such as what you see on this image here. Given an image, there are two
questions what color are her eyes? What is the mustache made of?

Similarly, for the second image, how many slices of pizza are there? Is this a vegetarian pizza?
The third image is this person expecting company? What is just under a tree under tree? And the
last one Does it appear to be rainy? Does this person have a 2020 vision? As you can see, for each
of these questions, the model needs to understand the question as well as the image and the
relationship between the image, and the question.

1253
So in this VQA data set, you can see that the answers are open-ended. And the answers are also
multiple choice answers. So you are given a set of options. And the model has to choose one of
those options as the output. Why is this important, which means the task for the model becomes
classification? Again, given a set of options, the model has to output a softmax over or a probability
vector over the set of options.

And the option with the highest probability is the predicted output of the model. So this VQA data
set has about 250 thousand images, which is obtained from a dataset known as MS COCO. MS
COCO stands for Microsoft Common Objects in Context. And this dataset was developed for the
image captioning task. That dataset is extended for the VQA dataset plus about 50 k abstract
images such as these.

There are a total of about 750,000 questions across these images. As you can see, given a single
image, you can have multiple questions. And there are a total of about 10 million answers. Because
for each question, you need multiple answers to choose from. And each question is answered by
10 Human annotators. This is the VQA dataset.

(Refer Slide Time: 06:45)

Another dataset that was developed in 2015 is known as the COCO QA dataset. This again used
the Microsoft COCO dataset. As I just mentioned, the Microsoft COCO data set was an image

1254
captioning dataset. So the COCO QA data set used those captions to automatically generate QA
pairs question-answer pairs.

So for example, given this image, if there was a caption, the caption is converted now into a
question and answer. The question could be how many leftover doughnuts is the red bicycle
holding? And the answer is three. What is the color of this T-shirt? The answer is Blue. And where
is the gray cat sitting? The answer is Window. You can see that one can obtain question-answer
pairs from a caption.

The only constraint here is that the answer is a single word which is true of many VQA datasets.
This data set had hundred and 118,000 question-answer pairs on 123,000 images. There were four
types of questions, what object, how many, what color, and where were the kinds of questions in
the dataset.

(Refer Slide Time: 08:23)

In subsequent years, in 2016 was developed another popular dataset known as Visual 7W was.
This dataset has images and questions such as what you see here. What end endangered animal is
featured on the truck in this image? The answer is a bald eagle and the other options are Sparrow,
Hummingbird, and a Raven. Similarly, where will the dry go driver go if turning right? If you look
at this particular image here, you could then say onto 24 3 4th Road and then you have other
options, which are similar road names that may be relevant.

1255
So 7W in the title of the dataset stands for What, Where, When, Who, Why, How? And finally,
this is a different kind of question compared to other datasets. The question here is given an image
at this bottom left, the question could be which pillow is farther from the window? And the answer
would be what you see in this yellow box here. Which step leads to the tub.

Once again, the answer would be the yellow box here which is the small computer in the corner.
Once again, the answer is in the yellow box here. That is the reason it is called Visual 7W. This
dataset has about 328,000 question-answer pairs on around 47,000 images. As we just saw, there
are two kinds of tasks, the telling task, which is what we saw on the top row, and the pointing task,
where we have to point to a particular part of the image to answer the question.

(Refer Slide Time: 10:35)

Another dataset more popular in recent times is known as CLEVR was developed in 2017. And
this dataset is a semi-synthetic dataset, which helps answer reasoning questions. So the questions
can be a bit complex. So given an image as what you see in the top left here, the question is, how
big is the gray rubber object that is behind the big, shiny thing behind the big metallic thing? That
is on the left side of the purple ball.

That sounds like a complex question. But if one reason, we are talking about this gray object at the
back. This particular one is what we are talking about. And one has to ideally understand several
subparts of the question to be able to answer such reasoning questions. So the CLEVR dataset has

1256
about 100,000 rendered images, and 1 million automatically generated questions. The answers to
these questions are single-word answers. But to get the answer, one needs to pass the question very
carefully. The questions are indeed complex and require reasoning skills.

(Refer Slide Time: 12:07)

Having seen a few of the datasets that are used for Visual Question Answering, Let us now see a
few models that have been developed for addressing the VQA problem. One of the baseline models
that you can think of, for using for visual question answering is the combination of an LSTM and
image features that you get out of a CNN. So given an image, you have a CNN, in this case, you
see a VGG net.

And at the end of the CNN, you get a fully connected layers representation, which is in this case,
1024 dimensional. Similarly, for the question, you pass each word through different time steps of
an LSTM. The output of an LSTM is a vector, which is again 1024 dimensions. You concatenate
these two and send them through a few fully connected layers to make a final prediction of what
the answer should be.

If your answers, were among a set of options, you would have cross-entropy loss. And everything
that you see here all the parameters of every part of the network that you see here can be learned
by a backpropagation on that cross-entropy loss. So I hope you understand how you can train these
architectures.

1257
Although we are adding the obstructions that we have learned so far, CNNs RNNs Attention, we
are trying to mix match them to solve these problems, hope you can take away that none of these
components violates how you can use backpropagation to learn the weights of the network. Using
backpropagation and gradient descent stays constant through all of these kinds of architectures.

(Refer Slide Time: 14:12)

Another baseline model that was developed in 2015, for visual question answering was a simple
method again, where given an image, an image feature was extracted out of one of the layers, one
of the outputs of the layers of CNN. And for the question, a simple bag of words frequency
histogram was used as the text input.

So these two were concatenated and using a simple neural network, the model predicted what was
the right answer among a set of options. Having seen these simple baseline models, let us now try
to see if you can improve upon these baseline models by getting more complex models.

1258
(Refer Slide Time: 15:05)

One of the efforts in this direction was in CVPR of 2016, known as Stacked Attention Networks
for Image Question Answering, in this particular model, given an image, you get a representation
out of a CNN. And very similar to Attention models that we talked about on top of a CNN, you
take certain convolution layers to feature map, you would then be able to map that to spatial
locations in the original image.

Now, you have feature vectors corresponding to each part of your image. You now take the
question and pass it through an LSTM. You can do convolution if you like, we will see this a
couple of slides later. And the output of the LSTM gives you a representation of the question. This
question is passed to the feature vectors of different parts of the image. And between the feature
vectors of the image and the question representation, the model gives us a certain attention map
on different parts of the image.

Based on this Attention map, and the representation of the question, there is another level of
attention that is performed on the image feature vectors of different parts again. That leads to a
different representation, which we expect is the final solution. And that representation is finally
taken to the output layer, where a softmax over the options in answers gives you the outcome.

1259
(Refer Slide Time: 16:56)

Let us see a visualization. So in this case, if you have the original image, what the stag attention
network does is the first attention layer focuses on all referred concepts in the question. So we saw
here in the previous slide, that the first attention layer decides which parts of the image to focus
on and which parts does it focus on, wherever there are concepts that are present in the question.
So it could be in this case, a bicycle basket, and objects in the basket. In this case, the question
was, what is sitting in the basket on a bicycle?

1260
(Refer Slide Time: 17:39)

The second attention layer, based on the attention that you get in the first attention layer, the second
attention layer, chooses to focus only on the part of the image that contains the answer. And that
is what helps us finally predict the outcome. So, in the second step, you narrow down the focus of
the attention only on the part of the image that contains the answer. And hence, the model can
predict the answer among the set of options as a dog.

(Refer Slide Time: 18:19)

1261
This leads us to a follow-up question. So, we in, in the model that we just saw, we performed
attention only on the image, we took the question representation as it is, and try to use a relationship
between the question representation and different parts of the image based on attention. So, even,
in the attention that we just used on the previous slide, it was Soft attention, which weights different
parts by certain weights that are learned and that can facilitate backpropagation. But now, we ask
the question, can we also introduce attention in the question?

(Refer Slide Time: 19:06)

This was done in a model known as the Hierarchical Co-Attention model. This was published in
2016 NeurIPS. In this particular model, the words give it all we have Co-Attention, which means
we attend both on parts of the image, as well as parts of the question. And the hierarchical part
comes from, we first get an embedding at a word level of the question, then add a phrase level of
the question and finally, on the entire question itself.

Let us see this in more detail. Given a question, we extract its word-level embeddings, phrase-
level embeddings, and question-level embeddings. At this point, you can assume that this is
something that you get by passing through some neural network. So how do you what is the input
in these cases? Do you give the text as input? Not really, remember, for all of these problems that
involve text, for to a large extent, you have a vocabulary of words.

1262
And each word in your question corresponds to a 1 hot vector on your vocabulary, where you put
a one for the index in the vocabulary that this word belongs to, and 0 everywhere else. That would
be the input that you give for a particular word. So you extract word level, phrase level, and
question-level embeddings. At each level, you apply Co-Attention. So both for a word-level map
embedding, you apply attention to the question, as well as the corresponding image.

Similarly, for the phrase level, and the question level, the final answer is based on all the Co-
Attended image and question features. So if you see carefully here, when we say what color on the
stoplight is lit up, so you can see here, this is the question, and the ideal answer you would expect
is green. So you first take word-level embeddings. And you attend to one part of it. In this case,
you can see that the attention is focused on stop. Similarly, at the same time, the attention on the
image also seems to be focused on somewhere around the traffic light. In the next higher level of
the hierarchy, the attention is focused on different phrases in the question. In this case, you can see
the maximum weight goes to the phrase called the stoplight.

And once again, you can see here that the image is a bit more confident now and focuses
completely on the stoplight. Then at the highest level of the hierarchy for the question, the attention
map tells us that light is what we are focusing on. And at that stage, the attention on the image
focuses only on the light inside the traffic signal. And that results in the answer green.

(Refer Slide Time: 22:35)

1263
So let us understand in some more detail the question hierarchy. So at a word level, you get a word-
level embedding, as we just saw, how do we get the phrase level and question-level embeddings.
To get the phrase-level embedding, you can take embeddings of each word of your caption,
perform convolution across those inputs with multiple filters of different widths.

And you will get vector representations such as these, which are what we call phrase-level
embeddings. Then we perform max pool over the convolution layer. And we then get a question-
level embedding through the LSTM, which we give that max pool output. That gives us a
representation at the question level.

(Refer Slide Time: 23:31)

Here are some results. So you see here, the image and the question are, what is the man holding a
snowboard on top of a snow-covered? Question mark? The answer is Mountain. And the other
example is what is the color of the bird? And the answer is White. Here are the word-level co-
attention maps. In this case, the question the part of the question that is being focused on is
snowboard top snow-covered and the image focus at that point is a bit on the snowboard and a bit
on the person.

And in this case, the focus is on the color of the bird. And the model suggests attending to the part
of the bird in the image. For phrase-level co-attention in this case, the focus is on holding
snowboard, snowboard on top of snow-covered and the focus in the image red being high and blue

1264
being low similar to heat maps that we saw with explanation methods. So you see that the model
is focusing on other parts of the image other than the person.

Similarly for what is the color of the bird. The model seems to be focusing on the area just around
the bird to be able to isolate the color. And finally, when the model comes to a question-level
potential map, what is the man holding a snowboard? On top of a snow-covered? The model looks
at everything other than the person to be able to get the answer, mountain. And in the other
example, what is the color of the bird, the model is looking exactly at the bird to answer, white.

(Refer Slide Time: 25:24)

We will now move on to the second task that we planned to discuss in this lecture, which you can
consider as an extension of Visual QA, which is Visual Dialogue. So the overview of the task is
given in the illustration here. This was first done in 2017. Given this image, and given this caption,
which says a cat drinking water out of a coffee mug, Visual dialog is like a chatbot. But a chatbot
that asks and answers questions with respect to an image rather than simple free text.

So let us see an example. What color is the mug? The bot answers, White and Red? So you can
see the mug there? Are there any pictures on it? No. Something is there? Cannot tell what it is. Is
the mug on Cat on a table? Yes, they are. Are there other items on the table? Yes, magazines,
books, a toaster, and basket, and a plate. And as you can see there is a toaster in the background,
there is a basket. And this is a fairly comprehensive list for a machine to answer Yes.

1265
(Refer Slide Time: 26:53)

So here is a demo of how visual dialogue works. Given this image. And the bot says a man and a
woman are holding umbrellas. The question that a human can ask is what color is his umbrella?
The bot answers his umbrella is black. What about hearse? This is a tricky question. Because the
model has to understand the question is about the umbrella. And secondly, the question is about
her umbrella.

This means the model needs to know that there are two people, one of them is a male, one of them
is a female, and then be able to answer this question. And the bot does say hers is multicolored?
How many other people are in the image? I think three they are occluded. Once again, this demo
comes from Cloud CV developed by Dhruv Batra and Devi Parekh Gatech, you are welcome to
go to this link. Try out visual dialogue, and see for yourself how it works.

1266
(Refer Slide Time: 28:01)

Let us now formally define how you would go about solving this task. This is not like a traditional
classification problem does not even seem like a traditional sequence learning problem where you
give out a caption for an image, then how would you go about formalizing it and solving it. Given
an image, I have a dialogue history of question-answer couples Q1 A1, Q2 A2 so on and so forth.
Till Qt 1 t minus 1 At minus 1.

Those are your previous question. Answer topples. And you have a follow-up question that is
current at this time. In this case, the question is, is the other one holding anything? And the task
here is to produce a free form natural language answer At that answers this current question, Qt.

1267
(Refer Slide Time: 29:00)

So one question you could ask here is, how do you evaluate the answer? The evaluation here can
be tricky and typical store scores used in natural language processing such as BLEU, which we
saw earlier METEOR, ROUGE. We would not get into them now. But it is known that such scores
often correlate poorly with human judgment. For example, you could have multiple answers set in
different ways that could answer a question, but not all of them may correlate with a score of one
single ground truth answer.

So how do you study, study this, you could probably have what is known as Human Turing tests,
where you give out a generated sentence, and then ask a bunch of humans to rate the quality of the
sentence. Now, this could work, because based on how a human looks at it, it is called the Human
Turing test to check whether the human can make out whether the answer is generated from a
machine, or whether the answer is generated by a human.

So you could introduce that as one of the questions that you ask the human in addition to the rating,
the quality of the answer. The challenge here is that this kind of evaluation could become
expensive, because you may have to compensate humans to participate in the evaluation. And it
could also be subjective, depending on what human is evaluating a system. So different humans
could evaluate different answers differently. So how do we go about evaluating such a system?

1268
(Refer Slide Time: 30:50)

So let us see that now. So given an image, the dialogue history, and the current question, what
datasets for visual and dialogue do is give about 100 answer options? These answer options could
be answers to 50 most similar questions 30, popular answers, and maybe 20, random answers. So
you pick these 100 answers and ask your model to rank all of these answers as the outcome.

So the model's job is, given these 100 answers, you could once again have a softmax, which gives
you a probability vector across these 100 options. But using that probability vector, you can now
rank order your options, the highest probability will be the first rank, the second-highest
probability will be the second rank, so on and so forth. But how do you judge the goodness of this
ranking?

You can do a couple of things to measure the performance, you could check the mean rank with
respect to the ground truth. So you expect that the model will give the first answer as the ground
truth. But that may not always happen. In one case, you may get the third rank dancer to be the
ground truth. In one case, you could get the seventh rank answer to be the ground truth. And in
one case, the 41st answer that you rank could be the ground truth, which would be a poor outcome.

So the mean rank with respect to the ground truth across all your test samples could be one
performance metric that you could use. Another that is another metric that is also used is known
as the mean, reciprocal rank. So the mean, reciprocal rank could be for one particular test question

1269
if you predicted a set of ranks for the options, and the third position of your predictions was the
correct answer.

So which means your correct rank is three? Let us assume that for another question. Your second-
ranked option was the correct answer. And for another question, your first ranked option was the
correct answer, then your mean reciprocal rank would be given by 1 plus 1 by 2 plus 1 by 3, divided
by 3 because those are the total number of options that you have, which you would get as 1.5, you
would get that to be 1.83 by 3, which gives you 0.61.

This would be the mean reciprocal rank? When would the mean reciprocal rank be 1, if your first
rank option always is the ground truth? And when would it be 0, if your last rank option was your
ground. So, mean reciprocal rank is another performance metric that you can use to be able to
study the performance.

(Refer Slide Time: 33:58)

Let us now come to the kind of models you can use to perform visual dialogue tasks. So it is once
again encoder-decoder frameworks, where the encoder could be a late fusion encoder, a
hierarchical recurrent encoder, or a memory network encoder. And the decoder can be generative,
or discriminative. We will see some of these examples over the next few slides.

(Refer Slide Time: 34:33)

1270
Here is an example of a Late Fusion Encoder. In this particular case, the image is the one that you
see here. You have a previous history of questions and answers so the man is riding his bicycle on
the sidewalk. Is the man wearing a helmet? No, he does not have a helmet on. Are there any people
nearby? Yes, a woman is walking behind him. That is the dialogue history. And the current
question Qt is given by Do you think the woman is with him?

So, we have to answer At for this question now. So, you provide the image to a CNN, you provide
the current question to an LSTM and you provide the entire history concatenated as single text
input to another LSTM, you fuse or concatenate the outputs of each of these models, the CNN, the
current question LSTM, and the history LSTM.

And you can now have layers after them to reduce the dimension of that final representation and
you provide this to a decoder to get the final answer. We will talk about decoders in a moment, but
at this point, we are discussing the encoders of models that are used for visual dialogue. This is
one kind of model known as the Late Fusion Encoder.

1271
(Refer Slide Time: 36:05)

Another kind of an encoder is a Hierarchical Recurrent Encoder, where you have once again an
image goes through a CNN gets a representation, the current question goes to an LSTM. But now,
this LSTM also gets the image representation as input, then for each round of the history of the
dialogue history, you send it through one instance of the LSTM and get different representations
for each step in history.

Now, you have another LSTM that combines these inputs from, from previous dialog history and
the current question and the image and the output of that next level of the STM, LSTM is passed
on to the decoder to get the output. So, this is called a Hierarchical Recurrent Encoder, because
you have an LSTM, to take inputs of the image, current question, and the dialogue. And you have
another LSTM that combines these inputs before passing them on to the decoder. This is the
Hierarchical Recurrent Encoder.

1272
(Refer Slide Time: 37:23)

The third kind of encoder is a Memory Network Encoder, which uses an attention idea where once
again, the image through a CNN question through an LSTM, concatenate representation, compress
the representation. That is the first part. Once again, each of the dialogues in history goes through
a different instance of the same LSTM and you get different representations for each of these
question answers in the dialogue history.

Now, you also bring the current image question representation, along with representations of the
history. And you now look at an Attention step here you see attention or history to decide which
part of the dialogue is relevant to answer the current question, it remembers for a long dialogue,
you may be referring to a sentence that was stated 2-3 conversations before to answer a current
question.

And using an attention idea gives us kind of a different approach. So you take an inner product of
the current question and image which gives you an idea of the alignment. Remember, we talked
about alignment when we talked about attention for encoder-decoder models, where we talked
about the attention of the hidden state in a decoder with respect of the alignment of a hidden state
in the decoder with respect to a hidden state in the encoder, and we said that one way you could
do alignment is through a dot product or inner product.

1273
So, what you see here is attention which checks for alignment between the current image and the
question with respect to history, to get an attention map over the history, which is a weighted sum,
then you combine this with a current feature vector of the image and current question and then you
pass that to the decoder to get the final answer.

Once again to remind you, every component that you see here is differentiable. And you can learn
all of these through the back propagation step. Even attention or history is simply a weighted
combination of inputs, which can be backpropagated.

(Refer Slide Time: 39:48)

Coming to the decoder part of such models as we said, decoders can be Generative or
Discriminative. In a generative decoder as an example, during training, you would maximize the
log-likelihood of the ground truth answer sequence; you would want your decoder to generate the
correct answer sequence. And during training, you maximize the log-likelihood of generating that
ground truth answer sequence.

Once again, when you implement them, these would be implemented as if you had a decoder
LSTM, at each step, you would output a word that would give you a cross-entropy or vocabulary.
And you would have multiple time steps in your LSTM. And the addition of that cross-entropy
would be your loss. Or in this case, the log-likelihood cross-entropy is a way of minimizing cross-
entropy is equivalent to minimizing negative log-likelihood, in this case.

1274
During a test time or evaluation, you use the models, log-likelihood scores over your vocabulary,
and rank your candidate answers. Whereas in a discriminative decoder, you take the input encoding
and compute a dot product similarity between the input encoding and the LSTM encoding of each
answer option. You could use any encoding method here, we will not get into that here, you
probably need some knowledge of NLP to understand how you can get encodings of text.

But let us assume now that you could get some encoding there are what are known as Word to Vec
embeddings, Glove embeddings, so on and so forth. You could use them to get an encoding for
the answer options, and the input encoding. Now you compute a dot product, and that dot product
is fed into a softmax to compute posterior probability over all your answer options.

And during training, you would maximize the log-likelihood of the correct option. And a test time
or evaluation options are simply ranked based on their posterior probabilities, which you would
get as the output of a softmax. This is how you could model the decoder of such encoder-decoder
models to solve the visual dialogue task.

(Refer Slide Time: 42:27)

Let us see a couple of examples. So here you have an image. The initial caption is the skiers stood
on top of the mountain. Then you have the dialogue. How many skiers are there? 100s? Are they
ready? Are they getting ready to go downhill? I think so. My view is at the end of the line. Is it
snowing? No. There is a lot of snow though. Can you see anybody going downhill? No, my view

1275
shows people going up a small hill in the skies. I cannot see what is going on from there. And so
on, as you can see is a meaningful dialogue.

(Refer Slide Time: 43:14)

One more example, for you to see at your leisure for Visual Dialog.

(Refer Slide Time: 43:21)

If you would like to know more, you can read these papers. There are some of them, which are
more recent, which we did not get a chance to cover in this lecture. We covered the most, most

1276
basic methods. But if you would like to know more, you can follow some of these recent papers.
And we are going to leave one question here. Can you come up with better methods to evaluate
visual dialogue systems? Think about it may not be a nontrivial answer may or may not be a trivial
answer. Think about it. And we will discuss the next time.

(Refer Slide Time: 43:58)

A few more readings for Visual Dialogue papers, including some very recent papers that could be
relevant.

1277
(Refer Slide Time: 44:06)

Here are some references.

1278
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 59
Other Attention Models in Vision

(Refer Slide Time: 00:14)

Moving on from Visual Q and A and the visual dialog will now look at how attention has been
modeled and used in various ways. In particular, we look at 3 specific models, Neural Turing
machines, DRAW, which stands for Deep Recurrent Attentive Writer, and Spatial Transformers.
Each of these has had a significant impact on the field of Deep Learning and Computer Vision.
And they either use or implement attention in very different ways. Let us see each of them in this
lecture.

1279
(Refer Slide Time: 01:03)

Let us start with Neural Turing machines. Neural Turing machines, as the name states denote a
class of neural networks that are intended to perform tasks analogous to the Turing machine, just
like a Turing machine, and NTM or a Neural Turing machine has memory and read-write
operations to access to and from memory to perform subsequent tasks. However, unlike a Turing
machine, you need to have fixed memory to be able to treat this as a neural network and learn
operations.

(Refer Slide Time: 01:43)

1280
So here is a high-level visualization of a neural Turing machine where you have a controller, you
have a memory, and the controller access accesses the memory using read and writes heads to read
from and write into the memory. So the controller is a neural network, certain layers, which
executes operations on memory, read and write.

And the memory by itself is an RxC matrix with R rows, each of C dimensions. Now let us ask a
question, like a Turing machine. If we do read-write operations on an NTM by specifying a row
and a column index, let us say we want to read the 3rd row and the 10th column. Can we train such
an NTM end to end using backpropagation?

(Refer Slide Time: 02:48)

The answer is, unfortunately, no because you cannot take the gradient of an index, since it could
be a hard choice, similar to hard attention. Just like how instead of a matrix, you had a feature map,
and let us assume that a C in a CNN, you wanted to focus only on one patch in a particular feature
map and that patch could be decided differently for different inputs. And only that should be
processed through the rest of the network that would be equivalent to hard attention, where you
are focusing only on one part and that cannot be backpropagated through.

Instead, what can we do very similar to what we talked about as soft attention, you could now
assume that you have a soft weighting function over different parts of a matrix. In an image, it was
called soft attention. Here, we are looking at the memory matrix, which is RxC, and we have soft

1281
attention over all the locations in the memory, which can be used to read and write from the
memory.

So this makes the neural Turing machine end-to-end trainable using backpropagation and gradient
descent. So we use a very similar idea as Soft Attention.

(Refer Slide Time: 04:14)

Now let us look at these components in more detail. So you have a memory matrix at time t given
as 𝑀𝑡 , which has R rows and C columns as we already saw, the normalized attention vector, which
is used to read from memory is given by a set of weights 𝑊𝑡 (𝑖), and the weights add up to one and
each of those weights lies between 0 and 1. And the read head is a sum of the matrix rows weighted
by the attention vector.

So if each row of the memory was empty of i, i th row of the memory at time t. Then you have a
corresponding weight for each row in the memory given by an attention mechanism. And a
weighted sum of all your memory rows becomes what the read head gets from memory.

1282
(Refer Slide Time: 05:19)

Similarly, for writing a part of memory at time 𝑡 − 1 is erased using an erase vector. So, if you
have an erase vector et 1 minus the erase vector is what is retained. So, if you had 𝑀𝑡−1, which is
the memory of (𝑀𝑡−1 )(1 − 𝑊𝑡 (𝑖)), which are the weights into arrays, vector et would give you
the memory at the new time step. And now you add a weighted combination of any external input,
let us call that input an added vector at, you have a weighted combination of that add vector, which
is added to the memory and that particular i th row.

(Refer Slide Time: 06:09)

1283
How do you obtain this attention vector 𝑊𝑡 ? So, to do that, NTM uses both what is known as
Content-based addressing and Location-based addressing similar to how you do it on a computer.
What does this mean, in terms of learning a Neural network? Each head be it the right head, or the
read head has a key vector 𝑘𝑡 at time step t, which is compared with each row in the memory
matrix 𝑀𝑡 . How is it compared? So, you take the vector 𝑘𝑡 , the key 𝑘𝑡 , take the cosine distance
between 𝑘𝑡 and i th row of the matrix 𝑀𝑡 . So that gives you how similar that key is with respect to
one specific location in your memory.

So this is like content-based addressing where you are trying to compare the content that you have
with the content in one particular row in the memory. And the final attention is obtained similar to
a softmax over the alignment of the current key vector with every row divided with a particular
row divided by the sum of the alignment with all rows.

You do have a constant 𝛽𝑡 here, which is you could consider a weighting factor or also what the
Neural Turing machines work calls key strength, which decides how concentrated the content
vector weight vector should be, you could consider that a scaling factor at this time. This is the
first step which is largely content-based addressing.

(Refer Slide Time: 07:57)

Now once you finish the content-based addressing, you combine this with the attention vector in
the previous time step. So your final attention now is given by your current attention into the
previous attention 𝑊𝑡−1 the previous time step. So you interpolate those two using some scalar

1284
parameter gt. And that is what you get as the attention vector that you are going to use in this time
step. Are we done?

No, we still have the location-based addressing to do to address this NTMs perform a circular
convolution of the resultant vector with a kernel. So, if you have an attention vector Wt remember,
which is a vector over all possible memory rows. Now, you perform a convolution operation with
a kernel St this is done in case you have to swap say locations in memory based on this content.

So, this is a kernel that could be decided based on what the controller needs to achieve. So, if
locations of different content have moved for some reason, those kinds of changes can be addressed
by convolving this attention vector with the suitable kernel to get the final attention vector.

And this attention vector or distribution is then sharpened using a scalar by raising each attention
value to a particular 𝛾𝑡 value to get a sharpened attention vector. So sharpened here would mean
that the value increases for a high attention scalar value for particular, for a particular position and
decreases for another attention value for a particular row where the original value may have been
low.

(Refer Slide Time: 10:01)

One can summarize all these steps as you have the attention vector from the previous time step,
𝑊𝑡−1, you have the memory matrix at time 𝑡, 𝑀𝑡 , and you have your controller outputs, 𝑘𝑡 , that is
your key 𝛽𝑡 , that is your key strength 𝑔𝑡 , that is a scalar to control the interpolation between the

1285
previous step potential and the step attention. Similarly, 𝑆𝑡 and 𝛾𝑡 , which is to sharpen, so, you
have you first perform content addressing, then you perform interpolation between your current
attention weights, and your previous time steps attention weights.

Then you perform a convolution will shift with respect to a kernel, this address is location-based
issues that the controller may want to achieve. And finally, a sharpening of the weights to decide
which part of the memory to, to be attending to at a particular time step.

(Refer Slide Time: 11:04)

So, that is the working of the neural Turing machine, as you saw a different use of what we saw
as attention. Now we will move on to the second method that we said we will talk about in this
lecture, which is known as DRAW or Deep, Recurrent Attentive Writer. This work was published
in 2015. And this work is based on the intuition that when humans generate images, such as say
drawings and sketches, Humans learn to draw sequentially.

One first draws the outlines, then keeps adding details iteratively. DRAW is used what is known
as a variational autoencoder. It is a variant of an autoencoder that is used for generative models to
generate images, which we will see very soon in the coming week's lectures. It uses a VAE to
mimic this kind of approach to generate images. Why are we talking about this here, although we
have not covered Generative models, yet, it uses a unique attention mechanism, which may be of
interest in this lecture.

1286
(Refer Slide Time: 12:27)

So here is an example of how the DRAW method works. An image is generated sequence, instead
of being generated at once, what you see here are 10 different generations, where each row is a
generation of an image, done iteratively. So in the initial steps, if you look at the last row here, in
the initial step, you have a blank canvas. The model then attends to a particular region, which is
given by this red block here, then it draws something in that block, then the canvas gets updated.

The model focuses on another block, draw something there. Then in the next time step, the model
focuses on the third area of an image, draws something there, and iteratively over time, the model
ends up generating this particular image at the end. This is an MNIST image. This is a model
trained on the MNIST image dataset. As we already mentioned, DRAW uses a variational
autoencoder, which is a variant of an autoencoder, which we will see soon.

This autoencoder has encoders and decoders both of which are RNNs and an Attention mechanism
is used to focus on specific parts to generate parts of an image iteratively. Let us see each of these
components in more detail.

1287
(Refer Slide Time: 14:05)

So the encoder and decoder are RNNs, which means your overall architecture diagram is going to
be like an autoencoder, which is one column of this diagram here. So you have an x input, you
have a read module, which is an Attention module. Then you have an encoder RNN then you have
an intermediate step, which you can ignore. Now we will talk about this in more detail when we
go to Variational autoencoders.

Then from that intermediate layer, you go to the decoder RNN and then you finally have a write
head that writes onto a canvas to generate some content. And both the encoder and the decoders
are RNNs, which means they share weights and you have that particular block repeated over
multiple time steps. So the image now is drawn on a canvas matrix C, which is what you see on
top here in t time steps.

And DRAW has both read and write it is you could say it is inspired by NTMS, to focus on specific
parts of the image for reading, as well as writing.

1288
(Refer Slide Time: 15:22)

So the encoder RNN takes four inputs, the input image X, a residual image x hat, we will see what
x hat is, in a moment. The encoder hidden state in the previous time step, the decoder hidden state
at the previous time step, you can see that the decoder hidden, hidden state at the previous time
step is passed to the encoder in the next time step. So notationally speaking, 𝑥̂ = 𝑥 − 𝜎(𝑐𝑡−1 ).

𝑐𝑡−1 is the canvas that has been drawn at time step 𝑡 − 1. Sigmoid of that gives you values between
0 and 1, x minus that gives us 𝑥̂, which is the portion of the image which the canvas so far has not
focused on. So we are looking at the residual image, which is what we call 𝑥̂. 𝑟𝑡 is the output of
the read module, which takes in 𝑥𝑡 , 𝑥̂ and ℎ𝑡−1 from the decoder, the hidden state at time step 𝑡 −
1 from the decoder using these, it decides which part of the image to read at this time step.

And ht of the encoder is an RNN that receives the hidden state from the previous time step ℎ𝑡−1,
it receives the output of the read module 𝑟𝑡 . And as we just mentioned, it also receives the hidden
state in the previous time step from the decode. So here σ is a sigmoid function and when we write
something in square brackets, we mean a concatenation of those values.

1289
(Refer Slide Time: 17:21)

So, when you for the decoder, when you use the decoder, you have a sampling step here, which
we will talk about when we come to Variational autoencoders. So you sample a certain vector,
which is passed to the deep decoder, RNN. And the decoder RNN uses the hidden state of the
decoder at the previous time step. And the sample that came from for now, what we call the
bottleneck layer of this variational autoencoder.

And the canvas finally, at time step t is updated as the canvas at the previous time step plus the
output of the right module at that particular time step. This is the overall functioning of the DRAW
framework.

1290
(Refer Slide Time: 18:14)

Now, let us try to understand how the Read and Write modules work here, which are attended,
which are our Attention mechanisms in this particular framework. So the goal of the Read and
Write modules are to focus on specific parts of an image. The way draws goes about this is because
we cannot focus just on one region, then you would end up having a problem because you cannot
differentiate through that one region which could change for every input.

So you have to somehow distribute the weights across the full image. So this is done in two steps
in the drawing framework. One, it predicts certain filter parameters that we will talk about, and
using those filter parameters, it places Gaussians over a grid of locations in the image, whose
coordinates are also learned and that becomes your attention mechanism.

To, to explain this image, given a particular image, the attention mechanism has to focus on a
certain region, this region is given by a grid of locations, and the center of that grid, which is given
by gx and gy is learned as well as δ, which is the stride between different points in that grid is also
learned. So by designing a 3x 3 grid, and learning a gx, gy, and 𝛿, the grid is completely specified.

We still have some more details to add, but at this time, the grid is specified. You can see here that
if you had the grid at a particular location in the image, you see that it focuses on a certain part.
And as the grid locations are moved, the model focuses on a different part of the image.

1291
(Refer Slide Time: 20:15)

So what are these grid locations and what is done to get the attention? So at each of these grid
locations, let us say it is 3x 3 or Nx N, in general, a Gaussian filter is placed at each of these grid
locations. So, if you have a Gaussian filter, you have to specify a variance for that Gaussian. So in
addition to the grid center coordinates gx gy, and the stride δ, the model also learns the variance
of the Gaussian σ square, which is going to be placed at each of these grid locations.

And also an intensity value γ, which we will see very soon on how it is used, how it is used. How
do you learn all of these parameters, these are all learned as part of the read or write module. So
in the case of the decoder or the right module, given the hidden state of the decoder at a particular
time step, you have a set of weights, which outputs all of these values that we want gx gy, 𝜎 2 , δ,
and γ.

Why is it 𝛿̃, 𝑔𝑥
̃ , 𝑔𝑦
̃ ? We will see that very soon. Just it is only to get a go from absolute locations
to relative locations. We will see that now.

1292
(Refer Slide Time: 21:45)

So given a gx hat, and the gy hat, which is the center of this grid, which could be a value lying
between 0 and 1, if we want to translate this to a particular position on the image, and let us assume
the image is an Ax B input image, then gx would be given by a plus 1 by 2 into detail the x plus 1
and gy will be given by B plus 1 by 2 into gy tilde plus 1, we are converting the predicted values
gx tilde and gy tilde in the specific positions on an Ax B image given by gx and gy.

Similarly, δ tilde could be the stride in the same dimensions that you were looking at for g tilde,
max(𝐴,𝐵)−1
now, you are converting that to a stride on a given Ax B image by saying that 𝛿 = (𝑁−1)
𝛿̃. N

is the number of grid locations that you have, as we said, a 3x 3 grid, or a 5x 3 5x 5 grid, so on and
so forth.

What you see in the figure here is a 3x 3 grid attention map, in that case, N would be 3. So you are
considering the max(𝐴, 𝐵). If A and B were 100x at for instance, if that was your side size of the
image, this would be 100-1, 99/3-1,2. So you would get you would be multiplying 𝛿̃ by a factor of
49, to be able to separate the at, the grid by the stride that is required for that image and gets a δ.

Now, to find out the grid point locations of this say 3x 3 grid, you have the final equations here,
𝑗
which are given by 𝜇𝑋𝑖 and 𝜇𝑋 , which gives us the location for the ij th coordinate of that 3x 3 grid.

1293
𝑁
So let us say the 1x 1 or the 1x 2 or the 1x 3, so on and so forth, is so the 𝜇𝑋𝑖 = 𝑔𝑥 + (𝑖 − 2 −
𝑗 𝑁
0.5)𝛿. Similarly, 𝜇𝑋 = 𝑔𝑦 + (𝑖 − 2 − 0.5)𝛿.

This would give you the locations if gx gy was the location of a particular of the center of the grid.
Let us assume that that was at 50, 40, for instance, then i minus N by 2 would give you so if you
are looking at the first location, N by 2 would be 3 by 2, so that would be minus 0.5, minus 0.5,
that would get a minus 1. And you would say gx minus 1 δ and if δ was 10 pixels, then you would
go 10 times to the left 10 pixels to the left.

Similarly for j, you would get 10 pixels to the top, and that would give you the top-left location of
the grid point and so on.

(Refer Slide Time: 25:10)

What do you do once you have these grid locations, as we said, at each of these grid locations, you
place a Gaussian filter given by the variance 𝜎 2 Fx and Fy denote those Gaussian filters here? So,
the Gaussian filter placed at a specific grid location, which could be at a better particular part of
the image gives you different portions of the image. So, what you see here are the bottom-most
and the second from the bottom are perhaps positioned at the same central grid location.

1294
But they have different Gaussian variances, which you can see in the output of the attended region.
And on top, the center grid location, and perhaps the δ is small, so, you are focusing on a narrower
location with its variance for the Gaussian.

(Refer Slide Time: 26:13)

To complete this discussion, the read function is finally implemented as a concatenation of a


Gaussian filtered input image and the residual image. So, you have x you have 𝑥̂, which is your
residual image, and Fy, an Fx are your Gaussian filters. So, you concatenate them, you apply γ,
which was an intensity factor, if you recall, so, you can now vary the intensity and then γ is also
learned. Why do we have the intensity factor?

You could assume that when you draw, you may initially draw a light shade and then make it
darker, right? So, in a specific time step γ controls how much intensity that you want to put into
generating that part of the image in this case read but γ is also used for the right even in read, you
can read only a certain intensity of a part of the image. And the right operation follows a
1
complement where the right operation is given by 𝛿 𝐹𝑦𝑇 𝑤𝑡 𝐹𝑥𝑇 .

So, you can see here that for the right patch, the order of the transposition is reversed. Here you
have 𝐹𝑦 𝑥𝐹𝑥𝑇 , here you have 𝐹𝑦𝑇 𝑤𝑡 𝐹𝑥𝑇 . So, the order of the transporters, transposition is reversed to
maintain the dimensions and also to maintain complementarity. But otherwise, the attention
mechanism is implemented in a very similar manner for the read and the write.

1295
(Refer Slide Time: 27:58)

And to recall, once you have this attention mechanism, at each time step, each of these rows is
different generations. So, at each time step, the grid location and the Gaussians tell you the grid
locations, the stride, and the Gaussians tell you which region in the image you are focusing on.
And in each time step, the region that you are focusing on can change. In this case, what you are
seeing here is the output of the right head.

And you can see that you are seeing different regions are in each region of the canvas, you are
drawing something which is obtained as the output of the decoder. And you repeat these over time
steps to get your final generated image.

1296
(Refer Slide Time: 28:51)

The third method that we are going to talk about in this lecture is known as Spatial Transformer
networks, which is again a very different approach to use attention and benefit from the value
attention brings to CNNs. So, the broad idea here is that CNNs on their own can lack spatial
invariance, so if a certain object was rotated in different ways, was located at different corners, it
may not work in practice.

To a certain extent, Max pooling operations help, but they only help with variations or invariance
within a small neighborhood, especially in deeper parts of the networks. Spatial transformers are
a set of our comprises is a fully differentiable module, and hence can be inserted in any CNN to
bring a certain attention module to solve a particular problem. Let us see spatial Transformers in
more detail.

1297
(Refer Slide Time: 30:01)

So here is an example of how spatial transformer networks work. So in the first column, you have
three different images. As you can see, in the first image 7 is slightly modified from how a 7 would
look like in an MNIST data set. Similarly, a 5 is reduced in size and rotated a little bit, and a 6 is
moved from the central patch and there are a few noise patches that are included to confuse a
model that may want to look at this and classify it as a digit 6.

What spatial transformer does is shown in column B, its job is to find out which part of the image
to focus on, that is wherein a sense, the attention mechanism comes in. So you can see here that
the top image, focuses on this part, which is the 7. Similarly in the middle image, it focuses on 5,
but the rotated box shows the orientation in which the model needs to look at the content. And
similarly, for the third image, which is 6.

In column C, the spatial transformer applies this transformation and converts this original image
to an untransformed uncorrupted version, where you can find each digit to occupy the full image
and be centered. And with the kind of images that you get in column C. In column D, we show
how a CNN can now give good classification performance if you attend to the specific part of the
image. Rather than be confused with the full image, which could have other artifacts.

1298
(Refer Slide Time: 31:52)

The spatial transformer network architecture consists of three modules, a Location network, a
Localization network, a Grid generator, and a Sampler. As you can see here, the spatial transformer
network module is something that you can insert between any two convolution layers in a CNN.
So if this was Con 4, you could insert a spatial transformer module in between Con 4 and Con 5.

What do you expect to happen? If this was an image, where the 6 was rotated, taken to a corner,
there were other noisy artifacts in the image, we expect the spatial transformer to focus on only
region 6 and pass it on to the next step V for further processing. Why is this challenging? Why is
this required? Remember now, that in a sense, we are doing Hard Attention, we are now trying to
focus on a specific part in the image which could vary for each input.

How do we manage to do this in a differentiable manner is the core contribution of the Spatial
Transformer networks. The three modules are Localization network, Grid generator, and a
Sampler. So the idea is to take an input feature map U and transform it into an output feature map
V, which can then be given for further processing, such as to perform classification. So let us see
each of these modules which are differential, which is differentiable and designed in a way to be
differentiable so that they can be included in any CNN.

1299
(Refer Slide Time: 33:49)

The localization network takes an input feature map U, which let us assume is of dimension Hx
Wx C. And it outputs a set of parameters theta, which we are going to say are the parameters of
the transformation. What does this mean? A localization network could have multiple hidden
layers, but the output layer of the localization network contains the number of transformation,
transformation parameters that may be required to model the problem.

For example, if we wanted spatial transformer networks to deal with only fine transformations,
then you would have to output six different values that would be the Localization network.

1300
(Refer Slide Time: 34:43)

The next step in the pipeline, as we saw was a Grid generator and the final step was a Sampler. So
in the grid generator, we want to now take the transmission values and find out the actual grid. So,
if we assume that the output V has dimensioned 𝐻′x𝑊′x C for a particular point 𝑥𝑖𝑠 , 𝑦𝑖𝑠 in the
input grid and a point 𝑥𝑖𝑡 , 𝑦𝑖𝑡 in the output grid, the transformation would be given by 𝑥𝑖𝑠 , 𝑦𝑖𝑠 would
be given by an affine transformation of the point 𝑥𝑖𝑡 , 𝑦𝑖𝑡 .

Remember, affine can include rotation, scaling, and translation for the moment. So, now, we have
an equivalence between the coordinates in the output feature map V and the coordinates in the
input feature map U. So, now the next step is to find out how do you sample the points in U in a
differentiable manner.

1301
(Refer Slide Time: 35:59)

Before we go there, let us see an example of the grid generator. So, here you see that in Figure a
here, you have an identity transformation, where you have an original image and the
transformation learned retains the same positions for every pixel from U to V. And here is another
example B where U has rotated 9 and V learns a transformation that takes a particular set of pixels
in U and maps them to all four corners of V.

Now, we need to find out how do you sample these pixels, which would form the output pixels
inside V. So, you can see that in this case, a distorted 9 has got transformed to a canonical 9, which
is CNN is aware of and can classify.

1302
(Refer Slide Time: 37:00)

This leads us to the final and important step sampler which draws this equivalence between V and
𝑐
U. How do we do this? Vic denotes the target feature value at i th location in channel c. 𝑈𝑛𝑚
denotes the input feature value at location nm in channel c. And then you have iterates that go to
H and W which are the entire dimensions for each channel in U. And then you have a sampling
kernel, which, which tells you at which location 𝑥𝑖𝑠 and 𝑥𝑖 and 𝑦𝑖𝑠 you need to sample to get a
particular output in the output feature map V. So, what kind of a kernel do we can we use?

1303
(Refer Slide Time: 37:54)

One example that you can use is a bilinear sampling kernel, which states that the kernel k can be
given by max⁡(0,1 − |𝑥𝑖𝑠 − 𝑚|). Similarly, max⁡(0,1 − |𝑦𝑖𝑠 − 𝑛|). What does this do? If you
observe, let us assume that for a particular pixel location in V, the corresponding 𝑥𝑖𝑠 and 𝑦𝑖𝑠 given
by the transformation was, say 45 and 34 for some reason. So, 45 minus m is an iterator that would
go from 1 to capital W and n is an iterator that would go from 1 to H.

So, you can see here that if 𝑥𝑖𝑠 were 45 and m was 50, for instance, you would then have 5, 1-5
would be minus 4, the max would be 0 and you would not sample the value at 50. So, where would
you sample? This would get activated only when m is equal to 𝑥𝑖𝑠 . So, you would only sample the
pixel at 𝑥𝑖𝑠 to get the corresponding location at 𝑉𝑖𝑐 . So, this becomes a differentiable way of
sampling and focusing on one specific region of U to get the subsequent feature map V.

How do you backpropagate through such a kernel? We know how to backpropagate through the
rest of the operations. But how do you backpropagate through such a sampling kernel? It is not
hard. The max⁡(0,1 − |𝑥𝑖𝑠 − 𝑚|). is similar to many other operations that we have seen so far for
𝜕𝑉
differentiation. So, 𝜕𝑈 would simply be the both these max terms themselves.

𝜕𝑉𝑖𝑐
So, that is what you would get because you would have a U here, and would similarly be,
𝜕𝑥𝑖𝑠
𝑐
𝑈𝑛𝑚 ⁡max⁡(0,1 − |𝑦𝑖𝑠 − 𝑛|). So, this first term which depends on x goes away, and you would have

1304
that this term would become 0, if 𝑚 − 𝑥𝑖𝑠 are greater than equal to 1, it will be 1 if m is greater
than 𝑥𝑖𝑠 and minus 1 if m is less than 𝑥𝑖𝑠 . That is how you can compute the partial derivative of this
transformation.

And this completes each of the modules, the Localization network, the Grid generator, and the
Sampler to be able to go to one specific part of the input image for use for further processing.

(Refer Slide Time: 40:54)

Your homework readings for this lecture are once again the nice blog of Lillian Wang on Attention.
A nice introduction and overview of DRAW by Jonathan Hui and a nice review of Spatial
Transformer networks by Sik Ho-Tsang.

1305
(Refer Slide Time: 41:13)

Here are references.

1306
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 60
Self-Attention and Transformers

(Refer Slide Time: 00:14)

For the last lecture of this week, we will look at a topic that is becoming extremely popular in
recent months, which is Self Attention and Transformers.

1307
(Refer Slide Time: 00:30)

One question that we left behind in an earlier class on Visual Dialogue was, what are other ways
in which one can evaluate visual dialogue systems? Hope you had a chance to think about it or
look for it online. One answer to this question is to look to natural language processing for getting
performance metrics from the text domain, which could measure consensus between answers
generated by a model and a relevant set of answers.

If you would like to know more about such an approach, please see this recent work called a
Revised Generative Evaluation of Visual Dialogue. This work also talks about how when one can
generate a set of relevant answers.

1308
(Refer Slide Time: 01:31)

So let us now come to Transformers and Acknowledgement. Most of this lecture slides are based
on Jay Alammasr’s excellent article on The Illustrated Transformer. So you can assume that any
images that are not credited are adapted or borrowed from Jay Alammasr’s article.

(Refer Slide Time: 01:55)

Let us first try to understand why does one wants to go beyond RNNs or LSTM's for sequence
modeling. So sequential computation prevents parallelization. So if you had a series of layers, let
us assume you have a linear layer, followed by an LSTM layer, followed by a linear layer, followed

1309
by an LSTM layer. When we say LSTM layer, we mean a series of LSTM blocks that process a
sequence of inputs.

So if you had an architecture that is similar to this, you could consider some of the architectures
that we had for complex tasks such as VQA or Visual Dialogue could have had such a sequence
of layers. When you have an architecture, such as this, it is not possible to parallelize the processing
of information through such a pipeline. Can we overcome this is a question we want to try to
answer?

Further despite the availability of GRUs and LSTMs, RNNs can only maintain information for a
limited period. If you have very long time-series data, for example, a chapter of a book, will an
RNN hold the information or a GRU or an LSTM hold information for a very long time series?
Possibly not, even in that scenario, one perhaps needs attention to focus on the right part of the
previous content.

We saw this with models for saying visual question answering or visual dialog, all of which used
attention, not all of which at least some of which used attention to be able to focus on the relevant
parts of the previous sequence information. So if we are anyway using attention in RNNs, why use
RNNs at all? Attention can anyway tell us which part of a sequence to look at while giving a
specific output.

So if you are doing, say, machine translation, while you are generating a particular word, you can
decide which part of the input sequence should I focus on. You maybe do not need an entire RNN
to hold the hidden state across multiple time steps. That is the hypothesis for Transformers, which
we will talk about in this lecture.

1310
(Refer Slide Time: 04:52)

Transformers were first introduced in this work called Attention is all you need. Published in
NeurIPS of 2017, which introduced an encoder-decoder framework to perform sequence to
sequence modeling without any RNNs. This work proposed the transformer model, which
introduced the Self Attention mechanism without any recurrent architectures. There were few key
components of this architecture.

A concept is known as Self-attention, then multi-head attention, positional encoding, and overall
encoder-decoder architecture. We will see each one of these in detail. What you see on the left
here is the overall architecture. Hopefully, this would become clear over the next few slides.

1311
(Refer Slide Time: 05:57)

So a quick summary of how Transformers work. If you considered a machine translation task, let
us say you want to translate this input into the output with states I am a student, we are proposing
the use of an encoder-decoder architecture where in the encoder, you have several encoder
modules, each of which feeds into the encoder at the next level. And all of these, and the encoder
at the highest level feeds into each of the decoders, which has similar staggered architecture.

And the final output is I am a student, each encoder has several components inside it, there is a
self-attention module, which outputs a vector for every input word. So if you had a sequence of
inputs, for each element of that sequence, you get a certain representation. As an output of the self-
attention module, this representation is passed through a feed-forward neural network to give you
another vector is output, which is passed on to the next encoder layer, and so on.

Let us see each of this competence in detail.

1312
(Refer Slide Time: 07:32)

Let us start with self-attention, which is one of the unique features that this work introduced. So
let us consider two input sentences that we would like to translate. One sentence reads, the animal
did notxthe street, because it was too tired. And the second sentence is the animal did notxthe street
because it was too wide. For a human, it is evident that the first it corresponded to the animal.

And the second corresponded to the street. But for an RN based more on an RNN based model,
this is difficult to catch and that is the reason self-attention brings value. Self-attention, allows the
model to look at what are the other words that need to be paid attention to, while you process one
particular word. The reason it is called self-attention is given a sequence when you are processing
one element of that sequence, you make a copy of that sequence and see which other parts of the
same sequence are important to process this element.

And that is why it is called self-attention. So now if you had to draw a parallel to RNNs, you would
see that you perhaps do not need a hidden state to maintain the history of previous sequence inputs.
Whenever you process one element of a sequence, at that time, you process an attention vector
over an entire sequence. You do not need to store a compressed hidden state that has all the
sequence information in one vector or one matrix. That again is the idea of transformers.

1313
(Refer Slide Time: 09:50)

So let us see how self-attention is implemented. The first step is given an input, input vector, which
is an embedding. So let us say you have a phrase thinking machines. So, you consider the first
element of that sequence thinking, you would get a word embedding. Once again, assume that this
is done using some text processing techniques to go from thinking to embedding, an embedding is
a vector representation of the word thinking.

Similarly, X2 here is the vector representation of the word machines. So, given these inputs to the
self-attention module, there are three vectors that the self-attention module creates these vectors
are known as the Query vector q, a key vector k, and a value vector v. We will see how these are
used very soon. But how are these vectors created? To create the q, k, and v vector, one uses
learnable weight matrices WQ, WK, and WV which are multiplied by the input X to give us q, k,
and v.

So in this particular work q, k and v, were 64-dimensional vectors, and X by itself, which is the
embedding of each word in the sequence was a 512-dimensional vector. One question here, does
q k v always have to be smaller than X? In this case, you see that X is 512 dimensional? q, k, and
v are 64 dimensional? Does this always have to be the case for Transformers to work? The answer
is No.

This was perhaps done in this work to make the computation of multi-headed attention constant,
we will see this very soon. What are the dimensions of 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 , you should be able to get

1314
that from X and q k v, you would have 512x64. Those are the weights that are learned while you
train the transformer end to end.

(Refer Slide Time: 12:28)

So once you get your q, k, and v vectors for each word in your input sequence, step two in self-
attention is to compute what is known as self-attention scores, which is the score of all words of
the input sentence against this particular word under consideration. So if you had the word
thinking, which is what you are processing, you would compare this with every other word in the
sequence and compute a self-attention score. How is that computed?

Let us see, it is by taking the dot product of the query vector with the key vector of the respective
words. So for this input, thinking, the first score would be 𝑞1 ⋅ 𝑘1 , that is the query vector, which
is q corresponding to the word thinking, and the key vector corresponding to the word thinking
𝑞1 ⋅ 𝑘1 . Similarly, you would compute 𝑞1 ⋅ 𝑘2 for the second word in the sequence, 𝑞1 ⋅ 𝑘3 for the
third word in the sequence, and so on.

These scores are then divided by the root of the length of k, which is the dimension of the query,
key, and value vector. In our case, the dimension was 64. So the root of the length would be 8. So
the score 𝑞1 ⋅ 𝑘1 , which was 112 would be divided by eight and you get 14. Similarly, 𝑞1 ⋅ 𝑘2 was
96. And that would get divided by 8 and you get a 12. This is known as scale dot product attention.

1315
You can recall our discussion in the first lecture of this week, where we spoke about several kinds
of attention, alignments, and scaled dot product attention, where you take a dot product and then
scale the value is one such alignment mechanism. Why do we do this here, scale dot product,
attention helps get more stable gradients by scaling the values to a manageable range, instead of
letting it go too high, or too low, and the variance is too much.

(Refer Slide Time: 15:12)

Once this is done, the next step is to use softmax to get a distribution of the attention of each of
the words in the sequence with respect to the word under consideration. So once you get the scores
14 and 12, you can do a softmax on them to get a weight for the first word or wait for the second
word with respect to the first word, we are still focusing on the still processing the first word here.

Here, this score of the first word with respect to the first word is likely to have the highest weight
because you would expect that to process the first word, the first word is the most important.
However, we still have weight or attention vectors attention values, over other words in the
sequence, which also add value to processing the current word. As the next step, we multiply the
softmax output with the value vector that we got earlier.

Remember, we had a query key and a value vector, we now use that value vector with the softmax
output to now get a weighted representation of the value vector. Similarly, you will have a
weighted representation of the second value vector, and so on, for every sequence of every input

1316
in the sequence. And we finally then do a weighted sum of all of this v1, v2, and so on vectors to
get a final z1, z2, so on and so forth.

So z1 here is the weighted sum of v1 into softmax plus v2 into its softmax value, plus v3 into its
softmax value will together form z1. Similarly, z2 would be formed when we process the word
machines.

(Refer Slide Time: 17:29)

What do we do with this z1s? Before we go there, let us try to summarize our discussion so far. So
you have your input embedding x, you see two rows in x here, corresponding to vector
representations for two words, thinking, and machines in this particular case. Then you have the
matrices of 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 those are learnable weights. And when you multiply x and that weight
matrix, you get a Q, K, and V.

Then you do a dot product between Q and K, scale it by the root of the dimension of the K vector,
do a softmax on it, multiply with the v vector, and that gives you your z vector for each of your
words in the input sequence.

1317
(Refer Slide Time: 18:24)

Now let us move on to Multi-Head Attention, we have seen self-attention so far. Next, the next we
will talk about the component known as multi-head attention multi-head attention are these
different layers that you see in the illustration here, instead of using only one 𝑊 𝑄 , or one 𝑊 𝑉 or
one 𝑊 𝐾 , transformers suggests having multiple such of 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 s so you would get multiple
such query vectors, key vectors, and value vectors.

So and you would do multiple such self-attention computations. And you finally then concatenate
and send them through a linear layer. How does this help expand the model's ability to focus on
different positions? So, different 𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 s weight matrices could help the model focus on
different parts of the sentence, which could together enrich the processing of the current word.

Secondly, it also gives our self-attention layer a sense of multiple representation spaces, instead of
choosing only one set of weight vectors and getting these different representation spaces and then
concatenating them and sending through a linear layer gives us better performance at the end. This
part of Transformers is known as Multi-Head Attention.

1318
(Refer Slide Time: 20:11)

Here is the illustration, so if you have your input sentence, in our case, it is thinking machines, we
embed each word. Each row of x is an embedding of thinking and machines. We have multiple
𝑊 𝑄 , 𝑊 𝐾 , 𝑊 𝑉 s. In this case, we see eight such heads, resulting in eight query vectors, key vectors,
and value vectors, we get eight different Zs, and then we use a linear layer to combine all of those
Zs to get one Z for each input of the sequence.

So, you see here that there is something called R, R simply indicates that in your transformer
architecture, there are multiple encoder layers. The first encoder layer receives x as the input.
Every other encoder layer receives the output of the previous encoder layer as input that is denoted
by R. So the original sentence, and the words may not be given as input to the second encoder in
that set of stacked encoders. But it could be the output of the first encoder that is fed there.

1319
(Refer Slide Time: 21:36)

A third component of the Transformers architecture is known as Positional Encoding. Unlike CNN
and RNN encoders, Attention encoded outputs do not depend on the order of inputs. Why? Because
you can choose to focus on what you want, irrespective of the order of the inputs. Having said that,
it still is important as to which position a particular word is in a sentence to make meaning out of
it.

So how do we bring that value into an attention model into a Self-attention model? That is done in
Transformers using an approach known as positional encoding. So the position of a word in a
sequence is given as a separate input, along with the x embedding. How is this position encoded?
You could do it in different ways but the paper suggests one specific way, although this may not
be the only way it takes the position of that particular word with respect to the sequence and
processes it as a sin and cosine signal.

And that becomes a positional encoding of that word in the sequence. This is given as input along
with what we saw as x in the previous slide. So given thinking machines, if you took the word
thinking, you would get a certain word embedding, you concatenate the position encoding along
with that, that becomes your input to yourself attention layer. So the role of position encoding is
to bring some value of where in the sequence, a specific input which you are currently processing
is.

1320
(Refer Slide Time: 23:41)

Finally, to come to the architecture of the entire model. As we mentioned, so far, it is an encoder-
decoder architecture. Let us first see the encoder. The encoder is a stack of six identical layers in
this particular transformer that was proposed. Each of such six layers looks somewhat like this,
what you see in the image here. So you have a multi-head self-attention layer, which we have
already seen self-attention, take multiple heads, concatenate and then have a linear layer to get a
vector input.

And that is fed into a fully connected network feed-forward network that you see on top, but
between the multi-head attention and the feed-forward layer, there is a residual connection. And
there is also a normalization that happens before that output is fed to the feed-forward layer. So
you first perform multi-head attention. Then the input also comes through a skip connection
directly to the output of multi-head attention, where you add and normalize.

The Z output that you get out of multi-head attention, and the 𝑋 + 𝑃 input, that is X plus Positional
encoding input that you get before the multi-head attention, you would add them, normalize them,
that is fed into a feed-forward network. Once again adding and normalizing and is the output of
one encoder module. You would have six, six such encoder modules in this transformer
architecture.

And all the sub-layers that you see within this particular encoder layer, output data of the same
dimension 512 in this work.

1321
(Refer Slide Time: 25:45)

Similarly, the decoder also had six identical layers. Each layer again has multi-head attention and
a feed-forward network. Again, you have a residual connection and a normalization step that is
done before the output of multi-head attention is given to the feed-forward layer. There is only one
difference here, this initial step also performs what is known as masked multi-head attention. The
goal here is to not let the architecture see further words in the output sequence when it is processing
one specific word.

So if you are translating from English to Hindi, you would not want if you are now trying to process
the third word in the output, at that time, you can use attention on the first two words, but you do
not want to use attention on the remaining words, because the model is not supposed to know them
while predicting the third word. This is done by masking the attention mechanism in the first
decoder and then the outputs are passed on as it is to further multi-head attention mechanisms.

1322
(Refer Slide Time: 27:10)

That is the architecture of the decoder. So, here is a revisit of the complete architecture again the
input, input embedding, then the positional encoding concatenated, you have the multi-head
attention mechanism, then you have a skip connection, add a normalize, feed-forward add and
normalize, you have ideally six such stacked encoders and similarly, you have the decoder as we
just spoke, which has a masked multi-head attention mechanism.

Add a normalize than a multi-head attention mechanism add a normalize, feed-forward add a
normalize and you once again have six such modules stacked on top of one another. So, the scaled
dot product attention can also be visualized this way, where you have Q K and V, you first perform
a matrix multiplication of Q and K that gives you your dot product, then you scale the dot product,
then you may choose to mask it optionally depending on which module you are talking about.

Masking is done only in this first part of the decoder here, then you apply the softmax then
multiplied with V and that becomes the output of the scale dot product attention module.

1323
(Refer Slide Time: 28:37)

Now, that was about transformers, for General Text processing or Sequence Processing problems.
Coming back to computer vision, how can Transformers be used in Computer Vision. So this is
being explored over the last few months very recent work performed one published in ECCV 2020
on Transformers for object detection, were given a CNN which is broken down into a set of image
features.

You also have a positional encoding of where each feature belongs to the input image. These are
concatenated and given to a transformer encoder, which processes each patch those outputs are
given to a transformer decoder. And then you have you give object queries here, object queries are
taking a patch from the input and then trying to find out what is the object inside each of those
patches.

The transformer decoder processes those object queries and gets an output that is passed to a feed-
forward network, which finally predicts each of those object queries. Object queries are certain
patches of the input image, which are queried for what object is there in that specific patch. So as
you can see in the outputs here, you could have a class label and a bounding box offset.

Or you could simply say no object, which means that patch corresponds to the background. You
see here similarly, for other regions, you have a class label with a bounding box offset, or a no
object. So this was the use of Transformers for Object Detection, published a couple of months
ago at ECCV.

1324
(Refer Slide Time: 30:37)

And here is a visualization of how this approach simplifies the object detection pipeline. So if you
consider a faster R CNN, if you recall, you have your initial CNN features from which you extract
perhaps about 200,000-course proposals, you filter de-duplicate, do non-maximum suppression,
and then do an ROI align of the proposals to let them a map, map to specific regions in the input
image. And then you finally predict your bounding boxes and classes.

While using Transformers here replaces all of these steps with a single end-to-end trainable
encoder-decoder.

1325
(Refer Slide Time: 31:28)

This work showed that they could beat the performance of faster R-CNN methods on the MS
COCO validation set, you can see the numbers here of average precision, AP stands for average
precision, AP at 50 stands for average precision at a particular threshold. And you can see that the
transformer models seem to outperform state-of-the-art faster R-CNN models across a range of
different metrics.

(Refer Slide Time: 32:01)

More recently, just about a month back, there was a work published in the archive on Transformers
for image recognition. This approach takes a set of fixed-size patches from the image. Each of

1326
those is linearly projected to get a vector representation. Positional encodings of each of these
patches with respect to the original image are also added to form the inputs. These inputs are fed
to a standard transformer encoder. And finally, there is a feed-forward network or an MLP head
that is used to get the final classification.

To perform the classification this approach also uses an extra learnable classification token here
an embedding for the class label, which is also given as input to a transformer encoder. The
transformer by itself is composed of a first normalization step, then multi-headed attention, then a
skip connection to add the input back to the normalization of feed-forward network or MLP. And
then concatenation with the output of multi-head attention and then the output of the encoder.

This is very recent work, and the space of Self Attention and Transformers are completely
revolutionizing how we look at Deep Learning methods for Sequences, and probably now even
images.

(Refer Slide Time: 33:43)

The homework there is a Transformers in action video provided along with your lecture slides for
this week, please watch it to understand how Transformers work. Also the excellent article by Jay
Alammar on the illustrated transformer. And if you need to understand positional encoding better
the specific article by positional encoding on positional encoding. And finally, if you need more
information, the original paper attention is all you need.

1327
Let us end this with one question, are Transformers faster or slower than LSTMs? What do you
think? And why do you think so? Think about it and we will discuss this the next time.

1328
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 61
Deep Generative Models: An Introduction

(Refer Slide Time: 00:14)

From Attention methods in Deep Learning Models and Transformers, we will now move to 2
weeks of a contemporary topic in Deep Learning, Deep Generative Models. Deep generative
models include topics such as GANs, Generative Adversarial Networks, and Variational Auto
Encoders, VAEs, which have been extremely successful over the last few years.

1329
(Refer Slide Time: 00:49)

Before we talk about the overall perspective of Deep Generative Models, let us talk about the
landscape of machine learning itself and try to understand where deep generative models fit into
the scheme of Machine Learning methods. If you considered Supervised Learning, supervised
learning is about learning a mapping between inputs and outputs, that is, data and class labels.
Your data is typically given in the form of input-label pairs, such as (𝑥1, 𝑦1), (𝑥2, 𝑦2), and so on

till (𝑥𝑛, 𝑦𝑛).

And the goal of a machine learning algorithm is to learn a model that can map x to y. So, it aims
to learn a function f, which, when applied on x, gives us y, where we would like y to be as close
to the original ground truth, y, that is provided. So, what are the supervised learning settings or
problems that we have seen so far?

1330
(Refer Slide Time: 2:00)

In computer vision, we have seen problems including Classification, Localization with


Classification, Detection, Segmentation, and so on. These are supervised learning problems
where there was an expected output for a given data point. Our deep neural network was trying
to learn the function that takes us from input to output. The nature of the output varies with
different kinds of tasks. For detection, the output looks different from the output for
Classification.

For Segmentation, the output looks different from that of Detection and Classification. That
resulted in different kinds of architectures, loss functions, and so on.

1331
(Refer Slide Time: 02:57)

So, what are we talking about here, to go beyond Supervised Learning. So, if you think about
paradigms beyond supervised learning, we talk about Unsupervised Learning, where the broad
goal is about just understanding data. Another setting is about detecting outliers. That is another
setting where one can use unsupervised learning. Finally, this week's lectures focus on generating
data from past data, with no labels involved as it is, but just generating data.

(Refer Slide Time: 03:39)

1332
So, suppose you considered unsupervised learning, which is perhaps the most popular form of
going beyond supervised learning. The main idea is to capture the underlying structure of the
data. You are given only the data; there are no labels. Since no labels are required, often training
data is cheap to obtain. The most difficult part often can be to get annotations or labels for data,
which can be challenging tasks depending on domains.

For example, in healthcare or genomics, or bioinformatics, getting labels for a data point may
require experts to participate and may be a very costly affair both time-wise and money-wise.
Unsupervised learning is generally categorized into clustering methods, such as K-means,
DBSCAN, and dimensionality reduction methods, such as Principal Component Analysis, ISO
map, and many other such methods.

(Refer Slide Time: 04:53)

The focus of this week's lectures, as I just mentioned, is Generative Models, which is another
variant of unsupervised learning. Given a set of data points, our objective is to generate more
data samples, similar to those in the training set. What does that mean? Given a set of data
points, as you see here, those are points that lie on some manifold. Remember, that manifold
corresponds to a locally Euclidean surface, which you can look at as the intrinsic dimensionality
of your data.

1333
So, suppose data lies on a certain manifold, which corresponds to a distribution. In that case, we
learn a model using some method, which can then sample from a distribution that the model has
learned and generate new data points. What does this mean mathematically? If you had training
data, 𝑥1to 𝑥 , which comes from some underlying distribution, let us say 𝑝𝐷. We would like a
𝑛

generative model to sample data from a distribution 𝑝𝑀(𝑥) to minimize some notion of distance

between 𝑝𝐷 and 𝑝𝑀. We would like to learn 𝑝𝑀 so that 𝑝𝑀 becomes close to 𝑝𝐷 in some notion of

a distance.

(Refer Slide Time: 06:40)

How do you do this? Can we elaborate on this? Let us try to go a bit further. So, we said we
would like to minimize some distance between 𝑝𝐷 and 𝑝𝑀. Given a dataset from an underlying

distribution, 𝑝𝐷, consider an approximating distribution 𝑝𝑀 coming from a family of distributions

M. Our objective is to find that parameterization in M or that distribution in M given by θ, the


parameters of that distribution, which minimizes some notion of distance between 𝑝𝑀 and 𝑝𝑀.

*
So, the objective function will be θ that is, our solution is given by argmin theta over M, some
distance function of 𝑝θ, and 𝑝𝐷. What distance function do we use? We choose KL divergence as

the distance function, which is a loose distance function between two probability distributions. It

1334
does not satisfy all properties of a distance metric, but it is commonly used to measure the
distance between two probability distributions.

If we replace this distance function with KL-divergence, it can be shown that the above problem
*
becomes one of maximum likelihood estimation. θ becomes argmin over θ belonging to M,
expectation with data points drawn from the distribution 𝑝𝐷 , − 𝑙𝑜𝑔(𝑝θ(𝑥)). This is maximum

likelihood estimation or minimization of negative log-likelihood. Why is this so? We are going to
leave that as homework. It is fairly trivial to show that if you replace distance with KL
divergence, you will automatically lead in two steps to negative log-likelihood. Please do work it
out. We will discuss it in the next class.

This idea of learning 𝑝𝑀 or 𝑝θ in this case, using maximum likelihood estimation, is one way of

learning generative models or deep generative models, in our case. A set of pixel CNN methods
and pixel RNN methods explicitly use this objective to learn deep generative models. And we
will see these methods a little later this week.

(Refer Slide Time: 09:32)

What are the applications? Why should we learn Deep Generative models? Deep generative
models can be used for image super-resolution. Given an input image of low resolution, you can
use generative adversarial networks or any other deep generative model to learn a super-resolved

1335
version or a high-resolution version of the original image. You could use such models to
colourize images, go from sketches to color images, or go from black and white to color images.

1336
(Refer Slide Time: 10:10)

You could use this for cross-domain image translation to go from one domain to the other. In this
case, we talk about taking a zebra picture and going to a picture of a horse. But you can also talk
about going from a picture to a cartoon, and so on. Finally, you can also talk about generating
realistic face data sets for performing face recognition, and so on.

(Refer Slide Time: 10:44)

There are many more applications; the goal can also be to learn good, generalizable latent
features by learning to generate new images. The model also allows you to learn latent

1337
representations, by which we mean the hidden representations and certain layers of that model.
Which can then be used for other downstream tasks such as, say, classification. You can also use
this generation of new data to augment small datasets. You can then get larger data sets, which
you can train for downstream tasks such as classification, again.

You can use such models for enabling mixed reality applications, such as Virtual Try-on. You can
imagine going to any shopping website that lets us try on a shirt or a trouser before you purchase
them. You could now take your image and see how you would look in that particular shirt, or
how you would look with a specific kind of glasses, or sunglasses, for that matter. Many more
applications that we will see this week, as well as in next week's lectures.

(Refer Slide Time: 12:05)

Before we move on with methods for generative models, what do generative models mean? Let
us try to understand this from a conceptual perspective. Generative models can be distinguished
from their complement, which is known as Discriminative models. In machine learning,
discriminative models aim to learn differentiating features between different classes in a data set.
In contrast, generator models aim to learn the underlying distribution of each class in a dataset.

Note that we are now trying to use an example from supervised learning to understand generative
models better. All the examples that we saw so far of generative models as unsupervised
learning. Generative models can also be used in supervised settings, where you can specify the

1338
class label, and the model will generate data from that specific class. So, for example, if you are
generating a face image, you can say which person you want. The model can generate the face
image for that person, which would correlate to a more supervised setting than an unsupervised
setting.

So, in supervised learning in Discriminative models, the goal is to learn a discriminator that
separates two classes. An example would be support vector machines, which is a maximum
margin classifier. On the other hand, in a generative model, the goal is to learn the
parametrizations of distributions for each class. In a Discriminative model, one would finally do
inference by asking if a test data point belongs to one side of a hyperplane or the other side. If it
fell on one side of the hyperplane, you would say it belongs to class + 1 , and if it belongs to the
other side of the hyperplane, you would say it belongs to class − 1.

Now, in a Generative model, we try to find out that given a data point, what is the probability
that the distribution of class + 1 generated this data point? We will try to find out for the same
data point what is the probability that the distribution of class − 1 generated the same data point.
You can see here that the entire approach to assigning a class label for the new point is generative
and tries to understand how likely a certain distribution would have had a probability of
generating a particular data point. That is the difference between discriminative and generative
models more broadly in machine learning.

(Refer Slide Time: 15:25)

1339
So, if you considered a binary classification problem of classifying, say images consisting of 1s
and 0s. A discriminative classifier tries to model the posterior 𝑝(𝑦 | 𝑥), a conditional distribution,
where x is given as input. Your model has to give a particular probability for y given x. So,
discriminative classifiers model this conditional distribution called the posterior. On the other
hand, generative models model the joint distribution 𝑝(𝑥, 𝑦).

So, how would you use a generative model for assigning a class label? You would take a test data
point and use this joint distribution to determine if this data point was assigned a class label + 1.
What is the probability of having generated such a data point? If this data point was allotted a
class label − 1, what is the probability of generating such a data point? And that would give you
a way of getting a class label from this joint distribution.

Remember, the posterior, conditional distribution, and joint distribution is related to your
probability basics. Where 𝑝(𝑦 | 𝑥) is 𝑝(𝑥, 𝑦), your joint, by 𝑝(𝑥), which relates to the Bayes
theorem, which states 𝑝(𝑦 | 𝑥) is equal to 𝑝(𝑥 | 𝑦) into 𝑝(𝑦) by 𝑝(𝑥). Some of these
understandings are something that we will use when we talk about different deep generative
models.

1340
(Refer Slide Time: 17:27)

Existing deep generative models can be broadly divided into two kinds: Fully Visible models,
which directly tries to model the observations. Here, the observations are x, and the model tries
to capture these observations themselves without introducing any extra random variables. So, the
only random variables considered are the excess of the data that you have. So, for example, you
could be considering each pixel value of an image as an observation. Based on that, you will
learn a generative model, which can generate new images.

The other category of methods is latent variable models. You introduce other kinds of random
variables known as latent variables, or hidden variables, which we assume generate the data,
which we try to learn as part of generating the latent part of learning the generative model.
Within these kinds of models, latent variable models, there are very broadly speaking two kinds
explicit and implicit models.

In explicit models, we try to explicitly define and learn a certain likelihood of data by defining a
certain distribution. We may, for example, define a Gaussian on data or a mixture of Gaussians
on data and then try to learn the means and covariances of those Gaussians as part of the learning
procedure. On the other hand, in implicit models, there is no distribution assumed. No
probability density function parameterization is assumed.

1341
Given a set of data points, we do have some latent variables. But we would like to learn the
latent variables so that we can generate data points that look similar to the data points that we
have in our training data. We do not impose any specific distributional form on data or the
model’s distribution in this particular case.

1342
(Refer Slide Time: 20:11)

More generally speaking, there is an entire taxonomy of dividing existing approaches for coming
up with generative models. So, as we just said, very broadly speaking, generator models can be
divided into explicit methods and implicit methods. In explicit methods, you assume a certain
PDF density function and try to learn the parameterization. In an implicit function, we do not
assume any such distribution. Still, we just want to learn a certain density function, which can
generate data similar to what we have seen in our original training data.

Within implicit density generative models, a popular example is GANs (Generative Adversarial
networks) which tries to use the implicit density learned and generates data points. Another
variant of implicit generator models is Markov chain based models such as generative stochastic
networks. These are not as popular as GANs these days. But that is where they fit in into the
overall landscape of generative models.

On the other hand, within explicit density generative models, you again have two kinds, one
where the density is tractable, and one can estimate it and move forward to generate data points.
Examples of methods here include Autoregressive models, NADE, MADE NICE, GLOW
Ffjord, so on so forth. We will at least see some of these in later lectures this week.

Another kind is where you do have an explicit density function, but you cannot directly compute
it. Still, you can compute it through an approximation, where the most popular method is through

1343
a method known as Variational Inference, which leads to a model called Variational
Autoencoder. Another approach here is also again using Markov chains to approximate this
density function, which leads to a family of methods known as Boltzmann machines.

We would not cover all of these methods in the lectures but cover the most popular ones,
including GANs, VAEs, and some tractable density methods.

(Refer Slide Time: 22:50)

For more information, you can go through this excellent “Tutorial on Deep Generative models”
by Aditya and Stefano, delivered at IJCAI 2018. If you would like to go further, another tutorial
in a conference called UAI 2017, by Shakir and Danilo. Let us revisit one question that we left
behind: why does using KL-divergence in finding the generator model simplify to maximum
likelihood estimation?

As I already mentioned, it is a straightforward derivation. Please do work it out and try it before
we discuss it.

1344
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Generative Adversarial Networks
Part 01
(Refer Slide Time: 00:14)

We now talk about the first kind of Deep Generative Models, which are perhaps arguably the
most popular Deep Generative Model too, Generative Adversarial Networks, or GANs.

(Refer Slide Time: 00:31)

1345
We left behind the question from the last lecture: Why does using KL divergence in finding the
generator model simplify to maximum likelihood estimation? If you recall the problem, the
*
problem was θ is argmin over all θ’s from a family of distributions M, some distance function of
𝑝 from 𝑝𝐷. This distance is KL divergence; remember that the definition of KL-divergence is if
θ

you had two probability distributions, p and q, for simplicity. This is given by 𝑝𝑙𝑜𝑔(𝑝 | 𝑞). That
is the formal definition of KL divergence.

So, in this case, you would have the KL divergence of 𝑝𝐷 and 𝑝θ. So, the first term 𝑝𝐷 does not

depend on θ, so it does not matter for our optimization. So, you are left with 𝑙𝑜𝑔(𝑝𝐷 | 𝑝θ), which

can be written out as 𝑙𝑜𝑔 𝑝𝐷 minus 𝑙𝑜𝑔 𝑝θ. Once again, here, 𝑙𝑜𝑔 𝑝𝐷 is the distribution of data

and does not depend on θ.

So, you are left with argmin over θ expectation over all the data samples from your distribution
𝑝𝐷, − 𝑙𝑜𝑔 𝑝θ(𝑥). This is maximum likelihood estimation or minimizing negative log-likelihood.

So, fairly trivial thing to show as we mentioned the last time.

(Refer Slide Time: 02:20)

Let us recap some of the concepts that we spoke about in the last lecture. We said that generative
models, try to learn a probability density function 𝑝(𝑥) over a given set of images, training data,

1346
x. If you knew the distribution, if that was parametrized as a Gaussian, 𝑝(𝑥) assigns a positive
number to each possible image, depending on its probability of being generated by that
distribution, it tells us how likely that image x is under that distribution 𝑝(𝑥).

And we said that applications of generative models include sampling and generating new data,
likelihood estimation, which could help you in outlier detection, because you have a certain
distribution, and you have a certain likelihood of a data point belonging to a distribution. If that
likelihood is very low, it is likely that the point was an outlier. Similarly, you can also do feature
learning, which we will talk about more in this lecture.

(Refer Slide Time: 03:36)

We also talked about the last lecture on density estimation of the probability distribution from an
implicit perspective and an explicit perspective. These also define the different approaches for
Deep generative models. In explicit density estimation methods, you try to write an explicit
function for the probability distribution. For example, you may say 𝑝(𝑥) is equal to 𝑓(𝑥, θ),
where 𝑓 could be a Gaussian function in the case in which we, you assumed a Gaussian
distribution for the density.

In this case, the input would be image x, the output 𝑝(𝑥) would be the likelihood value for image
x, and the parameters would be the weights θ, which defines that function that you assign for the
density. So, in this case, an explicit likelihood is assigned for each image. With explicit density

1347
estimation approaches, you will realise as you see different methods, they are very good at
outlier detection for the same reason that we just said.

So, if a point’s likelihood of belonging to distribution is very low, it is perhaps an outlier. But
explicit density estimation approaches do struggle to generate very high-quality images,
sometimes can also be slow in generating images. On the other hand, implicit density estimation
approaches do not assign a functional form to 𝑝(𝑥). Instead, they only aim to ensure that given a
model, you can sample images from that model without worrying about an explicit likelihood
assignment for each such sample. So, that is not part of the model. It happens that such methods
end up providing avenues for better sample generation and faster sampling speed.

(Refer Slide Time: 05:58)

With that, recall, let us move on to the focus of this lecture, which is Generative Adversarial
Networks. These networks were developed in 2014 by Goodfellow et al. The goal of GANs is to
build a good sampler that allows drawing high-quality samples from the 𝑝𝑚𝑜𝑑𝑒𝑙(𝑥). The

𝑝𝑚𝑜𝑑𝑒𝑙(𝑥) defines samples drawn from a model learned through this algorithm.

As we just mentioned, there is no explicit computation of a certain functional form for 𝑝(𝑥). The
only objective we have is to ensure that whatever images we sample from the model, the
distribution of those samples, which we call 𝑝𝑚𝑜𝑑𝑒𝑙(𝑥), remember, we will not assign any

1348
likelihood value for 𝑝𝑚𝑜𝑑𝑒𝑙(𝑥). We want to ensure that if you collected a set of samples from that

model, that distribution of samples should look similar to your original data distribution.

There is no likelihood assignment for each sample. So, we ideally want the output samples to be
similar but not the same as your training data. Why do we say that? Because if we have the same
samples as training data, the neural network is perhaps memorizing and not learning much. Then
you do not need such a model. You only need a model, which can then generate diverse samples
beyond what you already have.

So, How do you achieve this goal? The key idea in Generative Adversarial Networks or GANs is
to introduce a latent variable 𝑧. So, this is the latent variable with a simple prior, for example, a
Gaussian prior. Once you sample from that Gaussian, you pass it through a module called a
Generator. In our case, a Generator could be a neural network, and you give input to that
generative neural network, and the network outputs an image.

So, this module is known to us from whatever we have seen so far. We have seen semantic
segmentation methods, where the output layer is the size of an image itself. Similarly, here also,
the output of the generator would be an image itself. And that is what we define as the
^
distribution coming out of the generator. Let us call that 𝑥, the samples from 𝑝𝐺, where 𝑝𝐺

denotes the distribution of images generated by the generator G.

Now, what do we want? We want to ensure that if we had a distribution of training data given by
𝑝𝑑𝑎𝑡𝑎, we want to ensure that the distribution 𝑝𝐺 must be close to 𝑝𝑑𝑎𝑡𝑎. Observe that we are not
^
trying to assign a likelihood to every data point 𝑥 or even x, for that matter. We want that
distribution 𝑝𝐺 to be close to the distribution 𝑝𝑑𝑎𝑡𝑎. The challenge in implementing this is not

imposing any particular parameterization on 𝑝𝐺 or 𝑝𝑑𝑎𝑡𝑎.

So, we have to find the equivalence or ensure that 𝑝𝐺 becomes close to 𝑝𝑑𝑎𝑡𝑎 without knowing

its distribution. Remember here; the Gaussian is only an input vector. We are not assuming any
Gaussian or any parameterization on your 𝑝𝐺 or 𝑝𝑑𝑎𝑡𝑎 distributions.

1349
(Refer Slide Time: 10:23)

So, how do we ensure that 𝑝𝐺 is more or less close to 𝑝𝑑𝑎𝑡𝑎? To do this, the originators of this

particular method, Goodfellow et al., had an interesting idea. They introduced a classifier called
a discriminator, which we will denote as D in this lecture, to differentiate between real samples
and generated samples. So, if data came from 𝑝𝑑𝑎𝑡𝑎, it would be given class 1.

^
And if data came from the generated distribution that is 𝑥 coming from 𝑝𝐺, then it would give a

class 0 for that image. How does this benefit? Our goal is to train the generator to ensure that the
^
discriminator misclassifies the generated sample 𝑥 into class 1. It can no more differentiate
between the original distribution 𝑝𝑑𝑎𝑡𝑎 and the new distribution, 𝑝𝐺. So, the way we will equate

𝑝𝐺 and 𝑝𝑑𝑎𝑡𝑎 is through this discriminator, which the generator will seek to confuse. The

discriminator’s job is to separate the fake samples from the real samples. By fake samples, we
mean the generated samples.

The job of the generator is to fool the discriminator. So, you can also consider this, like a cop and
thief game, where discriminator is like a cop that can separate real and fake and the job of the
generator is to fool the discriminator. That is the overall idea.

1350
(Refer Slide Time: 12:24)

How do you train such a generator or a discriminator? So, the training objective for GANs is
given by a min-max optimization problem, you minimize over G, which are the parameters of
your generator network, you maximize over D, which are the parameters of your discriminator,
or the classifier network, the expectation of data coming from the real distribution, 𝑙𝑜𝑔 𝐷(𝑥),
why is this correct?

We are saying that we would like the discriminator to maximize the log-likelihood of those data
points that come from your original training distribution. The second part states that if you now
take a sample from the Gaussian and give that Gaussian to your generator, G and the generator’s
output is given to the discriminator. We want to ensure that the output is 0. At least the
discriminator must aim to make this value 0.

And that would happen when this entire quantity is maximized. On the other hand, the
generator’s job is to ensure that this last quantity goes to 1, minimzing this entire quantity. That
is why you minimize over G and maximize over D. This also is intuitive because generator and
discriminator are playing a cop and thief game. Each would like to output the other.

So, while the discriminator wants to maximize terms in the objective, the generator wants to
minimize corresponding terms in the objective. So, such a min-max problem is also known as a
zero-sum game. This has origins in game theory, which will not get into now. We assume that the

1351
discriminator has a sigmoid activation at its output layer. Remember, the discriminator in this
particular example is a binary classifier.

It only has to say real or fake. So, the best activation function that you can have in the output
layer of a binary classifier is the sigmoid activation. We now assume that a discriminator has a
sigmoid activation function in its output layer.

(Refer Slide Time: 15:18)

So, let us try to parse this objective function a bit differently. If you look at the first term, if you
look at this part of the term, let us write that and call it objective 1. So, objective 1 states that we
would like to train the discriminator such that if the sample belongs to 𝑝𝑑𝑎𝑡𝑎, which is the true

training data distribution. We maximize the log probability of it being a real sample. The second
part is objective 2, which talks about minimizing the generator with respect to the second term.

Remember that the first term does not have anything to do with G, and hence, in the
minimization of G, the first term does not matter and can be excluded. So, while minimizing G,
what are we looking for? We are looking to train the generator G, such that if the sample belongs
to 𝑝𝐺, that is, its output of the generator G, we would like to maximize the log probability of it

being a real sample. That is what the generator would like to do.

1352
Although, the discriminator would also like to maximize this quantity. What do we mean by
expectation in these two terms? In practical implementation, the expectation simply means that
the losses are averaged over a batch of samples. What does a batch of samples mean here? We
will see that in a moment when we see the algorithm.

(Refer Slide Time: 16:57)

Now, coming to the training strategy. So, the first thing that you can think of is you have two
networks to train D and G. So, far, we have seen several other approaches where we had two
networks to train, we had detection with two heads, we talked about a Siamese network with two
branches. We talked about a two-stream CNN with two branches. But in all of those examples,
both branches were trained with the same loss or the same objective.

We had examples like triplet loss were things slightly changed. But otherwise, it was the same
loss. You minimize the same quantity across the complete network to a large extent. So, what can
we do here? One option is we can train D completely first. So, we have a good discriminator.

1353
(Refer Slide Time: 17:56)

And that can be done by optimizing only the first part of the objective O1.

(Refer Slide Time: 18:01)

And once we have that, we can train G to optimize O2. Does that work? Does that have any
problems? Let us try to think this through. If D is initially very confident, which means you have
trained an excellent discriminator or a classifier, then it would be able to say that any sample that
comes from G is a fake. So, which means, if you get an x that is obtained from G, or
corresponding to the distribution, 𝑝𝐺, then 𝐷(𝑥), or σ(𝑥), which is sigmoid activation function in

1354
D would be equal to 0. which means 𝑙𝑜𝑔 (1 − 𝐷(𝑥)), in this case, would be 𝑙𝑜𝑔 (1 − σ(𝑥))
would be equal to 0.

(Refer Slide Time: 19:01)

And if you see the graph of 𝑙𝑜𝑔 (1 − σ(𝑥)) and the gradient of 𝑙𝑜𝑔 (1 − σ(𝑥)), when
𝑙𝑜𝑔 (1 − σ(𝑥)) is 0, obviously the gradient is also 0. So, you will get a zero gradient, which
means G will not get any gradients to train and will never learn. So, training discriminator
completely well in the beginning will not help us get a good generator, which is the main
objective of a GAN. So, what do we do? We want to alternate between training the discriminator
and training the generator using O1 and O2, respectively.

1355
(Refer Slide Time: 19:54)

Let us try to see how this is done in the algorithm for GANs. So, the original paper by
Goodfellow recommends that we first sample. So, for the K steps, you first sample a mini-batch
of M noise samples from the noise prior 𝑝𝑔(𝑧). So, remember, we had a Gaussian, from which

we get a few vectors, which we call noise samples. Similarly, we also sample a mini-batch of M
examples from the original data distribution 𝑝𝑑𝑎𝑡𝑎, which is your training data.

Now, remember, your overall objective is 𝑙𝑜𝑔(𝐷(𝑥)) + 𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧))). And remember, the
discriminator wants to maximize both of these. That is what we have said so far. And a
maximization problem is solved by gradient ascent, just like how a minimization problem is
solved by gradient descent. A maximization problem is solved by a gradient ascent, where in
each iteration, you go in the direction of the positive gradient, not the negative gradient the way
you did with gradient descent.

So, we update the discriminator by ascending its stochastic gradient, which is obtained by taking
this loss function and differentiating with respect to each weight in the discriminator network.
And you do this for K steps. So, for the K steps, you update the discriminator. It is not yet
completely trained. Once you have done it for the K steps, now switch over and train the
generator. So, you sample a mini-batch of M noise samples from the noise prior.

1356
Update the generator now by descending its stochastic gradient. Because for the generator, we
wanted to minimize the second term in the objective, the first term in the objective, anyway, did
not depend on G, only the second term dependent. So, we would now like to minimize
𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧))) with respect to the parameters in the generator network G. Now, this is
repeated over training iterations.

And this is used to come up with a model for the generator finally. At test time, what do you do?
Once you have trained the entire model, you can discard the discriminator. You take a sample
from your Gaussian, send it through the generator network, and the image you get is what you
would assume belongs to your original data distribution. One point here is what does it mean for
such a network to converge?

So far, we spoke about whenever we had a loss function, we always wanted to minimize the loss
function and wanted to ensure that the loss goes to 0. To avoid overfitting, we perhaps may not
let it go to 0. But we at least wanted to see the loss reducing over iterations. However, here, we
have two components, where one is perhaps trying to increase the objective function value, the
other is trying to reduce the objective function value. So, what does convergence mean for such a
network? Let us see that in more detail.

1357
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Generative Adversarial Networks: Part 02
(Refer Slide Time: 00:13)

So, to evaluate optimality for training such a network, let us reconsider the objective, min over
G, max over D, the expectation for data coming from the training distribution, 𝑙𝑜𝑔 𝐷(𝑥) plus
expectation for the generated samples, 𝑙𝑜𝑔 (1 − 𝐷(𝐺(𝑧))), where z comes from your Gaussian
prior for instance. Now, if we expanded out your expectation, you have min over G, max over D,
integral over x, (𝑝𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔 𝐷(𝑥) + 𝑝𝐺(𝑥) 𝑙𝑜𝑔 (1 − 𝐷(𝑥))) 𝑑𝑥.

We assume that the x in the second term is generated data, which we define as 𝑝𝐺(𝑥). So, this is

simply an expansion of the Expectation term in terms of an integral. So, let us now take the max
inside the integral. So, the only difference between step 1 and step 2 is that the max in step 1 has
gone inside; otherwise, the rest of the terms are exactly the same.

Now, to understand what the max of such a function would be, let us try to write out using some
variables. In general, let us say 𝑦 = 𝐷(𝑥), 𝑎 = 𝑝𝑑𝑎𝑡𝑎, and 𝑏 = 𝑝𝐺. So, then you can write this

entire integrand here as 𝑓(𝑦) = 𝑎 𝑙𝑜𝑔 𝑦 + 𝑏 𝑙𝑜𝑔 (1 − 𝑦). We ideally need to know when do
you attain the maximum over D or maximum over y for such a function.

1358
To find the maximum of a function, you need to take its derivative and set it to 0. So, let us do
^
that 𝑓, or the derivative of 𝑓(𝑦) is 𝑎/𝑦 − 𝑏/(1 − 𝑦) that follows from derivatives of 𝑙𝑜𝑔 𝑦
𝑎
and𝑙𝑜𝑔 (1 − 𝑦) and setting this to 0, we get that the maximum is obtained at 𝑦 = 𝑎+𝑏
.

What does that mean when we substitute back the variables? For a given x, the optimal
*
discriminator is 𝐷𝐺 (𝑥) = 𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝐺(𝑥) + 𝑝𝑑𝑎𝑡𝑎(𝑥)). Let us keep this in mind and

continue to look at the objective. So, that is the optimal discriminator, which does not end the
story because we also have a generator to think about.

(Refer Slide Time: 03:29)

1359
Let us now look at the generator side of the optimality. So, you have a min of G, of
*
𝑝𝑑𝑎𝑡𝑎(𝑥) 𝑙𝑜𝑔 𝐷𝐺 (𝑥) , the optimal discriminator, let us assume, the max of D has been evaluated,
*
and we substitute for a max of D inside the integral, plus 𝑝𝐺(𝑥) (1 − 𝐷𝐺 (𝑥) )𝑑𝑥. This can now

*
be written as, we replace 𝐷𝐺 , the optimal discriminator, as 𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) ),

which we got from the previous slide, which is replacing for that and we also replaced for that in
the second term.

Now, from here, we can see that the second term, 1 − 𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝 (𝑥) ), can
𝐺

now be rewritten as 𝑝𝐺(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝 (𝑥) ), a simple arithmetic operation on top of the
𝐺

earlier expression. Now we are going to get back this expression in terms of expectations.

We are going to bring back expectations from integral. So, the integral over the entire term dx
can be now rewritten as, expectation over x coming from 𝑝𝑑𝑎𝑡𝑎,

𝑙𝑜𝑔 [ 𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) )] , plus, the expectation of x coming from 𝑝𝐺, which is the

generated distribution, 𝑙𝑜𝑔 [ 𝑝𝐺(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) )]. What do we do with this?

1360
(Refer Slide Time: 05:13)

Let us now multiply and divide both of these terms by 2. Once we do that, we can take the
denominator’s 2 and add those two terms up and get a minus. Using the first 2, you would get a
minus 𝑙𝑜𝑔 2. Using the second 2, you will get a minus 𝑙𝑜𝑔 2. When you put these two together,
you will have a minus 𝑙𝑜𝑔 4 term. Now, the first term can be written as a KL-divergence between
𝑝𝑑𝑎𝑡𝑎(𝑥) and (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) )/2.

Remember, this would be 𝑙𝑜𝑔 (𝑝/𝑞), and you will be left with 𝑝𝑑𝑎𝑡𝑎(𝑥) and

(𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) )/2. Similarly, the second term would be the KL divergence between 𝑝𝐺(𝑥)

and (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) )/2. You still have the minus 𝑙𝑜𝑔 4. Now, let us briefly review some

standard notations and definitions. Remember, KL divergence is given by, in this case, to
simplify things, the expectation of x belonging to 𝑝, instead of 𝑝𝑙𝑜𝑔(𝑝/𝑞), I can write it as
expectation of x belonging to 𝑝, 𝑙𝑜𝑔 𝑝 by 𝑙𝑜𝑔 𝑞.

Jenson-Shannon Divergence is another divergence measure to measure the distance between two
probability distributions. Given two distributions 𝑝𝑑𝑎𝑡𝑎 and 𝑝𝐺 , the Jenson-Shannon divergence

between those distributions is given by 𝐾𝐿 (𝑝𝑑𝑎𝑡𝑎, (𝑝𝑑𝑎𝑡𝑎 + 𝑝𝐺)/2)/2 plus

𝐾𝐿 (𝑝𝐺, (𝑝𝑑𝑎𝑡𝑎 + 𝑝𝐺)/2)/2. So, this would be the Jenson-Shannon divergence between these

two distributions, which means now we can replace the KL divergences in our case.

1361
So remember, you would have had a by 2 and by 2. So, to ensure that this now becomes min over
G, 2 * 𝐽𝑆𝐷(𝑝𝑑𝑎𝑡𝑎, 𝑝𝐺) − 𝑙𝑜𝑔4. Remember that Jenson-Shannon divergence, just like KL is also

a non-negative quantity by definition. Now, what does this mean? Let us put all things together
now.

(Refer Slide Time: 07:53)

We already saw that the optimal discriminator is given by 𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝐺(𝑥) .

(Refer Slide Time: 08:03)

1362
This particular term, minimizing over G, 2 * 𝐽𝑆𝐷(𝑝𝑑𝑎𝑡𝑎, 𝑝𝐺) − 𝑙𝑜𝑔4, because Jenson-Shannon

divergence is a non-negative quantity. This would be minimized when 𝑝𝑑𝑎𝑡𝑎 = 𝑝𝐺 because then

this quantity will become 0, and you will be left with minus 𝑙𝑜𝑔 4. So, the generator G is
obtained when 𝑝𝑑𝑎𝑡𝑎 = 𝑝𝐺. We knew this intuitively. But we also now see this mathematically.

(Refer Slide Time: 08:44)

So, let us bring that back. We also know that at optimality for a generator, 𝑝𝑑𝑎𝑡𝑎, or the

probability distribution of the training data is equal to 𝑝𝐺 , which is the distribution of the

generator. Now, putting the two together, it states that the optimal discriminator is also
𝑝𝐺(𝑥) / (𝑝𝐺(𝑥) + 𝑝𝐺(𝑥) ) because 𝑝𝑑𝑎𝑡𝑎 = 𝑝𝐺 at optimality. And that can also be written as

𝑝𝑑𝑎𝑡𝑎(𝑥) / (𝑝𝑑𝑎𝑡𝑎(𝑥) + 𝑝𝑑𝑎𝑡𝑎(𝑥) ), both of which equate to half. At optimality, the discriminator

should give you an output half to maintain the balance between generator and discriminator.

We do not want the discriminator always to give one or always give zero. Suppose it outputs half
for any sample that is provided to it. We think it has been fooled because it cannot distinguish
between a real sample and a fake sample.

1363
(Refer Slide Time: 09:51)

So, that is about GANs. A follow-up architecture developed in 2016 by Radford et al. was called
the Deep Convolution GAN, or the DCGAN. DCGAN was a landmark achievement in
improvements over GANs because it came up with a few different ways of improving the
generation quality of images using GANs. The main idea of DCGAN was to bridge the gap
between the success of CNNs for supervised learning and unsupervised generative models.

It brings best practices of training CNNs to train GANs with deep architectures. They also show
that you can play with the latent space representations, the z vectors, you sample from a
Gaussian and do what is called Vector Arithmetic, which we will see soon. They also show how
using such an architecture for GANs gives you strong feature learning capabilities of the
network. Let us see each of these one by one.

1364
(Refer Slide Time: 11:08)

So, in terms of training practices to train good GANs for generation, DCGAN introduced a few
strategies. It replaced deterministic spatial pooling functions such as maxpool with strided
convolution, which allowed to learn spatial downsampling. It removed fully connected hidden
layers for deeper architectures. So, just convolution layers, nothing more. It introduced batch
normalization in both the generator and the discriminator.

This helps prevent generator collapse or helps gradient flow in deep architectures. However,
batch normalization was not applied at the output of G and the input of D. It used a ReLU,
non-linearity for the generator and a leaky ReLU non-linearity for the discriminator. And finally,
for output for the generator, it used a 𝑡𝑎𝑛ℎ non-linearity that was the final output of the
generator. And, as before, a sigmoid activation was used in the output layer of the discriminator.
That is to say, whether an input is real or fake.

1365
(Refer Slide Time: 12:29)

So, we talked about playing with the latent space to generate different kinds of images. So, this is
an interesting experiment. What DCGANs demonstrated is that if you had two latent samples, 𝑧1

and 𝑧2, that you sampled from the Gaussian. You sent it through the generator. You would get

two different outputs because that is what two latent are sampled for. If you now interpolated
between those outputs, so if you did α * 𝑧1 + (1 − α) * 𝑧2, you end up getting a smooth

transition of generated images from image 𝐺(𝑧1) to 𝐺(𝑧2).

(Refer Slide Time: 13:20)

1366
You could now look at it as doing vector arithmetic in latent space. If you had a set of images of
a man with glasses, a set of images of a man without glasses, and a set of images of women
without glasses. You can take the latent vectors corresponding to all men without glasses and
average it. Similarly, take an average vector for all men without glasses and women without
glasses.

Now, if you do arithmetic on those average z vectors or latent vectors, if you say a man with
glasses minus the latent vector corresponding to man without glasses, plus the latent z vector
corresponding to a woman without glasses, and use the resultant latent vector and pass it through
the generator, you end up getting images of a woman with glasses. This is interesting to
understand how the latent variable is interpolated, and the generator learns what the equivalent
interpolation in the image space should be.

(Refer Slide Time: 14:43)

1367
Here is another example of a similar idea for Pose transformation. So, given a set of images
looking left, and its corresponding average latent vector to be𝑧𝑙𝑒𝑓𝑡, a set of images looking right,

corresponding latent vectors averaged to form 𝑧𝑟𝑖𝑔ℎ𝑡. If you consider, 𝑧𝑡𝑢𝑟𝑛 = 𝑧𝑟𝑖𝑔ℎ𝑡 − 𝑧𝑙𝑒𝑓𝑡, the

difference between the extreme vectors, and 𝑧𝑛𝑒𝑤, a new latent sample.

Remember that z’s are all inputs to the generator. 𝑧𝑛𝑒𝑤 is 𝑧 + α. 𝑧𝑡𝑢𝑟𝑛 , and now you provide the

𝑧𝑛𝑒𝑤 to G, you get transformed images, with various posses between the right and the left pose.

(Refer Slide Time: 15:39)

1368
As we mentioned earlier, this work also showed how GANs learn good features that can also be
used for classification. This was demonstrated by training the DCGAN on ImageNet-1k and then
using the discriminator’s convolution features for images from another data set called the
CIFAR-10 dataset. So, the GAN is not trained on CIFAR-10; after it is trained on ImageNet, you
take CIFAR-10 dataset images. CIFAR-10 is another data set.

You pass those images through the discriminator of the GAN, and you now take its features at a
particular layer of the discriminator and use these features with an SVM to classify those
CIFAR-10 images into ten different classes. CIFAR-10 stands for a dataset with ten different
classes. That is what is done in this particular experiment. You see that the result obtained is
fairly competitive to many other contemporary methods at that time in 2016. So, this shows the
robustness of the features learned by the discriminator.

1369
(Refer Slide Time: 17:05)

Now, for the final discussion of this lecture, is how do you evaluate GANs? Because so far, for
supervised learning, we could use accuracy. But in GANs, how do you evaluate these models?
One is you could use Human judgment. How would you use Human judgment? You would say a
good generator is one, which can generate images with distinctly recognizable objects, and it also
generates semantically diverse samples.

How do you measure this? Recognizable objects would mean that an independent classifier
would take these generated images and predict the class with high confidence. Semantic diversity
would mean that the generator or the GAN generate samples of various classes in the training
set, ideally, all classes in the training set. That is one way of evaluating GANs.

Another way of evaluating GANs is by looking at the prediction power. Say you take an
ImageNet pre-trained Inception network V3 and see how it performs on the generated images to
understand the quality of generated images. If it performs well, perhaps the images are fairly
representative of a dataset, such as ImageNet. However, it is also to be kept in mind that the
evaluation of generative models is still an open research problem. Some metrics are popularly
followed, but there is still scope for improvement of these metrics.

1370
(Refer Slide Time: 18:49)

One such metric that is popularly used is the Inception score, which is intended to correlate with
human judgment. You consider two quantities, 𝑝(𝑦 | 𝑥), the softmax output over all class labels
of an Inception model, given a data point x. You also have 𝑝(𝑦), the generated samples’ class
distribution. What are you looking for here? We ideally want 𝑝(𝑦 | 𝑥) to have a pointed
distribution.

So, we would like one class label, the correct class label to be very high in the distribution, and
all other class labels to have very low probabilities. That is what we want 𝑝(𝑦 | 𝑥) to be. We
ideally want such a distribution to be as far away from a uniform distribution. That tells us that
the inception model is highly confident in recognising certain objects generated in a given image.
But is this correct? Not necessarily.

We do not want it to be far away from a uniform distribution. We want it to be far away from
how class labels are distributed among those generated images. That is the baseline for us
because it is possible that the GAN model generated more images of a certain class and lesser
images of another class. So, how the marginal distribution of the class labels in your generated
images looks is the baseline. And you want the class distribution of a specific data point x to be
as far away as from that baseline distribution for your generated images.

1371
(Refer Slide Time: 20:56)

So, the inception score is given by a KL divergence between 𝑝(𝑦 | 𝑥) and 𝑝(𝑦). It can be written
as 𝑙𝑜𝑔 (𝑝(𝑦 | 𝑥)) − 𝑙𝑜𝑔 (𝑝(𝑦)), which can also be written as the entropy of y, 𝐻(𝑦), minus the
entropy of y given x, 𝐻(𝑦 | 𝑥). What are these quantities? 𝐻(𝑦) is the entropy of generated
sample class labels. Remember, semantic diversity would mean that you have high entropy,𝐻(𝑦).
So, you are generating samples from all different classes.

High entropy means equally distributed class labels in your generator data distribution. 𝐻(𝑦 | 𝑥)
is the entropy of class labels from the classifier for different data points. If it is distinctly
classifiable, 𝐻(𝑦 | 𝑥) will be very low, which means one class label will dominate the output of
the softmax activation function rather than all labels. What are we looking for? We are looking
for a higher Inception score. We would like 𝑝 (𝑦 | 𝑥) to be as far away from𝑝(𝑦) as possible.

1372
(Refer Slide Time: 22:24)

While the Inception score is good, one of the limitations of the inception score is it does not
consider the real data at all. Its metric is purely based on the generated distribution alone. But we
know that the purpose of GAN was to ensure that the generated distribution is close to the real
distribution. So, how do we come up with a metric to measure this? This is done by Frechet
Inception distance called FID score, which tries to address this need.

So, we need to find the distance between the real world data distribution and the generating
models’ data distribution. How do we do this? You take your real images, and you take your
generated images and embed them in the feature space of an Inception V3 model. So, whatever
images you have, your real training images, and your generated images pass all of them through
an Inception network. You would get activations of the pool three layer of Inception v3.

Now, you compute the Frechet distance between two multivariate Gaussians. Once you get those
two sets of features, you compute the Frechet distance. You are given a gaussian distribution
with mean, m, covariance, C, and another distribution with mean 𝑚𝑤, and covariance ,𝐶𝑤. The

Frechet distance is defined by the first mean minus the second mean 2 norm whole square plus
trace of the first covariance plus the second covariance minus 2 into the 2 covariances squared.

So, that is the definition of the Frechet distance between two multivariate Gaussians. So, given
these two distributions of real and generated samples, you compute their means, compute the

1373
covariances. And then, you can compute the Frechet distance using this formula. So, in this
particular case, the lower FID, the better. Lower FID tells you that the generated distribution is
close to the real distribution.

(Refer Slide Time: 24:56)

Let us see an example of how these metrics look when you use them. So, these are some results
on a data set known as the CelebA, which contains face images of celebrities. In this particular
example, the first column and the third column are FID scores, and the second column and the
fourth column are Inception scores. So, remember the lower FID, the better, the higher Inception,
the better.

So, you can see here, in this particular case, the top row denotes images with added Gaussian
noise. You can see that the FID score keeps increasing, which means it is getting worse and
worse as you add more and more noise that is expected because your distribution is changing.
Similarly, here too, when Gaussian Blur is added, the FID keeps increasing. Similarly, when salt
and pepper noise is added, the FID keeps increasing.

And when ImageNet crops are added to the CelebA dataset, you take some images from
ImageNet, crop certain portions and keep adding them to your CelebA dataset. Once again FID
score goes up. On the other hand, you see the Inception score, which in this particular case, it is

1374
evident that when the ImageNet crop is added, the inception score goes down over time, which is
once again expected.

Remember, the higher Inception score is good. So, as more noise comes in, the inception score
drops. However, for Gaussian noise, Gaussian Blur, and salt and pepper noise, you see that the
Inception score does not show too much sensitivity, sensitivity to these kinds of noises in the
distribution. There is some variation. It is not as stark as for other images.

(Refer Slide Time: 26:58)

With that, your readings for this GAN lecture is a very nice dive into Deep learning on GANs
provided on this particular link. If you would like to play with GANs in your browser, here is a
nice link called https://poloclub.github.io/ganlab. And finally, implementation of GAN on this
particular link, which you can use to play with the code of GAN again.

1375
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Variational AutoEncoders
(Refer Slide Time: 00:14)

Let us move on from GANs to another kind of Deep Generative Models called Variational Auto
Encoders, popularly known as VAEs, before we talk about Variational Auto Encoders. Let us
briefly recall autoencoders, which we covered last week.

(Refer Slide Time: 00:34)

1376
Autoencoders are neural networks used for unsupervised learning, where the network attempts to
reconstruct the input itself as the output. Autoencoders have an architecture such as shown,
where the input is compressed into a bottleneck layer, where a certain representation of the input
is available, which is also called a Latent code, which is then used through a decoder to
reconstruct the output, which we would ideally like to assemble the input itself.

The question we would like to ask in the context of this week's topic is, how can we use such an
approach to generate images automatically? Remember, when we spoke about autoencoders, we
generally discard the decoder once you train an autoencoder. Given an input, we use the encoder
alone to get a low dimensional representation of your data. Let us try to see now if we can
reverse this idea.

At the end of the training, Can we discard the encoder instead of the decoder? Using some
vector as code, we can generate images close to the training data distribution. As you can see, the
goal is close to what GANs were trying to achieve but achieved with a different perspective.

(Refer Slide Time: 02:27)

The idea of variational autoencoders was introduced simultaneously by two groups of


researchers, Kingma and Welling, in a paper “Auto Encoding Variational Bayes” published in
ICLR 2014. Also by Rezende, Mohamed, and Wiestra called “Stochastic Backpropagation and
Variational Inference in Deep Latent Gaussian models”, which was published in ICML 2014.

1377
(Refer Slide Time: 02:58)

Let us try to understand variational autoencoders now. At the crux of VAEs is about learning
what is known as a latent variable model. In a latent variable model, our objective is to learn a
mapping from some latent random variable z to a possibly complex distribution on x, which we
assume is the real-world training distribution that we would like to generate data similar to.
Mathematically speaking, you would then write 𝑝(𝑥) is equal to the integral of the joint
distribution 𝑝(𝑥, 𝑧) where 𝑝(𝑥, 𝑧) is equal to 𝑝(𝑥 | 𝑧) into 𝑝(𝑧) We assume that the prior or the
marginal distribution on z, the latent variable, is something simple.

How would you get 𝑝(𝑥 | 𝑧)? We assume there is some function of an input vector z. So, that
would be the simple mathematical construction here. Given training data, assuming here it is
unsupervised learning, so only x. The question that we are asking is can we learn to decouple the
true explanatory factors or the latent variables underlying the data distribution, which is perhaps
generating the data?

For example, in face images, you may want to assume a latent random variable called identity, a
latent random variable called expression. So, if we say person A is happy, you should generate a
facial image of person A in a happy mood. So, in this case, identity and expression would be
latent variables. And those latent variables may assume different values, identity may assume
values of different people like ABC, and expression may assume different values like happy, sad,
angry, so on and so forth.

1378
And the combination results in the generation of specific kinds of images. So, we are talking
about a latent set of variables, in this case, represented as 𝑧1, 𝑧2, and a function g that takes us

from this latent space to the original data space. How do we learn such a function g that takes us
from z to x?

(Refer Slide Time: 05:45)

We will assume that we are going to use a neural network to learn such a function that takes us
from z to x. So, this conditional distribution here. So, given a vector z, we ideally should use a
neural network to give us a vector x. The question now remains, where does that come from?
Computing the posterior 𝑝(𝑧 | 𝑥) is intractable because we do not know what z looks like that is
not available to us. Unfortunately, we need that to be able to train the G of z model. Without
values of z, we cannot train the 𝑔(𝑧) model, also. So, how do we get around this problem?

1379
(Refer Slide Time: 06:34)

So, as you can see so far variational autoencoders offer a Bayesian spin on an autoencoder. So, if
we assume that our data is generated using this generator process, where z is a set of latent
random variables, which we assume belong to a certain prior distribution, we sample from a true
conditional distribution, which results in x, which is the training data distribution that we see.
The intuition here is, x could be an image, which is generated from some latent attributes, such
as a class label, and orientation, attributes, so on and so forth.

So, we could say that the class label is a person's face, the orientation is in which direction they
are looking, attributes could be the color of the face, the color of the eyes, the color of the hair,
etc. And the problem now for us is, without knowing what these z’s are, we need to estimate the
parameters of this conditional distribution, 𝑝(𝑥 | 𝑧). The θ parameters is something that we
would like to estimate without having access to z. How do we do this is the core idea of VAEs.

1380
(Refer Slide Time: 08:04)

To start this process, let us assume now that there is a prior distribution on the random variable z,
which for convenience, we are going to assume is a unit Gaussian. And the moment you have z,
you can now construct a neural network, which we call the decoder network with certain
parameters theta, which are the weights of those layers in the decoder network.

And the decoder outputs a set of means and variances, which can then be used to sample and get
data similar to x. So, the conditional distribution, 𝑝θ(𝑥 | 𝑧) is assumed to be a diagonal Gaussian

for simplicity. One could assume a multivariate Gaussian with a certain covariance matrix, but
the procedure gets a bit complex. We will keep it simple as a diagonal Gaussian, whose means
and variances are predicted by the decoder network. Each dimension of that multivariate
Gaussian, the mean and the variance is predicted.

1381
(Refer Slide Time: 09:19)

From Bayes rule, we know that 𝑝θ(𝑧 | 𝑥), which is our posterior, is given by 𝑝θ(𝑥 | 𝑧) into 𝑝θ(𝑧)

by 𝑝θ(𝑥 ). Now, if we observe the terms on the right-hand side, 𝑝θ(𝑥 | 𝑧) , we just talked about

how you can obtain it. Assume that your z’s are a unit Gaussian and learn a decoder network to
give you the output x. So, that gives you 𝑝θ(𝑥 | 𝑧). What about 𝑝θ(𝑧)?

We just said we are going to assume that to be a Gaussian. So, that is an assumption. What about
the denominator 𝑝θ(𝑥 )? Unfortunately, that is an intractable integral to find out what distribution

and parameterization of that distribution that x comes from. It is something we do not know at
this time. So, what do we do to get around this? We will now assume that there is some other
distribution 𝑞ϕ(𝑧 | 𝑥).

We will now call as an Approximate posterior. So, this is obtained using an encoder network.
What does the encoder network do? It takes in your training data distribution, so that is where
your neural network takes in all the inputs from your training data. It has a set of layers
parameterized by some weights, which we denote as to ϕ. It outputs a set of means and variances
of a multivariate diagonal Gaussian distribution. That is given by the approximate posterior
𝑞ϕ(𝑧 | 𝑥).

1382
(Refer Slide Time: 11:21)

Putting all the pieces together, we have an encoder at the bottom and the decoder on top. The
encoder takes x and gives out a set of means and variances for this approximate posterior
distribution 𝑞ϕ(𝑧 | 𝑥). Once you have the distribution, you can sample a z from that distribution

and that sample z is given as input to your decoder network. The decoder network outputs a set
of means and variances corresponding to 𝑝θ(𝑥 | 𝑧).

Given these means and variances, you can sample from 𝑝θ(𝑥 | 𝑧) to get your output images or

reconstructions. We assume that both 𝑞ϕand 𝑝θ are multivariate Gaussians, with a certain mean

and a diagonal covariance. Each dimension is independent. Now, how do we train such a
network? Let us try to understand that. We would ideally like to train the decoder network, just
like the normal autoencoder.

So, because it will give us an output x, we would ideally like to use something like a
reconstruction loss, something like a Mean Squared Loss to ensure that this reconstruction here
is as close as possible to x itself. That should be one term of our loss function. What else do we
need to do? We also want to ensure that this approximate posterior 𝑞ϕ(𝑧 | 𝑥) should be close to

the prior of z that we assumed.

1383
We assumed for the decoder network that z comes from a unit Gaussian prior. So, we would
ideally like 𝑞ϕ(𝑧 | 𝑥) to be close to the distribution, 𝑝θ(𝑧), which, in our case, we assume to be a

Gaussian with mean zero and identity matrix as it is a covariance matrix.

(Refer Slide Time: 13:46)

Let us try to put these together and get the formal loss function to train such a variational
autoencoder. Remember, the goal is to do a maximum likelihood estimation. We have a training
dataset given by 𝑥𝑖’s. We ideally like to learn the parameters θ, which are the parameters of the

decoder, in such a way that you maximize the likelihood of that distribution, generating your
training data points.

That is your maximum likelihood estimation for your dataset. We will convert that to a
maximum log-likelihood maximum estimation where the product gets converted to sum. Then
you have 𝑙𝑜𝑔 𝑝θ , which makes things a bit simpler mathematically. Now, 𝑝θ(𝑥), which is this

internal term here, can be given by integral 𝑝θ(𝑥, 𝑧) 𝑑𝑧 by definition, and this can be expanded

as 𝑝θ(𝑥 | 𝑧) * 𝑝θ(𝑧) 𝑑𝑧.

Unfortunately, we are once again left with an intractable integral in this particular case. So, let us
try to see how to solve this maximum likelihood estimation problem. We are going to solve it
with a twist.

1384
(Refer Slide Time: 15:18)

So, let us rewrite the maximum log-likelihood estimation problem. We have the log-likelihood
𝑙𝑜𝑔 𝑝θ, which can be given by, expectation over z vectors sampled from the approximate

posterior of 𝑙𝑜𝑔 𝑝θ(𝑥). So, if we sample z vectors from your approximate posterior and give

them as input to the decoder, the likelihood of generating samples similar to the training
distribution must be high. That is what the first sentence means.

Now, 𝑙𝑜𝑔 𝑝θ(𝑥) can be written using Bayes theorem, as 𝑝θ(𝑥 | 𝑧) * 𝑝θ(𝑧) by 𝑝θ(𝑧 | 𝑥). That is

just the Bayes theorem written differently. To this particular expression, we are going to multiply
and divide the approximate posterior. It is just a simple multiplication and division by constant.

Once we have this expression equivalent to the maximum likelihood estimation setting for this
problem, we can group terms to write this differently. The first term, 𝑝θ(𝑥 | 𝑧), stays the same

way, and we write that as the first term here, expectation over z, 𝑙𝑜𝑔 𝑝θ(𝑥 | 𝑧). You have another

term here, 𝑞ϕ(𝑧 | 𝑥) and 𝑝θ(𝑧). Ideally, you would be able to write that as plus expectation over

z, 𝑙𝑜𝑔 ( 𝑝θ(𝑧) / 𝑞ϕ(𝑧 | 𝑥).

Now, instead of adding it, we are going to subtract it and write it this way. So, we have a minus
expectation over z, 𝑙𝑜𝑔 ( 𝑞ϕ(𝑧 | 𝑥) / 𝑝θ(𝑧). It is just the reciprocal, with the sign change in the

1385
beginning. The remaining terms are these two terms here, 𝑝θ(𝑧 | 𝑥) in the denominator and 𝑞ϕin

the numerator, which is what the third term gives us. So, it is just grouping terms and writing
them differently.

Now, if you observe the second and third terms here, you will notice that the second term is the
KL divergence between 𝑞ϕ(𝑧 | 𝑥), the approximate posterior and 𝑝θ(𝑧), the prior on z. The third

term is the KL divergence between the true posterior, 𝑝θ(𝑧 | 𝑥) and the approximate posterior,

𝑞ϕ(𝑧 | 𝑥). We will write the second term as a certain KL divergence and the third term as a

certain KL divergence.

Now, if you observe here, KL divergence is a non-negative quantity. So, the third quantity will
be greater than or equal to 0; because this is a log-likelihood estimation problem, we would
ideally like to maximize this quantity. So, to maximize this quantity comprised of these three
terms, where the third term is greater than or equal to 0. It can be achieved by maximizing the
first two terms by itself.

(Refer Slide Time: 19:18)

So, we are saying now that we have a lower bound, which is given by the first two terms.
Remember, the third term is non-negative and is added to that. So, this entire log-likelihood will
be lower bounded by these first two terms. So, if we would like to maximize the log-likelihood,

1386
we can maximize this lower bound. This lower bound is also known as Evidence Lower bound
or Elbow. It is read out as Elbow, and that is why it is written like this, but it is ELBO. Evidence
Lower BOund is also known as the Variational lower bound.

And this entire procedure of optimization is known as a method called Variational Inference,
where we introduce a variational distribution. In this case, the variational distribution for us is
the approximate posterior 𝑞ϕ and using that, we try to learn the parametrizations of the
* *
distribution. So, our goal is to find θ and ϕ , which maximize this evidence lower bound. That is
what we would like to do.

(Refer Slide Time: 20:40)

1387
So, let us put this together. We now are saying that we have introduced an inference model
𝑞ϕ(𝑧 | 𝑥), the approximate posterior that tries to approximate the original posterior by optimizing

the variational lower bound that we saw.

Remember, these two terms here are exactly what we had as the two terms here: expectation over
z, 𝑙𝑜𝑔 𝑝θ(𝑥 | 𝑧) minus the KL divergence between approximate posterior and prior on z. We

parameterize 𝑞ϕ(𝑧 | 𝑥) using our encoder network. So, we now have an encoder and a decoder

network.

Let us see them individually before we put them together in this context. Given training data x,
provided as input to an encoder, the encoder outputs means and variances of an approximate
posterior distribution. We would like to have it as close to the prior on z. Then we sample from z
and use the decoder to generate x, which we would like to have as close to the training data
distribution.

(Refer Slide Time: 22:06)

1388
So, how are we going to train this? If you look at the loss by itself, the overall loss. In this case,
we will maximize, so not perhaps a loss, just the objective function. But there is one problem
with this training procedure. If you noticed, what we said was, we give the input distribution to
the encoder, the encoder gives out a parameterization of the approximate posterior. We sample
from a prior, which is then given to the decoder, and then you get your final output.

This means we now have an encoder-decoder model, an autoencoder. We will sample a vector
rather than forward propagate a vector in the middle step or the bottleneck layer step. Why is this
a problem? That sample may not be deterministic, and we would not know how to pass the
gradients through that bottleneck layer to the encoder. That is a challenge. What do we do?

We ideally want to ensure that the sample that we get becomes independent of the encoder
output. Suppose it becomes independent of the encoder output. In that case, you can simply
backpropagate your gradients through your decoder and encoder and assume that the randomness
is coming from some external quantity. How do we separate the randomness from the encoder
output?

(Refer Slide Time: 23:49)

1389
This is done using a trick known as the Re-parameterization trick. This was one of the main
contributions of the Variational Auto Encoder method, a simple trick that solves the problem of
back-propagating through this network. So, remember that we will sample z from, say, some
approximate posterior distribution, which was given by a Gaussian distribution with mean at a
certain µ𝑧(𝑥) with variances, so standard deviations are given by σ𝑧(𝑥).

Now, a sample that is drawn from such a distribution can be rewritten slightly differently. We can
parameterize that sample z. You randomly sample a vector from a unit standard normal
distribution multiplied by the standard deviation learned by the encoder network and add the
mean learned by the encoder network, and now this becomes a sample. How did this get around
the real problem?

Remember, while in the earlier steps so far, you could now consider this something like this. So,
you have a set of layers in a neural network. In the earlier version of the variational autoencoder,
one of the layers became stochastic, so you had to sample a vector. It was not forward
propagated through a neural network, as we have seen so far. So, because that was a sample, we
would not have known how to backpropagate through that layer to the earlier layers, in our case,
the encoder.

So, the parameterization trick suggests that let us fix that layer as the previous layer’s output.
Instead, it brings in the stochasticity from a different variable from the outside and not that layers

1390
output itself. If you see here, the new sample is given by µ(𝑧) and σ (𝑧), which are outputs of the
encoder into ε, which you could almost consider now like a bias vector, which is coming from
noise.

It does not affect backpropagation into earlier layers. It is an external input given in that
particular layer. And now, you can backpropagate through the entire network using gradient
descent, the way we have seen it before. So, you now have given x in the encoder. You generate
mu’s and sigma’s, then you sample z using the Re-parameterization trick, go through the decoder
and generate mu’s and sigma’s at the output from which you can sample and get any generations.

1391
(Refer Slide Time: 26:58)

So, with the Re-parameterization trick, the entire variational autoencoder becomes back
propagable. You can use gradient descent and backpropagation to update the weights in the VAE
and train the network. So, as before, the objective function here is given by minus KL divergence
of 𝑞ϕ and 𝑝θ plus the expectation of 𝑙𝑜𝑔 𝑝θ(𝑥 | 𝑧). Now, 𝑙𝑜𝑔 𝑝θ(𝑥 | 𝑧) turns out to be a simple

reconstruction loss. We would like this to be close to your training distribution.

So, you can use even a Mean Square Error to maximize this likelihood in the second term. What
about the first term? How can we make that differentiable? How can we get the gradient for it to
you for backpropagation? The key here is that we have assumed 𝑞ϕ to be a Gaussian with certain

means and variances, which the encoder provides; we also assumed 𝑝θ to be a unit Gaussian.

And the KL divergence between Gaussians is a well-defined formula that is differentiable. So,
that helps us ensure that we can compute the gradients and backpropagate and train this network
end to end.

1392
(Refer Slide Time: 28:28)

So, that completes our discussion of how VAEs are learned. To summarize and differentiate from
autoencoders. Traditional autoencoders are learned by reconstructing input, and they are often
used to learn features, the low dimensional representations, which could then be used to train
another neural network or an SVM or to train any other supervised models. On the other hand,
variational autoencoders conceptually are a marriage between Bayesian learning and Deep
learning.

And after training a variational autoencoder, you could discard the encoder, just sample from
your prior 𝑝θ(𝑧) , a unit Gaussian. And now your decoder has learned to generate images, taking

a unit Gaussian as input to generate images that look close to your input data distribution. That is
what allows VAE to become a generative model.

1393
(Refer Slide Time: 29:39)

What can VAE’s do? Here are a couple of examples where a VAE was trained on the MNIST
data set, and the Frey Face dataset. Here, you can see the MNIST dataset when small values
modify the latent variables that are the z’s in your VAE. Remember, we will be sampling some
vectors. You can see that the generated images, that is, the output of the decoder, gradually varies
as the z value is varied.

That shows us that the variational autoencoder has indeed learned latent variable values are
capable of generating data smoothly on the manifold on which it lies. Similarly, just like how we
saw for GANs on the Frey Face dataset, you can learn latent variables such as expression and
poses. As you change the values of those corresponding latent random variables, you end up
getting generations of face images with different poses and different expressions.

1394
(Refer Slide Time: 30:54)

What else can be VAEs be used for? They have plenty of applications, and we will see some of
these in the next week in detail. But here are a few examples, image and video generation,
super-resolution, forecasting from static images, image inpainting many more, and we will see
some of them next week.

(Refer Slide Time: 31:19)

VAEs have also been extended in numerous ways over the last few years. While we may not
have the scope to cover all of these in this course, here are a few pointers if you would like to

1395
know more about the original author's extended VAEs to get a semi-supervised variant to use
labelled and unlabeled data to learn the VAE.

Another popular extension, rather a powerful extension was conditional VAE, where you could
generate given a particular class label for instance. So, in the example that we have seen so far, in
the construction that we have seen so far, we considered it to be an unsupervised learning
problem. And just assumed only data was given and we were generating more data, you could
modify that to assume that label data was given to us.

And at inference, if you give a certain class label, the VAE would generate data points belonging
to that class label. Those issues are discussed in conditional VAE. Another variant known as
Importance-Weighted VAE weights the latent random variables differently to get more powerful
generations. Then there is a Denoising VAE, which you could consider as an extension of the
Denoising Auto Encoder to a VAE setting.

Then VAEs have also been used for graphics. A popular network is called the inverse graphics
network, which uses a VAE like strategy to learn latent variables that can give you control in
generating various kinds of objects, 3D objects, etc. You could control the light angle, the
rotation, the kind of object, so on and so forth using an inverse graphics network, which
effectively uses a VAE strategy. And finally, another kind of network called Adversarial Auto
Encoders, which brings together GANs and VAEs, will be a topic that we discuss immediately
next lecture.

1396
(Refer Slide Time: 33:38)

Your homework for this time is to go through “Tutorial on Variational Auto Encoder”, as well as
go through this VAE example in PyTorch. It may also be worthwhile going through the original
paper on Variational Auto Encoders if you are interested. There are a couple of questions for you
to think about. Why does the encoder of a VAE map to a vector of means and standard
deviations?

Why does the encoder not map to a vector of means and a covariance matrix? This encoder gets
you the approximate posterior. A similar question What about the decoder? If we assumed a
Mean Square Error for the reconstruction loss? The second term in your objective for VAE,
remember can use the Mean Square to get a reconstruction loss. What is the covariance of the
𝑝(𝑥 | 𝑧) Gaussian distribution? What do you think it will look like? Think about it and we will
discuss it soon.

1397
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 65
Combining VAEs and GANs

(Refer Slide Time: 00:15)

Having discussed GANs and VAEs so far, let us not talk about methods that have attempted to
combine GANs and VAEs in a single framework.

1398
(Refer Slide Time: 00:31)

Before we go there, let us try to recall the questions we left behind in the last lecture. One of the
questions was, why does the encoder of a VAE map to a vector of means and standard
deviations? Why does it not map to a vector of means and an entire covariance matrix? I hope
you had a chance to think about it. In this case, by design, we are explicitly learning a set of
independent Gaussians.

That is the reason you only need standard deviations per each dimension of the Gaussian. We are
not going to go with learning an entire covariance matrix. Technically speaking, it is possible
also to learn a complete covariance matrix, but that does complicate how VAE learns.
Importantly, this approach works, and it is easy to learn, relatively speaking, and compared to a
complete covariance matrix. And that is the reason we went ahead with that choice.

1399
(Refer Slide Time: 01:45)

What about the decoder? If we assumed a Mean Squared Error for the reconstruction loss, if you
recall, VAE had two terms in its objective function: reconstruction loss, which is about
maximizing one of a conditional probability and then a KL divergence term. If we used a mean
square error for the reconstruction loss, what would be the covariance of 𝑝 (𝑥 | 𝑧), assuming it is
a Gaussian. If you thought carefully about this, this particular case of assuming a mean squared
error would be equivalent to modelling 𝑝 (𝑥 | 𝑧) as a Gaussian with identity covariance.

In which case, you only need to learn the means; the standard deviations are given to be one. So,
the decoder output would be the mean alone, and identical variance would be a given. So, in this
case, the reconstruction loss would become − 𝑙𝑜𝑔 𝑝(𝑥𝑖 | 𝑡𝑖 ), where t is, say, each dimension. If

you expand the Gaussian formula in this particular case, you will notice that because you have an
identity matrix as a covariance matrix.

Minimizing this negative log-likelihood term simplifies minimizing the mean square error. Inside
this, the first term here would become a constant. It does not depend on the minimization term
minus log and exponential are inverse operations. You will be left only with this mean square
error term. So, using the mean square for the reconstruction loss in a VAE is equivalent to
assuming that your distribution 𝑝 (𝑥 | 𝑧) is a multivariate Gaussian but with an identity
covariance matrix.

1400
(Refer Slide Time: 03:57)

Let us try to look at the positives and negatives of VAEs and GANs before we discuss methods
that combine them. In VAEs, the biggest positive is learning a very strong inference mechanism
or machine by mapping data to a latent space with the distribution of choice with a fast, effective
inference step. The negative, however, is because of the use of KL divergence, VAEs tend to
distribute the probability mass diffusely over the data space may not cover the entire space,
which is one reason for VAEs to result in blurry or low-quality image samples.

The other reason is that by sampling from a distribution, there is always an averaging effect that
could also result in blurry generations rather than having sharp image generations from the latent
space of a VAE. On the other hand, GANs do not have an inference step. You do not try to learn
a latent from data. Recall that for GANs, you just give a Gaussian vector this input without
worrying about whether that is the real latent manifold, which captures the data distribution.

You learn a generative model that produces high-quality samples at a good sampling speed. That
is the objective of GAN. The negative is that GAN slacked that inference mechanism, which
could prevent reasoning about data at an abstract level. For example, you cannot look at the
latent variables and attach semantics to each latent variable. For example, you may not be able to
say that the first latent variable corresponds to identity. The second latent variable corresponds to
expression pose, and so on. It is difficult to do with a GAN. Whereas with a VAE, that procedure
is implicit in its design.

1401
(Refer Slide Time: 06:15)

Now, the question that we try to ask is, can we try to combine a VAE, and a GAN, to be able to
get high-quality samples, as well as have an effective inference network to be able to reason at
the level of latent variables. So you see here, that furbished variational autoencoder at training
time, you have an encoder, you learn a latent space, which then feeds into a decoder, and a test
time you sample from that latent space and pass to a decoder.

A GAN has a generator, which competes with the discriminator and a test time, you provide
random noise to the generator, and the generator can generate images. So what we are going to
see whether these two pipelines can be combined in some ways.

1402
(Refer Slide Time: 07:14)

To do that, let us first discuss a few limitations of VAEs in more detail. Recall the VAE objective,
which is given by the conditional distribution, the log-likelihood, and a KL divergence term that
matches the approximate posterior with the prior. If you look at these terms, this can be the first
term that a reconstruction loss can replace. And the second, you can view it as a certain kind of a
prior loss.

The first term is implemented, perhaps through a mean squared error, the second term is
implemented using the same KL divergence. Suppose you look at Mean Squared Error as a
reconstruction loss. Mean squared error is inherently limited by its capabilities. Why is this so?
Mean square error is an L2 distance between two images pixel-wise.

The moment you do that, you are assuming that the reconstruction fidelity or the signal fidelity
in the in our case, the signal is the generated 2d image is independent of spatial or any temporal
relationships across the pixels, which is not true for images, images do have a lot of local spatial
correlation. The element-wise metric of finding a mean square error between every pixel in the
same location does not model human perception of image fidelity and quality.

We will see an example soon. This could also lower the image quality in VAEs. Finally, the same
pixel-based last metric mean squared error does not respect semantic preserving transforms. So
you could have two images, which have the same object, but in one image, it is rotated from the

1403
other. Semantically, this is still the same image, but pixel-wise, this could result in a huge mean
square error.

(Refer Slide Time: 09:35)

A very nice study in this work known as “Mean Squared Error: Love it, or Leave it ?” shows a
few tangible examples. Here you see an image of Albert Einstein. So you have an image a here,
and you can see images b through i. If you took these images and observed the ones from b to g,
these images are in the first and second row. If you compare them to a, you can see that they
have significant differences.

You can see some images to be very sharp, some images to have a blur, a certain noise, a
significant blur, so on and so forth. But it happens that each of those images from b to g has
almost the same mean square error from the first image a. On the other hand, if you consider
those images from h to i, they all look the same to the human eye. But they all have very high
large mean square error values to the original image. That talks about mean square error as a
metric for capturing the goodness of reconstruction.

1404
(Refer Slide Time: 10:59)

In addition to this, VAEs have a couple of other problems also, by using the KL divergence to
match the approximate posterior to the prior on the latent variables, z. Inherently KL divergence
focuses on encouraging 𝑞(𝑧), the approximate posterior in our case, on picking the 𝑝(𝑧) 𝑚𝑜𝑑𝑒𝑠.
So if you had 𝑝(𝑧) to be a distribution, something like this, what it tries to do is try to ensure that
it matches p in these points where there is a high density because that is what would give it a low
KL divergence score between q and p. And by doing that, q may not completely match the entire
distribution of p.

1405
(Refer Slide Time: 11:52)

That could leave spaces or holes in the learned latent space of z, which may result in failing to
capture the data manifold. It could also miss several local regions in the data space, affecting
generalization capability of generating examples out of a VAE. Lastly, even the prior considered
in VAEs could become a limitation. Remember that VAE is required you to assume a certain
functional form of a prior such as a unit Gaussian.

And sometimes, for different kinds of priors, VAEs may be difficult to optimize. You may not get
a closed-form solution. In our case, because we assumed the approximate posterior and the prior
to be Gaussian, the KL divergence term became a closed-form, there was a closed-form
expression for a KL divergence between two Gaussian distributions, which turned out to be
differentiable, which allowed us to use it for training the VAE.

That may not be true for other kinds of priors. And this limits us to choices of priors that can be
used in a VAE. How do you address these limitations of VAEs? That is what we will talk about
by integrating elements of GANs in a VAE to help improve its performance. We will talk about a
couple of seminal methods in this context in this lecture.

1406
(Refer Slide Time: 13:34)

And one of the first efforts here is known as an Adversarial Autoencoder. This illustration on the
left gives the adversarial autoencoder. So on the top row, here is a standard variational
autoencoder. As you can see, x going to z, a latent variable is a sample from that latent space.
You have a decoder that gives you a reconstruction. The bottom part of an adversarial
autoencoder has a discriminator. The discriminator’s job is to not look at images and say whether
they are real or fake.

But to look at the latent space and see whether the latent space came from the real distribution,
the latents corresponding to the real distribution, which you may have got from a GAN, for
instance. The latent code that comes out while learning the VAE. Why do we do this? This has a
very important meaning here. The goal here is to make the VAEs prior match the original prior of
the data distribution.

This is important now because no more are you asking the approximate posterior to match a unit
Gaussian, but you are asking it to match a data prior that is known, which would allow the VAE
to be more powerful in its functioning. So this allows you to render a more continuous learned
latent space, which allows you to capture the data manifold well. Through this process, we are
converting the data distribution to prior distribution.

1407
And the decoder learns a deep generator model that maps that imposed prior to data distribution.
So instead of using a KL divergence between the approximate posterior and the unit Gaussian,
we now use an adversarial objective to match that approximate posterior to the prior of the data
generating distribution.

(Refer Slide Time: 15:57)

So if you look at the original objective of VAE, we had two terms, the reconstruction term and
the KL Regularizer. Now in an Adversarial Autoencoder, the KL Regularizer is replaced by an
adversarial loss of a discriminator. That is trying to classify the latent code as belonging to the
VAE or belonging to the original data distribution. So in the reconstruction phase, you introduce
a latent variable with a simple prior, you sample z and pass it through a generator. Remember
that we need to introduce a mechanism to ensure 𝑝𝐺 is 𝑝𝑑𝑎𝑡𝑎.

1408
(Refer Slide Time: 16:45)

And in the adversarial autoencoder, this is done by matching the aggregated posterior, the one
from the variational autoencoder, to an arbitrary prior using adversarial objective-based training
that comes from your GANs discriminator objective.

(Refer Slide Time: 17:07)

And by doing so, adversarial autoencoders give a very strong performance. Here is an example
using a model for an adversarial autoencoder and comparing it against the variational
autoencoder on the MNIST dataset. And what you see on top here is where a prior based on a

1409
spherical 2D Gaussian is used. And the bottom is where the prior is a mixture of 10 2D
Gaussians. And you can see here from the top that the adversarial autoencoder learns a more
continuous latent space, whereas the VAE has many discontinuities in that latent space. And
suppose you look at the bottom image. In that case, the adversarial autoencoder learns a fairly
smooth, multimodal distribution, all those modes along with those different directions, whereas
the VAE still struggles even in that setting.

(Refer Slide Time: 18:11)

This particular work for adversarial autoencoders also showed that you could also use more
complex priors if you choose to. Here is an example of where a latent space of an adversarial
autoencoder was trained on MNIST, with the prior being a Swiss roll distribution, as you see
here. So, you can now sample from this distribution by walking along the axis here in this
particular case. The samples were generated by walking along the Swiss roll axis, passing it to
the VAEs decoder and generating samples.

And you can see here that you have a fair good amount of variety in the generation of samples in
the MNIST dataset by walking along with such a prior. So this entire idea of adversarial
autoencoders replaces the KL divergence term in the objective of a variational autoencoder with
an adversarial learning term. When we say adversarial learning, we mean the loss corresponding
to the discriminator calling an item fake or real.

1410
And the nice part of this approach is there is no functional form of a prior required; whatever
prior is provided is what q tries to match.

1411
(Refer Slide Time: 19:41)

A second popular method tries to look at the objective of a VAE from the other perspective. So
while adversarial autoencoders replaced the KL divergence term with an adversarial objective.
VAE-GANs try to replace the reconstruction loss with a different term, what do they replace it
with? Instead of a pixel-wise mean square error, VAE-GANs replace it as a feature-wise distance
in the discriminator’s representation space between outputs that come from a VAE and original
data.

This approach combines the advantage of GAN as a high-quality generator model and VAE as a
method that can produce an encoding of data into a latent space and then further reasoning at the
latent space level.

1412
(Refer Slide Time: 20:42)

So the loss formulation in this particular case is based on representations of the discriminator.
Remember, the discriminator is yet another neural network. So you can take a certain layer of the
𝑡ℎ
discriminator given by 𝐷𝑖𝑠𝑙, which denotes the 𝑙 layer of the discriminator. Let us assume that
𝑡ℎ
the output of that discriminators 𝑙 layer is given by 𝐷𝑖𝑠𝑙(𝑥), which are the feature

representations that we are going to compare between a VAEs generated output and real data
input.

~
So, 𝑝 (𝐷𝑖𝑠𝑙(𝑥) | 𝑧)is assumed to be a Gaussian distribution, where 𝑥 is an output of the decoder
𝐷𝑖𝑠𝑙
obtained using a VAE. So, the first loss term is given by 𝐿 , which is the first term
𝑟𝑒𝑐𝑜𝑛−𝑐𝑜𝑛𝑡𝑒𝑛𝑡

of your VAE objective given by the expectation of samples coming from the approximate
posterior, 𝑙𝑜𝑔 𝑝 (𝐷𝑖𝑠𝑙(𝑥) | 𝑧), which is very close to the first term that we had in the VAE

objective.

𝐺𝐴𝑁
The second term comes from a GAN-based objective 𝐿 , 𝑙𝑜𝑔 𝐷𝑖𝑠 (𝑥). So if the data
𝑟𝑒𝑐𝑜𝑛−𝑠𝑡𝑦𝑙𝑒

comes from the real distribution, which is x, the GAN or the discriminator tries to maximize
𝑙𝑜𝑔 𝐷𝑖𝑠 (𝑥) and 𝑙𝑜𝑔 (1 − 𝐷𝑖𝑠 (𝐺𝑒𝑛(𝑧))). So, z is latent that comes from the VAEs latent

1413
space. Finally, you have your prior loss, which tries to match the approximate posterior to the
𝑝(𝑧).

Recall the main difference now is that the first term is based on the features of the discriminator.
So the total loss is given by the addition of these three losses.

(Refer Slide Time: 23:03)

So the overall training algorithm for the VAE-GAN is given. You sample a mini-batch of
samples from your training dataset. You get an encoding of X using the encoder in a variational
autoencoder. You have your prior loss, which tries to match the approximate posterior obtained
^
through your encoder part of your VAE with 𝑝(𝑍). 𝑋 is obtained through the decoder of the VAE
when Z is given as input, Z is a latent variable.

Coming to the discriminator, we already saw that one of the last terms is a minus expectation,
𝑝(𝐷𝑖𝑠(𝑋) | 𝑍). There is one other component that VAE-GAN adds while training. It improves
the performance and lets the user sample from a unit normal prior which is given as 𝑍𝑝. It is

passed through a decoder to get the reconstruction 𝑋𝑝. In addition to minimizing the likelihood of
^
𝑋 fooling the discriminator, the GAN loss also tries to minimize the likelihood of 𝑋𝑝 fooling the

discriminator.

1414
So, that is the loss of the GAN. Finally, the encoder, decoder and discriminator are updated using
the corresponding gradients that affect each output. So obviously, each of those networks only
uses the losses that are relevant for that particular network. So if you observe, you would notice
here that the discriminator loss should not try to minimize the reconstruction content loss, which
is the first term here, as that would collapse the discriminator to give a 0 at all times.

This is very similar to GANs, where we talked about training the discriminator fully initially. As
we just mentioned, the VAE-GAN model also allows using samples 𝑋𝑝, which are obtained from
~
a unit normal prior. In addition to 𝑋, which are generated as output of the VAE. Finally, each of
these recon-content and recon-style losses, which are the first two losses, are weighted to control
a tradeoff between reconstruction quality and fooling the discriminator.

(Refer Slide Time: 26:00)

Here are some examples of results. You can see here that when you train a VAE on face images,
you see that there is no complete clarity while you get an overall sense of a face image. In
contrast, the centre of the image has a certain acceptable degree of the face. You can see this as
you go away to the periphery. The clarity keeps dropping down as you keep going away from the
centre. Again, you see that the VAE gives a fairly good performance of obtaining the sharpness
and clarity of the face images.

1415
While GANs also do a good job, GANs suffer from some artifacts that miss global information.
These are excellent at retaining global information but miss finding sharpness. GANs, on the
other hand, do have local sharpness but at times miss global content and can sometimes place
different parts in different locations. And VAE-GANs bring the best of both worlds together to
generate globally relevant content and keep each of the pixels and each of the local areas sharp in
terms of perception.

1416
(Refer Slide Time: 27:32)

Another thing that you can do with these kinds of models, which was shown with the VAE again,
is conditional generation. In this particular example, in VAE-GANs, the authors concatenated the
face attribute vector. So you could, for example, have these attributes such as the white, fully
visible forehead, mouth closed, male, curly hair, eyes open, pale skin, frowning, pointy nose,
teeth not visible, and no eyewear, for instance.

So this can be represented as an attribute vector. So you can put 0s and 1s for different attributes,
for instance, and that is appended to the vector representation of the input in the encoder-decoder
and discriminator modules while training and this trained model is used to generate faces, which
are conditioned on some held-out test attributes. So, a test time, a new face attribute vector,
which is held-out which was not used before, is concatenated to the input representation, and
now, the model can generate faces that satisfy these requirements in the attributes.

So all these images here are generations of a face conditioned on these attributes. And you see
that for most of them, certain attributes such as the fully visible forehead, white, male, so on and
so forth, are met to a reasonable extent, justifying this kind of an approach. Compared to VAE,
the VAE GAN gives significantly good results for such conditional generation experiments.

1417
(Refer Slide Time: 29:33)

Your homework for this lecture will be to go through this excellent link on “A wizard’s guide to
adversarial autoencoders, Part 1 and Part 2”, as well as an excellent tutorial on VAE-GANs and
a nice video on YouTube on VAE-GANs, if you are interested, go through them.

1418
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 66
Beyond VAEs and GANs:
Other Methods for Deep Generative Methods - 01

(Refer Slide Time: 00:15)

Although VAEs and GANs are the most popular kinds of Deep Generative models, other
methods have also been successful over the last few years. Let us see a few of them in the last
lecture for this week.

1419
(Refer Slide Time: 00:34)

These models are generally known broadly as Flow-based models. How do they differ from
GANs and VAEs? There is a significant difference between them and GANs and VAEs. Both
GANs and VAEs do not explicitly learn the probability density function of the real data. In the
case of GANs, we already saw that the density estimation is implicit. You do not explicitly
assign a probability density function and try to estimate it in a GAN.

In the case of VAEs, we get an approximate density estimation by optimizing your evidence
lower bound using variational inference. In both cases, you do not get the exact density function.
However, the exact PDF 𝑝(𝑥) of your real data may be useful for many tasks, such as missing
values, sampling data, or even identifying bias in data distributions. Knowing the density
function could be handy for these kinds of tasks.

So, the methods that we will discuss in this lecture are methods that estimate the real density of
the provided training data. These can be categorized into two different kinds: Normalizing Flows
and Autoregressive Methods or Autoregressive Flows.

1420
(Refer Slide Time: 02:13)

Let us start with normalizing flows. Recall that in the first lecture for this week, we tried to ask
this question, that if we had a dataset of N data points coming from an underlying distribution 𝑝𝐷

(x), we wanted to find a θ belonging to a family of distributions M in such a way that the
distance between 𝑝θ and 𝑝𝐷 is minimized by that choice of parameterization in M. We also noted

at that time, that if the distance here was replaced by KL-divergence, this turns out to become a
maximum likelihood estimation problem, or by minimizing the negative log-likelihood. Why is
this important?

All methods that we have covered so far while training neural networks can be looked at to
minimise negative log-likelihood. Whether training a neural network using mean squared error
or training a neural network or an LSTM using cross-entropy loss for a classification problem,
one can show that both of these finally amounts to minimizing the negative log-likelihood.

So we are now saying that we could use a similar approach even in an unsupervised learning
setting. Where do we go from here?

1421
(Refer Slide Time: 03:54)

So what Flow-Based models trying to do here is if you could look at GAN as a discriminator and
the generator, where the discriminator is the one that distinguishes between real and fake data.
𝑃
The generator takes you from a latent to a generated image 𝑥 , which is then provided as input to
the discriminator along with the real data x. Similarly, VAE is an encoder-decoder model, where
the encoder captures the approximate posterior q parameterized by parameters ϕ, which are the
weights of the encoder.

Similarly, the decoder captures 𝑝θ(𝑥 | 𝑧), where θ are the parameters of the decoder. So, in this

case, it is an approximate density estimation, as we just mentioned. In flow-based generative


models, we do what should have been obvious but not very simple. Given x, find the function f
𝑃
to get us a latent representation, which if we invert, we get back 𝑥 or the reconstruction.

So, the challenge here is how do you find these functions f, which are exactly invertible to get
back your original data? That is the challenge.

1422
(Refer Slide Time: 05:27)

So, the first class of Normalizing Flows methods is to identify a transformation f, which takes
you from Z to X.

(Refer Slide Time: 05:43)

If you recall, Z is once again the latent very similar to VAE. The main difference between VAE
and Flow-based models is in VAE, the encoder captures an approximate posterior with the
different parameterization, and the decoder captures 𝑝θ itself. So, we know that 𝑞ϕ is an

1423
approximation of the true posterior 𝑝θ(𝑧 | 𝑥), whereas, in flow-based models, the functions are

an exact inverse.

(Refer Slide Time: 06:16)

So here, we identify a transformation f that goes from Z to X. The transformation 𝑓 is a series of


differentiable bijective functions (𝑓1, 𝑓2,.... 𝑓𝐾) in such a way that x can be written as 𝑓(𝑧) ,

which in turn can be written as 𝑓𝐾 , composition, 𝑓𝐾−1 , composition, so on and so forth until
−1 −1
𝑓1(𝑧). Conversely, z can be written as 𝑓 (𝑥), which in turn can be written as 𝑓1 composition,
−1 −1
𝑓2 , so on and so forth, till the final composition, 𝑓𝐾 (𝑥).

Diagrammatically given a latent variable z, which could be a Gaussian, very similar to GANs,
you pass this vector input vector sampled from z through a neural network. At this point, we will
just call it a function, which outputs 𝑧1, which goes through 𝑓2, and so on, till 𝑓𝐾, which we

finally expect to output the original data distribution that x comes from.

1424
(Refer Slide Time: 07:43)

Let us see this in more detail before going to the implementation. For any invertible function f
from Z to X, using change of variables of probability density functions, we can write 𝑝𝑋(𝑥) as
−1
𝑝𝑍(𝑧) into the determinant of ∂𝑓 (𝑥)/∂𝑥. Why? It is not difficult to show it; if z belongs to

distribution, π(𝑧)is a random variable, and x is equal to 𝑓(𝑧) such that f is invertible.

−1
So, z would be given as 𝑓 (𝑥). From the definition of the probability distribution, we have
integral 𝑝(𝑥)𝑑𝑥is equal to 1. It would also be integral π(𝑧)𝑑𝑧 is equal to 1. So from this equality,
we can say that 𝑝(𝑥) can be given as π(𝑧) into the gradient of 𝑑𝑧 / 𝑑𝑥, the absolute value. But
−1 −1
we already know that z is 𝑓 (𝑥), so we will put that here. We would have π (𝑓 (𝑥)) into the
−1 −1 −1 '
gradient of 𝑓 (𝑥)/𝑑𝑥 , which we are writing as to π(𝑓 (𝑥)) into (𝑓 ) (𝑥 ) where prime here
stands for the gradient.

−1
In vector form, this would be 𝑝(𝑥) is equal to π(𝑧) into a determinant of 𝑑𝑧 / 𝑑𝑥, or π(𝑓 (𝑥))
−1
into determinant, 𝑑𝑓 (𝑥)/𝑑𝑥, which is what we wrote here in the first place.

1425
(Refer Slide Time: 09:42)

If we took the same expression here, and if we expanded it and applied log on both sides, we
would then get 𝑙𝑜𝑔 𝑝𝑋(𝑥), which is the log-likelihood of the density that we are looking for

𝑝𝑋(𝑥) would now become the log-likelihood of z plus summation going from 𝑖 equal to 1 to 𝐾,
−1
log of the determinant of ∂𝑓𝑖 / ∂𝑧𝑖. How did this come? This came by substituting 𝑓 with a

composition of functions, 𝑓1 through 𝑓𝐾, the composition while taking log becomes a

summation. That is how we got this expression here.

So the intuition of these two terms in estimating the log-likelihood of the probability density
function is the first term here that can be looked at as the transformation moulds the density
𝑝𝑍(𝑧) into 𝑝𝑋(𝑥), that is what it would like to do. The second term here quantifies the relative

change of volume of a small neighbourhood 𝑑𝑧 around 𝑧. Remember that is what gradient


measures. That is what determinant would also measure. You can look at it as capturing the
volume of the space that you are trying to measure.

1426
(Refer Slide Time: 11:18)

Here is an illustration to understand how normalizing flows work. Assuming that you sample
from a Gaussian initially, which is 𝑧𝑜 you apply a function 𝑓1 get 𝑧1, then you would apply

another function, so on and so forth, till you get 𝑧𝑖−1, 𝑧𝑖. Finally, you keep applying many

functions until you get 𝑧𝐾 when the probability density function transforms this way, which is the

density function of x that we are looking at.

(Refer Slide Time: 11:52)

1427
Why did we say we want a differentiable bijective function? Why should each of the f's be a
differentiable bijective function? It should be fairly straightforward. We already saw how the
inverse was being used. But let us define it more formally. Such a bijective function, the way we
are using it, is called a Diffeomorphism. Diffeomorphic functions are composable, which means
given two transformations 𝑓1 and 𝑓2, the composition is also invertible and differentiable, which

is very important when working with neural networks. So any complex transformation from a
Gaussian to a complex probability density function of the real-world data can be modelled by
composing multiple instances of simple transformations.

What do we need for normalizing flows? We want the transformation function to be


differentiable, so that should give us the answer. If each of those functions that take you from z
to x is a layer of a neural network, or an LSTM, or a set of layers of a neural network, they are
differentiable, and we have met the first prerequisite. The second is that the function must be
easily invertible.

How do we do this? We will see in a moment, and the last is the determinant of the Jacobian
should be easy to compute, why because that is 1 of the terms in your loss function, and you
want that to be easy to compute so that you can train the entire network using gradient descent.

(Refer Slide Time: 13:46)

1428
Let us see one of the earliest efforts in implementing Normalizing flows. This is known as NICE
Non-Linear Independent Components Estimation. This work was published in 2015 by Lauren
Dinh et al. The idea here is to introduce known as Reversible Coupling layers. Let us see what
those transformations are. So the coupling layer operation used was 𝑦1 was the same as 𝑥1. So if

𝑥1, 𝑥2 are the different dimensions of the data x. 𝑦1, 𝑦2 so on and so forth, are the different

dimensions that you're trying to generate, which you would like to match x in principle.

So 𝑦1 is equal to 𝑥1, 𝑦2 is given by some function 𝑔(𝑥2; 𝑚(𝑥1)). So you can see here pictorially

𝑦2 is given by function g, which is applied on 𝑥2 and 𝑚(𝑥1). Note that in this particular

formulation, 𝑦1 does not depend on 𝑥2 , but 𝑦2 depends on 𝑥1. In this case, the Jacobean will be a

lower triangular matrix. Why do you say so? Because 𝑦1 does not depend on 𝑥2. So, all these

upper triangular elements of the Jacobean matrix would become 0’s.

Let us take a moment to recall a Jacobean matrix. A Jacobean matrix is the matrix of all partial
derivatives. So, if you had an output vector y, which contains 𝑦1 to 𝑦𝐾, and if you had an input

vector x, which was 𝑥1, to 𝑥𝑑, all the pairwise gradients, ∂𝑦1/ ∂𝑥1 , ∂𝑦2/ ∂𝑥1 , ∂𝑦3/ ∂𝑥1, so on

and so forth, till ∂𝑦𝐾/ ∂𝑥1.

And similarly, ∂𝑦1/ ∂𝑥2 , ∂𝑦1/ ∂𝑥3 so on and so forth will form the rows and the columns of the

Jacobean matrix. So, it is a matrix of first partial derivatives between a vector and a vector. So, in
this case, you can see that all the upper triangular elements, because of the construction of the
operation, 𝑦2 would depend on 𝑥1, but 𝑦1 would not depend on 𝑥2 𝑥3 or anything till 𝑥𝑑.

Similarly, 𝑦2 would depend on 𝑥1 and 𝑥2, but not on 𝑥3, to 𝑥𝑑, which means all those upper

triangle elements here would become 0.

You will be left with a lower triangular matrix, and the determinant of a lower triangular matrix
is simply the product of the diagonal elements. So, the determinant becomes easy to compute in
this construction. What would the inverse mappings be? The inverse mappings would be 𝑥1

1429
−1
would be. Of course, 𝑥2 would be 𝑔 (𝑦2) and which would be the inverse operation in this

particular case.

1430
(Refer Slide Time: 17:27)

In case you would like all data to be considered, so, in the previous construction,

(Refer Slide Time: 17:33)

We considered that 𝑦2 could depend on 𝑥1, but 𝑦1 cannot depend on 𝑥2.

(Refer Slide Time: 17:42)

1431
If we want all the output elements to depend on all the input elements, it can be done, but you
may have to flip the inputs after each layer. So that way, in the next iteration, next function,
remember it is a series of layers or composition of functions, you can have 𝑦1 dependent on 𝑥2,

and you can continue this process to ensure that all the output variables depend on all the input
variables.

(Refer Slide Time: 18:13)

So, you can now write this as Additive Coupling Operations also. In additive coupling
operations, you can say that 𝑦1 is equal to 𝑥1 and 𝑦2 is equal to 𝑥2 plus 𝑚(𝑥1). So, in this

1432
particular case, it is not a function. Still, you just have 𝑦2 is equal to 𝑥2 plus 𝑚(𝑥1). The inverse

operation here would be 𝑥1 is equal to 𝑦1 itself, and 𝑥2 is equal to 𝑦2 minus 𝑚(𝑦1). In such

construction in an additive coupling layer in the previous case, you had a g function here. Still, in
the additive coupling layer, the Jacobian determinant will always be 1.

Why do you say so? You can look at the two equations and give the answer; because it is
additive coupling, you would have ∂𝑦1/∂𝑥1 will be 1, ∂𝑦2/∂𝑥2 will also be one because the

second term here is additive and does not depend on 𝑥2. So, all those diagonal elements of

Jacobian, what are the diagonal elements in your Jacobian matrix? You would have ∂𝑦1/∂𝑥1,

∂𝑦2/∂𝑥2, so on and so forth till ∂𝑦𝐾/∂𝑥𝐾 . For the moment, let us assume both are the same

dimensions.

If you look at these equations, all these values will be 1 product of the diagonal elements will be
1. The Jacobian determinant will be one; this is called a volume-preserving operation.

(Refer Slide Time: 20:05)

Remember, we said, the Jacobian determinant is an estimate of how much the volume changes.
For example, if you have a random variable x, which has a certain density, say between 0 and 1,
in this case, a uniform density.

1433
If you add another random variable, say z, which goes from say 0 to 3. If its volume was like
this, you can see that the gradient would give you an answer to be 3. So for every 1 unit that you
move in x, you will move three units in z. So this tells you that the volume between x and the
PDF of z triples. The determinant of the gradient, ∂𝑧 / ∂𝑥 , also gives you the answer 3, which
intuitively tells you how much is the volume changing in the new PDF.

So, when the Jacobian determinant is always 1, you have an additive coupling layer. It is a
volume-preserving operation. So, in this case, the log-likelihood becomes very simple. You
simply have the determinant of the Jacobian term is always 1. So that term would disappear, and
you would have the log-likelihood of 𝑝𝑋 be the log-likelihood of 𝑝𝑌 itself.

(Refer Slide Time: 21:31)

An improvement over NICE was another normalizing flow method, a popular one known as Real
NVP, Real-Valued Non-Volume Preserving Normalizing Flow. As the name states, it is
non-volume preserving. So we have to do something beyond additive coupling. What do we do?
We have an affine coupling operation. What does the affine coupling operation do? We say 𝑦1

equals 𝑥1, and 𝑦2 equals 𝑥2 into this Hadamard product, which says it is an element-wise

product, so 𝑥2 is a vector.

1434
And the second term here is also a vector, and it is an element-wise product into an exponent of
𝑠(𝑥1) plus 𝑡(𝑥1). So there is a translation component, and there is a scale component, which

together becomes an affine transformation. The Jacobian of such a transformation would be, it
would still be a lower triangular matrix because 𝑦1 does not depend on 𝑥2, 𝑥3, so on and so forth.

Similarly, 𝑦2 does not depend on 𝑥3, 𝑥4, and so forth.

So the Jacobian will have the upper triangular entries to be 0. You would then have in the lower
triangle entries, ∂𝑦2/∂𝑥1. The diagonal entry, in this case, would be 𝑑𝑖𝑎𝑔(𝑒𝑥𝑝[𝑠(𝑥1)]) because

that is what the differentiation of ∂𝑦2/∂𝑥2 will be. In this case, the Jacobian need not be one, and

hence, the transformation may not be volume-preserving, which perhaps is more likely in
real-world data.

For example, if we gave z, the random variable that we considered for normalizing flow to be a
unit Gaussian, expecting that any real-world data distribution will also have the same volume as
that unit Gaussian may not be a correct assumption. So in that sense, real NVP is more realistic.
The inverse operation, what would it be here, 𝑥1 would be 𝑦1, equality, 𝑥2 would be 𝑦2 minus

𝑡(𝑦1) into an exponent of minus 𝑠(𝑦1).

Remember this exponential function. When you go to the other side, it becomes an inverse
exponential function, which this term denotes.

(Refer Slide Time: 24:11)

1435
Figuratively speaking, this is how a real NVP affine coupling would look like. So you have your
𝑦2 depends on 𝑥1, 𝑥2, and 𝑦1 depends only on 𝑥1. How does 𝑦2 depend on 𝑥1 and 𝑥2? A

translation and a scale component. The inverse looks something like in subfigure b. 𝑥2 would

depend on a scale and a translation component from 𝑦1and the contribution from 𝑦2. So how are

all of these networks train?

So the rest of it should work like any other neural network, so it is about maximizing the
likelihood. You can use standard error functions like mean squared error, cross-entropy to learn
the networks depending on what you are trying to reconstruct the output of each of these
networks. And each layer corresponds to one of these functions that you are trying to model.

(Refer Slide Time: 25:15)

1436
For example, 𝑦1 is equal to 𝑥1 , would be one of the layers of the neural network. The second

layer would be this, where the scale and the translation are learned by the neural network, so on
and so forth. That is how the network learns.

1437
(Refer Slide Time: 25:29)

(Refer Slide Time: 25:29)

1438
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 67
Beyond VAEs and GANs:
Other Methods for Deep Generative Methods - 02

(Refer Slide Time: 00:12)

The second kind of Deep Generative model that we will discuss in this lecture is Autoregressive
flows. Autoregressive flows come, as the name suggests, from Autoregression. Autoregressive
models are used in time series data in standard machine learning to look at the past n steps of a
time series data and predict the next value. For example, if you look at the stock price for the last
ten days, can you predict the stock price on the 11th day? It can be solved by using
autoregressive models such as ARMA, ARIMA NARIMA, so on and so forth.

These are popular time series models for this scenario. What we are talking about here is an
extension of that thought process. However, now, we want to generate data and not just predict
an outcome. Once again, this is a transition from supervised learning to unsupervised learning.
So let us see how to use Autoregression for a generation. So we decompose the task of finding
the PDF of the real-world data as 𝑝(𝑥!, 𝑥2,.... 𝑥𝑛), which is the probability density function of the

1439
real data given to us. It is written as 𝑝(𝑥!) into conditional 𝑝(𝑥2 | 𝑥1), and so on, till

𝑝(𝑥𝑛 | 𝑥1,...., 𝑥𝑛−1).

So, while calculating a conditional probability at a certain level, the model can only see inputs
occurring prior to it. Let us see how this would work. So, if you had 𝑥1, to 𝑥𝑛, as different

dimensions have an input, 𝑥1 is used as input to a second network 𝑥2, predicting the probability

of 𝑥2 given 𝑥1. Remember that probability, depending on the values 𝑥2 can take, can be learned

using a mean squared error or cross-entropy or any other loss that we have seen so far. 𝑥1 and 𝑥2

contribute to the third network to learn the probability of 𝑥3 given 𝑥1, 𝑥2, so on and so forth. For

a later 𝑥𝑖 given 𝑥1 to 𝑥𝑖−1, and finally, 𝑥𝑛 given 𝑥1 to 𝑥𝑛−1.

So, you could consider that autoregressive flows are a special case of normalizing flows, where
each intermediate transformation masks certain inputs. You are not going to look at all inputs in
every step. But in each, if you assume that network 1 through network n were different functions,
similar to normalizing flows. Each of these networks looks at only a certain part of the input and
not the complete input. You could look at it that way, although that is not exactly the way it is
implemented.

(Refer Slide Time: 03:30)

1440
So, one of the popular methods for autoregressive models is called NADE. NADE stands for
Neural Autoregressive Distribution Estimation. Here is how NADE works. You have inputs 𝑥1 to

𝑥𝑛 in dimensions of every input vector. Similarly, you have 𝑝(𝑥1), 𝑝(𝑥2 | 𝑥<2) - the probability of

𝑥2given all the random variables less than 2, 𝑝(𝑥3 | 𝑥<3), so on and so forth till 𝑝(𝑥𝑛 | 𝑥<𝑛).

We are trying to understand the probability density function of the real-world data, which is a
product of all of these. So, we provide all of these inputs to a shared neural network with as
many layers as you choose. The output of that neural network is then passed on to a layer of
hidden representations obtained over masked inputs. That is ℎ𝑛, which is one of the nodes here,

depends on all the x's, which is less than n.

So, here is how it is done. So, 𝑥1 will only be considered to get ℎ2, 𝑥1 and 𝑥2 will be considered

to get ℎ3, and 𝑥1, 𝑥2 𝑥3 and so on, all the way till 𝑥𝑛−1 will be considered to get ℎ𝑛. So one would

have to mask the inputs to ensure that the corresponding inputs reach the corresponding nodes in
the hidden layer.

(Refer Slide Time: 05:17)

What happens after this? Each of these ℎ1'𝑠 to ℎ𝑛'𝑠 are passed through different networks to get

𝑝(𝑥1), 𝑝(𝑥2 | 𝑥<2), etc.

1441
(Refer Slide Time: 05:31)

What happens at inference time? How do you generate? So once you have trained such a
network, remember in GANs, we knew how you could generate an image after training a GAN,
you would just sample a vector from a Gaussian, send it through a generator and get an image.
How do you do this with NADE? It is similar. We are going to assume that ℎ1 is equivalent to z

in sample generation. You could now assume that ℎ1 comes from a Gaussian; you sample a value

from there, send it through the first network, you would get a specific 𝑥1.

(Refer Slide Time: 06:09)

1442
That 𝑥1 is now.

(Refer Slide Time: 06:11)

fed to get ℎ2

1443
(Refer Slide Time: 06:16)

That is used to get 𝑝(𝑥2), then 𝑥1 and 𝑥2 together is fed to get ℎ3, which is used to get 𝑝(𝑥3),

then fed in to get. You continue this process until the last network predicts 𝑝(𝑥𝑛 | 𝑥<𝑛). You can

keep repeating this until the complete data sample is generated.

(Refer Slide Time: 06:43)

A variant of NADE, which is an improvement over NADE, is called MADE. MADE stands for
Masked Auto Encoder for Distribution Estimation, which improves upon the idea of NADE in a
different way. In this case, we use an autoencoder to achieve the same effect as NADE. So you

1444
have 𝑥1, 𝑥2, 𝑥3 𝑥4, you have 𝑝(𝑥1), 𝑝(𝑥2 | 𝑥<2), 𝑝(𝑥3 | 𝑥<3), so on and so forth till the last

random variable.

So how do we adapt NADE to this kind of architecture? There is an interesting methodology for
this. Firstly, you give an order to each of your input nodes and your output nodes. In this case,
we will keep it simple and say 1234 and 1234.

(Refer Slide Time: 07:48)

Once you do this, you also give an ordering for each hidden layers nodes. However, each hidden
layers nodes should get an ordering less than ‘n’. So, for example, in this hidden layer, n is 4. So
this means each node should get a number 1, 2 or 3. It has to be less than 4. Similarly, in the next
hidden layer, also each node gets a random number between 1, 2, and 3. It has to be less than 4.
Why do we do this? What do we do with this? Let us see that now.

1445
(Refer Slide Time: 08:22)

So, now, we do retain weights that connect node number ‘i’ to ‘j’, such that𝑖 ≤ 𝑗for all hidden
layers. What do we mean here? If you have a hidden node 1, it should be connected to an input
node less than or equal to 1, which means only 1 will connect to 1. So, the second node is 3. It
will be connected by 1, 2 and 3. Similarly, the hidden node labelled with 2 will be connected
with 1 and 1 and 2, so on and so forth.

(Refer Slide Time: 09:07)

1446
So we retain only those weights and discard all other weights of those layers. We also do this for
the last layer, where we only retain weights that connect node ‘i’ in output 𝑝(𝑥𝑖 | 𝑥<𝑗) such that

‘i’ is less than ‘j’. So, if you have 2, you only connect that with 1. Similarly, 3 is connected with
all 1s and 2s, so on and so forth.

(Refer Slide Time: 09:38)

And once again, you eliminate all other weights. What did we achieve through this process?

(Refer Slide Time: 09:46)

1447
We ensured that if you had 𝑝 (𝑥𝑖 | 𝑥<𝑖), all the other random variables less than that value ‘i’

depends only on inputs less than 𝑥𝑖. So this procedure ensured that if you had 𝑝(𝑥2 | 𝑥<2). You

notice the blue arrows that it is connected to only nodes labelled 1. It is not connected to even a
node with label 2, which was our goal in the first place, ensuring that only input 𝑥1 influences

𝑝(𝑥2), given all the other random variables less than 2.

(Refer Slide Time: 10:32)

Similarly, you can say it for 𝑝(𝑥3 | 𝑥<3), which depends only on all nodes labelled 1 and 2.

1448
(Refer Slide Time: 10:38)

And for 𝑝(𝑥4 | 𝑥<4), which depends on nodes labelled 1, 2, and 3.

(Refer Slide Time: 10:45)

In the end, the autoencoder has only a few weights saved, and it appears when you implement
that you have a full autoencoder. Still, you use a mask only to retain a certain set of weights and
discard the other set of weights while forward propagating or doing backpropagation. That gives
you MADE, and through this process. At the same time, you train this network. You are

1449
automatically learning each of these functions, your 𝑝(𝑥1 | 𝑥<1), 𝑝(𝑥2 | 𝑥<2), and so on, which

together gives us the real density function of the given data.

(Refer Slide Time: 11:29)

Another class of models for Autoregression are known as PixelCNNs. These were developed in
NeurIPS 2016. Given an image, our overall idea would be to send it through a CNN without
disturbing the shape and get a pixel-wise softmax distribution to generate the same data. But the
way we are going to do it is to introduce a pixel masking filter to say which pixels should be
used to predict the value at a specific pixel in the output.

For example, to predict a specific pixel here, you may not need all the pixels to predict the pixel
value at one specific location. You may only need a certain set of pixels in that neighbourhood
around the pixel you are predicting. That is the idea for pixel CNNs. It is very fast to compute.
As you can see, it is not a very complex procedure. However, pixel CNNs may not make use of
the full context.

For example, pixel masking is done to predict a particular pixel. It may look at a set of pixel
values that occurred before its presence. So, in this case, if you are trying to predict this red pixel
here, you may be using all these blue pixels, which is a sense of a local context. But it is also
missing the other pixels in this region, which also form a local context. So it could lose some
information, while generation, although it is extremely fast to compute.

1450
(Refer Slide Time: 13:18)

A variant of this by the same authors, which came in ICML 2016, is known as pixelRNNs and
pixelRNNs have a very similar idea. However, the generation is done using LSTMs instead of
CNNs. It is an autoregressive model, where images are generated pixel by pixel. Each pixel
depends on previous pixels based on a directed graph. So you have the overall joint distribution
2
𝑝(𝑥) given by, i is equal to 1 to 𝑛 , assuming an 𝑛 * 𝑛 image, probability of 𝑥𝑖 given all the other

pixels until that particular pixel.

So you can see here that for any specific pixel, you look at all the other pixels that came before it
to influence the generation of the value at that pixel for any specific pixel. So these dependencies
between pixels are modelled using LSTMs. That is why the name pixelRNN. The learning is
very similar to other models that we have seen in this lecture, maximising the likelihood of using
gradient descent.

The likelihood is tractable because we are using an LSTM. We know how it works. We just have
to structure it accordingly. An LSTM is trained the same way that we saw LSTMs earlier. The
image generation in pixelRNNs can be slow because you have to use an LSTM to generate each
pixels value. Unlike GANs, where the entire image is generated in one shot.

1451
PixelRNNs can also be considered as an example of a fully visible model. There are no latent
variables. pixelCNN is also an example of that. For that matter, most Autoregressive models turn
out to be fully visible models, where you do not model any latents per se as part of the method.

(Refer Slide Time: 15:31)

Let us look at how the method works. The PixelRNN has two variants; one is known as a Row
LSTM, where each LSTM generates an entire row of pixels at a time.

(Refer Slide Time: 15:52)

1452
So when you have to generate a specific row ‘t’, let us see how to generate one specific value in
that row ‘t’, which would be the output of a one-time step of the LSTM to generate that
particular row. Where should the LSTM look? How should the LSTM operate to generate a
𝑖𝑠
particular pixel ‘i’? We consider the set of pixels before that pixel in that row to be 𝐾 and the
𝑠𝑠
pixels just above that row in the immediate neighbourhood of ‘i’ to be 𝐾 .

They both become hidden state contexts, very similar to LSTM to generate the current pixel. So
you would have the hidden states and cell states to be given by some weight
𝑠𝑠 𝑖𝑠
𝐾 * ℎ𝑖−1 + 𝐾 * 𝑥𝑖 , a sigmoid that will give you all your different gates. The rest of it is very
𝑖𝑠 𝑠𝑠
similar to how an LSTM operates. So your inputs are based on you 𝐾 and 𝐾 . Those are the
values you get to generate ‘i’.

One could look at this as if you are generating a particular pixel, say the red pixel. You get the
context from the previous pixel in the same row and the top 3 pixels in the previous row.
However, those top 3 pixels received inputs from the 5 pixels in the previous row. Hence, you
could look at this red pixel that you generate to have a triangular context in its generation.

When you look at this model, it will be far slower than pixel CNNs because of its approach to
generation.

(Refer Slide Time: 17:52)

1453
Another variant that was proposed in PixelRNNs is known as the Diagonal BiLSTM. In
Diagonal BiLSTM, the approach is similar to what is known as a Bidirectional RNN. A
Bidirectional RNN is an RNN where, if you had your traditional RNN to be this way, let us
assume three timesteps. A bidirectional RNN is where your inputs are 𝑥0, 𝑥1 𝑥2 , and the outputs

are 𝑦0, 𝑦1, 𝑦2. We know from a standard Vanilla RNN that 𝑥0 influences 𝑦0, 𝑥0 and 𝑥1 influence

𝑦1, 𝑥0, 𝑥1, and 𝑥2 influence 𝑦2.

That is your standard RNN. In a bidirectional RNN, you also go the reverse way. All these
arrows can also operate reversely, which means you can reverse the entire RNN. You can now
say 𝑦2 is generated based only on 𝑥2, 𝑦1is generated based on 𝑥1and 𝑥2 , and 𝑦0 is based on 𝑥0,

𝑥1, and 𝑥2 altogether.

How do you learn such an RNN? You will learn the weights in both directions. You would get
two different sets of outputs 𝑦0, 𝑦1, and 𝑦2. You can average those outputs, which may give you

a better sense of what your output should be, especially when the direction in the sequence may
not matter when there could be context from both directions. So that is the idea used in the
diagonal BiLSTM variant of Pixel RNNs.

(Refer Slide Time: 19:46)

1454
In this case, the image is filled diagonal-wise. So you can see here that each diagonal is filled at a
time and when you fill one pixel at that particular location.

1455
(Refer Slide Time: 19:57)

(Refer Slide Time: 20:00)

Those are the values that are given us context to fill that ‘i’. In turn, these 2 pixels are denoted in
blue squares dependent on the previous pixels, so all of them would eventually influence the
generation of the value at this pixel ‘i’.

1456
(Refer Slide Time: 20:20)

We now repeat this process for the diagonal from the other side, very similar to a Bidirectional
RNN. So now, let us see if you had to generate this particular pixel ‘i’ here. What are the pixels
We would consider?

(Refer Slide Time: 20:35)

You would say the immediate neighbours, which means the pixel on top and the pixel to the
right.

1457
(Refer Slide Time: 20:42)

Unfortunately, you cannot use the pixel to the right. It has not been generated yet, according to
rasterization of the image. When we say rasterization, we mean this exact pattern involved in
generating an image, first row, then the second row, third row, and so on. So you cannot use this
pixel. So how do we overcome this?

(Refer Slide Time: 21:06)

So in a Diagonal BiLSTM approach, when you come from the other diagonal, you use these top
2 pixels as the context to generate this particular pixel ‘i’. That is the only change that occurs.

1458
(Refer Slide Time: 21:20)

You now combine the context you get from both of these and then get your generation at the
value ‘i’, which we talked about for a Bidirectional RNN. So now you can look at this pixel, the
red pixel in the middle, using the complete context from both directions. So to generate the red
pixel, you consider all the dark blue pixels that come from 1st diagonal and these light blue
pixels that you populate from the other direction, which give us the complete context. Once
again, here, you can see that these would be slower than pixel CNNs.

1459
(Refer Slide Time: 22:07)

So your homework for this lecture is a very nice blog on “Flow-Based Deep Generative models”
by Lilian Weng. Please do read it. If you are interested in knowing each of these methods, please
read the respective papers linked here. There are also newer methods, such as Glow for
Normalizing flows, which are also covered in this blog if you would like to know more.

(Refer Slide Time: 22:32)

And here are the references.

1460
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
GAN Improvements
(Refer Slide Time: 00:14)

Having seen various kinds of Deep Generative Models like GANs, VAEs, Normalizing flows,
Autoregressive flows last week. We will now move on to the various ways in which GANs have
been improved over the last few years. We will talk a bit of disentangling of VAEs, and then
move on to applications of GANs to Images and Videos. Let us start with a few improvements
over Vanilla GANs.

1461
(Refer Slide Time: 00:50)

The first method that we will talk about is known as Stack GAN. This work was published in
ICCV 2017. The goal of this work was to generate reasonably good resolution 256 * 256
photorealistic images conditioned on Text Descriptions. So here is a high-level flow. The entire
GAN model is conditioned on a Text Description which could be a caption, for instance.
Standard NLP methods such as Word2Vec, GloVe, BERT, etc., are used to get an embedding of
this Text Description, which is provided as an input to the GAN.

With this input in the first stage, the GAN generates 64 * 64 images, but perhaps a few
high-level details or low-frequency details. These images are then provided as an input to the
next stage of the Stack GAN, which generates 256 * 256 reasonably high detail photorealistic
images and both these stages are conditioned on the same text input.

1462
(Refer Slide Time: 02:20)

Let us see the entire architecture now. So you have the Text description, which is given as input
to the entire Stack GAN. In this example, the text description reads, “This bird is grey with white
on its chest and has a very short beak”. So you obtain an embedding of this text, as I just
mentioned, using, say, Word2Vec or GloVe or BERT or any other embedding of your choice and
this embedding is added to the standard Gaussian noise is given to a GAN.

So, you can see that you get a mean and standard deviation from this embedding and this mean
and standard deviation is used to obtain a sample. The obtained sample is concatenated with a
sample from the standard normal, which becomes the entire input to the first stage generator. So
the first stage generator then goes through an Upsampling Generator Module, which generates a
64 * 64image that you see here.

The generated image and a set of real images form a minibatch. This minibatch is provided to the
discriminator, which downsamples these images. And before the final step of classifying this
input as real or fake, you also include the embedding phi t, again here. So you see here that an
arrow comes all the way and gets combined to this output of the Downsampling Module.

And the job of this Discriminator now is to classify this tuple of a generated image whose
representation we are considering, which is this blue block here, and the text embedding together
to be real or fake. This forms the first stage. In the second stage, the results of the first stage are

1463
given as input along with the embedding of the text again; both of these then go through the
second generator stage.

So, you can see here that the image goes through a Downsampling, then the Text embedding is
then concatenated to the Downsample representation. This then goes through a set of residual
blocks, which is then Upsampled to get the final output 256 * 256 image, which is what you see
here.

Now, that goes to the second discriminator, where you give the generated 256 * 256 images and
one of the real 256 * 256 images. Once again, the output of the Downsample representation is
concatenated with the text embedding, and the Discriminator has to classify each such tuple as
real or fake.

(Refer Slide Time: 05:33)

Now, let us understand how to, how the Discriminator can look at the tuple and classify Real or
Fake. So, there are three kinds of Scores the Discriminator has to obtain/provide in this case. So,
the first one is when the real image is provided with the correct text. Obviously, the discriminator
would want the score of such an input to be one, because both are correct. There is a real image
and the corresponding correct text. There is also another setting where you have a real image and
an incorrect text.

1464
In this case, the discriminator should ideally give a low score. Similarly, you have a fake image
with the correct text, which the discriminator should give a low score, but the generator should
try to increase this particular score. These are denoted by 𝑠𝑟, 𝑠𝑤and 𝑠𝑓. Now, let us try to

understand how optimization actually works. So, the Stack GAN alternately maximizes the
discriminator loss and minimizes the generator loss.

Similar to what we saw with a GAN. In this case, the discriminator, as we just saw, would try to
maximize the log likelihood of the first score and minimize the log likelihood of the second and
third scores, which is what is given by the second and third terms here. We just saw that when we
understood the Scores. Similarly, the generator tries to maximize 𝑙𝑜𝑔(1 − 𝑠𝑓) because that is the

job of the generator. It also tries to minimize the KL divergence between the output of the mu’s
and sigma’s that you see in the initial layer at the end of the text embedding with the standard
normal distribution.

(Refer Slide Time: 07:47)

Here are some results that are obtained from Stack GAN. What you see on the top row is a
baseline method called GAN-INT-CLS that was published in 2016. And the bottom row shows
the results of the Stack GAN. So you can see here in the first column, the caption says, this
flower has petals that are white and has pink shading. You can see here that the quality obtained

1465
by the Stack GAN is far more photorealistic than of earlier methods. And this holds for all these
images that you see at the bottom row.

1466
(Refer Slide Time: 08:29)

Another popular method that came in 2018 is known as the Progressive GAN. The Progressive
GAN is designed for generating high-resolution images up to 1024 * 1024. And the key idea is
in the name of the method itself is the method progressively grows, the generator and the
discriminator. We will soon try to decipher what this means. This work was published in ICLR
2018. It also had a few other design decisions, such as using a standard deviation in a given
Minibatch, a concept of an equalized learning rate, and a Pixel wise feature vector normalization
in the generator, which we will also see soon.

(Refer Slide Time: 09:28)

1467
The idea of Progressive GAN is shown in the image on the left. The generator first produces a
4 * 4 image shown in the left-most part of this image. So, you have the latent vector that comes
from a standard normal, a very simple network, which generates a 4 * 4 image. This is provided
to a Discriminator D, along with a real sample, to judge whether it is real or fake.

This then increases to 8 * 8 in the next iterations of training. You can see here that compared to
the image generated in the 4 * 4 setting, the 8 * 8 setting is still blurry but has some more
details when compared to the 4 * 4 setting. You can also notice here that there are some layers
added in the generator and the discriminator when the next higher resolution is generated.

And this is repeated over and over again until we get the final generation of 1024 * 1024
image. The key idea that this allows is for stable training of GANs for generating high-resolution
images.

(Refer Slide Time: 11:02)

1468
Whenever the Progressive GAN goes from generating a certain resolution to the subsequent
higher resolution, as I just mentioned, there are new layers introduced. How are these layers
introduced? This procedure is very similar to ResNets. So, you can see that in subfigure b in the
image. You see now that while you had an initial resolution 16 * 16, when you step up to
32 * 32.

The nearest neighbour interpolated version of the 16 * 16 layer output; when you interpolate,
you get a 32 * 32 output added to the 32 * 32 layer through a skip connection. How is that
added? You can see an α and 1 − α. So, the new output layer is the final output is given by α
into the new output layer plus 1 − α into the projected layer. By projected layer, we mean the
output of the nearest neighbour interpolation. This allows Progressive GAN to fade in new layers
organically to generate better images.

(Refer Slide Time: 12:31)

1469
As I mentioned, Progressive GAN also introduces a few other design decisions. One such
contribution is known as Minibatch standard deviation. Here, the standard deviation is computed
and averaged at each spatial location over a Minibatch, So, you can imagine that when the
generator generates a particular feature map, you take every spatial location, which is at every
pixel, you compute its average, across the entire Minibatch, you will get a certain standard
deviation for that value across the Minibatch.

That is concatenated to all spatial locations at a later layer of a discriminator. Why do you think
this is done? Think about it. It will be your homework for this lecture. Another contribution that
Progressive GAN brings to the table is known as the Equalized learning rate. In this case, the
weights in each layer of the generator network are normalized by a constant c, which is specified
per-layer.

So, this constant c is a per-layer normalization constant. Why is this done? This allows us to vary
c in every layer and thus help keep the weights at a similar scale during training. So, why is this
called Equalized learning rate? Because the value of the weight effectively affects the gradient
and hence the learning rate. While methods such as Adam, AdaGrad, and so forth adapt the
learning rate, they may have low values with the weights themselves very low or very high.

Normalizing by a constant allows these weights to be of a similar scale during training across all
layers. This allows better learning. Another contribution that Progressive GAN makes is a
pixel-wise feature vector normalization in the generator. For each convolutional layer of the

1470
𝑡ℎ
generator, the normalization is defined by 𝑎𝑥,𝑦. Here, (x,y) is the pixel in 𝑎 feature map. It is

divided by N, where N is the number of feature maps. The summation j goes from 0 to 𝑁 − 1
𝑗 2
that is across the N feature maps, (𝑎 𝑥,𝑦
) + ϵ .

The denominator here considers the pixel value at the same location across all of the feature
𝑡ℎ
maps and normalizes the value at the 𝑎 feature map with this denominator. The ϵ here is for
numerical stability to avoid a divide by 0 error. And the output is denoted as 𝑏𝑥,𝑦, which is the

normalized value.

(Refer Slide Time: 15:59)

With these contributions, Progressive GAN reports impressive results. The results are compared
with earlier works such as Mao et al. and Gulrajani et al., and one can see that with the
Progressive GAN approach, the result is fairly photorealistic and high resolution with more
details when compared to earlier methods.

(Refer Slide Time: 16:25)

1471
A third more recent improvement of GAN is known as StyleGAN, published in CVPR 2019.
Progressive GAN, also known as ProGAN, generates high-quality images but does not give the
capability to control specific features in the generation. For example, it is difficult to use a
Progressive GAN and say that you would like to take an image and add a specific colour or
change a particular attribute in the image.

StyleGANs objective is to be able to control the generation of an image using a particular


predefined style. How does this do it? It automatically learns unsupervised separation of
high-level attributes, could be like, could be examples could be pose and identity for face
images. Stochastic variation such as hair could have a lot of randomnesses and scale specific
control of attributes.

1472
(Refer Slide Time: 17:38)

The intuition for StyleGAN is that a Coarse-resolution in an image, if we consider face images,
could affect attributes such as pose, a general hairstyle, face shape, etc. Suppose you go to the
next level of resolution, anywhere between 162 and 322. In that case, this resolution throws in
final attributes such as facial features, more specific hairstyle, eyes being open, closed, so on and
so forth.

At the highest resolution, a Fine resolution anywhere between 642 and 10242 affects the colour
scheme, both for eye hair and skin, and introduces micro features on the face. StyleGAN tries to
address and introduce some style-specific components at each resolution to attain its desired
effect.

1473
(Refer Slide Time: 18:43)

How does it do this? Given a Random input vector, the same as in the Vanilla GAN, StyleGAN
first normalizes this vector and then sends this through eight fully connected layers without
changing the input dimension and obtaining a 512 * 1 vector. Remember, the input latent was
also 512 * 1. So, these eight fully connected layers do not change the dimension of the input.
And this transformed vector which we call w, is given as input to the generator or the Synthesis
Network. Now, let us see how this vector w affects different resolutions in the generation.

(Refer Slide Time: 19:35)

1474
This is achieved by introducing a matrix A, which learns an affine transformation at different
resolutions. So, you can see here that the Synthesis Network or the generator in StyleGAN has an
Upsampling module and then has something called Adaptive Instance Normalization, which we
will see in a minute. A convolution layer follows it, and then another Adaptive Instance
Normalization Layer.

And this is then repeated over multiple blocks. Let us take a look at one of these combinations of
convolutional and Adaptive Instance Normalization layers. So, if you have a convolutional layer,
as seen on the top. In that case, a specific channel of the feature maps in that convolutional layer
is normalized by its mean and variance to get an Adaptive Instance Normalization.

In conclusion, the weight vector w is given as an input to each of these blocks. It is then
transformed by an affine map, which gives us a set of values 𝑦𝑠, 𝑖 and 𝑦𝑏, 𝑖, which is used to

change the scale and bias of the output of the convolutional layer. This is where Adaptive
Normalization comes into play. So, you can see here that this now becomes 𝑥𝑖 − µ(𝑥𝑖), which is

normalization with respect to mean and variance.

So, this quantity here on the inside corresponds to subtraction by mean and division by variance.
It multiplies them by 𝑦𝑠, 𝑖, which is a scaling value obtained as an output of the affine

transformation A and then biased by 𝑦𝑏, 𝑖, which is also the output of the affine transformation.

Note that these values, 𝑦𝑠, 𝑖 and 𝑦𝑏, 𝑖 could be different in different blocks of the Synthesis

Network. And each such transformation thus defines the visual expression of features in that
level. And that allows the input latent to have a different influence at different resolutions of
generation.

1475
(Refer Slide Time: 22:19)

Another method, the recent one, published in CVPR 2019 again, is known as SPADE. We will
see its expansion soon. And the key idea of SPADE is that previous methods directly feed the
semantic layout as input to the network. You have a certain image and the entire semantic layout
with the pixel configuration of the image. And in SPADE, which is given by Spatially Adaptive
Normalization. The input layout for modulating activations in normalization layers happens
through this spatially Adaptive Learned Transformation.

(Refer Slide Time: 23:08)

1476
Let us see how this happens. You could look at the output as going through an affine
transformation in the standard Batch Normalization layer. You scale up the value, and you add a
constant, which is an affine transformation. This is the standard batch norm operation, where you
have a value, you multiply it by gamma and add a beta which together defines an affine
transformation. However, in SPADE, a semantic segmentation map is given as input.

This semantic segmentation map goes through a convolutional layer, and the convolutional layer
outputs a gamma and a beta used to normalize the previous layer of the generator. So, as you can
see here, unlike blindly normalizing the output feature map in a specific layer, in this case, the
normalization is done in a spatially adaptive manner, where the gamma and beta come from the
spatial relationships in this Semantic Segmentation Map.

What is the Semantic Segmentation Map? Recall that a Semantic Segmentation Map has a
pixel-wise association of a class label. So, you can see here in the sample, you have a tree, you
have a sky, you have a mountain, you have grass, and you have a road, perhaps. That is used to
define the gamma and beta for normalization. A random latent vector can also be used to
manipulate the style of generated images.

But the semantic segmentation map gives a way of normalizing using spatial content at each
pixel. So, you can notice here that the normalization now is defined at each pixel. These gammas
and betas are defined at the pixel-wise level, denoted by the cross and plus, which are done
element-wise, which can allow each pixel to be normalised differently, based on the Semantic
Segmentation mask.

1477
(Refer Slide Time: 25:38)

Here are some interesting results. So, you can see here that the Semantic Segmentation output is
given as the input to SPADE. So, these are masks of Semantic Segmentation, Pixel-wise Class
Labeling, and this is the actual ground truth image corresponding to these Semantic
Segmentation masks. You can see here, while the third and fourth columns correspond to other
methods, SPADE gives a fairly photorealistic output close to the ground truth.

On the left side is the architecture of the generator in SPADE. It is similar to many other GAN
architectures, but the input of the semantic segmentation mask is coming in at each layer.

1478
(Refer Slide Time: 26:35)

Finally, the last method we will discuss in this lecture is a very popularly used one these days,
known as BigGAN, published in ICLR 2019. The focus of BigGAN was to scale up GANs for a
better high-resolution generation. It was designed for class conditional image generation, which
means the input is both a noise vector, similar to a Vanilla GAN and some class information.

For instance, it could be one hot vector, which together is given to the GAN to generate an image
corresponding to that class. BigGAN also introduces a few different design decisions, as we will
see in the following few slides.

1479
(Refer Slide Time: 27:28)

An important design decision in BigGAN is to use the idea of a Self-Attention GAN or a


SAGAN, an earlier work introduced in late 2018 and published in ICML 2019, which introduces
Self-Attention. We already saw Self-Attention in Transformers a week ago. It is the same idea
here, where you have a set of convolutional maps as the output of a particular layer in the
generator.

This output goes through multiple 1 * 1convolutions. Some of which transform and transpose to
obtain an Attention map. These three branches are very similar to the query, key and value of the
transformer architecture. You could consider them to be similar. So, two of those branches
generate an Attention map, which is used to focus on a specific part of the convolutional map to
generate that component of the image in the next layer. Why is this required?

Self-Attention GANs were introduced because often GANs, while they generated crisp images,
would miss out on finer details. For example, in a Dog's image, they may generate the fur on the
dog but miss out on the legs of a dog. So, the idea of using Self-Attention is to focus on specific
parts of the image and generate every local detail more appropriately. In BigGAN, along with
Self-Attention GAN, Hinge Loss is used.

This Hinge Loss is similar to the Hinge Loss used in support vector machines, given by
𝑚𝑎𝑥 (0, 1 − 𝑡. 𝑦), where t is the target output, and y is the predicted output. This is because

1480
BigGAN is class conditional. So, you also want the output image to belong to a particular class,
which is then used at the discriminator stage. Because the discriminator not only says real or fake
but also gives a class label whose loss, in this case, is given by Hinge Loss.

(Refer Slide Time: 30:04)

In addition, BigGAN also introduces Class-conditional Latents in a slightly different way.


Instead of a Class-conditional Latent, which could be a one-hot vector. Instead of giving that
directly to the generator’s input, the Class-conditional Latent is given as a separate input at
multiple stages of generation. It is concatenated with a certain input and then given to each of the
Residual blocks. What is done inside each of these Residual blocks?

The concatenated class label vector passes through two different linear transformations given to
the two batch norm layers inside each residual block.

1481
(Refer Slide Time: 31:02)

BigGAN also had Other Design Decisions, which helped its performance. One such was Spectral
Normalization, where the weight matrix in each layer was normalised to ensure that its spectral
norm was maintained. And it also satisfied the Lipschitz constraint, σ(𝑊) = 1. You can see this
paper called Spectral Normalization for GANs ICLR 2018 for more details. The broad idea of
such a step is similar to Batch normalisation to constrain the weights with specific properties that
ensure that learning can be better and faster.

In this case, we try to constrain the highest singular value of the weight matrix. A second idea is
Orthogonal Weight Initialization, a well-known weight initialisation method. Each layer’s
𝑇
weights are initialised to ensure 𝑊 𝑊 = 𝐼 or the weight matrix is orthogonal. It also introduced
Skip-z Connections, where the latent input z is connected to specific layers deep in the network.

It also introduces another method known as Orthogonal Regularization, which encourages the
weights to be orthogonal in each training iteration. This is done using a regulariser𝑅βwhere beta
𝑇
is a coefficient. The regulariser tries to ensure that the Frobenius norm between 𝑊 𝑊 and
𝑇
identity 𝐼is minimised. So, this would be a way to ensure the 𝑊 𝑊 is close to identity 𝐼, and
hence the matrix is Orthogonal. Why would we want Orthogonal Regularization? Think about
it; this is another homework for you for this lecture.

1482
(Refer Slide Time: 33:16)

Other Hacks employed in BigGAN were: 1. They updated the Discriminator model twice before
updating the Generator model in each iteration. 2. The model weights were finally averaged
across a few training iterations using a moving average approach. Progressive GAN also did this.
3. BigGAN also observed that using very large batch sizes, such as 256, 512, 1024, and 2048.
By batch sizes, we mean minibatch sizes while performing Minibatch SGD. They observed the
best performance with the largest minibatch size of 2048. 4. They also doubled the model
parameters or the number of channels in each layer.

5. They employed a trick known as the Truncation Trick, where the generator initially receives a
latent vector from a Gaussian during training time. At test time, you sample a latent from a
Gaussian. If that value is less than a threshold, you discard it and sample another value. So, this
is called a Truncated Gaussian. You could now imagine that this Truncated Gaussian is
something like this, where you are ensuring that something sampled below a threshold is not
considered an input to the GAN. So, that way, all inputs to the GAN come from the shaded part
of the PDF. The main idea is not to get anomalous generations using these latent vectors, but you
get generations that belong to the core part of the PDF of that Gaussian.

1483
(Refer Slide Time: 35:22)

Your homework for this lecture is to go through this excellent survey of GANs recently released,
Chapter 20 of the Deep Learning book. And here are the code links for most of the GANs
discussed in this lecture. Suppose you would like to see them and try them out. We left behind
the questions: Why is Minibatch standard deviation used in Progressive GAN, and Why is
Orthogonal Regularization used in BigGAN? Think about it and we will discuss it next time.

1484
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Deep Generative Models Across Multiple Domains
(Refer Slide Time: 00:12)

In this lecture, we will talk about an interesting use case of GANs, which is Generating Images
across Domains.

(Refer Slide Time: 00:26)

1485
Before we get there, let us answer the questions that we left behind. One of the questions was
Minibatch standard deviation is used in Progressive GAN. Why is this useful? I hope you had a
chance to try to find this. The answer is, in Minibatch standard deviation, the standard deviation
at each spatial location in a feature map across a Minibatch is concatenated in a later layer of a
discriminator.

The standard deviation gives an idea of the diversity of the images generated in a given
Minibatch. If this diversity is significantly different from the diversity in the real images from a
real dataset, that would incur a penalty, and the generator would learn to generate diverse images.
And that is the main idea of including this in Progressive GAN. The second question was, why is
Orthogonal Regularization of weights used in BigGAN?

The answer comes from linear algebra. Multiplication of a matrix by an orthogonal matrix leaves
the norm of the original matrix unchanged. Why is this useful? You have to recall Weight
Initialization and Batch Normalization. It is useful and important to maintain the same norm
across all layers and orthogonal regularization is a method that tries to achieve this during
training.

(Refer Slide Time: 02:22)

With that, let us move on to using GANs for generation across Domains, a task known as
Domain Translation. The goal here is, given an image from a source domain, we would like to

1486
generate an image in a target domain. We would like to learn this function G, which takes us
from source to target. You could look at this as a variant of GANs, where the input is not a noise
vector but a source domain image. There is more to it than just changing the input. Examples of
use cases could be to take a Male image to change it to Female, to go from sketches to photos, to
take a scene, and transform a summer scene to winter, and so on.

(Refer Slide Time: 03:21)

Here are some examples of how Domain Translation can be used. Here is an example of going
from Semantic Segmentation Labels to a Street Scene. Similarly, Labels to a Facade, Black and
White to Color, Day to Night, could be very useful for Autonomous Navigation or Self Driving
datasets. Going from an Aerial View to a Google map, the output is from a Sketch to a Photo. All
of these are examples of domain translation.

1487
(Refer Slide Time: 04:01)

So, there are a few settings under which different methods have been proposed, which we will
focus on in this lecture. In the first setting, known as Paired Training or the supervised setting,
you are given images from both domains in a paired manner. The goal is to train a GAN to
translate for a new image from one domain.

So, for every sketch, you are also given its corresponding photo in your dataset. This is the first
and simplest setting. In the second setting, we will talk about Unpaired training or unsupervised,
where you have a set of sketches, and you have a set of photos. They are not necessarily paired.
So, you do not know if, for a given sketch, the corresponding photo is there in the dataset or not.

But you still have to learn to go from a sketch to a photo, and we call this Unpaired image to
Image Translation. Finally, we will talk about Multi-modal generation, where you can go
seamlessly between domains, where the popular methods are UNIT and MUNIT. Let us see each
one in detail.

1488
(Refer Slide Time: 05:27)

The first method is for Paired translation. The popular method here is Pix2Pix. Pix2Pix defines
an image to image translation task as predicting pixels from pixels, and that is why the name
Pix2Pix. It provides a framework to perform all such tasks.

(Refer Slide Time: 05:53)

Pix2Pix builds upon the standard GAN objective. Recall that the standard GANs objective is to
maximize the likelihood of the discriminator and minimize the fooling rate of the generator. In

1489
contrast, the generator tries to maximize the second term. However, when adapting this for the
image to image translation tasks, the objective changes slightly.

(Refer Slide Time: 06:23)

Pix2Pix defines this as a conditional GAN objective. Instead of just an image x coming from the
real domain, you have an image x and the corresponding image y from another domain. In a
standard GAN, given an image and given the generated image, one would have to see which is
fake and real. In Pix2Pix, given an image and given another image, which may not be exactly
similar to 𝐺(𝑥). The discriminator has to tell whether this is a correct translation or not. So, the
conditional GAN objective now is given by the discriminator, which has to maximize the
probability of x and y, assuming they are the correct paired images from sketches and photos to
be real. So, that is something the discriminator has to do.

And the second term is where given x, a sketch, for example, and z, a latent vector, 𝐺(𝑥, 𝑧)
generates a photo. So, you could consider 𝐺 to be given input from one domain. In this case,
sketch G generates an image from the other domain. The discriminator’s job is to take the
original sketch and the generated photo and see if both can be classified to be fake. The generator
would want a log of 1 minus that quantity to be high. That is the min-max game generator, and
the discriminator would play here. In addition to the Vanilla objective, you now have the x, y
tuples to manage inside the GAN objective.

1490
(Refer Slide Time: 08:34)

In addition to doing this, Pix2Pix also introduces an 𝐿1 objective to ensure that the generated

image matches the original expected photo from the second domain. How is this done? The
generator also tries to minimize the 𝐿1 loss, which is the sum of absolute values of each element

between y which is the image from domain two in our case, could be a photo and 𝐺(𝑥, 𝑧), x
again is a sketch, an input from domain one.

And z is the latent noise vector given as input to the generator. 𝐺(𝑥, 𝑧)is the generated image
from domain two. We would like to match y as closely as possible, captured by this term in the
loss function. The overall objective now becomes the standard min-max GAN objective, which
is captured in the first term, represented as cGAN or conditional GAN plus some coefficient λ
times the 𝐿1 loss that forces the generated images to be close to ground truth.

With this objective Pix2Pix, obtains fairly impressive results. A couple of examples are shown,
where the input is a Semantic Segmentation mask, which is one domain, and the aim is to
generate the scene image, which is the other domain. One can see that using only with the 𝐿1

loss; the generation is very blurry. Using conditional GAN, the generation improves, and when
the two are put together, the generation has a fair good amount of detail and is close to the
ground truth image for this example. You see a similar observation even with the second image,
where one goes from the Semantic Segmentation mask to the Actual Facade picture.

1491
(Refer Slide Time: 10:51)

The generator architecture in Pix2Pix resembles a U-Net-based architecture. The encoder


reduces the dimensions of layers until a set of bottleneck features, which are then upsampled to
get the final dimension of the input image or the desired image as the output. Similar to U-Net,
there are skipped connections. These connections go from each layer in the encoder to its
corresponding mirror layer in the decoder, similar to what we saw in U-Net for Semantic
Segmentation.

(Refer Slide Time: 11:34)

1492
In addition, Pix2Pix also uses what is known as a PatchGAN Discriminator. In a standard
discriminator of a GAN, even if L1 or L2 loss is used as a regularizer the way Pix2Pix
introduced it. This ensures the crispness of low-frequency components of generated images. So,
you do get crispness in the output of certain large objects in the image. If one wanted finer
details, you need to do better than the L1 loss at the image level.

The PatchGAN Discriminator introduces L1 loss at a patch level between the generated image
and the original expected ground truth or the input. So, in this case, there is an enforcement of a
patch level classification of the generated image, comparing it to the ground truth and saying
whether it is real or fake. This is done for all patches in the generated image. The average is
taken to decide whether the generated image is real or fake, which is then used in the loss to
backpropagate and train the generator.

One could also look at PatchGAN as a form of texture or style aware generation so that finer
details or textures in the image can be generated better using a local patch wise discriminator
approach.

(Refer Slide Time: 13:21)

A second domain translation method or image to image translation method is Unpaired image to
image translation, an Unsupervised approach called CycleGAN, a popular approach again.
CycleGAN is premised on the observation that pair data from different domains can be

1493
challenging to collect. It is not very easy for every sketch to obtain its corresponding photo and
thus build a dataset. On the other hand, what may be more accessible is you could get large
amounts of Unpaired data where you have a set of sketches and a set of photos. You may not
necessarily have a paired photo for each sketch. They could just be loosely different sets. In this
case, it becomes challenging to learn domain conditional distributions the way Pix2Pix learned to
generate these images. The challenge here is you could have infinite possible translations for a
given source sample.

Because the pairing is not known in the training data. Given a sketch, there could be infinite
ways to transform this sketch into a photorealistic image. How do you handle this? That is where
CycleGAN comes into the picture. It uses two generators, G and F, which are intended to be
inverse functions of each other. It uses a concept called Cycle consistency, where the idea is that
the output from the target domain should also map back to the source domain and match the
input image. It uses adversarial training for generators and discriminators to achieve this.

(Refer Slide Time: 15:32)

Let us see what Cyclic Consistency mean. The name CycleGAN comes from this use. Cyclic
Consistency is similar to a concept from machine translation, where a phrase translated from
English to French should also translate back from French to English and get you back the
original sentence in English. This would ensure that the translation is complete and the generated
French sentence can recover from the original sentence.

1494
One would want the reverse process also to be true. If you start with French, go to English, then
the generated English sentence should get back the French sentence. It is shown pictorially in the
slide. So, given an image from one of the images from this input domain X, you can go to an
image from the second domain Y. If you generate back the image of the original domain from Y,
you should get back the original image you started with. This is the key idea of Cyclic
Consistency in CycleGANs.

(Refer Slide Time: 17:00)

The way this is implemented is, given an input image, coming from Domain 1, G generates the
version of the image in Domain 2, given by G(x). This input is given to F, F(G(x)), which should
be close to x. The L1 norm gives the loss function to minimize ||𝐹(𝐺(𝑥)) − 𝑥 ||1.

Similarly, if you start from the second domain, y, generate an image in the first domain, which is
given by 𝐹(𝑦). Then apply the transformation 𝐺(𝐹(𝑦)), the reconstruction error to be minimized
is given by || 𝐺(𝐹(𝑦)) − 𝑦 ||1, the L1 norm, which should be close to 0. These are the criteria

that help CycleGAN work, even with unpaired images.

(Refer Slide Time: 18:15)

1495
Adversarial losses give the loss functions. Let us elaborate on each of them. So, to go from
domain X to domain Y, the adversarial loss is to maximize the discriminator’s output on Y, and
the discriminator would want 𝐷𝑌(𝐺𝑋𝑌(𝑥)) to go to 0 and the generator would want

1 − 𝐷𝑌(𝐺𝑋𝑌(𝑥)) to go to 0 or 𝐷𝑌(𝐺𝑋𝑌(𝑥)) to go to 1. So, this would ensure that the image

generated from X in the Y domain looks like a real Y to the discriminator 𝐷𝑌.

This is the first Adversarial loss. Similarly, one could have a reverse adversarial loss for an
image going from domain Y to domain X. The second loss is simply the converse or the
complement of the first loss function. The only difference is the input is an image from domain
Y, and the output is an image from X given by generator 𝐺𝑌𝑋.

While these two losses give domain specific losses to go from X to Y and Y to X. We then have
the Cyclic loss for domain X to Y, which, as we already mentioned, is given by 𝐺𝑋𝑌(𝑥), which

gives you output in domain Y and then considering 𝐺𝑌𝑋, which is the generator going from Y to

X of this value in domain Y. So, remember, this quantity gives you an element in domain Y, and
then applying the generator that goes from Y to X gives you back an element in domain x.

1496
You would want this final generated output to be close to the x that you started with. So, the L1
loss tries to minimize the loss between the final reconstruction in the original domain and the
ground truth that we started with.

(Refer Slide Time: 20:56)

We also need a similar cyclic loss to go from Y to X, which is a complement of the loss from X
to Y. These four losses are put together to help train the CycleGAN.

(Refer Slide Time: 21:10)

1497
And with this, Cycle-GAN gives impressive results, where given an input image, it gives an
output image from another domain. You can see different kinds of images.

(Refer Slide Time: 21:28)

To make this a little bit more tangible and clearer. Here is an input image, and the output image
shows the input image in different artist styles, such as Van Gogh, Monet, Cezanne, etc. You can
see here that the styles of the artists remain the same. But for the input image, which is translated
in that artist’s style.

(Refer Slide Time: 21:55)

1498
However, CycleGAN has one problem because it is possible to have multiple possible
translations for a given source image. This leads to what is known as Mode Collapse, where the
model may not be able to produce diverse images because CycleGAN could just generate one
image from which one can retrieve the original image from the source domain and still get a low
loss. It does not explicitly try to generate different kinds of images in the second domain or the
target domain, for that matter.

So, the solution to address the mode collapse problem in CycleGANs is to embed latent spaces
inside the GAN framework. How do we do that? By combining VAEs and GANs. Recall that
VAEs introduce a latent space that is learned through an encoder-decoder framework. So, we will
talk about these methods now that embed latent spaces inside the GAN framework to address
this Mode collapse problem.

What we ideally want is, given an input edge image or a sketch, we would want several kinds of
domain translations. You may want a Pink shoe, a Black shoe or a Beige shoe. Based on certain
changes in a latent variable, which is learned through a VAE. You can see applications of such an
approach in, say, Fashion and Apparel purchase so on and so forth.

(Refer Slide Time: 23:55)

One of the earliest methods in this direction is known as UNIT. UNIT stands for Unsupervised
Image to Image Translation Network, which was published in NeurIPS 2017. UNIT uses a

1499
VAE-GAN framework to learn latent spaces and domain translation simultaneously. So, in
addition to the cycle consistency that we saw with CycleGANs, UNIT-GAN introduces cycle
consistency, even at the latent space level.

Let us see this with the loss functions. But let us first try to understand the entire setup before we
go there. The UNIT architecture is based on this illustration. Given two inputs 𝑥1 and 𝑥2 , from

the two domains, you have corresponding encoders 𝐸1 and 𝐸2 for each domain, which gives a

latent vector in the same space. The dimension of the latent vector for both domains is the same.

Given a latent vector from that shared space, you have two generators, 𝐺1 corresponding to

Domain 1 and 𝐺2 corresponding to Domain 2. This gives us four possibilities of generations,

𝑥1 → 1, where the input is from Domain 1 and the output generated is also from Domain 1. 𝑥2 → 1,

where the input is from Domain 2, but the output is from Domain 1. Similarly, 𝑥1 → 2and 𝑥2 → 2.

All these images are then passed to a discriminator corresponding to each domain to say whether
the generation is true or false.

(Refer Slide Time: 25:56)

The loss functions are given by both the VAE loss and the GAN loss since UNIT is a VAE-GAN
framework. A KL-divergence gives the VAE loss for domain 𝑥1 → 𝑥2, between the approximate

1500
posterior, 𝑞1(𝑧1| 𝑥1 || 𝑝η(𝑧)) and the log-likelihood of generating 𝑥1 given 𝑧1 using the generator

𝐺1, 𝑙𝑜𝑔 𝑝𝐺 (𝑥1| 𝑧1). We take the negative log-likelihood, which needs to be minimized.
1

Similarly, you have the GAN loss, which corresponds to 𝑥1 coming from 𝑝𝑥 , 𝑙𝑜𝑔 𝐷1(𝑥1) , which
1

is maximized by the discriminator 𝐷1 and 𝑙𝑜𝑔(1 − 𝐷1(𝐺1(𝑧2))), which is a sample drawn from

the second domain, but the generation is of the first domain is minimized by 𝐺1and maximized

by 𝐷1. You would have a similar loss for 𝑥2 → 𝑥1 for VAEs and GANs to complete this picture.

In addition, you also have a Cyclic loss for 𝑥1 → 𝑥2, given by a KL divergence between the

approximate posterior 𝑞1 of 𝑧1 given 𝑥1 and the prior. Similarly, 𝑞2 of 𝑧2 given 𝑥1 that translates

from 1 to 2, and along with that the prior. Also, the negative log-likelihood of 𝐺1 generating 𝑥1

given 𝑧2. This would be the Cyclic loss from 𝑥1 to 𝑥2, and one would again have the 𝑥2 to 𝑥1

defined similarly. All of these can be carefully understood as extensions of GANs and VAEs.
Keeping in mind that one needs to ensure generation across domains.

(Refer Slide Time: 28:49)

An extension of UNIT-GAN is the MUNIT-GAN. In MUNIT-GAN, the image data, the latent
space of the image data, is divided into a content space and a domain-specific style space. The

1501
idea here is that each domain has a certain style. We talked about this with CycleGANs, where
you could have each artist's style be a different domain. The style encoder then tries to transfer
the content in a different domain to the style in that particular domain.

This is implemented using a within-domain autoencoder framework, as well as a cross-domain


framework. Let us see this in some more detail here. So, you have 𝑥1and 𝑥2, which are images in

two different domains. The latent variable is now divided into two parts, 𝑠1 and 𝑐1, which would

have been 𝑧1in UNIT. 𝑠1 corresponds to the Style Latent’s. So, you could now imagine a latent

vector 𝑧 divided into two parts.

Not necessarily equal, so the latent vector could be, say, 100 dimensional, 40 dimensions could
correspond to the style, and 60 dimensions could correspond to content, just as an example. So,
𝑠1 corresponds to the style of the first domain, 𝑐1 is a set of latent dimensions corresponding to

the content of that particular domain. Similarly, 𝑐2 and 𝑠2 for the second domain. Otherwise, you

could now consider these two as individual variational autoencoders for Domain 1 and Domain
2.

This is the Within-domain Autoencoder framework. In the Cross-Domain framework, one gives
𝑥1 as input, which now leads to 𝑐1 , which is the content latent of 𝑥1. Similarly, the content latent

of 𝑥2 is 𝑐2. 𝑐2 is now combined with 𝑠1 that forms a new latent. The decoder is applied to this

new concatenated latent to get an 𝑥2 → 1 , which is a translation from the second domain to the

first domain.

Similarly, the content variable of Domain 1 with the style variable of Domain 2 gives us 𝑥1 → 2,

which is a translation from domain 1 to 2. Now, to complete the cycle in the latent variable
space, one could again take these x's and derive their latent variables through the encoder of a
^ ^ ^ ^
variational autoencoder, which would give us 𝑠1 , 𝑐2, 𝑐1 and 𝑠2. Let us keep this structure and

entire illustration in mind when we look at the loss functions.

1502
(Refer Slide Time: 32:13)

𝑐
So, you have an Image Reconstruction Loss, given by 𝐿1loss, || 𝐺1(𝐸 1(𝑥1)) − 𝑥1||1.
𝑐
Remember, 𝐸 1(𝑥1) gets us the content latent of the encoder of 𝑥1. Remember, this is the latent

content of the latent variables of the encoder of the first domain. The generator of Domain 1 is
applied to that, which gives a reconstruction in that same domain, which we would like to be
close to 𝑥1. This is a simple Image Reconstruction Loss.

(Refer Slide Time: 32:57)

1503
We also have the Latent Reconstruction Loss, where we would like to ensure the latent’s
𝑐
reconstruct themselves. This is given by 𝐿1loss, || 𝐸2 (𝐺2(𝑐1, 𝑠2)) − 𝑐1||1. Let us try to

understand this. Given the content from Domain 1 and the style from Domain 2, when one
applies G2, we get an output; what would that output be? That output would be 𝑥1 → 2. Now,
𝑐
given this x, when the encoder of the second domain is applied, which is what 𝐸2 corresponds

to.

One would want the content variable to be close to the content of the first domain, which is given
by 𝑐1 reconstruction loss. One would have a similar reconstruction loss for 𝑠2 . Let us see that

too. Given 𝑐1 and 𝑠2, once again, one gets a construction of 𝑥1 → 2. When the second encoder is
^
applied to this, one expects to retrieve 𝑠2, given by 𝑠2 if you recall in the earlier diagram.

One would want this to be close to the original 𝑠2 , and the L1 loss tries to capture this Latent

Reconstruction. You would similarly have terms for the corresponding 𝑠1 reconstruction and 𝑐2

reconstruction.

1504
(Refer Slide Time: 34:46)

Finally, Adversarial Loss also tries to ensure that 𝐷2(𝑥2) is maximized by the discriminator 𝐷2.

Similarly, (1 − 𝐷2(𝐺2(𝑐1, 𝑠1))) is maximized by the discriminator and minimized by the

generator in the second domain. This is the standard GAN loss for the second domain. One
would similarly define a GAN loss for the first domain, complementing the same loss function.

In case some of this is hard to follow, I would recommend going through these equations
carefully. This is an extension of Vanilla GAN across two domains, and there is no further
complexity.

1505
(Refer Slide Time: 35:40)

With these loss functions, MUNIT-GAN shows impressive results of translations from one
domain to the other. This time with more diversity and variety by changing the latent values,
given an input image from one domain, in this case, cats. If one wants to generate larger jungle
cats or big cats, you could now get several translations by playing with the latent vector with the
latent style vector of the second domain.

You can see that with different examples, you get several varieties of outputs in the second
domain for a given input, which can be obtained by changing the style vector of the second
domain. Remember, the style vector of the latent in the second domain can be interpolated to
change these kinds of outputs in the generation.

1506
(Refer Slide Time: 36:45)

And as wanted when we started, this also allows us to vary style and content gradually across the
generations. So, you can see an example that the content comes from these sketches, and the
style comes from these images. In each case, the colour and style are of the second domain,
while the content comes from the first domain. You can see in subfigures a and b that the first set
of images go from edges to shoes and the second set of images from big cats to house cats.

(Refer Slide Time: 37:31)

1507
That concludes this lecture. Each link provides more details of the paper and the code if you
would like to know more. The link at the end has a list of all image to image translation work if
you would like to understand them more.

(Refer Slide Time: 37:53)

Here are references.

1508
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 70
VAEs and Disentanglement
(Refer Slide Time: 00:14)

We have seen different variants of GANs over the last couple of lectures. We will now move on
to another important notion in Generative Models, which is called Disentanglement. This notion
is more closely associated with Variational Auto Encoders VAEs, and we will also discuss why
this is so as part of this lecture.

1509
(Refer Slide Time: 00:45)

To start with, what is Disentanglement? Disentanglement is about isolating sources of variation


in observational data. If you had an image of a Big Red Apple, can you separate the generative
factors for such images as corresponding to size, big, color, red and shape or object apple? Can
we enforce Deep Learning Models, Deep Generative Models, in particular, to isolate these
factors while learning such a model? Why do we need such an approach?

If we could disentangle the generative factors, it allows us to generate new images that may not
be in an observed dataset. Suppose your training dataset had images of small black grapes and
big red apples. Can we generate an image corresponding to a small black apple? You may not
find such an image in a real-world dataset.

But using a deep generative model can hypothesize how this would look by setting the color to a
particular value, setting the size to a particular value and the object to a particular value. You
would be able to do this reliably only if the latent variables in your generative model isolate
these components of images.

1510
(Refer Slide Time: 02:32)

Here is an example of face images. In this case, the latent variables could correspond to gender,
age, hair, and perhaps race so on and so forth. So if we knew which latent variable corresponded
to gender, one could manipulate that latent variable alone to generate different images of
different variations going from, say, female to the male gender, as you can see in the example
here.

(Refer Slide Time: 03:12)

1511
Why Variational Auto Encoders? You perhaps know the reason already. We probably already
used the word latent multiple times. In GANs, generative adversarial networks, the latents are
not learned per se. The latent vector is a noise from a Gaussian. In a Variational Auto Encoder,
the latent variables are learned. If one could now ensure that those latent variables are
disentangled, you may have a lot of control over what kind of images you can generate out of the
VAE. So recall the VAE overall architecture and formulation.

So you have your input data x, the encoder provides the mean and variance of an approximate
posterior, which over learning tries to become close to a pre-assumed prior. Then a vector is
sampled from the prior. The decoder reconstructs the data from that sample vector. These latent
variables could be a vector of multiple dimensions. If they are disentangled, you can generate
more control data.

VAE-GAN frameworks, such as Adversarial Auto Encoders, can benefit from disentangling this
latent variable in a VAE.

(Refer Slide Time: 04:56)

The first work that brought this notion to the community’s attention and developed a method to
allow disentanglement was β-VAE. Their work was published in ICLR 2017. It is primarily a
variant of VAE itself. Let us see what kind of a variant. If you recall the variational autoencoder

1512
loss, there are two terms in your evidence lower bound, one term which minimizes the negative
log-likelihood.

In other words, it maximizes the log-likelihood of generating that kind of data that is in the
training set. The second part minimizes the KL-divergence between the approximate posterior
and the true posterior. It breaks down into two terms which we finally use while training the
VAE.

We finally use only the KL divergence between 𝑞Φ(𝑧 | 𝑥) and the prior 𝑝θ(𝑧) after applying the

evidence lower bound. This is the correct KL to start. But this gets simplified to the KL that is
written on the right side.

Now, this entire objective can be written in a slightly different manner. We can say that we would
like to maximize the log-likelihood of generating x from z. Subject to the constraint that the
approximate posterior 𝑞Φ and 𝑝θ(𝑧), the prior on z. The KL divergence between these two

quantities is as small as possible. We say that the KL divergence should be less than some
positive constant delta.

This is another way of writing out the same objective. You can say now that we are maximizing
the probability of generating the real data while keeping the distance between real and
approximate posterior distributions small, which boils down using the evidence lower bound to
keeping the distance between the approximate posterior and the prior small. How does this help?

1513
(Refer Slide Time: 07:52)

Now, keeping this optimization problem in mind, we can write it as a Lagrangian. With the
Lagrangian multiplier using the KKT conditions. This is very similar to how one would write out
the support vector machine objective. So this would turn out to be maximizing the log-likelihood
and minimizing the constraint term that we had. This constraint term, when we have a
Lagrangian, would turn out to be 𝐷𝐾𝐿 − δ.

And that would then go to the numerator, and you would have a Lagrangian multiplier β, using
the standard Lagrangian approach to optimization. So here, we write the first term as it is the

1514
objective function minus β, which is the Lagrangian multiplier into the constraint, which is
KL-divergence between approximate posterior and prior minus δ. If you expand this, the first
term stays as it is, the second term becomes minus β into the KL. When we say KL, we mean
KL-divergence plus β * δ.

Since both βand δ are quantities that are greater than or equal to 0, that is how we define them.
So you are left with saying that this quantity will be greater than or equal to the log-likelihood
minus the KL-divergence. We are writing this is as a maximization problem. When we do
minimization, the sign will change.

(Refer Slide Time: 09:37)

One can now write the β-VAE loss to minimize, minus log-likelihood, or negative log-likelihood
plus beta times the KL-divergence between approximate posterior and prior. It almost seems like
nothing changed from a standard VAE which is partly true. In this case, when β is equal to 1, you
would have the standard VAE. However, when β is made greater than 1, it introduces stronger
disentanglement in the generative model. Why is the so?

Between these two terms used to train a VAE, the first term recall, the goal is to improve the
reconstruction capability of the decoder. It is the second term that tries to learn the latents of the
variational autoencoder. So by giving it a stronger weight, we are trying to make the latents be

1515
learned better in a more disentangled way. The only problem now is this could limit the
representation capacity of z, thus causing reconstruction problems in the entire VAE.

(Refer Slide Time: 11:05)

That brings us to another question which almost looks like a tradeoff between disentanglement
and reconstruction capability. By increasing β in a β-VAE, we get better disentanglement, but the
training procedure now thinks that the second term is more important.

(Refer Slide Time: 11:28)

1516
In this case, the second term is more important and the first term is slightly less important. So, if
the first term is slightly less important, this leads to lesser reconstruction performance.

(Refer Slide Time: 11:49)

So to address this issue, to be able to get good reconstruction and good disentanglement, there
was another method introduced in NeurlPS 2018 by Chen et al., β-TCVAE. Let us try to
understand this time. So the β-TCVAE looks at the KL-divergence term between the
approximate posterior and prior and then decomposes it into two parts. How is this
decomposition done? This decomposition is done by looking at the term, the approximate
posterior 𝑞Φ(𝑧 | 𝑥), which could also be written as 𝑞Φ(𝑧 | 𝑥𝑛).

Now assume that you have a set of data points going from, say 1 to N, and each 𝑥𝑖 is one data

point where 𝑖 comes from 1 to N. So that is the 𝑥𝑖 that we are talking about is each of the data

points. This is the same just expansion of writing the approximate posterior. So by standard
probability, we can now write this as the joint probability, 𝑞Φ(𝑧 , 𝑥𝑛) by the probability on 𝑥𝑛,

𝑝(𝑥𝑛) . Assuming all data points are equally likely, the denominator here would be 1/𝑁, which is

a constant.

So, you could now say that we could replace the approximate posterior with the joint probability
between the latent and each data point 𝑥𝑛. This means that the KL-divergence between the

1517
approximate posterior and the prior can be broken down into two parts. It can also be written as
KL between 𝑞Φ(𝑧 , 𝑥𝑛), the joint, with respect to the prior on z. The first term here, 𝑞Φ can be

broken down into two parts.

The first term would be the KL-divergence between the marginal on z with respect to the
approximate prior, and the second term would be a KL-divergence between the joint distribution
𝑞Φ(𝑧, 𝑥𝑛) and the product of the marginals 𝑞(𝑧) * 𝑝(𝑥𝑛). This is given by the mutual information

between z and n; n denotes the indices of the data points on x. Note that mutual information is
defined as a constant factor of a KL-divergence between the joint distribution between two
random variables and the product of its marginals. Now how does this decomposition help?

(Refer Slide Time: 15:13)

Once we have this decomposition, one notice is that the second term is the marginal KL; we will
call that marginal KL because the approximate posterior has now been marginalized. Earlier, we
had 𝑞Φ(𝑧 | 𝑥), but that got marginalized. The other term now came into the mutual information.

This marginal KL is the component responsible for disentanglement.

Hence, trying to penalize the mutual information may lead to poorer reconstruction. Keeping this
in mind, we now want to ensure that we focus on the marginal KL while learning a VAE; that is
the term that we want to weigh with a beta.

1518
(Refer Slide Time: 16:08)

Before we do that, we will do one more thing: decompose the marginal KL divergence even
further. The marginal KL divergence can be decomposed into a term that looks at the KL
divergence between z, the random variable, and the product of the marginals of each dimension
of z. This term is known as Total Correlation. Although the name is a misnomer, Total
Correlation is a concept from Information Theory which is a generalization of mutual
information to multiple random variables.

If you add two random variables 𝑧 and 𝑛 that we saw on the previous slide, you look at the joint
and the product of the marginals of the two random variables and take the KL divergence. In
total correlation, we do this for all the random variables involved in z in this particular context.
Those random variables for us are the different dimensions of the z. The second term here is the
dimension-wise KL-divergence and the sum of all of them.

So we have broken the overall KL divergence of z into dimension-wise quantities. Why is that
important? Because in disentanglement, we would like each dimension to have a unique
existence.

1519
(Refer Slide Time: 17:45)

Now, suppose you look at this decomposition. In that case, one notices that total correlation is
perhaps most important for disentangled representations. That term is responsible for looking at
each dimension of z to the overall z. This leads us to the final loss for the β-TCVAE, which
simply puts together all the components that we have seen so far, the negative log-likelihood, the
mutual information, the total correlation, and the dimension-wise KL, which are the different
components we have seen.

What is different? Notice, β is only on total correlation and not on any other terms in the overall
objective.

1520
(Refer Slide Time: 18:36)

This allows us to focus β on disentanglement only on that term and not affect the reconstruction
capabilities of the VAE.

(Refer Slide Time: 18:47)

Having seen β-VAE and β-TCVAE, one question that arises now is how do you evaluate whether
your generative model has learned to disentangle effectively? While one way is to generate
different images and check qualitatively whether those images represent different generative
factors, that can become a tedious exercise for many generative factors. One such metric that has

1521
been proposed in recent years is known as the Mutual information Gap (MIG). The idea is to use
the mutual information between generative factors, g and latent dimensions, z in some way.

What do we mean by generative factors? These are factors that we know exist in the dataset. If
we say, big red apple, size, color, and object represent the generative factors. The idea is to see if
the latent dimensions z that are learned capture these generative factors somehow. We would like
to use the mutual information between the random variables g and the latent dimensions z to
capture this.

How do we do this? We compute the mutual information between each generator factor, 𝑔𝑖 and

each latent dimension, 𝑧𝑖. You would then have an entire matrix of mutual information between

every pair, 𝑔1and 𝑧1, 𝑔1and 𝑧2, 𝑔2and 𝑧1, so on and so forth. What do we do with all of these

mutual information values? For each generator factor 𝑔𝑖 consider the latent factors that have the

top two mutual information values. Let us call them 𝑧𝑗 and 𝑧𝑙.

Once we have this, we define the mutual information gap as the difference in the mutual
information values between these top two latent factors. So, the mutual information of 𝑔𝑖 with 𝑧𝑗

and 𝑔𝑖 with 𝑧𝑙, will be the mutual information gap with some normalization factor on the outside.

What is the normalization factor? 1/𝐻(𝑔𝑖), the entropy of 𝑔𝑖 intrinsically, i.e., entropy is

− ∑ 𝑝 𝑙𝑜𝑔𝑝 of that generative factor.

And this normalization takes care of averaging this across all of the generative factors. So
averaging by 𝐾 and normalizing by H, entropy, provides us values between 0 and 1. If the mutual
information gap is zero, both these latent factors have high mutual information with the same
generative factor. It would be considered a bad disentanglement because both those latents are
learning the same thing. They are not disentangled.

On the other hand, when MIG is 1, it is good disentanglement. One question here is why do we
use the mutual information gap and not just mutual information itself? Think about it. It is

1522
homework for you. If you need to understand this better, read the paper “Isolating Sources of
Disentanglement”, NeurIPS 2018 paper, which defined this metric.

1523
(Refer Slide Time: 23:06)

Another metric for checking disentanglement that has been recently proposed in ICLR 2018 is
known as the DCI Metric. DCI stands for Disentanglement, Completeness and Informativeness.
There is a quantity defined for each of these. To compute them, let us first train any model, say
Beta-VAE, to learn latent representations. Get the latent representation of each image in a
training dataset or the test dataset, for that matter, if that is where you would like to study for
disentanglement.

Then we train a linear regressor. So you learn k different linear regressors, 𝑓1...... 𝑓𝑘, that predicts

each generative factor 𝑔𝑖 given the entire latent vector z. So you have k different generative

factors, and hence k different linear regressors. How do you learn them? For each input image,
you would get a latent factor z, and you want to use that now to predict the gender here? Or what
was the color of this Apple?

That would be the value of each generative factor. Once we train these linear regressors, it will
give you a weighted combination of each latent factor. That is what linear regression does. So
you would now have an entire matrix,𝑊𝑖𝑗, which tells us how much a latent factor 𝑧𝑖 is important

to predict a generative factor 𝑔𝑗. We will call this the relative importance matrix, which is the

absolute value of 𝑊𝑖𝑗’s, obtained through regressors.

1524
(Refer Slide Time: 25:10)

Once we get the relative importance matrix, there are quantities defined for disentanglement:
completeness and informativeness. Let us look at disentanglement first. This metric tries to
capture the degree to which representation disentangles underlying factors of variation. This is
obtained by defining 𝐷𝑖 as 1 − 𝐻(𝑃𝑖), where H is the entropy of 𝑃𝑖. What is the probability

distribution 𝑃𝑖? 𝑃𝑖 is defined as a vector of 𝑃𝑖𝑗’s for each generator factor 𝑔𝑗, where each 𝑃𝑖𝑗 is

given by 𝑅𝑖𝑗 in that matrix divided by all the entries in that particular row 𝑖 corresponding to that

latent factor.

𝑡ℎ
So, that is the disentanglement score of the 𝑖 latent. So, we ideally want the latent to predict
only one generative factor and not predict all. So, the Total Disentanglement score is then given
𝑡ℎ
by summation ρ𝑖𝐷𝑖 , where 𝐷𝑖 was the disentanglement score of only the 𝑖 latent. So, the

overall disentanglement score is given by summation ρ𝑖𝐷𝑖 , where the coefficient ρ𝑖 is given by

summation over j, 𝑅𝑖𝑗 divided by summation over 𝑖𝑗, 𝑅𝑖𝑗 , which is the normalization over the

column of that importance matrix.

The second metric is completeness, which is the degree to which a single latent variable captures
each underlying generative factor. For each generative factor 𝑔𝑗, the completeness is defined as

1525
1 − 𝐻(𝑃𝑗), where 𝑃𝑗 is defined as above. If a single latent variable contributes to 𝑔𝑗’s prediction,

the score would be 1. We only want one latent variable to correspond to a generative factor. If all
latent variables contribute equally to 𝑔𝑗’s prediction, the score is 0 because that represents the

opposite of disentanglement. We call that situation maximally over complete, where all latent
factors correspond to just one generative factor in the data.

(Refer Slide Time: 28:00)

The third metric is Informativeness. How informative are the disentangled latent representations?
This measures how useful, a representation is in capturing the underlying factors. This is
considered with respect to a specific task. For example, a classification task in which you would
like the latent representation to capture information about the object of interest. How do we
measure this?

The prediction error gives the Informativeness of z about a particular generator factor 𝑔𝑗

between the original generator factor and the predicted generator factor. It is obtained using one
of those regressors that we had defined earlier. These metrics, including informativeness, depend
on the goodness of those regressors that we use in the first step.

1526
(Refer Slide Time: 29:11)

That completes our discussion of disentanglement. I hope it provided you with an introduction to
the topic, a couple of methods that enforce disentanglement, as well as how to measure whether a
generative model is disentangled. As homework, please read this excellent blog, “From
Autoencoders to Beta-VAE”. Another blog on “Review of Disentanglement with VAEs” and
optionally, the papers that we referred to in the slides.

We left behind the question in the first mutual information gap metric: Why the gap and not just
mutual information itself? Think about it and we will discuss it next time.

1527
(Refer Slide Time: 30:00)

References.

1528
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad
Lecture 71
Deep Generative Models: Image Applications

In the remaining lectures of this week, we will see a few applications of GANs and Generative
Models, images and videos. We will try to cover simple applications of GANs to different
domains as well as a few adaptations of these Generative Models.

(Refer Slide Time: 00:39)

Before we go there, the question that we left behind from the last lecture, which was, why is
Mutual Information Gap used and not just the mutual information in the metric that we discussed
for disentanglement. Hope you had a chance to try finding this out. The Mutual Information Gap
tries to penalize unaligned latent variables. Because we are looking at the gap between two latent
variables, which have the highest mutual information with a generative factor.

So, if there is a latent, if there is a certain generative factor, which is a combination of two latent
variables that would be penalized in a Mutual Information Gap, we would then get a 0 or a lower
value, which indicates poorer disentanglement. The other answer is, if one latent variable models

1529
a generative factor, we do not need any other latent variable to model the same generative factor
and if that happens, we would like to say that we have poorer disentanglement.

This also necessitates the definition of a gap, rather than just using the mutual information of the
top latent variable with respect to the generative factor. More on this is there in the isolating
sources of disentanglement in VAEs paper. Do read it if you are interested and need more
information.

(Refer Slide Time: 02:27)

Let us now come to a few applications of GANs. The first application that we will talk about is
for Image Editing. Simple image edits, such as making an image grayscale or adjusting the
brightness or contrast, do not require an understanding of the semantics inside the image.
However, if we want to take a face image and make the hair black or add bangs or add a smile,
make the image male, then you need to understand the semantic content in the image.

Can GANs help here? Possibly. However, GANs do not have an inherent mechanism to map an
image to its latent representation. We saw VAEs can do it, but VAEs sometimes can suffer from
reconstruction issues, as we discussed in the VAE GAN lecture. How do you use GANs to be
able to achieve something like this?

1530
(Refer Slide Time: 03:38)

This was proposed in a variant of GAN called Invertible Conditional GANs in NeurIPS
workshop 2016 called IcGAN and the way IcGAN achieves this objective is using this
architecture. It is a conditional GAN that is used, very similar to what we saw earlier this week
of GANs across domains, a similar idea. But now there is also an encoder in the architecture,
which can encode an image to its latent attributes.

So what is the overall architecture here? You have two encoders. One encoder gives a latent of an
image. One encoder gives the attributes of an image. Now to be able to generate a new image
with different attributes. You pass the image through the encoder, get an attribute vector y,
change its attributes, change black hair from 0 to 1, change brown hair from 1 to 0 and now you
get the same image with black hair.

1531
(Refer Slide Time: 04:54)

So how does this actually work? The generator G samples an image from a latent representation
z and some conditional information y. So G is defined on(𝑧, 𝑦)and gives us a generated image
x’. The encoder performs the inverse, given an image x, it outputs(𝑧, 𝑦).

(Refer Slide Time: 05:22)

How do you train such a model? The broad objective is similar to what we saw with conditional
GANs. This is what we saw with the earlier lecture on GANs across domains. So, you have a log
likelihood of(𝑥, 𝑦)the image and the attribute, which we would like to maximize, as given in an

1532
input training dataset and we would also like to look at the second objective in the in the GAN as
𝐷(𝐺(𝑧, 𝑦'), which gives you an x’.

The discriminator now looks at the tuple(𝑥', 𝑦')and says whether it is real or fake. This is similar
to conditional GAN, where we try to generate images from text. The encoder here has two parts,
𝐸𝑧 and 𝐸𝑦 one for the latent and one for the attributes. To train the 𝐸𝑧 encoder, the generator is

first trained and the generator is used to generate a dataset of images. Now, we have pairs of
images and their corresponding latent representations as a dataset through the generator’s
performance.

That is now used to train the encoder. So, you have 𝐺(𝑧, 𝑦'), which was the generated image and
the corresponding z that was used to generate the image for the generator. Now, trying to
minimize the 𝐿2norm between these two or the mean square error between these two is used to

train the encoder z. The encoder y, the other one, is now trained using the data itself.

We assume that the training dataset has data and attributes that, so, you now consider y to be the
attributes that are provided in the dataset, 𝐸𝑦(𝑥)to be the predicted attributes and you do a mean

square error between these two to be able to train your 𝐸𝑦.

(Refer Slide Time: 07:42)

1533
Here are some results on a popular data set known as CelebA. So, you have an original image
and by varying several attributes, you see that you get several reconstructions that are fairly
realistic, several generations that are fairly realistic, where you see a balding person, the same
image with bangs, same image with black hair, blonde, glasses, more makeup, gender change,
pale skin, smiling, so on and so forth.

(Refer Slide Time: 08:18)

Another application of GANs is for a task known as Super-Resolution. We also talked about this
in the context of CNNs. Let us revisit it now. Image Super Resolution is the task of obtaining a
high resolution image from a low resolution image, which can be a challenging task, because you
have to fill in information that was not available in the first place. SR-GAN or Super Resolution
GAN aims to generate photo-realistic images with up to 4 x upscaling factors.

1534
(Refer Slide Time: 09:01)

Why SR-GAN? Even CNNs can be used for Super-Resolution by doing deconvolution or
upsampling through the second stage of a network through an encoder decoder architecture. In
those architectures, the loss minimizes the mean square error between a generated output and the
true ground truth high resolution image. Mean square error can tend to overly smoothen output
images, because of the averaging effect that you get across all pixels.

So, because you try to take the mean squared error, it may not try to make specific pixels sharper,
which may be important for Super-Resolution. GANs by using adversarial loss drives outputs
closer to the natural image manifold. So, if this was the natural image manifold given by these
red boxes, GANs try to push the generated image onto that manifold, which helps in getting high
quality super-resolved images.

1535
(Refer Slide Time: 10:18)

How is SR-GAN trained? SR-GAN has a couple of components. One of the losses is known as a
content loss. The content loss minimizes the mean square error between the output of the
generator which is given by 𝐺θ and the true high resolution image. But instead of measuring the
𝐺

mean square error between the images themselves, it minimizes the mean square error between
the feature representations of these two images.

What features? Both these images are sent through a VGG-19 network and an output of a certain
convolutional layer in the VGG-19 network, that feature representation is taken and the mean
square error between those representations are considered to minimize as a loss function for
training the SR-GAN. In this particular case, ϕ𝑖,𝑗corresponds to the feature map obtained by the

jth convolution before the ith max pooling layer.

And 𝑊𝑖,𝑗and 𝐻𝑖,𝑗are the width and height of the feature map respectively. That is used for

normalization and the second loss that is used to train SR-GAN is the standard adversarial loss,
which looks at fooling the discriminator similar to every other GAN that we have seen so far.

The final loss used to train SR-GAN is a weighted sum of the content loss and the adversarial
loss. In this particular case, it was shown that the adversarial loss can be weighted by a smaller
quantity 10 power minus 3 in their case, to get good performance on Super-Resolution.

1536
(Refer Slide Time: 12:21)

With this loss, here are some sample results. You can see that bicubic interpolation of this image
gives us this super resolved image. Using a super-resolution CNN ResNet gives this image and
using a GAN, gives a far sharper generation of the image when compared to the original.

(Refer Slide Time: 12:46)

More images on natural scenes, bicubic interpolation, SR-CNN with a ResNet and SR-GAN and
the final original picture in high resolution. You can see that SR-GAN gets local details fairly
more clearly, than super-resolution CNN and much better than bicubic interpolation.

1537
(Refer Slide Time: 13:11)

The third application is GANs for 3D Object Generation. So far, we have seen GANs generate
2D images. Can GANs also generate 3D objects? Fairly straightforward. Recent advances with
3D CNNs and Volumetric Convolution Networks have made it possible to learn 3D Object
representations. So, the layers in the generator have to now generate a volume. So as long as you
can replace those layers, you can generate any object through a GAN framework.

(Refer Slide Time: 13:52)

1538
In this particular example, this work was published in NeurIPS of 2016. So, the 3D GAN, the
way it works is the generator G maps a 200 length latent vector to a 64𝑥64𝑥64cube, which
represents an object 𝐺(𝑧)in 3D voxel space. The discriminator classifies whether the object now
is real or synthetic. So, only the architecture in the generator now is different from a standard
GAN. What is the loss? Standard GAN loss. This is a straightforward application of GANs to 3D
objects by changing the architecture of the generator to handle 3D volumes.

(Refer Slide Time: 14:47)

And with that loss, when a GAN, is trained to generate 3D objects. You get fairly good
generations of different kinds of 3D objects, as you see on this slide.

(Refer Slide Time: 15:06)

1539
The last application that we will see is GANs for Image Inpainting. Image Inpainting is a task
where a part of an image is missing and the job of a GAN is to fill in that missing part in the
image. Can a GAN be trained to perform Image Inpainting? You know the answer?

(Refer Slide Time: 15:31)

The answer is, Yes. One of the methods proposed for this is known as PG-GAN, which uses a
standard GAN framework where the generator is a residual network, which takes an image as
input, the one which has parts missing, generates an image of the same resolution and the
discriminator is divided into two parts, a global discriminator and a Patch GAN discriminator.

1540
The global discriminator reasons at a global or an image level to decide whether the image is real
or fake and the Patch GAN discriminator reasons at a local level, at a patch level to decide
whether an image is real or fake. So, the Patch GAN discriminator helps capture local texture
details and fill in those missing gaps in the image.

1541
(Refer Slide Time: 16:32)

What is the loss? So there is a reconstruction loss, which measures a pixel wise L1 distance
between the generated image and ground truth and an Adversarial Loss, where the adversarial
loss is divided into two parts, one for the global GAN and one for the Patch GAN and the overall
objective is a combination of the reconstruction loss, the adversarial loss for the global GAN and
the adversarial loss for the Patch GAN.

(Refer Slide Time: 17:07)

1542
With this kind of a training procedure and architecture, here are some results of Image
Inpainting, where when a patch is taken out from the image, the GAN gets a fairly true
reconstruction of those parts. You can see many examples here, where you can see some artifacts
where the GAN generates content, but it is still fairly realistic in its generation.

(Refer Slide Time: 17:37)

Here are more results of Building facades, where once again, the GAN fills in information fairly
in a photorealistic manner.

(Refer Slide Time: 17:49)

1543
For more information on these applications, please see this link for 3D GANs and a very nice
link called GAN Zoo, which presents an overview of various kinds of GANs, which you can use
for different applications.

(Refer Slide Time: 18:06)

Here are references.

1544
Deep Learning for Computer Vision
Professor Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture 72
Deep Generative Models: Video Applications

(Refer Slide Time: 00:15)

For the last lecture of this week, we will talk about Applications of Generative Models in video
understanding. In particular, we will talk about three different methods which have done different
things in the scope of video understanding and generative modeling.

1545
(Refer Slide Time: 00:38)

Let us start by asking, can we use GANs to generate videos? We already answered the question
partly in the previous lecture, when we talked about using GANs for 3D object generation. In a
sense, videos are volumes too. More generally speaking, if you look at the GAN objective, your
min max objective, where your discriminator D is parametrized by weights 𝑤𝐷, and your

generator is parametrized by weights 𝑤𝐺.

G and D can take any form appropriate for a task as long as they are differentiable with respect to
the parameters under consideration and this minmax objective can be solved using this objective
function. Let us now see one method that extends GANs for video generation. However, with
one unique change. That unique change is this method called Generating Videos with Scene
Dynamics published in NeurIPS of 2016.

Considers the output of the generator to be given by 𝑚(𝑧) ⊙ 𝑓(𝑧) + (1 − 𝑚(𝑧)) ⊙ 𝑏(𝑧)
where z is the latent vector, which is the input to the GAN, to the generator. Here f is foreground,
b is background, and m is a mask, which is also learned, which indicates whether to use
foreground or background for a pixel. Let us see how this is used in a GAN architecture to
generate videos.

1546
(Refer Slide Time: 02:43)

So we have this function here that we just wrote. And we have an input noise vector to the
generator of a GAN which is 100 dimensional. Now, the noise is given as input to two branches,
one branch which is a foreground stream which generates a volume, a 3D volume. This volume
generator, this foreground stream forks into two parts at the end, where one part generates the
video foreground and the other part generates a mask for every pixel in every frame of that
video.

So the mask also has the same volume, same dimensions as the foreground. And the input noise
also goes through another stream which generates a static background. When we talk about
generating short videos, the background is not expected to change. And it is most likely that it is
the foreground that changes on a static background. So in this case, the second branch of the
generator generates a static image background.

And now, the foreground and the background are combined using this mask as, this mask
whatever values is output by the layer here, the layer shown in blue here,
𝑚 ⊙ 𝑓 + (1 − 𝑚) ⊙ 𝑏. That is what leads to the generated output. As you can see, the
foreground stream captures the moving features.

The background stream captures the static features and the background is replicated over time at
every step in the frame which is the same background that is used all over again. And that leads

1547
to the final generated video which is a space-time cuboid. All of this corresponds to the generator
of this model. What about the discriminator?

(Refer Slide Time: 05:09)

The discriminator is the standard discriminator in a GAN, where the generated video is down
sampled over a series of layers. And the last layer has a binary classifier which says whether the
video was real or fake. And the loss function remains the same as a standard GAN.

1548
(Refer Slide Time: 05:35)

And now we are able to generate videos with scene dynamics. Here you see snapshot examples
of videos. On top, you can see a background scene that is generated, a foreground that is
generated, a mask that is generated along with the foreground and together, we see the final
generation to be a combination of the background and the foreground using the mask.

You see below another example, where if you observe carefully, you can see here that the waves
at the bottom here seem to be receding, which shows the movement across the frames generated
in the scene. If you would like to see more video examples, please go to the website of this paper
right here. And you should be able to see many more examples of videos generated through this
method.

1549
(Refer Slide Time: 06:34)

Another interesting method for generating videos in fact, a different task by itself is video
forecasting. We will talk about one such method which was published in ICCV 2017, called The
Pose Knows. This work argues that when GANs and VAEs are used directly for video generation
they generate video directly in the pixel space each pixel is generated in every frame which
means the structure and the scene dynamics are captured at the same time.

And this could often lead to uninterpretable results. To avoid this problem, this method proposes
that one needs to forecast first at a higher abstraction, and then build the video based on that high
level abstraction. What is that high level abstraction in this method? This method uses human
pose as a high level abstraction. And it exploits human pose detectors as a free source of
supervision, and thus breaks the video forecasting problem into two steps.

One, it uses a VAE to predict the possible movements of humans in pose space. And these
predicted or generated human poses are used as conditional information to help the GAN predict
future frames in pixel space. The task here, is given a set of frames in a video, the model needs to
predict the next frame or the next set of frames. You could imagine multiple kinds of
applications, you could think of a sports application.

If somebody gave you a certain configuration of football players on a football field, and you saw
the last 10 seconds’ video, what would happen next, whom would the person pass the ball to

1550
could be one way of looking at it. It could also have applications from a security perspective if
you try to take it to a different context.

(Refer Slide Time: 09:11)

So in this method, as we just mentioned, there is a first stage where the pose of a human involved
in the video is first predicted and the predicted pose or the skeleton human skeleton is then
passed as conditional information to a GAN to actually generate the frames of the video. Let us
first see the pose prediction part which is done using a VAE. So, in this pose prediction encoder
decoder model which is a variational auto encoder in their implementation.

1551
In the encoder, there are image features read in from 𝑋𝑡which is a particular frame at a time t,

corresponding past poses, which could be 𝑃1till 𝑃𝑡, those were the poses until now and also the

velocities of the poses at different joint positions in each of these times 1 to t. What is velocity?
It is how much is each joint location likely to move by between one frame and the next frame.
You could consider these velocities as optical flow of joint locations in poses in or the skeleton
pose skeletons that we get to represent human pose.

So, once the encoder reads in these image features, corresponding past poses, which in our case
these past poses 𝑃1to 𝑃𝑡, are what you see here, 𝑃1to 𝑃𝑡 and corresponding velocities 𝑌1to 𝑦𝑡give

how the velocity each of the joint locations in each pose at a time t is moving towards the next
time step. And the output here is the decoder which replays the pose velocities in reverse order.

So, given the output of the encoder which is 𝐻𝑡, which goes to the decoder, the decoder now tries

to predict 𝑌𝑡to 𝑌1which are the pose velocities in the reverse order. What do we do with this?

Once such a model is trained, the decoder is taken away. Only the encoder is used to get these
hidden states 𝐻1to 𝐻𝑡. What do we do with that? We now use that to go to a future encoder and a

future decoder.

(Refer Slide Time: 12:03)

1552
In a future encoder 𝐻𝑡is given as input. 𝐻𝑡is what we got out of the past encoder, the output of

the past encoder that is given as input to the future encoder. So, the future encoder receives past
𝐻𝑡, future pose information 𝑌𝑡+1···𝑇which are your velocities and 𝑃𝑡+1···𝑇which are the pose

skeletons that you see here. By pose skeletons you have to recall methods like deep pose which
give us an XY coordinate location for certain joint locations of a human skeleton. Since this is a
Variational Auto Encoder, the encoder outputs an approximate posterior Q which is shown here.
And what does the decoder do?

(Refer Slide Time: 13:06)

The decoder samples from Q, not exactly Q, but from a prior that is close to the approximate
posterior Q which is how the VAE learns to reconstruct the motions 𝑌𝑡+1···𝑇which are your pose

velocities given past 𝐻𝑡and poses 𝑡 + 1 ··· 𝑇.

This is shown in the input of the future decoder. So, you get a sample from your approximate
from your prior sorry from your prior. Along with that you also get so, those samples here are
given by 𝑧𝑡+1, 𝑧𝑡+2, 𝑧𝑡so on and so forth. Each of these red elements here are samples from a

Gaussian. Along with that you also have 𝐻𝑡which comes from the encoder and not the encoder

from the past encoder.

1553
And along with that, you also have 𝑃𝑡+1, 𝑃𝑡+2to 𝑃𝑡. Given these, the decoder of the VAE predicts

𝑌𝑡+1···𝑇which are the velocities of each joint location for times 𝑡 + 1 ··· 𝑇. Now, what happens

once you train such a VAE?

1554
(Refer Slide Time: 14:33)

Like any other VAE, the encoder is not used at inference, you only sample from the prior and
then let the future decoder predict how the poses will look given until a particular time 𝐻𝑡which

is given by 𝐻𝑡here. The decoder tries to predict what the pose will look like from 𝑡 + 1to T

using these velocities. Remember, given a pose, and given the velocities, one can construct the
pose in all of those time steps in 𝑡 + 1 ··· 𝑇just by adding the velocity to the current pose
location.

1555
(Refer Slide Time: 15:18)

Now coming to the second stage, once the pose information is obtained, remember we said that
the pose information is given as a conditional input to the generator which finally generates the
videos. Let us see that part now. So, this is the generator which, given a set of frames, which we
denote as input I at this time, and the pose information which comes from the previous module
that we just discussed.

The generator generates the next few frames which is the goal of video forecasting. Now, this is
a generator module. This is similar to a GAN module, so it has a loss for the discriminator and a
loss for the generator. Let us see each of them. So the discriminator loss has 𝑙(𝐷(𝑉𝑖), 𝑙𝑟). What is

𝑉𝑖? 𝑉𝑖is the ground truth video. What is 𝑙𝑟? Real label. So we would like the discriminator to

assign the real label to the ground truth video.

And on the other hand, we would like the discriminator to consider the generation of the input
and 𝑆𝑇denotes the pose skeleton, I denotes the input, this is input, and this is 𝑆𝑇the pose skeleton.

Given these two, the generator generates the future frames. And the discriminator’s job is to give
that a fake label. What are these summations here? These summations are over a mini batch
whose size is M.

So one half of the mini batch are real videos, ground truth videos and the other half of the mini
batch are generated videos which are given by this term. This is what the discriminator tries to

1556
do. Let us try to see what the generator tries to do. The generator which has a counter objective
wants the discriminator to look at the generator’s output and give it a real label. In addition to
this generator objective, we also try to minimize the 𝐿1distance between the generated future

frames and the future frames that are provided to us in the ground truth 𝑉𝑖and the 𝐿1term is

weighted by a coefficient α.

This together helps train a model that can generate videos where in the first stage poses are
predicted and the poses are then passed on for the GAN to generate videos.

(Refer Slide Time: 18:15)

So the final end to end architecture looks like this. There is a past encoder that would generate
the 𝐻𝑡. That 𝐻𝑡is given as input to the future decoder along with poses 𝑃𝑡+1to 𝑃𝑇. Now this

generates 𝑌𝑡+1to 𝑌𝑡which are the predictions of the pose, which are given as input to the GAN

which finally uses that and the input so far to be able to generate the future frames.

1557
(Refer Slide Time: 18:56)

With this approach, this method shows interesting results. Here is an example of a person skiing
on a snow slope, snowy slope. You can see the pose skeleton here and the predicted movement
here. So these blue lines represent optical flow or the velocities at each of those joint locations
which also tells us if you see this red skeleton that is the skeleton where the person moves to
from this green skeleton in the current location.

Now using that skeleton to condition the generation of the next frame, the GAN generates this
image where the person moves to that location. Clearly, it is not very sharp in terms of the
generation considering the real world situation, but it gets a fairly good estimate of where the
person is likely to be in future of the video that is provided to us.

1558
(Refer Slide Time: 20:04)

A third method in Video understanding that we will discuss is an interesting one called
Everybody Dance Now, a recent work published in ICCV of 2019, where the objective is, given
a professional’s dancing video, and an amateur’s dancing video, can we generate a video of an
amateur dancing professionally? And one of the basic modules of this framework tries to have a
generator which captures the temporal coherence between frames using the ground truth
information and the generated information.

1559
(Refer Slide Time: 20:54)

So once we have this generated information, so you have 𝐺(𝑥𝑡), and 𝐺(𝑥𝑡+1); 𝑥𝑡and 𝑥𝑡+1are the

poses in two successive frames. 𝐺(𝑥𝑡)and 𝐺(𝑥𝑡+1)are the generations corresponding to those

poses. And 𝑦𝑡and 𝑦𝑡+1are the actual ground truths for that person’s pose corresponding to the

skeleton 𝑥𝑡, 𝑥𝑡+1. The job of the discriminator in this module of the framework is to look at the

generated frames, 𝐺(𝑥𝑡)and 𝐺(𝑥𝑡+1)and see whether they are temporally incoherent, which is

equivalent to fake here.

And for real data, the label given is temporally coherent or real. So it is no more real versus fake.
The discriminator tries to say, temporally coherent or temporally incoherent. So your standard
GAN loss looks like you have the log likelihood of discriminator looking at 𝑥𝑡, 𝑥𝑡+1which are

the pose information of a frame at time t and the pose information at time 𝑡 + 1, 𝑦𝑡and 𝑦𝑡+1are

the expected frames the ground truth frames for the same pose locations and 𝐺(𝑥𝑡)and 𝐺(𝑥𝑡)are

the generated frames.

So this GAN module of this framework tries to ensure that consequent frames or successive
frames of the video being generated is temporally coherent, like a given input video. This is one
module of this work.

1560
(Refer Slide Time: 22:56)

Another module that is part of this work is to ensure that when you have a source person’s dance
moves, so in this case, you have 𝑦1', ···, 𝑦𝑡'which are the video frames of a source person, a

professional dancer. So the pose of the professional dancer is obtained to get 𝑥1', ···, 𝑥𝑡'. Since

this has to be overlaid on the target persons pose. This is normalized to get the pose information
in the dimensions of the person that is being considered.

So different people may have different limp proportions. So this normalization step moves the
skeleton of the source person that is the professional dancer to the skeleton of the amateur dancer
in terms of sizes and distances between the joints. Given this 𝑥1to 𝑥𝑡, the generator module in the

previous step we had trained a generator that could take pose information and generate frames.

Now the pose information is obtained using this architecture of coming from a professional’s
skeleton normalized into the dimensions of the target person and the generator now generates the
generator trained, as in the previous step now generates videos of the target person with the same
pose moves of the professional person.

1561
(Refer Slide Time: 24:41)

Finally, because in the process of transposing someone else’s moves on to a target person there
could be distortions and loss of detail on the face of the target person, this method also
introduces one module to refine the generated face. So this in this module, this is another GAN
within the overall framework. In this module, the face portion of the generated image. So this
𝐺(𝑥)that was generated in the previous step, a certain region around the face around the nose of
the person is cropped out which is given by 𝐺(𝑥)𝐹.

And 𝑥𝐹is also the pose drop cropped around the same area. These two are given to a new

generator module which outputs a residual that needs to be added to the original face to add more
detail. This residual is then added to the original generated face to give the final outcome which
is then transposed onto the final generation. This is a separate GAN by itself where the
discriminator tries to maximize the likelihood of 𝑥𝐹, 𝑦𝐹.

Where 𝑥𝐹is the pose skeleton around the face, 𝑦𝐹is the actual face region in the target image and

𝐺(𝑥)𝐹 + 𝑟is the generated face through this generator where the residual is the output of the

generator. And this helps this GAN helps refine the face.

1562
(Refer Slide Time: 26:38)

Putting these together in stage 1 in this particular framework. The first GAN is trained and this
GAN has a few components. It has an \𝐿𝑠𝑚𝑜𝑜𝑡ℎ(𝐺, 𝐷𝑘), which is the standard GAN loss that we

already saw. So this is your standard GAN loss that ensured temporal coherence. 𝐿𝐹𝑀gives a

discriminator feature matching loss, very similar to pix2pix between the generated image and the
real image that comes from a discriminator.

And there is also an 𝐿𝑃loss, which gives a perceptual reconstruction loss, once again comes from

pix2pix. And this is for 𝐺(𝑥𝑡−1)matching to 𝑦𝑡−1and 𝐺(𝑥𝑡)matching to 𝑦𝑡. Remember the

perceptual loss also tries to measure loss at an intermediate feature space feature space level.
And this is stage 1.

1563
(Refer Slide Time: 27:50)

Once this GAN is trained the stage 1 weights are frozen, and then comes stage 2, where the face
GAN is trained based on the L face which is the standard GAN loss for the face region and a
perceptual loss also defined for the face region. Once stage 2 is trained, stage 1 is again retrained.
And this goes on in iterations to be able to get the final generations.

(Refer Slide Time: 28:23)

Now here are some interesting results. Here is a source subject and the source subjects
corresponding poses. And these poses are now translated to a target subject who now show the

1564
same pose in different scales, in different backgrounds with different body structures, including
male and female. A similar example is also seen on the right side here.

(Refer Slide Time: 28:55)

The paper also shows interesting results of multi-subject synchronized dancing, where the source
subjects poses are transposed onto multiple different subjects of different backgrounds, different
sizes, different genders. And the same pose is now shown by all of those people to give a sense
of synchronized dancing.

1565
(Refer Slide Time: 29:22)

Your homework for this lecture is to check out this demo video from everybody dance now
paper, which is fairly interesting and also a very nice article on open questions on GANs to wrap
up our discussions so far on generative models. And of course, if you are interested, please do
read the papers on the respective slides.

Let us end this lecture with one question. Through this lecture, we saw methods that used videos
as input where the discriminator looked at the real video and the generated video and said
whether it was real or fake or temporally coherent or incoherent. Can you generate a video from
a single image? Think about it, and we will discuss soon.

1566
(Refer Slide Time: 30:18)

Here are references.

1567
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 73
Few Shot and Zero Shot Learning – Part 01

In this last week of lectures, we will discuss a few advanced topics related to Deep Learning for
Computer Vision. We will start with a contemporary topic, Zero shot and Few shot learning.

(Refer Slide Time: 00:32)

Before we move forward, the question that we left behind in the last lecture on Applications for
Deep Generative Models to Video understanding, we saw methods that use videos as input for
generating videos. The question that we left behind was, can we generate a video from a Single
Image? You may have guessed the answer. The answer is yes, we are not going to discuss the
specific method now. But you can see the article linked on the slide and the corresponding paper
relevant to the article.

The core idea of doing this is through a method known as Single Shot learning, where using a
single instance, which in our case is an image, the network generates or learns to generate a
complete video. Interestingly, single shot learning is one specific setting of few shot learning,
and is the focus of this lecture. So, this becomes a nice segue into what we are going to discuss in

1568
this lecture. But for more details, please come back and visit this article to understand how a
video can be generated from a single image.

(Refer Slide Time: 02:05)

Deep learning models are known to be heavily reliant on large amounts of labeled data during
training. That is one of the reasons they work so well. But that is also one of their major
limitations. If you had only a few samples to learn from, deep learning models may not be that
effective in giving you a good model, and a good prediction performance. To a certain extent,
regularization methods are used to avoid over fitting in low data regimes, but they do not really
give you good performance on low data regimes. So, what do we do?

The solution is to be able to train models explicitly, that are capable to rapidly generalize to new
tasks with only a few samples, or perhaps no samples at all. And that is the focus of this lecture.
The objective is to enable models to perform under practical settings, where data annotation
could be infeasible, could be expensive, or new classes get added over time. So, illustrate
illustratively speaking, you have a few what are known as base classes, where you may have
many examples.

The goal is to use them and learn a classifier in such a way that if you have other classes, for
which you may have very few samples, or no samples at all, then the classifier can still work and

1569
be able to give you a good prediction. Let us see how this can be done over the rest of this
lecture.

(Refer Slide Time: 04:04)

So, this problem setting is called Few Shot learning or Zero shot learning. Let us review the
formulation first for both of these, if x was an image or a feature, you could assume that the input
to these models need not be an image per se, it could just be say a VGG feature or a ResNet
feature, which is obtained by passing the image through a VGG network or a ResNet network for
that matter. And y denotes the class label for that data point, then, Few shot learning is
formalized as, you have training data, 𝐷𝑡𝑟𝑎𝑖𝑛given by your (𝑥𝑖, 𝑦𝑖)tuples. Let us assume there are

total of 𝐼tuples, where for some of the classes there are only a few training samples.

Few shot learning is also characterized as an N-way-K-shot setting, for where for N of your total
number of classes, if you had total number of classes to be, say capital C, for N of those, N less
than C, you have only K samples, while for the rest of the classes, you have a lot of samples. The
rest of the classes are also called base classes, as you saw on the slide, in the earlier, on the
earlier slide. That is a Few shot learning.

What is Zero shot learning? Zero shot learning is when few becomes 0, what does that mean?
You have training data, x, y. But now, because you want to be able to classify a data point that
was never in your training set, that is the zero shot setting where you do not have for some of

1570
your classes, you have reasonable amounts of data for some of the classes, which are also called
zero shot classes, there is absolutely no label data at all.

In such a setting, your training data consists of x, y, and 𝑎(𝑦), which are some attributes related
to each class y. It could be some textual description for instance, so if you had images of all
animals, Let us say you have seen a Lion, Tiger, a Zebra, but you have never seen a horse. Let us
say that image was never available to you. But you know how the description of a horse looks
like in text or as a set of attributes. When we say a set of attributes, we mean whether the animal
has a tail, 4 legs, what colors can it assume, does it have a mane, so on and so forth. So, those are
what we refer to as attributes 𝑎(𝑦), you can look at them as metadata that are provided, even if
the data is not available.

𝑠 𝑠
So, that is the zero shot setting where 𝑥 ∈ 𝑋 , which is known as seen classes, 𝑦 ∈ 𝑌 , which are
again the Labels of the seen classes, and 𝑎(𝑦)are a set of attributes for all classes. And the test
time here in zero shot is a setting where you have x y, 𝑎(𝑦), where x comes from an unseen class
𝑢 𝑢 𝑢
𝑋 , y is the corresponding label and 𝑎(𝑦)is the corresponding attribute. Importantly, 𝑌 ∩ 𝑌 is
the null set. So, the seen classes and unseen classes have no intersection.

(Refer Slide Time: 08:12)

Within few shot/zero shot, there are two further ways of categorizing them. One of them is what
is known as Conventional few shot or zero shot, where in the problem setting, the goal is to learn

1571
a classifier f such that the image or feature x to be recognized at test time comes only from the
unseen or Few shot classes, not from the base classes, or the scene classes. That is what is known
as a conventional few shot or zero shot. That is how the initial methods in the space were
developed about 4 or 5 years ago.

The other setting, which is the more challenging and more practical setting is Generalized Zero
shot and few-shot learning, where the goal is to still learn a classifier going from x to both seen
and unseen classes or base and few-shot classes. Where the image or feature x to be recognized
at test time may belong to either seen or unseen, or base or Few shot.

Clearly, the generalized zero-shot or few-shot learning setting is the more challenging and more
useful one. Because in the real world, you cannot predict whether your image at test time will
come only from an unseen class or only from a few shot class. It could come from any class in
your universe of classes.

(Refer Slide Time: 09:59)

At this juncture, Let us recall one of the fundamental tenets of Supervised Machine Learning,
which is Empirical Risk Minimization. If 𝑝(𝑥, 𝑦)is a ground truth joint probability distribution of
input x and output y. The goal of supervised learning is given a hypothesis h or a model h, the
goal is to minimize the expected risk or loss measured with respect to the joint probability
distribution, which is then given by 𝑅(ℎ)is integration of the loss, the loss is has two inputs, ℎ(𝑥)

1572
, the prediction of the model, and y, which is the expected output. So, the loss is computed
between these two quantities.

And you integrate this over the entire joint probability distribution, which is also denoted as the
expectation over the distribution of the loss. Unfortunately, in the real world, we cannot estimate
the joint probability distribution. So, this expectation is approximated as the empirical risk with
respect to the training data that is provided to us. So, the empirical risk is given by
𝐼
1
𝐼
∑ 𝑙(ℎ(𝑥𝑖, 𝑦𝑖)).
𝑖=1

(Refer Slide Time: 11:41)

Now, Let us assume that we have a family of models, you could assume we are solving a linear
problem. So, you could assume that, if it was a linear SVM, your family of models belong to
lines. So, that is what we denote by H depends on what algorithm what choice you are making.
But let us now assume that the family of models that you are trying to search for to address this
problem is given by capital H.

And let one of the hypothesis of the models be small h. Then we have 3 kinds of functions. One
^
is ℎ, which is the original function that minimizes expected risk that we are really looking for,
*
that is the ground truth function. Then we have ℎ , which is the function belonging to the family
that we are searching for, within this family capital H, that minimizes the expected risk. And

1573
finally, we have ℎ𝐼, which is the function in H that maximizes the empirical risk, which is the risk

of the loss with respect to your training data.

Given these 3 terms, one could now define the gap or the generalization gap between what our
model with respect to the training data has learned, which is ℎ𝐼with respect to the original ground
^ * ^
truth model, ℎ. The loss between them can be decomposed into two parts one 𝑅(ℎ ) − 𝑅(ℎ). If
*
you chose this family of models in capital H, and the best model in that to be 𝐻 , how close
would that be to the actual model that could belong to any family of models. It need not be linear
^ *
model, need not be a quadratic model, it could be any model. So, ℎis any model, ℎ is the best
model. In this family of models that we have chosen hypothesis and model here mean the same
thing.

And the second term is how close ℎ𝐼which is the model that we obtain from our training data is
*
close to ℎ , which is the ideal function which in the family of functions. So, the first component
is denoted as approximation error, which measures how closely functions in this family H can
^
approximate your optimal hypothesis ℎ. And the second term here is the estimation error, which
measures the effect of minimizing the empirical risk with respect to the expected risk within the
family of functions H. Why are we talking about all of this now?

(Refer Slide Time: 14:55)

1574
We will come to that in the context of few shot and zero shot learning. The second term in the
previous slide can be reduced. If you had large amounts of data, then your empirical risk
minimizer becomes close to your expected risk minimizer. Because the training data when you
have large amounts, resembles, the overall joint probability distribution more closely. See, if you
had sufficient labeled data to train with, then your empirical risk minimizer gives a good
*
approximation to the best possible ℎ .

Unfortunately, with few-shot learning, and zero-shot learning, we do not have the sufficient
amount of training data, especially for a certain set of classes. So, what is the problem? Let us
look at this illustration. So on the left, what you have now is when you have a large number of
^
data samples. You have ℎ, which is your optimal hypothesis denoted by the star here, then you
have this ellipse, which represents the family of models capital H, that you are focusing on to ask
your machine learning algorithm to learn or pick one of them.

^
So, the best solution there would be the projection of ℎonto this family of models, which we
*
denote as ℎ . And finally, you now start your machine learning algorithm, somewhere here where
the triangle is shown. And over the learning procedure, you converge to a model, which is given
*
by ℎ𝐼, which is fairly close to ℎ . So, if you have large number of samples, you get an ℎ𝐼that is
*
close to ℎ .

On the other hand, when you do not have a large number of samples, you may not be able to get
*
very close to ℎ , which is at least the best model that you can get in this family. So which means,
the resultant 𝑅𝐼(ℎ), which is the empirical risk that you have from this particular model, is going

to be far from being good a good approximation of the expected risk and the resulting model ℎ𝐼

may over fit to your training data.

(Refer Slide Time: 17:33)

1575
How do you address this problem? So we can look at addressing this problem from 3 different
perspectives: from a data perspective, from a model perspective and from an algorithm
perspective. Let us now see each of them. From a data perspective, one could simply augment
~
the training dataset, simply increase the number of samples to an 𝐼, which is far, far greater than
I. And that should help you get a better estimate of the empirical risk minimizer. So, you can see
here, that by increasing the number of samples, which we call prior knowledge here, we now
*
push the ℎ𝐼model closer to ℎ , which is our ideal goal. And that would be by leveraging data.

The second option is to do this by constraining the space of models to a smaller space, which is
*
less likely to push ℎ𝐼far away from ℎ . So, you can see here that using some kind of a prior

knowledge or domain understanding, we try to ensure that the space of hypothesis or models
capital H is now restricted to only this non-shaded region in the center. And any model here is
perhaps better than most of the models that are outside this non shaded space.

This is another way of addressing the setting where you may have only few samples from certain
classes. What are we doing here? We are constraining the complexity of H to a much smaller
hypothesis space, then you are hoping that even a small amount of training data will be sufficient
*
to train a reliable ℎ𝐼which is once again close to ℎ .

(Refer Slide Time: 19:37)

1576
The third approach is to address this from an algorithmic perspective. We aim to search for
*
parameters θof the best hypothesis ℎ in capital H. One way this can be achieved is by using
some kind of prior knowledge. We will see examples of methods later in this lecture, which
alters the search strategy, and can also help provide a good initialization. So, you can see here
that using some kind of a prior knowledge, you now have a start point, which is much better than
the start point that you had in this figure a, which helps you converge to a better ℎ𝐼. Each of these

3 approaches are valid approaches to address this problem. Let us see methods belonging to each
of these kinds in this lecture.

(Refer Slide Time: 20:34)

1577
So, from an overall taxonomy perspective, one could look at constraining the hypothesis space
by prior knowledge, which is the model approach to solving few-shot, zero-shot learning.
Examples of methods here are known as embedding learning methods. The second category of
methods that are based on augmenting training Data the data approach are broadly grouped as
Hallucination or feature synthesis methods. And the third case of altering the search strategy in
hypothesis space by using some prior knowledge, the algorithmic approach are also sometimes
categorized as Parameter Initialization methods. Let us see each of them in detail now.

1578
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 74
Few Shot and Zero- Shot Learning – Part 02

(Refer Slide Time: 00:14)

The first kind of methods that we will talk about are Embedding Learning methods or Model
Based methods, as we just said. The key intuition of these methods is to address the few-shot
learning problem by learning to compare. If a model can determine similarity between two
images, or the similarity between the semantics of the class labels, which is required for
zero-shot learning.

One can classify unseen input in comparison, or relation to labeled instances seen during
training. What is the overall idea? One learns separate embedding functions, embedding
functions are the output representations of a neural network in this context. For training samples,
𝐷𝑡𝑟𝑎𝑖𝑛 and test samples 𝐷𝑡𝑒𝑠𝑡. And these comparison models are trained end to end using an

approach known as meta learning which will describe soon.

At test time the prediction is based on is made based on comparing distances between the x test
feature and the training set features from each class. So, here is the overall schematic. So, you
have a Few shot training set 𝐷𝑡𝑟𝑎𝑖𝑛, and a test sample 𝐷𝑡𝑒𝑠𝑡, you have an embedding for the train

1579
set g, and embedding for the test sample f. The similarity is computed to make the final
prediction on the class label of interest. What is Meta Learning here? Meta Learning is known as
learning to learn. And we will describe that in more detail.

(Refer Slide Time: 02:14)

If we considered the N-way-K-shot setting of Few shot learning, where N of the total number of
classes have only K examples. In meta learning, in each episode, you have a different set of
training classes, and a set of test examples. And these set of training classes may not be
exhaustive, and complete the full training dataset. These are loosely sampled as some of the
classes from your dataset. And the goal in this episode is to train a learner, which can make a
prediction on this test set based on this train set. That would complete one episode of meta
learning in which you get one learner as the output.

Now, in the second episode of meta learning, a different set of classes is again sampled to form
your training dataset, you have a test set, and once again, the meta learner refines the learner to
be able to solve the problem in this episode. This is repeated over multiple episodes, where in
each episode, a different set of training classes may be sampled. And at meta test, you finally
have your setting of your training classes, and your test samples where you would like to deploy
your final model.

1580
The goal here is 𝐷𝑚𝑒𝑡𝑎−𝑡𝑟𝑎𝑖𝑛and 𝐷𝑚𝑒𝑡𝑎−𝑡𝑒𝑠𝑡 which are these episodes have disjoint classes. Which

means the 5 classes here that you have in meta train may be different from the 5 classes here. But
in meta test, but the goal is you have learned to learn a model.

(Refer Slide Time: 04:12)

So, another way of differentiating in a classical machine learning algorithm setting, you have a
training set, you learn a model with certain number of parameters with an objective to get a good
performance on a test set. In meta learning, you have a meta training set, which has its own train
and test split, which corresponds to one episode of meta learning. You learn a set of meta
parameters theta, which is used to train learners on a new meta test set.

1581
(Refer Slide Time: 04:49)

Here is another way of understanding this. You have your training set in each episode of meta
learning that is used to learn a meta learner which teaches a learner to learn a model, and there is
a loss induced because of learning across these episodes in each round of meta learning. If you
see here, between the meta train and the meta test settings, the problem set up matches, so that
once you have trained a Meta-learner, the Meta-learner knows how to provide a model for the
new set of classes that may come in the training part of the Meta test episode.

Remember that in Meta learning, each episode can have a training and test. So in the Meta test,
you have a set of training classes. The learnt Meta-learner knows how to produce a model for the
set of classes to be able to solve these test samples.

1582
(Refer Slide Time: 05:55)

Let us see a concrete example of this idea of meta learning. This is known as Matching
Networks, a method that was developed and published in NeurIPS of 2016. A matching networks
is premised on bringing together Parametric models and Non-parametric models. Recall that
parametric models learn model parameters from training samples slowly, you ideally require
large datasets to avoid over fitting.

Unfortunately, in this setting, we are talking about few-shot and zero-shot, you may not have
large datasets. On the other hand non-parametric models like K Nearest Neighbors, allow novel
examples to be assimilated quickly. And they are robust to a phenomenon called Catastrophic
Forgetting. Catastrophic Forgetting refers to a phenomenon where if you have a set of classes for
which a model is trained, if you took MNIST, and you trained a model on the classes 0 to 6.

Let us assume that the classes 7 and 8 occur a bit later. And when you train your model, or refine
your model, on the classes 7 and 8, the model forgets what a 0 or a 1 look like. This is known as
catastrophic forgetting. And neural network models are known to be prone to this phenomenon.
If you do not retrain on the complete dataset, which may not be possible in all settings.
Non-parametric models automatically are robust to such a phenomenon. As an example, as I
said, could be K Nearest Neighbors.

1583
So, in this method called matching networks, the proposers proposed to combine the best of both
worlds. So you have a training phase, where you learn cosine similarity based Embedding
models. So, you see here that you have a set of samples S and a query sample Q. So, S is like
your train set in a Meta learning episode and Q is like your test set in that Meta learning episode.
You learn Embedding function F, which gives you corresponding embeddings for the train set
and the test set and then the test and then you use cosine distance to measure the similarity. That
is what happens in the train face to learn the F's. And a test time once you get these embeddings
a Nearest Neighbors approach based on the cosine distance is used to classify the test sample.

(Refer Slide Time: 08:43)

Here is the architecture that was used in Matching Networks. So, you can see here that you have
this is one meta learning episode, you have a set of samples from a given set of training classes,
you have a test sample in that meta learning episode, remember that you repeat these meta
learning episodes with a different set of training classes and test sample in each Meta learning
episode.

These training samples as are given as a sequential set of inputs to an LSTM. And the LSTM
outputs a representation for each of these input training samples. And you also get a
representation of the test sample. And then, there is a comparison module that looks at the
similarity between the test sample and each of these training samples. And that comparison is

1584
^
given in the equation here as 𝑎(𝑥, 𝑥𝑖). Where 𝑥𝑖’s are the different training samples. How do you

measure how do you compute this a function?

In this approach, it is a very simple computation. It gets a softmax over the cosine distances
^
between ℎ(𝑥)and 𝑔(𝑥𝑖). So, you take a cosine similarity between the test image and each of the

training images, and then do a softmax over these cosine similarities, to show to get a
distribution over which of these training samples, this test sample is closest to. And multiplying
that with the class label gives us the final class label prediction for this test sample.

One detail here is if you notice, the LSTM here is a Bi-directional LSTM. Why is that so? We do
not want this LSTM to actually depend on the sequence of these training samples that are given
as input. And by going Bi-directional, by doing a forward and reverse direction in the LSTM.
You are trying to counter that dependence on the sequence.

(Refer Slide Time: 11:01)

An improved idea was called Relation Networks, which was published in CVPR of 2018. In this
approach, the author's question the use of a cosine embedding, instead, the method proposes to
train and learn a data-driven nonlinear metric, instead of using the cosine distance metric. This
enables the model to extract complex nonlinear relationships among the data samples, and thus
generalize better to novel classes.

1585
And in fact, relation networks are also extensible easily to the more challenging zero-shot
setting, where you have no samples for the unseen classes. So, here is the overall schematic. So,
if you see here, you still have S and Q, your train samples and your query sample. They go
through a learned embedding, which is given by 𝑓θ, θthe parameters of that network. And now

you take the class mean µfor a set of samples belonging to a class, and you compare the query
sample to that mean representation of a class.

How do you compare? That comparison is also learned by the model, instead of using the cosine
distance. So, in the training phase, you Meta-learn both the Embedding module which is f, and
the relation module, which is the one that learns the relationship, or the distance metric between
the representations. And at test phase given a query sample, you use the relation scores in the
embedding space to the mean of each of your training classes to finally predict the class for the
query sample.

(Refer Slide Time: 13:00)

Here is the overall architecture, you have your set of training set in a given meta learning
episode, you have a query sample here, that comes in as a separate test sample. All of these go
through the same Embedding module 𝑓φ. You get feature maps corresponding to this 𝑥𝑗, and all

the 𝑥𝑖’s from S. And now different from matching networks, relation networks concatenate these

feature maps. The yellow bar here corresponds to the features of 𝑥𝑗.

1586
And each of these other colored bars are features from each of the other 𝑥𝑖’s in your train set of

that Meta learning episode. What happens after you concatenate? This is then fed to a relation
module g, whose parameters are again learned, which finally produces a scalar score for the
similarity between the 𝑥𝑗 and each of the 𝑥𝑖’s which is given as a vector of relation scores. And

finally, by doing a softmax, or an arg max over the relation scores a final one hot vector can be
predicted.

How is this network trained? The relation scores 𝑟𝑖𝑗 are critical for training the network. The loss
𝑚 𝑛
function is given by 𝑎𝑟𝑔 𝑚𝑖𝑛φ,ϕ ∑ ∑ (𝐶(𝑓φ(𝑥𝑗), 𝑓φ(𝑥𝑖))). Rather this says that whenever 𝑦𝑖 is
𝑖=1 𝑖=1

equal to 𝑦𝑗, or the query sample’s label matches one of the train samples label, we want 𝑟𝑖𝑡 to be

1, remember this is squared. So, this overall quantity has to be positive. We want 𝑟𝑖𝑗 to be 1 in

that scenario, because similar similarity is high and whenever the labels do not match, 𝑟𝑖𝑗 will be

forced to go lesser than 1.

(Refer Slide Time: 15:18)

How is this used for few-shot or zero-shot learning? In the few-shot learning scenario, what we
described is exactly what is done, you have an element wise sum over the embedding module,
which gets your class feature map, we combine the class feature map with the query feature map,

1587
we just talked about the concatenation, and then we get the relation scores to finally get the
predicted label.

For Zero-shot, because you may not have any samples at all, for some of your classes. You need
what are known as semantic class embeddings. So, you get some Meta information about a class,
could be a set of attributes, could be any other textual description of a class label, or simply take
a Word2Vec representation of the class label itself or any representation, word level
representation of the class label itself. Now, the concatenation, and the embedding operator
compares the embedding of the class label with the embedding of each of those trained samples.
And that is used to get a similarity and be able to give the final outcome.

(Refer Slide Time: 16:39)

That is about Embedding Learning methods. Let us move now to the second kind of methods
data based methods, which are Hallucination or Feature Synthesis methods.

(Refer Slide Time: 16:52)

1588
The intuition with these methods is learning to augment. We saw learning to compare, now we
talk about learning to augment. Here, we learn a generative model to hallucinate new novel class
data for data augmentation. And by generating more data, we reduce few or the zero-shot
learning problem to a standard supervised learning problem. How do we do this? A standard
approach is to learn a generator that is conditioned on some Meta information using the data in
the base classes. And then we generate novel class features conditioned on unseen class Meta
information. And finally, we train a classifier on the base class samples and the generated novel
class samples.

So just to show this, you have your set of training samples in your meta learning episode, a test
sample, so using some meta information, and using the train samples themselves, you train a
generative model, which is then finally given to a classifier to be able to say, of course, you have
since its generative model, you may have an adversarial component there to decide whether the
generated features are real or fake. But you also have a classification component, which tells
how to classify these final generated features. Let us see a more concrete example.

(Refer Slide Time: 18:32)

1589
A popular method in this space is f-CLSWGAN. Here the goal is given a train set of seen
classes, we learn a conditional generator G, which receives as input Z, the random standard noise
that you give to a GAN and a class embedding corresponding to each class. Using these, the
generator learns to generate image features. We do not need to generate images here, because our
goal is not really to generate pretty looking images, our goal is to generate image representations
which can be classified and this makes it a more feasible problem.

To ensure that the features generated by this approach are good. There is also a discrimination
module, which minimizes classification loss over the generated features. So, you see here that
f-CLSWGAN tries to take some conditional, some attributes of belonging to different classes,
which are conditioned on to generate image features. And what kind of image features are
generated? These are image features that are outputs of CNN models on standard images.

So if you used a VGG or a ResNet as a CNN, you would get a certain representation of the
image. And we are now asking the GAN to generate similar vector features instead of generating
images. How do you extend this to Few shot learning? So, this would be the approach for Zero
shot learning, where there are no samples for certain classes, which is why we need the class
embedding.

However, in few-shot learning, you already have samples from some of the classes and they can
also be used to help with generation. So, you may not need the class embeddings for few-shot,

1590
you may instead use a generator with just noise itself as the input also, when you extend such an
approach to few-shot learning.

(Refer Slide Time: 20:53)

How is such a GAN trained? So, this method uses two components CLS and WGAN and that is
the reason for its name. The WGAN component uses a standard GAN loss, but uses a variant of a
GAN known as WGAN, or Wasserstein GAN, which is a variant that tries to mitigate the mode
collapse issue in GANs. That is the GAN loss that is used here. And there is a classification loss
that is used to classify the generated features. So, you can see here that you have some images in
your dataset, you have a CNN, whose output gives you some real features x.

On the other hand, you have your generator, which takes in a noise vector, and some class
~ ~
information, and also generates some image features 𝑥. A discriminator then takes an x and 𝑥 and
~
the conditional information class and says whether these x and 𝑥, or are real or fake. But there is
also another loss, which takes the generated features and tries to classify them into one of the
classes in your class universe. The final loss is a combination of the GAN loss and a
classification loss, which is weighted suitably using a coefficient beta.

So just to repeat, to train the classifier, you use a pre trained generator to generate samples of
novel and unseen classes conditioned on class embeddings. And the classifier is often a softmax
classifier that is trained on the train set, which gives you x’s and the generated unseen class

1591
~
image features. So, both x and 𝑥 are together used to train this classifier module in this
framework.

Hope that gave you an understanding of Hallucination or Feature Synthesis methods. Now let us
move to the third category, which are Parameter Initialization methods, which try to use prior
knowledge to alter the search strategy.

(Refer Slide Time: 23:23)

The key intuition here is learning to fine-tune. We saw learning to compare, we saw learning to
augment and now we talk about addressing few-shot using learning to fine-tune. What does that
mean? We learn a set of meta model parameters which are common to all classes. In this context,
we say classes are tasks, each class is a task. So, we learn a set of meta parameters that can be
applied to all classes or tasks in such a way that with very few samples for a given class, you can
fine tune those meta model parameters to learn the parameters for that class or task.

So, here is an overall intuition or method of how this works. For each meta learning episode, you
have 𝐷𝑡𝑟𝑎𝑖𝑛 and 𝐷𝑡𝑒𝑠𝑡. You update some task specific parameters Φ𝑖 to minimize the loss of the

overall parameters for that class. Remember, you have θ, which is an overall set of model
parameters, which you initialize your model with, then you fine tune that to solve only one
particular class in your current meta learning episode. Then, across all other tasks in a given meta

1592
learning episode, you now update the meta parameter theta to minimize the loss across all of
those classes.

So, you see here, Φ𝑖 is also here, you can see that Φ𝑖 is also here in this diagram, which is the

model parameter for a specific task. ϕ3 is the model parameter for another specific task. ϕ2 is the

model parameter for another specific task, we want to learn a theta, which is here, which can
easily be fine tuned to get ϕ1, ϕ2 or ϕ3. Once you learn ϕ1, ϕ2, ϕ3 in a given meta learning

episode, you now update theta as a weighted sum of each of these losses corresponding to each
of those tasks.

What happens at test time, if you have a few samples from one of those few-shot classes, you
have already learned theta, which are your meta model parameters, you can now fine tune theta
using those few samples to get the model for that particular few-shot class at test time.

1593
(Refer Slide Time: 26:04)

Let us see this in more detail. The most popular approach in this category is called MAML,
which stands for Model Agnostic Meta Learning. There have been several improvements of
MAML over these years. But we will talk about the basic method where the idea is very similar.
The task specific parameters, say Φ𝑖are obtained through optimization. The overall formulation

goes like this. Remember, the goal of this category of methods is to learn a prior such that the
model can easily adapt to new tasks. Remember, we are talking about this in the context of
few-shot learning, where we expect a few samples to be there for those few-shot classes.

And by prior here, we mean a good parameter initialization, which is what θor the meta model
parameters try to achieve. So, your task specific update is you start by initializing θ, which with
any initialization, and in each meta learning episode, you update the task specific parameters,
𝑖
which are θ𝑖', which are given by θ − α∇θ𝐿(θ, 𝐷𝑡𝑟𝑎𝑖𝑛 ). So, that gives you the parameters for

that specific class in that Meta learning episode.

𝑖
Once this is done, the θ is then updated using θ − β∇θ ∑(θ𝑖', 𝐷𝑡𝑒𝑠𝑡 ). In the, of course, now you
𝑖

are getting your images from the test set. Visually, you can look at it as we just explained on the
* * *
previous slide, you are trying to learn a θ, which if fine tuned, can easily get you θ1 , θ2 and θ3 ,

which are the ideal model parameters for class 1, class 2 and class 3.

1594
So, by meta model parameters, we mean the theta, which can be easily fine tuned. So, if you
observe the task specific update, and the meta update, you would notice that you can now replace
*
θ𝑖', which is inside this loss, as θ − α∇θ𝐿(θ, 𝐷𝑡𝑟𝑎𝑖𝑛 ). And when you substitute it that way, you

𝑖 𝑖
now see that you have θ = θ − β∇θ ∑ 𝐿(θ − α∇θ𝐿(θ, 𝐷𝑡𝑟𝑎𝑖𝑛 ), 𝐷𝑡𝑒𝑠𝑡 ).
𝑖

So in a sense, this becomes equivalent to almost doing second order derivatives on θ. However,
this gives us a efficient and effective way to be able to compute the updates to theta using this
meta learning approach.

(Refer Slide Time: 29:10)

Here is an Algorithmic view of the same methodology. You randomly initialize θ, you sample a
set of tasks in a given meta learning episode. For the tasks in that meta learning episode, you
sample a set of data points from each of those classes in that remember classes and tasks here
mean the same thing. You sample a set of data points from each of those classes in that meta
learning episode.

You evaluate the gradient of the loss with respect to each of those classes, and update the task
specific parameters θ𝑖', and then you sample a set of data points from the same classes for the

meta update. And then you do the meta update based on these samples by combining the
gradients for all of the tasks specific losses, remember, each task is a class here. And that gives

1595
you your meta model parameter update. That completes one Meta learning episode. And then
you again repeat and take your next set of tasks, not here, here the next set of tasks and then you
repeat this process.

(Refer Slide Time: 30:29)

We said that MAML was for few-shot learning, can this be used for Zero-shot learning? A more
recent approach shows how this can be done. This is called Meta ZSL, or Meta Learning for
generalized Zero-shot learning. Here, the idea is to learn a GAN conditioned on class attributes,
and then train using the same Meta learning framework like MAML and then facilitate the
generalization to normal classes.

So, the key idea here is you use the meta learning framework where in each episode, you have a
set of training classes and test classes. But in each episode, we are going to simulate a Zero-shot
learning kind of a setup. Let us see how this is done. So, there is a GAN coupled with a classifier
module, which leads to 3 meta learners, you have a generator G, which is a meta learner learns
meta model parameters, you have a discriminator D, which again learns a meta model parameter,
and then you have a classifier, which checks the goodness of the generated features.

So, you can look at this as a combination of feature synthesis methods to do Zero-shot learning
and MAML for the meta learning based approach to achieve this achieve this approach. So, let

1596
the parameters for each of these modules D, G and C be θ𝑑, θ𝑔and θ𝑐. And we denote θ𝑔𝑐 as the

set of parameters combined for [θ𝑔, θ𝑐]. Now, how is this learning done?

You have given image and attributes for seen classes, you get X and a by passing the image
through a CNN, you would get a set of image features X. You have the corresponding attributes
~ ~
a. Then given noise and attributes, you ask the generator to generate features 𝑋. Given 𝑋 and a,
the discriminator tries to say whether this is 0 or 1, real or fake. And then you have a classifier,
which then says whether the generated features belong to a particular class. So, this is similar to
the features synthesis framework.

(Refer Slide Time: 33:01)

But there is a key component here, which is the meta learning framework which differentiates
this and f-CLSWGAN, this is done by the objective now is to for the discriminator to maximize
the log likelihood of generating x and 𝑎𝑐 which are the true image features and the corresponding

attributes. And minimize the generated image features or the corresponding attributes. On the
other hand, the generator tries to fool the discriminator by taking the attributes and noise and
~
generating an 𝑋 and we want the discriminator to call this real and you have the classification
loss. Beyond this, this objective, you have the meta learning episodes now.

1597
The basic updates, now, given a meta model parameter θ𝑑, you now get θ𝑑' for each of your tasks

or classes in a meta learning episode, very similar to MAML.

1598
(Refer Slide Time: 34:08)

And then you perform a meta update at the end of that meta learning episode, which combines
the losses for all of the tasks in that meta learning episode and updates θ𝑑, θ𝑔𝑐and so on. So, you

have you put the f-CLSWGAN in a meta learning framework and you get meta ZSL.

(Refer Slide Time: 34:36)

Once this is done, the way inference is done is to generate unseen class samples. Using the
^
learned generator you get 𝑥 assuming that you can give the attributes of the unseen class as input
to condition the generator. And once the unseen class samples or the features of the unseen class

1599
samples are generated, you can use the classifier. You can also train any other classifier to predict
the class label for a Zero-shot class or any other class in that problem being considered.

(Refer Slide Time: 35:13)

Hope that gave you an overview of different kinds of methods for few-shot and zero-shot
learning. Although there are far more, if you would like to know more and understand more,
please go through this excellent tutorial, blog article on Meta Learning, Learning to Learn Fast
by Lillian Weng, a nice YouTube video on Few-shot learning with Meta Learning, a tutorial at
ICML 2019 on Meta Learning and a very nice introduction to Zero-shot learning.

1600
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 75
Self-Supervised Learning

The next topic that we will get into for this week is a very popular one these days, called Self
Supervised Learning.

(Refer Slide Time: 00:27)

If you recall, Unsupervised Learning the discussion that we had around it, unsupervised learning,
the tasks that could be categorized under this topic could be Clustering, which groups data into
clusters to reveal something meaningful about the data. Dimensionality Reduction, to learn low
dimensional representations of data that could be useful in some tasks. Data Generation, which is
where we talked about GANs, and VAEs and other Generative models.

Where the goal is to generate data belonging to a given training distribution. The last one we will
include now is the more broader task of Representation Learning. In essence, several of the
methods that we have talked about so far, do perform representation learning. The specific
context in which we approach representation, representation learning now is with an
unsupervised learning, where our goal is to learn a distribution that implicitly reveals data

1601
representation that can eventually help a downstream task. And this leads us to the topic of Self
Supervised Learning.

(Refer Slide Time: 01:54)

What is Self Supervised Learning? It is a twist on unsupervised learning, where we exploit


unlabeled data to obtain labels. There is no explicit annotation or class labels associated with the
data. We exploit unlabeled data itself to get some kind of labels and induce a supervised learning
model on unlabeled data. Specifically, we design supervised tasks, which are called pretext or
auxiliary tasks, which can learn meaningful representations through which the model becomes
more ready to then be able to solve a downstream task, such as a classification or semantic
segmentation or any other supervised learning task.

A sample task in this context could be to predict a certain part of the input from another part,
somewhat like fill in the blanks of a given input.

1602
(Refer Slide Time: 03:02)

So, what part of an input can you then predict or what can you learn? You can predict any part of
the input from any other part. Between images and videos, you could predict the future from the
past, you could predict the future from the recent past, you could try predicting the past from the
present, the top from the bottom in case of an image. In a more broader sense, you could predict
the occluded from the visible.

In general, you pretend there is a part of the input that you do not know, and try to predict that.
That is, those are the different tasks or pretext tasks that you can use in self supervised learning.
We will see a more set of more concrete examples or the remainder of this lecture.

1603
(Refer Slide Time: 03:49)

Why do we do this? Why self supervised learning? We know that deep supervised learning
works well, when there is large amounts of labeled data. However, in the real world, we also
know that we have large amounts of unlabeled data. How can we exploit this? We take
inspiration from humans that humans do not need supervision, to learn everything. They learn, or
we rather learn by observation and prediction and by self feedback.

You try something and you see how that boomerangs on you, or how that affects you. And then
you keep recalibrating based on a self supervised or a self feedback, paradigm of learning. That
is the idea of self supervision.

1604
(Refer Slide Time: 04:45)

So, how do you do self supervision in computer vision? Let us see a few examples of tasks that
are fairly popular. One of such examples, which we already saw in the GAN context, but also
becomes relevant in a self supervised context is image In-painting. The goal of this task is to
occlude or remove a certain part of the image and ask a network to complete the image. No
external labels required, no external annotation required.

In this particular work called context encoders published in CVPR of 2016. There was an
encoder-decoder framework learned to perform image In-painting. A context encoder auto-
encoder is what it was called, was trained to fill in the missing parts. The mask of the missing
region in this example is a square, but it could be of any shape. The encoder in this particular
work was derived from an Alexnet architecture and the final model was trained using 𝐿2loss

between the final completed image and the original completed ground truth.

But in addition, this method also introduced an adversarial loss. What is adversarial loss? To
look at the model completed image and the original image and have a discriminator say which of
them is real and fake?

1605
(Refer Slide Time: 06:31)

Using these two losses, this work showed fairly good results. You can see this example of an
input where the central region is missing. This is an impression of a human artist on filling in the
details. This is the same context encoder work with only the 𝐿2loss. You can see here that in the

central patch, the 𝐿2loss has an averaging effect does not give sharp details at every pixel

because 𝐿2loss minimizes error across the pixels and does not focus on each pixel individually.

Adding an adversarial loss to the context auto encoder improves performance and makes the
final In-painted image more realistic.

(Refer Slide Time: 07:22)

1606
Another popular example of a Self supervision, Self supervised pretext task is solving Jigsaw
puzzles. In this case, the objective is to teach a model that an image is made of multiple parts and
to coax the model to learn certain mappings of parts to the objects as well as their spatial
arrangement in an image. This is done by solving a 9-tiled jigsaw puzzle. But how do you teach a
neural network to solve a jigsaw puzzle?

(Refer Slide Time: 08:08)

This is done using a Neural Network. 9 tiles are shuffled, taken from the image could be the
entire image, could be part of an image are shuffled by a random permutation. And a neural

1607
network is now asked to predict the right permutation. How is this implemented? Given 9 tiles,
all possible permutations are given index values. And the job of the neural network is to predict
the index associated with the right permutation for this jumbled up set of patches. What is the
loss? A cross entropy loss on the index of permutation can be used to train the network.

1608
(Refer Slide Time: 09:01)

Another popular Self Supervised pretext task is predicting rotations. Given an image, multiple
rotations of the image are performed. You can see here, this is the unrotated image. This is
rotated by 90 degrees, rotated by 180 degrees, 270 degrees, so on and so forth. And the same
CNN model in all of these cases, is used to predict the rotation angle. This can help the network
learn images from the domain, as well as perhaps learn certain artifacts, such as an object's
location in an image, its type, pose, etcetera.

(Refer Slide Time: 09:50)

1609
So, K rotations are applied. And the model outputs a probability distribution over all rotations.
Log loss is used for training your standard cross entropy loss, over the set of rotation categories
that you have as outputs is the loss that is used for training. In this particular loss, g is the
transformation function, which is our rotation, and F denotes the CNN that is given in this
particular picture.

(Refer Slide Time: 10:26)

Another interesting Self Supervised task is Image Colorization. The goal here is to take a
grayscale image, and then predict the colored version of the grayscale image. The assumption
here is that you already have colored images in your dataset. So, you take grayscale versions of
them and now ask your network to predict the color version of the grayscale image. Once you
have trained such a network, remember, you could use this network as an initialization for say a
semantic segmentation task or any other pixel wise classification task such as Surface normal
prediction or depth estimation, so on and so forth, just as examples.

So, in this particular case, the image is mapped to a distribution over 313 AB pairs of quantized
color value outputs. So, the model does not predict any real value in the AB output space. But
about 313 bins are created and the model has to predict one of those 313 values, making this a
classification problem. Why LAB space? Remember, LAB is one of the color spaces, where L is
the grayscale intensity and AB bring the color into the final representation.

1610
Why LAB space? Why not RGB space? The answer is simple. In LAB space, the L is given us
input, you only have to predict AB and add L to get the image. Which means you need to predict
only 2 channels instead of 3 channels, which makes it an easier task. Secondly, why not predict
the color value as it is? Why should we quantize and then predict? The researchers in this paper
found that quantizing it and then predicting give sharper colors, and predicting the actual color
value necessitates an 𝐿2kind of loss, which can once again smoothen the colors over the image

and not give us a sharp coloring effect.

(Refer Slide Time: 12:57)

More recently, over the last year, in particular, self supervised learning has been dominated by
what are known as contrastive learning based methods where the overall goal is to learn
representations by contrasting positive and negative samples, you can go back and revisit triplet
loss, contrastive loss, margin loss, so on and so forth, ranking loss, so on and so forth. But the
goal, but the approach here is slightly different because we do not have class labels.

Let us see what the difference is. The goal here is to learn an encoder, which learns the
representation of the an image such that some score, similarity score between an image and a
similar sample is greater than a score between the same image and a negative sample. Now, if
you had class labels, positive and negative, become easy to define, similar to what we saw with
triplet loss. So, how do you go about it here is what we will talk about.

1611
So, in this particular case, a softmax classifier is used at the end to classify the positive and
negative samples, we will soon talk about how those positive and negative samples are obtained.
And the general form of a loss function used for these set of tasks is given by a softmax loss and
then having a log on top of that loss but the softmax is defined as where the numerator is
between the current sample and a similar sample divided by the similarity of the score between
the current sample and all samples including positive and negative.

And generally, which with such softmax operators, there is also a temperature hyper parameter, τ
that you see here, which is also added to help learning. What is the role of a temperature
parameter in softmax? It depend depending on what the value of τ is, it either helps smoothen the
softmax distribution or sharpen the softmax distribution. If τ is greater than 1, it softens the
softmax distribution, a distribution, which was perhaps concentrated in one label, now gets
distributed over all possible labels.

On the other hand, if τ is less than 1, your value inside your exponent will increase. And any
increase in the argument further increases the value of the exponential function and your softmax
distribution gets sharper around certain nodes. τ greater than one softens the distribution or
smoothens the distribution, tau less than one sharpens the modes of the distribution.

(Refer Slide Time: 16:09)

1612
And one of the earliest methods in this space is known as MoCO, which was first made public
about a year ago, but was published in CVPR of 2020. MoCO stands for Momentum Contrast
and this method proposes an unsupervised learning of visual representations using the idea of a
dynamic dictionary lookup. So, you can see here that the dictionary, which in this case is stored
𝑘𝑒𝑦 𝑘𝑒𝑦 𝑘𝑒𝑦
as 𝑥0 , 𝑥1 , 𝑥2 so on and so forth, a set of keys, is structured as a large first in first out

queue of encoded representations of data samples.

So, given a query sample 𝑥𝑞, an encoder transforms it to a representation q. Similarly, all the key

samples are transformed by another encounter encoder, which gives us 𝑘0, 𝑘1, 𝑘2so on and so

forth. And then we measure the similarity between q 𝑘0, q 𝑘1, q 𝑘2 so on and so forth pair wise

and use the softmax log loss that we saw on the previous slide. One question is, how do we
ensure that q is similar to one of these k’s here?

Remember, when we use this for a Few-short learning, we would ensure that in a given meta
learning episode, the query class was from one of the key classes. But now, how do you ensure
that? That is ensured by making a one of the case as an augmented version of 𝑥𝑞. And as I just

mentioned, it is trained using the loss that we saw on the previous slide, which is the log loss
with a soft max with the temperature hyper parameter.

(Refer Slide Time: 18:21)

1613
Why is this called Momentum Contrast in that particular case? The reason it is called Momentum
Contrast is the full network is firstly trained end to end, which means both the query and the key
encoders are updated based on the loss that we just talked about. And the dictionary is
maintained as a queue of data samples. The interesting contribution of this work was to make
this set or dictionary of encoded q’s, as coming from the current mini-batch, as well as
immediately preceding mini-batches.

So, the dictionary size would be slightly large than a mini batch size. So, all the data points that
are not queued in a mini batch would become keys of that mini-batch. But you may also have a
few images from the previous mini-batch, also as keys. And that is a queue, which where as
newer keys come from a current mini-batch, the oldest keys from the oldest mini-batch that was
visited, then is pushed out of the queue. And this is repeated over and over again. It is called
Momentum, because this dictionary is based on the philosophy of momentum, where the idea is
to use the previous iterations samples not the gradient this time in the current iteration.

(Refer Slide Time: 19:55)

So one question here is, we say that we have an encoder, as well as a Momentum encoder. Why
two encoders? Can’t we use the same encoder in both places? It may be easy to update.
Unfortunately, we cannot do that, because the representation may not be consistent for both the
query and the keys. Because the momentum encoder is more stable than the query encoder in the

1614
sense of maintaining a larger number of data points which stay constant for a couple of mini-
batches at least whereas, the encoder can vary with each query image.

So, the task in both the encoders is slightly different. And that is why two different encoders to
get the corresponding embeddings. While the query encoder is updated using normal
backpropagation, the key or the momentum encoder is updated using a momentum concept
where θ𝑘, the parameters at a particular iteration is 𝑚θ𝑘 + (1 − 𝑚)θ𝑞, q comes as the

parameters of the query encoder. So, that is the idea that is used here to update both the query
encoder and the momentum encoder.

(Refer Slide Time: 21:27)

Another framework that came up around the same time was published in ICML of 2020. This
year is known as SimCLR where the goal is more similar to standard contrastive learning, where
the model learns via maximizing agreement between differently augmented views of the same
data sample. How is this done? In a given mini batch, if you have n samples, you can obtain 2n
samples up which can for which can be obtained using two different augmentations.

Now, given one positive pair, which is given an image, you give the image itself on one branch
and give a rotated version of the image on the second branch, for instance. So, then for that one
pair, now there exists 2(𝑛 − 1), where the other (𝑛 − 1) images are the same mini batch, those

1615
many negative pairs. So, we automatically have similar and dissimilar pairs in each mini batch
without changing anything and this is useful to learn representations.

In SimCLR, you see that there are two networks an f network here, which performs a
representation learning step and a g network here, which is then used to get a representation to
maximize agreement. You could consider this similar to the relation network that we saw in
few-shot learning, or zero-shot learning, where we first learned an embedding, and then learned a
scoring mechanism to compare those embeddings. This is similar in that sense, but the goal here
is not Few-shot posts, or Zero-short learning, but simply to learn the representations in an
unsupervised setting, using the data in a mini batch itself.

(Refer Slide Time: 23:39)

SimCLR and MoCO are direct competitors. So, one can compare the two. SimCLR has some
advantages. It has strong data augmentation techniques, and uses an MLP projection over the
representation layer. So, you have a q or a g on top of f, which is your embedding network.
However, one of the disadvantages of SimCLR is that the number of negative samples you have
is limited by the batch size. In MoCO that can be expanded to be a dictionary that is comprised
of many mini-batches’ images.

On the other hand, MoCO decouples the dictionary and the mini-batch, and does allows more
negative samples to help better learning. So, you can see here that in terms of number of

1616
parameters, when SimCLR has small number of parameters, it still gets a reasonable amount of
accuracy when compared to a fully supervised setting, it gets close to 70 percent accuracy on
ImageNet Top-1 whereas, supervised a fully supervised method is slightly over 75.

However, if the network of SimCLR has more parameters up to 4x, up to say 620 million, you
can see the performance reaches close to supervised learning, without using any labels on
ImageNet Top-1 accuracy. More recently, there has been a version known as MoCO V2, which
combines the benefits of MoCO and SimCLR. We leave this for your reading after this lecture.

1617
(Refer Slide Time: 25:38)

The last method that we will talk about in the context of self supervised learning is another
recent method called Bootstrap your Own Latent or BYOL, where the method claims to achieve
state of the art results without depending on any negative sample at all. How does it achieve this?
It bootstraps the outputs of a network itself to serve as targets, remember, it is self supervised
learning. So, every input is passed through two networks, an online network shown in blue, and a
target network shown in red. So, what is the objective?

The objective is that the online network predicts the target network’s representation of another
augmentation of the same image. So, you give a certain image, let the network learn a series of
representations and asked for it to provide a prediction and that prediction or output must match
the output provided by the target network’s prediction of an augmented version of the same
input, t and t prime are two different transformations or augmentations of the same input. So,
why is the bottom one used as the target?

1618
(Refer Slide Time: 27:15)

Not really, you can flip it as we will see soon. So, the loss for BYOL is an 𝐿2 loss between the

prediction of the online network 𝑞θand the output of the target network, which in this case is

given by 𝑧ξ. Now, one can flip the network to also get the loss the other way. So, in both of these

cases, 𝑞θ and z are 𝐿2 normalized. And when it is flipped, the lower network becomes the online

network and the top network becomes the target network. So, one can switch the roles of v’ and
v.

And the final learning is through a combination of the original loss and the loss obtained by

flipping which is denoted as L, 𝐿‾ where the bottom network is online, and the top network is
target.

1619
(Refer Slide Time: 28:19)

Your homework is to read this excellent article on Self Supervised Representation Learning by
Lillian Weng. Also, try to go through how MoCOv2 combines MoCO and SimCLR.

(Refer Slide Time: 28:38)

Here are some references.

1620
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 76
Adversarial Robustness

The next topic that we will discuss, again, a contemporary one is Adversarial Robustness.

(Refer Slide Time: 00:24)

In Supervised Machine Learning, often the distributions that we deploy machine learning models
on that is the Test distribution are often not the distributions that you train the models on,
unfortunately. Whenever there is such a distribution shift, then Machine Learning models,
especially Deep Learning models suffer and tend to perform poorly.

1621
(Refer Slide Time: 00:53)

One such scenario where this distinction becomes even more prominent are what are known as
Adversarial Examples. Adversarial Examples, this is different from the adversarial word in
GANs, this has a different notion here. Adversarial examples are data points, which are
indistinguishable from your dataset, but lead to wrong predictions. Here you see an example
where an image of a pig, which a Deep Learning model said was a pig with 91 percent
confidence. If you add a little bit of random noise, just 0.005 of random noise to this pig, the
same Deep Learning model now calls this an airliner with 99 percent confidence.

This is a serious problem, especially considering that humans do not seem to suffer from this
problem. And for few humans, the output image is still a pig, quite clearly. So, why do Deep
Neural Networks suffer from this problem? Unfortunately, it is still not well known. But this
question has led to a body of work over the last few years on what are known as adversarial
attacks, which try to hack the model and disturb the model and adversarial defenses, which try to
protect the model against adversarial attacks. Let us see both these kinds over the rest of this
lecture.

1622
(Refer Slide Time: 02:36)

In general, it has been studied and understood now that supervised frameworks are vulnerable to
adversarial attacks in any domain. Here is an example where you see a detection and
segmentation task, given an image. The first detection image that you see here calls these dogs,
and you also get a segmentation mask around them as dogs. But when a small adversarial
perturbation is added to it to this image, now, this pink box or magenta box that you see here, is
now a train with a fair good amount of confidence. And the masks too now reveal a different
label for the same image.

Similarly, for text, if there was an original input, which has a certain sentence, and the prediction
is that the tone is positive, with a 77 percent confidence. If an adversarial example is a small
change is made to the sentence, as you can see, the change here is very, very small. But now, the
prediction is that the tone is negative with the 52 percent confidence. And if the word film is
changed to footage, which is perhaps semantically similar, even then the prediction becomes
negative with 54 percent. This is worrisome, again, considering that humans do not suffer from
such problems.

1623
(Refer Slide Time: 04:17)

Similarly, here is an example for an audio based network. Where given an audio signal, the
neural network translates this as it was the best of times, it was the worst of times. And when a
noisy signal multiplied by a small scalar constant is added to the signal, the same neural network
now classifies this as it is a truth universally acknowledged that a single, the meaning completely
changes.

(Refer Slide Time: 04:49)

1624
So, now coming to what are adversarial attacks. Over the years, there have been methods that
have been developed of the different kinds. And there are a few different ways of categorizing
these methods. If one looked at the threat model, these attacks can be called White Box or Black
Box Attacks. In White Box attacks, the attacker assumes access to the models parameters.
Whereas, in Black Box attacks, the attacker does not have access to the models parameters. So, it
is perhaps coming from a different model, or no model at all, to generate these Adversarial
images.

Remember, adversarial attacks are intended to cause problems with the model. So you could look
at that as equivalent to ethical hacking. Similarly, on the basis of objective, adversarial attacks
are differentiated as Targeted Attacks, or Untargeted Attacks. In untargeted attacks, the aim of
the attacker is to enforce the model to misclassify. Whereas, in a Targeted attack, the aim is to
ensure that the model classifies it as a particular target label. So, in an Untargeted attack, if you
had an image of a cat, or generally called a cat, you want to add a perturbation that ensures that
the model no longer calls it a cat.

In a targeted attack, the attacker now wants to add a perturbation to ensure that the cat is now
called a dog, for instance, could be any other target label, but you want a specific target label.
Targeted attack could be a big problem in biometrics, where one person may want to add some
noise to an image to impersonate as somebody else to a Deep Learning model trained for face
recognition.

A third categorization is based on distance metrics used. So in all of these Adversarial


perturbations, the constraint is that the perturbation should not have a very high norm, because
you do not want the perturbation to be too large, because it is then always easy to add a very
large number and change the outcome of a model. The challenge here is to ensure that the
perturbation is as minimal as possible. And when you say as minimal, you have to measure a
distance and that is done using different kinds of 𝐿𝑝 norms, such as 𝐿0 norm, 𝐿2 norm, or 𝐿∞

norm.

So, in 𝐿0 norm, the total number of pixels that differ between clean and adversarial images is

used, in 𝐿2 norm it is the squared difference between the clean image and the adversarial,

1625
adversarially perturbed image. And an 𝐿∞ norm it is the maximum pixel difference between the

clean and adversarial images. An infinity norm is often used in many of the methods.

(Refer Slide Time: 08:10)

Let us now see a few methods that do what is known as White Box Adversarial Attacks, which
means they assume that they have access to the models parameters to be able to attack the model.
One of the prominent methods in this context is known as FGSM, or Fast Gradient Sign Method
where the model computes an adversarial image by adding a pixel wise perturbation of
magnitude in the direction of the gradient. So, in this single step method, you add a perturbation
x is the original image, you take the gradient of the loss with respect to x, find the sign and add
an epsilon in the direction of that sign.

Remember that the gradient would actually be a vector with respect to each input dimension. So,
for each of them, you would have a sign. So, for each dimension, you add an epsilon times its
corresponding sign. What are you doing, you are now trying to add a perturbation, which would
increase the loss and hence resulting in an adversarial perturbation. So, this is known as a single
step method, just one update to get an Adversarial perturbation, and hence very efficient in terms
of computation time to implement.

If this had to be done for a targeted attack, it is the same method. The only difference now is you
now go in the direction of the negative of the gradient of x assuming that the gradient is

1626
computed with respect to the target label, remember that when you do a negative, you are trying
to reduce the loss with respect to the target label. So, you are now trying to make the model think
that that sample is a different label.

So, x here is the image 𝑥𝑎𝑑𝑣 is the adversarially perturbed sample, L is the classification loss,

𝑦𝑡𝑟𝑢𝑒 is the actual label, 𝑦𝑡𝑎𝑟𝑔𝑒𝑡 is the target label for a Targeted attack and ϵ is the budget or 𝐿∞

budget or any norm budget that you have, which you allow for the perturbation.

(Refer Slide Time: 10:47)

A more popular and a complete method is known as Projected Gradient Descent. This was
introduced in 2017. This is considered one of the most popular methods today widely used. And
it is considered a complete method, because it does not impose any constraints on amount of
time or effort. So, you it is an iterative process, unlike FGSM, which is a single step method.
PGD is an iterative method, which keeps improving the adversarial perturbation, unless until it
succeeds.

How do you do this? You start your first iteration’s adversarial sample with x, which is your
given sample, then in each iteration, you add a gradient corresponding to that iteration’s
adversarial sample. And corresponding to the true label, remember, you are trying to increase the
loss for the true label. And you know project the sample back into the ϵ neighborhood. So, you
try to find out what is that sample that takes me to a higher loss.

1627
Project it back to the ϵball that you are allowed your perturbation in remember that ϵ imposes a
certain constraint on the size or the norm of the perturbation. This gives you a new adversarial
perturbation. Now, you repeat this over multiple iterations, until the sign or the classification
output changes. In case of 𝐿2norm, the update becomes ϵ times you have the gradient of the loss

divided by the two norm of the gradient of the loss. The sign simply becomes the gradient by the
two norm of the gradient which is in other words, the sign.

The visualization here shows a loss surface. The loss surface here is that the sample is initially in
a region of a very low loss. And by taking a series of iterative steps, the sample is now moved to
a position where the loss is very high, remember, yellow here is high loss. And in two different
runs, you now take the sample to two different locations, where the loss is very high.

(Refer Slide Time: 13:21)

Another popular method, which was developed in CVPR of 2016, is known as DeepFool. In
Deep Fool, the idea is considering the decision boundary, while coming up with the Adversarial
𝑇
perturbation. Given any affine classifier, a linear classifier, which is given by 𝑓(𝑥) = 𝑤 𝑥 + 𝑏.
The minimum perturbation to change the class of an example 𝑥0 is the distance to the hyperplane
𝑇
to the decision boundary, which is given by 𝑤 𝑥 + 𝑏 = 0.

1628
−𝑓(𝑥0)
And this distance between a point and the hyperplane is given by 2 , where 𝑥0 is the point.
||𝑤||2

This is a known quantity and this takes us from any point to the decision boundary when the
class label changes, when that point goes beyond that decision boundary. In general, for a more
general differentiable classifier, this method assumes that f, or the decision boundary is linear
around that particular point.

And hence, it tries to compute a perturbation which minimizes the 2 norm of the perturbation
subject to assuming that the perturbation is on a linear neighborhood local neighborhood of that
point 𝑥𝑡 with respect to the decision boundary. But now, you keep running this iteratively until

the decision on 𝑥𝑡' changes from 𝑥0. So, you find a delta such that it lies on a linear

approximation of the decision boundary around that point in such a way that the 2 norm of that
perturbation is as low as possible remember, you want your perturbation to be a very small
quantity and you run this iteratively until the decision boundary or the class label changes.

What happens if you have a multi class classifier then, the distance is computed to the surface of
a convex polyhedron formed by the decision boundaries between all classes. So, the distance
from 𝑥0 to any direction which is on the surface of a convex polyhedron is the minimum distance

that you need to add or the minimum perturbation that you need to add to 𝑥0 to change the class

label into something else.

(Refer Slide Time: 16:07)

1629
Another white box adversarial attack, which is also been used over the years is known as the CW
attack or the Carlini and Wagner attack, which is an optimization based Adversarial attack again,
but the problem now is formulated as min 𝐷(𝑥, 𝑥 + δ) such that 𝑓(𝑥 + δ) is a particular
δ

target label or a label different from the true label. Subject to the constraint that 𝑥 + δ also lies
between 0 and 1. The D here can be 𝐿0, 𝐿2 or 𝐿∞ distance measures.

To ensure that 𝑥 + δ yields a valid image that is 𝑥 + δ should lie between 0 and 1. This method
introduces a new variable κ which is used within this formulation for δ. δ is written as
1
2
[𝑡𝑎𝑛ℎ(κ) + 1] − 𝑥. Why is this done? Because if we write δ this way, we get that 𝑥 + δ
1
will be 2
[𝑡𝑎𝑛ℎ(κ) + 1]. We know that tanh has a range minus 1 to plus 1. So, hence

𝑡𝑎𝑛ℎ + 1 has a range 0 to 2 and dividing by half ensures that this value always lies between 0
and 1.

(Refer Slide Time: 17:47)

1630
One more White Box Adversarial Attack is a Jacobian-based Saliency Map Attack. In this case,
to fool the DNN with small 𝐿0 perturbations, the method computes the Jacobian matrix of the

logits before the softmax layer, remember, logits are the outputs of the neural network before the
softmax layer. And Jacobian is computed between every output in that layer in the logits with
respect to every input that gives you a matrix of different gradients.

Once you get the Jacobian matrix, you get an understanding of which input pixels affect which of
the logits in particular, using this, the methods suggests the creation of an adversarial saliency
map, which is given by this adversarial saliency map is given by 0 if the true gradient that is f
with respect to that 𝑥𝑗 that you want to consider that input dimension. If that is less than or equal

to 0, or the gradient with respect to non true labels, non ground truth labels is greater than 0.
What does this mean?

This means that if the gradient becomes less than 0, which means we are further improving the
output on the true label, this also means that we are going further away from the non true labels.
We do not want that direction. So, in those directions, the saliency map is set to 0. In every other
case, where the gradient of the true label with respect to 𝑥𝑗 is greater than 0, or the gradient of the

other labels, the logits corresponding to the other labels with respect to the input is less than 0,
we would like to encourage those directions to change the label.

1631
So, this method proposes the Adversarial Saliency Map to be the gradient of the true label into
the absolute value of the sum of the gradients for the other classes. Why absolute value, because
in the second case, we know that the other gradients will all be less than 0. So, we would like to
use the absolute value. Once you get these adversarial saliency maps, we then perturb the
element with the highest value of the adversarial saliency map to increase or decrease the logit
outputs based on what you like, of the target, or the other class significantly.

(Refer Slide Time: 20:33)

The last White Box Adversarial Attack we will see is known as the Universal Adversarial
Attack. The goal here is to find one perturbation for all your examples, if you observe carefully,
so far, all the methods for adversarial attack, try to find a perturbation for a given sample that
will fool the deep neural network model. But that can be laborious. So, the aim here is to see if
we can find one perturbation that will fool them all.

1632
(Refer Slide Time: 21:13)

How do you do this? This is now done by if you consider this illustration here, 𝑥1, 𝑥2 and 𝑥3 are

juxtaposed, they are not located, they do not have the same vector value. This is just for
convenience. ℜ1 here denotes the boundary corresponding to 𝑥1 rather, that is the minimal

perturbation, that is the ball or the set of values with the minimal perturbation, after which the
class label for 𝑥1 will change.

Similarly, you see an ℜ2 for 𝑥2 and an ℜ3 for 𝑥3, which are the regions where the label stays the

same beyond which the label changes. With this knowledge, how do you learn the universal
perturbation? The method initializes, the perturbation v to 0. And for every sample 𝑥𝑖, finds a

\∆𝑣𝑖 that adds an r such that the decision changes. And then now projects v on to some ball, you

try to ensure that if you found a ∆𝑣𝑖, then you try to ensure that the final perturbation for this

sample is a v, which satisfies the ε constraint on the LP norm.

So, if you started with 𝑥1, you would get this particular point as the point at which the label

changes. Now when the next point comes, you realize that at that point, 𝑥2 label does not change.

So, you further add a ∆ for 𝑥2, then a ∆ for the next point, so on and so forth. And v now is the

1633
sum of all of these ∆𝑉𝑖’s, which is the total perturbation that you need to add for any of these

points to change the class label.

1634
(Refer Slide Time: 23:16)

Now let us look at a couple of Non-Lp based White Box Adversarial Attacks, where you do not
necessarily use the norms to create the attack, but use other kinds of methods. One of them is
called the spatially transformed adversarial attack. In this method, given a benign input image, a
clean image, the goal is to find a flow that means all pixels are moved by a certain quantity in
such a way when that flow is added, and you perform a bilinear interpolation to ensure that you
do not get positions that lie between two pixel locations, you get a final adversarial image.

Now, to learn this flow, the final objective function is given by 𝑎𝑟𝑔 min 𝐿𝑎𝑑𝑣(𝑥, 𝑓). Here, 𝐿𝑎𝑑𝑣
𝑓

encourages the generated samples to be misclassified and 𝐿𝑓𝑙𝑜𝑤 tries to ensure that the flow

perturbation is as minimal as possible. Another method in this category are known as functional
adversarial attacks. So far, we looked at all attacks for perturbations, which are additive in
nature, where you add a perturbation to a sample in this case, we now try to transform the sample
using a function. Those are known as functional adversarial attacks.

And then you can combine the additive and the functional attack to get a combined attack. An
example here is, you could now apply an attack to change all red pixels in an image to light red
pixels. Remember, this is not additive, but an intensity scaling operation, which would be
considered a functional attack.

1635
(Refer Slide Time: 25:33)

Moving on from white-box attacks, we will now talk about a few Black-box attacks for
adversarial perturbations. One of such examples, remember, in a Black-box attack, you do not
have access to the model’s parameters. But remember, from what we have seen so far, we need to
be able to estimate the gradients of the target neural network to produce an adversarial image
because that tells you a sense of the direction in which the loss or the output of the neural
network will change.

But in this case, in Black-box, we do not have access to the model’s parameters, we only have
access to the outputs of the model. We assume that we have an the access to the logits of the
target model, we still have to come up with an adversarial perturbation, which changes the output
of the true label to some other class label by at least a certain distance K. So, this method called
ZOO Zeroth Order Optimization, suggests that you can now approximate the gradient by doing
multiple forward passes on the model.

𝑓(𝑥+ℎ𝑒𝑖)−𝑓(𝑥−ℎ𝑒𝑖)
You do a forward pass of 2ℎ
where 𝑒𝑖is a small perturbation. We know from first

principles, that this is an approximation of the gradient. And I could now use this as an estimate
of the gradient and then do any other any method similar to one of the White-box attacks.
Another approach in this direction is called an Opt-Attack, where the target model here can only

1636
be queried to get the hard label, not even the logits. Remember, in this approach, in the ZOO
approach, we said you could get the probability scores for the logits of the model.

Now we say even that is not available, you can only get the winning label. So, the perturbation
θ
that we need here, 𝑔(θ) is the minimum λ such that 𝑓(𝑥 + λ · ||θ||
≠ 𝑦𝑡𝑟𝑢𝑒. But how do we do

this when we do not have access to the gradients, and we only have access to the final prediction
of the model? So, one has to rely on a brute force kind of an approach, where a coarse grained
search is finally performed is initially performed to find a decision boundary and then this is
fine-tuned using binary search.

So you try to see how much do I add to cross the decision boundary. And if you crossed it by a
lot, now within that perturbation, do a binary search to find the exact perturbation that makes you
cross the decision boundary. And you do have the output of the target model to check whether
you crossed the decision boundary or not.

(Refer Slide Time: 28:51)

Another Black-box Adversarial Attack, which is a gradient free attack, where there is absolutely
no gradients is you create a neighborhood consisting of all images that are different from the
previous round’s image by just 1 pixel. You initially pick a random pixel and add a perturbation,
you can now calculate the importance of that pixel by observing the change in classification
accuracy after adding noise.

1637
Now, you choose the next pixel location among pixels lying within a square whose side length is
2p, you stay in that neighborhood. You again find the importance of each of those pixels in that
neighborhood. And this can give you an approximation of the loss functions gradient, which you
then use to attack the model.

(Refer Slide Time: 29:53)

We now come to the other side of the story. So far, we have seen several methods that perform
adversarial attacks. Now we will talk about a few methods that try to defend Deep Neural
Networks against adversarial attacks. Why do we need this adversarial attacks pose a serious
security threat to services that use modern Deep Learning models. Here is an example, which
was performed using a TensorFlow Camera Demo app to classify clean and Adversarial Images.

Given an image from the data set, initially, the model calls this a washer. And when an
adversarial attack, or a perturbation is added to the image, now the model thinks this is a safe,
which it is clearly not.

1638
(Refer Slide Time: 30:52)

So in the physical world, like applications in Autonomous Navigation, an Adversarial attack can
cause lives, and it becomes important to defend against them. One method to perform an
adversarial defense is called as Random Input Transformation. So in this case, given an input
image at test time, the approach is to firstly, randomly resize the layer and then randomly pad the
image in different ways. And pick one of these outputs to give to the CNN model and classify.
Why are we doing this?

We expect that irrespective of where the object is an, is in an image, the CNN will be able to
classify it, that is the first part. The second part is we are hoping that by choosing this image
randomly, the model or the adversarial attack may not pick a pixel in this part of the image,
where the image is actually located in this part of the canvas, where the actual image is actually
located.

Another approach is called Random Noising, where a noise layer is added before each
convolutional layer in both training and testing phases. You can see here that the convolutional
layer block, which has a convolutional layer, a batch norm layer and an activation layer. Now,
you also have a noise layer added to each of your convolutional blocks. What is the purpose?
Both at training and testing, you add multiple random noises, and ensemble your prediction
results across all of these randomly perturb inputs. And this gives you robustness against
adversarial attacks, at least to an extent.

1639
(Refer Slide Time: 32:56)

Another defense method is known as Defense GAN, where a generator is trained first, to model
the distribution of clean images. So, you have your training dataset, you do not consider any
adversarial perturbations, you only train a GAN, to generate more images, such as your training
dataset. At test time, you now cleanse an adversarial input by finding the nearest generated
image. How do you do that? You then you have a random number generator, you then try to find
out different images obtained through different random noise vectors to the generator.

You try to see which of them of which of those generated images is closest to the current input,
which could be adversarial. And now use that as a classifier to finally use that as input to your
classifier to get the output. Why do we do this? We hope now that any adversarial perturbation
may now get hidden by considering the closest image and that pixel value that have been that
may have been added to a specific pixel is now offset by considering a different image. There are
other methods in the same space that operate with a similar principle called Pixel Defend,
Magnet, APE-GAN, so on and so forth.

1640
(Refer Slide Time: 34:33)

Another defense is based on an approach called Network Distillation. We will see the idea of
distillation more closely in the next lecture. But we will give a high level idea here. The high
level idea here is you have an initial neural network, which is trained given training data and
training labels. And this network outputs a set of a vector of logits or a softmax probability
distribution as its output.

Now, the distilled network learns using these probability vector predictions instead of the
original one hot training labels. So, if your initial network had a distribution over, say 10 labels,
instead of recognizing a digit as 3, it would have a probability distribution over all labels. The
second network is asked to predict that probability distribution, rather than predict the digit 3.
How does this help?

Using a very high temperature softmax remember, if you use a high temperature softmax, you
get a more smoother distribution in your softmax of values, the probability values, that reduces
the model sensitivity to small perturbations. So, you do not easily assign another label because
you have smoothened out your probability distribution across all labels.

1641
(Refer Slide Time: 36:11)

But the most important and most widely used adversarial defense is known as Adversarial
Training. In adversarial training, which is a computationally intensive process. You simply add a
PGD or an FGSM attack as part of your original training loop. This is considered the ultimate
data augmentation step. So, in each step of your training of the original Deep Neural Network
model you add an inner maximization step, which tries to find the δ that maximizes the loss, and
then you find the θ that minimizes the loss for this adversarial perturbation.

So, within each loop or iteration of training, you have to solve for each data point, a
maximization problem, which says, given an input sample in one mini-batch iteration of SGD.
How much should I perturb this to cause maximum damage? You find that perturbation, call that
δ, now you add δ to that input, and then minimize the loss over the overall network. This is
obviously the most powerful adversarial defense, it is widely opted, and the current state of the
art for adversarial training.

1642
(Refer Slide Time: 37:48)

There have been a few variations of adversarial training based on similar ideas. One of them is
known as Adversarial Logit Pairing, where you have a clean sample, which is given by this green
vector. You perturb it and you get this red vector. You pass both of them through the Neural
Network, you try to ensure that the logits of the clean sample and the adversarial sample are
close to each other using something like 𝐿2 loss. This also helps as a form of adversarial training.

A more recent method called TRADES, which was published in ICML of 2019. Uses a similar
idea, but a different method. It decomposes the prediction error as the sum of natural error and
boundary error. So, you have a natural error here, which is the error that you typically deal with
when you train any Neural Network model. And then you have a boundary error here, which is
trying to find the delta that will push the decision function to give a different label. That is the
second part here, which has another maximization here, which is why it is an adversarial training
step.

While ALP which is Adversarial Logit Pairing, uses PGD Adversarial samples. TRADES
computes adversarial samples as maximum of delta where the decision between 𝑓(𝑥) and
𝑓(𝑥 + δ) is maximized. So, you want that loss to be as high as possible, not necessarily a PGD
attack. So, you can look at ALP as enforcing 𝐿2 loss, while TRADES uses a classification

calibrated loss.

1643
(Refer Slide Time: 39:57)

An important take away from the entire studies around adversarial robustness is that one has to
deal with a tradeoff between robust and clean accuracy. Unfortunately, adversarial robustness
comes at a cost of decreased clean or standard accuracy. Here is an example where, as you see
here, that on the y axis, you see the difference between adversarial error and the standard error,
you can see that when the attack increases 4 by 255, is a very powerful attack, because you are
allowing the perturbation to be larger than 3 by 255, or 2 by 255, or 1 by 255.

You can see here that when the attack is higher, the difference between adversarial error and
standard error goes up. However, this graph also gives us some consolation that as the number of
labeled samples increases, these differences seem to get reduced over time. So as you keep
increasing the number of training samples, it looks like even if you had a stronger Adversarial
attack, the adversarial error and the standard error, the difference is not too much.

So more recent methods, such as Interpolated Adversarial Training, or Adversarial Vertex


Mixup, try to reduce this tradeoff by increasing the training set size using interpolations between
your standard data and an adversarial sample. You could consider now taking using adversarial
training to find take a data point and input data point find its adversarial perturbation in the inner
maximization loop of adversarial training.

1644
Now take interpolations of all the points that lie on the line between the clean point and the
adversarial perturbation. All of these, when added to training, help improve the train set size, and
thus mitigate this trade off problem between adversarial error and standard error.

(Refer Slide Time: 42:21)

Over the last year, there have also been other notions of robustness that have surfaced, such as
attributional robustness, where methods attack explanations and saliency maps instead of
attacking predictions. So, you have an image, you perturb it in such a way that the class label
output is the same, but the explanation significantly changes. There is also the notion of
robustness to natural corruptions, such as say fog, a blur, or snow, so on and so forth.

Which again, an example could be autonomous navigation, where when you train a model to
drive on a normal road, you should be able to drive in the same scene, even if there is rain or
snow in the same setting. Robustness against natural corruptions, also becomes important in such
scenarios.

1645
(Refer Slide Time: 43:26)

Your homework for this lecture is to go through these excellent tutorials on Adversarial Machine
Learning, Tutorial 1, to, Part 1, Part 2 and Part 3. And if you are interested, you can read further
on these links here. The space of Adversarial Robustness is quite vast today, but these links give
you a fair picture of an understanding and the links to a few codebases to try to understand how
these work in practice are also provided here for your experimentation.

(Refer Slide Time: 44:01)

1646
1647
Here are a very comprehensive set of references if you would like to follow through further.

1648
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 77
Pruning and Model Compression

Moving on from adversarial robustness, we will now talk about Pruning and Model
Compression. Another important component in taking Deep Learning models to in the wild real
world applications.

(Refer Slide Time: 00:32)

Neural Networks in general are optimized to improve predictive accuracy. Be it accuracy for
classification models, mean average precision for detection models or pixel wise classification
accuracy for segmentation. Trying to chase accuracy alone makes neural networks very large. As
a result, the models that are state of the art today have very large number of parameters, often of
the order of millions.

Recall that we said that AlexNet has over 61 million parameters, occupying about 200 MB of
space in the memory. VGG occupies up to 500 MB of space, just to store the weights in the
model in your memory.

1649
(Refer Slide Time: 01:26)

Is this really a problem? When you train your models, it is alright to have very high storage
footprint, memory footprint, and one can use powerful GPUs to train these models. However,
expecting the availability of heavy compute at test time or inference may be limiting. If one
considers the deployment of Deep Learning models in low compute applications, such as mobile
phones, drones, or unmanned aerial vehicles, or IoT devices, which could be deployed in any
edge at the corner of the world, even in harsh conditions, having bulky Neural Network models
becomes a limiting factor in taking their success to these kinds of compute platforms.

(Refer Slide Time: 02:22)

1650
Another way of viewing this, is from the viewpoint of the energy expended for carrying out such
operations in memory. An interesting analysis was done by Song Han, who came up with one of
the most popular papers for deep model compression. And the analysis here shows on this table
−12
that a 32 bit integer addition consumes about 0.1 Pico joules. Pico joules 10 of energy. A 32
bit float addition operation consumes 0.9 Pico joules. And if you keep going further and further,
a 32 bit SRAM cache access operation consumes 5 Pico joules. And when you go to the DRAM,
you significantly go up in orders of magnitude. And now things go up to 640 Pico joules.

Accessing DRAM or dynamic RAM is significantly more costly than accessing your SRAM.
Why are we talking about this, which means when we talk about low compute devices, we
ideally would like these Deep Learning models to be housed in the SRAM and not have to go to
the DRAM, because accessing them could cause a lot of energy requirements, especially in
environments like drones, or edge devices, IoT devices, where battery also becomes a concern to
deploy these models.

So, one key requirement that emerges now is the need to be able to prune these bulky neural
network models into smaller memory footprints that can be deployed in low compute
environments. This category of methods are broadly called model compression methods, where a
trained model is compressed into a smaller memory footprint for deployment in low compute
devices.

1651
(Refer Slide Time: 04:34)

Over the last few years, several efforts have been taken have been taken by different researchers.
And a broad categorization of these methods can be given as parameter pruning and
quantization. That is one family of methods, which focuses on reducing redundant parameters
that do not affect performance. A second family of methods is based on Low-rank factorization,
where matrix and tensor decomposition methods are used to estimate only the informative
parameters and discard the rest.

Transferred or compact convolution filters are a family of methods where special structural
convolution filters are designed to save parameters. And finally, an interesting family of methods
called Knowledge distillation methods that use an idea of distilling knowledge from a large
neural network model into a small student neural network model. We would not see all of them
in this lecture, but see a few ones briefly and point to other resources for more reading.
Specifically, we will see a very popular pruning based approach, a knowledge distillation
approach and a more recent approach called lottery ticket hypothesis.

1652
(Refer Slide Time: 06:05)

One of the most popular and reasonably early methods for model compression is called Deep
Compression, developed by Song Han in ICLR of 2016. It was a game-changing method, which
also took the method forward to hardware design. And this uses a 3-stage pipeline to reduce
storage requirement of neural nets. The first step being pruning of a trained model, then,
quantization of the weights and finally, an Huffman encoding step, which provides a model that
reduces the size by 35 to 50 X with very minimal loss in accuracy.

(Refer Slide Time: 06:54)

1653
Let us see each of these steps. The first step was to prune the model. What does pruning mean
here? Once you train the full model, which is the first step, weights with values below a certain
threshold are removed from the network. So, if any weight is lower than say 10 power minus 5,
that weight is removed from the network and the remaining sparse network with only the other
connections is retrained to get a better network. Once again, this is an iterative process. Once
again, in that new retrained sparse network, if any weights are below a certain threshold, they are
removed. And once again the remaining sparse network is retrained and this step is done
iteratively.

(Refer Slide Time: 07:45)

And just with this simple step alone, the authors showed that many of the popular networks
could be reduced in size significantly. For example, you see here, AlexNet, while the original
size is 61 million, using the simple pruning step, the size comes down to 6.7 million, which is a
9x compression. And when one looks at the top-1 error or top-5 error on ImageNet, you notice
that there is no significant drop in performance, because of this reduction in parameters. In case
of VGG, pruning alone, reduced number of parameters by 13x. This was also observed for
smaller networks, such as LeNet.

1654
(Refer Slide Time: 08:39)

The second step after pruning is, a step known as weight sharing, where in each layer, the
weights in that layer are partitioned into k clusters using simple K-means clustering and each
weight is replaced by the centroid of the cluster that it belongs to. So, here you see an example of
a 4 by 4 matrix of weights and the cluster assignment in the subsequent matrix here and the
cluster centroid value, which is shown for each of these clusters.

At the end, each of these blue weights are replaced by the cluster centroid of the blue color here,
and so on and so forth for each of the colors. How does it help? We need to store fewer values to
represent this layer’s weights. A subsequent question is if the weights changed and are clustered
like this, what happens to the gradients?

1655
(Refer Slide Time: 09:45)

The gradients also follow a similar process. So, if you have a certain values of gradients for each
of these locations in that particular layer. The gradients are also clustered and the cluster centroid
value for the gradient is then used to subtract from the original weight to get the new weight. So,
in the even the gradients participate in this weight sharing exercise in the same way.

(Refer Slide Time: 10:18)

Having done pruning and weight sharing, the next step that the authors used was Quantization.
This was based on an empirical observation that instead of using 32-bit float values, if we used

1656
just 8 bits, the performance really did not reduce much. So, this is the quantization step that was
then used to still further reduce the storage footprint. And the final step was to use Huffman
coding. Huffman coding is a popular coding compression method in computer science, where a
frequently occurring pattern is stored with lesser number of bits and a rarely occurring pattern is
stored with more number of bits to capture it is additional information.

Huffman coding is a long standing compression method, which is used here to once again reduce
the storage footprint. With these methods sequentially, one after the other. The overall approach
showed a 35 to 49 x reduction in parameters with minimal loss of accuracy. As you can see here,
AlexNet went from being 240 MB to 6.9 MB in this particular case, and VGG went from 552
MB to 11.3 MB, which was a 49x reduction in storage parameters.

With if you see the error rates here, there is no significant loss in error because of this
compression, which is the main objective. So, once you get to 6.9 MB, or 10 MB, these models
now become amenable to deploy on low compute devices.

(Refer Slide Time: 12:17)

A second method that we will talk about is that of Knowledge Distillation. The key intuition of
this family of methods is to transfer knowledge from a cumbersome, large model to a small
model, which we call a student model, whose size is more optimized for deployment. The
question obviously here is, what do we mean by knowledge in a Deep Neural Network.

1657
(Refer Slide Time: 12:51)

And the first idea that was used here was that knowledge can be viewed as the mapping between
inputs and the softmax probabilities. Instead of while you are looking, if a cat, if there was an
image of a cat, and you would like the class, for the cat to be 1 and everything else to be 0, a
neural network may not necessarily give you that output, it may say, the probability for a cat is
0.8, and the probability for other class labels could be 0.01, 0.05, so on and so forth.

Now, these outputs represent the knowledge that the Neural Network has gained over the process
of training. So, in knowledge distillation, the idea now is to take a small shallow student network
and instead of training this network with hard labels, or one hot labels, we ask the student to
target and predict the softmax probabilities or even the logits of the teacher network.

You can see here you have the cumbersome model, and through distillation, the distilled model’s
objective is to match the soft outputs or targets of the teacher model. In this sense, the knowledge
gained by the teacher model is distilled into the student model, which performs as well with a
smaller storage footprint.

1658
(Refer Slide Time: 14:25)

Here is a simple experimental example. So given a cumbersome model, this is on MNIST of 2


layers, with 1200 ReLU, nodes and dropout. And a small model of 2 layers and 800 ReLU nodes
smaller model at least with no regularization. It is simple to observe that the number of errors on
MNIST with the Cumbersome model is 67. If you train the small model using standard training,
it makes 146 errors on the test set of MNIST and the small model with distillation makes only 74
errors, which is close to the bulky model.

Over the years, knowledge distillation has resulted in several variants, where instead of matching
only the logits or only this outputs, probabilistic outputs of the teacher model, you can also
match intermediate representations of hidden layers, you can add some noise, and try to
ensemble, multiple teachers, so on and so forth, which are provided in the references for further
reading.

1659
(Refer Slide Time: 15:35)

A third approach that we will talk about here is a recent one published in ICLR 2019, called
Lottery Ticket Hypothesis. As the name says, this was based on an observation that when you
train a full bulky Deep Neural Network, often a very sparse sub network, obtained after pruning
produces accuracy, close to the full model. So, you randomly initialize weights and train a full
network, you get 90 percent accuracy and you prune, you get a sparse sub network with 90
percent accuracy.

But you took the same kind of a Sub network, randomly initialized it and trained, you get only 60
percent accuracy. So, there seems to be something about training the full network, and then
pruning.

1660
(Refer Slide Time: 16:32)

So, this work made a hypothesis that a randomly initialized Dense Neural Network contains a
sub network that is initialized such that when it is trained in isolation, it can match the test
accuracy of the original network, after training for at most the same number of iterations. The
obvious question now for us is, how do you find the sub network? To do this, this approach,
proposed a simple idea, which is called One Shot pruning where you first train a full network
with random initialization. You prune a certain p percentage of the smallest weights of the full
network. You reset the remaining weights to their previous initialization to create the winning
ticket.

They showed that following this procedure helps us find the lottery ticket, which is that one
random sparse sub network which seems to match the accuracy of the complete network. One
could also repeatedly prune the network over multiple rounds, similar to the iterative pruning
that we spoke about for deep compression. This does get better results, but of course, requires
more computation.

1661
(Refer Slide Time: 18:06)

Here are some results that were shown in this work. On the left, what you see are percentage of
weights remaining versus early stop iterations for MNIST and CIFAR-10. You see here, that for
if you look at these two curves, one of them the dotted lines, is a randomly sampled sparse
network and the bold line is the one obtained by Lottery Ticket Hypothesis. You see that even
when the number of early stop iterations is very low, the percentage of weights remaining for the
Lottery Ticket Hypothesis remains high using this kind of an approach.

A similar pattern is also observed for CIFAR-10. On the other hand, more importantly, a plot of
the percentage of weights remaining versus accuracy is again favorable, favorable for the Lottery
Ticket Hypothesis, you can see that the dotted line here and the bold line, here dotted line is a
randomly sampled sparse sub network and the Bold Line is the one obtained using Lottery Ticket
Hypothesis.

You see that as the percentage of weights remaining decreases, you can see here it goes from 100
to 0.2. The dotted network sub network random one has a quicker fall in accuracy, while the
Lottery Ticket Hypothesis maintains a higher accuracy for a longer period of time. A similar
result is also seen on CIFAR-10 with different kinds of layers, shown for the percentage of
weights remaining.

1662
(Refer Slide Time: 19:57)

So, is that always a good strategy? Not really. While iterative pruning produces better results, it
requires training the network perhaps about 15 times per round of pruning. On the other hand,
finding Lottery Ticket Hypothesis, while can be done sometimes may not give you as good a
performance as the original network. If you did iterative pruning it is going to be harder to study
large data sets such as ImageNet.

So, over the last few years, there have been improvements over the Vanilla Lottery Ticket
Hypothesis work, where researchers have studied if these winning lottery tickets can be found
early on in training, rather than wait for too many iterations. Can such winning tickets generalize
to newer datasets and optimizers and does this hypothesis hold in other domains such as text
processing, or NLP.

1663
(Refer Slide Time: 21:01)

These methods have also been extended in several different ways. XNOR-Net is a popular
compression method where binary weights are used. The understanding here is you do not need
the precision of representing each weight using too many bits just using binary weights and
replacing convolution with XNOR operations can make Neural Networks attain a reasonable
amount of accuracy with a very small memory footprint. Thi-Net compressor compresses CNNs
with filter pruning.

Knowledge Distillation methods have used noisy teachers where the teacher logits are perturbed
to get the effect of multiple teachers to train a student. Relational Knowledge Distillation adapts
metric learning in distillation, in distillation, and there have also been specific architectures. We
discussed a few of them when we discussed CNNs, Mobile Nets, Shuffle Net, SqueezeNet,
SqueezeDet for detection and SEP-NET so on and so forth, which have been used for model
compression.

1664
(Refer Slide Time: 22:18)

As we said earlier, the space is fairly large. There are also Low rank factorization methods. There
are also methods that design convolutional filters in a particular way to save parameters, so on
and so forth, which we leave it for reading in this lecture.

(Refer Slide Time: 22:34)

So, the homework for you is to read a very nice survey of the Lottery Ticket Hypothesis and this
Comprehensive survey of different Model Compression and Acceleration Methods for Deep
Neural Networks.

1665
(Refer Slide Time: 22:50)

Here are some references.

1666
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 78
Neural Architecture Search

As the last advanced topic for this week, we will talk about Neural Architecture Search, another
contemporary topic that deals with searching for the right Neural Network architecture for a
given problem. It is, it should straightaway appeal to you that this is a very important problem
considering the various hyper parameters that one has in designing Neural Networks.

(Refer Slide Time: 00:44)

To get a better understanding of why we need this, we have seen over the duration of this course
that accuracy and performance has improved with better design of architectures. And this often
requires human experts to spend hundreds of hours on arbitrary training and testing and hyper
parameter tuning. This, in fact, does not even mean that the entire space of possible architectures
has been searched exhaustively, to find the right architecture.

Often, researchers use architectures that were proposed earlier, and adapt a few elements to
arrive at an architecture for a given problem. That brings the question, can we adopt the
systematic and automatic way of learning high performance model architectures? The solution
that we are going to talk about is Neural Architecture Search.

1667
(Refer Slide Time: 01:54)

Neural Architecture Search, perhaps not in the same name has been around for some time. In
very early phases, one would use Evolutionary algorithms such as, say genetic algorithms, to be
able to design neural network architectures. This led to exploration of random search for
architectures or even Bayesian optimization for architectures and hyper parameter tuning over
the last decade.

Over the last 2 to 3 years, Neural Architecture Search has emerged as an independent important
problem. It primarily started with searching for these architectures using principles from
reinforcement learning.

1668
(Refer Slide Time: 02:42)

The general problems set up in NAS or Neural Architecture Search is given by, you have a
search space, which defines the space of all possible architectures that you want to look for, for a
given problem. You have a search strategy to look for an architecture within the search space,
often the search space can be extremely large. And we need a performance estimation strategy to
try to estimate an architecture’s performance. Using standard training and validation performance
could be computationally expensive to be able to try various kinds of architectures.

So, this also needs to be chosen carefully. So, you have a search space, a search strategy, which
results in an architecture whose performance is estimated based on the performance estimate, the
search strategy looks further for a newer architecture and this loop continues until a desired
performance is met on a consistent basis.

1669
(Refer Slide Time: 03:57)

As we just mentioned, the search space of NAS methods are extremely important. So, how is the
search space defined? One can view Neural Networks as a function that transforms input
variables to output variables through a series of operations. So, if you look at Neural Networks as
computational graphs, recall the lecture we had earlier, each neural network node represents a
tensor and this is associated with an operation on its parent nodes I.

(𝑘)
So, you can say that the computation at a node k is given by 𝑥 is the operation at k of its parent
(𝑘)
nodes 𝐼 . What are the operations? The operations could be convolutions, pooling, activation
functions, or even n-ary operations like concatenation or element wise addition or simple
addition, so on and so forth. NAS space is generally a subspace of this general definition of
neural architectures. So, what kind of search spaces do NAS methods use?

1670
(Refer Slide Time: 05:16)

Broadly speaking, one could divide this into two kinds, a Global Space versus a Cell based
search space. In a Global Search Space, the method allows any kind of an architecture. So, you
have a large degree of freedom in arranging the operations. You have a set of allowed operations,
such as convolutions, pooling, dense layers, global average pooling, with different hyper
parameter settings like number of filters, filter width, filter height, so on and so forth.

But there are also a few constraints that are specified. For example, you may not want to start the
neural network with pooling as the first operation, you may not want to have dense layers, before
the first set of convolution operations, so on and so forth. In Global search, the issue is that the
search space is a bit rigid, and it could be impractical to scale and transfer considering the
extensiveness of this search space.

On the other hand, Cell based Search Space, which was first introduced in this work in CVPR of
2018. They introduced a method called NASNet. The idea here is that a lot of handcrafted
architectures are actually designed with repetitions of specific structures. We have seen residual
blocks in ResNets. This idea is taken forward here, by defining a cell structure and constructing a
network architecture by repeating the cell structure.

What is a cell? You could look at it as a small directed acyclic graph that represents a
transformation or a series of transformations. In NASNet, which was the method introduced in

1671
CVPR of 2018, the method learns two kinds of cells, a Normal cell where the dimension of the
input and the output of that transformation is retained, example would be Convolution. And then
a Reduction cell where the output feature map has width and height reduced for instance, like
pooling operations.

(Refer Slide Time: 07:44)

Here is an example of a Global Search Space where you could have a chain-structured search
space as given below, where given an input you have a series of operations and the operations
have to be searched for from your search space from your universe of operations. And another
variant is where you have a skip connection in your chain-structured search space.

1672
(Refer Slide Time: 08:15)

On the other hand, in a Cell based Search Space, an architecture template is defined. For
example, here you see an input, which first goes through a normal cell, then a reduction cell, then
a normal cell, reduction cell, and a normal cell and softmax. In very simple terms, you could
assume that this normal cell was convolutional plus batch norm and reduction cell could have
been pooling in a standard AlexNet context.

But now in NASnet, each of these cells is searched for within within a universe of operations.
So, here is the reduction cell of one of the architectures of NASNet called NASNet-A, where you
see 5 blocks with different kinds of operations that have been mentioned on each block, which is
obtained after the architecture search.

1673
(Refer Slide Time: 09:15)

As we just mentioned, some time ago, the space of Neural Architecture Search became popular
in 2017 with the introduction of NAS through Reinforcement Learning. While we have not
covered reinforcement, reinforcement learning as a topic in this course. We will speak at a very
high level to explain how this NAS method works. So, this method proposes the use of a
controller, which proposes a child architecture, and the controller is implemented as a Recurrent
Neural Network or an RNN which outputs a sequence of tokens that configure the network
architecture.

The controller RNN is trained, similar to a reinforcement learning task using a popular algorithm
called reinforce, which uses Monte-Carlo policy gradients. The action space in this case is the
list of tokens for defining a child network. The reward is the accuracy of the child network. And
the loss is the reinforce loss given by this term. Since we have not covered reinforcement
learning in this course, we would not go further deeper than this. But if you are interested, you
can read this paper for more details.

Here is the schematic you have the controller which samples in architecture with probability b, p.
You train a child network with that architecture to get an accuracy, which is here a reward in the
context of reinforcement learning. And then the policy gradient is computed to update the
controller RNN.

1674
(Refer Slide Time: 11:09)

Here is an example of how the controller RNN works, which is trained using reinforce. So, you
can see that at each step of the RNN across different layers, multiple hyper parameters, such as
number of filters, filter height, filter width, stride height and stride width are predicted for that
layer, which is then passed on to the next layer, whose number of filters, filter height and other
hyper parameters are predicted.

So, the prediction could be something like 3, 7, 1, 2, 36, where the first parameter says the filter
height. The next one gives the filter width, the stride height, the stride width, the number of
filters, so on and so forth.

1675
(Refer Slide Time: 11:58)

One of the limitations of NAS based methods is that when you view your entire search space as a
graph, with each path as a certain sequence of operations, which could be an architecture, each of
these paths are looked at, and trained independently to be able to find which architecture suits a
particular task. So, given the overall search space, the path 123 which could be say a convolution
followed by a pooling, and a path 124, which could be convolution followed by a batch norm for
instance each of these are trained independently, and their performances are assessed to see
which of those operations or sequence of operations suits for a given task.

A recent method called Efficient NAS proposed in ICML of 2018 asks the question, can we treat
all models trajectories as sub graphs of a single directed acyclic graph? And the answer is, yes.
An Efficient NAS shows a way to do this by sharing parent parameters across the child models.

1676
(Refer Slide Time: 13:18)

So, in this case, if what you see here is a supergraph of all possible sequences of operations, each
architecture becomes a subgraph or a trajectory in this graph. So, you could look at these red
arrows here as one sequence of operations corresponding to one architecture. And that sequence
of operations is decided by an RNN controller, which is trained using reinforcement learning.

How is this trained? The weights of the controller and that the weights corresponding to the
chosen paths are trained in an alternate manner and this leads to the final training. And they show
that this approach gives about 2.89 percent test error on the CIFAR-10 dataset, which is close to
state of the art and took less than 16 hours to search for a right architecture, which is
significantly lesser than other NAS based models.

Especially NAS based models, based on reinforcement learning can take many days to search for
the right architecture. One other observation to point out here is a lot of the state of the art
models today in classification in detection are all based on architectures obtained through Neural
Architecture Search.

1677
(Refer Slide Time: 14:54)

A more recent development called DARTS, a Differentiable Architecture Search was proposed in
ICLR, of 2019. And the key idea here was to define the search space using a set of binary
(𝑘)
variables on the operations. So, your output of layer 𝑧 is now an operation O multiplied by a
(𝑖)
binary variable into in terms of the input variables 𝑧 from the previous layer. The key insight in
this approach is that, until this method in all methods, you could either have an operation or not
have an operation.

But this method says, why do not we relax this and make α something that is learnable. And
hence, it relaxes this categorical choice of choosing an operation as a softmax, over all possible
operations. So, the operation is not just binary in terms of its existence, it is now a softmax that
you have over the space of all possible operations. This allows us to back propagate through the
choice of the operation itself.

1678
(Refer Slide Time: 16:21)

What does this mean? How do you train? So, in DARTS, an alternating bilevel optimization
method is proposed similar to Efficient NAS, where in the first iteration, the model parameters w
of a specific architecture is trained by minimizing loss on a training set. And in the second step,
the structural parameters α, which weight each of your operations are learned using by
*
minimizing another loss on a validation set, which is defined this way, where 𝑤 (α) is the
minimum loss of w on the training set.

So, you use those weights to be able to get a validation loss, which is then used to minimize the
value of α. So, the final architecture is chosen based on the sequence of operations that
maximizes α or the operation at that step that gives you the maximum α. This is an elegant
solution that makes all parameters differentiable and entire, and thus the entire architecture
search as a differentiable procedure.

1679
(Refer Slide Time: 17:42)

Other considerations that have been looked at in the NAS community is Architecture
Transferability. If we learn an architecture, through a NAS method on CIFAR-10, how well does
it transfer to ImageNet. And the search is shown that a lot of a contemporary NAS methods
deliver fairly well on this count, where an architecture trained on CIFAR-10 also does well
reasonably on ImageNet, as well as on other datasets and tasks.

(Refer Slide Time: 18:23)

1680
While this discussion on NAS was intended to be brief, keeping in view the scope of this course.
Let us discuss a few future directions and open problems in NAS. One of the first issues is search
efficiency. As I just mentioned, trying to perform NAS using Reinforcement Learning can create
a training overhead of many days on standard GPU architecture. How do you improve the search
becomes an important problem. A different perspective is to also move towards less constrained
search spaces. One way you could improve your search efficiency is to constrain your search
space more.

But now, we also do not want to constrain the search space more to be able to explore better
architectures. Designing efficient architectures, such as those we get after pruning or after
quantization, or probably finding the lottery ticket using NAS could also be an interesting
direction of the space. Another important direction is a joint optimization of all components of a
deep learning pipeline, not just architecture. We could also talk about what data augmentation to
use, what activation functions to use, what optimizers to use, and so on and so forth.

So, including all of that in the NAS search pipeline would be an interesting future task in the
space. Finally, designing architectures for multimodal tasks, such as vision and language tasks,
could also be a very important and useful direction.

(Refer Slide Time: 20:22)

1681
Before we conclude this lecture, there is also an interesting observation by a few researchers
about the Curious Case of Random search. In this work, in AAAI 2019, researchers observed
that when one randomly searched for a Neural Network Architecture, the difference in accuracy
was not too much between using random search, reinforcement learning or evolutionary
algorithms. And this leads to an important conversation as to whether we really need Neural
Architecture Search.

(Refer Slide Time: 21:02)

This observation was also seconded by a couple of other papers, which showed that random
search and a random wiring of Neural Networks can perform to a reasonable extent in contrast to
what was observed earlier in terms of searching Neural Network Architectures. This leaves an
interesting an open question for the community.

1682
(Refer Slide Time: 21:31)

The recommended readings for this lecture are once again, a very nice blog article by Lilian
Weng on Neural Architecture Search and these two survey papers on Neural Architecture Search
if you need more information.

1683
Deep Learning for Computer Vision
Professor. Vineeth N Balasubramanian
Department of Computer Science and Engineering
Indian Institute of Technology, Hyderabad
Lecture No. 79
Course Conclusion

After these 12 weeks of lectures, which you hopefully learned something valuable from, we
come to the conclusion of this course. In these last few minutes, I would like to share a few
topics that we did not get a chance to cover in this course, as well as look at some emerging areas
and applications of Computer Vision, which are worth looking into.

(Refer Slide Time: 00:44)

One topic, which, which we briefly touched upon earlier this week, but has many faces, facets
and dimensions is learning with limited supervision. In the standard supervised learning setting,
you have a training set, and you train a learner, and you get a model which is used on the test set.
So, now, you can adapt this paradigm to several other paradigms, some of which we have seen,
some of which we have not.

Multitask Learning is a setting where you train a model using a learner using data from multiple
different tasks to learn a single model that can then classify all the tasks across the image at the
test time. An example of multitask learning could be given a face image, you may want to

1684
recognize the identity of the person, the expression of the person, the pose of the person, so on
and so forth, using a single model, and this would come under Multitask Learning.

The second kind of a setting is Transfer Learning, which we have seen earlier in the course,
where you have a source domain data, where you have trained a model. Now, the knowledge of
that model is used to train a different model on a different target dataset to be able to perform
well on the target domain. We saw fine tuning as one of the most preferred and used approaches
in Deep Learning for Computer Vision but there are other methods that one can use for transfer
learning too.

A variant or a generalization is of this idea is also called domain adaptation where in the source
domain, you do not necessarily learn a model, but you directly use the source data to train a
model which performs on the target domain. Domain Adaptation, when you see the diagram, it
looks similar to Multitask Learning, but there is a subtle difference. In Multitask Learning, your
goal is to do well on the on each of these tasks at test time. Whereas, in domain adaptation, your
goal is to do well on the target domain alone. The source domain is intended merely to provide
that adaptation or extra information from another domain.

K-shot learning is a generalization of Few-shot or Zero-shot learning. We have already seen that
where we have a set of classes for which you have a lot of data, a set of classes for which you
have only k samples, where k can be 0. We now have to train a model to be able to do well on all
the classes, we have seen a few examples of methods here. Another setting in the context of
learning with Limited Supervision is Continual Learning, which has become popular in recent
years, where you once again have multiple tasks similar to Multitask Learning, the only
difference being is that the tasks could keep coming incrementally over time, that universe of
tasks may not be limited.

Now, you want to train a model that works on all the tasks seen so far. Initially, you have data
from task 1, you train a model, it should work on other data from task 1, then data for task 2
comes, you update the model. And the model now has to work on data from task 1 and task 2.
Similarly, so on and so forth, you get data from task 𝑇𝑛, you now have to ensure that your train

updated model works on all of your data from task 1, task 2 through task n. This especially gets
challenging when you do not have access to the data from the previous tasks.

1685
The phenomenon of Catastrophic Forgetting, we briefly spoke about this in the lecture on
Few-shot and Zero-shot learning, becomes even more prominent here, when you do not have
access to data from task 1 or task 2, when you come to the task 𝑇𝑛 but you still have to perform

well on them. Neural Networks in the Vanilla versions suffer from Catastrophic Forgetting in this
context. Addressing Catastrophic Forgetting is one of the most important objectives of Continual
Learning methods.

The last setting we will talk about is called Domain Generalization. You could consider this as an
extension of domain adaptation. You have multiple domains, in this case, say two 𝑇𝑠 and 𝑇𝑡. You

train a model but now the model has to work on a third domain, or a different domain, 𝐷𝑡+1. This

is why we call this domain generalization. All of these are very contemporary topics in Deep
Learning for Computer Vision, to be able to go beyond supervised learning in general, which we
understand can be solved by engineering networks and training them and evaluating them in a
comprehensive manner.

(Refer Slide Time: 06:43)

Another topic that we did not have the scope to cover in this course an important one is the
intersection of Reinforcement Learning and Computer Vision. Reinforcement Learning, as you
may know, is a paradigm of Machine Learning similar to supervised and unsupervised, where an

1686
agent takes actions and interacts with an environment to maximize potentially delayed rewards.
And this is usually modeled as a Markov Decision Process.

Where is Reinforcement Learning used in Computer Vision? We already saw a couple of


examples, we saw the use of Reinforcement Learning we at least briefly spoke about it in hard
attention, where one cannot back propagate. We said we use Reinforcement Learning based
algorithm. There is also NAS that we just saw, where Reinforcement Learning based methods are
used to search for architectures.

Other use cases are in games, for example, in Atari Games, to automatically play the game, one
needs to understand the visual elements on the screen, as well as use Reinforcement Learning to
give a path of actions that the agent has to take to attain a specific reward. Visual servoing where
a motor is moved, to be able to capture or achieve a specific purpose is done using
Reinforcement Learning. Visual tracking can also be done using Reinforcement Learning, there
are many more applications. But these are a few sample use cases of Reinforcement Learning in
Vision.

(Refer Slide Time: 08:36)

More generally speaking, other contemporary topics, in terms of methods could be Egocentric
vision, where the camera is placed on the person who is traversing an environment and not
necessarily placed as a CCTV or a web camera or anything like that it is placed on the person.

1687
So, the problems are associated with how one views the world around an individual using camera
feed.

You could look an example application for this could be based on something like google glass a
wearable glass, or could be very useful for developing technologies for individuals with visual
impairments, where the problems have to be seen from that individual’s, egocentric perspective,
when we say egocentric, we mean from the viewpoint of that individual. You can also have
Embodied Vision, which becomes similar to Egocentric vision, where we talk about the vision
technologies being embodied in a human. We also have the intersection of Visual Perception and
Robotics, which is a very useful and a very contemporary topic.

Visual Tracking, Hyperspectral Image Analysis where we try to understand images of different
spectra, for example, ultraviolet images, or IR images, so on and so forth. A lot of the methods
that we discussed so far directly can be applied to hyperspectral images. It may just be a
difference in the number of channels in the input, so on and so forth. However, studying images
across the visible beyond the visible spectrum is an important area of research for various
problems, such as nighttime vision, or astronomy, or many other such domains.

Further, doing Computer Vision for augmented and virtual reality is also becoming an important
topic. While it could have been for games or the entertainment industry, these days it is also
becoming important for remote interactions of any kind, be it Instructional, Educational or
simply a family interaction, doing Computer Vision on augmented and virtual reality settings is
becoming an important topic. And finally, having fair, explainable and trusted models, continues
to be a very important topic.

We saw various methods for explaining the decisions of a CNN, but one can go beyond and also
talk about fairness of these models, to de-bias these models from any biases that may be there in
the dataset, and help humans develop trust in these models. On the application front some of the
applications that are quite popular, and have received a lot of attention in recent years, are vision
for autonomous navigation, self driving cars, this is a very important topic that has a lot of
investment from the industry. Companies like Uber, Tesla, Google itself, many other companies
investing a lot of effort in this direction.

1688
Vision for drone images. Drones are becoming increasingly popular for various tasks ranging
from security to Disaster management. How do you understand images that come from drones
becomes an important task. Vision for all seasons, a lot of the datasets that are available are all
based on daylight scenes. How do you now make these tasks work on adverse weather and
lighting conditions becomes a very general context of Vision applications. Vision for Healthcare,
and Biomedical Imaging continues to be an important topic.

Vision for Agriculture has received a lot of traction in the last few years, to help Deep Learning
models and Computer Vision models for Precision Agriculture, to understand Plant Phenotypes,
to be able to spray pesticides, to be able to understand how much harvest you would get out of a
patch of land, and so on and so forth. Similarly, Vision for Fashion and Retail, trying to use a
virtual try on to see how a person looks in a particular dress, or simply using Computer Vision,
such as an Amazon go in a shopping environment to automate the shopping experience is an
important dimension of applications in vision.

And finally, an application of a huge commercial interest, which is Vision for Sports. In many
different sports, Computer Vision is being used these days to understand biomechanics of
humans, perhaps for rehabilitation, understand strategies on a football field, understand
movements of a particular player for analysis, either could be for an opponent, or for a coach of
that player, so on and so forth.

(Refer Slide Time: 14:20)

1689
To conclude, let us revisit the outline of this lecture of this course that we started with. So we had
5 different segments. We started with trying to understand Image Formation and Linear Filtering.
We then talked about Edges, Blobs and Features. We then went into Visual Descriptors, Feature
Matching, so on and so forth. And then moved from Traditional Computer Vision to Deep
Learning for Computer Vision, where we started with a review of Neural Networks and
backpropagation went on to the building blocks of Neural Networks for Computer Vision, which
are convolutional Neural Networks.

We talked about various different convolutional Neural Network Architectures, how to visualize
and understand how they work. Then we also looked at the many forms and uses of CNNs,
ranging from recognition, verification, detection, segmentation. We talked about the changes in
Architectures, the changes in Loss Functions, and so on. Then we moved on to adding a
dimension to the input, going from images to videos, or time series data.

In this context, we talked about Recurrent Neural Networks, and variants such as LSTMs, we at
least briefly discussed Spatio-Temporal Models, as well as Attention and Vision Language tasks.
And in the final 2-3 weeks, we discussed a lot of contemporary topics such as Deep Generative
models, learning with Limited Supervision, and recent trends such as Pruning, Adversarial
Robustness, Neural Architecture Search, and so on.

Hopefully, this course gave you a good introduction, and a good grip on the topics related to
contemporary Deep Learning Models for Computer Vision.

(Refer Slide Time: 16:22)

1690
Let me recall that Computer Vision is far broader than the topics that we covered in this course.
Recall that in the first lecture of this course, we said that one could divide Computer Vision into
3 parts, Learning based vision, Geometry based vision, and Physics based vision. For a large
part, we focused on Learning based vision in this course. And you can go back and revisit the
links that we provided for you if you would like to learn more about Geometry based vision, or
Physics based vision.

(Refer Slide Time: 16:59)

1691
With that, let me take the opportunity to acknowledge and thank all the resources, including
Deep Learning and Computer Vision courses that are publicly available, which definitely
inspired and helped create the content for this course, although it was content of the content of
this course, was created exclusively. Definitely there was influence and content, adapted and
perhaps borrowed from different sources. We thank all of these sources. And if there are any
errors in the material, we will gladly take your suggestions and improve them as we move
forward.

1692
(Refer Slide Time: 17:35)

With that, thank you to each one of you for being a part of this course, participating in this
course. Hope you learn something valuable by going through this course, through the lectures, as
well as the quizzes and the assignments. Wish you Happy Learning and hope you get a chance to
do more in this field. Thank you.

1693
THIS BOOK
IS NOT FOR
SALE
NOR COMMERCIAL USE

(044) 2257 5905/08


nptel.ac.in
swayam.gov.in

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy