0% found this document useful (0 votes)
4 views34 pages

Images and Convolutional Neural Networks: Practical Deep Learning

The document discusses computer vision, emphasizing the role of convolutional neural networks (CNNs) in enabling computers to understand visual information. It explains the process of digitizing images into pixels, the architecture of CNNs, and various activation functions used in neural networks. Additionally, it highlights the applications of CNNs in tasks such as object detection, semantic segmentation, and video recognition.

Uploaded by

dhruv tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Images and Convolutional Neural Networks: Practical Deep Learning

The document discusses computer vision, emphasizing the role of convolutional neural networks (CNNs) in enabling computers to understand visual information. It explains the process of digitizing images into pixels, the architecture of CNNs, and various activation functions used in neural networks. Additionally, it highlights the applications of CNNs in tasks such as object detection, semantic segmentation, and video recognition.

Uploaded by

dhruv tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Images and convolutional

neural networks

Practical deep learning

1
Computer vision

Computer vision = giving computers the ability to


understand visual information
Examples:
○ A robot that can move around obstacles by analysing the
input of its camera(s)
○ A computer system finding images of cats among millions
of images on the Internet

2
From picture to pixels

An image has to be digitized for It is turned into millions of “pixel” elements


computer processing

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922

0.49803922 0.49411765 0.4862745 0.47058824 0.49411765

0.5019608 0.49803922 0.49803922 0.49019608 0.50980395

0.50980395 0.5058824 0.52156866 0.50980395 0.5058824

Picture source: https://pixabay.com/en/kitty-cat-kid-cat-domestic-cat-2948404/


Each a set of numbers quantifying the
color of that element
3
From pixels to … understanding?

0.49411765 0.49411765 0.4745098 0.49019608 0.4745098

0.49411765 0.49411765 0.5058824 0.49411765 0.49803922


There’s a cat among some
0.49803922 0.49411765 0.4862745 0.47058824 0.49411765
flowers in the grass
0.5019608 0.49803922 0.49803922 0.49019608 0.50980395

0.50980395 0.5058824 0.52156866 0.50980395 0.5058824

● This is easy for humans


● But for AI it’s actually one of the harder problems!
● How do you transform that grid of numbers into understanding…
or even something useful?
4
Image understanding
• Humans are so good in vision that it’s not even considered intelligence

5
Convolutional neural networks
Convolutional neural network
(CNN, ConvNet)
● Dense or fully-connected: each neuron connected to all
neurons in previous layer
● CNN: only connected to a small “local” set of neurons
● Radically reduces number
Dense layer Convolutional layer
of network connections

7
Convolution for image data
3✕3 weights
3✕3 image area (conv. kernel)
output
● Image represented as 2D grid of values neuron

● Each output neuron connected to


small 2D area in the image
● Output value = weighted sum of inputs
● Idea: nearby pixels are related ⇒
we can learn local relationships of pixels

8
Image source: https://mlnotebook.github.io/post/CNN1/
Convolution for image data
image input 3✕3 weights
(conv. kernel)
● We repeat for each output neuron
● Weights stay the same (shared
weights)
● Border effect: without padding
output area is smaller
● Outputs form a “feature map”

feature map

9
Image source: https://mlnotebook.github.io/post/CNN1/
A real example

Image from: http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx


Side note: color images
● Example: 256 ✕ 256 color image with 3 color channels (red, green,
and blue)
⇒ single image is a 3D tensor: 256 ✕ 256 ✕ 3
● Example: 5 ✕ 5 convolution is actually also a 3D tensor: 5 ✕ 5 ✕ 3
● Slides over width and height, but covers the full color depth

11
Convolution for image data K feature maps each
252✕252✕1

K kernels
● We can repeat for different sets each 5✕5(✕3)
of weights (kernels)
● Each learns a different “feature”
● Typically: edges, corners, etc image
256✕256✕3
● Each outputs a feature map

...

...
12
Convolution for image data
output tensor
252✕252✕K

K kernels
● We stack the feature maps into a each 5✕5(✕3)
single tensor
● Depth out output tensor =
number of kernels K
image
● Tensor is the output of the 256✕256✕3
entire convolutional layer

...
13
Convolution in layers: intuition
● We can then add another
convolutional layer
● This operates on the
previous layer’s output
tensor (feature maps)
“cat”
● Features layered from
simple to more complex

14
learned learned learned
learned
low-level mid-level high-level cat
classifier
features features features

Image from lecture by Yann Le Cun, original from Zeiler & Fergus (2013)

15
Image datasets

• Color image mini-batches are 4D tensors:


width ✕ height ✕ color channels ✕
samples
• Plenty of big datasets for training exist, e.g.,
ImageNet with 1,2 million images in 1000
classes
• Data augmentation for small datasets:
generate more training data by transforming
existing data
• E.g., shifting, rotation, cropping,
Scaling, adding noise, etc …

16
Convolutional layers

• Input: tensor of size N × Wi × Hi × Ci


• Hyperparameters:
• K: number of filters
• w, h: kernel size
• padding: how to handle image borders
• activation function
• Output: tensor of size N × Wo × Ho × K
• In tf.keras:
layers.Conv2D(filters, kernel_size,
padding, activation)

(there is also Conv1D and Conv3D)

17
Pooling layers

• Used to reduce the spatial resolution


• independently on each channel
• reduce complexity and number
of parameters
• MAX operator most common
• sometimes also AVERAGE
• In tf.keras:
layers.MaxPooling2D(pool_size)
layers.AveragePooling2D(pool_size)

18 Image from http://cs231n.github.io/convolutional-networks/


• Flatten
• flattens the input into a vector
(typically before dense layers)
• Dropout
• similar as with dense layers
• In tf.keras:
layers.Flatten()
layers.Dropout(rate)

19
Non-Linearity Layer

• Non-linear activations are needed to learn complex (non-linear)


data representations
• Otherwise, NNs would be just a linear function (such as W! W" 𝑥 = 𝑊𝑥)
• NNs with large number of layers (and neurons) can approximate more
complex functions

20
Activation: Sigmoid

• Sigmoid function σ: takes a real-valued number and “squashes” it into the range between 0 and 1
§ The output can be interpreted as the firing rate of a biological neuron
o Not firing = 0; Fully firing = 1
§ When the neuron’s activation are 0 or 1, sigmoid neurons saturate
o Gradients at these regions are almost zero (almost no signal will flow)
§ Sigmoid activations are less common in modern NNs

𝑓 𝑥 ℝ! → 0,1

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 21


Activation: Tanh

• Tanh function: takes a real-valued number and “squashes” it into range between -1 and 1
§ Like sigmoid, tanh neurons saturate
§ Unlike sigmoid, the output is zero-centered
o It is therefore preferred than sigmoid
§ Tanh is a scaled sigmoid: tanh(𝑥) = 2 , 𝜎(2𝑥) − 1

𝑓 𝑥 ℝ! → −1,1

Slide credit: Ismini Lourentzou – Introduction to Deep Learning 22


Activation: ReLU

• ReLU (Rectified Linear Unit): takes a real-valued number and thresholds it at zero
𝑓 𝑥 = max(0, 𝑥) ℝ! → ℝ!"

§ Most modern deep NNs use ReLU


activations
𝑓 𝑥
§ ReLU is fast to compute
o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
§ Accelerates the convergence of gradient
descent
o Due to linear, non-saturating form
§ Prevents the gradient vanishing problem 𝑥

23
Activation: Leaky ReLU

• The problem of ReLU activations: they can “die”


§ ReLU could cause weights to update in a way that the gradients can become zero and the neuron will not
activate again on any data
§ E.g., when a large learning rate is used

• Leaky ReLU activation function is a variant of ReLU


§ Instead of the function being 0 when 𝑥 < 0, a leaky ReLU has a small negative slope (e.g., α = 0.01, or
similar)
§ This resolves the dying ReLU problem
𝑓 𝑥
§ Most current works still use ReLU
𝛼𝑥 for 𝑥 < 0
o With a proper setting of the learning =3
𝑥 for 𝑥 ≫ 0
rate, the problem of dying ReLU can be
avoided

24
Activation: Linear Function

• Linear function means that the output signal is proportional to the input signal to the neuron
ℝ! → ℝ!
§ If the value of the constant c is 1, it is
also called identity activation function
𝑓 𝑥 = 𝑐𝑥
§ This activation type is used in
regression problems
o E.g., the last layer can have linear
activation function, in order to output a
real number (and not a class
membership)

25
Fully Connected Layer

• A Fully Connected (FC) layer, also known as a dense layer, is a


type of layer used in artificial neural networks where each neuron
or node from the previous layer is connected to each neuron of
the current layer.
• It’s called “fully connected” because of this complete linkage. FC
layers are typically found towards the end of a neural network
architecture and are responsible for producing final output
predictions

26
Fully Connected Layer

Key Features:
• In CNNs, FC layers often come after the convolutional and pooling
layers. They are used to flatten the 2D spatial structure of the
data into a 1D vector and process this data for tasks like
classification.
• The number of neurons in the final FC layer usually matches the
number of output classes in a classification problem. For instance,
for a 10-class digit classification problem, there would be 10
neurons in the final FC layer, each outputting a score for one of
the classes.

27
Typical architecture

1. Input layer = image pixels


2. Convolution
3. ReLU Repeat one or more times
4. Pooling
5. One or more fully connected layers (+ReLU)
6. Final fully connected layer to get to the number of
classes we want
7. Softmax to get probability distribution over classes
28
CNN architectures and
applications

29
AlexNet

VGG

30
Inception /
GoogLeNet

ResNet

DenseNet

31
Large-scale CNNs with pre-trained weights
retrain

replace
output layer

extracted
features

• For many applications, an existing CNN can be re-used instead of training a


new model from scratch: extract features from suitable layer or
retrain the top layers with new data
• Keras contains several models trained with ImageNet:
• Xception, VGG16, VGG19, ResNet50, InceptionV3,
InceptionResNetV2, MobileNet, DenseNet, NASNet
Computer vision applications

Image credit: Li Fei-Fei et al


33
Image credit: Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Some selected applications

• Object detection: https://pjreddie.com/darknet/yolo/


• Semantic segmentation:
https://www.youtube.com/watch?v=qWl9idsCuLQ
• Human pose estimation:
https://www.youtube.com/watch?v=pW6nZXeWlGM
• Video recognition: https://valossa.com/
• Digital pathology: https://www.aiforia.com/

34

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy