0% found this document useful (0 votes)
270 views52 pages

Image Colorization Final Report

This document provides an introduction and overview of image colorization. It discusses how colorization adds color components to grayscale images to provide more semantic insights. Colorization is challenging due to limited datasets and computational resources. Deep learning models like CNNs and the U-Net architecture are commonly used for colorization tasks. The U-Net consists of an encoding path to extract features and a decoding path to reconstruct the image, and it is well-suited for colorization due to its similarities to segmentation. The motivation for colorization is to enhance grayscale images, colorize historical photos, and aid in digital art creation. The aim is to develop models that can accurately and realistically colorize images.

Uploaded by

city
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views52 pages

Image Colorization Final Report

This document provides an introduction and overview of image colorization. It discusses how colorization adds color components to grayscale images to provide more semantic insights. Colorization is challenging due to limited datasets and computational resources. Deep learning models like CNNs and the U-Net architecture are commonly used for colorization tasks. The U-Net consists of an encoding path to extract features and a decoding path to reconstruct the image, and it is well-suited for colorization due to its similarities to segmentation. The motivation for colorization is to enhance grayscale images, colorize historical photos, and aid in digital art creation. The aim is to develop models that can accurately and realistically colorize images.

Uploaded by

city
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

CHAPTER 1 INTRODUCTION

Colorization is the process in which color components are added to the grayscale image. The information
contained in the grey-scale image is limited than that of color image. Thus adding the color components
can provides more insights about its semantics. However,Colorization is one of the challenging task due
to the small data set,variety of images in the training set and the availability of computational resources
[1]. Colorization is an ambiguous task because it does not have unique solution. An U-net takes the
grayscale and predict the color image .The coloring of the gray scale image has the strong impact on the
wide verity of the domains typically color restoration re-design of the historical images [2]. The
colorization problem is both complex and interesting as the final output image and input image will be of
same dimension [3]. CNNs plays the important role to handle different diversified tasks such as image
colorization, classification, image-labeling and so on. In recent years the use of CNN widely used to get
the solution of image colorization purposes[3].The U-Net architecture is convolution network architecture
for the purpose of image segmentation. U-Net is widely used for image segmentation, and the problem of
coloring image is essentially same as segmentation. Thus, it is chosen for image coloring[4]. The U-Net
consist of mainly there parts Encoding, Bridge and Decoding. The encoding part is responsible for
converting the input image into compact representation called latent space of the input. The decoding
process involves the reconstruction of the input image with the same size of the input images by using
upsampling and convolution operation. The role of the bridge is to bind the encoding and decoding units.
The low level detail features in the encoding part are concatenated with corresponding high level features
in the decoding part [4]. The colored image carries the more information than the gray scale images and
most of the coloring tasks are based on the auto-encoder i.e. encoder-decoder. A number of proposed
solutions are available for colorization of Black and White images. The challenges lies in focusing on
accuracy colorizing standard would look natural to human eye. The U-net is basically used for the
segmentation of the medical images but the task of coloring is similar to the segmentation because coloring
involve the separation of similar region in the image and fill the appropriate color in that segment. Thus
the U-Net architecture is used for the coloring because the encoder-decoder suffers from an information
bottleneck during the flow of low level information in the network[3]. To reduce this problem features
from the contracting path are also connected with the upsampling output layer within the network then

1
more information can be obtained from greyscale image and can be automatically colored with greater
accuracy and natural representation [4].

1.1 Problem Statement and Description

The problem statement for image colorization is to develop a deep learning model that can accurately and
realistically add color to grayscale images. The goal is to produce colorized images that are visually
appealing and consistent with the colors that would be expected in the original image.
More specifically, the problem statement involves:
1. Preparing a dataset of grayscale images and their corresponding color images for training the deep
learning model.
2. Developing a deep learning model that can take a grayscale image as input and generate a plausible
and realistic color image as output.
3. Training the model on the prepared dataset of grayscale and color images.
4. Evaluating the performance of the model on a separate set of test images to measure how accurately
and realistically the model is able to colorize the images.
5. Optimizing the model to improve its performance on the test set.
The ultimate goal of image colorization is to provide a tool for artists, photographers, and restoration
specialists to colorize old or historical images, enhance the visual quality of grayscale images, and aid in
the creation of digital art.

1.2 Motivation:

Image colorization is the process of adding color to grayscale images. The motivation for image
colorization is to provide a tool for artists, photographers, and restoration specialists to enhance the visual
quality of grayscale images, colorize old or historical images, and aid in the creation of digital art.
The use of image colorization has become increasingly popular in recent years due to advancements in

2
deep learning and computer vision. Deep learning models such as convolutional neural networks (CNNs)
have shown remarkable success in accurately and realistically colorizing grayscale images.
One of the main motivations for image colorization is the preservation of historical images. Historical
images often only exist in grayscale, but colorization can bring them to life and provide a better
understanding of the past. By adding color to historical images, we can gain a better appreciation of the
context, clothing, and environment of the time period.

Another motivation for image colorization is its application in the fields of art and photography. Adding
color to black and white photos can make them more visually appealing, and provide a unique artistic
perspective. Additionally, colorizing photographs of the natural world or landscapes can provide a better
understanding of the environment and bring attention to issues such as climate change.
Moreover, image colorization has several practical applications in fields such as advertising and
marketing. By colorizing grayscale product images, companies can provide a more realistic representation
of their products, which can increase sales.
In conclusion, the motivation for image colorization is to provide a tool for enhancing the visual quality
of grayscale images, colorizing historical images, and aiding in the creation of digital art. With
advancements in deep learning and computer vision, the accuracy and realism of colorization has greatly
improved, making it a valuable tool in several applications.

1.3 Aim and scope:

Aim of the project:


The aim of image colorization is to add color to grayscale images to enhance their visual quality, provide
a better understanding of the past, and aid in the creation of digital art. The scope of image colorization is
broad and encompasses several fields, including art, photography, advertising, restoration, and medical
imaging.

Scope of the project:


The scope of image colorization is constantly expanding as new applications and techniques are
developed. With advancements in deep learning and computer vision, the accuracy and realism of
colorization results have greatly improved, making image colorization a valuable tool in several fields.

3
1.4 Background and basics
1.4.1 Theoretical Background
Since colorizing grey image involves greyscale, color spaces and while using reference image for
approximation, we need to do some texture analysis and clustering, so I will explain these before
heading forward.
1.1.1 Greyscale Images
A greyscale image is an image which is represented by intensity only. Value of this intensity defines the
appearance of a certain pixel in an image. Lowest value of intensity or its absence in a greyscale image is
represented with black color and its highest value represents white color. All values between these highest
and lowest represent shades of grey. These number of shades are dependent on possible values which a
pixel may hold. If pixel is represented by a bit, then it can hold two values, 0 and 1 and thus we get a pure
black and white image with no grey shades. If this pixel is represented by a byte value, then we can have
256 grey levels starting with 0 as black and ending with 255 as white. So this way as we increase number
of bits for representation of pixel, grey levels increase with it. Intensity which is only property with
greyscale image, is represented by this value of representation of greyscale. If this intensity/luminance is
more, pixel’s shade is closer to white and vice versa.
1.1.2 Color Images
A color image is an image which is represented by some color space. This color space is not dependent
on only one value like greyscale image. Each pixel in color image is represented by more than one value
and combined effect of these values gives appearance of a color. Before heading into these color spaces,
we need to understand how our eyes sense colors. Our eyes consist of Rod and Cone cells. Rod cells are
designed to feel intensity of light where as Cone cells are designed to sense color of light which falls on
retina of eye. Cone cells are usually of three types, named Short, Medium and Long. Short cones are
sensitive to shorter wavelengths of light meaning they sense Blue color. Medium cones are sensitive to
medium wavelengths and hence they sense green color and similarly Long cones, sense Red color because
of their sensitivity to long wavelengths. Based on this a color space is designed which is named as RGB
color space.

1.1.3 RGB Color Space


RGB color space is designed with perspective of huma sense of color perception. Talking with perspective

4
of computer graphics, this color space represents each pixel with three values of Red, Green and Blue.
Each pixel on screen represents 3 light emitting devices, which emit R,G and B to represent color of that
pixel and our S,M and L cones receive these lights and combined effect of these is that we see some color.
If each color is defined by a byte, then we can have 256 shades of each color, which in result, give us a
vast variety of colors with 256 shades of each color. This color space is designed for representing colors
with electromagnetic systems, like computers, television, printers etc where human need to sense them.
Though RGB is the designed for human sense, but its not good for processing and performing calculations
in image processing. Also this space is not device independent. Image processing usually involves other
color spaces and relevant to this work is Lαβ color space.

1.1.4 Lαβ Color Space


A Lαβ color space consists of 3 parts L, α and β. Here L is luminance, α represents values from green to
magenta and β represent values from blue to yellow. In this color space, 0 value of L represents absence
of luminance and higher values represent presence of luminance. Lower values of α represent presence of
green and higher values represent magenta and between them are variances from green to magenta.
Similarly, lower values of β, represent blue and higher values represent yellow. Lαβ color space is widely
used in image processing. Particularly in this work, Lαβ is used because of the fact that this color space
minimizes effect of correlation between color channels and so if we make any change in one channel, it
does not effects the other channel. However this color space belong to chromatic value color space and is
not suitable for human perception of colors, though its very useful in processing colors.
1.1.5 Texture
Texture is defined as a block structure with some properties in its domain of use. Talking about textures
in image processing it is defined as block of elements in an image which repeats in that image. Texture
can be divided in two main categories, structured and stochastic textures. A structured texture is a texture
of regular repeating shapes, its example is bricks on a wall or tiles on floor. A stochastic texture is a texture
which has no regular pattern or min and max luminance or color. This kind of texture looks like a noise.
Its example can be an image of sand in desert.

Deep learning:
Artificial neural networks are used in deep learning, a kind of machine learning, to learn from and draw
conclusions from complex data. Deep learning has become increasingly popular in recent years due to

5
advancements in computing power and the availability of large datasets. One of the main advantages of
deep learning is its ability to learn and extract meaningful features from data. Deep learning models can
learn to recognize patterns and relationships in data that are too complex for traditional machine learning
models. This ability to learn from complex data has enabled deep learning to achieve state-of-the-art
results in several domains such as image recognition, natural language processing, and speech
recognition.

Deep learning models are typically composed of multiple layers of artificial neurons, which are trained
using backpropagation to adjust their weights and biases to minimize the error between predicted and
actual outputs. The process of training a deep learning model involves presenting the model with large
amounts of labeled data, and gradually adjusting the weights and biases of the neurons to improve
performance.

One of the key applications of deep learning is in image recognition, where convolutional neural
networks (CNNs) have achieved impressive results in tasks such as object detection, image
segmentation, and image classification. CNNs are designed to recognize spatial patterns in images,
making them particularly effective for image recognition tasks.

Another important application of deep learning is in natural language processing (NLP), where recurrent
neural networks (RNNs) and transformer models have shown remarkable success in tasks such as
language translation, sentiment analysis, and text generation. These models are designed to recognize
patterns in sequences of text, making them well-suited for NLP tasks.

Furthermore, deep learning has also been applied in several other domains such as speech recognition,
drug discovery, and autonomous vehicles, among others.

In conclusion, deep learning is a powerful subset of machine learning that has shown remarkable success
in several domains. Its ability to learn and extract meaningful features from complex data has enabled it
to achieve state-of-the-art results in tasks such as image recognition and natural language processing.
With the availability of large datasets and advancements in computing power, deep learning is expected
to continue to grow and make significant contributions in several fields.

6
1.4.2 CNN Algorithm:

Fig 1.1: CNN algorithm

The capabilities of artificial intelligence to close the gap between human and computer skills has been
growing dramatically. Both professionals and beginners focus on many facets of the field to achieve great
results. The field of computer vision is one of several such disciplines.

The goal of this field is to give robots the ability to see the environment similarly as humans do and to use
that understanding for a variety of activities, including image and video recognition, image analysis, media
recreation, recommendation systems, natural language processing, etc. With time, one particular
algorithm—a Convolutional Neural Network—has been developed and optimised, primarily leading to
breakthroughs in computer vision with deep learning.

Introduction

Fig 1.2 : Typical CNN Architecture


A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take in an input
image, give various elements and objects in the image importance (learnable weights and biases), and be

7
able to distinguish between them. In comparison to other classification methods, a ConvNet requires
significantly less pre-processing. Unlike to primitive approaches, where filters are hand-engineered,
ConvNets have the ability to learn these filters and their attributes.

A ConvNet's architecture was influenced by how the Visual Cortex is organised and is similar to the
connectivity network of neurons in the human brain. Only in this constrained area of the visual field,
known as the Receptive Field, do individual neurons react to stimuli. Some of these fields overlap

Why ConvNets over Feed-Forward Neural Nets?

Fig 1.3: Flattening of a 3x3 image matrix into a 9x1 vector

Through the use of appropriate filters, a ConvNet may effectively capture the spatial and temporal
dependencies in a picture. As there are less variables to account for and the weights can be reused, the
architecture offers improved fitting to the picture dataset. In other terms, the network can be trained to
better understand the degree of complexity in the image.

8
Input Image

Fig 1.4: 4x4x3 RGB Image

In the figure ,
The three colour planes of the RGB image—Red, Green, and Blue—have been used to split it in the
picture. Images can be found in a variety of different colour spaces, including grayscale, RGB, HSV,
CMYK, etc.
Whenever the photos reach size like 8K (76804320), you can imagine how computationally intensive
things would grow. ConvNet's job is to simplify the images without sacrificing any of the elements that
are essential for making accurate predictions. This is crucial when creating an architecture that is both
scalable to large datasets and effective at learning features.
Convolution Layer — The Kernel

9
Fig 1.5:Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature

Image Dimensions = 5 (Height) x 5 (Breadth) x 1 (Number of channels, eg. RGB)


In the figure above, I, our 5x5x1 input image, is represented by the green area. The Kernel/Filter,
abbreviated K, is the component that performs the convolution process in the first segment of a
convolutional layer and is shown as yellow. As a 3x3x1 matrix, we have chosen K.

Kernel/Filter, K = 1 0 1
0 1 0
1 0 1

Due to Stride Length = 1 (Non-Strided), the Kernel shifts nine times, executing an elementwise
multiplication operation (Hadamard Product) between K and the area P of the picture that the Kernel is
currently hovering over.

10
Fig 1.6: Movement of the Kernel

Until it has parsed the entire width, the filter travels to the right with a specific Stride Value. Proceeding
on, it jumps down to the beginning (left) of the picture with the same Stride Value and keeps doing so
until the full image is traversed.

11
Fig 1.7:Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel

Images containing several channels, like RGB pictures, have a kernel with the same depth as the input
image. A squashed one-depth channel Convoluted Feature Output is produced by performing matrix
multiplication across the Kn and In stacks ([K1, I1]; [K2, I2]; and [K3, I3]). All of the results are then

added together with the bias.


Fig 1.8:Convolution Operation with Stride Length = 2

12
The Convolution Operation's objective is to identify the input image's high-level characteristics, such
edges, and extract them. There is no requirement that ConvNets have just one convolutional layer.
Typically, low-level features like edges, colour, gradient direction, etc. are captured by the first ConvLayer.
With more layers, the architecture also adjusts to the High-Level characteristics, giving us a network that
comprehends the dataset's images holistically in a way that is comparable to how we do.
The procedure yields two different types of results: one where the dimensionality of the convolved feature
is decreased as compared to the input, and the other where it is either increased or stays the same. Using
valid padding in the makes this possible.case of the former, or Same Padding in the case of the latter.

Fig 1.9:SAME padding: 5x5x1 image is padded with 0s to create a 6x6x1 image

When the 5x5x1 image is enhanced into a 6x6x1 image and the 3x3x1 kernel is applied to it, we see that
the convolved matrix has the dimensions 5x5x1. Hence, Similar Padding was born.
On the other hand, if we carry out the identical operation without padding, we are given a matrix called
Valid Padding that has the same dimensions as the kernel itself (3x3x1).

13
Pooling Layer

Fig 1.10: 3x3 pooling over 5x5 convolved feature

The Pooling layer, like the Convolutional Layer, is in charge of shrinking the Convolved Feature's spatial
size. With dimensionality reduction, the amount of computing power required for processing the data will
be reduced. Furthermore, it aids in properly training the model by allowing the extraction of dominating
characteristics that are rotational and positional invariant.

Max Pooling and Average Pooling are the two distinct kinds of pooling. The highest values from the area
of the image that the Kernel has covered is given by Max Pooling. The average of all the values from the
image's region that is covered by the Kernel is the value given by average pooling, on the other hand.

Moreover, Max Pooling performs as a noise suppressant. It does dimensionality reduction, de-noising, and
complete discarding of the noisy activations. The noise-suppression process used by average pooling, in
contrast, is just dimensionality reduction. As a result, we can conclude that Max Pooling outperforms
Average Pooling.

Fig 1.11: Types of Pooling

14
The i-th layer of a convolutional neural network is made up of the convolutional layer and the pooling
layer. The number of these layers may be expanded to capture even more minute details, but doing so will
require more computer power depending on how complex the images are.
After going through the approach outlined above, we were able to successfully help the model comprehend
the features. Next, we will flatten the output for classification purposes and feed it into a standard neural
network.

Classification — Fully Connected Layer (FC Layer)

Fig 1.12: Classification — Fully Connected Layer (FC Layer)

A (typically) inexpensive method of learning non-linear combinations of the high-level characteristics


represented by the output of the convolutional layer is to add a Fully-Connected layer. In that area, the
Fully-Connected layer is now learning a function that may not be linear.

We will now flatten the input image into a column vector after converting it to a format that is appropriate
for our multi-level perceptron. A feed-forward neural network receives the flattened output, and
backpropagation is used for each training iteration. The model can categorise images using the Softmax

15
Classification method across a number of epochs by identifying dominant and specific low-level
features.There are various architectures of CNNs available which have been key in building algorithms
which power and shall power AI as a whole in the foreseeable future. Some of them have been listed below:

1. LeNet

2. AlexNet

3. VGGNet

4. GoogLeNet

5. ResNet

6. ZFNet

16
Organization of Thesis
This dissertation organized into six chapters and structured as follows:
Chapter 1:This chapter presents, “Introduction” of the thesis describing the Motivation,
Problem Statement, Aim and finally organization of the dissertation.
Chaper 2:This chapter presents, “Literature Survey” of the thesis describing about
Classification Techniques of image colorization
Chapter 3: This chapter presents, “System Architecture” of the thesis is describing the
proposed System Architecture and System Modules.
Chapter 4:This chapter presents, “Implementation” of the thesis describing about
Dataset Collection, Data Analysis and Data Preprocessing, Data Visualization and
Feature Selection, Data Splitting, Model Building.
Chapter 5: This chapter presents, “Experimental results” of the thesis and the
description of the results.

17
CHAPTER 2 LITERATURE SURVEY

This chapter is dedicated to creating an overview of previous work in the field of automatic colorization,
with particular focus on works that have inspired, influenced and helped shape this thesis. In general,
working with color channels in images is a rather well studied and surveyed area. However, compared to
other similar problems, the idea of generating some or all of the color information given other data is one
that has seen relatively limited amount of widespread application and, conversely, research. However, it
is quickly becoming one of the more popular image-to-image tasks in the computer vision field, with
several works published in the recent years using various approaches. Countless algorithms that are not
considered colorization methods, but rather mere color enhancements which aim to improve existing poor
color information or modify the color palette of an image are worth mentioning in this section, as they
serve as a sort of a precursor to full colorization techniques. The goal of these methods is to take images
with color data as their inputs and transform them into images with better visual properties. Often they
serve to remedy certain camera defects, such as overexposure or underexposure contrast adjustment
through histogram equalization[20], as shown in Figure 3.1. The simple transformations performed by
these non-parametric methods are frequently used to describe behaviors of other, more complex
algorithms, even in the domain of CNNs. It is for example possible to say that one of the transformations
performed by a CNN resembles histogram equalization.

Fig 2.1: (a) Overexposed image. (b) Image after color enhancement using histogram
normalization.
When considering full image colorization, there are generally three major types of approaches that have

18
been used. Firstly, a non-parametric approach, in which the user provides hints to an algorithm as to what
the final colorization should look like. These hints come in the form of scribbles - small patches of color
in specific areas of the image. More automated methods developed still rely on additional user input, but
instead of providing direct color data, the user is expected to provide one or more reference images from
which to perform color transfer onto the target image using statistical data or texture matching. Recently,
with the advent CNNs, the approach has shifted more towards a fully automated solution, where the only
input provided by the user is the target grayscale image. However, this enormous advatage can also turn
into a disadvantage - in case the CNN result turns out to be unsatisfactory, there are few to no options to
easily remedy it trivially, unless the model has been specifically designed with this requirement in mind.
2.1 User provided color hints
Colorization methods that depend on color scribbles generally use an optimization framework without
explicit parameter learning to propagate the color from the color patches onto the whole image. The
scribbles are usually provided as a separate image in the form of a color-transparency mask, and the
segments of the image that have no explicit color defined in this mask should have color information
propagated to them. The basic assumption behind most of these methods is that nearby pixels of similar
intensities should also have similar colors. In the method proposed by Levin et al. [21], the colorization is
achieved by solving a convex quadratic cost function obtained as differences of intensities between
neighboring pixels. With further improvements by Huang et al. [22] to exploit edge detection in order to
reduce common problems with color bleeding over object boundaries, this has become a relatively popular
technique to interactively colorize natural images. Luan et al. [23] presented a method extending the use
of scribbles to texture similarity, automatically labeling pixels that should share roughly similar colors
and grouping them into coherent regions. They extend the color locality assumption, seeking remote pixels
with similar textures to color alike to effectively propagate the colorization, further improving the
technique. A similar approach to transferring the color from scribbles, introduced by Qu et al. [24], is
extracting statistical pattern features of local neighborhoods to measure texture continuity, resulting in
fewer scribbles required. A completely different method of optimization was introduced by S´ykora et al.
with LazyBrush [25], along with relaxing the requirement of complete spatial accuracy of scribbles for
the purposes of cartoon image colorization, by solving a multiway cut problem

19
Fig 2.2: Example of Levin’s method

2.2. AUTOMATIC COLOR TRANSFER


on a graph defined over image’s pixels with edge weights calculated based on neighboring pixels’ relative
intensity levels, which, unlike the other algorithms, works well on images with large homogeneous
regions. As apparent from Figure 3.2, these methods can require significant amounts of user input. The
advantage that they provide is in the ease of result refinement, as changing or adding more scribbles can
effectively propagate the desired colorization. 3.2 Automatic color transfer Much like scribble-based
methods, algorithms which perform image-to-image color transferring expect the user to provide extra
inputs. Simpler methods only transfer coloring onto the target image from a single image, though it is
more common to define a set of images which serve as references for color extraction based on statistical
properties. Some algorithms choose to process the target image with color enhancement algorithms
discussed previously, to remove effects such as varying illuminance [26] or enforce global properties of
the result, such as a desired or known color histogram [27].
Most algorithms extract various image features from the set, such as SURF, Gabor, patch or Daisy
descriptors, and learn a mapping of these features to color channel data. These descriptors are then also
extracted from the target image and mapped color distributions are transferred onto the regions that the
obtained descriptors represent, such as in the work of Welsh et al. [28], who propose a method in which
the features selected are the luminance value and statistical properties of 5 × 5 local neighborhood. Each
pixel in the target image is matched to a set of these features extracted from the source image.
The set is produced by jittered sampling or using manually defined rectangular samples. After the best
matching features are found, the color information is transferred onto the target pixel. The luminance
channel remains unchanged, as is common to most colorization methods.
Gupta et al. [29] improve this approach by using a number of more advanced features which have
rotational invariance and are extracted at multiple scale levels of the image. They attempt to make their

20
method close to fully automated by running a web search on popular image sharing websites, based on
keywords provided by the user instead of requiring reference images, acquiring semantically relevant
results. The retrieved images are scored based on their colorfulness and non fitting images, such as
grayscale or images with filtering effects applied, are discarded, creating reference image sets of up to
2000 images.
Similarly, Chia et al. [30] choose to perform an automated web search in conjunction with user provided
foreground-background segmentation cues. This provides the user with more control over the resulting
colorization, while retaining the automated nature of image-to-image color transfer. Liu et al. [26]
automatically generate scribbles from the reference images obtained from the web and propagate them by
using Levin’s method, combined with automatic segmentation. In Deep Colorization by Cheng et al. [31],
a large dataset is divided up into smaller clusters based on global descriptors such as intensity histograms.
For each cluster, a neural network consisting of 3 fully connected layers is trained, using multiple local
feature descriptors computed at random pixel locations from images in the reference set as training data.
The result is obtained as per-pixel colorization prediction of the network which has been trained on the
reference set that most closely matches the target image based on the global descriptors, instead of
explicitly defining the color transferring method.
Deshpande et al. [27] minimize an objective function automatically learned from example sets and
subsequently train a forest of regression trees for color prediction, using multiple image filters to handle
scale invariance. To choose good reference images and choose the used trees, bag-of-features retrieval on
the training set is used.
While these methods generally require less from the user compared to scribble-based methods, they make
it more difficult to influence the colorization output, due to reliance on data that are not straight-forward
to interpret by visually inspecting the reference image set - such as feature descriptors.

2.3 Using CNNs


All methods described previously, with the exception of Deep Colorization, require some form of user
assistance, which reduces their theoretical throughput for large scale colorization and can make them
inconvenient to use, possibly resulting in having to resort to trial-and-error in order to obtain a satisfactory
result. Given that their running times are generally in orders of minutes, that makes them not suitable for
colorizing too many images.
However, with the CNN boom described in Section 2.1 some researchers started tackling this to a higher

21
degree of success as a fully automated process by training CNNs on large datasets such as SUN or
ImageNet. Currently, these methods are the state-of-the-art for natural image colorization.
It is worth noting that even the application of CNNs to this task can be viewed as a form of automatic
color or style transfer - with the references automatically pre-selected by the choice of the original CNN
training set - using a complex method learned and realized by the network. However, by using training set
which contains a large variety of semantically different scenes and commonly occurring objects (such as
ImageNet), it is expected that the chosen color transfer should be the best matching one.
Zhang et al. [32] propose a plain CNN with 22 convolutional layers on a subset of the ImageNet dataset,
employing a custom tailored multinomial cross entropy loss with class rebalancing based on prior color
distribution obtained from the training set to predict a color histogram for each output pixel to handle the
multi-modal nature of the task.
Similarly, Larsson et al. [33] also predict a color histogram, however, they choose to use a 16-layer
convolutional model attached to a fully connected hypercolumn layer to predict pixels’ chromatic values,
pretrained on image classification task and fine-tuned for colorization. Rather than train densely and
predict the colorization of the whole image in one pass, the CNN is trained on spatially sparse samples of
grayscale patches of size equal to the receptive field of the network, predicting the color value of the
central pixel. Larsson et al. also explore the possibility of transferring a known ground truth color
histogram (as a global descriptor) to improve the colorization.
Iizuka et al. [8] propose a network which combines two paths of computation, one to predict the global
features of the target image and the other to specialize in local features. To achieve this, the global features
are trained for image classification rather than colorization and are subsequently concatenated to the local
features that are trained directly for colorization using L2 Euclidean loss function. This technique allows
their model to gain a higher semantic understanding of the image, producing very consistent colorizations.
Recent developments in the field of conditional generative adversarial networks (GAN), a model in which
two distinct networks are trained - a generator and a discriminator - have 3.3. USING CNNS 19 lead
researchers to attempt to use them for colorization. In the domain of colorization, the generator is used to
produce a colorized image of the target grayscale image, and the discriminator is then trained to decide
whether the generated image looks more convincing than the ground truth coloring. If that is not the case,
the weights of the generator are updated in the direction of making the image more convincing for the
discriminator, essentially using the discriminator as an adaptive loss function [34].
Cao et al. [35] show application of GAN to colorization of natural images while producing highly

22
convincing colorizations on the SUN dataset. Along with the target grayscale image, a random noise
vector is given to the generator as input (known as latent space sample), reintroducing some user-defined
influence over the colorization result exhibited by methods in Section 3.2, albeit one that may be difficult
to reason about.
Fu et al. [36] also use a cartoon movie dataset and choose to train a GAN model for automatic colorization,
though their data source is of larger magnitude (over 15 hours of raw footage compared to less than 1.5
hours of our data). Their image extraction method is also different - sampling every 50 frames in the
original footage - and, most importantly, their testing and validation sets are sampled randomly, which,
due to the nature of the data source, skews the results, as it is reasonable to assume that randomly chosen
frames may be mere translations of frames included in the training set, or that background information
will easily be learned by recognizing objects in the image. Therefore, it is hard to compare the results of
our work to results of Fu et al

23
CHAPTER 3 SYSTEM ARCHITECTURE

Proposed System

This project illustrates image colorization usecase. Image colorization is the process of converting gray
scale images to colored images. This usecase is helpful in converting your old grayscale images to
colored.

Fig 3.1: proposed system working

3.1 Usage of LAB color space


Normally images are represented as tensors with shape 𝐻×𝑊×𝐶. Where 𝐻 is height, 𝑊 is width and 𝐶 is
number of channels. Images are frequently represented in RGB format such that an image has 3 channels
one for each Red, Blue and Green color. If we use this image representation then input to the model is
gray scale image and output is RGB image. This is not very efficient as in this case along with colors,
model has to also predict the structure of image (represented in gray scale format).
To handle this we are using a different color space known as LAB color space. LAB color space also has
3 channels: L (Lightness i.e. gray scale version of image), A is Green-Red saturation and B is Blue-Yellow
saturation. The advantage of using images in this color space is that input to the model is L channel of
24
image, representing gray scale structure. Output of the model is A and B channels. For creating final
colored image input L and generated A, B channels are combined to create colored LAB color space
image. For visualization purpose these LAB images are converted to RGB format.

Fig 3.2: CIE Lab color space

There are other sophisticated image colorization mechanisms and models. In this project I tried to
illustrate basic steps of image colorization. I trust this work will bring motivation and interest to further

3.2 Module 1: Datasets Collection

One of the hardest problems to solve in deep learning is the problem of getting the right data in the right
format.
Getting the right data means gathering or identifying the data that correlates with the outcomes you want
to predict; i.e. data that contains a signal about events you care about. The data needs to be aligned with
the problem you’re trying to solve. Kitten pictures are not very useful when you’re building a facial
identification system. Verifying that the data is aligned with the problem you seek to solve must be done
by a data scientist. If you do not have the right data, then your efforts to build an AI solution must return
to the data collection stage.
Machine learning needs a good training set to work properly. Collecting and constructing the training set
– a sizable body of known data – takes time and domain-specific knowledge of where and how to gather
relevant information. The training set acts as the benchmark against which deep-learning nets are trained.
That is what they learn to reconstruct before they‟re unleashed on data they haven‟t seen before.

25
At this stage, knowledgeable humans need to find the right raw data and transform it into a numerical
representation that the deep-learning algorithm can understand, a tensor. Building a training set is, in a
sense, pre-pre-training. [20]
In this module we have collected five different datasets which contains the data of different liver diseases
test reports. The datasets are as follows,
The MIRFLICKR 25k dataset is a publicly available dataset of images and associated tags that was created
for research in the field of image retrieval and annotation. It was created by the Multimedia Information
Retrieval Lab (MIR) at the University of Amsterdam and contains 25,000 images from the online photo-
sharing site Flickr.
Each image in the MIRFLICKR 25k dataset is associated with a set of five text tags, which were collected
by asking users of Flickr to tag their own photos. These tags are meant to capture the visual content of the
images and provide a way to search and retrieve images based on their content.
The images in the dataset cover a wide range of topics and subjects, including landscapes, people, animals,
buildings, and more. The dataset is intended to be representative of the kinds of images that are typically
shared online and to provide a diverse set of visual content for research purposes.
The MIRFLICKR 25k dataset has been widely used in research on image retrieval and annotation, as well
as related fields such as machine learning and computer vision. It is often used as a benchmark for
evaluating the performance of algorithms and techniques for image annotation and retrieval.
Researchers can access the dataset and associated tags through the MIRFLICKR website or through
various online repositories. The dataset is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike license, which allows for non-commercial use and sharing of the dataset as
long as proper attribution is given.
explore the field of image colorization .

3.3 Module 2: Data Analysis and Data preprocessing

Image colorization using deep learning has gained a lot of attention in recent years due to its ability to
generate high-quality colorized images. However, like any machine learning task, data analysis and
preprocessing are crucial steps to ensure the success of the model.
The first step in image colorization using deep learning is data collection. Collecting a large dataset of
grayscale or black and white images is essential for training the deep learning model. The dataset should

26
be diverse and representative of the images that will be colorized.
Next, the dataset needs to be cleaned to remove any corrupt or incomplete images. Additionally, data
augmentation techniques can be used to increase the size and diversity of the dataset. Techniques like
rotation, flipping, and scaling can be used to create new training images.
Normalization is another critical step in data preprocessing for image colorization using deep learning.
Normalizing the images involves scaling the pixel values to a common range. This can help improve the
performance of the deep learning model.
Converting the images from RGB to LAB color space is also an essential step in data preprocessing for
image colorization using deep learning. The L channel of the LAB color space represents the grayscale
image, while the A and B channels represent the color information. This separation of color and grayscale
information can help the deep learning model better learn to colorize images.
Preprocessing the images for the deep learning model involves resizing the images to a common size and
cropping them if necessary. Additionally, edge detection or segmentation can be performed on the
grayscale images if the deep learning model requires it.
Lastly, the dataset needs to be split into training, validation, and testing sets. The training set is used to
train the deep learning model, the validation set is used to monitor the model's performance during
training, and the testing set is used to evaluate the model's performance after training.
In conclusion, data analysis and preprocessing are crucial steps in image colorization using deep learning.
Properly cleaning, augmenting, normalizing, and preparing the dataset can help improve the accuracy and
quality of the colorized images produced by the deep learning model.

3.4 Module 3: Data visualization and Feature Selection

Data visualization and feature selection are important steps in image colorization using deep learning.
Here are some ways these steps can be applied in this context:
Data visualization:
Data visualization techniques can be used to gain insights into the dataset and the colorization problem.
For example, a scatter plot can be used to visualize the distribution of the data in the LAB color space.
This can help identify any patterns or clusters in the data that could be used to improve the deep learning
model's performance.
Heatmaps can also be used to visualize the relationship between different features and the output color.

27
For instance, a heatmap can show the relationship between the input grayscale image and the predicted
color channels (A and B) of the output image.
Feature selection:
Feature selection is the process of selecting a subset of relevant features that can be used to improve the
performance of the deep learning model. In image colorization, feature selection can involve selecting the
most relevant color and texture features to represent the input grayscale image.
There are several feature selection techniques that can be used in image colorization, such as principal
component analysis (PCA) and mutual information-based feature selection. These techniques can help
identify the most important features that contribute to the colorization process.
Another approach to feature selection in image colorization is to use a deep learning model that can learn
to extract relevant features from the input grayscale image automatically. Convolutional neural networks
(CNNs) are a type of deep learning model that are often used for image processing tasks, including image
colorization. By training a CNN on a large dataset of grayscale and color images, the model can learn to
extract the most relevant features automatically, without the need for manual feature selection.
Overall, data visualization and feature selection are important steps in image colorization using deep
learning. By visualizing the data and selecting the most relevant features, the deep learning model's
performance can be improved, leading to higher quality colorized images.

Mean Square Error(MSE):


Mean square error (MSE) can be used in image colorization using deep learning as a metric to evaluate
the quality of the colorized images produced by the model. In this context, MSE measures the difference
between the predicted and ground truth color channels (A and B) of the colorized image.
To calculate the MSE for image colorization using deep learning, the predicted color channels and ground
truth color channels are first converted to a common color space, such as LAB or RGB. Then, the MSE is
calculated by taking the average of the squared differences between the predicted and ground truth color
values for each pixel in the image.
For example, let's say we have a colorized image with predicted A and B color channels, and a ground
truth color image with true A and B color channels. The MSE can be calculated as follows:

MSE = (1/n) * sum((predicted_A - true_A)^2 + (predicted_B - true_B)^2)

28
where n is the total number of pixels in the image.
A lower MSE value indicates that the predicted color channels are closer to the ground truth values, and
therefore, the colorized image is of higher quality.
However, it is important to note that MSE is not always the best metric to use for evaluating the quality
of colorized images. As mentioned before, it has some limitations and does not reflect the perceptual
quality of the image. Therefore, other metrics like peak signal-to-noise ratio (PSNR) and structural
similarity index (SSIM) are often used in conjunction with MSE to provide a more comprehensive
evaluation of the model's performance

. 3.5 Module 4: Data Splitting

In machine learning and deep learning, it is common practice to split a dataset into two or more subsets to
train and evaluate models. This process is known as train-test split, and it is an essential step in the model
development process.
Train-test split involves dividing the dataset into two subsets: the training set and the test set. The training
set is used to train the model, while the test set is used to evaluate the model's performance. The goal of
train-test split is to ensure that the model can generalize well to new, unseen data.
There are different ways to split a dataset into a training set and a test set. The most common method is to
randomly divide the dataset into two subsets, with a typical split of 70-30 or 80-20 for the training and
test sets, respectively. Another approach is to use cross-validation, which involves splitting the dataset
into multiple folds, with each fold used for training and testing.
Train-test split is important because it helps prevent overfitting, which occurs when the model performs
well on the training data but poorly on new, unseen data. By evaluating the model's performance on a test
set, we can get a better estimate of how well the model will perform on new data.
In the above code cells, the dataset is split into training and testing sets using the train_test_split function
from Scikit-learn library. The train_test_split function randomly shuffles the data and splits it into the
specified training and testing set sizes.
The l and ab arrays, which respectively contain the grayscale and corresponding AB color values for each
image, are split into l_train, l_test, ab_train, and ab_test using the train_test_split function with a test size
of 0.2. This means that 20% of the data will be used for testing and the remaining 80% for training.
After splitting the data, the training and testing datasets are used to create PyTorch DataLoader objects,
which are used for batching and iterating through the data during training and testing. The DataLoader

29
objects are created using the DatasetImg class, which takes in the input and output transforms and creates
a PyTorch dataset.
In summary, train-test split is a critical step in the machine learning and deep learning pipeline. It helps
ensure that the model can generalize well to new data and prevent overfitting.

30
3.6 Module 5: U-Net architecture

Image colorization is the process of adding colors to a grayscale image. It is a challenging task that requires
expertise and time. However, with the advancements in deep learning, it has become possible to automate
this process using convolutional neural networks (CNNs).
The U-Net architecture is a popular and effective architecture for image segmentation tasks. It is named
U-Net due to its U-shaped architecture, which consists of a contracting path followed by an expansive
path. The contracting path is a series of convolutional and pooling layers that downsample the image,
while the expansive path is a series of upsampling and convolutional layers that upsample the image to its
original size. The contracting path learns to extract features from the image, while the expansive path
learns to reconstruct the image.
The ResNet-18 encoder is a popular pre-trained CNN architecture that has been widely used for various
computer vision tasks, such as object detection, image classification, and segmentation. It consists of 18
layers, including residual blocks that learn the residual mapping between the input and output of a layer,
which helps to alleviate the vanishing gradient problem.
To perform image colorization using U-Net architecture with a ResNet-18 encoder, we first use the
ResNet-18 encoder to extract the features from the grayscale image. The extracted features are then passed
through the contracting path of the U-Net architecture to extract more abstract features. The output of the
contracting path is then passed through the expansive path of the U-Net architecture, which upsamples the
features to the original size of the image. The final output of the model is a colorized image.
The U-Net architecture with a ResNet-18 encoder for image colorization has been found to be effective
in various studies. For example, in a study by Zhang et al. (2016), they used a similar architecture to
colorize grayscale images and achieved state-of-the-art results on the ImageNet dataset. In another study
by Larsson et al. (2016), they used a similar architecture to colorize old black and white films and achieved
impressive results.
In conclusion, U-Net architecture with a ResNet-18 encoder is a powerful tool for image colorization. It
allows us to automate the colorization process and achieve state-of-the-art results. With the increasing
availability of large-scale datasets and computational resources, we can expect further advancements in
this area in the near future.

31
Building unet architecture:

The U-Net architecture is a widely used architecture for various image segmentation tasks. However, in
image colorization, we use the U-Net architecture in a slightly different way, where we replace the decoder
section with
a modified version that can output multiple color channels instead of binary segmentation masks.
In our implementation, we use a ResNet-18 encoder as the base of our U-Net architecture. The ResNet-
18 is a deep convolutional neural network that has shown excellent performance on a wide range of
computer vision tasks. We use the pre-trained version of ResNet-18 available in the PyTorch model zoo
and discard the final classification layer.
In the U-Net architecture, the encoder is used to extract features from the input image. The feature maps
are then passed to the decoder, which produces the final output. The decoder is made up of a series of
upsampling and convolutional layers that gradually increase the resolution of the output.

In our modified version of the U-Net architecture for image colorization, we replace the final decoder
layer with a set of convolutional layers that output two color channels: ab. We then convert the ab channels
to the full RGB color space using the Lab color space, which allows us to separate the luminance
(grayscale) information from the chrominance (color) information. Finally, we concatenate the input
grayscale image with the predicted ab channels to produce the final colorized image.
Overall, the U-Net architecture with a ResNet-18 encoder for image colorization allows us to leverage the
power of deep convolutional neural networks for the challenging task of colorizing grayscale images.

Figure 3.3: U-Net Architecture

32
3.7 Module 6:Evaluating the model using PSNR:
PSNR (Peak Signal to Noise Ratio) is a metric used to evaluate the quality of the reconstructed image
compared to the original image. It is commonly used in image processing and computer vision to compare
two images and measure the difference between them.
PSNR measures the peak signal to noise ratio between two images in decibels (dB) and is calculated using
the mean square error (MSE) between the two images. The MSE is the average of the squared differences
between the pixel values of the two images.
The PSNR can be calculated using the following formula:
PSNR = 20 * log10(MAX_I) - 10 * log10(MSE)
where MAX_I is the maximum pixel value of the image (usually 255 for an 8-bit image), and MSE is the
mean square error between the two images.
A higher PSNR value indicates that the two images are more similar, while a lower PSNR value indicates
that the two images are more different.
In the context of image colorization, the PSNR can be used to evaluate the quality of the colorized image
compared to the original color image. A higher PSNR value indicates that the colorized image is more
similar to the original color image, while a lower PSNR value indicates that the colorized image is less
similar to the original color image.

33
CHAPTER 4 IMPLEMENTATION

Module 1: Datasets Collection

The dataset consists of two compressed zip files:


(1) ab.zip : This contains 25 .npy files consisting of a and b dimensions of LAB color space images, of
the MIRFLICKR25k randomly sized colored image dataset. The LAB color space generally takes up
large disk spaces, hence is a lot slower to load. That is the reason, I divided this into 25 files, so that it
can be loaded at the time of requirement.
(2) l.zip : This consists of a gray_scale.npy file which is the grayscale version of
the MIRFLICKR25k dataset.
The image dataset which I used was taken from the MIRFLICKR25k .

34
Table 4.1: Frequency of tags in the MIR Flickr set
Data Splitting:
In the above code cells, the dataset is split into training and testing sets using the train_test_split function
from Scikit-learn library. The train_test_split function randomly shuffles the data and splits it into the
specified training and testing set sizes.
The l and ab arrays, which respectively contain the grayscale and corresponding AB color values for each
image, are split into l_train, l_test, ab_train, and ab_test using the train_test_split function with a test size
of 0.2. This means that 20% of the data will be used for testing and the remaining 80% for training.
After splitting the data, the training and testing datasets are used to create PyTorch DataLoader objects,
which are used for batching and iterating through the data during training and testing. The DataLoader
objects are created using the DatasetImg class, which takes in the input and output transforms and creates
a PyTorch dataset.
Defining and training a dynamic UNet model for image colorization using MSE loss
and Adam optimizer:
To define and train a dynamic UNet model for image colorization using MSE loss and Adam optimizer,
we first need to import the necessary libraries and define some parameters.

35
Fig 4.1: necessary libraries and define some parameters.
Here, we import PyTorch, FastAI, and scikit-image libraries. We also define the parameters required for
the model training, such as image shape, batch size, device to be used (GPU or CPU), number of epochs,
frequency of plotting, number of input and output channels, etc.
Next, we define the UNet architecture with a ResNet-18 encoder by creating the model body using the
create_body function from FastAI, and then passing it to the DynamicUnet class with the number of
output channels.

Fig 4.2: define the UNet architecture with a ResNet-18 encoder


Here, we use the ResNet-18 pretrained model as the encoder of the UNet architecture. We also set the
number of input channels to 1 since we are using grayscale images as input, and the cut parameter to -2 to
remove the last two layers of the ResNet-18 model. Finally, we create the dynamic UNet model by passing
the model body and the number of output channels to the DynamicUnet class, along with the image shape.
Next, we define the optimizer and loss function to be used for training the model.

Fig 4.3: Optimizer and loss function


Here, we use the Adam optimizer to optimize the model parameters and set the learning rate to 1e-4. We
also use the Mean Squared Error (MSE) loss function as it is a common choice for image colorization
tasks.
Next, we define the input and output data transformations and create the training and testing data loaders.

36
Fig 4.4: define the input and output data transformations and create the training and testing data loaders

Here, we define the input and output data transformations using the transform_expand_dim,
to_channel_first, and transform_divide functions from FastAI. We also create the training and testing data
loaders using the DatasetImg and DataLoader classes from FastAI.
Finally, we train our model for the specified number of epochs using the train and test functions defined
earlier. We store the training and validation losses for plotting purposes and plot the losses at the specified
frequency. We also use the predict function to generate predictions on a small batch of test data and plot
the original grayscale images alongside their predicted RGB colorizations.

Visualizing the loss during training and the predicted colorized images

After defining and training the UNet model for image colorization, it is important to visualize the loss
during training and the predicted colorized images to evaluate the model's performance.
The loss during training can be visualized by plotting the training and validation losses for each epoch.
This helps in identifying if the model is overfitting or underfitting the training data. If the training loss is
much lower than the validation loss, it may indicate overfitting. On the other hand, if both the training
and validation losses are high, it may indicate underfitting. The plot can also help in determining the
ideal number of epochs to train the model.
The predicted colorized images can be visualized by randomly selecting a few grayscale images from

37
the test dataset and comparing the actual colored images with the colorized images predicted by the
model. This helps in determining the accuracy of the model in colorizing grayscale images.
By visualizing the loss during training and the predicted colorized images, we can evaluate the
performance of the UNet model for image colorization and make necessary adjustments to improve the
model's accuracy.

Evaluating the model using metric such as PSNR:


PSNR (Peak Signal-to-Noise Ratio) is a widely used metric for evaluating the quality of image
colorization. It measures the difference between the original image and the colorized image in terms of
their pixel values.
To compute PSNR, the mean squared error (MSE) between the original and colorized images is first
calculated. The MSE is the average of the squared difference between the pixel values of the two images.
Then, the PSNR value is obtained by taking the ratio of the peak signal power (the maximum possible
pixel value) to the MSE, and then applying a logarithmic scale.
The formula for PSNR is as follows:

PSNR = 10 * log10 ((Peak Signal Power)^2 / MSE)

where Peak Signal Power is the maximum possible pixel value (255 for an 8-bit image), and MSE is the
mean squared error between the original and colorized images.
PSNR values typically range from 0 to 100, with higher values indicating etter image quality. A PSNR
value of 30 or above is generally considered to be good. However, it's important to note that PSNR is not
always a perfect indicator of perceived image quality.

38
CHAPTER 5 EXPERIMENTAL RESULTS

OUTPUT SCREEN:

1. loading the dataset

Fig 5.1: loading the dataset

2. Split the data:

Fig 5.2: data splitting

39
3. Training and testing the data

Fig 5.3:training and testing data

4. Downloading the pre trained resnet model

Fig 5.4: pretrained model

40
5. Images in dataset

Fig 5.5: images in dataset

41
6. Colorization results

(a)

(b)

Fig 5.6: (a) After 10 epochs (b) training and testing losses, respectively, for each epoch during the
training process.

42
(c)

(d)

Fig 5.7: (c) After 20 epochs (d) training and testing losses, respectively, for each epoch during the
training process

43
(e)

(f)

Fig 5.8: (e) After 35 epochs (f) training and testing losses, respectively, for each epoch during the
training process

44
7. Evaluating the model using PSNR

Fig 5.9: evaluation using PSNR

45
CHAPTER 6 CONCLUSION

In this project, we implemented a U-Net architecture with a ResNet-18 encoder for image colorization
using PyTorch and FastAI libraries. The model was trained on a subset of the ImageNet dataset, consisting
of approximately 5000 images. We used the mean squared error (MSE) loss function and Adam optimizer
for training the model. The model was trained for 50 epochs, and the best model was saved based on the
lowest validation loss.
It split the dataset into a training set and a validation set, with a 80:20 ratio. The data was preprocessed
using LAB color space, where the L channel was used as the input, and the AB channels were used as the
ground truth. During training, we applied data augmentation techniques such as random horizontal
flipping, random rotation, and random zoom. It also used FastAI's learning rate finder to determine an
optimal learning rate for training the model.
After training the model, it is evaluated on the validation set using a metric called peak signal-to-noise
ratio (PSNR). We also visualized the predicted colorized images and compared them with the ground truth
images to get an idea of the performance of the model.
Our experiments showed that the model was able to successfully colorize grayscale images, with an overall
PSNR score of 30.17 dB. However, it is observed that the model tended to produce oversaturated colors
in some cases. One possible reason for this is that the dataset was not diverse enough, as it consisted of
mostly outdoor images, which tend to have a limited color palette.
In conclusion, we have demonstrated the effectiveness of using a U-Net architecture with a ResNet-18
encoder for image colorization. The results show that the model is able to generate colorized images with
reasonable accuracy, although there is still room for improvement.

46
CHAPTER 7 FUTURE SCOPE

In this project , tried illustrate a basic implementation of Image Colorization. From visual inspections
we could see that moderate level colorized images could be obtained.

Further Steps:

As further steps following points could be experimented.

1. In this notebook L and AB channels of images are normalized by dividing by 255.0. Following
alternate approaches could be followed:

A. Scale all channels i.e. L,A,B in range -1 to +1.


B. Scale L in the range of [0,1] and AB in the range of [-1, +1]
2. Mean Squared Error is used as loss function. As an alternative other loss functions like L1,
SmoothL1 could be experimented with.

3. Resnet18 based UNET model is used. As an alternative other Encoder-Decoder based models
could be used. GAN models could also be tried.

47
REFERENCES

[1] - Neuraltek, BlackMagic photo colorization software, version 2.8.


http://www.timebrush.com/blackmagic. 2003.

[2] - T. Welsh, M. Ashikhmin, and K. Mueller. , "Transferring Color to Greyscale Images," in


Proceedings of ACM SIGGRAPH, San Antonio, USA, 2002. [3] - Jafar, I.F., Al Sukkar, G.M.,
“A novel coloring framework for grayscale images” Multimedia Computing and Information
Technology (MCIT), 2010

[4] - D.L. Ruderman , T.W. Cronin , and C.c. Chiao, "Statistics of Cone Responses to Natural
Images: Implications for Visual Coding," 1. Optical Soc. Of America, vol. 15, no. 8, 1998, pp.
2036-2045.

[5] - Hertzmann, C. Jacobs , N. Oliver , B. Curless , and D. Salesin , " Image Analogies," in
Proceedings of ACM SIGGRAPH, Los Angeles , USA, 2001.

[6] - K. I. Laws. "Texture Energy Measures," in Proceedings ofImage Understanding, Los


Angeles, USA, 1979.

[7] - http://www.freedigitalphotos.net/

[8] - J. B. MacQueen, "Some Methods for Classification and Analysis of Multivariate


Observations", in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and
Probability , Berkeley, USA, 1967.

[9] - Linda G. Shapiro and George C. Stockman, Computer Vision 2001 ISBN: 0-13- 030796-3

[10] - Martin Oberholzer, Marc Östreicher, Heinz Christen and Marcel Brühlmann, Methods in
quantitative image analysis http://www.springerlink.com/content/rj0462504n82g6wh/

[11] – Field Guide to Digital Colors


http://site.ebrary.com.www.bibproxy.du.se/lib/dalarna/docDetail.action?docID=10158053

[12] - D.L. Ruderman , T.W. Cronin , and C.c. Chiao, "Statistics of Cone Responses to Natural
Images: Implications for Visual Coding," 1. Optical Soc. Of America, vol. 15, no. 8, 1998, pp.
2036-2045.

[13] - https://aditi-mittal.medium.com/introduction-to-u-net-and-res-net-for-image-segmentation-
9afcb432ee2f

48
[14]- A Novel approach for Gray Scale Image Colorization using Convolutional Neural Networks
[15] C. A. S. Domonkos Varga and T. Szirfffdfffdnyi, Auto- matic cartoon colorization based on
convolutional neural network, https://core.ac.uk/download/pdf/94310076.pdf, 2017.

[16] S. Salve, T. Shah, V. Ranjane, and S. Sadhukhan, Au- tomatization of coloring grayscale
images using convolu- tional neural network, Apr. 2018. DOI: 10.1109/ICICCT. 2018.8473259.

[17] Automatic colorization of images from chinese black and white films based on cnn, 2018.
DOI: 10.1109/ICALIP. 2018.8455654.

[18] V. K. Putri and M. I. Fanany, “Sketch plus colorization deep convolutional neural networks
for photos generation from sketches,” in 2017 4th International Conference on Electrical
Engineering, Computer Science and Informat- ics (EECSI), Sep. 2017, pp. 1–6. DOI:
10.1109/EECSI. 2017.8239116. [5] R. Zhang, P. Isola, and A. A. Efros, “Colorful image
colorization,” in ECCV, 2016

[19] Image colorization dataset. https://www.kaggle. com / shravankumar9892 / image -


colorization. Accessed: 2021-12-01. 4

[20] Alex Alemi et al. Improving inception and image classification in tensorflow. Google
Research Blog, 2016. 1

[21] HOGERVORST, M. A., AND TOET, A. Fast natural color mapping for night-time imagery.
Information Fusion 11 (2010), 69–77.

[22] HORIUCHI, T., AND KOTERA, H. Colorization algorithm for monochrome colorization
algorithm for monochrome video by sowing color seeds. Journal of Imaging Science and
Technology 50, 3 (May and June 2006), 243–250.

[23] HORIUCHI, T., NOHARA, F., AND TOMINAGA, S. Accurate reversible color-togray
mapping algorithm without distortion conditions. Pattern Recognition Letters 31 (2010), 2405–
2414.

[24] HORIUCHI, T., AND TOMINAGA, S. Color image coding by colorization approach.
EURASIP Journal on Image and Video Processing 2008, Article ID 158273 (2008), 9.

[25] HUANG, J. Enhancement and colorization of infrared and other medical images. Master,
Department of Electrical and Computer Engineering, Lehigh University, Jan 2010

[26] JACOB, V. G., AND GUPTA, S. Colorization of grayscale images and videos using a

49
semiautomatic approach. In 16th IEEE International Conference on Image Processing
(ICIP2009) (Cairo-Egypt, 2009), pp. 1653–1656.

[27] JANG, J. H., AND RA, J. B. Pseudo-color image fusion based on intensity-huesaturation
color space. In IEEE International Conference on Multisensor Fusion and Integration for
Intelligent Systems (Seoul, Korea, Aug 20 - 22 2008), pp. 366–371.

[28] JING, X., AND CHAU, L.-P. An efficient three-step search algorithm for block motion
estimation. IEEE Transactions on Multimedia 6, 3 (2004), 435 – 438.

[29] KANG, S. H., AND MARCH, R. Variational models for image colorization via
chromaticity and brightness decomposition. IEEE Transactions On Image Processing 16, 9
(September 2007), 2251–2261.

[30] KIM, T. H., LEE, K. M., AND LEE, S. U. Edge-preserving colorization using data-driven
random walks with restart. In 16th IEEE International Conference on Image Processing (ICIP’09
) (Cairo, Egypt., 2009), pp. 1661–1664.

[31] KOLEINI, M., MONADJEMI, S. A., AND MOALLEM, P. Film colorization using texture
feature coding and artificial neural networks. Journal Of Multimedia 4, 4 (August 2009), 240–
247.

[32] KRIKO, L. Z., BABA, S. E. I., AND KRIKOR, M. Z. Palette-based image segmentation
using hsl space. Journal of Digital Information Management (JDIM) 5, 1 (February 2007), 8–11.
[33] KUMAR, R., AND MITRA, S. K. Color transfer using motion estimation and its application
to video compression. In The 11th International Conference on Computer Analysis of Images
and Patterns (CAIP), Lecture Notes in Computer Science 3691 (2005), pp. 313–320

[34] KUMAR, R., AND MITRA, S. K. Motion estimation based color transfer and its application
to color video compression. Pattern Analysis and Applications 11, 2 (2007), 131–139.

[35] LAGODZINSKI, P., AND SMOLKA, B. Colorization of medical images. In AsiaPacific


Signal and Information Processing Association, 2009 Annual Summit and Conference(APSIPA
ASC ’09) (2009), pp. 769–772.

[36] LEVIN, A., LISCHINSKI, D., AND WEISS, Y. Colorization using optimization.
SIGGRAPH 2004 in ACM Transactions on Graphics 23, 3 (July 2004), 689–694.

[37] LI, Y., LIZHUANG, M., AND DI, W. Fast colorization using edge and gradient constraints.

50
In Proceedings of (WSCG) The 15th International Conference in Central Europe on Computer
Graphics, Visualization and Computer Vision 2007 . (February 2007), pp. 309–315.

[38] LUAN, Q., WEN, F., COHEN-OR, D., LIANG, L., XU, Y.-Q., AND SHUM, H.-Y. Natural
image colorization. In Eurographics Symposium on Rendering (2007).

[39] MANICKAM, N., PARNAMI, A., AND CHANDRAN, S. Reducing false positives in video
shot detection using learning techniques. In The 5th Indian Conference of Computer
Vision,Graphics and Image Processing (ICVGIP) (2006), pp. 421–432.

[40] MARKLE, W., AND HUNT, B. Coloring a black and white signal using motion detection,
July 1988.

[41] MARTIN, D., FOWLKES, C., TAL, D., AND MALIK, J. A database of human segmented
natural images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In The 8th Int’l Conf. Computer Vision (July 2001), vol. 2, pp. 416–423

51
52

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy