Image Colorization Final Report
Image Colorization Final Report
Colorization is the process in which color components are added to the grayscale image. The information
contained in the grey-scale image is limited than that of color image. Thus adding the color components
can provides more insights about its semantics. However,Colorization is one of the challenging task due
to the small data set,variety of images in the training set and the availability of computational resources
[1]. Colorization is an ambiguous task because it does not have unique solution. An U-net takes the
grayscale and predict the color image .The coloring of the gray scale image has the strong impact on the
wide verity of the domains typically color restoration re-design of the historical images [2]. The
colorization problem is both complex and interesting as the final output image and input image will be of
same dimension [3]. CNNs plays the important role to handle different diversified tasks such as image
colorization, classification, image-labeling and so on. In recent years the use of CNN widely used to get
the solution of image colorization purposes[3].The U-Net architecture is convolution network architecture
for the purpose of image segmentation. U-Net is widely used for image segmentation, and the problem of
coloring image is essentially same as segmentation. Thus, it is chosen for image coloring[4]. The U-Net
consist of mainly there parts Encoding, Bridge and Decoding. The encoding part is responsible for
converting the input image into compact representation called latent space of the input. The decoding
process involves the reconstruction of the input image with the same size of the input images by using
upsampling and convolution operation. The role of the bridge is to bind the encoding and decoding units.
The low level detail features in the encoding part are concatenated with corresponding high level features
in the decoding part [4]. The colored image carries the more information than the gray scale images and
most of the coloring tasks are based on the auto-encoder i.e. encoder-decoder. A number of proposed
solutions are available for colorization of Black and White images. The challenges lies in focusing on
accuracy colorizing standard would look natural to human eye. The U-net is basically used for the
segmentation of the medical images but the task of coloring is similar to the segmentation because coloring
involve the separation of similar region in the image and fill the appropriate color in that segment. Thus
the U-Net architecture is used for the coloring because the encoder-decoder suffers from an information
bottleneck during the flow of low level information in the network[3]. To reduce this problem features
from the contracting path are also connected with the upsampling output layer within the network then
1
more information can be obtained from greyscale image and can be automatically colored with greater
accuracy and natural representation [4].
The problem statement for image colorization is to develop a deep learning model that can accurately and
realistically add color to grayscale images. The goal is to produce colorized images that are visually
appealing and consistent with the colors that would be expected in the original image.
More specifically, the problem statement involves:
1. Preparing a dataset of grayscale images and their corresponding color images for training the deep
learning model.
2. Developing a deep learning model that can take a grayscale image as input and generate a plausible
and realistic color image as output.
3. Training the model on the prepared dataset of grayscale and color images.
4. Evaluating the performance of the model on a separate set of test images to measure how accurately
and realistically the model is able to colorize the images.
5. Optimizing the model to improve its performance on the test set.
The ultimate goal of image colorization is to provide a tool for artists, photographers, and restoration
specialists to colorize old or historical images, enhance the visual quality of grayscale images, and aid in
the creation of digital art.
1.2 Motivation:
Image colorization is the process of adding color to grayscale images. The motivation for image
colorization is to provide a tool for artists, photographers, and restoration specialists to enhance the visual
quality of grayscale images, colorize old or historical images, and aid in the creation of digital art.
The use of image colorization has become increasingly popular in recent years due to advancements in
2
deep learning and computer vision. Deep learning models such as convolutional neural networks (CNNs)
have shown remarkable success in accurately and realistically colorizing grayscale images.
One of the main motivations for image colorization is the preservation of historical images. Historical
images often only exist in grayscale, but colorization can bring them to life and provide a better
understanding of the past. By adding color to historical images, we can gain a better appreciation of the
context, clothing, and environment of the time period.
Another motivation for image colorization is its application in the fields of art and photography. Adding
color to black and white photos can make them more visually appealing, and provide a unique artistic
perspective. Additionally, colorizing photographs of the natural world or landscapes can provide a better
understanding of the environment and bring attention to issues such as climate change.
Moreover, image colorization has several practical applications in fields such as advertising and
marketing. By colorizing grayscale product images, companies can provide a more realistic representation
of their products, which can increase sales.
In conclusion, the motivation for image colorization is to provide a tool for enhancing the visual quality
of grayscale images, colorizing historical images, and aiding in the creation of digital art. With
advancements in deep learning and computer vision, the accuracy and realism of colorization has greatly
improved, making it a valuable tool in several applications.
3
1.4 Background and basics
1.4.1 Theoretical Background
Since colorizing grey image involves greyscale, color spaces and while using reference image for
approximation, we need to do some texture analysis and clustering, so I will explain these before
heading forward.
1.1.1 Greyscale Images
A greyscale image is an image which is represented by intensity only. Value of this intensity defines the
appearance of a certain pixel in an image. Lowest value of intensity or its absence in a greyscale image is
represented with black color and its highest value represents white color. All values between these highest
and lowest represent shades of grey. These number of shades are dependent on possible values which a
pixel may hold. If pixel is represented by a bit, then it can hold two values, 0 and 1 and thus we get a pure
black and white image with no grey shades. If this pixel is represented by a byte value, then we can have
256 grey levels starting with 0 as black and ending with 255 as white. So this way as we increase number
of bits for representation of pixel, grey levels increase with it. Intensity which is only property with
greyscale image, is represented by this value of representation of greyscale. If this intensity/luminance is
more, pixel’s shade is closer to white and vice versa.
1.1.2 Color Images
A color image is an image which is represented by some color space. This color space is not dependent
on only one value like greyscale image. Each pixel in color image is represented by more than one value
and combined effect of these values gives appearance of a color. Before heading into these color spaces,
we need to understand how our eyes sense colors. Our eyes consist of Rod and Cone cells. Rod cells are
designed to feel intensity of light where as Cone cells are designed to sense color of light which falls on
retina of eye. Cone cells are usually of three types, named Short, Medium and Long. Short cones are
sensitive to shorter wavelengths of light meaning they sense Blue color. Medium cones are sensitive to
medium wavelengths and hence they sense green color and similarly Long cones, sense Red color because
of their sensitivity to long wavelengths. Based on this a color space is designed which is named as RGB
color space.
4
of computer graphics, this color space represents each pixel with three values of Red, Green and Blue.
Each pixel on screen represents 3 light emitting devices, which emit R,G and B to represent color of that
pixel and our S,M and L cones receive these lights and combined effect of these is that we see some color.
If each color is defined by a byte, then we can have 256 shades of each color, which in result, give us a
vast variety of colors with 256 shades of each color. This color space is designed for representing colors
with electromagnetic systems, like computers, television, printers etc where human need to sense them.
Though RGB is the designed for human sense, but its not good for processing and performing calculations
in image processing. Also this space is not device independent. Image processing usually involves other
color spaces and relevant to this work is Lαβ color space.
Deep learning:
Artificial neural networks are used in deep learning, a kind of machine learning, to learn from and draw
conclusions from complex data. Deep learning has become increasingly popular in recent years due to
5
advancements in computing power and the availability of large datasets. One of the main advantages of
deep learning is its ability to learn and extract meaningful features from data. Deep learning models can
learn to recognize patterns and relationships in data that are too complex for traditional machine learning
models. This ability to learn from complex data has enabled deep learning to achieve state-of-the-art
results in several domains such as image recognition, natural language processing, and speech
recognition.
Deep learning models are typically composed of multiple layers of artificial neurons, which are trained
using backpropagation to adjust their weights and biases to minimize the error between predicted and
actual outputs. The process of training a deep learning model involves presenting the model with large
amounts of labeled data, and gradually adjusting the weights and biases of the neurons to improve
performance.
One of the key applications of deep learning is in image recognition, where convolutional neural
networks (CNNs) have achieved impressive results in tasks such as object detection, image
segmentation, and image classification. CNNs are designed to recognize spatial patterns in images,
making them particularly effective for image recognition tasks.
Another important application of deep learning is in natural language processing (NLP), where recurrent
neural networks (RNNs) and transformer models have shown remarkable success in tasks such as
language translation, sentiment analysis, and text generation. These models are designed to recognize
patterns in sequences of text, making them well-suited for NLP tasks.
Furthermore, deep learning has also been applied in several other domains such as speech recognition,
drug discovery, and autonomous vehicles, among others.
In conclusion, deep learning is a powerful subset of machine learning that has shown remarkable success
in several domains. Its ability to learn and extract meaningful features from complex data has enabled it
to achieve state-of-the-art results in tasks such as image recognition and natural language processing.
With the availability of large datasets and advancements in computing power, deep learning is expected
to continue to grow and make significant contributions in several fields.
6
1.4.2 CNN Algorithm:
The capabilities of artificial intelligence to close the gap between human and computer skills has been
growing dramatically. Both professionals and beginners focus on many facets of the field to achieve great
results. The field of computer vision is one of several such disciplines.
The goal of this field is to give robots the ability to see the environment similarly as humans do and to use
that understanding for a variety of activities, including image and video recognition, image analysis, media
recreation, recommendation systems, natural language processing, etc. With time, one particular
algorithm—a Convolutional Neural Network—has been developed and optimised, primarily leading to
breakthroughs in computer vision with deep learning.
Introduction
7
able to distinguish between them. In comparison to other classification methods, a ConvNet requires
significantly less pre-processing. Unlike to primitive approaches, where filters are hand-engineered,
ConvNets have the ability to learn these filters and their attributes.
A ConvNet's architecture was influenced by how the Visual Cortex is organised and is similar to the
connectivity network of neurons in the human brain. Only in this constrained area of the visual field,
known as the Receptive Field, do individual neurons react to stimuli. Some of these fields overlap
Through the use of appropriate filters, a ConvNet may effectively capture the spatial and temporal
dependencies in a picture. As there are less variables to account for and the weights can be reused, the
architecture offers improved fitting to the picture dataset. In other terms, the network can be trained to
better understand the degree of complexity in the image.
8
Input Image
In the figure ,
The three colour planes of the RGB image—Red, Green, and Blue—have been used to split it in the
picture. Images can be found in a variety of different colour spaces, including grayscale, RGB, HSV,
CMYK, etc.
Whenever the photos reach size like 8K (76804320), you can imagine how computationally intensive
things would grow. ConvNet's job is to simplify the images without sacrificing any of the elements that
are essential for making accurate predictions. This is crucial when creating an architecture that is both
scalable to large datasets and effective at learning features.
Convolution Layer — The Kernel
9
Fig 1.5:Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature
Kernel/Filter, K = 1 0 1
0 1 0
1 0 1
Due to Stride Length = 1 (Non-Strided), the Kernel shifts nine times, executing an elementwise
multiplication operation (Hadamard Product) between K and the area P of the picture that the Kernel is
currently hovering over.
10
Fig 1.6: Movement of the Kernel
Until it has parsed the entire width, the filter travels to the right with a specific Stride Value. Proceeding
on, it jumps down to the beginning (left) of the picture with the same Stride Value and keeps doing so
until the full image is traversed.
11
Fig 1.7:Convolution operation on a MxNx3 image matrix with a 3x3x3 Kernel
Images containing several channels, like RGB pictures, have a kernel with the same depth as the input
image. A squashed one-depth channel Convoluted Feature Output is produced by performing matrix
multiplication across the Kn and In stacks ([K1, I1]; [K2, I2]; and [K3, I3]). All of the results are then
12
The Convolution Operation's objective is to identify the input image's high-level characteristics, such
edges, and extract them. There is no requirement that ConvNets have just one convolutional layer.
Typically, low-level features like edges, colour, gradient direction, etc. are captured by the first ConvLayer.
With more layers, the architecture also adjusts to the High-Level characteristics, giving us a network that
comprehends the dataset's images holistically in a way that is comparable to how we do.
The procedure yields two different types of results: one where the dimensionality of the convolved feature
is decreased as compared to the input, and the other where it is either increased or stays the same. Using
valid padding in the makes this possible.case of the former, or Same Padding in the case of the latter.
Fig 1.9:SAME padding: 5x5x1 image is padded with 0s to create a 6x6x1 image
When the 5x5x1 image is enhanced into a 6x6x1 image and the 3x3x1 kernel is applied to it, we see that
the convolved matrix has the dimensions 5x5x1. Hence, Similar Padding was born.
On the other hand, if we carry out the identical operation without padding, we are given a matrix called
Valid Padding that has the same dimensions as the kernel itself (3x3x1).
13
Pooling Layer
The Pooling layer, like the Convolutional Layer, is in charge of shrinking the Convolved Feature's spatial
size. With dimensionality reduction, the amount of computing power required for processing the data will
be reduced. Furthermore, it aids in properly training the model by allowing the extraction of dominating
characteristics that are rotational and positional invariant.
Max Pooling and Average Pooling are the two distinct kinds of pooling. The highest values from the area
of the image that the Kernel has covered is given by Max Pooling. The average of all the values from the
image's region that is covered by the Kernel is the value given by average pooling, on the other hand.
Moreover, Max Pooling performs as a noise suppressant. It does dimensionality reduction, de-noising, and
complete discarding of the noisy activations. The noise-suppression process used by average pooling, in
contrast, is just dimensionality reduction. As a result, we can conclude that Max Pooling outperforms
Average Pooling.
14
The i-th layer of a convolutional neural network is made up of the convolutional layer and the pooling
layer. The number of these layers may be expanded to capture even more minute details, but doing so will
require more computer power depending on how complex the images are.
After going through the approach outlined above, we were able to successfully help the model comprehend
the features. Next, we will flatten the output for classification purposes and feed it into a standard neural
network.
We will now flatten the input image into a column vector after converting it to a format that is appropriate
for our multi-level perceptron. A feed-forward neural network receives the flattened output, and
backpropagation is used for each training iteration. The model can categorise images using the Softmax
15
Classification method across a number of epochs by identifying dominant and specific low-level
features.There are various architectures of CNNs available which have been key in building algorithms
which power and shall power AI as a whole in the foreseeable future. Some of them have been listed below:
1. LeNet
2. AlexNet
3. VGGNet
4. GoogLeNet
5. ResNet
6. ZFNet
16
Organization of Thesis
This dissertation organized into six chapters and structured as follows:
Chapter 1:This chapter presents, “Introduction” of the thesis describing the Motivation,
Problem Statement, Aim and finally organization of the dissertation.
Chaper 2:This chapter presents, “Literature Survey” of the thesis describing about
Classification Techniques of image colorization
Chapter 3: This chapter presents, “System Architecture” of the thesis is describing the
proposed System Architecture and System Modules.
Chapter 4:This chapter presents, “Implementation” of the thesis describing about
Dataset Collection, Data Analysis and Data Preprocessing, Data Visualization and
Feature Selection, Data Splitting, Model Building.
Chapter 5: This chapter presents, “Experimental results” of the thesis and the
description of the results.
17
CHAPTER 2 LITERATURE SURVEY
This chapter is dedicated to creating an overview of previous work in the field of automatic colorization,
with particular focus on works that have inspired, influenced and helped shape this thesis. In general,
working with color channels in images is a rather well studied and surveyed area. However, compared to
other similar problems, the idea of generating some or all of the color information given other data is one
that has seen relatively limited amount of widespread application and, conversely, research. However, it
is quickly becoming one of the more popular image-to-image tasks in the computer vision field, with
several works published in the recent years using various approaches. Countless algorithms that are not
considered colorization methods, but rather mere color enhancements which aim to improve existing poor
color information or modify the color palette of an image are worth mentioning in this section, as they
serve as a sort of a precursor to full colorization techniques. The goal of these methods is to take images
with color data as their inputs and transform them into images with better visual properties. Often they
serve to remedy certain camera defects, such as overexposure or underexposure contrast adjustment
through histogram equalization[20], as shown in Figure 3.1. The simple transformations performed by
these non-parametric methods are frequently used to describe behaviors of other, more complex
algorithms, even in the domain of CNNs. It is for example possible to say that one of the transformations
performed by a CNN resembles histogram equalization.
Fig 2.1: (a) Overexposed image. (b) Image after color enhancement using histogram
normalization.
When considering full image colorization, there are generally three major types of approaches that have
18
been used. Firstly, a non-parametric approach, in which the user provides hints to an algorithm as to what
the final colorization should look like. These hints come in the form of scribbles - small patches of color
in specific areas of the image. More automated methods developed still rely on additional user input, but
instead of providing direct color data, the user is expected to provide one or more reference images from
which to perform color transfer onto the target image using statistical data or texture matching. Recently,
with the advent CNNs, the approach has shifted more towards a fully automated solution, where the only
input provided by the user is the target grayscale image. However, this enormous advatage can also turn
into a disadvantage - in case the CNN result turns out to be unsatisfactory, there are few to no options to
easily remedy it trivially, unless the model has been specifically designed with this requirement in mind.
2.1 User provided color hints
Colorization methods that depend on color scribbles generally use an optimization framework without
explicit parameter learning to propagate the color from the color patches onto the whole image. The
scribbles are usually provided as a separate image in the form of a color-transparency mask, and the
segments of the image that have no explicit color defined in this mask should have color information
propagated to them. The basic assumption behind most of these methods is that nearby pixels of similar
intensities should also have similar colors. In the method proposed by Levin et al. [21], the colorization is
achieved by solving a convex quadratic cost function obtained as differences of intensities between
neighboring pixels. With further improvements by Huang et al. [22] to exploit edge detection in order to
reduce common problems with color bleeding over object boundaries, this has become a relatively popular
technique to interactively colorize natural images. Luan et al. [23] presented a method extending the use
of scribbles to texture similarity, automatically labeling pixels that should share roughly similar colors
and grouping them into coherent regions. They extend the color locality assumption, seeking remote pixels
with similar textures to color alike to effectively propagate the colorization, further improving the
technique. A similar approach to transferring the color from scribbles, introduced by Qu et al. [24], is
extracting statistical pattern features of local neighborhoods to measure texture continuity, resulting in
fewer scribbles required. A completely different method of optimization was introduced by S´ykora et al.
with LazyBrush [25], along with relaxing the requirement of complete spatial accuracy of scribbles for
the purposes of cartoon image colorization, by solving a multiway cut problem
19
Fig 2.2: Example of Levin’s method
20
method close to fully automated by running a web search on popular image sharing websites, based on
keywords provided by the user instead of requiring reference images, acquiring semantically relevant
results. The retrieved images are scored based on their colorfulness and non fitting images, such as
grayscale or images with filtering effects applied, are discarded, creating reference image sets of up to
2000 images.
Similarly, Chia et al. [30] choose to perform an automated web search in conjunction with user provided
foreground-background segmentation cues. This provides the user with more control over the resulting
colorization, while retaining the automated nature of image-to-image color transfer. Liu et al. [26]
automatically generate scribbles from the reference images obtained from the web and propagate them by
using Levin’s method, combined with automatic segmentation. In Deep Colorization by Cheng et al. [31],
a large dataset is divided up into smaller clusters based on global descriptors such as intensity histograms.
For each cluster, a neural network consisting of 3 fully connected layers is trained, using multiple local
feature descriptors computed at random pixel locations from images in the reference set as training data.
The result is obtained as per-pixel colorization prediction of the network which has been trained on the
reference set that most closely matches the target image based on the global descriptors, instead of
explicitly defining the color transferring method.
Deshpande et al. [27] minimize an objective function automatically learned from example sets and
subsequently train a forest of regression trees for color prediction, using multiple image filters to handle
scale invariance. To choose good reference images and choose the used trees, bag-of-features retrieval on
the training set is used.
While these methods generally require less from the user compared to scribble-based methods, they make
it more difficult to influence the colorization output, due to reliance on data that are not straight-forward
to interpret by visually inspecting the reference image set - such as feature descriptors.
21
degree of success as a fully automated process by training CNNs on large datasets such as SUN or
ImageNet. Currently, these methods are the state-of-the-art for natural image colorization.
It is worth noting that even the application of CNNs to this task can be viewed as a form of automatic
color or style transfer - with the references automatically pre-selected by the choice of the original CNN
training set - using a complex method learned and realized by the network. However, by using training set
which contains a large variety of semantically different scenes and commonly occurring objects (such as
ImageNet), it is expected that the chosen color transfer should be the best matching one.
Zhang et al. [32] propose a plain CNN with 22 convolutional layers on a subset of the ImageNet dataset,
employing a custom tailored multinomial cross entropy loss with class rebalancing based on prior color
distribution obtained from the training set to predict a color histogram for each output pixel to handle the
multi-modal nature of the task.
Similarly, Larsson et al. [33] also predict a color histogram, however, they choose to use a 16-layer
convolutional model attached to a fully connected hypercolumn layer to predict pixels’ chromatic values,
pretrained on image classification task and fine-tuned for colorization. Rather than train densely and
predict the colorization of the whole image in one pass, the CNN is trained on spatially sparse samples of
grayscale patches of size equal to the receptive field of the network, predicting the color value of the
central pixel. Larsson et al. also explore the possibility of transferring a known ground truth color
histogram (as a global descriptor) to improve the colorization.
Iizuka et al. [8] propose a network which combines two paths of computation, one to predict the global
features of the target image and the other to specialize in local features. To achieve this, the global features
are trained for image classification rather than colorization and are subsequently concatenated to the local
features that are trained directly for colorization using L2 Euclidean loss function. This technique allows
their model to gain a higher semantic understanding of the image, producing very consistent colorizations.
Recent developments in the field of conditional generative adversarial networks (GAN), a model in which
two distinct networks are trained - a generator and a discriminator - have 3.3. USING CNNS 19 lead
researchers to attempt to use them for colorization. In the domain of colorization, the generator is used to
produce a colorized image of the target grayscale image, and the discriminator is then trained to decide
whether the generated image looks more convincing than the ground truth coloring. If that is not the case,
the weights of the generator are updated in the direction of making the image more convincing for the
discriminator, essentially using the discriminator as an adaptive loss function [34].
Cao et al. [35] show application of GAN to colorization of natural images while producing highly
22
convincing colorizations on the SUN dataset. Along with the target grayscale image, a random noise
vector is given to the generator as input (known as latent space sample), reintroducing some user-defined
influence over the colorization result exhibited by methods in Section 3.2, albeit one that may be difficult
to reason about.
Fu et al. [36] also use a cartoon movie dataset and choose to train a GAN model for automatic colorization,
though their data source is of larger magnitude (over 15 hours of raw footage compared to less than 1.5
hours of our data). Their image extraction method is also different - sampling every 50 frames in the
original footage - and, most importantly, their testing and validation sets are sampled randomly, which,
due to the nature of the data source, skews the results, as it is reasonable to assume that randomly chosen
frames may be mere translations of frames included in the training set, or that background information
will easily be learned by recognizing objects in the image. Therefore, it is hard to compare the results of
our work to results of Fu et al
23
CHAPTER 3 SYSTEM ARCHITECTURE
Proposed System
This project illustrates image colorization usecase. Image colorization is the process of converting gray
scale images to colored images. This usecase is helpful in converting your old grayscale images to
colored.
There are other sophisticated image colorization mechanisms and models. In this project I tried to
illustrate basic steps of image colorization. I trust this work will bring motivation and interest to further
One of the hardest problems to solve in deep learning is the problem of getting the right data in the right
format.
Getting the right data means gathering or identifying the data that correlates with the outcomes you want
to predict; i.e. data that contains a signal about events you care about. The data needs to be aligned with
the problem you’re trying to solve. Kitten pictures are not very useful when you’re building a facial
identification system. Verifying that the data is aligned with the problem you seek to solve must be done
by a data scientist. If you do not have the right data, then your efforts to build an AI solution must return
to the data collection stage.
Machine learning needs a good training set to work properly. Collecting and constructing the training set
– a sizable body of known data – takes time and domain-specific knowledge of where and how to gather
relevant information. The training set acts as the benchmark against which deep-learning nets are trained.
That is what they learn to reconstruct before they‟re unleashed on data they haven‟t seen before.
25
At this stage, knowledgeable humans need to find the right raw data and transform it into a numerical
representation that the deep-learning algorithm can understand, a tensor. Building a training set is, in a
sense, pre-pre-training. [20]
In this module we have collected five different datasets which contains the data of different liver diseases
test reports. The datasets are as follows,
The MIRFLICKR 25k dataset is a publicly available dataset of images and associated tags that was created
for research in the field of image retrieval and annotation. It was created by the Multimedia Information
Retrieval Lab (MIR) at the University of Amsterdam and contains 25,000 images from the online photo-
sharing site Flickr.
Each image in the MIRFLICKR 25k dataset is associated with a set of five text tags, which were collected
by asking users of Flickr to tag their own photos. These tags are meant to capture the visual content of the
images and provide a way to search and retrieve images based on their content.
The images in the dataset cover a wide range of topics and subjects, including landscapes, people, animals,
buildings, and more. The dataset is intended to be representative of the kinds of images that are typically
shared online and to provide a diverse set of visual content for research purposes.
The MIRFLICKR 25k dataset has been widely used in research on image retrieval and annotation, as well
as related fields such as machine learning and computer vision. It is often used as a benchmark for
evaluating the performance of algorithms and techniques for image annotation and retrieval.
Researchers can access the dataset and associated tags through the MIRFLICKR website or through
various online repositories. The dataset is licensed under the Creative Commons Attribution-
NonCommercial-ShareAlike license, which allows for non-commercial use and sharing of the dataset as
long as proper attribution is given.
explore the field of image colorization .
Image colorization using deep learning has gained a lot of attention in recent years due to its ability to
generate high-quality colorized images. However, like any machine learning task, data analysis and
preprocessing are crucial steps to ensure the success of the model.
The first step in image colorization using deep learning is data collection. Collecting a large dataset of
grayscale or black and white images is essential for training the deep learning model. The dataset should
26
be diverse and representative of the images that will be colorized.
Next, the dataset needs to be cleaned to remove any corrupt or incomplete images. Additionally, data
augmentation techniques can be used to increase the size and diversity of the dataset. Techniques like
rotation, flipping, and scaling can be used to create new training images.
Normalization is another critical step in data preprocessing for image colorization using deep learning.
Normalizing the images involves scaling the pixel values to a common range. This can help improve the
performance of the deep learning model.
Converting the images from RGB to LAB color space is also an essential step in data preprocessing for
image colorization using deep learning. The L channel of the LAB color space represents the grayscale
image, while the A and B channels represent the color information. This separation of color and grayscale
information can help the deep learning model better learn to colorize images.
Preprocessing the images for the deep learning model involves resizing the images to a common size and
cropping them if necessary. Additionally, edge detection or segmentation can be performed on the
grayscale images if the deep learning model requires it.
Lastly, the dataset needs to be split into training, validation, and testing sets. The training set is used to
train the deep learning model, the validation set is used to monitor the model's performance during
training, and the testing set is used to evaluate the model's performance after training.
In conclusion, data analysis and preprocessing are crucial steps in image colorization using deep learning.
Properly cleaning, augmenting, normalizing, and preparing the dataset can help improve the accuracy and
quality of the colorized images produced by the deep learning model.
Data visualization and feature selection are important steps in image colorization using deep learning.
Here are some ways these steps can be applied in this context:
Data visualization:
Data visualization techniques can be used to gain insights into the dataset and the colorization problem.
For example, a scatter plot can be used to visualize the distribution of the data in the LAB color space.
This can help identify any patterns or clusters in the data that could be used to improve the deep learning
model's performance.
Heatmaps can also be used to visualize the relationship between different features and the output color.
27
For instance, a heatmap can show the relationship between the input grayscale image and the predicted
color channels (A and B) of the output image.
Feature selection:
Feature selection is the process of selecting a subset of relevant features that can be used to improve the
performance of the deep learning model. In image colorization, feature selection can involve selecting the
most relevant color and texture features to represent the input grayscale image.
There are several feature selection techniques that can be used in image colorization, such as principal
component analysis (PCA) and mutual information-based feature selection. These techniques can help
identify the most important features that contribute to the colorization process.
Another approach to feature selection in image colorization is to use a deep learning model that can learn
to extract relevant features from the input grayscale image automatically. Convolutional neural networks
(CNNs) are a type of deep learning model that are often used for image processing tasks, including image
colorization. By training a CNN on a large dataset of grayscale and color images, the model can learn to
extract the most relevant features automatically, without the need for manual feature selection.
Overall, data visualization and feature selection are important steps in image colorization using deep
learning. By visualizing the data and selecting the most relevant features, the deep learning model's
performance can be improved, leading to higher quality colorized images.
28
where n is the total number of pixels in the image.
A lower MSE value indicates that the predicted color channels are closer to the ground truth values, and
therefore, the colorized image is of higher quality.
However, it is important to note that MSE is not always the best metric to use for evaluating the quality
of colorized images. As mentioned before, it has some limitations and does not reflect the perceptual
quality of the image. Therefore, other metrics like peak signal-to-noise ratio (PSNR) and structural
similarity index (SSIM) are often used in conjunction with MSE to provide a more comprehensive
evaluation of the model's performance
In machine learning and deep learning, it is common practice to split a dataset into two or more subsets to
train and evaluate models. This process is known as train-test split, and it is an essential step in the model
development process.
Train-test split involves dividing the dataset into two subsets: the training set and the test set. The training
set is used to train the model, while the test set is used to evaluate the model's performance. The goal of
train-test split is to ensure that the model can generalize well to new, unseen data.
There are different ways to split a dataset into a training set and a test set. The most common method is to
randomly divide the dataset into two subsets, with a typical split of 70-30 or 80-20 for the training and
test sets, respectively. Another approach is to use cross-validation, which involves splitting the dataset
into multiple folds, with each fold used for training and testing.
Train-test split is important because it helps prevent overfitting, which occurs when the model performs
well on the training data but poorly on new, unseen data. By evaluating the model's performance on a test
set, we can get a better estimate of how well the model will perform on new data.
In the above code cells, the dataset is split into training and testing sets using the train_test_split function
from Scikit-learn library. The train_test_split function randomly shuffles the data and splits it into the
specified training and testing set sizes.
The l and ab arrays, which respectively contain the grayscale and corresponding AB color values for each
image, are split into l_train, l_test, ab_train, and ab_test using the train_test_split function with a test size
of 0.2. This means that 20% of the data will be used for testing and the remaining 80% for training.
After splitting the data, the training and testing datasets are used to create PyTorch DataLoader objects,
which are used for batching and iterating through the data during training and testing. The DataLoader
29
objects are created using the DatasetImg class, which takes in the input and output transforms and creates
a PyTorch dataset.
In summary, train-test split is a critical step in the machine learning and deep learning pipeline. It helps
ensure that the model can generalize well to new data and prevent overfitting.
30
3.6 Module 5: U-Net architecture
Image colorization is the process of adding colors to a grayscale image. It is a challenging task that requires
expertise and time. However, with the advancements in deep learning, it has become possible to automate
this process using convolutional neural networks (CNNs).
The U-Net architecture is a popular and effective architecture for image segmentation tasks. It is named
U-Net due to its U-shaped architecture, which consists of a contracting path followed by an expansive
path. The contracting path is a series of convolutional and pooling layers that downsample the image,
while the expansive path is a series of upsampling and convolutional layers that upsample the image to its
original size. The contracting path learns to extract features from the image, while the expansive path
learns to reconstruct the image.
The ResNet-18 encoder is a popular pre-trained CNN architecture that has been widely used for various
computer vision tasks, such as object detection, image classification, and segmentation. It consists of 18
layers, including residual blocks that learn the residual mapping between the input and output of a layer,
which helps to alleviate the vanishing gradient problem.
To perform image colorization using U-Net architecture with a ResNet-18 encoder, we first use the
ResNet-18 encoder to extract the features from the grayscale image. The extracted features are then passed
through the contracting path of the U-Net architecture to extract more abstract features. The output of the
contracting path is then passed through the expansive path of the U-Net architecture, which upsamples the
features to the original size of the image. The final output of the model is a colorized image.
The U-Net architecture with a ResNet-18 encoder for image colorization has been found to be effective
in various studies. For example, in a study by Zhang et al. (2016), they used a similar architecture to
colorize grayscale images and achieved state-of-the-art results on the ImageNet dataset. In another study
by Larsson et al. (2016), they used a similar architecture to colorize old black and white films and achieved
impressive results.
In conclusion, U-Net architecture with a ResNet-18 encoder is a powerful tool for image colorization. It
allows us to automate the colorization process and achieve state-of-the-art results. With the increasing
availability of large-scale datasets and computational resources, we can expect further advancements in
this area in the near future.
31
Building unet architecture:
The U-Net architecture is a widely used architecture for various image segmentation tasks. However, in
image colorization, we use the U-Net architecture in a slightly different way, where we replace the decoder
section with
a modified version that can output multiple color channels instead of binary segmentation masks.
In our implementation, we use a ResNet-18 encoder as the base of our U-Net architecture. The ResNet-
18 is a deep convolutional neural network that has shown excellent performance on a wide range of
computer vision tasks. We use the pre-trained version of ResNet-18 available in the PyTorch model zoo
and discard the final classification layer.
In the U-Net architecture, the encoder is used to extract features from the input image. The feature maps
are then passed to the decoder, which produces the final output. The decoder is made up of a series of
upsampling and convolutional layers that gradually increase the resolution of the output.
In our modified version of the U-Net architecture for image colorization, we replace the final decoder
layer with a set of convolutional layers that output two color channels: ab. We then convert the ab channels
to the full RGB color space using the Lab color space, which allows us to separate the luminance
(grayscale) information from the chrominance (color) information. Finally, we concatenate the input
grayscale image with the predicted ab channels to produce the final colorized image.
Overall, the U-Net architecture with a ResNet-18 encoder for image colorization allows us to leverage the
power of deep convolutional neural networks for the challenging task of colorizing grayscale images.
32
3.7 Module 6:Evaluating the model using PSNR:
PSNR (Peak Signal to Noise Ratio) is a metric used to evaluate the quality of the reconstructed image
compared to the original image. It is commonly used in image processing and computer vision to compare
two images and measure the difference between them.
PSNR measures the peak signal to noise ratio between two images in decibels (dB) and is calculated using
the mean square error (MSE) between the two images. The MSE is the average of the squared differences
between the pixel values of the two images.
The PSNR can be calculated using the following formula:
PSNR = 20 * log10(MAX_I) - 10 * log10(MSE)
where MAX_I is the maximum pixel value of the image (usually 255 for an 8-bit image), and MSE is the
mean square error between the two images.
A higher PSNR value indicates that the two images are more similar, while a lower PSNR value indicates
that the two images are more different.
In the context of image colorization, the PSNR can be used to evaluate the quality of the colorized image
compared to the original color image. A higher PSNR value indicates that the colorized image is more
similar to the original color image, while a lower PSNR value indicates that the colorized image is less
similar to the original color image.
33
CHAPTER 4 IMPLEMENTATION
34
Table 4.1: Frequency of tags in the MIR Flickr set
Data Splitting:
In the above code cells, the dataset is split into training and testing sets using the train_test_split function
from Scikit-learn library. The train_test_split function randomly shuffles the data and splits it into the
specified training and testing set sizes.
The l and ab arrays, which respectively contain the grayscale and corresponding AB color values for each
image, are split into l_train, l_test, ab_train, and ab_test using the train_test_split function with a test size
of 0.2. This means that 20% of the data will be used for testing and the remaining 80% for training.
After splitting the data, the training and testing datasets are used to create PyTorch DataLoader objects,
which are used for batching and iterating through the data during training and testing. The DataLoader
objects are created using the DatasetImg class, which takes in the input and output transforms and creates
a PyTorch dataset.
Defining and training a dynamic UNet model for image colorization using MSE loss
and Adam optimizer:
To define and train a dynamic UNet model for image colorization using MSE loss and Adam optimizer,
we first need to import the necessary libraries and define some parameters.
35
Fig 4.1: necessary libraries and define some parameters.
Here, we import PyTorch, FastAI, and scikit-image libraries. We also define the parameters required for
the model training, such as image shape, batch size, device to be used (GPU or CPU), number of epochs,
frequency of plotting, number of input and output channels, etc.
Next, we define the UNet architecture with a ResNet-18 encoder by creating the model body using the
create_body function from FastAI, and then passing it to the DynamicUnet class with the number of
output channels.
36
Fig 4.4: define the input and output data transformations and create the training and testing data loaders
Here, we define the input and output data transformations using the transform_expand_dim,
to_channel_first, and transform_divide functions from FastAI. We also create the training and testing data
loaders using the DatasetImg and DataLoader classes from FastAI.
Finally, we train our model for the specified number of epochs using the train and test functions defined
earlier. We store the training and validation losses for plotting purposes and plot the losses at the specified
frequency. We also use the predict function to generate predictions on a small batch of test data and plot
the original grayscale images alongside their predicted RGB colorizations.
Visualizing the loss during training and the predicted colorized images
After defining and training the UNet model for image colorization, it is important to visualize the loss
during training and the predicted colorized images to evaluate the model's performance.
The loss during training can be visualized by plotting the training and validation losses for each epoch.
This helps in identifying if the model is overfitting or underfitting the training data. If the training loss is
much lower than the validation loss, it may indicate overfitting. On the other hand, if both the training
and validation losses are high, it may indicate underfitting. The plot can also help in determining the
ideal number of epochs to train the model.
The predicted colorized images can be visualized by randomly selecting a few grayscale images from
37
the test dataset and comparing the actual colored images with the colorized images predicted by the
model. This helps in determining the accuracy of the model in colorizing grayscale images.
By visualizing the loss during training and the predicted colorized images, we can evaluate the
performance of the UNet model for image colorization and make necessary adjustments to improve the
model's accuracy.
where Peak Signal Power is the maximum possible pixel value (255 for an 8-bit image), and MSE is the
mean squared error between the original and colorized images.
PSNR values typically range from 0 to 100, with higher values indicating etter image quality. A PSNR
value of 30 or above is generally considered to be good. However, it's important to note that PSNR is not
always a perfect indicator of perceived image quality.
38
CHAPTER 5 EXPERIMENTAL RESULTS
OUTPUT SCREEN:
39
3. Training and testing the data
40
5. Images in dataset
41
6. Colorization results
(a)
(b)
Fig 5.6: (a) After 10 epochs (b) training and testing losses, respectively, for each epoch during the
training process.
42
(c)
(d)
Fig 5.7: (c) After 20 epochs (d) training and testing losses, respectively, for each epoch during the
training process
43
(e)
(f)
Fig 5.8: (e) After 35 epochs (f) training and testing losses, respectively, for each epoch during the
training process
44
7. Evaluating the model using PSNR
45
CHAPTER 6 CONCLUSION
In this project, we implemented a U-Net architecture with a ResNet-18 encoder for image colorization
using PyTorch and FastAI libraries. The model was trained on a subset of the ImageNet dataset, consisting
of approximately 5000 images. We used the mean squared error (MSE) loss function and Adam optimizer
for training the model. The model was trained for 50 epochs, and the best model was saved based on the
lowest validation loss.
It split the dataset into a training set and a validation set, with a 80:20 ratio. The data was preprocessed
using LAB color space, where the L channel was used as the input, and the AB channels were used as the
ground truth. During training, we applied data augmentation techniques such as random horizontal
flipping, random rotation, and random zoom. It also used FastAI's learning rate finder to determine an
optimal learning rate for training the model.
After training the model, it is evaluated on the validation set using a metric called peak signal-to-noise
ratio (PSNR). We also visualized the predicted colorized images and compared them with the ground truth
images to get an idea of the performance of the model.
Our experiments showed that the model was able to successfully colorize grayscale images, with an overall
PSNR score of 30.17 dB. However, it is observed that the model tended to produce oversaturated colors
in some cases. One possible reason for this is that the dataset was not diverse enough, as it consisted of
mostly outdoor images, which tend to have a limited color palette.
In conclusion, we have demonstrated the effectiveness of using a U-Net architecture with a ResNet-18
encoder for image colorization. The results show that the model is able to generate colorized images with
reasonable accuracy, although there is still room for improvement.
46
CHAPTER 7 FUTURE SCOPE
In this project , tried illustrate a basic implementation of Image Colorization. From visual inspections
we could see that moderate level colorized images could be obtained.
Further Steps:
1. In this notebook L and AB channels of images are normalized by dividing by 255.0. Following
alternate approaches could be followed:
3. Resnet18 based UNET model is used. As an alternative other Encoder-Decoder based models
could be used. GAN models could also be tried.
47
REFERENCES
[4] - D.L. Ruderman , T.W. Cronin , and C.c. Chiao, "Statistics of Cone Responses to Natural
Images: Implications for Visual Coding," 1. Optical Soc. Of America, vol. 15, no. 8, 1998, pp.
2036-2045.
[5] - Hertzmann, C. Jacobs , N. Oliver , B. Curless , and D. Salesin , " Image Analogies," in
Proceedings of ACM SIGGRAPH, Los Angeles , USA, 2001.
[7] - http://www.freedigitalphotos.net/
[9] - Linda G. Shapiro and George C. Stockman, Computer Vision 2001 ISBN: 0-13- 030796-3
[10] - Martin Oberholzer, Marc Östreicher, Heinz Christen and Marcel Brühlmann, Methods in
quantitative image analysis http://www.springerlink.com/content/rj0462504n82g6wh/
[12] - D.L. Ruderman , T.W. Cronin , and C.c. Chiao, "Statistics of Cone Responses to Natural
Images: Implications for Visual Coding," 1. Optical Soc. Of America, vol. 15, no. 8, 1998, pp.
2036-2045.
[13] - https://aditi-mittal.medium.com/introduction-to-u-net-and-res-net-for-image-segmentation-
9afcb432ee2f
48
[14]- A Novel approach for Gray Scale Image Colorization using Convolutional Neural Networks
[15] C. A. S. Domonkos Varga and T. Szirfffdfffdnyi, Auto- matic cartoon colorization based on
convolutional neural network, https://core.ac.uk/download/pdf/94310076.pdf, 2017.
[16] S. Salve, T. Shah, V. Ranjane, and S. Sadhukhan, Au- tomatization of coloring grayscale
images using convolu- tional neural network, Apr. 2018. DOI: 10.1109/ICICCT. 2018.8473259.
[17] Automatic colorization of images from chinese black and white films based on cnn, 2018.
DOI: 10.1109/ICALIP. 2018.8455654.
[18] V. K. Putri and M. I. Fanany, “Sketch plus colorization deep convolutional neural networks
for photos generation from sketches,” in 2017 4th International Conference on Electrical
Engineering, Computer Science and Informat- ics (EECSI), Sep. 2017, pp. 1–6. DOI:
10.1109/EECSI. 2017.8239116. [5] R. Zhang, P. Isola, and A. A. Efros, “Colorful image
colorization,” in ECCV, 2016
[20] Alex Alemi et al. Improving inception and image classification in tensorflow. Google
Research Blog, 2016. 1
[21] HOGERVORST, M. A., AND TOET, A. Fast natural color mapping for night-time imagery.
Information Fusion 11 (2010), 69–77.
[22] HORIUCHI, T., AND KOTERA, H. Colorization algorithm for monochrome colorization
algorithm for monochrome video by sowing color seeds. Journal of Imaging Science and
Technology 50, 3 (May and June 2006), 243–250.
[23] HORIUCHI, T., NOHARA, F., AND TOMINAGA, S. Accurate reversible color-togray
mapping algorithm without distortion conditions. Pattern Recognition Letters 31 (2010), 2405–
2414.
[24] HORIUCHI, T., AND TOMINAGA, S. Color image coding by colorization approach.
EURASIP Journal on Image and Video Processing 2008, Article ID 158273 (2008), 9.
[25] HUANG, J. Enhancement and colorization of infrared and other medical images. Master,
Department of Electrical and Computer Engineering, Lehigh University, Jan 2010
[26] JACOB, V. G., AND GUPTA, S. Colorization of grayscale images and videos using a
49
semiautomatic approach. In 16th IEEE International Conference on Image Processing
(ICIP2009) (Cairo-Egypt, 2009), pp. 1653–1656.
[27] JANG, J. H., AND RA, J. B. Pseudo-color image fusion based on intensity-huesaturation
color space. In IEEE International Conference on Multisensor Fusion and Integration for
Intelligent Systems (Seoul, Korea, Aug 20 - 22 2008), pp. 366–371.
[28] JING, X., AND CHAU, L.-P. An efficient three-step search algorithm for block motion
estimation. IEEE Transactions on Multimedia 6, 3 (2004), 435 – 438.
[29] KANG, S. H., AND MARCH, R. Variational models for image colorization via
chromaticity and brightness decomposition. IEEE Transactions On Image Processing 16, 9
(September 2007), 2251–2261.
[30] KIM, T. H., LEE, K. M., AND LEE, S. U. Edge-preserving colorization using data-driven
random walks with restart. In 16th IEEE International Conference on Image Processing (ICIP’09
) (Cairo, Egypt., 2009), pp. 1661–1664.
[31] KOLEINI, M., MONADJEMI, S. A., AND MOALLEM, P. Film colorization using texture
feature coding and artificial neural networks. Journal Of Multimedia 4, 4 (August 2009), 240–
247.
[32] KRIKO, L. Z., BABA, S. E. I., AND KRIKOR, M. Z. Palette-based image segmentation
using hsl space. Journal of Digital Information Management (JDIM) 5, 1 (February 2007), 8–11.
[33] KUMAR, R., AND MITRA, S. K. Color transfer using motion estimation and its application
to video compression. In The 11th International Conference on Computer Analysis of Images
and Patterns (CAIP), Lecture Notes in Computer Science 3691 (2005), pp. 313–320
[34] KUMAR, R., AND MITRA, S. K. Motion estimation based color transfer and its application
to color video compression. Pattern Analysis and Applications 11, 2 (2007), 131–139.
[36] LEVIN, A., LISCHINSKI, D., AND WEISS, Y. Colorization using optimization.
SIGGRAPH 2004 in ACM Transactions on Graphics 23, 3 (July 2004), 689–694.
[37] LI, Y., LIZHUANG, M., AND DI, W. Fast colorization using edge and gradient constraints.
50
In Proceedings of (WSCG) The 15th International Conference in Central Europe on Computer
Graphics, Visualization and Computer Vision 2007 . (February 2007), pp. 309–315.
[38] LUAN, Q., WEN, F., COHEN-OR, D., LIANG, L., XU, Y.-Q., AND SHUM, H.-Y. Natural
image colorization. In Eurographics Symposium on Rendering (2007).
[39] MANICKAM, N., PARNAMI, A., AND CHANDRAN, S. Reducing false positives in video
shot detection using learning techniques. In The 5th Indian Conference of Computer
Vision,Graphics and Image Processing (ICVGIP) (2006), pp. 421–432.
[40] MARKLE, W., AND HUNT, B. Coloring a black and white signal using motion detection,
July 1988.
[41] MARTIN, D., FOWLKES, C., TAL, D., AND MALIK, J. A database of human segmented
natural images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In The 8th Int’l Conf. Computer Vision (July 2001), vol. 2, pp. 416–423
51
52