0% found this document useful (0 votes)
45 views22 pages

Government College of Engineering Aurangabad: Submitted BY

Prathamesh Vijay Pendam submitted a seminar report on "Vision Transformer" to his guide Prof. V.A. Injamuri. The report provides an abstract summarizing that Vision Transformers can perform well on image classification tasks without relying on convolutional neural networks when pre-trained on large datasets. It then reviews related work applying self-attention to computer vision tasks and discusses how Vision Transformer extracts patches from images and feeds them into a standard Transformer to classify images. Large-scale pre-training of Vision Transformers on datasets with millions of images allows them to match or exceed state-of-the-art convolutional networks on various image recognition benchmarks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views22 pages

Government College of Engineering Aurangabad: Submitted BY

Prathamesh Vijay Pendam submitted a seminar report on "Vision Transformer" to his guide Prof. V.A. Injamuri. The report provides an abstract summarizing that Vision Transformers can perform well on image classification tasks without relying on convolutional neural networks when pre-trained on large datasets. It then reviews related work applying self-attention to computer vision tasks and discusses how Vision Transformer extracts patches from images and feeds them into a standard Transformer to classify images. Large-scale pre-training of Vision Transformers on datasets with millions of images allows them to match or exceed state-of-the-art convolutional networks on various image recognition benchmarks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

GOVERNMENT COLLEGE OF ENGINEERING AURANGABAD

(An Autonomous Institute of Government of Maharashtra)

SEMINAR REPORT
ON

“VISION TRANSFORMER”

SUBMITTED
BY
Prathamesh Vijay Pendam (BE20F05F051)

GUIDED
BY
Prof. V.A Injamuri
(Assistant Professor in Computer Science Department)

ACADEMIC YEAR 2022-23


DEPARTMENT
OF
COMPUTER SCIENCE ENGINEERING
GOVERNMENT COLLEGE OF ENGINEERING AURANGABAD
(An Autonomous Institute of Government of Maharashtra)
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

CERTIFICATE
This is to certify that
Prathamesh Vijay Pendam (BE20F05F051)

have successfully completed seminar work titled “Vision Transformer” during


the academic year 2022-2023, in partial fulfillment of Degree in Computer
Science Engineering of Government College of Engineering, Aurangabad. To
the best of our knowledge and belief this project work has not been submitted
elsewhere.

Date:

Prof.V.A. Injamuri Prof. S.G.Shikalpure


(Asst. Prof. in Computer Science Department) H.O.D
SEMINAR GUIDE Computer Science Engg.

Dr. A. S. Bhalchandra
(Principal)
ACKNOWLEDEMENT

It gives us immense pleasure in submitting the seminar report


on topic “VISION TRANSFORMER” to our guide Prof.
V.A.Injamuri Mam who was a constant source of guidance
and inspiration through developing the seminar report and for
preparation of seminar. I am also very thankful to all the staff
members of Computer Engineering department, who have
indirectly guided and helped me in preparation of this
seminar. We also express our sincere gratitude to our
honorable HOD Prof. S.G.Shikalpure Sir for providing us
necessary facilities. At last, I am thankful to my friends whose
encouragement and constant inspiration helped us to for the
preparation of the seminar.

Regards,
Prathamesh Vijay Pendam
(BE20F05F051)
CONTENTS

Abstract……………………………………………………………………………………
1.INTRODUCTION……………………………………………………………………………

2.RELATED WORK……………………………………………..……………………………

3.VISION TRANSFORMER (VIT)………………………………………………….………..

4.PRE-TRAINING DATA REQUIREMENTS…..……………………………………………

5.SELF-SUPERVISION……………………………………………………………………….

6.ORIGIN AND HISTORY OF VIT MODELS……………………………………………….

6.VISION TRANSFORMER ARCHITECTURE …………………………………………….

6.USE CASES AND APPLICATIONS………………………………………………………..

6.CONCLUSION……………………………………………..………………………………..

6.REFERENCE……………………….………………………………………………………..
ABSTRACT

While the Transformer architecture has become the de-


facto standard for natural language processing tasks, its
applications to computer vision remain limited. In vision,
attention is either applied in conjunction with
convolutional networks, or used to replace certain
components of convolutional networks while keeping
their overall structure in place. We show that this
reliance on CNNs is not necessary and a pure
transformer applied directly to sequences of image
patches can perform very well on image classification
tasks. When pre-trained on large amounts of data and
transferred to multiple mid-sized or small image
recognition benchmarks (ImageNet, CIFAR-100, VTAB,
etc.), Vision Transformer (ViT) attains excellent results
compared to state-of-the-art convolutional networks
while requiring substantially fewer computational
resources to train.
INTRODUCTION

Self-attention-based architectures, in particular Transformers (Vaswani et al.,


2017), have become the model of choice in natural language processing (NLP).
The dominant approach is to pre-train on a large text corpus and then fine-tune
on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’
computational efficiency and scalability, it has become possible to train models
of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin
et al., 2020). With the models and datasets growing, there is still no sign of
saturating performance. In computer vision, however, convolutional
architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et
al., 2016). Inspired by NLP successes, multiple works try combining CNN-like
architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some
replacing the convolutions entirely (Ramachandran et al., 2019; Wang et al.,
2020a). The latter models, while theoretically efficient, have not yet been scaled
effectively on modern hardware accelerators due to the use of specialized
attention patterns. Therefore, in large-scale image recognition, classic ResNetlike
architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020;
Kolesnikov et al., 2020). Inspired by the Transformer scaling successes in NLP, we
experiment with applying a standard Transformer directly to images, with the
fewest possible modifications. To do so, we split an image into patches and
provide the sequence of linear embeddings of these patches as an input to a
Transformer. Image patches are treated the same way as tokens (words) in an
NLP application. We train the model on image classification in supervised
fashion. When trained on mid-sized datasets such as ImageNet without strong
regularization, these models yield modest accuracies of a few percentage points
below ResNets of comparable size. This seemingly discouraging outcome may be
expected: Transformers lack some of the inductive biases

inherent to CNNs, such as translation equivariance and locality, and therefore do


not generalize well when trained on insufficient amounts of data. However, the
picture changes if the models are trained on larger datasets (14M-300M images).
We find that large scale training trumps inductive bias. Our Vision Transformer
(ViT) attains excellent results when pre-trained at sufficient scale and transferred
to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k
dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the
art on multiple image recognition benchmarks. In particular, the best model
reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL,
94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

RELATED WORK

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become
the state-of-the-art method in many NLP tasks. Large Transformer-based models are often pre-trained
on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising
self-supervised pre-training task, while the GPT line of work uses language modelling as its pre-training
task (Radford et al., 2018; 2019; Brown et al., 2020). Naive application of self-attention to images would
require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this
does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing,
several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only
in local neighbourhoods for each query pixel instead of globally. Such local multi-head dot-product
self-attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019;
Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable
approximations to global self-attention in order to be applicable to images. An alternative way to scale
attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only
along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention
architectures demonstrate promising results on computer vision tasks, but require complex
engineering to be implemented efficiently on hardware accelerators. Most related to ours is the model
of Cordonnier et al. (2020), which extracts patches of size 2 × 2 from the input image and applies full
self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that
large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-
art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of 2 × 2 pixels, which makes the
model applicable only to small-resolution images, while we handle medium-resolution images as well.
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of
self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by further
processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion
et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al.,
2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al.,
2020c; Lu et al., 2019; Li et al., 2019). Another recent related model is image GPT (iGPT) (Chen et al.,
2020a), which applies Transformers to image pixels after reducing image resolution and color space.
The model is trained in an unsupervised fashion as a generative model, and the resulting
representation can then be fine-tuned or probed linearly for classification performance, achieving a
maximal accuracy of 72% on ImageNet. Our work adds to the increasing collection of papers that
explore image recognition at larger scales than the standard ImageNet dataset. The use of additional
data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al., 2018;
Touvron et al., 2019; Xie et al., 2020). Moreover, Sun et al. (2017) study how CNN performance scales
with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration
of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on
these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior
works.

Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, add
position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
In order to perform classification, we use the standard approach of adding an extra learnable
“classification token” to the sequence. The illustration of the Transformer encoder was inspired by
Vaswani et al. (2017).

VISION TRANSFORMER (VIT)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input
a 1D sequence of token embeddings. To handle 2D images, we reshape the image x ∈ R H*W*C
into a sequence of flattened 2D patches xp ∈ R N*(P^2-C), where (H, W) is the resolution of the
original image, C is the number of channels, (P, P) is the resolution of each image patch, and N
= HW/P2 is the resulting number of patches, which also serves as the effective input sequence
length for the Transformer. The Transformer uses constant latent vector size D through all of
its layers, so we flatten the patches and map to D dimensions with a trainable linear projection
(Eq. 1). We refer to the output of this projection as the patch embeddings.

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of


embedded patches (z00 = Xclass), whose state at the output of the Transformer encoder (z0p)
serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a
classification head is attached to z 0 L. The classification head is implemented by a MLP with
one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed significant performance gains
from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of
embedding vectors serves as input to the encoder.

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-
attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every
block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).

The MLP contains two layers with a GELU non-linearity.

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally
equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is
used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning
time for adjusting the position embeddings for images of different resolution (as described below).
Other than that, the position embeddings at initialization time carry no information about the 2D
positions of the patches and all spatial relations between the patches have to be learned from scratch.

Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed from
feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E
(Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have
spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial
dimensions of the feature map and projecting to the Transformer dimension. The classification input
embedding and position embeddings are added as described above.

PRE-TRAINING DATA REQUIREMENTS


The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With
fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform
two series of experiments. First, we pre-train ViT models on datasets of increasing size:
ImageNet, ImageNet-21k, and JFT300M. To boost the performance on the smaller datasets,
we optimize three basic regularization parameters – weight decay, dropout, and label
smoothing. Figure 3 shows the results after finetuning to ImageNet (results on other datasets
are shown in Table 5)2. When pre-trained on the smallest dataset, ImageNet, ViT-Large models
underperform compared to ViT-Base models, despite (moderate) regularization. With
ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the
full benefit of larger models. Figure 3 also shows the performance.

------------------------------------------------------------------------------

2Note that the ImageNet pre-trained models are also fine-tuned, but again on ImageNet. This
is because the resolution increase during fine-tuning improves the performance.

Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB (Table 2)
seem promising for very low-data transfer. Further analysis of few-shot properties of ViT is an exciting
direction of future work.
INSPECTING VISION TRANSFORMER
To begin to understand how the Vision Transformer processes
image data, we analyse its internal representations. The first layer
of the Vision Transformer linearly projects the flattened patches
into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the
top principal components of the learned embedding filters. The
components resemble plausible basis functions for a low-
dimensional representation of the fine structure within each
patch. After the projection, a learned position embedding is added
to the patch representations. Figure 7 (center) shows that the
model learns to encode distance within the image in the similarity
of position embeddings, i.e., closer patches tend to have more
similar position embeddings. Further, the row-column structure
appears; patches in the same row/column have similar
embeddings. Finally, a sinusoidal structure is sometimes apparent
for larger grids (Appendix D). That the position embeddings learn
to represent 2D image topology explains why hand-crafted 2D-
aware embedding variants do not yield improvements (Appendix
D.4). Self-attention allows ViT to integrate information across the entire image even in the
lowest layers. We investigate to what degree the network makes use of this capability.
Specifically, we compute the average distance in image space across which information is
integrated, based on the attention weights (Figure 7, right). This “attention distance” is
analogous to receptive field size in CNNs. We find that some heads attend to most of the image
already in the lowest layers, showing that the ability to integrate information globally is indeed
used by the model. Other attention heads have consistently small attention distances in the
low layers. This highly localized attention is less pronounced in hybrid models that apply a
ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function
as early convolutional layers in CNNs. Further, the attention distance increases with network
depth. Globally, we find that the model attends to image regions that are semantically relevant
for classification (Figure 6).

SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not
only from their excellent scalability but also from large scale self-supervised pre-training (Devlinet al.,
2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for
self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised
pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
improvement of 2% to training from scratch, but still 4% behind supervised pre-training. Appendix
B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He
et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.
Vision Transformer (ViT) in Image Recognition While the Transformer
architecture has become the highest standard for tasks involving Natural Language
Processing (NLP), its use cases relating to Computer Vision (CV) remain only a few.
In many computer vision tasks, attention is either used in conjunction with
convolutional networks (CNN) or used to substitute certain aspects of convolutional
networks while keeping their entire composition intact. Popular image recognition
algorithms include ResNet, VGG, YOLOv3, and YOLOv7.
The concept of widely popular Convolutional Neural Networks (CNN)
However, this dependency on CNN is not mandatory, and a pure transformer
applied directly to sequences of image patches can work exceptionally well on
image classification tasks. Performance of Vision Transformers in Computer Vision
Vision Transformers (ViT) have recently achieved highly competitive performance
in benchmarks for several computer vision applications, such as image
classification, object detection, and semantic image segmentation. CSWin
Transformer is an efficient and effective Transformer-based backbone for general-
purpose vision tasks that uses a new technique called “Cross-Shaped Window self-
attention” to analyze different parts of the image at the same time, which makes it
much faster. The CSWin Transformer has surpassed previous state-of-the-art
methods, such as the Swin Transformer. In benchmark tasks, CSWIN achieved
excellent performance, including 85.4% Top-1 accuracy on ImageNet-1K, 53.9 box
AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K
semantic segmentation task.
ORIGIN AND HISTORY OF VISION TRANSFORMER MODELS

In the following, we highlight some of the most significant vision


transformers that have been developed over the years. They are based on the
transformer architecture, which was originally proposed for natural language
processing (NLP) in 2017.
Vision Transformer ViT Architecture

Several vision transformer models have been proposed in the literature.

The overall structure of the vision transformer architecture consists of the following
steps:

1. Split an image into patches (fixed sizes)


2. Flatten the image patches
3. Create lower-dimensional linear embeddings from these flattened image
patches
4. Include positional embeddings
5. Feed the sequence as an input to a state-of-the-art transformer encoder
6. Pre-train the ViT model with image labels, which is then fully supervised on
a big dataset
7. Fine-tune the downstream dataset for image classification
Real-World Vision Transformer (ViT)
Use Cases and Applications

Vision transformers have extensive applications in popular image recognition tasks


such as object detection, segmentation, image classification, and action
recognition. Moreover, ViTs are applied in generative modeling and multi-model
tasks, including visual grounding, visual-question answering, and visual reasoning.
Video forecasting and activity recognition are all parts of video processing that
require ViT. Moreover, image enhancement, colorization, and image super-
resolution also use ViT models. Last but not least, ViTs have numerous applications
in 3D analysis, such as segmentation and point cloud classification
Image Classification
The task of image classification is the most common problem in vision. CNN-based
methods are state-of-art for image classification tasks. ViTs don’t produce a comparable
performance at small to medium datasets. However, they have outperformed CNNs on
very large datasets.

This is because CNNs encode the local information in the image more effectively than
ViTs due to the application of locally restricted receptive fields.
Image captioning
A more advanced form of image categorization can be achieved by generating a caption
describing the content of an image instead of a one-word label. This has become
possible with the use of ViTs. ViTs learn a general representation of a given data
modality instead of a crude set of labels. Therefore, it is possible to generate descriptive
text for a given image. We will use an implementation of ViT trained on the COCO
dataset. The results of such captioning can be seen below:

Image segmentation
DPT (DensePredictionTransformers) is a segmentation model released by Intel in March
2021 that applies vision transformers to images. It can perform image semantic
segmentation with 49.02% mIoU on ADE20K. It can also be used for monocular depth
estimation with an improvement of up to 28% relative performance compared to a
state-of-the-art fully-convolutional network.
Anomaly detection
A transformer-based image anomaly detection and localization network combines a
reconstruction-based approach and patch embedding. The use of transformer networks
helps preserve the spatial information of the embedded patches, which is later
processed by a Gaussian mixture density network to localize the anomalous areas.
Action recognition
Interesting paper by the Google Research team where they use a pure-transformer
based models for video classification, drawing upon the recent success of such models
in image classification. The model extracts spatiotemporal tokens from the input video,
which are then encoded by a series of transformer layers.

To handle the long sequences of tokens encountered in the video, the authors propose
several efficient variants of our model that factorize the input's spatial and temporal
dimensions.

Although transformer-based models are known to only be effective when large training
datasets are available, we show how we can effectively regularise the model during
training and leverage pretrained image models to be able to train on comparatively
small datasets
Autonomous driving

On Tesla AI Day in 2021, Tesla revealed many intricate inner workings of the neural

network powering Tesla FSD. One of the most intriguing building blocks is one dubbed

“image-to-BEV transform + multi-camera fusion). At the center of this block is a

Transformer module, or more concretely, a cross-attention module.


CONCLUSION

We have explored the direct application of Transformers to image recognition.


Unlike prior works using self-attention in computer vision, we do not introduce
image-specific inductive biases into the architecture apart from the initial patch
extraction step. Instead, we interpret an image as a sequence of patches and
process it by a standard Transformer encoder as used in NLP. This simple, yet
scalable, strategy works surprisingly well when coupled with pre-training on large
datasets. Thus, Vision Transformer matches or exceeds the state of the art on many
image classification datasets, whilst being relatively cheap to pre-train. While these
initial results are encouraging, many challenges remain. One is to apply ViT to other
computer vision tasks, such as detection and segmentation. Our results, coupled
with those in Carion et al. (2020), indicate the promise of this approach. Another
challenge is to continue exploring self-supervised pre-training methods. Our initial
experiments show improvement from self-supervised pre-training, but there is still
large gap between self-supervised and large-scale supervised pretraining. Finally,
further scaling of ViT would likely lead to improved performance.
6.REFERENCES

• Vision Transformer: What It Is & How It Works [2023 Guide]


https://www.v7labs.com/blog/vision-transformer-
guide#h2

• Vision Transformers (ViT) in Image Recognition –


2023 Guide
Read more at:
https://viso.ai/deep-learning/vision-transformer-vit/

• RESEARCH PAPER ICLR 2021


AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR
IMAGE RECOGNITION AT SCALE.
[Submitted on 22 Oct 2020 (v1), last revised 3 Jun 2021 (this version, v2)]

https://arxiv.org/abs/1706.03762

• ATTENTION ALL YOU NEED – 2016


[Submitted on 12 Jun 2017 (v1), last revised 6 Dec 2017 (this version, v5)]

https://arxiv.org/abs/1706.03762

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy