Government College of Engineering Aurangabad: Submitted BY
Government College of Engineering Aurangabad: Submitted BY
SEMINAR REPORT
ON
“VISION TRANSFORMER”
SUBMITTED
BY
Prathamesh Vijay Pendam (BE20F05F051)
GUIDED
BY
Prof. V.A Injamuri
(Assistant Professor in Computer Science Department)
CERTIFICATE
This is to certify that
Prathamesh Vijay Pendam (BE20F05F051)
Date:
Dr. A. S. Bhalchandra
(Principal)
ACKNOWLEDEMENT
Regards,
Prathamesh Vijay Pendam
(BE20F05F051)
CONTENTS
Abstract……………………………………………………………………………………
1.INTRODUCTION……………………………………………………………………………
2.RELATED WORK……………………………………………..……………………………
5.SELF-SUPERVISION……………………………………………………………………….
6.CONCLUSION……………………………………………..………………………………..
6.REFERENCE……………………….………………………………………………………..
ABSTRACT
RELATED WORK
Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become
the state-of-the-art method in many NLP tasks. Large Transformer-based models are often pre-trained
on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising
self-supervised pre-training task, while the GPT line of work uses language modelling as its pre-training
task (Radford et al., 2018; 2019; Brown et al., 2020). Naive application of self-attention to images would
require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this
does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing,
several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only
in local neighbourhoods for each query pixel instead of globally. Such local multi-head dot-product
self-attention blocks can completely replace convolutions (Hu et al., 2019; Ramachandran et al., 2019;
Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable
approximations to global self-attention in order to be applicable to images. An alternative way to scale
attention is to apply it in blocks of varying sizes (Weissenborn et al., 2019), in the extreme case only
along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention
architectures demonstrate promising results on computer vision tasks, but require complex
engineering to be implemented efficiently on hardware accelerators. Most related to ours is the model
of Cordonnier et al. (2020), which extracts patches of size 2 × 2 from the input image and applies full
self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that
large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-
art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of 2 × 2 pixels, which makes the
model applicable only to small-resolution images, while we handle medium-resolution images as well.
There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of
self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by further
processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion
et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al.,
2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al.,
2020c; Lu et al., 2019; Li et al., 2019). Another recent related model is image GPT (iGPT) (Chen et al.,
2020a), which applies Transformers to image pixels after reducing image resolution and color space.
The model is trained in an unsupervised fashion as a generative model, and the resulting
representation can then be fine-tuned or probed linearly for classification performance, achieving a
maximal accuracy of 72% on ImageNet. Our work adds to the increasing collection of papers that
explore image recognition at larger scales than the standard ImageNet dataset. The use of additional
data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al., 2018;
Touvron et al., 2019; Xie et al., 2020). Moreover, Sun et al. (2017) study how CNN performance scales
with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration
of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on
these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior
works.
Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, add
position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
In order to perform classification, we use the standard approach of adding an extra learnable
“classification token” to the sequence. The illustration of the Transformer encoder was inspired by
Vaswani et al. (2017).
An overview of the model is depicted in Figure 1. The standard Transformer receives as input
a 1D sequence of token embeddings. To handle 2D images, we reshape the image x ∈ R H*W*C
into a sequence of flattened 2D patches xp ∈ R N*(P^2-C), where (H, W) is the resolution of the
original image, C is the number of channels, (P, P) is the resolution of each image patch, and N
= HW/P2 is the resulting number of patches, which also serves as the effective input sequence
length for the Transformer. The Transformer uses constant latent vector size D through all of
its layers, so we flatten the patches and map to D dimensions with a trainable linear projection
(Eq. 1). We refer to the output of this projection as the patch embeddings.
Position embeddings are added to the patch embeddings to retain positional information. We use
standard learnable 1D position embeddings, since we have not observed significant performance gains
from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of
embedding vectors serves as input to the encoder.
The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multiheaded self-
attention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every
block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).
Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than
CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are
baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally
equivariant, while the self-attention layers are global. The two-dimensional neighborhood structure is
used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning
time for adjusting the position embeddings for images of different resolution (as described below).
Other than that, the position embeddings at initialization time carry no information about the 2D
positions of the patches and all spatial relations between the patches have to be learned from scratch.
Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed from
feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection E
(Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have
spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial
dimensions of the feature map and projecting to the Transformer dimension. The classification input
embedding and position embeddings are added as described above.
------------------------------------------------------------------------------
2Note that the ImageNet pre-trained models are also fine-tuned, but again on ImageNet. This
is because the resolution increase during fine-tuning improves the performance.
Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB (Table 2)
seem promising for very low-data transfer. Further analysis of few-shot properties of ViT is an exciting
direction of future work.
INSPECTING VISION TRANSFORMER
To begin to understand how the Vision Transformer processes
image data, we analyse its internal representations. The first layer
of the Vision Transformer linearly projects the flattened patches
into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the
top principal components of the learned embedding filters. The
components resemble plausible basis functions for a low-
dimensional representation of the fine structure within each
patch. After the projection, a learned position embedding is added
to the patch representations. Figure 7 (center) shows that the
model learns to encode distance within the image in the similarity
of position embeddings, i.e., closer patches tend to have more
similar position embeddings. Further, the row-column structure
appears; patches in the same row/column have similar
embeddings. Finally, a sinusoidal structure is sometimes apparent
for larger grids (Appendix D). That the position embeddings learn
to represent 2D image topology explains why hand-crafted 2D-
aware embedding variants do not yield improvements (Appendix
D.4). Self-attention allows ViT to integrate information across the entire image even in the
lowest layers. We investigate to what degree the network makes use of this capability.
Specifically, we compute the average distance in image space across which information is
integrated, based on the attention weights (Figure 7, right). This “attention distance” is
analogous to receptive field size in CNNs. We find that some heads attend to most of the image
already in the lowest layers, showing that the ability to integrate information globally is indeed
used by the model. Other attention heads have consistently small attention distances in the
low layers. This highly localized attention is less pronounced in hybrid models that apply a
ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function
as early convolutional layers in CNNs. Further, the attention distance increases with network
depth. Globally, we find that the model attends to image regions that are semantically relevant
for classification (Figure 6).
SELF-SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not
only from their excellent scalability but also from large scale self-supervised pre-training (Devlinet al.,
2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for
self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised
pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
improvement of 2% to training from scratch, but still 4% behind supervised pre-training. Appendix
B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He
et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.
Vision Transformer (ViT) in Image Recognition While the Transformer
architecture has become the highest standard for tasks involving Natural Language
Processing (NLP), its use cases relating to Computer Vision (CV) remain only a few.
In many computer vision tasks, attention is either used in conjunction with
convolutional networks (CNN) or used to substitute certain aspects of convolutional
networks while keeping their entire composition intact. Popular image recognition
algorithms include ResNet, VGG, YOLOv3, and YOLOv7.
The concept of widely popular Convolutional Neural Networks (CNN)
However, this dependency on CNN is not mandatory, and a pure transformer
applied directly to sequences of image patches can work exceptionally well on
image classification tasks. Performance of Vision Transformers in Computer Vision
Vision Transformers (ViT) have recently achieved highly competitive performance
in benchmarks for several computer vision applications, such as image
classification, object detection, and semantic image segmentation. CSWin
Transformer is an efficient and effective Transformer-based backbone for general-
purpose vision tasks that uses a new technique called “Cross-Shaped Window self-
attention” to analyze different parts of the image at the same time, which makes it
much faster. The CSWin Transformer has surpassed previous state-of-the-art
methods, such as the Swin Transformer. In benchmark tasks, CSWIN achieved
excellent performance, including 85.4% Top-1 accuracy on ImageNet-1K, 53.9 box
AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K
semantic segmentation task.
ORIGIN AND HISTORY OF VISION TRANSFORMER MODELS
The overall structure of the vision transformer architecture consists of the following
steps:
This is because CNNs encode the local information in the image more effectively than
ViTs due to the application of locally restricted receptive fields.
Image captioning
A more advanced form of image categorization can be achieved by generating a caption
describing the content of an image instead of a one-word label. This has become
possible with the use of ViTs. ViTs learn a general representation of a given data
modality instead of a crude set of labels. Therefore, it is possible to generate descriptive
text for a given image. We will use an implementation of ViT trained on the COCO
dataset. The results of such captioning can be seen below:
Image segmentation
DPT (DensePredictionTransformers) is a segmentation model released by Intel in March
2021 that applies vision transformers to images. It can perform image semantic
segmentation with 49.02% mIoU on ADE20K. It can also be used for monocular depth
estimation with an improvement of up to 28% relative performance compared to a
state-of-the-art fully-convolutional network.
Anomaly detection
A transformer-based image anomaly detection and localization network combines a
reconstruction-based approach and patch embedding. The use of transformer networks
helps preserve the spatial information of the embedded patches, which is later
processed by a Gaussian mixture density network to localize the anomalous areas.
Action recognition
Interesting paper by the Google Research team where they use a pure-transformer
based models for video classification, drawing upon the recent success of such models
in image classification. The model extracts spatiotemporal tokens from the input video,
which are then encoded by a series of transformer layers.
To handle the long sequences of tokens encountered in the video, the authors propose
several efficient variants of our model that factorize the input's spatial and temporal
dimensions.
Although transformer-based models are known to only be effective when large training
datasets are available, we show how we can effectively regularise the model during
training and leverage pretrained image models to be able to train on comparatively
small datasets
Autonomous driving
On Tesla AI Day in 2021, Tesla revealed many intricate inner workings of the neural
network powering Tesla FSD. One of the most intriguing building blocks is one dubbed
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/1706.03762