0% found this document useful (0 votes)

20 views35 pages

Transformer Segmentation

Uploaded by

Anshul Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views35 pages

Transformer Segmentation

Uploaded by

Anshul Kaushal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Semantic Segmentation using Vision Transformers:

A survey
Hans Thisankea , Chamli Deshana , Kavindu Chamitha , Sachith
Seneviratneb,c , Rajith Vidanaarachchib,c , Damayanthi Heratha,∗
a
Department of Computer Engineering, University of
Peradeniya, Peradeniya, 20400, Sri Lanka
b
Melbourne School of Design, University of Melbourne, Parkville, VIC 3010, Australia
c
arXiv:2305.03273v1 [cs.CV] 5 May 2023

Faculty of Engineering and IT, University of Melbourne, Parkville, VIC 3010, Australia

Abstract
Semantic segmentation has a broad range of applications in a variety of do-
mains including land coverage analysis, autonomous driving, and medical
image analysis. Convolutional neural networks (CNN) and Vision Trans-
formers (ViTs) provide the architecture models for semantic segmentation.
Even though ViTs have proven success in image classification, they cannot
be directly applied to dense prediction tasks such as image segmentation
and object detection since ViT is not a general purpose backbone due to
its patch partitioning scheme. In this survey, we discuss some of the differ-
ent ViT architectures that can be used for semantic segmentation and how
their evolution managed the above-stated challenge. The rise of ViT and
its performance with a high success rate motivated the community to slowly
replace the traditional convolutional neural networks in various computer
vision tasks. This survey aims to review and compare the performances of
ViT architectures designed for semantic segmentation using benchmarking
datasets. This will be worthwhile for the community to yield knowledge re-
garding the implementations carried out in semantic segmentation and to
discover more efficient methodologies using ViTs.
Keywords: vision transformer, semantic segmentation, review, survey,
convolution neural networks, self-supervised learning, deep learning

∗
Corresponding author
Email addresses: e16368@eng.pdn.ac.lk (Hans Thisanke), e16076@eng.pdn.ac.lk
(Chamli Deshan), e16057@eng.pdn.ac.lk (Kavindu Chamith),
sachith.seneviratne@unimelb.edu.au (Sachith Seneviratne),
rajith.vidanaarachchi@unimelb.edu.au (Rajith Vidanaarachchi),
damayanthiherath@eng.pdn.ac.lk (Damayanthi Herath)

Preprint submitted to Engineering Applications of Artificial Intelligence May 8, 2023

1. Introduction
Transformers became the new state-of-the-art in natural language pro-
cessing (NLP) [1] after the tremendous success it achieved. This led to the
development of ViT [2] which was later adapted into the computer vision
tasks such as image classification [2, 3], semantic segmentation [4, 5] and
object detection [6, 7]. A typical Transformer encoder consists of a multi-
head self-attention (MSA) layer, a multi-layer perceptron (MLP), and a layer
norm (LN). The main driving force behind the ViT is the multi-head self-
attention mechanism. It helps ViT to capture long-range dependencies with
less inductive bias [8]. When trained on a sufficient amount of data, ViT
shows remarkable performance, beating the performance of state-of-the-art
CNNs [2]. However, ViTs still have some drawbacks compared to CNNs such
as the need for very large datasets. Strategies such as self-supervised based
approaches can be used to alleviate some of these drawbacks and further
enhance ViTs [9].
Semantic segmentation is the process of assigning a class label to each
and every pixel of an image. This requires accurate predictions at the pixel
level. For segmentation, there exist both CNN-based models and Trans-
former based models. However, plain ViT models cannot be directly used
for segmentation tasks because they do not consist of segmentation heads
[10]. Instead SETR [5] and Swin Transformer [4] based architectures can be
utilized for segmentation tasks. Unlike image classification, dense prediction
tasks such as semantic segmentation and object detection come with a few
difficulties due to the rich intra-class variation, context variation, occlusion
ambiguities, and low image resolution [11]. There have been many improve-
ments in the ViT domain in the last few years to overcome these challenges
while further developments are still in progress to make them efficient.
The review focuses specifically on semantic segmentation using Vision
Transformers. The comparison of the ViT models specialized for semantic
segmentation is discussed with architecture-wise and tabulated specific sets
of model variants that can be compared with the same set of benchmark
datasets. The current surveys performed on ViTs have been structured with
a detailed historical evolution from NLP to the Vision Transformer domain.
[12] focuses on self-attention and its varieties with advantages and limitations
with existing methods for segmentation, object detection, classification, and
action recognition. The comparison follows between CNN and ViT back-
bones on the ImageNet dataset. The survey done by [13] is also considering
various vision tasks and surpasses CNN-based models with experimental re-

2
sults on benchmark datasets. Even though several surveys have been done
[12, 13, 14], a comparison between segmentation models with several bench-
mark datasets to identify the best-performing model has not been performed.
In our survey, we provide a set of segmentation models, for each of which we
define the best variant in each benchmark dataset category. This is useful in
the sense of identifying the most optimal parameters such as patch size, iter-
ations count for each variant of the model. By providing mIoU (%) of model
performance results over several semantic segmentation-related benchmark
datasets, overall evaluation and highest-performing model variants for each
dataset can be identified.
In Section 2 we discuss the applications of semantic segmentation, ViTs,
their challenges, and loss functions. Section 3 describes benchmark datasets
used in semantic segmentation. Section 4 describes the existing work done
in semantic segmentation using ViTs and presents a quantitative analysis.
Finally, Section 5 provides the discussions and Section 6 concludes the paper
with future directions.

2. Semantic Segmentation using Vision Transformers

This section aims to provide an in-depth analysis of the applications in
semantic segmentation, with a focus on recent advancements in ViTs. We
begin by exploring the principles and architecture of ViTs and their potential
for improving semantic segmentation performance. We then delve into vari-
ous application domains of semantic segmentation. We also devote a section
to practical approaches for overcoming the data limitations that often arise
in ViT models. Finally, we discuss various loss functions used in semantic
segmentation and their effectiveness in different scenarios.

2.1. Vision Transformers

Automatic segmentation techniques have been evolving and improving
throughout the years with the advancements of deep learning approaches
and the application of semantic segmentation in practical usage. For seman-
tic segmentation, the requirement is to locally identify the different classes
in the image with spatial location. For that, the fully connected layers in
the conventional CNN architecture were replaced with fully convolutional
layers combined with feature extraction. This was introduced as Fully Con-
volutional Networks (FCN) [15] to identify high-level semantic features from
images. These networks have shown to be faster compared to previous CNN-
based techniques and are also capable of generating segmentation maps for
images of any resolution. Some of the commonly known architectures are

3
U-Net (state-of-the-art FCN) and more improved architectures with higher
accuracy and efficiency are developed by [16, 17, 18].
One of the limitations identified with the FCN architecture is the low
resolution of the final output segmentation image of the feature map due
to going through several convolutional and pooling layers. Furthermore, the
locality property of the FCN-based methods caused limitations to the capture
of long-range dependencies of the feature maps. To solve this, researchers also
looked into attention mechanisms to merge or replace these models. This has
led to trying out Transformer architectures in the computer vision domain
which were successful in NLP.
Self-attention-based architectures have taken priority in NLP by avoiding
the drawbacks such as vanishing gradients in sequence modeling and trans-
duction tasks. Specially designed for sequence modeling and transduction
tasks, Transformers with attention were able to model long-range sequences
of data. When training a NLP model, one of the best ways is to pre-train
on a large text corpus and then fine-tune on a small set of data which is for
the related task. But with deep neural networks, this was a challenging task.
As Transformers have high computational efficiency and scalability, it was
easier to train on a large set of data [19].
With the success of using self-attention to enhance the input-output in-
teraction in NLP, works have been proposed to combine convolutional ar-
chitectures with self-attention, especially in object detection and semantic
segmentation where input-output interaction is highly needed [20]. But ap-
plying attention to convolutional architectures demands high computation
power, even though they are theoretically efficient [1].
Considering images, calculating self-attention is quadratic to the image
size as each pixel attends to every other pixel therefore it is a quadratic cost of
the pixel count [2]. Thus [2] proposed to divide the image into a sequence of
patches and treat them as tokens as it was done in NLP. Instead of pixel-wise
attention, patch-wise attention was used in the architecture which helped to
reduce the computational complexity compared to applying self-attention to
convolutional architecture.
This architecture showed promising results by surpassing all the state-
of-the-art convolution-based methods by reaching an accuracy of 88.55% on
ImageNet, 90.72% on ImageNet-ReaL, and 94.55% on CIFAR-100 datasets
[2]. A major characteristic of the ViT is that it needs more data for model
training. Experiments carried out by [2] ensure that with increasing data
size, ViT performs well.

4
Figure 1: Architecture of the Vision Transformer. The model splits an image into a
number of fixed-size patches and linearly embeds them with position embeddings (left).
Then the result is fed into a standard transformer encoder (right). Adapted from [2].

2.2. Applications of Semantic Segmentation

In this section, we discuss various application domains of semantic seg-
mentation, including remote sensing, medical imaging, and video processing.
For each of these domains, we highlight the unique challenges and opportuni-
ties that arise, as well as the current state-of-the-art methods and techniques.

2.2.1. Semantic Segmentation of Remote Sensing Images

Remote sensing is the process of getting information and monitoring the
characteristics of an area without having any physical contact. The two
main types of remote sensing techniques are the use of active sensors such
as RADAR, LiDAR and the use of passive sensors such as satellite imagery
[21]. These high-resolution earth surface images provide a wide range of use
cases such as world mapping updates [22], forest degradation analysis [23],
monitoring changes to the surface [24], etc.
Remote sensing imagery is widely used in combination with computer vi-
sion and Artificial Intelligence (AI) for analyzing and processing the earth’s
surface over large areas with complex feature distributions. The images col-
lected by satellites or unmanned aerial vehicles (UAV) provide a wide range
of information for applications such as urban planning, disaster management,
traffic management, climate change, wildlife conservation, crop monitoring,
etc. The use of datasets containing these high-resolution images and their
respective segmented masks [25] have provided a base for remote sensing

5
image analysis using computer vision and AI. The use of neural networks
provides the ability to process large amounts of image data for object de-
tection, semantic segmentation, and change detection tasks. The evolution
in the remote sensing domain has further improved satellite sensors and the
introduction of drone technology for aerial imagery has been vital to getting
finer details on the earth’s surface. This has resulted in precise and accurate
data for processing using AI techniques [26].
Remote sensing images of the earth’s surface provide land cover areas that
can be categorized into different segmented classes. Each of these classes is
assigned a label for each pixel while preserving the spatial resolution of the
image. Many datasets containing these remote sensing images and their seg-
mented masks are available [25, 27, 28] to use for different applications such
as change detection, land cover segmentation, and classification. Examples
of common land cover classes covered by the pixel-level classification are
forests, crops, buildings, water resources, grasslands, roads, etc. Research
has been conducted using ViT architecture models by adding layers and at-
tention mechanisms efficiently and improvements in performance to process
high-resolution remote sensing images for semantic segmentation such as Ef-
ficient Transformer [10] and Wide-Context Transformer [29].
Manual segmentation of these different environmental areas from a com-
plex satellite or aerial images is a difficult task which is time-consuming,
error-prone, and requires expertise in the remote sensing domain.

2.2.2. Semantic Segmentation of Medical Images

Medical image analysis has developed and incorporated scanning and vi-
sualization techniques. Segmentation techniques have been vital as it has the
ability to identify and segment medical imagery to assist in further diagnosis
and interventions. By identifying each region of interest (ROI) highlighted,
various important diagnoses are happening such as brain tumor boundary
detection from MRI images, pneumonia affections in X-rays, cancer detec-
tion from biopsy sample images, etc. The demand for this type of analy-
sis through image segmentation has emerged in the recent past with much
research being done in the scope to develop more precise, efficient models
and algorithms. These medical images that are used in image segmentation
tasks can be grouped based on modalities such as MRI, CT scan, X-ray,
ultrasound, microscopy, dermoscopy, etc. Each of these categories contains
datasets that were collected under medical supervision and some are made
publicly available.
Since there exist several modalities as mentioned above, the technological
systems that are used for medical imagery differ. Medical imagery system
development vendors built them as per the doctor’s requirements. There-

6
fore, the images generated are bound to the limitations of the technology
available and require medical personal intervention to examine them [30].
Therefore the segmentation of these images in different biological domains
requires experts in each field to cope with these systems and spend a vast
amount of time examining them. To overcome these difficulties, the capabil-
ity of automatic feature extraction has been introduced with deep learning
based techniques, which have been valuable in the sense of medical imagery.
With the advancements in segmentation analysis, better-performing models
have been introduced with the use of medical images by many researchers.
One such famous architecture is the U-Net [31] which was initially intro-
duced for medical image analysis. Based on this, several improved versions
have been followed up using medical imagery datasets from heart, lesion, and
liver segmentation [32, 33, 18]. This proves how beneficial the improvement
of segmentation has been in the medical environment. In recent years, the
emerging new architectures of ViTs have also been applied to the medical
domain with TransUNet [34] and Swin-Unet [35]. They are hybrid Trans-
former architectures with the advantages of the U-Net. They performed with
better accuracy in cardiac and multi-organ segmentation applications.
Some limitations of medical images are the relatively less number of im-
ages available compared to natural image datasets (landscapes, people, an-
imals, and automobiles) with millions of images. In the medical domain,
there are several image modalities. For annotating medical images, expertise
in each medical field is a must. Among them, MRI and microscopy images
are quite difficult to annotate [36]. Typically, these datasets contain fewer
images compared to ultrasound, X-ray, and lesion datasets which are ob-
tained with the existing scanning systems and are easier to annotate with
less complex structures and fine boundaries. But still, limitations exist due
to restrictions on privacy and other medical policies to obtain these images in
large quantities. To overcome these limitations with some datasets, several
image segmentation challenge competitions are taking place every year which
provide publicly available well-annotated medical image datasets. Most of
the improvements made through research in semantic segmentation models
have been based on these challenge datasets and most are taken as bench-
mark datasets for segmentation [37, 38, 39].

2.2.3. Video Semantic Segmentation

Human-Machine interaction [40], augmented reality [41], autonomous ve-
hicles [42], image search engines [43] are some applications in complete scene
understanding and for these type of applications, semantic segmentation con-
tributes more on complete scene understanding on videos. Usually, the idea
is to apply semantic segmentation on frames of a high-resolution video where

7
the video is considered as a set of uncorrelated fixed images [44]. The com-
mon challenge with this type of semantic segmentation is the computational
complexity of scaling the spatial dimension of the video using the temporal
frame rate. Removal of temporal features and only focusing on spatial frame-
by-frame features doesn’t make sense in video segmentation. Since there is
a combined flow among frames of a video, considering the temporal context
of a video is an essential factor in video semantic segmentation, even though
it is computationally expensive.
Research has been conducted to reduce this high computation cost on
videos. Feature reuse and feature warping [45] have been proposed as a
solution. Cityscapes [46] and CamVid [47], are some largest video segmen-
tation datasets available for frame-by-frame approach of video segmentation
[48]. Recent papers have proposed segmentation methods such as selective
re-execution of feature extraction layers [49], optical flow-based feature warp-
ing [50], and LSTM-based, fixed-budget keyframe selection policies [51]. The
main key problem in these approaches is that they have less attention to the
temporal context of a video. Researchers have shown that to satisfy both
spatial and temporal contexts, using an optical flow of video as temporal in-
formation to speed up uncertainty estimation makes good sense [52]. VisTR
[53], TeViT [54] and SeqFormer [55] are some of the Transformer models that
are used for video segmentation tasks.

2.3. Practical approaches to overcome the data limitation

Deep neural networks have performed well with supervised learning in
computer vision and NLP. But when it comes to the real world, supervised
learning faces a bottleneck in training a neural network as it needs lots of
labeled data. Collecting labeled data or manual labeling is difficult in every
aspect. Training a network from scratch is a somewhat costly task; as a
remedy for this, transfer learning comes into play. But when considering
specified downstream tasks such as satellite imagery semantic segmentation,
using pre-trained datasets is difficult as most of the architectures have been
trained on benchmark datasets where the data domain is different. Therefore,
getting good accuracy has been tricky.
Specially when considering Transformer architectures, self-supervised learn-
ing plays a great role as a remedy for data-hungry problems in deep learning.
In human vision, humans are fed with different things in the environment and
then are able to distinguish those things from other objects in the environ-
ment. There are no labeling mechanisms for these scenarios. Therefore, this
is the technique used in SSL which actually trains a neural network using
an unlabeled dataset where the labels are automatically provided through
the dataset itself. As the first step, the network is set to solve a pretext

8
task as described in Figure 2. A pretext task is a pre-designed task from
which the network can learn features and then using those trained weights
for different features, the network can be applied to solve some downstream
tasks. A downstream task is a specified task. Common downstream tasks in
computer vision are semantic segmentation, object detection, etc.

Figure 2: The general pipeline of self-supervised learning. The trained weights from solving
a pretext task are applied to solve some downstream tasks.

Rotating an image by a given angle and predicting the rotation, solving

jigsaw puzzles, filling a cut patch on an image, predicting the relative position
of a patch of an image, and separating images belonging to different clusters
can be considered as some of the pretext tasks in SSL [56]. By using these
methods, the network can learn different features in the dataset under the
given scope. No labels are used here and automatic labeling is achieved via
the image itself.
SSL has three general categories based on how the training happens.

• Generative: Train the encoder to encode the given input and using the
decoder get the input back

• Contrastive: Train the encoder to encode the given input and find the
similarities

9
• Generative-Contrastive (Adversarial): Train encoder to encode the given
input and create fake outputs and compare the features of the input
and output [57]

Semantic segmentation is one of the major downstream tasks that can

be performed using SSL. Pixel-wise labeling is essential in semantic segmen-
tation. If there are no properly annotated datasets, SSL is the best way to
train semantic segmentation architectures.

2.4. Loss functions in semantic segmentation

For segmentation, classification, and object detection models accuracy
improvement not only depends on the model architectures but also on the loss
functions used. The loss function calculates the overall error while training
batches and adjust the weights through back propagation. Numerous loss
functions have been created to cope with various domains, and some of them
are derived from existing loss functions. Additionally, these loss functions
take into account the imbalances in the dataset too.
In the case of semantic segmentation, the default choice and most com-
monly used is the cross-entropy loss which is applied pixel-wise. The loss
function independently evaluates the class predictions for each pixel and av-
erages over all the pixels.
n
X
CEloss (p, q) = − pi log(qi ) (1)
i=1

The equation 1 above computes the average loss for each pixel in an
image. Here in the equation pi is the true probability of the ith class and
qi is the predicted probability of the same class. This supports the model
to generate probability maps that closely resemble the actual segmentation
masks while penalizing inaccurate predictions more heavily. By minimizing
the cross-entropy loss function during training, the model becomes better at
precise image segmentation.
Even though the above method is widely used it can be biased with
dataset imbalance as the majority class will be dominant. To overcome this
when the dataset is skewed, a weighted cross entropy loss is introduced in
[31].
n
X
W CEloss (p, q) = − pi wi log(qi ) (2)
i=1

Here as in equation 2, a weight factor as wi for the ith class is inserted to

the typical equation 1. But the issue was not significantly solved as the cross

10
entropy calculates the average per-pixel loss without considering the adjacent
pixels which can be boundaries.
As a further improvement for the cross-entropy loss, the focal loss tech-
nique [58] was introduced. This is implemented by altering the structure
of cross-entropy loss. When focal loss is applied to samples with accurate
classifications, the scaling factor value is down-weighted. This ensures the
more harder samples are emphasized, therefore high class imbalance won’t
bias toward the overall calculations.

Floss (pt ) = −αt (1 − pt )γ log(pt ) (3)

In the equation 3, pt is the predicted probability of the true class, αt is a
scaling factor that gives higher weight to the positive class, and γ is a focusing
parameter that controls how much the loss is focused on hard examples.
The cross-entropy loss is scaled in this loss function, with the scaling
factors decreasing to zero as the confidence in the well-classified classes rises.
Therefore more attention is given to the pixel classes which are difficult to
predict.
Another set of loss calculation techniques is the overlapping between pre-
diction and actual segmentations. The models are trained to minimize the
loss such that the model outputs segmentations with higher overlaps.
Dice loss is one such widely used popular measure in computer vision
tasks to calculate the similarity between two images. It is based on the
dice coefficient which was later developed as the dice loss function in the
segmentation domain. This loss was first used in the computer vision domain
by [59] in medical image segmentation tasks.

2 ni=1 gi pi
P
Dloss (g, p) = 1 − Pn Pn (4)
i=1 gi + i=1 pi +
Here, in equation 4 g and p describes the ground truth and prediction
segmentations. The sum is calculated over the n number of pixels with
small constant added to avoid division by zero. The dice coefficient measures
the overlap between the samples (ground truth and prediction) and provides
a score ranging from 0 to 1, 1 means perfect overlap. Since this method
considered pixels in both global and local contexts, the accuracy is higher
than cross-entropy loss calculations.
Another similar method used to evaluate the metric of models is the
IoU (Intersection over Union) loss also known as the Jaccard index. It is
quite similar to the dice metric and measures the overlapping of the positive
instances between the considered samples. This method as shown in equation

11
5 differs from the dice loss with correctly classified segments relative to total
pixels in either the ground truth or predicted segments.
Pn
g pi
IoUloss (g, p) = 1 − Pn Pn i=1 i P n (5)
i=1 gi + i=1 pi − i=1 gi pi +
For multi-class segmentation, the mean IoU is considered by taking the
average of each individual class IoU. This is widely used for performance
comparison and evaluation of dense prediction models [60].

3. Datasets
In this section, the common datasets used for the training and testing
of semantic segmentation models are considered. Factors affecting the cre-
ation of real datasets are lighting conditions, weather, and season. Based
on these factors, datasets can be classified into different groups. When data
is collected under normal daytime environmental conditions, those data are
categorized under no cross-domain datasets. If data is collected under some
deviated environmental conditions including rainy, cloudy, nighttime, snowy,
etc then such data are categorized under cross-domain datasets. Another
category is synthetic data, where the data is artificially created and col-
lected for training purposes. These synthetic datasets are mostly created as
a cost-effective supplement for training purposes. Following are some of the
benchmark datasets specially made for semantic segmentation tasks, with a
summary presented in Table 1.
PASCAL-Context [61] This dataset was created by manually labeling
every pixel of PASCAL-VOC 2010 [62] dataset with semantic categories. The
domain of this dataset is not limited and its data contains different objects.
The semantic categories of this dataset can be divided into three main classes.
(i) objects, (ii) stuff, and (iii) hybrids. Objects have defined categories such
as cups, keyboards, etc. Stuff has classes without a specific shape and has
regions such as sky, water, etc. Hybrid contains intermediate objects such
as roads where roads have a clear boundary but shape cannot be predicted
correctly.
ADE20K [63] Annotations of this dataset are done on scenes, objects,
parts of objects. Many of the objects in the dataset are annotated with their
parts. Annotations in this dataset are made continuously. Therefore, this is
a growing dataset.
KITTI [64] This dataset contains both 2D and 3D images which have
been collected from urban and rural expressway incidents and traffic sce-
narios. It is useful for robotics and autonomous driving. This dataset has

12
different variants namely KITTI-2012, KITTI-2015 and they have some dif-
ferences in the ground truth.
Cityscapes [46] This contains large-scale pixel-level and instance-level
semantic segmentation annotations recorded from a set of stereo video se-
quences. Compared to other datasets, quality, data size, and annotations in
this dataset have a good rank and data have been collected from 50 different
cities in Germany and neighboring countries.
IDD [65] This is specially designed for road scene understanding and data
have been collected from 182 Indian road scenes. As these are taken from
Indian roads, there are some variations in the weather and lighting conditions
because of dust and air quality on roads. One key feature of this dataset is,
this contains some special classes such as auto-rickshaws and animals on the
roads.
Virtual KITTI [66] Except for different weather and imaging conditions,
most of the virtual vision datasets such as Virtual KITTI are similar to the
real vision datasets. Therefore virtual datasets are useful for pre-training
purposes. This dataset is created from 5 different urban scene videos from
the real-world KITTI dataset. Data have been automatically labeled and can
be used for object detection, semantic segmentation, instance segmentation,
etc.
IDDA [67] This contains 1 million frames generated from simulator
CARLA oriented on different 7 city models. This dataset can be used to
do semantic segmentation for more than 100 different visual domains and is
specially designed for autonomous driving models.
Dataset Classes Size Train Validation Test Resolution (pixels) Category
PASCAL-Context 540 19740 4998 5105 9637 387 × 470 No cross-domain
ADE20K 150 25210 20210 2000 3000 - No cross-domain
KITTI 5 252 140 - 112 1392 × 512 No cross-domain
Cityscapes 30 5K fine, 20K coarse 2975 500 1525 1024 × 2048 Cross-domain
IDD 34 10004 7003 1000 2001 1678 × 968 Cross-domain
Virtual KITTI 14 21260 - - - 1242 × 375 Synthetic
IDDA 24 1M - - - 1920 × 1080 Synthetic

Table 1: Summary of the datasets

Note: Both cross-domain and no-cross domain falls into the non-synthetic
category

4. Meta - analysis
In this section, we discuss some of the ViT models specialized for the
task of semantic segmentation. The models are selected upon considering the
datasets that they benchmarked (ADE20K, Cityscapes, PASCAL-Context).

13
The intuition behind that is to compare all the models on a common basis.
The benchmark results are summarized in Table 2.
4.1. SEgmentation TRansformer (SETR)
SETR [5] proposes semantic segmentation as a sequence-to-sequence pre-
diction task. They adopt a pure Transformer as the encoder part of their
segmentation model without utilizing any convolution layers. In this model,
they replace the prevalent stacked convolution layer based encoder with a
pure Transformer which gradually reduces the spatial resolution.

Figure 3: SETR architecture and its variants adapted from [5]. (a) SETR consists of a
standard Transformer. (b) SETR-PUP with a progressive up-sampling design. (c) SETR-
MLA with a multi-level feature aggregation.

The SETR encoder (Figure 3a) which is a standard Transformer treats

an image as a sequence of patches followed by a linear projection. Then it
embeds these projections with patch embedding + position embedding to
feed them into a set of Transformer layers. SETR has no down-sampling
in spatial resolution at each layer of the encoder transformer while it only
provides global context modeling. They classify SETR into a few variants
depending on the decoder part of the model; SETR-PUP (Figure 3b) which
has a progressive up-sampling design and the SETR-MLA (Figure 3)which
has a multi-level feature aggregation.
SETR achieved state-of-the-art semantic segmentation results on ADE20K,
Pascal Context by the time of submission [5]. It has also been tested on the
Cityscapes dataset and has shown impressive results.
4.2. Swin Transformer
To address the issue of not having a general purpose Transformer back-
bone for computer vision tasks, [4] proposed Swin Transformer (Hierarchical

14
Vision Transformer using Shifted Windows) which can be served as a gen-
eral purpose backbone for computer vision tasks such as image classification
and dense prediction.

Figure 4: An overview of the Swin Transformer adapted from [4]. (a) Hierarchical feature
maps for reducing computational complexity. (b) Shifted window approach which was
used when calculating self-attention. (c) Two successive Swin Transformer Blocks which
presented at each stage. (d) Core architecture of the Swin.

Swin Transformer was able to bring down the quadratic computational

complexity of calculating self-attention in Transformers to linear complex-
ity by constructing hierarchical feature maps (Figure 4a). Also, the shifted
window approach illustrated in Figure 4b has much lower latency than the
earlier sliding window based approaches which were used to calculate the self-
attention. Swin Transformer showed great success over the previous state-
of-the-art in image classification (87.3% top-1 accuracy on ImageNet-1K),
semantic segmentation (53.5% mIoU on ADE20Kval) and object detection
(58.7 box AP and 51.1 mask AP on COCO test-dev) [4].
According to the architecture of a Swin Transformer, in the beginning,
it splits the given image into a sequence of non-overlapping patches (tokens)
by using the patch partitioning module (Figure 4d). Then a linear embed-
ding is applied to this sequence of patches to project them into an arbitrary
dimension. It is followed by several Swin Transformer blocks to apply self-
attention. The main responsibility of the patch merging module is to reduce
the number of tokens in deeper layers. It is noteworthy that the feature map
resolutions in the hierarchical stages are similar to those in typical convo-

15
lution architectures such as ResNet [68]. Therefore Swin Transformer can
efficiently replace ResNet backbone networks in computer vision tasks.

4.3. Segmenter
Segmenter [11] is a purely transformer-based approach for semantic seg-
mentation which consist of a ViT backbone pre-trained on ImageNet and
introduces a mask transformer as the decoder (Figure 5). Even though the
model was built for segmentation tasks, they take advantage of the mod-
els made for image classification to pre-train and then fine-tune them on
moderate-sized segmentation datasets.

Figure 5: Segmenter architecture adapted from [11]. It basically has a ViT backbone with
a mask transformer as the decoder.

CNN-based models are generally inefficient when processing global image

context and ultimately result in a sub-optimal segmentation. The reason for
the sub-optimal segmentation of the convolution-based approaches is that
convolution is a local operation which poorly accesses the global information
of the image. But the global information is crucial where the global image
context usually influences the local patch labeling. But modeling of global
interaction has a quadratic complexity to the image size because it needs
to model the interaction between each and every raw pixel of the image.
The architecture of the Segmenter especially captures the global context of
images, unlike the traditional CNN-based approaches.
Other than the semantic segmentation tasks, this Segmenter model also
can be applied to panoptic segmentation (semantic segmentation + instance
segmentation) tasks by altering the model architecture. The class embed-
dings of the model need to be replaced by object embeddings in such a case.

16
4.4. SegFormer
SegFormer [69] is an architecture for semantic segmentation which consist
of a hierarchical Transformer encoder with a lightweight multilayer percep-
tron (MLP) decoder (Figure 6). The MLP decoder is used for predicting the
final mask. To obtain a precise segmentation, it uses a patch size of 4 × 4
in contrast to ViT which uses a patch size of 16 × 16. It has an overlapped
patch merging process to maintain the local continuity around the patches.

Figure 6: SegFormer architecture adapted from [69]. It has a hierarchical Transformer

encoder for feature extraction and a lightweight MLP decoder for predicting the final
mask.

Generally, ViT has a fixed resolution for positional encoding [70]. This
leads to a drop in accuracy since it needs to interpolate the positional en-
coding of testing images when they have a different resolution than training
images. Thus, SegFormer introduces a Positional-Encoding-Free design as a
key feature.
Moreover, the authors claim their architecture is more robust against
common corruptions and perturbations than current methods which make
SegFormer appropriate for safety-critical applications. SegFormer achieved
competitive results on ADE20K, Cityscapes, and COCO-Stuff datasets as
shown in Table 2. SegFormer comes in several variants from SegFormer-B0
to SegFormer-B5, where the largest model is SegFormer-B5. This largest
model surpasses the SETR [5] on the ADE20K dataset achieving the highest
mIoU while being 4× faster than SETR. All of these SegFormer models have
trade-offs between model size, accuracy, and runtime.

17
4.5. Pyramid Vision Transformer (PVT)
ViT couldn’t be directly applicable to dense prediction tasks because its
output feature map is single scaled and it generally has a low resolution which
comes at a higher computational cost. PVT [71] overcomes the aforemen-
tioned concerns by introducing a progressive shrinking pyramid backbone
network to reduce the computational costs and simultaneously output more
fine-grained segmentation. PVT comes in two variants. PVT v1 [71] is the
first work by the authors and PVT v2 [72] comes with some additional im-
provements to the previous version.

4.5.1. PVT v1
This initial version has some noteworthy changes compared to the ViT.
It takes 4 × 4 input patches in contrast to the 16 × 16 patches in ViT. This
improves the model’s ability to learn high-resolution representations. It also
reduces the computational demand of traditional ViT by using a progressive
shrinking pyramid. This pyramid structure progressively shrinks the output
resolution from high to low in the stages which are responsible for generat-
ing the scaled feature maps (Figure 7). Another major difference is that it
replaces the multi-head attention layer (MHA) in ViT with a novel spatial
reduction attention (SRA) layer which reduces the spatial scales before the
attention operation. This further reduces the computational and memory
demand because SRA has a low computational complexity than MHA.

Figure 7: PVT v1 architecture adapted from [71]. The pyramid structure of the stages
progressively shrinks the output resolution from high to low.

18
4.5.2. PVT v2
The former version has a few drawbacks. The computational demand
of the PVT v1 is relatively large when processing high-resolution images.
It loses the local continuity of the images when processing the image as a
sequence of non-overlapping patches. It cannot process variable-sized inputs
because of the fixed-size position encoding. This new version has three major
improvements which circumvent the previous design issues. First one is linear
spatial reduction attention (LSRA) which reduces the spatial dimension of
the image to a fixed size using average pooling (Figure 8). Unlike SRA
in the PVT v1, LSRA benefits from linear complexity. Second one is the
overlapping patch embedding (Figure 9a). This is done by zero-padding the
border of the image and taking more enlarged patch windows which overlap
with the adjacent windows. It helps to capture more local continuity of the
images. The third one is the convolutional feed-forward network (Figure 9b)
which helps to process different sizes of input resolutions. With these major
improvements, PVT v2 was able to bring down the complexity of PVT v1
to linear complexity.

Figure 8: Comparison of spatial reduction attention (SRA) layers in PVT versions [72]

We can clearly see how the improvements of the PVT v2 contribute to

higher gains in the benchmark comparison in Table 2.

4.6. Twins
Twins [73] propose two modern Transformer designs for computer vision
named Twins-PCPVT and Twins-SVT by revisiting the work on the PVT
v1 [71] and Swin Transformer [4].
Twins-SVT uses a spatially separable self-attention (SSSA) mechanism
based on the depth-wise separable convolutions in neural networks. This

19
Figure 9: Improved patch embedding and feed-forward networks in PVT v2 [72]

SSSA has two underlying attention mechanisms which are capable of cap-
turing local information as well as global information. Locally grouped self-
attention (LSA) and global sub-sampled attention (GSA) are the above-
mentioned attention mechanisms respectively. Those techniques greatly re-
duce the heavy computational demand in high-resolution image inputs while
keeping a fine-grained segmentation.

Figure 10: Twins-PCPVT architecture adapted from [73]. It uses conditional position
encoding with a positional encoding generator (PEG) to overcome some of the drawbacks
of fixed-positional encoding.

As we discussed in the Pyramid Vision Transformer section, PVT v1

can only process fixed-size image inputs due to its absolute positional en-

20
coding. This hinders the performance of PVT. To alleviate this challenge
Twins-PCPVT uses a conditional position encoding (CPE) first introduced
in Conditional Position encoding Vision Transformer (CPVT) [70]. This is
illustrated as the positional encoding generator (PEG) in Figure 10. It is
capable of alleviating some of the issues encountered in fixed-position encod-
ing.
Twins architectures have shown outstanding performance on computer
vision tasks including image classification and semantic segmentation. The
semantic segmentation results achieved by the two Twins architectures are
highly competitive compared to the Swin Transformer [4] and PVT [71].

4.7. Dense Prediction Transformer (DPT)

DPT [74] architecture is introduced with a transformer backbone inside
the encoder-decoder design for fine-grained output segmentation predictions
compared to the fully convolutional networks. The transformer encoder
based on ViT [2] is capable of maintaining spatial resolution over all the
stages of the Transformer architecture which is important for dense predic-
tions.

Figure 11: DPT architecture adapted from [74]. (a) Non-overlapping image patches are fed
into the Transformer block. (b) Reassemble operation for assembling tokens into feature
maps. (c) Fusion blocks for combining feature maps.

In the paper, the authors have introduced several models based on the
used image embedding technique. The DPT-Base and DPT-Large models
use patch-based embedding where the input image is separated into non-
overlapping image patches. Then these are fed into the Transformer block
with a learnable position embedding to locate the spatial position of each in-
dividual token (Figure 11a). DPT-Base has 12 transformer layers compared
to the DPT-Large which has 24 layers with wide feature sizes. The other
model is the DPT-Hybrid, which uses the convolutional backbone ResNet-50
as a feature extractor and uses the pixel-based feature maps as token inputs

21
to the 12-layer transformer block. The Transformer blocks reassemble the
tokens with multi-head self-attention (MSA) [1] sequential blocks for global
interaction between tokens. The tokens are reassembled into image-like fea-
ture representations in various resolutions (Figure 11b). Finally, these rep-
resentations are combined using residual convolutional units in the decoder
and fused together for the final dense prediction (Figure 11c).
The experimental results of the dense prediction transformer have pro-
vided improved accuracy results over several benchmark dataset compar-
isons. The results show that for a large training dataset, the model has the
best performance. The comparisons were done for depth estimations and
semantic segmentation. ADE20K dataset is used for segmentation and the
DPT-Hybrid model has outperformed all the fully-convolutional models [74].
The DPT has the ability to identify precise boundaries of objects with less
distortion. The DPT model was also compared with the PASCAL-Context
dataset after fine-tuning.

4.8. High-Resolution Transformer (HRFormer)

HRFormer [75] is an architecture model that is built using a depth-wise
convolutional design with a Feed Forward Network (FFN) and a local win-
dow self-attention mechanism with a multi-resolution parallel transformer
module. This model is developed for dense prediction tasks focusing on pose
estimation and semantic segmentation. The model outperforms the conven-
tional ViT model which produces low-resolution outputs. The HRFormer is
designed to maintain the high-resolution using multi-resolution streams and
is more efficient in computational complexity and memory usage.

Figure 12: HRFormer architecture adapted from [75]. (a) Self-attention blocks. (b) FFN
with depth-wise convolutions.

HRFormer has been incorporated by using the HRNet [76], which is a

convolutional network consisting of a multi-scale parallel design. This ar-
chitecture helps to capture feature maps in variant resolutions while main-
taining high resolution. At each of these resolution blocks, partitioning is

22
done by creating non-overlapping windows, and self-attention is performed
on each image window separately. This improved the efficiency significantly
compared to overlapping local window mechanisms introduced earlier in dif-
ferent studies [77]. The self-attention blocks (Figure 12a) are followed by
an FFN with depth-wise convolutions (Figure 12b) to increase the receptive
field size by information exchange between local windows, which is vital in
dense prediction. By incorporating a multi-resolution parallel transformer
architecture with convolutional multi-scale fusions for the overall HRFormer
architecture, the information between different resolutions is exchanged re-
peatedly. This process creates a high-resolution output with both local and
global context information.

4.9. Masked-attention Mask Transformer (Mask2Former)

Mask2Former [78] is a new transformer architecture that can be leveraged
to do segmentation tasks including panoptic, instance, and semantic segmen-
tation. It is a successful attempt to introduce a universal architecture for the
segmentation tasks which outperforms the current specialized SOTA archi-
tectures for each of the segmentation tasks by the time of submission. Its key
components consist of a transformer decoder with masked attention. Gener-
ally, a standard Transformer attends to the full feature map. In contrast, the
masked attention operator in Mask2Former restricts the cross-attention to
the foreground region of the predicted mask and then extracts the localized
features. This makes the attention mechanism more efficient in this model.

Figure 13: Mask2Former architecture adapted from [78]. The model consists of a backbone
feature extractor, a pixel decoder, and a Transformer decoder.

23
The architecture of Mask2Former is similar in design to the previous
MaskFormer [79] architecture. The main components are the backbone fea-
ture extractor, pixel decoder, and the Transformer decoder (Figure 13). The
backbone could be either a CNN-based model or a Transformer based model.
As the pixel decoder, they have used a more advanced multi-scale deformable
attention Transformer (MSDeformAttn) [6] in contrast to the feature pyra-
mid network [80] used in MaskFormer [79]. Masked attention has been used
to enhance the effectiveness of the Transformer decoder.
Despite being a universal architecture for segmentation, Mask2Former
still needs to be trained separately for each of the specific tasks. This is
a common limitation of the universal architectures for segmentation tasks.
Mask2Former has achieved new SOTA performance on all three segmentation
tasks (panoptic, instance, semantic) in popular datasets such as COCO and
ADE20K and Cityscapes. The semantic segmentation results are compared
for ADE20K and Cityscapes datasets in Table 2.

24
Datasets
Model Variant Backbone #Params (M) ADE20K Cityscapes PASCAL-Context
SETR-Naı̈ve(16,160k)ρ ViT-L‡ [2] 305.67 48.06 / 48.80 - -
SETR-PUP (16,160k) ViT-L‡ 318.31 48.58 / 50.09 - -
SETR-MLA(16,160k) ViT-L‡ 310.57 48.64 / 50.28 - -
SETR-PUP (16,40k) ViT-L‡ 318.31 - 78.39 / 81.57 -
SETR [5]
SETR-PUP (16,80k) ViT-L‡ 318.31 - 79.34 / 82.15 -
SETR-Naı̈ve(16,80k) ViT-L‡ 305.67 - - 52.89 / 53.61
‡
SETR-PUP (16,80k) ViT-L 318.31 - - 54.40 / 55.27
SETR-MLA(16,80k) ViT-L‡ 310.57 - - 54.87 / 55.83
Swin-T 60 46.1 - -
ℵ Swin-S 81 49.3 - -
Swin [4] ‡
Swin-B 121 51.6 - -
‡
Swin-L 234 53.5 - -
Seg-B DeiT-B† [81] 86 48.05 80.5 53.9
†
§ Seg-B/Mask DeiT-B 86 50.08 80.6 55.0
Segmenter [11]
Seg-L ViT-L‡ 307 52.25 80.7 56.5
Seg-L/Mask ViT-L‡ 307 53.63 81.3 59.0
MiT-B0† 3.4 37.4 / 38.0 76.2 / 78.1 -
MiT-B1† 13.1 42.2 / 43.1 78.5 / 80.0 -
MiT-B2† 24.2 46.5 / 47.5 81.0 / 82.2 -
SegFormer [69]
MiT-B3† 44.0 49.4 / 50.0 81.7 / 83.3 -
MiT-B4† 60.8 50.3 / 51.1 82.3 / 83.9 -
MiT-B5† 81.4 51.0 / 51.8 82.4 / 84.0 -
PVT-Tiny‡ 17.0 35.7 - -
PVT-Small‡ 28.2 39.8 - -
‡
PVT v1 [71] PVT-Medium 48.0 41.6 - -
‡
PVT-Large 65.1 42.1 - -
‡
PVT-Large * 65.1 44.8 - -
ℵ ‡
PVT PVT v2-B0 7.6 37.2 - -
PVT v2-B1‡ 17.8 42.5 - -
‡
PVT v2 [72] PVT v2-B2 29.1 45.2 - -
‡
PVT v2-B3 49.0 47.3 - -
‡
PVT v2-B4 66.3 47.9 - -
‡
PVT v2-B5 85.7 48.7 - -
Twins-PCPVT-S† 54.6 46.2 / 47.5 - -
Twins-PCPVT Twins-PCPVT-B† 74.3 47.1 / 48.4 - -
Twins-PCPVT-L† 91.5 48.6 / 49.8 - -
Twins [73]
Twins-SVT-S† 54.4 46.2 / 47.1 - -
Twins-SVT Twins-SVT-B† 88.5 47.7 / 48.9 - -
Twins-SVT-L† 133 48.8 / 50.2 - -
§ DPT-Hybrid ViT-Hybrid‡ 123 49.02 - 60.46
DPT [74]
DPT-Large ViT-L‡ 343 47.63 - -
OCRNet(7,150k)ρ HRFormer-S 13.5 44.0 / 45.1 - -
OCRNet(7,150k) HRFormer-B 50.3 46.3 / 47.6 - -
OCRNet(7,80k) HRFormer-S 13.5 - 80.0 / 81.0 -
HRFormer [75] OCRNet(7,80k) HRFormer-B 50.3 - 81.4 / 82.0 -
OCRNet(15,80k) HRFormer-B 50.3 - 81.9 / 82.6 57.6 / 58.5
OCRNet(7,60k) HRFormer-B 50.3 - - 56.3 / 57.1
OCRNet(7,60k) HRFormer-S 13.5 - - 53.8 / 54.6
Swin-T - 47.7 / 49.6 - -
Swin-L‡ 216 56.1 / 57.3 - -
‡
Mask2Former [78] Swin-L-FaPN - 56.4 / 57.7 - -
Swin-L‡ 216 - 83.3 / 84.3 -
Swin-B‡ - - 83.3 / 84.5 -

Table 2: Comparison of the ViT models specialized for the task

of semantic segmentation according to mIoU (%) using different
benchmark datasets. The best-performing variant of each model
for a given dataset is highlighted. Overall top performing model
variant for each dataset is shaded in gray. ”SS / MS” contains both
single-scale and multi-scale inferences. ”ℵ” - Single-scale inference only, ”§”
- Multi-scale inference only, ”ρ” - (patch size, iterations), ”†” - pre-trained
on ImageNet-1K, ”‡” - pre-trained on ImageNet-21K, ”∗” - 320K training
iterations and multi-scale flip testing

25
5. Discussion
In this survey, we discussed how ViTs became a powerful alternative to
classical CNNs in various computer vision applications, their strengths as
well as limitations, and how ViT contributed to the semantic segmentation
of images with their usage across different domains such as remote sensing,
medical and video processing. Even though we included some of the CNN ar-
chitectures widely used in prior mentioned domains to provide a comparison
between the ViT and CNNs, an in-depth discussion about CNN architectures
is beyond the scope of this paper. We have summarized the different statis-
tics regarding popular datasets used for semantic segmentation tasks and the
results of different ViT architectures used for semantic segmentation to give
a clear and high-level overview for the reader around the region of semantic
segmentation.

6. Conclusions and Future Directions

Unlike mature convolutional neural networks, ViTs are still in the early
stage of development. Nevertheless, we observed how powerful and compet-
itive they are with their CNN counterparts. ViTs are progressing towards
excellence and it is expected that they will replace traditional CNN-based
methods widely used in the deep learning domain in the near future. Differ-
ent variants of ViTs can be used for experiments with domains such as big
data analytics that require a vast amount of data for processing. Exploring
research areas with less adaptation to ViT usage can create more efficient,
performance-increased outcomes for current implementation methods.
Even though ViTs have proven successful, they can be challenging to ex-
periment with due to their high computational demand. Thus improvements
to the ViT architecture are needed to make it lightweight and more efficient.
This will inspire the community to open new pathways using ViTs.
We believe there is a plethora of new research areas that ViT, along with
semantic segmentation can be applied to solve real-world problems.

References
[1] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Lukasz Kaiser, I. Polosukhin, Attention is all you need,
Advances in Neural Information Processing Systems 30 (2017).

[2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,

T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,

26
An image is worth 16x16 words: Transformers for image recognition at
scale, arXiv preprint arXiv:2010.11929 (2020).

[3] C.-F. R. Chen, Q. Fan, R. Panda, Crossvit: Cross-attention multi-

scale vision transformer for image classification, in: Proceedings of the
IEEE/CVF international conference on computer vision, 2021, pp. 357–
366.

[4] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin
transformer: Hierarchical vision transformer using shifted windows, in:
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2021, pp. 10012–10022.

[5] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,

T. Xiang, P. H. Torr, et al., Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers, in: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2021, pp. 6881–6890.

[6] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De-
formable transformers for end-to-end object detection, arXiv preprint
arXiv:2010.04159 (2020).

[7] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, L. Zhang, Dynamic detr:

End-to-end object detection with dynamic attention, in: Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021,
pp. 2988–2997.

[8] N. Park, S. Kim, How do vision transformers work?, arXiv preprint

arXiv:2202.06709 (2022).

[9] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski,

A. Joulin, Emerging properties in self-supervised vision transformers, in:
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2021, pp. 9650–9660.

[10] Z. Xu, W. Zhang, T. Zhang, Z. Yang, J. Li, Efficient transformer for re-
mote sensing image segmentation, Remote Sensing 13 (18) (2021) 3585.

[11] R. Strudel, R. Garcia, I. Laptev, C. Schmid, Segmenter: Transformer for

semantic segmentation, in: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2021, pp. 7262–7272.

27
[12] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah,
Transformers in vision: A survey, ACM computing surveys (CSUR)
54 (10s) (2022) 1–41.

[13] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi,

J. Fan, Z. He, A survey of visual transformers, IEEE Transactions on
Neural Networks and Learning Systems (2023).

[14] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao,

C. Xu, Y. Xu, et al., A survey on vision transformer, IEEE transactions
on pattern analysis and machine intelligence (2022).

[15] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for se-

mantic segmentation, in: Proceedings of the IEEE conference on com-
puter vision and pattern recognition, 2015, pp. 3431–3440.

[16] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Mis-

awa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al., At-
tention u-net: Learning where to look for the pancreas, arXiv preprint
arXiv:1804.03999 (2018).

[17] F. I. Diakogiannis, F. Waldner, P. Caccetta, C. Wu, Resunet-a: A deep

learning framework for semantic segmentation of remotely sensed data,
ISPRS Journal of Photogrammetry and Remote Sensing 162 (2020) 94–
114.

[18] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, J. Liang, Unet++:

A nested u-net architecture for medical image segmentation, in: Deep
learning in medical image analysis and multimodal learning for clinical
decision support, Springer, 2018, pp. 3–11.

[19] S. Hochreiter, The vanishing gradient problem during learning recurrent

neural nets and problem solutions, International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems 6 (02) (1998) 107–116.

[20] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya,

J. Shlens, Stand-alone self-attention in vision models, Advances in Neu-
ral Information Processing Systems 32 (2019).

[21] L. Zhu, J. Suomalainen, J. Liu, J. Hyyppä, H. Kaartinen, H. Haggren,

et al., A review: Remote sensing sensors, Multi-purposeful application
of geospatial data (2018) 19–42.

28
[22] M. Schmitt, J. Prexl, P. Ebel, L. Liebel, X. X. Zhu, Weakly super-
vised semantic segmentation of satellite images for land cover mapping–
challenges and opportunities, arXiv preprint arXiv:2002.08254 (2020).

[23] L. P. Olander, H. K. Gibbs, M. Steininger, J. J. Swenson, B. C. Murray,

Reference scenarios for deforestation and forest degradation in support
of redd: a review of data and methods, Environmental Research Letters
3 (2) (2008) 025011.

[24] F. Pacifici, F. Del Frate, C. Solimini, W. J. Emery, An innovative

neural-net method to detect temporal changes in high-resolution op-
tical satellite imagery, IEEE Transactions on Geoscience and Remote
Sensing 45 (9) (2007) 2940–2952.

[25] A. Boguszewski, D. Batorski, N. Ziemba-Jankowska, T. Dziedzic,

A. Zambrzycka, Landcover. ai: Dataset for automatic mapping of build-
ings, woodlands, water and roads from aerial imagery, in: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, 2021, pp. 1102–1110.

[26] L. P. Osco, J. M. Junior, A. P. M. Ramos, L. A. de Castro Jorge, S. N.

Fatholahi, J. de Andrade Silva, E. T. Matsubara, H. Pistori, W. N.
Gonçalves, J. Li, A review on deep learning in uav remote sensing,
International Journal of Applied Earth Observation and Geoinformation
102 (2021) 102456.

[27] J. Wang, Z. Zheng, A. Ma, X. Lu, Y. Zhong, Loveda: A remote sensing

land-cover dataset for domain adaptive semantic segmentation, arXiv
preprint arXiv:2110.08733 (2021).

[28] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu,

F. Hughes, D. Tuia, R. Raskar, Deepglobe 2018: A challenge to parse the
earth through satellite images, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, 2018, pp. 172–
181.

[29] L. Ding, D. Lin, S. Lin, J. Zhang, X. Cui, Y. Wang, H. Tang, L. Bruz-

zone, Looking outside the window: Wide-context transformer for the
semantic segmentation of high-resolution remote sensing images, IEEE
Transactions on Geoscience and Remote Sensing 60 (2022) 1–13.

[30] S. D. Olabarriaga, A. W. Smeulders, Interaction in the segmentation of

medical images: A survey, Medical image analysis 5 (2) (2001) 127–142.

29
[31] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for
biomedical image segmentation, in: International Conference on Medical
image computing and computer-assisted intervention, Springer, 2015,
pp. 234–241.
[32] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao,
J. Liu, Ce-net: Context encoder network for 2d medical image segmen-
tation, IEEE transactions on medical imaging 38 (10) (2019) 2281–2292.
[33] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-
W. Chen, J. Wu, Unet 3+: A full-scale connected unet for medical image
segmentation, in: ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp.
1055–1059.
[34] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille,
Y. Zhou, Transunet: Transformers make strong encoders for medical
image segmentation, arXiv preprint arXiv:2102.04306 (2021).
[35] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang,
Swin-unet: Unet-like pure transformer for medical image segmentation,
in: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part III, Springer, 2023, pp. 205–218.
[36] A. Işın, C. Direkoğlu, M. Şah, Review of mri-based brain tumor image
segmentation using deep learning methods, Procedia Computer Science
102 (2016) 317–324.
[37] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti,
S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al., Skin
lesion analysis toward melanoma detection: A challenge at the 2017
international symposium on biomedical imaging (isbi), hosted by the
international skin imaging collaboration (isic), in: 2018 IEEE 15th in-
ternational symposium on biomedical imaging (ISBI 2018), IEEE, 2018,
pp. 168–172.
[38] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W.
Fu, X. Han, P.-A. Heng, J. Hesser, et al., The liver tumor segmentation
benchmark (lits), arXiv preprint arXiv:1901.04056 (2019).
[39] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al., The multi-
modal brain tumor image segmentation benchmark (brats), IEEE trans-
actions on medical imaging 34 (10) (2014) 1993–2024.

30
[40] D. Gorecky, M. Schmitt, M. Loskyll, D. Zühlke, Human-machine-
interaction in the industry 4.0 era, in: 2014 12th IEEE international
conference on industrial informatics (INDIN), Ieee, 2014, pp. 289–294.

[41] R. T. Azuma, A survey of augmented reality, Presence: teleoperators &

virtual environments 6 (4) (1997) 355–385.

[42] J. Janai, F. Güney, A. Behl, A. Geiger, et al., Computer vision for au-
tonomous vehicles: Problems, datasets and state of the art, Foundations
and Trends® in Computer Graphics and Vision 12 (1–3) (2020) 1–308.

[43] T. Gevers, A. Smeulders, Image search engines: An overview, Emerging

Topics in Computer Vision (2004) 1–54.

[44] S. Jain, X. Wang, J. E. Gonzalez, Accel: A corrective fusion network

for efficient semantic segmentation on video, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2019, pp. 8866–8875.

[45] M. Ding, Z. Wang, B. Zhou, J. Shi, Z. Lu, P. Luo, Every frame counts:
Joint learning of video segmentation and optical flow, in: Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp.
10713–10720.

[46] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen-

son, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic
urban scene understanding, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 3213–3223.

[47] G. J. Brostow, J. Fauqueur, R. Cipolla, Semantic object classes in video:

A high-definition ground truth database, Pattern Recognition Letters
30 (2) (2009) 88–97.

[48] S. R. Richter, V. Vineet, S. Roth, V. Koltun, Playing for data: Ground

truth from computer games, in: European conference on computer vi-
sion, Springer, 2016, pp. 102–118.

[49] E. Shelhamer, K. Rakelly, J. Hoffman, T. Darrell, Clockwork convnets

for video semantic segmentation, in: European Conference on Computer
Vision, Springer, 2016, pp. 852–868.

[50] X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2349–2358.

31
[51] B. Mahasseni, S. Todorovic, A. Fern, Budget-aware deep semantic video
segmentation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 1029–1038.

[52] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez,

P. Martinez-Gonzalez, J. Garcia-Rodriguez, A survey on deep learn-
ing techniques for image and video semantic segmentation, Applied Soft
Computing 70 (2018) 41–65.

[53] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, H. Xia, End-to-

end video instance segmentation with transformers, in: Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition,
2021, pp. 8741–8750.

[54] S. Yang, X. Wang, Y. Li, Y. Fang, J. Fang, W. Liu, X. Zhao, Y. Shan,

Temporally efficient vision transformer for video instance segmentation,
in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2022, pp. 2885–2895.

[55] J. Wu, Y. Jiang, S. Bai, W. Zhang, X. Bai, Seqformer: Sequential trans-

former for video instance segmentation, in: Computer Vision–ECCV
2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,
Proceedings, Part XXVIII, Springer, 2022, pp. 553–569.

[56] S. Gustavsson, Object detection and semantic segmentation using self-

supervised learning (2021).

[57] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, J. Tang, Self-

supervised learning: Generative or contrastive, IEEE Transactions on
Knowledge and Data Engineering (2021).

[58] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense
object detection, in: Proceedings of the IEEE international conference
on computer vision, 2017, pp. 2980–2988.

[59] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural

networks for volumetric medical image segmentation, in: 2016 fourth
international conference on 3D vision (3DV), IEEE, 2016, pp. 565–571.

[60] S. Jadon, A survey of loss functions for semantic segmentation, in: 2020
IEEE Conference on Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), IEEE, 2020, pp. 1–7.

32
[61] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Ur-
tasun, A. Yuille, The role of context for object detection and semantic
segmentation in the wild, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, 2014, pp. 891–898.

[62] M. Everingham, J. Winn, The pascal visual object classes challenge 2012
(voc2012) development kit, Pattern Anal. Stat. Model. Comput. Learn.,
Tech. Rep 2007 (2012) 1–45.

[63] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, A. Torralba, Scene

parsing through ade20k dataset, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 633–641.

[64] S. Kuutti, R. Bowden, Y. Jin, P. Barber, S. Fallah, A survey of deep

learning applications to autonomous vehicle control, IEEE Transactions
on Intelligent Transportation Systems 22 (2) (2020) 712–733.

[65] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, C. Jawa-

har, Idd: A dataset for exploring problems of autonomous navigation
in unconstrained environments, in: 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV), IEEE, 2019, pp. 1743–1751.

[66] A. Gaidon, Q. Wang, Y. Cabon, E. Vig, Virtual worlds as proxy for

multi-object tracking analysis, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2016, pp. 4340–4349.

[67] E. Alberti, A. Tavera, C. Masone, B. Caputo, Idda: a large-scale multi-

domain dataset for autonomous driving, IEEE Robotics and Automation
Letters 5 (4) (2020) 5526–5533.

[68] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.

[69] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, P. Luo,

Segformer: Simple and efficient design for semantic segmentation with
transformers, Advances in Neural Information Processing Systems 34
(2021) 12077–12090.

[70] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, C. Shen, Con-

ditional positional encodings for vision transformers, arXiv preprint
arXiv:2102.10882 (2021).

33
[71] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
L. Shao, Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions, in: Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, 2021, pp. 568–578.

[72] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
L. Shao, Pvt v2: Improved baselines with pyramid vision transformer,
Computational Visual Media 8 (3) (2022) 415–424.

[73] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, C. Shen,

Twins: Revisiting the design of spatial attention in vision transformers,
Advances in Neural Information Processing Systems 34 (2021) 9355–
9366.

[74] R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense

prediction, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2021, pp. 12179–12188.

[75] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, J. Wang,

Hrformer: High-resolution vision transformer for dense predict, Ad-
vances in Neural Information Processing Systems 34 (2021) 7281–7293.

[76] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu,

M. Tan, X. Wang, et al., Deep high-resolution representation learning for
visual recognition, IEEE transactions on pattern analysis and machine
intelligence 43 (10) (2020) 3349–3364.

[77] H. Hu, Z. Zhang, Z. Xie, S. Lin, Local relation networks for image
recognition, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2019, pp. 3464–3473.

[78] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, R. Girdhar, Masked-

attention mask transformer for universal image segmentation, in: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2022, pp. 1290–1299.

[79] B. Cheng, A. Schwing, A. Kirillov, Per-pixel classification is not all

you need for semantic segmentation, Advances in Neural Information
Processing Systems 34 (2021) 17864–17875.

[80] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie,

Feature pyramid networks for object detection, in: Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
2117–2125.

34
[81] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou,
Training data-efficient image transformers & distillation through atten-
tion, in: International Conference on Machine Learning, PMLR, 2021,
pp. 10347–10357.

2301.07499v1
No ratings yet
2301.07499v1
177 pages
S4Capital 2022 Annual Report - 0
No ratings yet
S4Capital 2022 Annual Report - 0
210 pages
Hidden Markov Models and POS Tagging
No ratings yet
Hidden Markov Models and POS Tagging
156 pages
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
No ratings yet
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
82 pages
5augvf Riseofcloudaiinindia2024 Designv7 240805143621 76fe517a
No ratings yet
5augvf Riseofcloudaiinindia2024 Designv7 240805143621 76fe517a
55 pages
sample- Final Manuscript - STEM 11
No ratings yet
sample- Final Manuscript - STEM 11
40 pages
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
No ratings yet
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
61 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
Lecture-5 Classification in ML
No ratings yet
Lecture-5 Classification in ML
50 pages
TSP_CMC_50790
No ratings yet
TSP_CMC_50790
24 pages
A CASE AGAINST COPYRIGHT PROTECTION FOR AI-GENERATED WORKS
No ratings yet
A CASE AGAINST COPYRIGHT PROTECTION FOR AI-GENERATED WORKS
21 pages
2012.12556
No ratings yet
2012.12556
23 pages
2204.07118v1
No ratings yet
2204.07118v1
27 pages
Wags Umi Distinguished Masters Thesis Award
100% (3)
Wags Umi Distinguished Masters Thesis Award
5 pages
Government College of Engineering Aurangabad: Submitted BY
No ratings yet
Government College of Engineering Aurangabad: Submitted BY
22 pages
Neural Architecture Search For Transformers A Surv
No ratings yet
Neural Architecture Search For Transformers A Surv
39 pages
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
No ratings yet
Transformer-Based Framework For Accurate Segmentation of High-Resolution Images in Structural Health Monitoring
15 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
No ratings yet
Zhang_Semantic_Segmentation_by_Early_Region_Proxy_CVPR_2022_paper
11 pages
Week 4 - Diffusion Models
No ratings yet
Week 4 - Diffusion Models
35 pages
PA 5 UNIT
No ratings yet
PA 5 UNIT
35 pages
CMT: Convolutional Neural Networks Meet Vision Transformers
No ratings yet
CMT: Convolutional Neural Networks Meet Vision Transformers
11 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
No ratings yet
Li Et Al. - 2022 - Rethinking Vision Transformers For MobileNet Size and Speed
15 pages
Challenging Task[1]
No ratings yet
Challenging Task[1]
21 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
1188-Article Text-3493-1-10-20240225
No ratings yet
1188-Article Text-3493-1-10-20240225
12 pages
Hambatan Dan Tantangan Chat GPT Dalam Translation
No ratings yet
Hambatan Dan Tantangan Chat GPT Dalam Translation
11 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
ViT Survey On Segmentation
No ratings yet
ViT Survey On Segmentation
30 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
DL Segmentation 2
No ratings yet
DL Segmentation 2
18 pages
PPT_pprid_104
No ratings yet
PPT_pprid_104
11 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
A Dual-Branch Network Based on ViT and Mamba for Semantic Segmentation of Remote Sensing Image
No ratings yet
A Dual-Branch Network Based on ViT and Mamba for Semantic Segmentation of Remote Sensing Image
6 pages
Machine Learning For Stock Market Prediction With Step by Step Implementation
No ratings yet
Machine Learning For Stock Market Prediction With Step by Step Implementation
7 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Transformer-Based Visual Segmentation - A Survey
No ratings yet
Transformer-Based Visual Segmentation - A Survey
23 pages
[2024] Back to School AI Guide - Dan Fitzpatrick
No ratings yet
[2024] Back to School AI Guide - Dan Fitzpatrick
61 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
No ratings yet
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
5 pages
Artificial Intelligence-Based Spatial Domain Beam Prediction For 5G Beyond
No ratings yet
Artificial Intelligence-Based Spatial Domain Beam Prediction For 5G Beyond
6 pages
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
No ratings yet
Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper
10 pages
Transformer-Based Visual Segmentation: A Survey
No ratings yet
Transformer-Based Visual Segmentation: A Survey
25 pages
Springer TextGenerationUsingLongShort TermMemoryNetworks
No ratings yet
Springer TextGenerationUsingLongShort TermMemoryNetworks
10 pages
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
No ratings yet
【SegFormer】NeurIPS 2021 Segformer Simple and Efficient Design for Semantic Segmentation With Transformers Paper
14 pages
Gaurav_Vision_Transformer
No ratings yet
Gaurav_Vision_Transformer
10 pages
Transformers For Vision
No ratings yet
Transformers For Vision
28 pages
STS
100% (2)
STS
8 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
paper3
No ratings yet
paper3
7 pages
Machine Learning Supervised
No ratings yet
Machine Learning Supervised
42 pages
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
No ratings yet
A Survey On Deep Learning Techniques For Image and Video Semantic Segmentation
68 pages
RAND Presentation
No ratings yet
RAND Presentation
4 pages
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
No ratings yet
Crossvit: Cross-Attention Multi-Scale Vision Transformer For Image Classification
12 pages
The Role of AI Attribution Knowledge in The Evaluation of Artwork, de Harsha Gangadharbatla
No ratings yet
The Role of AI Attribution Knowledge in The Evaluation of Artwork, de Harsha Gangadharbatla
18 pages
good note - ViT
No ratings yet
good note - ViT
13 pages
Understanding Robustness of Transformers For Image
No ratings yet
Understanding Robustness of Transformers For Image
23 pages
Satellite Instructions
No ratings yet
Satellite Instructions
3 pages
NeurIPS 2021 Transformer in Transformer Paper
No ratings yet
NeurIPS 2021 Transformer in Transformer Paper
12 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
2408.14957v1
No ratings yet
2408.14957v1
7 pages
Artificial Intelligence And Machine Learning (2)
No ratings yet
Artificial Intelligence And Machine Learning (2)
2 pages
Abstract
No ratings yet
Abstract
2 pages
Strudel Transformer Segmentation
No ratings yet
Strudel Transformer Segmentation
17 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
A Review On Deep Learning Techniques Applied To Semantic Segmentation
No ratings yet
A Review On Deep Learning Techniques Applied To Semantic Segmentation
23 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
No ratings yet
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
3 pages
Quantum Computing and AI A Quantum Leap in Intelligence
No ratings yet
Quantum Computing and AI A Quantum Leap in Intelligence
7 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Sujal Dua - Resume-1
No ratings yet
Sujal Dua - Resume-1
1 page
Principles of Marketing - Final Assessment
No ratings yet
Principles of Marketing - Final Assessment
16 pages
Computer Vision
No ratings yet
Computer Vision
2 pages
Mini Project Guide Students
No ratings yet
Mini Project Guide Students
11 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
19 New Technologies by 2030
No ratings yet
19 New Technologies by 2030
6 pages
Answer: A
No ratings yet
Answer: A
48 pages
TDWI Checklist Report Halper Snowflake Generative AI Web
No ratings yet
TDWI Checklist Report Halper Snowflake Generative AI Web
10 pages
IDC Generative AI Strategies - 2024 Aug
No ratings yet
IDC Generative AI Strategies - 2024 Aug
1 page
Agentic AI and Its Frameworks
No ratings yet
Agentic AI and Its Frameworks
12 pages
Vision Transformer Understanding
No ratings yet
Vision Transformer Understanding
3 pages
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
No ratings yet
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
2 pages
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
From Everand
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.