Transformer Segmentation
Transformer Segmentation
A survey
Hans Thisankea , Chamli Deshana , Kavindu Chamitha , Sachith
Seneviratneb,c , Rajith Vidanaarachchib,c , Damayanthi Heratha,∗
a
Department of Computer Engineering, University of
Peradeniya, Peradeniya, 20400, Sri Lanka
b
Melbourne School of Design, University of Melbourne, Parkville, VIC 3010, Australia
c
arXiv:2305.03273v1 [cs.CV] 5 May 2023
Faculty of Engineering and IT, University of Melbourne, Parkville, VIC 3010, Australia
Abstract
Semantic segmentation has a broad range of applications in a variety of do-
mains including land coverage analysis, autonomous driving, and medical
image analysis. Convolutional neural networks (CNN) and Vision Trans-
formers (ViTs) provide the architecture models for semantic segmentation.
Even though ViTs have proven success in image classification, they cannot
be directly applied to dense prediction tasks such as image segmentation
and object detection since ViT is not a general purpose backbone due to
its patch partitioning scheme. In this survey, we discuss some of the differ-
ent ViT architectures that can be used for semantic segmentation and how
their evolution managed the above-stated challenge. The rise of ViT and
its performance with a high success rate motivated the community to slowly
replace the traditional convolutional neural networks in various computer
vision tasks. This survey aims to review and compare the performances of
ViT architectures designed for semantic segmentation using benchmarking
datasets. This will be worthwhile for the community to yield knowledge re-
garding the implementations carried out in semantic segmentation and to
discover more efficient methodologies using ViTs.
Keywords: vision transformer, semantic segmentation, review, survey,
convolution neural networks, self-supervised learning, deep learning
∗
Corresponding author
Email addresses: e16368@eng.pdn.ac.lk (Hans Thisanke), e16076@eng.pdn.ac.lk
(Chamli Deshan), e16057@eng.pdn.ac.lk (Kavindu Chamith),
sachith.seneviratne@unimelb.edu.au (Sachith Seneviratne),
rajith.vidanaarachchi@unimelb.edu.au (Rajith Vidanaarachchi),
damayanthiherath@eng.pdn.ac.lk (Damayanthi Herath)
2
sults on benchmark datasets. Even though several surveys have been done
[12, 13, 14], a comparison between segmentation models with several bench-
mark datasets to identify the best-performing model has not been performed.
In our survey, we provide a set of segmentation models, for each of which we
define the best variant in each benchmark dataset category. This is useful in
the sense of identifying the most optimal parameters such as patch size, iter-
ations count for each variant of the model. By providing mIoU (%) of model
performance results over several semantic segmentation-related benchmark
datasets, overall evaluation and highest-performing model variants for each
dataset can be identified.
In Section 2 we discuss the applications of semantic segmentation, ViTs,
their challenges, and loss functions. Section 3 describes benchmark datasets
used in semantic segmentation. Section 4 describes the existing work done
in semantic segmentation using ViTs and presents a quantitative analysis.
Finally, Section 5 provides the discussions and Section 6 concludes the paper
with future directions.
3
U-Net (state-of-the-art FCN) and more improved architectures with higher
accuracy and efficiency are developed by [16, 17, 18].
One of the limitations identified with the FCN architecture is the low
resolution of the final output segmentation image of the feature map due
to going through several convolutional and pooling layers. Furthermore, the
locality property of the FCN-based methods caused limitations to the capture
of long-range dependencies of the feature maps. To solve this, researchers also
looked into attention mechanisms to merge or replace these models. This has
led to trying out Transformer architectures in the computer vision domain
which were successful in NLP.
Self-attention-based architectures have taken priority in NLP by avoiding
the drawbacks such as vanishing gradients in sequence modeling and trans-
duction tasks. Specially designed for sequence modeling and transduction
tasks, Transformers with attention were able to model long-range sequences
of data. When training a NLP model, one of the best ways is to pre-train
on a large text corpus and then fine-tune on a small set of data which is for
the related task. But with deep neural networks, this was a challenging task.
As Transformers have high computational efficiency and scalability, it was
easier to train on a large set of data [19].
With the success of using self-attention to enhance the input-output in-
teraction in NLP, works have been proposed to combine convolutional ar-
chitectures with self-attention, especially in object detection and semantic
segmentation where input-output interaction is highly needed [20]. But ap-
plying attention to convolutional architectures demands high computation
power, even though they are theoretically efficient [1].
Considering images, calculating self-attention is quadratic to the image
size as each pixel attends to every other pixel therefore it is a quadratic cost of
the pixel count [2]. Thus [2] proposed to divide the image into a sequence of
patches and treat them as tokens as it was done in NLP. Instead of pixel-wise
attention, patch-wise attention was used in the architecture which helped to
reduce the computational complexity compared to applying self-attention to
convolutional architecture.
This architecture showed promising results by surpassing all the state-
of-the-art convolution-based methods by reaching an accuracy of 88.55% on
ImageNet, 90.72% on ImageNet-ReaL, and 94.55% on CIFAR-100 datasets
[2]. A major characteristic of the ViT is that it needs more data for model
training. Experiments carried out by [2] ensure that with increasing data
size, ViT performs well.
4
Figure 1: Architecture of the Vision Transformer. The model splits an image into a
number of fixed-size patches and linearly embeds them with position embeddings (left).
Then the result is fed into a standard transformer encoder (right). Adapted from [2].
5
image analysis using computer vision and AI. The use of neural networks
provides the ability to process large amounts of image data for object de-
tection, semantic segmentation, and change detection tasks. The evolution
in the remote sensing domain has further improved satellite sensors and the
introduction of drone technology for aerial imagery has been vital to getting
finer details on the earth’s surface. This has resulted in precise and accurate
data for processing using AI techniques [26].
Remote sensing images of the earth’s surface provide land cover areas that
can be categorized into different segmented classes. Each of these classes is
assigned a label for each pixel while preserving the spatial resolution of the
image. Many datasets containing these remote sensing images and their seg-
mented masks are available [25, 27, 28] to use for different applications such
as change detection, land cover segmentation, and classification. Examples
of common land cover classes covered by the pixel-level classification are
forests, crops, buildings, water resources, grasslands, roads, etc. Research
has been conducted using ViT architecture models by adding layers and at-
tention mechanisms efficiently and improvements in performance to process
high-resolution remote sensing images for semantic segmentation such as Ef-
ficient Transformer [10] and Wide-Context Transformer [29].
Manual segmentation of these different environmental areas from a com-
plex satellite or aerial images is a difficult task which is time-consuming,
error-prone, and requires expertise in the remote sensing domain.
6
fore, the images generated are bound to the limitations of the technology
available and require medical personal intervention to examine them [30].
Therefore the segmentation of these images in different biological domains
requires experts in each field to cope with these systems and spend a vast
amount of time examining them. To overcome these difficulties, the capabil-
ity of automatic feature extraction has been introduced with deep learning
based techniques, which have been valuable in the sense of medical imagery.
With the advancements in segmentation analysis, better-performing models
have been introduced with the use of medical images by many researchers.
One such famous architecture is the U-Net [31] which was initially intro-
duced for medical image analysis. Based on this, several improved versions
have been followed up using medical imagery datasets from heart, lesion, and
liver segmentation [32, 33, 18]. This proves how beneficial the improvement
of segmentation has been in the medical environment. In recent years, the
emerging new architectures of ViTs have also been applied to the medical
domain with TransUNet [34] and Swin-Unet [35]. They are hybrid Trans-
former architectures with the advantages of the U-Net. They performed with
better accuracy in cardiac and multi-organ segmentation applications.
Some limitations of medical images are the relatively less number of im-
ages available compared to natural image datasets (landscapes, people, an-
imals, and automobiles) with millions of images. In the medical domain,
there are several image modalities. For annotating medical images, expertise
in each medical field is a must. Among them, MRI and microscopy images
are quite difficult to annotate [36]. Typically, these datasets contain fewer
images compared to ultrasound, X-ray, and lesion datasets which are ob-
tained with the existing scanning systems and are easier to annotate with
less complex structures and fine boundaries. But still, limitations exist due
to restrictions on privacy and other medical policies to obtain these images in
large quantities. To overcome these limitations with some datasets, several
image segmentation challenge competitions are taking place every year which
provide publicly available well-annotated medical image datasets. Most of
the improvements made through research in semantic segmentation models
have been based on these challenge datasets and most are taken as bench-
mark datasets for segmentation [37, 38, 39].
7
the video is considered as a set of uncorrelated fixed images [44]. The com-
mon challenge with this type of semantic segmentation is the computational
complexity of scaling the spatial dimension of the video using the temporal
frame rate. Removal of temporal features and only focusing on spatial frame-
by-frame features doesn’t make sense in video segmentation. Since there is
a combined flow among frames of a video, considering the temporal context
of a video is an essential factor in video semantic segmentation, even though
it is computationally expensive.
Research has been conducted to reduce this high computation cost on
videos. Feature reuse and feature warping [45] have been proposed as a
solution. Cityscapes [46] and CamVid [47], are some largest video segmen-
tation datasets available for frame-by-frame approach of video segmentation
[48]. Recent papers have proposed segmentation methods such as selective
re-execution of feature extraction layers [49], optical flow-based feature warp-
ing [50], and LSTM-based, fixed-budget keyframe selection policies [51]. The
main key problem in these approaches is that they have less attention to the
temporal context of a video. Researchers have shown that to satisfy both
spatial and temporal contexts, using an optical flow of video as temporal in-
formation to speed up uncertainty estimation makes good sense [52]. VisTR
[53], TeViT [54] and SeqFormer [55] are some of the Transformer models that
are used for video segmentation tasks.
8
task as described in Figure 2. A pretext task is a pre-designed task from
which the network can learn features and then using those trained weights
for different features, the network can be applied to solve some downstream
tasks. A downstream task is a specified task. Common downstream tasks in
computer vision are semantic segmentation, object detection, etc.
Figure 2: The general pipeline of self-supervised learning. The trained weights from solving
a pretext task are applied to solve some downstream tasks.
• Generative: Train the encoder to encode the given input and using the
decoder get the input back
• Contrastive: Train the encoder to encode the given input and find the
similarities
9
• Generative-Contrastive (Adversarial): Train encoder to encode the given
input and create fake outputs and compare the features of the input
and output [57]
The equation 1 above computes the average loss for each pixel in an
image. Here in the equation pi is the true probability of the ith class and
qi is the predicted probability of the same class. This supports the model
to generate probability maps that closely resemble the actual segmentation
masks while penalizing inaccurate predictions more heavily. By minimizing
the cross-entropy loss function during training, the model becomes better at
precise image segmentation.
Even though the above method is widely used it can be biased with
dataset imbalance as the majority class will be dominant. To overcome this
when the dataset is skewed, a weighted cross entropy loss is introduced in
[31].
n
X
W CEloss (p, q) = − pi wi log(qi ) (2)
i=1
10
entropy calculates the average per-pixel loss without considering the adjacent
pixels which can be boundaries.
As a further improvement for the cross-entropy loss, the focal loss tech-
nique [58] was introduced. This is implemented by altering the structure
of cross-entropy loss. When focal loss is applied to samples with accurate
classifications, the scaling factor value is down-weighted. This ensures the
more harder samples are emphasized, therefore high class imbalance won’t
bias toward the overall calculations.
2 ni=1 gi pi
P
Dloss (g, p) = 1 − Pn Pn (4)
i=1 gi + i=1 pi +
Here, in equation 4 g and p describes the ground truth and prediction
segmentations. The sum is calculated over the n number of pixels with
small constant added to avoid division by zero. The dice coefficient measures
the overlap between the samples (ground truth and prediction) and provides
a score ranging from 0 to 1, 1 means perfect overlap. Since this method
considered pixels in both global and local contexts, the accuracy is higher
than cross-entropy loss calculations.
Another similar method used to evaluate the metric of models is the
IoU (Intersection over Union) loss also known as the Jaccard index. It is
quite similar to the dice metric and measures the overlapping of the positive
instances between the considered samples. This method as shown in equation
11
5 differs from the dice loss with correctly classified segments relative to total
pixels in either the ground truth or predicted segments.
Pn
g pi
IoUloss (g, p) = 1 − Pn Pn i=1 i P n (5)
i=1 gi + i=1 pi − i=1 gi pi +
For multi-class segmentation, the mean IoU is considered by taking the
average of each individual class IoU. This is widely used for performance
comparison and evaluation of dense prediction models [60].
3. Datasets
In this section, the common datasets used for the training and testing
of semantic segmentation models are considered. Factors affecting the cre-
ation of real datasets are lighting conditions, weather, and season. Based
on these factors, datasets can be classified into different groups. When data
is collected under normal daytime environmental conditions, those data are
categorized under no cross-domain datasets. If data is collected under some
deviated environmental conditions including rainy, cloudy, nighttime, snowy,
etc then such data are categorized under cross-domain datasets. Another
category is synthetic data, where the data is artificially created and col-
lected for training purposes. These synthetic datasets are mostly created as
a cost-effective supplement for training purposes. Following are some of the
benchmark datasets specially made for semantic segmentation tasks, with a
summary presented in Table 1.
PASCAL-Context [61] This dataset was created by manually labeling
every pixel of PASCAL-VOC 2010 [62] dataset with semantic categories. The
domain of this dataset is not limited and its data contains different objects.
The semantic categories of this dataset can be divided into three main classes.
(i) objects, (ii) stuff, and (iii) hybrids. Objects have defined categories such
as cups, keyboards, etc. Stuff has classes without a specific shape and has
regions such as sky, water, etc. Hybrid contains intermediate objects such
as roads where roads have a clear boundary but shape cannot be predicted
correctly.
ADE20K [63] Annotations of this dataset are done on scenes, objects,
parts of objects. Many of the objects in the dataset are annotated with their
parts. Annotations in this dataset are made continuously. Therefore, this is
a growing dataset.
KITTI [64] This dataset contains both 2D and 3D images which have
been collected from urban and rural expressway incidents and traffic sce-
narios. It is useful for robotics and autonomous driving. This dataset has
12
different variants namely KITTI-2012, KITTI-2015 and they have some dif-
ferences in the ground truth.
Cityscapes [46] This contains large-scale pixel-level and instance-level
semantic segmentation annotations recorded from a set of stereo video se-
quences. Compared to other datasets, quality, data size, and annotations in
this dataset have a good rank and data have been collected from 50 different
cities in Germany and neighboring countries.
IDD [65] This is specially designed for road scene understanding and data
have been collected from 182 Indian road scenes. As these are taken from
Indian roads, there are some variations in the weather and lighting conditions
because of dust and air quality on roads. One key feature of this dataset is,
this contains some special classes such as auto-rickshaws and animals on the
roads.
Virtual KITTI [66] Except for different weather and imaging conditions,
most of the virtual vision datasets such as Virtual KITTI are similar to the
real vision datasets. Therefore virtual datasets are useful for pre-training
purposes. This dataset is created from 5 different urban scene videos from
the real-world KITTI dataset. Data have been automatically labeled and can
be used for object detection, semantic segmentation, instance segmentation,
etc.
IDDA [67] This contains 1 million frames generated from simulator
CARLA oriented on different 7 city models. This dataset can be used to
do semantic segmentation for more than 100 different visual domains and is
specially designed for autonomous driving models.
Dataset Classes Size Train Validation Test Resolution (pixels) Category
PASCAL-Context 540 19740 4998 5105 9637 387 × 470 No cross-domain
ADE20K 150 25210 20210 2000 3000 - No cross-domain
KITTI 5 252 140 - 112 1392 × 512 No cross-domain
Cityscapes 30 5K fine, 20K coarse 2975 500 1525 1024 × 2048 Cross-domain
IDD 34 10004 7003 1000 2001 1678 × 968 Cross-domain
Virtual KITTI 14 21260 - - - 1242 × 375 Synthetic
IDDA 24 1M - - - 1920 × 1080 Synthetic
4. Meta - analysis
In this section, we discuss some of the ViT models specialized for the
task of semantic segmentation. The models are selected upon considering the
datasets that they benchmarked (ADE20K, Cityscapes, PASCAL-Context).
13
The intuition behind that is to compare all the models on a common basis.
The benchmark results are summarized in Table 2.
4.1. SEgmentation TRansformer (SETR)
SETR [5] proposes semantic segmentation as a sequence-to-sequence pre-
diction task. They adopt a pure Transformer as the encoder part of their
segmentation model without utilizing any convolution layers. In this model,
they replace the prevalent stacked convolution layer based encoder with a
pure Transformer which gradually reduces the spatial resolution.
Figure 3: SETR architecture and its variants adapted from [5]. (a) SETR consists of a
standard Transformer. (b) SETR-PUP with a progressive up-sampling design. (c) SETR-
MLA with a multi-level feature aggregation.
14
Vision Transformer using Shifted Windows) which can be served as a gen-
eral purpose backbone for computer vision tasks such as image classification
and dense prediction.
Figure 4: An overview of the Swin Transformer adapted from [4]. (a) Hierarchical feature
maps for reducing computational complexity. (b) Shifted window approach which was
used when calculating self-attention. (c) Two successive Swin Transformer Blocks which
presented at each stage. (d) Core architecture of the Swin.
15
lution architectures such as ResNet [68]. Therefore Swin Transformer can
efficiently replace ResNet backbone networks in computer vision tasks.
4.3. Segmenter
Segmenter [11] is a purely transformer-based approach for semantic seg-
mentation which consist of a ViT backbone pre-trained on ImageNet and
introduces a mask transformer as the decoder (Figure 5). Even though the
model was built for segmentation tasks, they take advantage of the mod-
els made for image classification to pre-train and then fine-tune them on
moderate-sized segmentation datasets.
Figure 5: Segmenter architecture adapted from [11]. It basically has a ViT backbone with
a mask transformer as the decoder.
16
4.4. SegFormer
SegFormer [69] is an architecture for semantic segmentation which consist
of a hierarchical Transformer encoder with a lightweight multilayer percep-
tron (MLP) decoder (Figure 6). The MLP decoder is used for predicting the
final mask. To obtain a precise segmentation, it uses a patch size of 4 × 4
in contrast to ViT which uses a patch size of 16 × 16. It has an overlapped
patch merging process to maintain the local continuity around the patches.
Generally, ViT has a fixed resolution for positional encoding [70]. This
leads to a drop in accuracy since it needs to interpolate the positional en-
coding of testing images when they have a different resolution than training
images. Thus, SegFormer introduces a Positional-Encoding-Free design as a
key feature.
Moreover, the authors claim their architecture is more robust against
common corruptions and perturbations than current methods which make
SegFormer appropriate for safety-critical applications. SegFormer achieved
competitive results on ADE20K, Cityscapes, and COCO-Stuff datasets as
shown in Table 2. SegFormer comes in several variants from SegFormer-B0
to SegFormer-B5, where the largest model is SegFormer-B5. This largest
model surpasses the SETR [5] on the ADE20K dataset achieving the highest
mIoU while being 4× faster than SETR. All of these SegFormer models have
trade-offs between model size, accuracy, and runtime.
17
4.5. Pyramid Vision Transformer (PVT)
ViT couldn’t be directly applicable to dense prediction tasks because its
output feature map is single scaled and it generally has a low resolution which
comes at a higher computational cost. PVT [71] overcomes the aforemen-
tioned concerns by introducing a progressive shrinking pyramid backbone
network to reduce the computational costs and simultaneously output more
fine-grained segmentation. PVT comes in two variants. PVT v1 [71] is the
first work by the authors and PVT v2 [72] comes with some additional im-
provements to the previous version.
4.5.1. PVT v1
This initial version has some noteworthy changes compared to the ViT.
It takes 4 × 4 input patches in contrast to the 16 × 16 patches in ViT. This
improves the model’s ability to learn high-resolution representations. It also
reduces the computational demand of traditional ViT by using a progressive
shrinking pyramid. This pyramid structure progressively shrinks the output
resolution from high to low in the stages which are responsible for generat-
ing the scaled feature maps (Figure 7). Another major difference is that it
replaces the multi-head attention layer (MHA) in ViT with a novel spatial
reduction attention (SRA) layer which reduces the spatial scales before the
attention operation. This further reduces the computational and memory
demand because SRA has a low computational complexity than MHA.
Figure 7: PVT v1 architecture adapted from [71]. The pyramid structure of the stages
progressively shrinks the output resolution from high to low.
18
4.5.2. PVT v2
The former version has a few drawbacks. The computational demand
of the PVT v1 is relatively large when processing high-resolution images.
It loses the local continuity of the images when processing the image as a
sequence of non-overlapping patches. It cannot process variable-sized inputs
because of the fixed-size position encoding. This new version has three major
improvements which circumvent the previous design issues. First one is linear
spatial reduction attention (LSRA) which reduces the spatial dimension of
the image to a fixed size using average pooling (Figure 8). Unlike SRA
in the PVT v1, LSRA benefits from linear complexity. Second one is the
overlapping patch embedding (Figure 9a). This is done by zero-padding the
border of the image and taking more enlarged patch windows which overlap
with the adjacent windows. It helps to capture more local continuity of the
images. The third one is the convolutional feed-forward network (Figure 9b)
which helps to process different sizes of input resolutions. With these major
improvements, PVT v2 was able to bring down the complexity of PVT v1
to linear complexity.
Figure 8: Comparison of spatial reduction attention (SRA) layers in PVT versions [72]
4.6. Twins
Twins [73] propose two modern Transformer designs for computer vision
named Twins-PCPVT and Twins-SVT by revisiting the work on the PVT
v1 [71] and Swin Transformer [4].
Twins-SVT uses a spatially separable self-attention (SSSA) mechanism
based on the depth-wise separable convolutions in neural networks. This
19
Figure 9: Improved patch embedding and feed-forward networks in PVT v2 [72]
SSSA has two underlying attention mechanisms which are capable of cap-
turing local information as well as global information. Locally grouped self-
attention (LSA) and global sub-sampled attention (GSA) are the above-
mentioned attention mechanisms respectively. Those techniques greatly re-
duce the heavy computational demand in high-resolution image inputs while
keeping a fine-grained segmentation.
Figure 10: Twins-PCPVT architecture adapted from [73]. It uses conditional position
encoding with a positional encoding generator (PEG) to overcome some of the drawbacks
of fixed-positional encoding.
20
coding. This hinders the performance of PVT. To alleviate this challenge
Twins-PCPVT uses a conditional position encoding (CPE) first introduced
in Conditional Position encoding Vision Transformer (CPVT) [70]. This is
illustrated as the positional encoding generator (PEG) in Figure 10. It is
capable of alleviating some of the issues encountered in fixed-position encod-
ing.
Twins architectures have shown outstanding performance on computer
vision tasks including image classification and semantic segmentation. The
semantic segmentation results achieved by the two Twins architectures are
highly competitive compared to the Swin Transformer [4] and PVT [71].
Figure 11: DPT architecture adapted from [74]. (a) Non-overlapping image patches are fed
into the Transformer block. (b) Reassemble operation for assembling tokens into feature
maps. (c) Fusion blocks for combining feature maps.
In the paper, the authors have introduced several models based on the
used image embedding technique. The DPT-Base and DPT-Large models
use patch-based embedding where the input image is separated into non-
overlapping image patches. Then these are fed into the Transformer block
with a learnable position embedding to locate the spatial position of each in-
dividual token (Figure 11a). DPT-Base has 12 transformer layers compared
to the DPT-Large which has 24 layers with wide feature sizes. The other
model is the DPT-Hybrid, which uses the convolutional backbone ResNet-50
as a feature extractor and uses the pixel-based feature maps as token inputs
21
to the 12-layer transformer block. The Transformer blocks reassemble the
tokens with multi-head self-attention (MSA) [1] sequential blocks for global
interaction between tokens. The tokens are reassembled into image-like fea-
ture representations in various resolutions (Figure 11b). Finally, these rep-
resentations are combined using residual convolutional units in the decoder
and fused together for the final dense prediction (Figure 11c).
The experimental results of the dense prediction transformer have pro-
vided improved accuracy results over several benchmark dataset compar-
isons. The results show that for a large training dataset, the model has the
best performance. The comparisons were done for depth estimations and
semantic segmentation. ADE20K dataset is used for segmentation and the
DPT-Hybrid model has outperformed all the fully-convolutional models [74].
The DPT has the ability to identify precise boundaries of objects with less
distortion. The DPT model was also compared with the PASCAL-Context
dataset after fine-tuning.
Figure 12: HRFormer architecture adapted from [75]. (a) Self-attention blocks. (b) FFN
with depth-wise convolutions.
22
done by creating non-overlapping windows, and self-attention is performed
on each image window separately. This improved the efficiency significantly
compared to overlapping local window mechanisms introduced earlier in dif-
ferent studies [77]. The self-attention blocks (Figure 12a) are followed by
an FFN with depth-wise convolutions (Figure 12b) to increase the receptive
field size by information exchange between local windows, which is vital in
dense prediction. By incorporating a multi-resolution parallel transformer
architecture with convolutional multi-scale fusions for the overall HRFormer
architecture, the information between different resolutions is exchanged re-
peatedly. This process creates a high-resolution output with both local and
global context information.
Figure 13: Mask2Former architecture adapted from [78]. The model consists of a backbone
feature extractor, a pixel decoder, and a Transformer decoder.
23
The architecture of Mask2Former is similar in design to the previous
MaskFormer [79] architecture. The main components are the backbone fea-
ture extractor, pixel decoder, and the Transformer decoder (Figure 13). The
backbone could be either a CNN-based model or a Transformer based model.
As the pixel decoder, they have used a more advanced multi-scale deformable
attention Transformer (MSDeformAttn) [6] in contrast to the feature pyra-
mid network [80] used in MaskFormer [79]. Masked attention has been used
to enhance the effectiveness of the Transformer decoder.
Despite being a universal architecture for segmentation, Mask2Former
still needs to be trained separately for each of the specific tasks. This is
a common limitation of the universal architectures for segmentation tasks.
Mask2Former has achieved new SOTA performance on all three segmentation
tasks (panoptic, instance, semantic) in popular datasets such as COCO and
ADE20K and Cityscapes. The semantic segmentation results are compared
for ADE20K and Cityscapes datasets in Table 2.
24
Datasets
Model Variant Backbone #Params (M) ADE20K Cityscapes PASCAL-Context
SETR-Naı̈ve(16,160k)ρ ViT-L‡ [2] 305.67 48.06 / 48.80 - -
SETR-PUP (16,160k) ViT-L‡ 318.31 48.58 / 50.09 - -
SETR-MLA(16,160k) ViT-L‡ 310.57 48.64 / 50.28 - -
SETR-PUP (16,40k) ViT-L‡ 318.31 - 78.39 / 81.57 -
SETR [5]
SETR-PUP (16,80k) ViT-L‡ 318.31 - 79.34 / 82.15 -
SETR-Naı̈ve(16,80k) ViT-L‡ 305.67 - - 52.89 / 53.61
‡
SETR-PUP (16,80k) ViT-L 318.31 - - 54.40 / 55.27
SETR-MLA(16,80k) ViT-L‡ 310.57 - - 54.87 / 55.83
Swin-T 60 46.1 - -
ℵ Swin-S 81 49.3 - -
Swin [4] ‡
Swin-B 121 51.6 - -
‡
Swin-L 234 53.5 - -
Seg-B DeiT-B† [81] 86 48.05 80.5 53.9
†
§ Seg-B/Mask DeiT-B 86 50.08 80.6 55.0
Segmenter [11]
Seg-L ViT-L‡ 307 52.25 80.7 56.5
Seg-L/Mask ViT-L‡ 307 53.63 81.3 59.0
MiT-B0† 3.4 37.4 / 38.0 76.2 / 78.1 -
MiT-B1† 13.1 42.2 / 43.1 78.5 / 80.0 -
MiT-B2† 24.2 46.5 / 47.5 81.0 / 82.2 -
SegFormer [69]
MiT-B3† 44.0 49.4 / 50.0 81.7 / 83.3 -
MiT-B4† 60.8 50.3 / 51.1 82.3 / 83.9 -
MiT-B5† 81.4 51.0 / 51.8 82.4 / 84.0 -
PVT-Tiny‡ 17.0 35.7 - -
PVT-Small‡ 28.2 39.8 - -
‡
PVT v1 [71] PVT-Medium 48.0 41.6 - -
‡
PVT-Large 65.1 42.1 - -
‡
PVT-Large * 65.1 44.8 - -
ℵ ‡
PVT PVT v2-B0 7.6 37.2 - -
PVT v2-B1‡ 17.8 42.5 - -
‡
PVT v2 [72] PVT v2-B2 29.1 45.2 - -
‡
PVT v2-B3 49.0 47.3 - -
‡
PVT v2-B4 66.3 47.9 - -
‡
PVT v2-B5 85.7 48.7 - -
Twins-PCPVT-S† 54.6 46.2 / 47.5 - -
Twins-PCPVT Twins-PCPVT-B† 74.3 47.1 / 48.4 - -
Twins-PCPVT-L† 91.5 48.6 / 49.8 - -
Twins [73]
Twins-SVT-S† 54.4 46.2 / 47.1 - -
Twins-SVT Twins-SVT-B† 88.5 47.7 / 48.9 - -
Twins-SVT-L† 133 48.8 / 50.2 - -
§ DPT-Hybrid ViT-Hybrid‡ 123 49.02 - 60.46
DPT [74]
DPT-Large ViT-L‡ 343 47.63 - -
OCRNet(7,150k)ρ HRFormer-S 13.5 44.0 / 45.1 - -
OCRNet(7,150k) HRFormer-B 50.3 46.3 / 47.6 - -
OCRNet(7,80k) HRFormer-S 13.5 - 80.0 / 81.0 -
HRFormer [75] OCRNet(7,80k) HRFormer-B 50.3 - 81.4 / 82.0 -
OCRNet(15,80k) HRFormer-B 50.3 - 81.9 / 82.6 57.6 / 58.5
OCRNet(7,60k) HRFormer-B 50.3 - - 56.3 / 57.1
OCRNet(7,60k) HRFormer-S 13.5 - - 53.8 / 54.6
Swin-T - 47.7 / 49.6 - -
Swin-L‡ 216 56.1 / 57.3 - -
‡
Mask2Former [78] Swin-L-FaPN - 56.4 / 57.7 - -
Swin-L‡ 216 - 83.3 / 84.3 -
Swin-B‡ - - 83.3 / 84.5 -
25
5. Discussion
In this survey, we discussed how ViTs became a powerful alternative to
classical CNNs in various computer vision applications, their strengths as
well as limitations, and how ViT contributed to the semantic segmentation
of images with their usage across different domains such as remote sensing,
medical and video processing. Even though we included some of the CNN ar-
chitectures widely used in prior mentioned domains to provide a comparison
between the ViT and CNNs, an in-depth discussion about CNN architectures
is beyond the scope of this paper. We have summarized the different statis-
tics regarding popular datasets used for semantic segmentation tasks and the
results of different ViT architectures used for semantic segmentation to give
a clear and high-level overview for the reader around the region of semantic
segmentation.
References
[1] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, Lukasz Kaiser, I. Polosukhin, Attention is all you need,
Advances in Neural Information Processing Systems 30 (2017).
26
An image is worth 16x16 words: Transformers for image recognition at
scale, arXiv preprint arXiv:2010.11929 (2020).
[4] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin
transformer: Hierarchical vision transformer using shifted windows, in:
Proceedings of the IEEE/CVF International Conference on Computer
Vision, 2021, pp. 10012–10022.
[6] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: De-
formable transformers for end-to-end object detection, arXiv preprint
arXiv:2010.04159 (2020).
[10] Z. Xu, W. Zhang, T. Zhang, Z. Yang, J. Li, Efficient transformer for re-
mote sensing image segmentation, Remote Sensing 13 (18) (2021) 3585.
27
[12] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah,
Transformers in vision: A survey, ACM computing surveys (CSUR)
54 (10s) (2022) 1–41.
28
[22] M. Schmitt, J. Prexl, P. Ebel, L. Liebel, X. X. Zhu, Weakly super-
vised semantic segmentation of satellite images for land cover mapping–
challenges and opportunities, arXiv preprint arXiv:2002.08254 (2020).
29
[31] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for
biomedical image segmentation, in: International Conference on Medical
image computing and computer-assisted intervention, Springer, 2015,
pp. 234–241.
[32] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao,
J. Liu, Ce-net: Context encoder network for 2d medical image segmen-
tation, IEEE transactions on medical imaging 38 (10) (2019) 2281–2292.
[33] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Y.-
W. Chen, J. Wu, Unet 3+: A full-scale connected unet for medical image
segmentation, in: ICASSP 2020-2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp.
1055–1059.
[34] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille,
Y. Zhou, Transunet: Transformers make strong encoders for medical
image segmentation, arXiv preprint arXiv:2102.04306 (2021).
[35] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang,
Swin-unet: Unet-like pure transformer for medical image segmentation,
in: Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part III, Springer, 2023, pp. 205–218.
[36] A. Işın, C. Direkoğlu, M. Şah, Review of mri-based brain tumor image
segmentation using deep learning methods, Procedia Computer Science
102 (2016) 317–324.
[37] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti,
S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al., Skin
lesion analysis toward melanoma detection: A challenge at the 2017
international symposium on biomedical imaging (isbi), hosted by the
international skin imaging collaboration (isic), in: 2018 IEEE 15th in-
ternational symposium on biomedical imaging (ISBI 2018), IEEE, 2018,
pp. 168–172.
[38] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W.
Fu, X. Han, P.-A. Heng, J. Hesser, et al., The liver tumor segmentation
benchmark (lits), arXiv preprint arXiv:1901.04056 (2019).
[39] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest, et al., The multi-
modal brain tumor image segmentation benchmark (brats), IEEE trans-
actions on medical imaging 34 (10) (2014) 1993–2024.
30
[40] D. Gorecky, M. Schmitt, M. Loskyll, D. Zühlke, Human-machine-
interaction in the industry 4.0 era, in: 2014 12th IEEE international
conference on industrial informatics (INDIN), Ieee, 2014, pp. 289–294.
[42] J. Janai, F. Güney, A. Behl, A. Geiger, et al., Computer vision for au-
tonomous vehicles: Problems, datasets and state of the art, Foundations
and Trends® in Computer Graphics and Vision 12 (1–3) (2020) 1–308.
[45] M. Ding, Z. Wang, B. Zhou, J. Shi, Z. Lu, P. Luo, Every frame counts:
Joint learning of video segmentation and optical flow, in: Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp.
10713–10720.
[50] X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2349–2358.
31
[51] B. Mahasseni, S. Todorovic, A. Fern, Budget-aware deep semantic video
segmentation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 1029–1038.
[58] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense
object detection, in: Proceedings of the IEEE international conference
on computer vision, 2017, pp. 2980–2988.
[60] S. Jadon, A survey of loss functions for semantic segmentation, in: 2020
IEEE Conference on Computational Intelligence in Bioinformatics and
Computational Biology (CIBCB), IEEE, 2020, pp. 1–7.
32
[61] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Ur-
tasun, A. Yuille, The role of context for object detection and semantic
segmentation in the wild, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, 2014, pp. 891–898.
[62] M. Everingham, J. Winn, The pascal visual object classes challenge 2012
(voc2012) development kit, Pattern Anal. Stat. Model. Comput. Learn.,
Tech. Rep 2007 (2012) 1–45.
[68] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image
recognition, in: Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
33
[71] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
L. Shao, Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions, in: Proceedings of the IEEE/CVF In-
ternational Conference on Computer Vision, 2021, pp. 568–578.
[72] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
L. Shao, Pvt v2: Improved baselines with pyramid vision transformer,
Computational Visual Media 8 (3) (2022) 415–424.
[77] H. Hu, Z. Zhang, Z. Xie, S. Lin, Local relation networks for image
recognition, in: Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2019, pp. 3464–3473.
34
[81] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou,
Training data-efficient image transformers & distillation through atten-
tion, in: International Conference on Machine Learning, PMLR, 2021,
pp. 10347–10357.
35