0% found this document useful (0 votes)
9 views14 pages

Zegclip: Towards Adapting Clip For Zero-Shot Semantic Segmentation

The paper introduces ZegCLIP, a novel one-stage zero-shot semantic segmentation model that simplifies the traditional two-stage approach by directly extending CLIP's image-level zero-shot capabilities to pixel-level predictions. It incorporates three key design modifications—Deep Prompt Tuning, Non-mutually Exclusive Loss, and a Relationship Descriptor—to enhance generalization to unseen classes and improve segmentation performance. Extensive experiments demonstrate that ZegCLIP significantly outperforms state-of-the-art methods while achieving faster inference speeds.

Uploaded by

eteliver
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Zegclip: Towards Adapting Clip For Zero-Shot Semantic Segmentation

The paper introduces ZegCLIP, a novel one-stage zero-shot semantic segmentation model that simplifies the traditional two-stage approach by directly extending CLIP's image-level zero-shot capabilities to pixel-level predictions. It incorporates three key design modifications—Deep Prompt Tuning, Non-mutually Exclusive Loss, and a Relationship Descriptor—to enhance generalization to unseen classes and improve segmentation performance. Extensive experiments demonstrate that ZegCLIP significantly outperforms state-of-the-art methods while achieving faster inference speeds.

Uploaded by

eteliver
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

Ziqin Zhou1 Yinjie Lei2 Bowen Zhang1 Lingqiao Liu1 * Yifan Liu1
1 2
The University of Adelaide, Australia Sichuan University, China
{ziqin.zhou,b.zhang,lingqiao.liu,yifan.liu04}@adelaide.edu.au,yinjie@scu.edu.cn
arXiv:2212.03588v3 [cs.CV] 20 Jun 2023

Abstract 100
Seen Unseen 89.9 91.9

80 76.5 75.9 77.8


71.1
Recently, CLIP has been applied to pixel-level zero-shot 60

mIoU(%)
learning tasks via a two-stage scheme. The general idea is 40.4
40
to first generate class-agnostic region proposals and then 28.3
20 16.3
feed the cropped proposal regions to CLIP to utilize its 13.8

image-level zero-shot classification capability. While ef- 0


fective, such a scheme requires two image encoders, one
for proposal generation and one for CLIP, leading to a
complicated pipeline and high computational cost. In this Figure 1. Quantitative improvements achieved by our proposed
work, we pursue a simpler-and-efficient one-stage solu- designs on VOC dataset. (1)(2) represents our one-stage Base-
tion that directly extends CLIP’s zero-shot prediction ca- line model of different versions (Fix or Fine-Tune CLIP image en-
pability from image to pixel level. Our investigation starts coder), while (3)-(5) shows the effectiveness of applying our pro-
with a straightforward extension as our baseline that gen- posed designs, i.e., Deep Prompt Tuning (DPT), Non-mutually
erates semantic masks by comparing the similarity between Exclusive Loss (NEL), Relationship Descriptor (RD), on base-
text and patch embeddings extracted from CLIP. However, line model step by step. We highlight that our designs can dramat-
ically increase the segmentation performance on unseen classes.
such a paradigm could heavily overfit the seen classes and
fail to generalize to unseen classes. To handle this is-
tic segmentation models heavily relies on the availability
sue, we propose three simple-but-effective designs and fig-
of a large amount of annotated training images, which in-
ure out that they can significantly retain the inherent zero-
volves a substantial amount of labor. This gives rise to a
shot capacity of CLIP and improve pixel-level generaliza-
surging interest in low-supervision-based semantic segmen-
tion ability. Incorporating those modifications leads to
tation approaches, including semi-supervised [7], weakly-
an efficient zero-shot semantic segmentation system called
supervised [48], few-shot [46], and zero-shot semantic seg-
ZegCLIP. Through extensive experiments on three public
mentation [3, 34, 44].
benchmarks, ZegCLIP demonstrates superior performance,
outperforming the state-of-the-art methods by a large mar- Among them, zero-shot semantic segmentation is partic-
gin under both “inductive” and “transductive” zero-shot ularly challenging and attractive since it is required to di-
settings. In addition, compared with the two-stage method, rectly produce the semantic segmentation results based on
our one-stage ZegCLIP achieves a speedup of about 5 times the semantic description of a given class. Recently, the pre-
faster during inference. We release the code at https: trained vision-language model CLIP [36] has been adopted
//github.com/ZiqinZhou66/ZegCLIP.git. into various dense prediction tasks, such as referring seg-
mentation [43], semantic segmentation [33], and detection
1. Introduction [14]. It also offers a new paradigm and has made a break-
through for zero-shot semantic segmentation. Initially built
Semantic segmentation is one of the fundamental tasks for matching text and images, CLIP has demonstrated a re-
in the computer vision field, which aims to predict the markable capability for image-level zero-shot classification.
category of each pixel of an image [7, 15, 31, 41]. Ex-
However, zsseg [49] and Zegformer [12] follow the com-
tensive works have been proposed [8, 26, 30], e.g., Fully
mon strategy that needs two-stage processing that first gen-
Convolutional Networks [29], U-net [38], DeepLab fam-
erates region proposals and then feeds the cropped regions
ily [4–6] and more recently Vision Transformer based meth-
to CLIP for zero-shot classification. Such a strategy re-
ods [16, 53, 58]. However, the success of the deep seman-
quires two image encoding processes, one for generating
* Corresponding author proposals and one for encoding each proposal via CLIP.
Table 1. Differences between our approach and related zero-shot public datasets and show that our method outperforms the
semantic segmentation methods based on CLIP. state-of-the-art methods by a large margin in both the “in-
Methods
Need an extra CLIP as an image Can do ductive” and “transductive” settings.
image encoder? -level classifier? inductive?
zsseg [49] ! ! ! 2. Related Works
ZegFormer [12] ! ! !
MaskCLIP+ [56] ! % % Pre-trained Vision Language Models [19, 22, 36, 40]
ZegCLIP (Ours) % % ! which connects the image representation with text em-
bedding have achieved significant performance on various
This design creates additional computational overhead and downstream tasks, like image retrieval [28], dense predic-
cannot leverage the knowledge of the CLIP encoder at the tion [37], visual referring expression [43], visual question
proposal generation stage. Besides, MaskCLIP+ [56] uti- answering [21] and so on. CLIP [36] as one of the most pop-
lizes CLIP to generate pseudo labels of novel classes for ular vision-language models is pre-trained via contrastive
self-training but will be invalid if the unseen class names learning on 400 million text-image pairs and shows power-
in inference are unknown in the training stage (“inductive” ful zero-shot classification ability. In this work, we explore
zero-shot setting). how to adapt CLIP’s pre-trained vision-language knowl-
This paper pursues simplifying the pipeline by directly edge from image-level into pixel-level prediction efficiently.
extending the zero-shot capability of CLIP from image- Semantic Segmentation as a fundamental dense pre-
level to pixel-level. The basic idea of our method is straight- diction task in computer vision requires annotating each
forward: we use a lightweight decoder to match the text pixel of the input image. Previous work follows two prin-
prompts against the local embeddings extracted from CLIP, ciples: 1) Directly considering semantic segmentation as
which could be achieved via the self-attention mechanism in per-pixel classification task [29, 39, 45, 52, 55]; 2) De-
a transformer-based structure. We train the vanilla decoder coupling the mask generation and the semantic classifica-
and fix or fine-tune the CLIP image encoder on a dataset tion [8, 9, 51]. Both methods achieve significant progress
containing pixel-level annotations from a limited number with a pre-defined closed set of semantic classes.
of classes, expecting the text-patch matching capability can Zero-shot Semantic Segmentation remains an impor-
generalize to unseen classes. Unfortunately, this basic ver- tant but challenging task due to the inevitable imbalance
sion tends to overfit the training set: while the segmentation problem in seen classes. The target model is required to
results for seen classes generally improve, the model fails segment unseen classes after training on seen classes with
to produce reasonable segments on unseen classes. Sur- annotated labels. Previous works like SPNet [44], ZS3 [3],
prisingly, we discover such an overfitting issue can be dra- CaGNet [17], SIGN [10], Joint [1] and STRICT [34] fol-
matically alleviated by incorporating three modified design low the strategy that improves the generalization ability of
choices and report the quantitative improvements in Fig. 1. semantic mapping from seen classes to unseen classes.
The following highlights our key discoveries: As the popular pre-trained vision-language model CLIP
Design 1: Using Deep Prompt Tuning (DPT) instead of shows the powerful ability of zero-shot classification, it has
fine-tuning or fixing for the CLIP image encode. We find been also applied in zero-shot semantic segmentation re-
that fine-tuning could lead to overfitting to seen classes cently. Zegformer [12] and zsseg [49] develop an extensive
while prompt tuning prefers to retain the inherent zero-shot proposal generator and use CLIP to classify each region and
capacity of CLIP. then ensemble the predicting results. Although it remains
Design 2: Applying Non-mutually Exclusive Loss (NEL) CLIP’s zero-shot ability at the image level, the computa-
function when performing pixel-level classification but gen- tional cost increases inevitably due to classifying each pro-
erating the posterior probability of one class independent of posal. MaskCLIP+ [56] creatively applies CLIP to gener-
the logits of other classes. ate pseudo annotations on novel classes (“transductive”) for
Design 3: Most importantly and our major innovation — self-training. It has achieved competitive performance but
introducing a Relationship Descriptor (RD) to incorporate will be invalid in “inductive” setting where the names of
the image-level prior into text embedding before matching unseen classes in inference are unavailable while training.
text-patch embeddings from CLIP can significantly prevent Fine-tuning and Prompt Tuning are different methods
the model from overfitting to the seen classes. to update the parameters of pre-trained model when adopt
By incorporating those three designs into our one-stage on downstream tasks. Intuitively, fine-tuning [18,42] is able
baseline, we create a simple-but-effective zero-shot seman- to achieve significant performance in fully supervised learn-
tic segmentation model named ZegCLIP. Tab. 1 summa- ing tasks but is challenged by transfer learning. Recently,
rizes the differences between our proposed method and ex- prompting tuning that freezes the pre-trained model while
isting approaches based on CLIP. More details can be found introducing a set of learnable prompts [2] provides an al-
in Appendix. We conduct extensive experiments on three ternative way. It has achieved satisfied performance in both
Annotations of seen classes  N

[building] C Class embeddings


[people]
CLIP Text tree(unseen)
A photo of a {} [car] building people car bus
Encoder C N
[bus] Q' C masks 
... T ×3
Training: Seen Classes
Relationship
3 road(unseen)
Inference: Seen∪Unseen Classes
Descriptor (RD) T̂
Subsec. 3.5
Non-mutually 2 V Linear
g Exclusive Loss
CLIP Image (NEL) Subsec. 3.4
Encoder [cls] token K
Text-Patch Linear
Matching Q
Decoder Linear

descriptor
Baseline framework:
H
Subsec. 3.2 Deep Prompt
Tuning (DPT)
Frozen Learnable C d N d
Subsec. 3.3 1 N Patch embeddings T g d
H

Figure 2. Overall of our proposed ZegCLIP. Our method modifies a One-Stage Baseline framework of matching text and patch em-
beddings from CLIP to generate semantic masks. The key contribution of our work is three simple-but-effective designs (labeled as the
red circles 1,2,3 in the figure). Incorporating these three designs into our proposed one-stage baseline framework can upgrade the poorly
performed baseline method to a strong zero-shot segmentation model.

NLP tasks [24, 27, 35] and CV tasks [47, 50, 57]. Visual segmentation setting, in which both unseen class names
Prompt Tuning [20] proposes an effective solution that in- and images are not accessible during training. Besides “in-
serts trainable parameters in each layer of the transformer. ductive” zero-shot segmentation, there is a “transductive”
Self-training in Zero-shot Segmentation introduces zero-shot learning setting, which assumes that the names
another setting of zero-shot semantic segmentation called of unseen classes are known before the testing stage. They
“transductive”. Unlike the traditional “inductive” setting [17, 56] suppose that the training images include the unseen
where the novel class names and annotations are both un- objects, and only ground truth masks for these regions are
available in the training stage, [34] proposed that self- not available. Our method can easily be extended to both
training via pseudo labels on unlabeled pixels benefits solv- settings and achieve excellent performance.
ing the imbalance problem. In such “transductive” situa-
tion, both the ground truth of seen classes, as well as pseudo 3.2. Baseline: One-stage Text-Patch Matching
labels of unseen classes, will be utilized to supervise the
target model for self-training [17, 49, 56] which can achieve As the large-scale pre-trained model CLIP shows im-
better performance. pressive zero-shot classification ability, recent methods ex-
plore applying CLIP to zero-shot segmentation by propos-
ing a two-stage paradigm. In stage 1, they train a class-
3. Method agnostic generator and then leverage CLIP as a zero-shot
3.1. Problem Definition image-level classifier by matching the similarity between
text embeddings and [cls] token of each proposal in stage 2.
Our proposed method follows the generalized zero-shot While effective, such a design requires two image encoding
semantic segmentation (GZLSS) [44], which requires to processes and brings expensive computational overhead.
segment both seen classes C s and unseen classes C u after To simplify the two-stage pipeline when adapting CLIP
only training on a dataset with pixel-annotations of seen to zero-shot semantic segmentation, in this work, we aim to
part. In the training stage, the model generates per-pixel cope with the critical problem that how to transfer CLIP’s
classification results from the semantic description of all powerful generalization capability from image to pixel level
seen classes. In the testing stage, the model is expected classification effectively. Motivated by the observation of
to produce segmentation results for both known and novel recent work [56] that the text embedding can be implic-
classes. Note that C s ∩ C u = ⊘ and the labels of C u are un- itly matchable to patch-level image embeddings, we build
available while training. The key problem of zero-shot seg- a one-stage baseline by adding a vanilla light-weight trans-
mentation is that merely training on seen classes inevitably former as a decoder inspired by [13, 51]. Then we formu-
leads to overbias on known categories while inference. late semantic segmentation as a matching problem between
This naturally corresponds to the “inductive” zero-shot a representative class query and the image patch features.
Formally, let’s denote the C class embeddings as T = [cls] token 0 g
[t , t2 , ..., tC ] ∈ RC×d , with d is the feature dimension of
1 Patch embeddings
Position embeddings
1 1 1
2 2 2
CLIP model, ti representing the ith class, and the N patch

Layer 12
Layer 1

Layer 2
Learnable prompts
CLIP image encoder
tokens of an image as H = [h1 , h2 , ..., hN ] ∈ RN ×d , with m m m
hj denoting the jth patch. 1

ViT-B/16

ViT-B/16

ViT-B/16
Then we apply linear projections ϕ to generate 2

Conv
3
Q(query), K(key) and V(value) as: H
4

\begin {array}{c} \mathbf {Q}=\phi _{q}(\mathbf {T})\in \mathbb {R}^{C\times d}\\ \mathbf {K}=\phi _{k}(\mathbf {H})\in \mathbb {R}^{N\times d},\mathbf {V}=\phi _{v}(\mathbf {H})\in \mathbb {R}^{N\times d}. \end {array}
(1) n

Figure 3. The architecture of deep prompt tuning.


The semantic masks could be calculated by the scaled dot-
product attention which is the intermediate product of the simple-but-effective designs that can dramatically alleviate
multi-head attention model (MHA): this issue, as will be described in the subsequent sections.

\label {eq:decoder} \mathtt {Masks}=\frac {\mathbf {Q}\mathbf {K}^{T}}{\sqrt {d_{k}}}\in \mathbb {R}^{C\times N}, \vspace {-1mm} (2) 3.3. Design 1: Deep Prompt Tuning (DPT)

where dk is the dimension of the keys as a scaling factor The first design modification is to use deep prompt tun-
and the final segmentation results are obtained by applying ing (DPT) [57] for the CLIP backbone rather than fine-
Argmax operation on the class dimension of Masks. The tuning CLIP. As described in Sec. 2, prompt tuning [47] is a
detailed architecture of the decoder has been shown on the recently proposed scheme for adapting a pre-trained trans-
right of Fig. 8, which consists of three layers of transform- former model to a target domain. It has become a compet-
ers. itive alternative to fine-tuning in the transfer learning set-
How to update the CLIP image encoder: The patch fea- ting [20]. In this work, we explore deep prompt tuning to
ture representation is generated by the CLIP image encode. adapt CLIP for zero-shot image segmentation. Prompt tun-
How to update the CLIP image encoder, e.g., how to calcu- ing fixes the original parameters of CLIP and adds learn-
late H, is an important factor. In our baseline approaches, able prompt tokens as additional input for each layer. Since
we consider H is obtained from a parameter fixed CLIP the zero-shot segmentation model is trained on a limited
or a parameter tunable CLIP, denoted as Baseline-Fix and number of seen classes, directly fine-tuning the model tends
Baseline-FT separately. Later in SubSec. 3.3, we will dis- to overfit the seen classes as the model parameters are ad-
cuss Deep Prompt Tuning (DPT), which turns out to be a justed to optimize the loss only for the seen classes. Conse-
better way to adapt the CLIP for zero-shot segmentation. quently, knowledge learned for vision concepts unseen from
How to train the segmentation model: To properly train the training set might be discarded in fine-tuning. Prompt
the decoder and (optionally) the CLIP model, we apply the tuning could potentially alleviate this issue since the origi-
commonly used softmax operator to convert the logits cal- nal parameters are intact during training.
culate from Eq. 2 to the posterior probability. Exclusive Formally, we denote the input embeddings from the l-
Loss (EL) like Cross-entropy is then used as the objective th MHA module of the ViT-based image encoder in CLIP
function. Later in SubSec. 3.4, we will point out this seemly as {gl , hl1 , hl2 , · · · , hlN }, where gl denotes the [CLS] token
straightforward strategy can be potentially harmful for gen- embedding and Hl = {hl1 , hl2 , · · · , hlN } denotes the im-
eralization. age patch embeddings. Deep prompt tuning appends learn-
Design of query embedding T: The query embeddings able tokens Pl = {pl1 , pl2 , · · · , plM } to the above token
T are the key to our approach. In our baseline model, we sequence in each ViT layer of CLIP image encoder. Then
use the embeddings from the CLIP text encoder. However, the l-th MHA module process the input token as:
as will be pointed out in SubSec. 3.5, such a choice might
cause severe overfitting and we propose a relationship de- [\mathbf {g}^{l},\,\_\,,\mathbf {H}^{l}]=\mathtt {Layer}^{l}([\mathbf {g}^{l-1},\mathbf {P}^{l-1},\mathbf {H}^{l-1}]) \vspace {-1mm} (3)
scriptor by using the relationship between text and image
token as class queries. In SubSec. 4.5-A, we further explore where the output embeddings of {pl1 , · · · , plM } are dis-
other choices of T. Please refer to the relevant sections for carded (denoted as _) and will not feed into the next layer.
more details. Therefore, {pl1 , pl2 , · · · , plM } merely acts as a set of learn-
As shown in Fig. 1-(1)(2) and Tab. 5, our baseline model able parameters to adapt the MHA model.
turns out to perform poorly in practice, especially for the As shown in Fig. 1-(3) and Tab. 5, compared with fine-
unseen classes. It seems that the model overfits severely to tuning, deep prompt tuning on CLIP image encoder can
the seen classes and forgets the zero-shot learning capabil- achieve similar performance on seen classes but improve
ity of CLIP during training. Fortunately, we identify three the segmentation results on unseen significantly.
3.4. Design 2: Non-mutually Exclusive Loss (NEL) where tc ∈ Rd denotes a text embedding for the cth class
and tj is its j-th dimension; g is the image embedding ([cls]
The general practice of semantic segmentation models
token); rjc = tcj gj . We postulate that rc = [r1c , r2c , · · · , rdc ]
treats it as a per-pixel multi-way classification problem, and
characterizes how the image and text, i.e., the text prompt
Softmax operation is used to calculate the posterior proba-
for representing class c, are matched, and call it text-image
bility, followed by using mutually Exclusive Loss (EL) like
Relationship Descriptor (RD) denoted as R ∈ RC×d for
Cross Entropy as the loss function. However, Softmax es-
all C classes in this work. Then we concatenate RDs with
sentially assumes a mutually exclusive relationship between
the original text embedding T ∈ RC×d as image-specific
the to-be-classified classes: a pixel has to belong to one of
text queries T̂ = {t̂1 , t̂2 , ..., t̂C } ∈ RC×2d for the trans-
the classes of interest. Thus only the relative strength of log-
former decoder. Specifically, with this scheme, the input
its, i.e., the ratio of logits, matters for posterior probability
text query of transformer decoder for each class becomes:
calculation. However, when applying the model to unseen
\mathbf {\hat {t}}=concat[\mathbf {r},\mathbf {t}]=concat[\mathbf {t}\odot \mathbf {g},\mathbf {t}], \vspace {-2mm} (8)
classes, the class space will be different from the training
scenario, making the logit of an unseen class poorly cali- where ⊙ is the Hadamard product. Note that, we apply
brated with the other unseen classes. learnable linear projection layers on both T̂ and H to make
To handle this issue, we suggest avoiding the mutual- them has the same feature dimension.
exclusive mechanism during training time and using Non- We compare the effectiveness of applying the relation-
mutually Exclusive Loss (NEL), more specifically, Sigmoid ship description shown in Fig. 1-(5) as well as in Tab. 5. As
and Binary Cross Entropy (BCE) loss to ensure the segmen- seen, it can dramatically improve the segmentation results
tation result for each class is independently generated. In on both seen and unseen categories. In SubSec. 4.5-A, we
addition, we use the focal loss [25] variation of the BCE further discuss the effect of using different combination for-
loss and combine it with an extra dice loss [32, 54] as previ- mats of element-wise operations for text-image matching in
ous work [9]: the relationship descriptor and text queries T̂.

\mathcal {L}_{\mathtt {focal}}=-\frac {1}{\mathrm {hw}}\sum _{i=1}^{\mathrm {hw}}(1-y_{i})^{\gamma }\times \hat {y}\mathrm {log}(y_{i})+y_{i}^{\gamma }\times (1-\hat {y_{i}})\mathrm {log}(1-y_{i}), \vspace {-1mm} 4. Experiments
(4) 4.1. Datasets
To evaluate the effectiveness of our proposed method,
\mathcal {L}_{\mathtt {dice}}=1-\frac {2\sum _{i=1}^{\mathrm {hw}}y_{i}\hat {y}_{i}}{\sum _{i=1}^{\mathrm {hw}}y_{i}^{2}+\sum _{i=1}^{\mathrm {hw}}\hat {y}_{i}^{2}}, (5) we conducted extensive experiments on three public bench-
mark datasets, including PASCAL VOC 2012, COCO-Stuff
164K, and PASCAL Context. The unseen classes of each
\label {eq: total loss} \mathcal {L}=\alpha \cdot \mathcal {L}_{\mathtt {focal}}+\beta \cdot \mathcal {L}_{\mathtt {dice}}, \vspace {-1mm} (6) dataset are shown in the Appendix according to previous
where γ = 2 balances hard and easy samples and {α, β} are works [12, 17, 49, 56]. The details of the datasets are elabo-
coefficients to combine focal loss and dice loss. As shown rated as follows:
in Fig. 1-(4) and Tab. 5, compared with CE, BCE performs PASCAL VOC 2012 contains 10,582 augmented images
better on unseen classes, and the focal loss and dice loss for training and 1,449 for validation. We also ignore the
on BCE [9] boost the performance furthermore as demon- “background” category and use 15 classes as the seen part
strated in SubSec. 4.5-C. and 5 classes as the unseen part.
COCO-Stuff 164K is a large-scale dataset that contains
3.5. Design 3: Relationship Descriptor (RD) 171 categories with 118,287 images for training and 5,000
In the above design, the class embeddings extracted from for testing. The whole dataset is divided into 156 seen
the CLIP text encoder will match against the patch embed- classes and 15 unseen classes.
dings from the CLIP image encoder in the decoder head. PASCAL Context includes 60 classes with 4,996 for train-
While being quite intuitive, we find this design could lead ing and 5,104 for testing. The dataset is divided into 50
to severe overfitting. We postulate that this is because the known classes (including “background”) and the rest 10
matching capability between the text query and image pat- classes as used as unseen classes in the test set.
terns is only trained on the seen-class datasets. 4.2. Evaluation Metrics
This motivates us to incorporate the matching capability
learned from the original CLIP training into the transformer Following previous works, we measure pixel-wise clas-
decoder. Specifically, we noticed that CLIP calculates the sification accuracy (pAcc) and the mean of class-wise inter-
matching score between a text prompt and an image by section over union (mIoU) on both seen and unseen classes,
denoted as mIoU (S) and mIoU (U ), respectively. We also
{\mathbf {t}^c}^\top \mathbf {g} = \sum _j^d t^c_j g_j = \sum _j^d r^c_j, \vspace {-2mm} (7) evaluate the harmonic mean IoU (hIoU ) among seen and
unseen classes.
Table 2. Comparison with the state-of-the-art methods on PASCAL VOC 2012, COCO-Stuff 164K, and PASCAL Context datasets. “ST”
represents applying self-training via generating pseudo labels on all unlabeled pixels, while “*”+“ST” denotes that pseudo labels are
merely annotated on unseen pixels excluding the ignore part.
PASCAL VOC 2012 COCO-Stuff 164K PASCAL Context
Methods
pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU
Inductive
SPNet [44] - 78.0 15.6 26.1 - 35.2 8.7 14.0 - - - -
ZS3 [3] - 77.3 17.7 28.7 - 34.7 9.5 15.0 52.8 20.8 12.7 15.8
CaGNet [17] 80.7 78.4 26.6 39.7 56.6 33.5 12.2 18.2 - 24.1 18.5 21.2
SIGN [10] - 75.4 28.9 41.7 - 32.3 15.5 20.9 - - - -
Joint [1] - 77.7 32.5 45.9 - - - - - 33.0 14.9 20.5
ZegFormer [12] - 86.4 63.6 73.3 - 36.6 33.2 34.8 - - - -
zsseg [49] 90.0 83.5 72.5 77.5 60.3 39.3 36.3 37.8 - - - -
ZegCLIP (Ours) 94.6 91.9 77.8 84.3 62.0 40.2 41.4 40.8 76.2 46.0 54.6 49.9
Transductive
SPNet+ST [44] - 77.8 25.8 38.8 - 34.6 26.9 30.3 - - - -
ZS5 [3] - 78.0 21.2 33.3 - 34.9 10.6 16.2 49.5 27.0 20.7 23.4
CaGNet+ST [17] 81.6 78.6 30.3 43.7 56.8 35.6 13.4 19.5 - - - -
STRICT [34] - 82.7 35.6 49.8 - 35.3 30.3 34.8 - - - -
zsseg+ST [49] 88.7 79.2 78.1 79.3 63.8 39.6 43.6 41.5 - - - -
ZegCLIP+ST (Ours) 95.1 91.8 82.2 86.7 68.8 40.6 54.8 46.6 77.2 46.6 65.4 54.4
*MaskCLIP+ [56] - 88.8 86.1 87.4 - 38.1 54.7 45.0 - 44.4 66.7 53.3
*ZegCLIP+ST (Ours) 96.2 92.3 89.9 91.1 69.2 40.7 59.9 48.5 77.3 46.8 68.5 55.6
Fully Supervised
ZegCLIP (Ours) 96.3 92.4 90.9 91.6 69.9 40.7 63.2 49.6 77.5 46.5 78.7 56.9

4.3. Implementation Details Table 3. Efficiency comparison with different metrics. All mod-
Our proposed method is implemented based on the open- els are evaluated on a single 1080Ti GPU. #Params represents the
number of learnable parameters in the whole framework.
source toolbox MMSegmentation [11] with PyTorch 1.10.1.
Datasets Methods #Params(M) ↓ Flops(G) ↓ FPS ↑
All experiments we provided are based on the pre-trained ZegFormer [12] 60.3 1829.3 1.7
VOC
CLIP ViT-B/16 model and conducted on 4 Tesla V100 ZegCLIP 13.8 110.4 9.0
GPUs, and the batch size is set to 16 with 512x512 as ZegFormer [12] 60.3 1875.1 1.5
COCO
ZegCLIP 14.6 123.9 6.7
the resolution of images. For “inductive” zero-shot learn-
ing, the total training iterations are 20K for PASCAL VOC proves the performance on unseen classes while maintain-
2012, 40K for PASCAL Context, and 80K for COCO-Stuff ing excellent performance on seen part after self-training.
164K. In the “transductive” setting, we train our ZegCLIP Fig. 4 shows the segmentation results of the Baseline-
model on seen classes in the first half of training iterations Fix version and our proposed ZegCLIP on seen and un-
and then apply self-training via generating pseudo labels in seen classes. After applying our designs, ZegCLIP shows
the rest of iterations. The optimizer is set to AdamW with impressive segmentation ability on both seen and unseen
the default training schedule in the MMSeg toolbox. classes and can clearly distinguish similar unseen cate-
gories, for example, the unseen “tree”, “grass” and “play-
4.4. Comparison with State-of-the-art methods ingfield” categories in (1).
To demonstrate the effectiveness of our method, the eval- In addition, to demonstrate the efficiency of our pro-
uation results compared with previous state-of-the-art meth- posed method, we compare the number of learnable param-
ods are reported in Tab. 2. We also provide the fully super- eters and inference speed between our one-stage ZegCLIP
vised learning results as the upper bound to show the per- and typical two-stage method Zegformer [12] in Tab. 3. Our
formance gap between fully-supervised segmentation and proposed method achieves significant performance while
zero-shot segmentation results on unseen classes. The qual- only requiring approximately 14M learnable parameters
itative results on COCO-Stuff 164K are shown in Fig. 4 and with 117 Flops(G) on average which is only 23% and 6%
more visualization results are provided in Appendix. of [12], respectively. In terms of frames Per Second (FPS),
From Tab. 2, we can see that our proposed method our method can achieve a speedup of about 5 times faster
achieves significant performance under the “inductive” set- during inference compared with two-stage method.
ting and outperforms previous works, especially for unseen
classes. This clearly demonstrates the superior generaliza- 4.5. Ablation Study
tion capability of our method over the previous approaches. A. Effect of different formats of text query t̂
In addition, we also find our approach excels in the “trans- In this work, we propose an important design called Re-
ductive” setting, although our method is not specifically de- lationship Descriptor (RD) as described in SubSec. 3.5. In
signed for that setting. As seen, our model dramatically im- this module, we combine the queried text embeddings T
tree baseball glove
tree tree baseball glove

people
(1) people people
grass
wood
grass grass
plant-other

playingfield playingfield playingfield

sky sky sky

(2) skis snowboard skis


tree tree
people people people
snow snow snow

tree tree tree


house
house house
people
people people branch flower
(3)
truck truck flower truck
snow fire hydrant
snow
road snow road
road

(a) Image (b) Baseline-FT (c) Ours (ZegCLIP) (d) Ground truth
Figure 4. Qualitative results on COCO-Stuff 164K. (a) are the original testing images; (b) represent the performance of our proposed
one-stage baseline (fine-tuning the image encoder); (c) are the visualization results of our proposed ZegCLIP; (d) are the ground truths of
each image. Note that the white and red tags represent seen and unseen classes separately.

with image [cls] token g extract from CLIP to generate Table 4. Effect of different formats of text queries t̂.
text-image description T̂ before feeding the queries into the dim format of t̂ pAcc mIoU(S) mIoU(U) hIoU
segment decoder. Such image-specific queries can dramat- t 86.8 89.5 33.7 49.0
t⊙g 93.1 90.2 68.4 77.8
ically improve the zero-shot segmentation performance on 512 |t-g| 92.4 90.6 64.2 75.1
unseen classes. To further explore the effect of different t-g 88.7 87.9 46.5 60.8
element-wise operations in the relationship descriptor, we t+g 82.2 89.9 13.9 24.1
[t, g] 88.9 88.8 39.3 54.5
conduct various formats of T̂ on VOC and report the results [t⊙g, t] 94.6 91.9 77.8 84.3
in Tab. 4. For the image-specific text query of each class t̂, [|t-g|, t] 90.9 91.5 54.2 68.1
512*2
it can be formulated as shown in the second column. [t⊙g, t+g] 88.3 90.0 38.0 53.4
[t+g, t] 82.8 89.4 20.7 33.6
We can see, the dot product and the absolute difference [t⊙g, |t-g|] 94.1 91.2 73.9 81.6
between text embedding t and [cls] token g can provide 512*3 [t⊙g, |t-g|, t] 93.4 91.6 67.3 77.6
more general information, but the sum and concatenate op-
prediction task. Meanwhile, fine-tuning CLIP (Baseline-
erations perform poorly on both seen and unseen classes.
FT) performs better on seen classes but deteriorates on un-
This is understandable since dot product and absolute dif-
seen classes dramatically. It seems that the zero-shot trans-
ference characterize the relationship between the image and
fer learning ability of CLIP is destroyed after updating on
text encoding, which is in line with the interpretation made
seen classes. Changing objective functions from mutual
in SubSec. 3.5. The red line in Tab. 4 is the format of t̂ we
exclusive into Non-mutually Exclusive Loss (NEL, De-
finally chose in our ZegCLIP model.
sign 2) also leads to significant performance boost both in
B. Detailed results of applying designs on baseline seen and unseen scenarios. The most dramatic improve-
To demonstrate the effectiveness of our proposed de- ment comes from Relationship Descriptor (RD, Design
signs, we further report the improvements of applying de- 3) which almost doubles the unseen performance in many
signs on our Baseline model in Tab. 5. It shows that fixed cases. Finally, we noticed that Deep Prompt Tuning (DPT,
CLIP (Baseline-Fix) with a learnable segment decoder fails Design 1) works well with NEL (Design 2) and RD (Design
to achieve satisfactory performance on both seen and un- 3). If we replace DPT by fine-tuning CLIP and combine
seen classes due to weak image representation for the dense it with NEL and RD, the issue of overfitting seen classes
Table 5. Quantitative results on VOC and COCO dataset to demonstrate the effectiveness of our proposed three designs.
PASCAL VOC 2012 COCO-Stuff 164K
method
pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU
Baseline-Fix 69.3 71.1 16.3 26.5 33.3 17.1 15.4 16.2
Baseline-Fix + NEL 85.5 85.2 36.6 51.2 52.4 31.7 20.8 25.1
Baseline-Fix + RD 86.0 82.5 46.6 59.6 41.0 23.3 23.4 23.3
Baseline-Fix + NEL + RD 89.6 83.3 66.4 73.9 53.7 32.3 32.5 32.4
Baseline-FT 77.3 76.5 13.8 23.4 48.4 32.4 17.5 22.7
Baseline-FT + NEL 83.8 84.1 27.5 41.4 56.5 39.9 25.4 31.0
Baseline-FT + RD 79.4 77.8 20.7 32.7 54.0 39.6 22.4 28.6
Baseline-FT + NEL + RD 89.6 90.2 42.4 57.7 60.2 42.7 22.3 29.3
Baseline-DPT 76.2 75.9 28.3 41.2 39.0 22.5 17.5 19.7
Baseline-DPT + NEL 89.2 89.9 40.4 55.7 58.5 38.0 27.4 31.8
Baseline-DPT + RD 85.5 81.0 55.2 65.7 46.4 28.4 27.8 28.1
Baseline-DPT + NEL + RD (ZegCLIP) 94.6 91.9 77.8 84.3 62.0 40.2 41.4 40.8
80
mIoU(%)

60
40
20
0

Basline-Fix+NEL+RD Basline-FT+NEL+RD Basline-DPT+NEL+RD (ZegCLIP) Basline-DPT+NEL+RD (ZegCLIP) + ST


Figure 5. Detailed performance on unseen classes of COCO datasets. Note that “ST” represents self-training in “transductive” setting.

remain. In conclusion, our proposed method can ensure eralization capability compared with the latest related work
semantic segmentation performance while maintaining the Zegformer which is also based on the CLIP model.
powerful ability of CLIP and achieving excellent perfor- Table 7. Generalization ability to other datasets.
mance on seen and unseen classes simultaneously. source target method pAcc mIoU mAcc
C. Effect of advanced loss function Zegformer [12] 56.8 36.1 64.0
As we described in Subsec 3.4, we propose that apply- Context ZegCLIP 60.9 41.2 68.4
*ZegCLIP+ST 68.4 45.8 70.9
ing non-mutually exclusive objective functions like using COCO
Zegformer [12] 92.8 85.6 92.7
binary cross entropy (BCE) with sigmoid performs better VOC ZegCLIP 96.9 93.6 96.4
on zero-shot semantic segmentation. To better handle the *ZegCLIP+ST 97.2 94.1 96.7
imbalance problem among categories, we further combine
BCE with focal loss [25] and dice loss [23] according to 5. Conclusion
[9, 51] as Eq. 6, and the results of applying such two ver- In this work, we propose an efficient one-stage straight-
sions of the loss function, denoted as “plain” and “plus”, forward zero-shot semantic segmentation method based
are reported separately in Tab. 6. We can see that the “plus” on the pre-trained vision-language CLIP. To transfer the
loss function achieves better performance on both seen and image-wise classification ability to dense prediction tasks
unseen classes on VOC, COCO, and Context datasets. while maintaining the advanced zero-shot knowledge, we
figure out three designs to achieve competitive results on
Table 6. Comparison of introducing advanced loss function. Note
that “plain” represents merely Binary Cross Entropy (BCE), while
seen classes while extremely improving the performance on
“plus” means adding focal loss on BCE and dice loss novel classes. Our proposed method relay on text embed-
dataset loss pAcc mIoU(S) mIoU(U) hIoU dings as queries that are very flexible to cope with both “in-
plain 93.4 89.7 73.6 80.9 ductive” and “transductive” zero-shot settings. To demon-
VOC plus 94.6 91.9 77.8 84.3 strate the effectiveness of our method, we conduct exten-
plain 59.8 38.8 39.0 38.9 sive performance on three public benchmark datasets and
COCO plus 62.0 40.2 41.4 40.8 outperform previous state-of-the-art methods. Meanwhile,
plain 75.3 43.5 50.0 46.5
Context plus 76.2 46.0 54.6 49.9 our one-stage framework shows about 5 times faster com-
pared with two-stage methods while inference. In general,
D. Generalization ability to other datasets our work explores how to leverage the pre-trained vision-
To further explore the generalization ability of our pro- language model CLIP into semantic segmentation and suc-
posed method, we conduct extra experiments in Tab. 7. cessfully utilize its zero-shot knowledge in downstream
We apply the pre-trained model of the source dataset via tasks which may provide inspiration for future research.
supervised learning on seen classes and evaluate the seg- Acknowledgement. This work was done in Adelaide Intel-
mentation results on both seen and unseen classes of tar- ligence Research (AIR) Lab and Lingqiao Liu is supported
get datasets. Our method shows better cross-domain gen- by the Centre of Augmented Reasoning (CAR).
References of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 11583–11592, 2022. 1, 2,
[1] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex- 5, 6, 8
ploiting a joint embedding space for generalized zero-shot
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
semantic segmentation. In Proceedings of the IEEE/CVF In-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
ternational Conference on Computer Vision (ICCV), pages
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
9536–9545, 2021. 2, 6
vain Gelly, et al. An image is worth 16x16 words: Trans-
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- formers for image recognition at scale. arXiv preprint
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- arXiv:2010.11929, 2020. 3
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
[14] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei
guage models are few-shot learners. Proceedings of the Ad-
Shu. Zero-shot out-of-distribution detection based on the
vances in Neural Information Processing Systems (NeurIPS),
pretrained model clip. In Proceedings of the AAAI confer-
33:1877–1901, 2020. 2
ence on artificial intelligence (AAAI), 2022. 1
[3] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
[15] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei,
Pérez. Zero-shot semantic segmentation. Proceedings of
Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking
the Advances in Neural Information Processing Systems
bisenet for real-time semantic segmentation. In Proceedings
(NeurIPS), 32, 2019. 1, 2, 6
of the IEEE/CVF Conference on Computer Vision and Pat-
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, tern Recognition (CVPR), pages 9716–9725, 2021. 1
Kevin Murphy, and Alan L Yuille. Semantic image segmen-
[16] Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li,
tation with deep convolutional nets and fully connected crfs.
Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, and David Z
arXiv preprint arXiv:1412.7062, 2014. 1
Pan. Multi-scale high-resolution vision transformer for se-
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, mantic segmentation. In Proceedings of the IEEE/CVF
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Conference on Computer Vision and Pattern Recognition
segmentation with deep convolutional nets, atrous convolu- (CVPR), pages 12094–12103, 2022. 1
tion, and fully connected crfs. IEEE Transactions on Pattern
[17] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and
Analysis and Machine Intelligence (TPAMI), 40(4):834–848,
Liqing Zhang. Context-aware feature generation for zero-
2017. 1
shot semantic segmentation. In Proceedings of the ACM
[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and International Conference on Multimedia (ACMMM), pages
Hartwig Adam. Rethinking atrous convolution for seman- 1921–1929, 2020. 2, 3, 5, 6
tic image segmentation. arXiv preprint arXiv:1706.05587,
[18] Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and
2017. 1
Kurt Konolige. On pre-trained image features and synthetic
[7] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong images for deep learning. In Proceedings of the European
Wang. Semi-supervised semantic segmentation with cross Conference on Computer Vision (ECCV) Workshops, 2018.
pseudo supervision. In Proceedings of the IEEE/CVF 2
Conference on Computer Vision and Pattern Recognition
[19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
(CVPR), pages 2613–2622, 2021. 1
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
[8] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- Duerig. Scaling up visual and vision-language representa-
der Kirillov, and Rohit Girdhar. Masked-attention mask tion learning with noisy text supervision. In Proceedings of
transformer for universal image segmentation. In Proceed- the IEEE/CVF International Conference on Computer Vision
ings of the IEEE/CVF Conference on Computer Vision and (ICCV), pages 4904–4916, 2021. 2
Pattern Recognition (CVPR), pages 1290–1299, 2022. 1, 2
[20] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
[9] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
pixel classification is not all you need for semantic segmen- sual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
tation. Proceedings of the Advances in Neural Information 3, 4
Processing Systems (NeurIPS), 34:17864–17875, 2021. 2, 5,
[21] Jingjing Jiang, Ziyi Liu, and Nanning Zheng. Finetuning
8
pretrained vision-language models with correlation informa-
[10] Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and tion bottleneck for robust visual question answering. arXiv
Wael Abd-Almageed. SIGN: Spatial-information incorpo- preprint arXiv:2209.06954, 2022. 2
rated generative network for generalized zero-shot seman-
[22] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh,
tic segmentation. In Proceedings of the IEEE/CVF Interna-
and Kai-Wei Chang. Visualbert: A simple and perfor-
tional Conference on Computer Vision (ICCV), pages 9556–
mant baseline for vision and language. arXiv preprint
9566, 2021. 2, 6
arXiv:1908.03557, 2019. 2
[11] MMSegmentation Contributors. MMSegmentation:
[23] Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei
Openmmlab semantic segmentation toolbox and
Wu, and Jiwei Li. Dice loss for data-imbalanced nlp tasks.
benchmark. https : / / github . com / open -
arXiv preprint arXiv:1911.02855, 2019. 8
mmlab/mmsegmentation, 2020. 6
[24] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz-
[12] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De-
ing continuous prompts for generation. arXiv preprint
coupling zero-shot semantic segmentation. In Proceedings
arXiv:2101.00190, 2021. 3
[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, vision. In Proceedings of the International Conference on
and Piotr Dollár. Focal loss for dense object detection. In Machine Learning (ICML), pages 8748–8763, 2021. 1, 2
Proceedings of the IEEE/CVF International Conference on [37] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong
Computer Vision (ICCV), pages 2980–2988, 2017. 5, 8 Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu.
[26] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Denseclip: Language-guided dense prediction with context-
Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- aware prompting. In Proceedings of the IEEE/CVF Confer-
deeplab: Hierarchical neural architecture search for seman- ence on Computer Vision and Pattern Recognition (CVPR),
tic image segmentation. In Proceedings of the IEEE/CVF pages 18082–18091, 2022. 2
Conference on Computer Vision and Pattern Recognition [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
(CVPR), pages 82–92, 2019. 1 Convolutional networks for biomedical image segmentation.
[27] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao In International Conference on Medical Image Computing
Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning and Computer Assisted Intervention (MICCAI), pages 234–
can be comparable to fine-tuning across scales and tasks. In 241, 2015. 1
Proceedings of the Association for Computational Linguis- [39] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
tics (ACL), pages 61–68, 2022. 3 Schmid. Segmenter: Transformer for semantic segmenta-
[28] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and tion. In Proceedings of the IEEE/CVF Conference on Com-
Stephen Gould. Image retrieval on real-life images with puter Vision and Pattern Recognition (CVPR), pages 7262–
pre-trained vision-and-language models. In Proceedings of 7272, 2021. 2
the IEEE/CVF International Conference on Computer Vision [40] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
(ICCV), pages 2125–2134, 2021. 2 Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully linguistic representations. arXiv preprint arXiv:1908.08530,
convolutional networks for semantic segmentation. In Pro- 2019. 2
ceedings of the IEEE/CVF Conference on Computer Vision [41] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
and Pattern Recognition (CVPR), pages 3431–3440, 2015. Georgoulis, and Luc Van Gool. Unsupervised semantic seg-
1, 2 mentation by contrasting object mask proposals. In Proceed-
[30] Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei ings of the IEEE/CVF International Conference on Com-
Zhou, Jianbing Shen, and Luc Van Gool. Video object seg- puter Vision (ICCV), pages 10052–10062, 2021. 1
mentation with episodic graph memory networks. In Pro- [42] Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong,
ceedings of the IEEE conference on European Conference on Wei Dong, and Xiangke Liao. Bridging pre-trained models
Computer Vision (ECCV), pages 661–679. Springer, 2020. 1 and downstream tasks for source code understanding. In Pro-
[31] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and ceedings of the International Conference on Software Engi-
Andrea Vedaldi. Deep spectral methods: A surprisingly neering (ICSE), pages 287–298, 2022. 2
strong baseline for unsupervised semantic segmentation and [43] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong
localization. In Proceedings of the IEEE/CVF Conference Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-
on Computer Vision and Pattern Recognition (CVPR), pages driven referring image segmentation. In Proceedings of
8364–8375, 2022. 1 the IEEE/CVF Conference on Computer Vision and Pattern
[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Recognition (CVPR), pages 11686–11695, 2022. 1, 2
V-net: Fully convolutional neural networks for volumetric [44] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt
medical image segmentation. In Proceedings of the Inter- Schiele, and Zeynep Akata. Semantic projection network
national Conference on 3D Vision (3DV), pages 565–571. for zero-and few-label semantic segmentation. In Proceed-
IEEE, 2016. 5 ings of the IEEE/CVF Conference on Computer Vision and
[33] Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E Pattern Recognition (CVPR), pages 8256–8265, 2019. 1, 2,
Green, and Nassir Navab. Segmentation in style: Unsuper- 3, 6
vised semantic image segmentation with stylegan and clip. [45] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
arXiv preprint arXiv:2107.12518, 2021. 1 Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
[34] Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimil- ficient design for semantic segmentation with transformers.
iano Mancini, Zeynep Akata, and Barbara Caputo. A closer Proceedings of the Advances in Neural Information Process-
look at self-training for zero-label semantic segmentation. ing Systems (NeurIPS), 34:12077–12090, 2021. 2
In Proceedings of the IEEE/CVF Conference on Computer [46] Guo-Sen Xie, Jie Liu, Huan Xiong, and Ling Shao. Scale-
Vision and Pattern Recognition (CVPR), pages 2693–2702, aware graph neural network for few-shot semantic segmenta-
2021. 1, 2, 3, 6 tion. In Proceedings of the IEEE/CVF Conference on Com-
[35] Guanghui Qin and Jason Eisner. Learning how to ask: puter Vision and Pattern Recognition (CVPR), pages 5475–
Querying lms with mixtures of soft prompts. arXiv preprint 5484, 2021. 1
arXiv:2104.06599, 2021. 3 [47] Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guo-
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya qiang Liang, and Yanning Zhang. Class-aware visual prompt
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, tuning for vision-language pre-trained model. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2208.08340, 2022. 3, 4
ing transferable visual models from natural language super-
[48] Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid on Computer Vision and Pattern Recognition (CVPR), pages
Boussaid, Ferdous Sohel, and Dan Xu. Leveraging auxil- 12083–12093, 2022. 1
iary tasks with affinity learning for weakly supervised se- [54] Rongjian Zhao, Buyue Qian, Xianli Zhang, Yang Li, Rong
mantic segmentation. In Proceedings of the IEEE/CVF In- Wei, Yang Liu, and Yinggang Pan. Rethinking dice loss for
ternational Conference on Computer Vision (ICCV), pages medical image segmentation. In Proceedings of the IEEE In-
6984–6993, 2021. 1 ternational Conference on Data Mining (ICDM), pages 851–
[49] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue 860. IEEE, 2020. 5
Cao, Han Hu, and Xiang Bai. A simple baseline for zero- [55] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
shot semantic segmentation with pre-trained vision-language Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
model. arXiv preprint arXiv:2112.14757, 2021. 1, 2, 3, 5, 6 Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
[50] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat- tation from a sequence-to-sequence perspective with trans-
Seng Chua, and Maosong Sun. CPT: Colorful prompt tun- formers. In Proceedings of the IEEE/CVF Conference on
ing for pre-trained vision-language models. arXiv preprint Computer Vision and Pattern Recognition (CVPR), pages
arXiv:2109.11797, 2021. 3 6881–6890, 2021. 2
[51] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- [56] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic dense labels from clip. In Proceedings of the IEEE confer-
segmentation with plain vision transformers. arXiv preprint ence on European Conference on Computer Vision (ECCV),
arXiv:2210.05844, 2022. 2, 3, 8 2022. 2, 3, 5, 6
[52] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, [57] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- Liu. Learning to prompt for vision-language models. Inter-
text encoding for semantic segmentation. In Proceedings of national Journal of Computer Vision (IJCV), 130(9):2337–
the IEEE/CVF Conference on Computer Vision and Pattern 2348, 2022. 3, 4
Recognition (CVPR), pages 7151–7160, 2018. 2 [58] Fangrui Zhu, Yi Zhu, Li Zhang, Chongruo Wu, Yanwei Fu,
[53] Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, and Mu Li. A unified efficient pyramid transformer for se-
Xinggang Wang, Wenyu Liu, Gang Yu, and Chunhua Shen. mantic segmentation. In Proceedings of the IEEE/CVF In-
Topformer: Token pyramid transformer for mobile semantic ternational Conference on Computer Vision (ICCV), pages
segmentation. In Proceedings of the IEEE/CVF Conference 2667–2677, 2021. 1
Appendix quantitative results are reported in Tab. 8. For a better ex-
planation, we number the total 12 vision transformer layers
A. Effect of the number of deep prompt tokens in the CLIP image encoder from 1 (“bottom") to 12 (“top")
We figure out that the best number of deep prompt to- and the layers of inserting prompts are as denoted in the first
kens varies for different datasets and the detailed results column of Tab. 8. We figure out that adding prompt tokens
are shown in Fig. 6. For the PASCAL VOC 2012 dataset on “bottom" layers generally tends to perform better than on
(VOC) which contains fewer training samples and cate- “top" layers. Meanwhile, inserting learnable prompt tokens
gories, 10 tokens are enough to obtain significant perfor- in each ViT layer (layer=1→12) achieves the best perfor-
mance on both seen and unseen classes. However, for large- mance which is also the default setting in our experiments.
scale datasets, more deep prompts, i.e., 100 for COCO-Stuff
164K (COCO) and 35 for PASCAL Context (Context), are C. Effect of single and multiple text templates
beneficial to achieve better segmentation performance. In Following the training details of CLIP, we apply a sin-
general, the best number of deep prompt tokens increases gle template “A photo of a {}" on PASCAL VOC 2012
with the scale of the dataset and the complexity of the per- (VOC) and multiple templates on large-scale datasets, i.e.,
pixel classification task increases. Meanwhile, using too COCO-Stuff 164K (COCO) and PASCAL Context (Con-
many visual prompts may be detrimental to our model in- text), when obtaining the class embeddings from CLIP text
stead. encoder. We provide the quantitative results of using single
95
and multiple templates in Tab. 9 where we can see that mul-
91.9 91.7 VOC (U)
90.3
VOC (S) tiple descriptions achieve reasonable improvements on both
85 COCO (U) two datasets.
COCO (S)
76.3 77.8
Context (U)
Table 9. Comparison of using single and multiple templates on
75 71.5 Context (S)
COCO-Stuff 164K and PASCAL Context datasets.
mIOU (%)

65 dataset template pAcc mIoU(S) mIoU(U) hIoU


single 61.4 39.5 40.6 40.0
52.3
54.5 53.4 52.9 COCO
55 multiple 62.0 40.2 41.4 40.8
44.3 46.1 46.0 46.8 46.6 single 75.8 45.1 52.1 48.3
45 Context
44.9 41.0 40.9 39.2 41.4
39.8 multiple 76.2 46.0 54.6 49.9
39.3 39.4 39.7 40.2 39.5
35
5 10 20 35 50 100 200 The details of the 15 augmented templates we used on
Number of the deep prompt tokens COCO-Stuff 164K and PASCAL Context datasets are:
Figure 6. The quantitative results of applying the different num-
bers of deep prompt tokens. Note that “S” and “U” represent seen ‘A photo of a {}.’
and unseen classes separately. ‘A photo of a small {}.’
‘A photo of a medium {}.’
B. Effect of the depth of deep prompt tokens ‘A photo of a large {}.’
‘This is a photo of a {}.’
Table 8. Effect of the depth of deep prompt tuning on VOC. ‘This is a photo of a small {}.’
‘This is a photo of a medium {}.’
layer pAcc mIoU(S) mIoU(U) hIoU ‘This is a photo of a large {}.’
1 91.4 87.5 67.8 76.4 ‘A {} in the scene.’
1→3 91.7 86.7 70.2 77.6 ‘A photo of a {} in the scene.’
1→6 92.7 87.8 75.3 81.1 ‘There is a {} in the scene.’
1→9 93.3 88.9 72.4 79.8 ‘There is the {} in the scene.’
1→12 94.6 91.9 77.8 84.3 ‘This is a {} in the scene.’
10→12 92.5 88.3 70.9 78.6 ‘This is the {} in the scene.’
7→12 92.5 89.0 68.0 77.1 ‘This is one {} in the scene.’
4→12 93.6 91.5 66.9 77.3
Except that the number of deep prompt tokens can im- D. Brief frameworks of related Two-stage and
pact the performance of zero-shot semantic segmentation, our One-stage method
we also conduct extensive experiments on PASCAL VOC
2012 (VOC) to explore the effect of inserting the learnable As we described above, previous zero-shot semantic
prompts in different layers of the CLIP image encoder. The segmentation methods based on CLIP follow a two-stage
paradigm as shown in Fig. 7-(a). In stage 1, they need to
N queries Image Class Names Image Class Names Image Class Names
1 Deep CLIP Image CLIP Text
Proposal CLIP Image CLIP Text
Stage1 CLIP Image CLIP Text Seen Prompts Encoder Encoder
Generator Encoder Encoder
Encoder Encoder Unseen
Stage1 Stage2 3
Stage1 Relationship Descriptor (RD)
Proposal-level Classification
Text-patch
Matching Decoder Text-patch Seen
Semantic Masks Assembly Matching Decoder
Unseen
Fix or Fine-tune 2 Non-mutually
Loss Fix Loss Fix Exclusive Loss designs
(a) Related Two-Stage methods. (b) Our Baseline: One-Stage method. (c) ZegCLIP: Baseline with designs.

Figure 7. Brief frameworks of related two-stage methods, our one-stage baseline, and our proposed ZegCLIP model. The happy face
in (b)(c) means good performance in seen or unseen classes, while the sad face in (b) represents that the baseline model achieves poor
performance in unseen classes.

Table 10. The details of unseen class names. In conclusion, in the two-stage methods, N cropped
class-agnostic images will be fed into CLIP for image-
Dataset The name of unseen classes wise classification which may heavily increase the compu-
VOC pottedplant, sheep, sofa, train, tvmonitor tational cost. Our proposed one-stage paradigm is simple-
cow, giraffe, suitcase, frisbee, skateboard but-efficient due to the original image will be encoded only
COCO carrot, scissors, cardboard, clouds, grass once. The inference speed has been compared in Tab. 3.
playingfield, river, road, tree, wall concrete Our one-stage method ZegCLIP can achieve a speedup of
cow, motorbike, sofa, cat, boat, fence about 5 times faster than the two-stage method in the infer-
Context
bird, tv monitor, keyboard, aeroplane ence stage.

E. The details of unseen classes names


generate abundant class-agnostic region proposals accord-
ing to the learnable queries. In stage 2, each cropped pro- For fair comparison, here we provide the detailed unseen
posal region will be encoded via CLIP image encoder to class names of PASCAL VOC 2012 (VOC), COCO-Stuff
utilize its powerful image-level zero-shot classification ca- 164K (COCO), and PASCAL Context (Context) dataset in
pability. The CLIP is still used as a zero-shot image-level Tab. 10.
classifier.
Instead, we propose a simpler-and-efficient one-stage F. More visualization details
solution as our baseline that directly leverages the feature
To further demonstrate the effectiveness of our designs
embedding from CLIP and extends CLIP from image-level
on the one-stage baseline (Baseline-FT version), we pro-
classification into pixel-level (or patch-level) as shown in
vide more visualizations including the predicted segmenta-
Fig. 7-(b). We introduce a light transformer-based de-
tion results and the semantic masks of different class queries
coder to generate semantic masks by computing the simi-
via decoder in Fig. 8. Note that the class names in red are
larities between text-wise and patch-wise embeddings ex-
the novel categories.
tracted from CLIP. In the baseline, the CLIP text encoder is
We can see, after applying our proposed designs, the seg-
frozen and the CLIP image encoder can be fixed (Baseline-
mentation performance of (b) ZegCLIP has improved on
Fix) or fine-tuned (Baseline-FT).
both seen and unseen classes compared with (a) Baseline-
However, our baseline model still faces the overfitting
FT. Meanwhile, similar unseen classes can be more clearly
problem on seen classes. To improve the generalization
classified by our ZegCLIP model as shown in the heat maps.
ability to unseen classes, we propose three important de-
For example, in the “COCO-000000079188” testing im-
signs on our baseline as shown in 7-(c). After combining
age, although Baseline-FT can classify “grass” and “tree”
three designs, our model ZegCLIP can transfer the zero-shot
(both are unseen classes) correctly in the semantic masks,
ability of CLIP from image-level to pixel-level and achieve
our ZegCLIP can distinguish these novel classes discrimi-
significant performance on both seen and unseen classes.
natively.
(a)
COCO-000000017178

Image Baseline-Finetune Horse (seen) Sky-other (seen) Grass (unseen) Tree (unseen)

(b)

Ground Truth ZegCLIP(Ours) Horse (seen) Sky-other (seen) Grass (unseen) Tree (unseen)
(a)
COCO-000000027768

Image Baseline-Finetune People (seen) Bus (seen) Tree (unseen) Wall-concrete (unseen)
(b)

Ground Truth ZegCLIP(Ours) People (seen) Bus (seen) Tree (unseen) Wall-concrete (unseen)
(a)
COCO-000000079188

Image Baseline-Finetune Sky-other (seen) Giraffe (unseen) Grass (unseen) Tree (unseen)
(b)

Ground Truth ZegCLIP(Ours) Sky-other (seen) Giraffe (unseen) Grass (unseen) Tree (unseen)
(a)
COCO-000000031217

Image Baseline-Finetune People (seen) Fence (seen) Tennis racket (seen) Playingfiled (unseen)
(b)

Ground Truth ZegCLIP(Ours) People (seen) Fence (seen) Tennis racket (seen) Playingfiled (unseen)

Figure 8. Visualization of semantic masks of different text query embeddings.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy