Zegclip: Towards Adapting Clip For Zero-Shot Semantic Segmentation
Zegclip: Towards Adapting Clip For Zero-Shot Semantic Segmentation
Ziqin Zhou1 Yinjie Lei2 Bowen Zhang1 Lingqiao Liu1 * Yifan Liu1
1 2
The University of Adelaide, Australia Sichuan University, China
{ziqin.zhou,b.zhang,lingqiao.liu,yifan.liu04}@adelaide.edu.au,yinjie@scu.edu.cn
arXiv:2212.03588v3 [cs.CV] 20 Jun 2023
Abstract 100
Seen Unseen 89.9 91.9
mIoU(%)
learning tasks via a two-stage scheme. The general idea is 40.4
40
to first generate class-agnostic region proposals and then 28.3
20 16.3
feed the cropped proposal regions to CLIP to utilize its 13.8
Figure 2. Overall of our proposed ZegCLIP. Our method modifies a One-Stage Baseline framework of matching text and patch em-
beddings from CLIP to generate semantic masks. The key contribution of our work is three simple-but-effective designs (labeled as the
red circles 1,2,3 in the figure). Incorporating these three designs into our proposed one-stage baseline framework can upgrade the poorly
performed baseline method to a strong zero-shot segmentation model.
NLP tasks [24, 27, 35] and CV tasks [47, 50, 57]. Visual segmentation setting, in which both unseen class names
Prompt Tuning [20] proposes an effective solution that in- and images are not accessible during training. Besides “in-
serts trainable parameters in each layer of the transformer. ductive” zero-shot segmentation, there is a “transductive”
Self-training in Zero-shot Segmentation introduces zero-shot learning setting, which assumes that the names
another setting of zero-shot semantic segmentation called of unseen classes are known before the testing stage. They
“transductive”. Unlike the traditional “inductive” setting [17, 56] suppose that the training images include the unseen
where the novel class names and annotations are both un- objects, and only ground truth masks for these regions are
available in the training stage, [34] proposed that self- not available. Our method can easily be extended to both
training via pseudo labels on unlabeled pixels benefits solv- settings and achieve excellent performance.
ing the imbalance problem. In such “transductive” situa-
tion, both the ground truth of seen classes, as well as pseudo 3.2. Baseline: One-stage Text-Patch Matching
labels of unseen classes, will be utilized to supervise the
target model for self-training [17, 49, 56] which can achieve As the large-scale pre-trained model CLIP shows im-
better performance. pressive zero-shot classification ability, recent methods ex-
plore applying CLIP to zero-shot segmentation by propos-
ing a two-stage paradigm. In stage 1, they train a class-
3. Method agnostic generator and then leverage CLIP as a zero-shot
3.1. Problem Definition image-level classifier by matching the similarity between
text embeddings and [cls] token of each proposal in stage 2.
Our proposed method follows the generalized zero-shot While effective, such a design requires two image encoding
semantic segmentation (GZLSS) [44], which requires to processes and brings expensive computational overhead.
segment both seen classes C s and unseen classes C u after To simplify the two-stage pipeline when adapting CLIP
only training on a dataset with pixel-annotations of seen to zero-shot semantic segmentation, in this work, we aim to
part. In the training stage, the model generates per-pixel cope with the critical problem that how to transfer CLIP’s
classification results from the semantic description of all powerful generalization capability from image to pixel level
seen classes. In the testing stage, the model is expected classification effectively. Motivated by the observation of
to produce segmentation results for both known and novel recent work [56] that the text embedding can be implic-
classes. Note that C s ∩ C u = ⊘ and the labels of C u are un- itly matchable to patch-level image embeddings, we build
available while training. The key problem of zero-shot seg- a one-stage baseline by adding a vanilla light-weight trans-
mentation is that merely training on seen classes inevitably former as a decoder inspired by [13, 51]. Then we formu-
leads to overbias on known categories while inference. late semantic segmentation as a matching problem between
This naturally corresponds to the “inductive” zero-shot a representative class query and the image patch features.
Formally, let’s denote the C class embeddings as T = [cls] token 0 g
[t , t2 , ..., tC ] ∈ RC×d , with d is the feature dimension of
1 Patch embeddings
Position embeddings
1 1 1
2 2 2
CLIP model, ti representing the ith class, and the N patch
Layer 12
Layer 1
Layer 2
Learnable prompts
CLIP image encoder
tokens of an image as H = [h1 , h2 , ..., hN ] ∈ RN ×d , with m m m
hj denoting the jth patch. 1
ViT-B/16
ViT-B/16
ViT-B/16
Then we apply linear projections ϕ to generate 2
Conv
3
Q(query), K(key) and V(value) as: H
4
\begin {array}{c} \mathbf {Q}=\phi _{q}(\mathbf {T})\in \mathbb {R}^{C\times d}\\ \mathbf {K}=\phi _{k}(\mathbf {H})\in \mathbb {R}^{N\times d},\mathbf {V}=\phi _{v}(\mathbf {H})\in \mathbb {R}^{N\times d}. \end {array}
(1) n
\label {eq:decoder} \mathtt {Masks}=\frac {\mathbf {Q}\mathbf {K}^{T}}{\sqrt {d_{k}}}\in \mathbb {R}^{C\times N}, \vspace {-1mm} (2) 3.3. Design 1: Deep Prompt Tuning (DPT)
√
where dk is the dimension of the keys as a scaling factor The first design modification is to use deep prompt tun-
and the final segmentation results are obtained by applying ing (DPT) [57] for the CLIP backbone rather than fine-
Argmax operation on the class dimension of Masks. The tuning CLIP. As described in Sec. 2, prompt tuning [47] is a
detailed architecture of the decoder has been shown on the recently proposed scheme for adapting a pre-trained trans-
right of Fig. 8, which consists of three layers of transform- former model to a target domain. It has become a compet-
ers. itive alternative to fine-tuning in the transfer learning set-
How to update the CLIP image encoder: The patch fea- ting [20]. In this work, we explore deep prompt tuning to
ture representation is generated by the CLIP image encode. adapt CLIP for zero-shot image segmentation. Prompt tun-
How to update the CLIP image encoder, e.g., how to calcu- ing fixes the original parameters of CLIP and adds learn-
late H, is an important factor. In our baseline approaches, able prompt tokens as additional input for each layer. Since
we consider H is obtained from a parameter fixed CLIP the zero-shot segmentation model is trained on a limited
or a parameter tunable CLIP, denoted as Baseline-Fix and number of seen classes, directly fine-tuning the model tends
Baseline-FT separately. Later in SubSec. 3.3, we will dis- to overfit the seen classes as the model parameters are ad-
cuss Deep Prompt Tuning (DPT), which turns out to be a justed to optimize the loss only for the seen classes. Conse-
better way to adapt the CLIP for zero-shot segmentation. quently, knowledge learned for vision concepts unseen from
How to train the segmentation model: To properly train the training set might be discarded in fine-tuning. Prompt
the decoder and (optionally) the CLIP model, we apply the tuning could potentially alleviate this issue since the origi-
commonly used softmax operator to convert the logits cal- nal parameters are intact during training.
culate from Eq. 2 to the posterior probability. Exclusive Formally, we denote the input embeddings from the l-
Loss (EL) like Cross-entropy is then used as the objective th MHA module of the ViT-based image encoder in CLIP
function. Later in SubSec. 3.4, we will point out this seemly as {gl , hl1 , hl2 , · · · , hlN }, where gl denotes the [CLS] token
straightforward strategy can be potentially harmful for gen- embedding and Hl = {hl1 , hl2 , · · · , hlN } denotes the im-
eralization. age patch embeddings. Deep prompt tuning appends learn-
Design of query embedding T: The query embeddings able tokens Pl = {pl1 , pl2 , · · · , plM } to the above token
T are the key to our approach. In our baseline model, we sequence in each ViT layer of CLIP image encoder. Then
use the embeddings from the CLIP text encoder. However, the l-th MHA module process the input token as:
as will be pointed out in SubSec. 3.5, such a choice might
cause severe overfitting and we propose a relationship de- [\mathbf {g}^{l},\,\_\,,\mathbf {H}^{l}]=\mathtt {Layer}^{l}([\mathbf {g}^{l-1},\mathbf {P}^{l-1},\mathbf {H}^{l-1}]) \vspace {-1mm} (3)
scriptor by using the relationship between text and image
token as class queries. In SubSec. 4.5-A, we further explore where the output embeddings of {pl1 , · · · , plM } are dis-
other choices of T. Please refer to the relevant sections for carded (denoted as _) and will not feed into the next layer.
more details. Therefore, {pl1 , pl2 , · · · , plM } merely acts as a set of learn-
As shown in Fig. 1-(1)(2) and Tab. 5, our baseline model able parameters to adapt the MHA model.
turns out to perform poorly in practice, especially for the As shown in Fig. 1-(3) and Tab. 5, compared with fine-
unseen classes. It seems that the model overfits severely to tuning, deep prompt tuning on CLIP image encoder can
the seen classes and forgets the zero-shot learning capabil- achieve similar performance on seen classes but improve
ity of CLIP during training. Fortunately, we identify three the segmentation results on unseen significantly.
3.4. Design 2: Non-mutually Exclusive Loss (NEL) where tc ∈ Rd denotes a text embedding for the cth class
and tj is its j-th dimension; g is the image embedding ([cls]
The general practice of semantic segmentation models
token); rjc = tcj gj . We postulate that rc = [r1c , r2c , · · · , rdc ]
treats it as a per-pixel multi-way classification problem, and
characterizes how the image and text, i.e., the text prompt
Softmax operation is used to calculate the posterior proba-
for representing class c, are matched, and call it text-image
bility, followed by using mutually Exclusive Loss (EL) like
Relationship Descriptor (RD) denoted as R ∈ RC×d for
Cross Entropy as the loss function. However, Softmax es-
all C classes in this work. Then we concatenate RDs with
sentially assumes a mutually exclusive relationship between
the original text embedding T ∈ RC×d as image-specific
the to-be-classified classes: a pixel has to belong to one of
text queries T̂ = {t̂1 , t̂2 , ..., t̂C } ∈ RC×2d for the trans-
the classes of interest. Thus only the relative strength of log-
former decoder. Specifically, with this scheme, the input
its, i.e., the ratio of logits, matters for posterior probability
text query of transformer decoder for each class becomes:
calculation. However, when applying the model to unseen
\mathbf {\hat {t}}=concat[\mathbf {r},\mathbf {t}]=concat[\mathbf {t}\odot \mathbf {g},\mathbf {t}], \vspace {-2mm} (8)
classes, the class space will be different from the training
scenario, making the logit of an unseen class poorly cali- where ⊙ is the Hadamard product. Note that, we apply
brated with the other unseen classes. learnable linear projection layers on both T̂ and H to make
To handle this issue, we suggest avoiding the mutual- them has the same feature dimension.
exclusive mechanism during training time and using Non- We compare the effectiveness of applying the relation-
mutually Exclusive Loss (NEL), more specifically, Sigmoid ship description shown in Fig. 1-(5) as well as in Tab. 5. As
and Binary Cross Entropy (BCE) loss to ensure the segmen- seen, it can dramatically improve the segmentation results
tation result for each class is independently generated. In on both seen and unseen categories. In SubSec. 4.5-A, we
addition, we use the focal loss [25] variation of the BCE further discuss the effect of using different combination for-
loss and combine it with an extra dice loss [32, 54] as previ- mats of element-wise operations for text-image matching in
ous work [9]: the relationship descriptor and text queries T̂.
\mathcal {L}_{\mathtt {focal}}=-\frac {1}{\mathrm {hw}}\sum _{i=1}^{\mathrm {hw}}(1-y_{i})^{\gamma }\times \hat {y}\mathrm {log}(y_{i})+y_{i}^{\gamma }\times (1-\hat {y_{i}})\mathrm {log}(1-y_{i}), \vspace {-1mm} 4. Experiments
(4) 4.1. Datasets
To evaluate the effectiveness of our proposed method,
\mathcal {L}_{\mathtt {dice}}=1-\frac {2\sum _{i=1}^{\mathrm {hw}}y_{i}\hat {y}_{i}}{\sum _{i=1}^{\mathrm {hw}}y_{i}^{2}+\sum _{i=1}^{\mathrm {hw}}\hat {y}_{i}^{2}}, (5) we conducted extensive experiments on three public bench-
mark datasets, including PASCAL VOC 2012, COCO-Stuff
164K, and PASCAL Context. The unseen classes of each
\label {eq: total loss} \mathcal {L}=\alpha \cdot \mathcal {L}_{\mathtt {focal}}+\beta \cdot \mathcal {L}_{\mathtt {dice}}, \vspace {-1mm} (6) dataset are shown in the Appendix according to previous
where γ = 2 balances hard and easy samples and {α, β} are works [12, 17, 49, 56]. The details of the datasets are elabo-
coefficients to combine focal loss and dice loss. As shown rated as follows:
in Fig. 1-(4) and Tab. 5, compared with CE, BCE performs PASCAL VOC 2012 contains 10,582 augmented images
better on unseen classes, and the focal loss and dice loss for training and 1,449 for validation. We also ignore the
on BCE [9] boost the performance furthermore as demon- “background” category and use 15 classes as the seen part
strated in SubSec. 4.5-C. and 5 classes as the unseen part.
COCO-Stuff 164K is a large-scale dataset that contains
3.5. Design 3: Relationship Descriptor (RD) 171 categories with 118,287 images for training and 5,000
In the above design, the class embeddings extracted from for testing. The whole dataset is divided into 156 seen
the CLIP text encoder will match against the patch embed- classes and 15 unseen classes.
dings from the CLIP image encoder in the decoder head. PASCAL Context includes 60 classes with 4,996 for train-
While being quite intuitive, we find this design could lead ing and 5,104 for testing. The dataset is divided into 50
to severe overfitting. We postulate that this is because the known classes (including “background”) and the rest 10
matching capability between the text query and image pat- classes as used as unseen classes in the test set.
terns is only trained on the seen-class datasets. 4.2. Evaluation Metrics
This motivates us to incorporate the matching capability
learned from the original CLIP training into the transformer Following previous works, we measure pixel-wise clas-
decoder. Specifically, we noticed that CLIP calculates the sification accuracy (pAcc) and the mean of class-wise inter-
matching score between a text prompt and an image by section over union (mIoU) on both seen and unseen classes,
denoted as mIoU (S) and mIoU (U ), respectively. We also
{\mathbf {t}^c}^\top \mathbf {g} = \sum _j^d t^c_j g_j = \sum _j^d r^c_j, \vspace {-2mm} (7) evaluate the harmonic mean IoU (hIoU ) among seen and
unseen classes.
Table 2. Comparison with the state-of-the-art methods on PASCAL VOC 2012, COCO-Stuff 164K, and PASCAL Context datasets. “ST”
represents applying self-training via generating pseudo labels on all unlabeled pixels, while “*”+“ST” denotes that pseudo labels are
merely annotated on unseen pixels excluding the ignore part.
PASCAL VOC 2012 COCO-Stuff 164K PASCAL Context
Methods
pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU
Inductive
SPNet [44] - 78.0 15.6 26.1 - 35.2 8.7 14.0 - - - -
ZS3 [3] - 77.3 17.7 28.7 - 34.7 9.5 15.0 52.8 20.8 12.7 15.8
CaGNet [17] 80.7 78.4 26.6 39.7 56.6 33.5 12.2 18.2 - 24.1 18.5 21.2
SIGN [10] - 75.4 28.9 41.7 - 32.3 15.5 20.9 - - - -
Joint [1] - 77.7 32.5 45.9 - - - - - 33.0 14.9 20.5
ZegFormer [12] - 86.4 63.6 73.3 - 36.6 33.2 34.8 - - - -
zsseg [49] 90.0 83.5 72.5 77.5 60.3 39.3 36.3 37.8 - - - -
ZegCLIP (Ours) 94.6 91.9 77.8 84.3 62.0 40.2 41.4 40.8 76.2 46.0 54.6 49.9
Transductive
SPNet+ST [44] - 77.8 25.8 38.8 - 34.6 26.9 30.3 - - - -
ZS5 [3] - 78.0 21.2 33.3 - 34.9 10.6 16.2 49.5 27.0 20.7 23.4
CaGNet+ST [17] 81.6 78.6 30.3 43.7 56.8 35.6 13.4 19.5 - - - -
STRICT [34] - 82.7 35.6 49.8 - 35.3 30.3 34.8 - - - -
zsseg+ST [49] 88.7 79.2 78.1 79.3 63.8 39.6 43.6 41.5 - - - -
ZegCLIP+ST (Ours) 95.1 91.8 82.2 86.7 68.8 40.6 54.8 46.6 77.2 46.6 65.4 54.4
*MaskCLIP+ [56] - 88.8 86.1 87.4 - 38.1 54.7 45.0 - 44.4 66.7 53.3
*ZegCLIP+ST (Ours) 96.2 92.3 89.9 91.1 69.2 40.7 59.9 48.5 77.3 46.8 68.5 55.6
Fully Supervised
ZegCLIP (Ours) 96.3 92.4 90.9 91.6 69.9 40.7 63.2 49.6 77.5 46.5 78.7 56.9
4.3. Implementation Details Table 3. Efficiency comparison with different metrics. All mod-
Our proposed method is implemented based on the open- els are evaluated on a single 1080Ti GPU. #Params represents the
number of learnable parameters in the whole framework.
source toolbox MMSegmentation [11] with PyTorch 1.10.1.
Datasets Methods #Params(M) ↓ Flops(G) ↓ FPS ↑
All experiments we provided are based on the pre-trained ZegFormer [12] 60.3 1829.3 1.7
VOC
CLIP ViT-B/16 model and conducted on 4 Tesla V100 ZegCLIP 13.8 110.4 9.0
GPUs, and the batch size is set to 16 with 512x512 as ZegFormer [12] 60.3 1875.1 1.5
COCO
ZegCLIP 14.6 123.9 6.7
the resolution of images. For “inductive” zero-shot learn-
ing, the total training iterations are 20K for PASCAL VOC proves the performance on unseen classes while maintain-
2012, 40K for PASCAL Context, and 80K for COCO-Stuff ing excellent performance on seen part after self-training.
164K. In the “transductive” setting, we train our ZegCLIP Fig. 4 shows the segmentation results of the Baseline-
model on seen classes in the first half of training iterations Fix version and our proposed ZegCLIP on seen and un-
and then apply self-training via generating pseudo labels in seen classes. After applying our designs, ZegCLIP shows
the rest of iterations. The optimizer is set to AdamW with impressive segmentation ability on both seen and unseen
the default training schedule in the MMSeg toolbox. classes and can clearly distinguish similar unseen cate-
gories, for example, the unseen “tree”, “grass” and “play-
4.4. Comparison with State-of-the-art methods ingfield” categories in (1).
To demonstrate the effectiveness of our method, the eval- In addition, to demonstrate the efficiency of our pro-
uation results compared with previous state-of-the-art meth- posed method, we compare the number of learnable param-
ods are reported in Tab. 2. We also provide the fully super- eters and inference speed between our one-stage ZegCLIP
vised learning results as the upper bound to show the per- and typical two-stage method Zegformer [12] in Tab. 3. Our
formance gap between fully-supervised segmentation and proposed method achieves significant performance while
zero-shot segmentation results on unseen classes. The qual- only requiring approximately 14M learnable parameters
itative results on COCO-Stuff 164K are shown in Fig. 4 and with 117 Flops(G) on average which is only 23% and 6%
more visualization results are provided in Appendix. of [12], respectively. In terms of frames Per Second (FPS),
From Tab. 2, we can see that our proposed method our method can achieve a speedup of about 5 times faster
achieves significant performance under the “inductive” set- during inference compared with two-stage method.
ting and outperforms previous works, especially for unseen
classes. This clearly demonstrates the superior generaliza- 4.5. Ablation Study
tion capability of our method over the previous approaches. A. Effect of different formats of text query t̂
In addition, we also find our approach excels in the “trans- In this work, we propose an important design called Re-
ductive” setting, although our method is not specifically de- lationship Descriptor (RD) as described in SubSec. 3.5. In
signed for that setting. As seen, our model dramatically im- this module, we combine the queried text embeddings T
tree baseball glove
tree tree baseball glove
people
(1) people people
grass
wood
grass grass
plant-other
(a) Image (b) Baseline-FT (c) Ours (ZegCLIP) (d) Ground truth
Figure 4. Qualitative results on COCO-Stuff 164K. (a) are the original testing images; (b) represent the performance of our proposed
one-stage baseline (fine-tuning the image encoder); (c) are the visualization results of our proposed ZegCLIP; (d) are the ground truths of
each image. Note that the white and red tags represent seen and unseen classes separately.
with image [cls] token g extract from CLIP to generate Table 4. Effect of different formats of text queries t̂.
text-image description T̂ before feeding the queries into the dim format of t̂ pAcc mIoU(S) mIoU(U) hIoU
segment decoder. Such image-specific queries can dramat- t 86.8 89.5 33.7 49.0
t⊙g 93.1 90.2 68.4 77.8
ically improve the zero-shot segmentation performance on 512 |t-g| 92.4 90.6 64.2 75.1
unseen classes. To further explore the effect of different t-g 88.7 87.9 46.5 60.8
element-wise operations in the relationship descriptor, we t+g 82.2 89.9 13.9 24.1
[t, g] 88.9 88.8 39.3 54.5
conduct various formats of T̂ on VOC and report the results [t⊙g, t] 94.6 91.9 77.8 84.3
in Tab. 4. For the image-specific text query of each class t̂, [|t-g|, t] 90.9 91.5 54.2 68.1
512*2
it can be formulated as shown in the second column. [t⊙g, t+g] 88.3 90.0 38.0 53.4
[t+g, t] 82.8 89.4 20.7 33.6
We can see, the dot product and the absolute difference [t⊙g, |t-g|] 94.1 91.2 73.9 81.6
between text embedding t and [cls] token g can provide 512*3 [t⊙g, |t-g|, t] 93.4 91.6 67.3 77.6
more general information, but the sum and concatenate op-
prediction task. Meanwhile, fine-tuning CLIP (Baseline-
erations perform poorly on both seen and unseen classes.
FT) performs better on seen classes but deteriorates on un-
This is understandable since dot product and absolute dif-
seen classes dramatically. It seems that the zero-shot trans-
ference characterize the relationship between the image and
fer learning ability of CLIP is destroyed after updating on
text encoding, which is in line with the interpretation made
seen classes. Changing objective functions from mutual
in SubSec. 3.5. The red line in Tab. 4 is the format of t̂ we
exclusive into Non-mutually Exclusive Loss (NEL, De-
finally chose in our ZegCLIP model.
sign 2) also leads to significant performance boost both in
B. Detailed results of applying designs on baseline seen and unseen scenarios. The most dramatic improve-
To demonstrate the effectiveness of our proposed de- ment comes from Relationship Descriptor (RD, Design
signs, we further report the improvements of applying de- 3) which almost doubles the unseen performance in many
signs on our Baseline model in Tab. 5. It shows that fixed cases. Finally, we noticed that Deep Prompt Tuning (DPT,
CLIP (Baseline-Fix) with a learnable segment decoder fails Design 1) works well with NEL (Design 2) and RD (Design
to achieve satisfactory performance on both seen and un- 3). If we replace DPT by fine-tuning CLIP and combine
seen classes due to weak image representation for the dense it with NEL and RD, the issue of overfitting seen classes
Table 5. Quantitative results on VOC and COCO dataset to demonstrate the effectiveness of our proposed three designs.
PASCAL VOC 2012 COCO-Stuff 164K
method
pAcc mIoU(S) mIoU(U) hIoU pAcc mIoU(S) mIoU(U) hIoU
Baseline-Fix 69.3 71.1 16.3 26.5 33.3 17.1 15.4 16.2
Baseline-Fix + NEL 85.5 85.2 36.6 51.2 52.4 31.7 20.8 25.1
Baseline-Fix + RD 86.0 82.5 46.6 59.6 41.0 23.3 23.4 23.3
Baseline-Fix + NEL + RD 89.6 83.3 66.4 73.9 53.7 32.3 32.5 32.4
Baseline-FT 77.3 76.5 13.8 23.4 48.4 32.4 17.5 22.7
Baseline-FT + NEL 83.8 84.1 27.5 41.4 56.5 39.9 25.4 31.0
Baseline-FT + RD 79.4 77.8 20.7 32.7 54.0 39.6 22.4 28.6
Baseline-FT + NEL + RD 89.6 90.2 42.4 57.7 60.2 42.7 22.3 29.3
Baseline-DPT 76.2 75.9 28.3 41.2 39.0 22.5 17.5 19.7
Baseline-DPT + NEL 89.2 89.9 40.4 55.7 58.5 38.0 27.4 31.8
Baseline-DPT + RD 85.5 81.0 55.2 65.7 46.4 28.4 27.8 28.1
Baseline-DPT + NEL + RD (ZegCLIP) 94.6 91.9 77.8 84.3 62.0 40.2 41.4 40.8
80
mIoU(%)
60
40
20
0
remain. In conclusion, our proposed method can ensure eralization capability compared with the latest related work
semantic segmentation performance while maintaining the Zegformer which is also based on the CLIP model.
powerful ability of CLIP and achieving excellent perfor- Table 7. Generalization ability to other datasets.
mance on seen and unseen classes simultaneously. source target method pAcc mIoU mAcc
C. Effect of advanced loss function Zegformer [12] 56.8 36.1 64.0
As we described in Subsec 3.4, we propose that apply- Context ZegCLIP 60.9 41.2 68.4
*ZegCLIP+ST 68.4 45.8 70.9
ing non-mutually exclusive objective functions like using COCO
Zegformer [12] 92.8 85.6 92.7
binary cross entropy (BCE) with sigmoid performs better VOC ZegCLIP 96.9 93.6 96.4
on zero-shot semantic segmentation. To better handle the *ZegCLIP+ST 97.2 94.1 96.7
imbalance problem among categories, we further combine
BCE with focal loss [25] and dice loss [23] according to 5. Conclusion
[9, 51] as Eq. 6, and the results of applying such two ver- In this work, we propose an efficient one-stage straight-
sions of the loss function, denoted as “plain” and “plus”, forward zero-shot semantic segmentation method based
are reported separately in Tab. 6. We can see that the “plus” on the pre-trained vision-language CLIP. To transfer the
loss function achieves better performance on both seen and image-wise classification ability to dense prediction tasks
unseen classes on VOC, COCO, and Context datasets. while maintaining the advanced zero-shot knowledge, we
figure out three designs to achieve competitive results on
Table 6. Comparison of introducing advanced loss function. Note
that “plain” represents merely Binary Cross Entropy (BCE), while
seen classes while extremely improving the performance on
“plus” means adding focal loss on BCE and dice loss novel classes. Our proposed method relay on text embed-
dataset loss pAcc mIoU(S) mIoU(U) hIoU dings as queries that are very flexible to cope with both “in-
plain 93.4 89.7 73.6 80.9 ductive” and “transductive” zero-shot settings. To demon-
VOC plus 94.6 91.9 77.8 84.3 strate the effectiveness of our method, we conduct exten-
plain 59.8 38.8 39.0 38.9 sive performance on three public benchmark datasets and
COCO plus 62.0 40.2 41.4 40.8 outperform previous state-of-the-art methods. Meanwhile,
plain 75.3 43.5 50.0 46.5
Context plus 76.2 46.0 54.6 49.9 our one-stage framework shows about 5 times faster com-
pared with two-stage methods while inference. In general,
D. Generalization ability to other datasets our work explores how to leverage the pre-trained vision-
To further explore the generalization ability of our pro- language model CLIP into semantic segmentation and suc-
posed method, we conduct extra experiments in Tab. 7. cessfully utilize its zero-shot knowledge in downstream
We apply the pre-trained model of the source dataset via tasks which may provide inspiration for future research.
supervised learning on seen classes and evaluate the seg- Acknowledgement. This work was done in Adelaide Intel-
mentation results on both seen and unseen classes of tar- ligence Research (AIR) Lab and Lingqiao Liu is supported
get datasets. Our method shows better cross-domain gen- by the Centre of Augmented Reasoning (CAR).
References of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 11583–11592, 2022. 1, 2,
[1] Donghyeon Baek, Youngmin Oh, and Bumsub Ham. Ex- 5, 6, 8
ploiting a joint embedding space for generalized zero-shot
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
semantic segmentation. In Proceedings of the IEEE/CVF In-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
ternational Conference on Computer Vision (ICCV), pages
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
9536–9545, 2021. 2, 6
vain Gelly, et al. An image is worth 16x16 words: Trans-
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- formers for image recognition at scale. arXiv preprint
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- arXiv:2010.11929, 2020. 3
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
[14] Sepideh Esmaeilpour, Bing Liu, Eric Robertson, and Lei
guage models are few-shot learners. Proceedings of the Ad-
Shu. Zero-shot out-of-distribution detection based on the
vances in Neural Information Processing Systems (NeurIPS),
pretrained model clip. In Proceedings of the AAAI confer-
33:1877–1901, 2020. 2
ence on artificial intelligence (AAAI), 2022. 1
[3] Maxime Bucher, Tuan-Hung Vu, Matthieu Cord, and Patrick
[15] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei,
Pérez. Zero-shot semantic segmentation. Proceedings of
Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking
the Advances in Neural Information Processing Systems
bisenet for real-time semantic segmentation. In Proceedings
(NeurIPS), 32, 2019. 1, 2, 6
of the IEEE/CVF Conference on Computer Vision and Pat-
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, tern Recognition (CVPR), pages 9716–9725, 2021. 1
Kevin Murphy, and Alan L Yuille. Semantic image segmen-
[16] Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li,
tation with deep convolutional nets and fully connected crfs.
Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, and David Z
arXiv preprint arXiv:1412.7062, 2014. 1
Pan. Multi-scale high-resolution vision transformer for se-
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, mantic segmentation. In Proceedings of the IEEE/CVF
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image Conference on Computer Vision and Pattern Recognition
segmentation with deep convolutional nets, atrous convolu- (CVPR), pages 12094–12103, 2022. 1
tion, and fully connected crfs. IEEE Transactions on Pattern
[17] Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, and
Analysis and Machine Intelligence (TPAMI), 40(4):834–848,
Liqing Zhang. Context-aware feature generation for zero-
2017. 1
shot semantic segmentation. In Proceedings of the ACM
[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and International Conference on Multimedia (ACMMM), pages
Hartwig Adam. Rethinking atrous convolution for seman- 1921–1929, 2020. 2, 3, 5, 6
tic image segmentation. arXiv preprint arXiv:1706.05587,
[18] Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and
2017. 1
Kurt Konolige. On pre-trained image features and synthetic
[7] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong images for deep learning. In Proceedings of the European
Wang. Semi-supervised semantic segmentation with cross Conference on Computer Vision (ECCV) Workshops, 2018.
pseudo supervision. In Proceedings of the IEEE/CVF 2
Conference on Computer Vision and Pattern Recognition
[19] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
(CVPR), pages 2613–2622, 2021. 1
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
[8] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- Duerig. Scaling up visual and vision-language representa-
der Kirillov, and Rohit Girdhar. Masked-attention mask tion learning with noisy text supervision. In Proceedings of
transformer for universal image segmentation. In Proceed- the IEEE/CVF International Conference on Computer Vision
ings of the IEEE/CVF Conference on Computer Vision and (ICCV), pages 4904–4916, 2021. 2
Pattern Recognition (CVPR), pages 1290–1299, 2022. 1, 2
[20] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
[9] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
pixel classification is not all you need for semantic segmen- sual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
tation. Proceedings of the Advances in Neural Information 3, 4
Processing Systems (NeurIPS), 34:17864–17875, 2021. 2, 5,
[21] Jingjing Jiang, Ziyi Liu, and Nanning Zheng. Finetuning
8
pretrained vision-language models with correlation informa-
[10] Jiaxin Cheng, Soumyaroop Nandi, Prem Natarajan, and tion bottleneck for robust visual question answering. arXiv
Wael Abd-Almageed. SIGN: Spatial-information incorpo- preprint arXiv:2209.06954, 2022. 2
rated generative network for generalized zero-shot seman-
[22] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh,
tic segmentation. In Proceedings of the IEEE/CVF Interna-
and Kai-Wei Chang. Visualbert: A simple and perfor-
tional Conference on Computer Vision (ICCV), pages 9556–
mant baseline for vision and language. arXiv preprint
9566, 2021. 2, 6
arXiv:1908.03557, 2019. 2
[11] MMSegmentation Contributors. MMSegmentation:
[23] Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei
Openmmlab semantic segmentation toolbox and
Wu, and Jiwei Li. Dice loss for data-imbalanced nlp tasks.
benchmark. https : / / github . com / open -
arXiv preprint arXiv:1911.02855, 2019. 8
mmlab/mmsegmentation, 2020. 6
[24] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz-
[12] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De-
ing continuous prompts for generation. arXiv preprint
coupling zero-shot semantic segmentation. In Proceedings
arXiv:2101.00190, 2021. 3
[25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, vision. In Proceedings of the International Conference on
and Piotr Dollár. Focal loss for dense object detection. In Machine Learning (ICML), pages 8748–8763, 2021. 1, 2
Proceedings of the IEEE/CVF International Conference on [37] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong
Computer Vision (ICCV), pages 2980–2988, 2017. 5, 8 Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu.
[26] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Denseclip: Language-guided dense prediction with context-
Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto- aware prompting. In Proceedings of the IEEE/CVF Confer-
deeplab: Hierarchical neural architecture search for seman- ence on Computer Vision and Pattern Recognition (CVPR),
tic image segmentation. In Proceedings of the IEEE/CVF pages 18082–18091, 2022. 2
Conference on Computer Vision and Pattern Recognition [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
(CVPR), pages 82–92, 2019. 1 Convolutional networks for biomedical image segmentation.
[27] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao In International Conference on Medical Image Computing
Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning and Computer Assisted Intervention (MICCAI), pages 234–
can be comparable to fine-tuning across scales and tasks. In 241, 2015. 1
Proceedings of the Association for Computational Linguis- [39] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia
tics (ACL), pages 61–68, 2022. 3 Schmid. Segmenter: Transformer for semantic segmenta-
[28] Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and tion. In Proceedings of the IEEE/CVF Conference on Com-
Stephen Gould. Image retrieval on real-life images with puter Vision and Pattern Recognition (CVPR), pages 7262–
pre-trained vision-and-language models. In Proceedings of 7272, 2021. 2
the IEEE/CVF International Conference on Computer Vision [40] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
(ICCV), pages 2125–2134, 2021. 2 Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully linguistic representations. arXiv preprint arXiv:1908.08530,
convolutional networks for semantic segmentation. In Pro- 2019. 2
ceedings of the IEEE/CVF Conference on Computer Vision [41] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
and Pattern Recognition (CVPR), pages 3431–3440, 2015. Georgoulis, and Luc Van Gool. Unsupervised semantic seg-
1, 2 mentation by contrasting object mask proposals. In Proceed-
[30] Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei ings of the IEEE/CVF International Conference on Com-
Zhou, Jianbing Shen, and Luc Van Gool. Video object seg- puter Vision (ICCV), pages 10052–10062, 2021. 1
mentation with episodic graph memory networks. In Pro- [42] Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong,
ceedings of the IEEE conference on European Conference on Wei Dong, and Xiangke Liao. Bridging pre-trained models
Computer Vision (ECCV), pages 661–679. Springer, 2020. 1 and downstream tasks for source code understanding. In Pro-
[31] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and ceedings of the International Conference on Software Engi-
Andrea Vedaldi. Deep spectral methods: A surprisingly neering (ICSE), pages 287–298, 2022. 2
strong baseline for unsupervised semantic segmentation and [43] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong
localization. In Proceedings of the IEEE/CVF Conference Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-
on Computer Vision and Pattern Recognition (CVPR), pages driven referring image segmentation. In Proceedings of
8364–8375, 2022. 1 the IEEE/CVF Conference on Computer Vision and Pattern
[32] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. Recognition (CVPR), pages 11686–11695, 2022. 1, 2
V-net: Fully convolutional neural networks for volumetric [44] Yongqin Xian, Subhabrata Choudhury, Yang He, Bernt
medical image segmentation. In Proceedings of the Inter- Schiele, and Zeynep Akata. Semantic projection network
national Conference on 3D Vision (3DV), pages 565–571. for zero-and few-label semantic segmentation. In Proceed-
IEEE, 2016. 5 ings of the IEEE/CVF Conference on Computer Vision and
[33] Daniil Pakhomov, Sanchit Hira, Narayani Wagle, Kemar E Pattern Recognition (CVPR), pages 8256–8265, 2019. 1, 2,
Green, and Nassir Navab. Segmentation in style: Unsuper- 3, 6
vised semantic image segmentation with stylegan and clip. [45] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
arXiv preprint arXiv:2107.12518, 2021. 1 Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
[34] Giuseppe Pastore, Fabio Cermelli, Yongqin Xian, Massimil- ficient design for semantic segmentation with transformers.
iano Mancini, Zeynep Akata, and Barbara Caputo. A closer Proceedings of the Advances in Neural Information Process-
look at self-training for zero-label semantic segmentation. ing Systems (NeurIPS), 34:12077–12090, 2021. 2
In Proceedings of the IEEE/CVF Conference on Computer [46] Guo-Sen Xie, Jie Liu, Huan Xiong, and Ling Shao. Scale-
Vision and Pattern Recognition (CVPR), pages 2693–2702, aware graph neural network for few-shot semantic segmenta-
2021. 1, 2, 3, 6 tion. In Proceedings of the IEEE/CVF Conference on Com-
[35] Guanghui Qin and Jason Eisner. Learning how to ask: puter Vision and Pattern Recognition (CVPR), pages 5475–
Querying lms with mixtures of soft prompts. arXiv preprint 5484, 2021. 1
arXiv:2104.06599, 2021. 3 [47] Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guo-
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya qiang Liang, and Yanning Zhang. Class-aware visual prompt
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, tuning for vision-language pre-trained model. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2208.08340, 2022. 3, 4
ing transferable visual models from natural language super-
[48] Lian Xu, Wanli Ouyang, Mohammed Bennamoun, Farid on Computer Vision and Pattern Recognition (CVPR), pages
Boussaid, Ferdous Sohel, and Dan Xu. Leveraging auxil- 12083–12093, 2022. 1
iary tasks with affinity learning for weakly supervised se- [54] Rongjian Zhao, Buyue Qian, Xianli Zhang, Yang Li, Rong
mantic segmentation. In Proceedings of the IEEE/CVF In- Wei, Yang Liu, and Yinggang Pan. Rethinking dice loss for
ternational Conference on Computer Vision (ICCV), pages medical image segmentation. In Proceedings of the IEEE In-
6984–6993, 2021. 1 ternational Conference on Data Mining (ICDM), pages 851–
[49] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue 860. IEEE, 2020. 5
Cao, Han Hu, and Xiang Bai. A simple baseline for zero- [55] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
shot semantic segmentation with pre-trained vision-language Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
model. arXiv preprint arXiv:2112.14757, 2021. 1, 2, 3, 5, 6 Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
[50] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat- tation from a sequence-to-sequence perspective with trans-
Seng Chua, and Maosong Sun. CPT: Colorful prompt tun- formers. In Proceedings of the IEEE/CVF Conference on
ing for pre-trained vision-language models. arXiv preprint Computer Vision and Pattern Recognition (CVPR), pages
arXiv:2109.11797, 2021. 3 6881–6890, 2021. 2
[51] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi- [56] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic dense labels from clip. In Proceedings of the IEEE confer-
segmentation with plain vision transformers. arXiv preprint ence on European Conference on Computer Vision (ECCV),
arXiv:2210.05844, 2022. 2, 3, 8 2022. 2, 3, 5, 6
[52] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, [57] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Con- Liu. Learning to prompt for vision-language models. Inter-
text encoding for semantic segmentation. In Proceedings of national Journal of Computer Vision (IJCV), 130(9):2337–
the IEEE/CVF Conference on Computer Vision and Pattern 2348, 2022. 3, 4
Recognition (CVPR), pages 7151–7160, 2018. 2 [58] Fangrui Zhu, Yi Zhu, Li Zhang, Chongruo Wu, Yanwei Fu,
[53] Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen, and Mu Li. A unified efficient pyramid transformer for se-
Xinggang Wang, Wenyu Liu, Gang Yu, and Chunhua Shen. mantic segmentation. In Proceedings of the IEEE/CVF In-
Topformer: Token pyramid transformer for mobile semantic ternational Conference on Computer Vision (ICCV), pages
segmentation. In Proceedings of the IEEE/CVF Conference 2667–2677, 2021. 1
Appendix quantitative results are reported in Tab. 8. For a better ex-
planation, we number the total 12 vision transformer layers
A. Effect of the number of deep prompt tokens in the CLIP image encoder from 1 (“bottom") to 12 (“top")
We figure out that the best number of deep prompt to- and the layers of inserting prompts are as denoted in the first
kens varies for different datasets and the detailed results column of Tab. 8. We figure out that adding prompt tokens
are shown in Fig. 6. For the PASCAL VOC 2012 dataset on “bottom" layers generally tends to perform better than on
(VOC) which contains fewer training samples and cate- “top" layers. Meanwhile, inserting learnable prompt tokens
gories, 10 tokens are enough to obtain significant perfor- in each ViT layer (layer=1→12) achieves the best perfor-
mance on both seen and unseen classes. However, for large- mance which is also the default setting in our experiments.
scale datasets, more deep prompts, i.e., 100 for COCO-Stuff
164K (COCO) and 35 for PASCAL Context (Context), are C. Effect of single and multiple text templates
beneficial to achieve better segmentation performance. In Following the training details of CLIP, we apply a sin-
general, the best number of deep prompt tokens increases gle template “A photo of a {}" on PASCAL VOC 2012
with the scale of the dataset and the complexity of the per- (VOC) and multiple templates on large-scale datasets, i.e.,
pixel classification task increases. Meanwhile, using too COCO-Stuff 164K (COCO) and PASCAL Context (Con-
many visual prompts may be detrimental to our model in- text), when obtaining the class embeddings from CLIP text
stead. encoder. We provide the quantitative results of using single
95
and multiple templates in Tab. 9 where we can see that mul-
91.9 91.7 VOC (U)
90.3
VOC (S) tiple descriptions achieve reasonable improvements on both
85 COCO (U) two datasets.
COCO (S)
76.3 77.8
Context (U)
Table 9. Comparison of using single and multiple templates on
75 71.5 Context (S)
COCO-Stuff 164K and PASCAL Context datasets.
mIOU (%)
Figure 7. Brief frameworks of related two-stage methods, our one-stage baseline, and our proposed ZegCLIP model. The happy face
in (b)(c) means good performance in seen or unseen classes, while the sad face in (b) represents that the baseline model achieves poor
performance in unseen classes.
Table 10. The details of unseen class names. In conclusion, in the two-stage methods, N cropped
class-agnostic images will be fed into CLIP for image-
Dataset The name of unseen classes wise classification which may heavily increase the compu-
VOC pottedplant, sheep, sofa, train, tvmonitor tational cost. Our proposed one-stage paradigm is simple-
cow, giraffe, suitcase, frisbee, skateboard but-efficient due to the original image will be encoded only
COCO carrot, scissors, cardboard, clouds, grass once. The inference speed has been compared in Tab. 3.
playingfield, river, road, tree, wall concrete Our one-stage method ZegCLIP can achieve a speedup of
cow, motorbike, sofa, cat, boat, fence about 5 times faster than the two-stage method in the infer-
Context
bird, tv monitor, keyboard, aeroplane ence stage.
Image Baseline-Finetune Horse (seen) Sky-other (seen) Grass (unseen) Tree (unseen)
(b)
Ground Truth ZegCLIP(Ours) Horse (seen) Sky-other (seen) Grass (unseen) Tree (unseen)
(a)
COCO-000000027768
Image Baseline-Finetune People (seen) Bus (seen) Tree (unseen) Wall-concrete (unseen)
(b)
Ground Truth ZegCLIP(Ours) People (seen) Bus (seen) Tree (unseen) Wall-concrete (unseen)
(a)
COCO-000000079188
Image Baseline-Finetune Sky-other (seen) Giraffe (unseen) Grass (unseen) Tree (unseen)
(b)
Ground Truth ZegCLIP(Ours) Sky-other (seen) Giraffe (unseen) Grass (unseen) Tree (unseen)
(a)
COCO-000000031217
Image Baseline-Finetune People (seen) Fence (seen) Tennis racket (seen) Playingfiled (unseen)
(b)
Ground Truth ZegCLIP(Ours) People (seen) Fence (seen) Tennis racket (seen) Playingfiled (unseen)