CLIP Report
CLIP Report
net/publication/389266248
CITATIONS READS
0 29
2 authors, including:
Rajesh Prakash
University of Windsor
5 PUBLICATIONS 60 CITATIONS
SEE PROFILE
All content following this page was uploaded by Rajesh Prakash on 24 February 2025.
Abstract
1. Introduction
Visual recognition, including tasks such as image classification [22, 9], object detec-
tion [33], and semantic segmentation [30], has been a central focus of computer vision
research. Traditionally, these tasks have relied on deep learning techniques that require
large amounts of labeled data and task-specific fine-tuning to achieve optimal perform-
ance [9, 2]. However, this process is labor-intensive and time-consuming. The advent of
Vision-Language Models (VLMs) has revolutionized this approach by enabling zero-shot
predictions using pre-trained models that learn from web-scale image-text pairs [19, 11,
17, 31, 4, 15]. VLMs eliminate the need for fine-tuning on specific tasks, thus simplify-
ing the workflow and reducing the dependency on labeled data. This paper provides a
systematic review of VLMs, detailing their development, applications, and the challenges
that remain in the field [7, 38, 37].
1
2. Background
The development of VLMs has been a response to the limitations of traditional deep
learning approaches [19, 11, 17, 31, 4, 15]. Initially, visual recognition tasks relied on
deep neural networks (DNNs) trained from scratch on large-scale labeled datasets [9, 2].
This method, although effective, presented two significant challenges: slow convergence
during training and the need for vast amounts of task-specific data. The introduction
of VLMs, particularly with models like CLIP [19], marked a paradigm shift. By pre-
training on large-scale, often unannotated image-text pairs, VLMs are able to capture
rich vision-language correlations that enable them to perform well on various downstream
tasks without fine-tuning. This new paradigm has accelerated the training process and
broadened the applicability of models to a wider range of tasks.
2
5. Evaluation and Benchmarking
The evaluation of VLMs typically involves zero-shot prediction, where the model is applied
directly to downstream tasks without any task-specific fine-tuning [2, 14, 6, 1, 16]. This
has been demonstrated in tasks like image classification, object detection, and semantic
segmentation. In these evaluations, VLMs show impressive performance, often outper-
forming traditional models that require fine-tuning on task-specific data. Additionally,
linear probing, which involves freezing the pre-trained VLM and training a linear classi-
fier on the encoded embeddings, is commonly used to assess the quality of the learned
representations. These evaluation techniques have solidified the effectiveness of VLMs in
a wide range of visual recognition tasks.
7. Conclusions
Vision-Language Models have revolutionized visual recognition by enabling zero-shot pre-
dictions and eliminating the need for task-specific fine-tuning. By harnessing the power
of large-scale image-text data, VLMs have demonstrated impressive performance across a
wide range of tasks, from image classification to complex object detection and semantic
segmentation. However, challenges remain in fine-grained vision-language modeling, data
efficiency, and multilingual capabilities. As research continues to address these challenges,
VLMs are poised to push the boundaries of what is possible in visual recognition, offering
exciting opportunities for future advancements in the field.
3
References
[1] Lukas Bossard, Matthieu Guillaumin and Luc Van Gool. ‘Food-101–mining dis-
criminative components with random forests’. In: Computer vision–ECCV 2014:
13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings,
part VI 13. Springer. 2014, pp. 446–461.
[2] Jia Deng et al. ‘Imagenet: A large-scale hierarchical image database’. In: 2009 IEEE
conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255.
[3] Peng Gao et al. ‘Clip-adapter: Better vision-language models with feature adapters’.
In: International Journal of Computer Vision 132.2 (2024), pp. 581–595.
[4] Shijie Geng et al. ‘HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-
aware Attention’. In: The Eleventh International Conference on Learning Repres-
entations.
[5] Jianping Gou et al. ‘Knowledge distillation: A survey’. In: International Journal of
Computer Vision 129.6 (2021), pp. 1789–1819.
[6] Gregory Griffin, Alex Holub, Pietro Perona et al. Caltech-256 object category dataset.
Tech. rep. Technical Report 7694, California Institute of Technology Pasadena, 2007.
[7] Xiuye Gu et al. ‘Open-vocabulary Object Detection via Vision and Language Know-
ledge Distillation’. In: International Conference on Learning Representations.
[8] Zixian Guo et al. ‘Texts as images in prompt tuning for multi-label image recogni-
tion’. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition. 2023, pp. 2808–2817.
[9] Kaiming He et al. ‘Deep residual learning for image recognition’. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
[10] Dat Huynh et al. ‘Open-vocabulary instance segmentation via robust cross-modal
pseudo-labeling’. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2022, pp. 7020–7031.
[11] Chao Jia et al. ‘Scaling up visual and vision-language representation learning with
noisy text supervision’. In: International conference on machine learning. PMLR.
2021, pp. 4904–4916.
[12] Muhammad Uzair Khattak et al. ‘Maple: Multi-modal prompt learning’. In: Pro-
ceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2023, pp. 19113–19122.
[13] Dahun Kim, Anelia Angelova and Weicheng Kuo. ‘Region-aware pretraining for
open-vocabulary object detection with vision transformers’. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 2023, pp. 11144–
11154.
4
[14] Alex Krizhevsky, Geoffrey Hinton et al. ‘Learning multiple layers of features from
tiny images’. In: (2009).
[15] Liunian Harold Li et al. ‘Grounded language-image pre-training’. In: Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition. 2022,
pp. 10965–10975.
[16] Tsung-Yi Lin et al. ‘Microsoft coco: Common objects in context’. In: Computer
vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-
12, 2014, proceedings, part v 13. Springer. 2014, pp. 740–755.
[17] Norman Mu et al. ‘Slip: Self-supervision meets language-image pre-training’. In:
European conference on computer vision. Springer. 2022, pp. 529–544.
[18] Sarah Parisot, Yongxin Yang and Steven McDonagh. ‘Learning to name classes
for vision and language models’. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2023, pp. 23477–23486.
[19] Alec Radford et al. ‘Learning transferable visual models from natural language su-
pervision’. In: International conference on machine learning. PmLR. 2021, pp. 8748–
8763.
[20] Yongming Rao et al. ‘Denseclip: Language-guided dense prediction with context-
aware prompting’. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition. 2022, pp. 18082–18091.
[21] Manli Shu et al. ‘Test-time prompt tuning for zero-shot generalization in vision-
language models’. In: Advances in Neural Information Processing Systems 35 (2022),
pp. 14274–14289.
[22] Karen Simonyan and Andrew Zisserman. ‘Very deep convolutional networks for
large-scale image recognition’. In: arXiv preprint arXiv:1409.1556 (2014).
[23] Ximeng Sun, Ping Hu and Kate Saenko. ‘Dualcoop: Fast adaptation to multi-label
recognition with limited annotations’. In: Advances in Neural Information Pro-
cessing Systems 35 (2022), pp. 30569–30582.
[24] Vishaal Udandarao, Ankush Gupta and Samuel Albanie. ‘Sus-x: Training-free name-
only transfer of vision-language models’. In: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision. 2023, pp. 2725–2736.
[25] Karl Weiss, Taghi M Khoshgoftaar and DingDing Wang. ‘A survey of transfer learn-
ing’. In: Journal of Big data 3 (2016), pp. 1–40.
[26] Peng Xia et al. ‘HGCLIP: Exploring Vision-Language Models with Graph Repres-
entations for Hierarchical Understanding’. In: Proceedings of the 31st International
Conference on Computational Linguistics. 2025, pp. 269–280.
[27] Peng Xia et al. ‘LMPT: Prompt Tuning with Class-Specific Embedding Loss for
Long-Tailed Multi-Label Visual Recognition’. In: Proceedings of the 3rd Workshop
on Advances in Language and Vision Research (ALVR). 2024, pp. 26–36.
5
[28] Hantao Yao, Rui Zhang and Changsheng Xu. ‘Visual-language prompt tuning with
knowledge-guided context optimization’. In: Proceedings of the IEEE/CVF confer-
ence on computer vision and pattern recognition. 2023, pp. 6757–6767.
[29] Lewei Yao et al. ‘Detclipv2: Scalable open-vocabulary object detection pre-training
via word-region alignment’. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. 2023, pp. 23497–23506.
[30] Changqian Yu et al. ‘Bisenet: Bilateral segmentation network for real-time se-
mantic segmentation’. In: Proceedings of the European conference on computer vision
(ECCV). 2018, pp. 325–341.
[31] Xiaohua Zhai et al. ‘Lit: Zero-shot transfer with locked-image text tuning’. In: Pro-
ceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2022, pp. 18123–18133.
[32] Renrui Zhang et al. ‘Tip-adapter: Training-free adaption of clip for few-shot clas-
sification’. In: European conference on computer vision. Springer. 2022, pp. 493–
510.
[33] Zhong-Qiu Zhao et al. ‘Object detection with deep learning: A review’. In: IEEE
transactions on neural networks and learning systems 30.11 (2019), pp. 3212–3232.
[34] Yiwu Zhong et al. ‘Regionclip: Region-based language-image pretraining’. In: Pro-
ceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2022, pp. 16793–16803.
[35] Kaiyang Zhou et al. ‘Conditional prompt learning for vision-language models’. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recogni-
tion. 2022, pp. 16816–16825.
[36] Kaiyang Zhou et al. ‘Learning to prompt for vision-language models’. In: Interna-
tional Journal of Computer Vision 130.9 (2022), pp. 2337–2348.
[37] Qihang Zhou et al. ‘AnomalyCLIP: Object-agnostic Prompt Learning for Zero-
shot Anomaly Detection’. In: The Twelfth International Conference on Learning
Representations.
[38] Ziqin Zhou et al. ‘Zegclip: Towards adapting clip for zero-shot semantic segmenta-
tion’. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2023, pp. 11175–11185.
[39] Beier Zhu et al. ‘Prompt-aligned gradient for prompt tuning’. In: Proceedings of the
IEEE/CVF international conference on computer vision. 2023, pp. 15659–15669.
6
View publication stats