SVLM Survey For ACL 2025
SVLM Survey For ACL 2025
1
MobileVLM SmolVLM Moondream
MiniVLM Pali-3 MobileVLM v2 TinyLLaVa Mipha Imp Bunny 2B 0.5B Qwen2.5-VL
08-2021 10-2023 12-2023 02-2024 02-2024 03-2024 05-2024 07-2024 11-2024 01-2025 01-2025
Figure 1: A timeline of the recently introduced small-scale VLMs by the research community.
2
2023). Mini-Gemini (Li et al., 2024b) combines SLM. These learnable connectors or projectors in
CNN with ViT, making a dual-encoder system to SVLMs are designed to reduce the number of vi-
process both low and high resolution images. sual tokens efficiently before feeding them to the
language model to speed up processing (Chu et al.,
2.2 Small Language Model (SLM) 2023). Efficiency is key for small models so the
Instead of the larger LLMs, small VLMs use projector must be lightweight to occupy less than
smaller LLMs to reduce computational cost. To 1% of parameters with minimal computational cost.
avoid the cost of training from scratch, the nor- A common type of projector is Multi-Layer Per-
mal tendency is to use lightweight pre-trained and ceptrons (MLPs), which is often a two-layer linear
open-source LLMs in the VLMs. SVLMs require network. It is a simple yet effective way to convert
compact, efficient language models to ensure fast visual features into the dimension of text embed-
inference and low memory usage. Most widely ding space. The MLP often uses GELU activation
used small-scale language models are Phi-1.5 (Li (Lin et al., 2024a) or SwiGLU (Team, 2025) acti-
et al., 2023c), Phi-2 (Javaheripi et al., 2023), TinyL- vation functions. Some models use a linear layer
LaMA (Zhang et al., 2024b), StableLM-2 (Bella- (Li et al., 2024b; Chen et al., 2023a) or a series
gente et al., 2024), downscaled version of LLaMA of linear projection layers (Face, 2025) to bridge
(Touvron et al., 2023) as MobileLLaMA (Chu et al., the dimensionality gap between the visual encoder
2023), Chinchilla-1.4B (Hoffmann et al., 2022), and the LLM embedding space, while some up-
Qwen-1.8B (Bai et al., 2023), FLAT-T5 (Chung grade them to the lightweight downsample projec-
et al., 2024), MiniLM (Wang et al., 2020b) etc. tor (LDP) (Chu et al., 2023, 2024), designed to re-
A brief overview of these LLMs are summarized duce visual tokens and enhance positional informa-
in Table 1. These models range from 1.1B to tion using depth-wise convolutions. Another popu-
2.7B parameters, making them much smaller than lar projector is Querying Transformer (Q-Former)
standard LLMs (e.g., LLaMA-7B(Touvron et al., (Li et al., 2023a; Yuan et al., 2023), which uses
2023), GPT-4 (Achiam et al., 2023)). Their com- trainable query tokens to interact with image fea-
pact size makes them efficient for edge AI and mo- tures and selectively retrieve relevant information
bile applications. Many of these models (Phi-1.5, (Figure 2). This architecture efficiently extracts
Phi-2, StableLM-2, MobileLLaMA) are trained task-relevant information from images, improving
to work on low-power devices. They avoid mas- vision-language understanding.
sive token generation overhead common in larger
LLMs. Also, these models integrate well with
Fully Connected
Feed Forward
CLIP, SigLIP and ViT-based encoders, which are Visual
Q-Former
Image
Self-Attention
ing.
Learned
Queries
Model Published Parameters
MiniLM 2020 30M to 300M Figure 2: Architecture and Workflow of Q-Former.
Chinchilla-1.4B 2022 1.4B
FLAT-T5 2023 1.3B-2.7B
Qwen-1.8B Aug 2023 1.8B 3 Categorizing SVLM Models
Phi-1.5 Nov 2023 1.3B Recent research on small-scale VLMs has led to the
Phi -2 Dec 2023 2.7B development of various architectures with varying
TinyLLaMA Dec 2023 1.1B configurations of vision encoders, language models
StableLM-2 Jan 2024 1.6B and modality connectors. These models focus on
MobileLLaMA Jan 2024 1.4B-2.7B reducing computational overhead and improving
Table 1: Summary of the most widely used small-scale inference speed, while maintaining multimodal rea-
LLMs in SVLMs. soning capabilities while balancing the trade-off
between model size and performance. SVLMs can
broadly be classified based on their vision back-
2.3 Modality Connector bone into two main categories: CNN-based models
This component projects the visual features into and ViT-based models, where ViT-based models
the text embedding space understandable to the can be further divided into two subcategories based
3
CNN-based
Flamingo-3B
on pre-trained contrastive vision-language models
MiniVLM Mini-Gemini
(CLIP, SigLIP), which provide stronger alignment
Vision Encoder
TinyGPT-V
LLaVa-Phi
SVLM CLIP MobileVLM between vision and language embeddings.
MobileVLM-v2
MoE-LLaVa
3.2 Models with ViT Backbone
ViT-based Mipha
Pali-3 The transition from CNN to ViT-based encoders
SigLIP TinyLLaVa
Bunny-4B marks a crucial trend in SVLM development after
Moondream
SmolVLM
2023. Vision Transformers (ViTs) offer a more ex-
Imp pressive feature representation, making them better
suited for multimodal reasoning tasks. ViT-based
Figure 3: Categorization of SVLMs based on Vision
SVLMs are categorized into three major groups:
Encoder.
models utilizing CLIP encoders, those leveraging
SigLIP encoders and those adopting other ViT ar-
on the ViT architecture. A detailed dendrogram of chitectures.
SVLM categorization is illustrated in Figure 3. Models with CLIP Encoders: A large num-
ber of SVLMs initially adopted CLIP-based en-
3.1 Models with CNN Backbone coders due to their robust contrastive pretraining
Earlier SVLMs leveraged CNN-based vision en- on large-scale image-text pairs. LLaVA-Phi (Zhu
coders due to their efficiency in image recognition et al., 2024b), MobileVLM (Chu et al., 2023) and
tasks. Two such small-scale models are MiniVLM MobileVLM-v2 (Chu et al., 2024) integrated CLIP
(Wang et al., 2020a) and Flamigo-3B (Alayrac ViT-L/14 (Radford et al., 2021) as the vision back-
et al., 2022), published before 2023. The MiniVLM bone, leveraging its zero-shot capabilities and pre-
architecture employs a Two-stage Efficient fea- trained multimodal alignment. LLaVa-Phi incorpo-
ture Extractor (TEE) for visual feature extraction rates Phi-2 LLM and the other two use downsized
and MiniLM (Wang et al., 2020b) as the language MobileLLaMA language model (Touvron et al.,
model. The TEE module includes EffiecinetNet 2023). These SVLMs introduced lightweight pro-
(Tan and Le, 2019) with BiFPN as the backbone jection layers (e.g., MLPs, LDP) to efficiently map
based on EfficientDet (Tan et al., 2020), which visual embeddings into the language model hid-
uses depthwise and pointwise convolutions for den space. MobileVLM (Chu et al., 2023), op-
lightweight processing and reduced model size. timized for mobile and edge devices, introduced
The backbone is followed by a region proposal a Lightweight Downsample Projector (LDP) that
network (RPN) (Ren et al., 2016) and an RoIAlign uses convolution with stride 2 to reduce visual to-
operation (He et al., 2020) for visual region feature ken sizes by 75%. Whereas MobileVLM-v2 (Chu
extraction with reduced computational complex- et al., 2024) upgraded the projector to LDPv2 us-
ity while ensuring robust multimodal alignment. ing point-wise convolution, average pooling for
Similarly, Flamingo (Alayrac et al., 2022) offers token reduction and PEG (Chu et al., 2021) with
a family of VLMs developed by DeepMind and skip connection for enhanced positional informa-
their 3B parameter model is the SVLM one. It in- tion. This reduces 99.8% parameters than LDP,
tegrates a frozen ResNet-based NFNet backbone making it faster in inference. Despite their ef-
(Brock et al., 2021) and a Perceiver Resampler to re- ficiency, CLIP-based SVLMs face limitations in
fine visual features before fusion with a Chinchilla- fine-grained vision-language understanding, partic-
based language model (Hoffmann et al., 2022). De- ularly in text recognition and detailed spatial rea-
spite its strength in open-ended tasks, Flamingo soning. This led to a shift towards models adopting
has limitations, such as sensitivity to prompt de- SigLIP-based encoders.
sign, computational intensity for long sequences Models with SigLIP Encoders: Recently, a
and relatively weaker classification performance significant number of SVLMs have transitioned
compared to contrastive models. However, CNN- to SigLIP-based encoders, which refine the con-
based SVLMs are gradually being outperformed trastive learning approach by introducing a more
by ViT-based models, as transformers offer supe- stable loss function and better multimodal fea-
rior long-range dependency modeling and global ture representation. Moondream (moo, 2024),
feature extraction. The shift towards ViT-based en- TinyLLaVA (Zhou et al., 2024), Mipha (Zhu et al.,
coders has been driven by the increasing reliance 2024a), Imp (Shao et al., 2024) and Bunny-4B (He
4
et al., 2024) integrated SigLIP ViT-L/14 or ViT- ties. They retain large-model capabilities while
G variants, leading to more accurate multimodal optimizing computational costs, making them suit-
alignment while reducing computational overhead. able for scientific research, enterprise AI and high-
For example, TinyLLaVA combined SigLIP-L/14 fidelity multimodal reasoning.
with Phi-2, achieving superior performance in in- A brief overview of the recent SVLM models
struction tuning tasks. Moondream-2B is one of are summarized in Table 2.
the smallest yet effective SVLMs, which optimized
SigLIP models with quantization techniques, re- 4 Training Strategy
ducing the memory footprint for deployment in The training methods for SVLMs are designed to
edge AI applications. Mipha extended the effec- balance computational efficiency with high per-
tiveness of SigLIP ViT-G by incorporating smaller formance by leveraging innovative strategies. It
language models like Phi-1.5 (Li et al., 2023c) is significant because it can help compensate for
and Gemma-2B (Team et al., 2024), demonstrat- the reduction in model size and improve overall
ing that large vision encoders can still be paired performance. Better training recipes and quality
with compact LLMs for efficiency. Hugging Face data allow smaller LMMs to achieve on-par perfor-
released SmolVLM (Face, 2025) series, including mance with larger models (Zhou et al., 2024). The
2B, 0.5B and the 0.25B version being the smallest core training schemes include Pre-training, Instruc-
VLM available. The smaller models use a 93M- tion tuning, Fine-tuning and Multi-task learning.
parameter SigLIP base patch-16/512. These mod- Models usually employ multiple of these schemes
els are optimized for on-device deployment, includ- together to enable effective adaptation of SVLMs
ing laptops and potentially web browsers, due to to diverse real-world tasks.
their minimal memory requirements and efficient
processing capabilities. 4.1 Pre-Training
Pre-training is a crucial stage that aims for vision-
3.3 Categorization on Model Efficiency
language alignment. The primary objective is
SVLMs can be classified based on model size effi- to train the modality connector on large-scale
ciency and deployment scalability in the following image-text data to learn cross-modal representa-
way: tions, preparing the model for subsequent fine-
Memory-Optimized SVLMs: These mod- tuning on specific downstream tasks. This enables
els, including Moondream-0.5B (moo, 2024), the connector to align visual tokens (from mod-
TinyLLaVA-1.5B (Zhou et al., 2024) and els such as ViTs) with textual embeddings, allow-
SmolVLM-0.5B/0.25B (Face, 2025), focus on ex- ing the language model to adapt to visual inputs
treme compression techniques such as quantization, through the learnable projector while retaining the
low-bit encoding and minimal architectural com- pre-trained knowledge. MiniGPT-V (Zhu et al.,
ponents to enable deployment on mobile, IoT and 2023), Pali (Chen et al., 2023a), TinyLLaVa (Zhou
embedded systems while maintaining competitive et al., 2024). and MobileVLM (Chu et al., 2023)
multimodal capabilities. adopt this method. Some models, like MobileVLM-
Balanced Size and Performance Models: v2 (Chu et al., 2024), unfreeze the LLM while pre-
These models include MobileVLM-v2 (Chu et al., training to enhance the model’s in-context learn-
2024), Bunny-4B (He et al., 2024) and Mipha (Zhu ing capabilities. The quality and diversity of the
et al., 2024a), which balance between accuracy and pre-training data are critical for advancing SVLM
computational efficiency by leveraging moderate- performance (He et al., 2024). Hence, high-quality
sized encoders and small LLMs, ensuring usability data are constructed by filtering large datasets (e.g.,
across diverse applications from real-time dialogue CC-595K (Zhu et al., 2024b), LAION-GPT-4V
systems to AR/VR. (LAION/GPT-4V), SBU (Ordonez et al., 2011)) to
High-Performance Compact SVLMs: These ensure quality input.
types of models including MoE-LLaVA (Lin et al.,
2024a) and Pali-3 (Chen et al., 2023a) are larger 4.2 Instruction Tuning
in size. MoE-LLaVa introduces the Mixture of Ex- Instruction tuning is a training strategy that en-
perts (MoE) (Jacobs et al., 1991) approach-based hances a model’s ability to follow user instructions.
sparse VLM framework with excellent multimodal It involves finetuning language models (LLM) us-
understanding and hallucination mitigation abili- ing instruction datasets to improve their zero-shot
5
capabilities. In this way, models learn to gen- diverse set of training examples, using task-specific
eralize well on unseen tasks. Popular instruc- tokens to reduce ambiguity and improve task exe-
tion dataset is LLaVa-Instruct (Liu et al., 2024), cution, carefully designing the training process to
ALLaVa (Chen et al., 2024a) and some other mix- balance the learning of different tasks.
ture of datasets. Pali, MobileVLM, LLaVa-Phi, By combining robust pre-training with special-
TinyGPT-V, MiniGPT-4 models adopt this strat- ized fine-tuning strategies, SVLMs are able to
egy for improved multimodal instruction following achieve high performance across a range of vision-
capability. language tasks while remaining computationally
efficient. Note: Extended explanations, examples
4.3 Fine-Tuning and additional experimental details are provided
Fine-tuning adapts pre-trained components of in Appendix B
SVLMs to specific downstream applications
such as VQA, image captioning and instruction- 5 Qualitative Performance of SVLMs
following. This technique refines the model’s pre- Qualitative examples play a key role in understand-
trained parameters using a targeted dataset to en- ing the capabilities of SVLMs. This analysis can
hance its performance for a specific task. The pro- reveal the strengths and weaknesses of these mod-
cess aims to enhance task-specific performance els in real-world scenarios, providing insights that
while preserving generalization. Some models fine- go beyond quantitative benchmarks. Some key
tune only small components (e.g., projection lay- model performances are described below:
ers), as demonstrated by MiniGPT-4 (Zhu et al.,
2023). LLaVA-Phi (Zhu et al., 2024b) and Bunny • MiniGPT-4 (Zhu et al., 2023) can identify ob-
(He et al., 2024) perform instruction fine-tuning for jective elements within an image, offering de-
enhanced conversational and instruction following tailed descriptions. It can also differentiate
abilities using curated datasets. TinyGPT-V (Yuan memes as humorous by interpreting the under-
et al., 2023) adopts task-specific training, optimiz- lying message, demonstrating humor interpre-
ing on multi-task datasets to address various vision- tation capability.
language challenges. Sometimes, fine-tuning on
high-resolution images improve fine-grained visual • TinyGPT-V (Yuan et al., 2023) excels in de-
understanding, as implemented in PaLI-3 (Chen livering concise and accurate visual interpre-
et al., 2023a). MobileVLM (Chu et al., 2023) and tations. In a test involving a game of hide-
Flamigo (Alayrac et al., 2022) also adopt finetuning and-seek, TinyGPT-V gave a single, viable
strategy for improved task specific performance. suggestion ("under couch"), unlike other mod-
els that gave multiple options some of which
4.4 Multi-task Learning were incorrect.
Multi-task Learning (MTL) trains a single model
• Imp (Shao et al., 2024) models can demon-
on multiple related tasks, harnessing shared rep-
strate skills such as code generation, math
resentations to improve generalization. This is
problem solving and medical image under-
typically achieved by employing a shared back-
standing. The Imp-3B model provides rea-
bone (vision encoder and language model) along
sonable responses in these cases, showing its
with task-specific output heads. Techniques such
superiority in VL understanding and reason-
as sequential curriculum learning, where tasks are
ing, as well as completeness of knowledge.
introduced progressively and unified optimization
across diverse data formats help the model adapt • Imp-2B (Qwen-1.5) responds in the expected
to complex challenges such as visual question an- language for Chinese conversations, while
swering, image captioning and visual grounding. other models fail to generate Chinese re-
MobileVLM-v2 (Chu et al., 2024), TinyGPT-V sponses, highlighting the importance of multi-
(Yuan et al., 2023), Pali (Chen et al., 2023a) mod- lingual LLMs in user-friendly LMMs.
els adopt this strategy using multi-task TextVQA,
OCR, VQA, COCO Caption (Lin et al., 2014), • Mini-Gemini (Li et al., 2024b) can recognize
SBU, Flickr30k (Young et al., 2014) and other mix- plotted curves in graphical data and trans-
ture of datasets. Key Aspects of Multi-Task Train- late them into Python code, describe intricate
ing include mixing of different datasets to provide a elements within complex indoor scenes and
6
demonstrate an understanding of character as- to effectively answer varied and complex ques-
sociations in memes. tions. However, Flamingo-3B (Alayrac et al., 2022)
performs poor in this case. GQA (Graph Ques-
• LLaVA-Phi (Zhu et al., 2024b) exhibits strong tion Answering) benchmark (Hudson and Manning,
generalization ability in handling challenging 2019) focuses on complex real-world visual rea-
questions, generating code based on instruc- soning with structured scene understanding, testing
tions and solving mathematical problems. object recognition, spatial reasoning and compo-
• MobileVLM (Chu et al., 2023) showcases per- sitional question answering. Mipha (Zhu et al.,
formance on benchmarks including attribute 2024a), MoE-LLaVA (Lin et al., 2024a), Imp-3B
understanding, spatial and relational reason- and Bunny-4B (He et al., 2024) perform relatively
ing social and natural science, OCR, object well on the GQA benchmark, where TinyLLaVA
recognition and word knowledge. (Zhou et al., 2024) shows moderate performance.
In contrast, TinyGPT-V (Yuan et al., 2023) strug-
• Pali-3 (Chen et al., 2023a) provides excel- gles with this assessment. TextVQA (Singh et al.,
lent video QA results and achieves respectable 2019) specifically evaluates how models read and
video captioning results understand text within images to answer ques-
tions, emphasizing OCR and text-based reason-
Some example of qualitative performance of the ing within visual contexts. Recently introduced
SVLMs are demonstrated in Appendix D. Moondream-2B (moo, 2024), SmolVLM (Face,
2025) and Qwen2.5-VL (Team, 2025) show high
6 Benchmark Evaluation
performances, as shown in Fig. 4(c), indicating su-
Evaluating SVLMs through benchmarks is crucial perior capabilities in synthesizing OCR with visual
for assessing their capabilities relative to larger and language understanding.
models. This evaluation identifies the strengths and
weaknesses of different SVLMs and determines 6.2 Comprehensive Multimodal Assessment
whether they can achieve performance comparable In spite of evaluating some selected aspects some
to larger models with reduced computational costs. benchmarks are designed to assess a wide range
This is particularly relevant for deployment on edge of multimodal capabilities, including perception,
or mobile devices and real-time applications, where cognition and reasoning across different modalities.
efficiency is significant. The existing benchmarks They often include multiple sub-tasks that test dif-
for SVLMs are the same used for larger models, ferent aspects of a model’s ability to process and in-
which can be broadly categorized into three groups tegrate information from both visual and textual in-
based on the aspects they assess: Visual Reason- puts. MMBench (Liu et al., 2025) provides a broad
ing, Comprehensive Multimodal Assessment and evaluation of 20 capabilities using a structured and
Specialized Tasks. The performance of SVLMs on hierarchical approach, involving fine-grained per-
several benchmarks is illustrated in Figure 4. ception and complex reasoning tasks. Bunny-4B
(He et al., 2024), Imp-3B (Shao et al., 2024) and
6.1 Visual Reasoning and Understanding Qwen2.5-VL (Team, 2025) exhibit top-tier perfor-
These benchmarks focus on the model’s ability to mances, while LLaVa-Phi (Zhu et al., 2024b) and
understand and reason about visual content and Mini-Gemini (Li et al., 2024b) perform lower in
answer questions, often requiring integration with this benchmark. MME (Multimodal Model Eval-
basic textual queries. VQA (Visual Question An- uation) (Fu et al., 2024) measures both perception
swering) benchmark (Antol et al., 2015; Goyal (like object recognition and OCR) and cognition
et al., 2017) evaluates the ability to answer open- (including commonsense reasoning and numerical
ended questions about images, challenging mod- calculations) abilities of VLMs across 14 diverse
els to understand and process visual content in sub-tasks. Bunny-4B (He et al., 2024) shows ex-
conjunction with textual questions. As shown in ceptional performance, demonstrating their strong
Fig. 4(a), PaLI-3 (Chen et al., 2023a), Imp-3B ability to handle diverse and challenging multi-
(Shao et al., 2024), Mipha (Zhu et al., 2024a) and modal tasks, while LLaVa-Phi and Mini-Gemini
Bunny-4B (He et al., 2024) perform notably well (Li et al., 2024b) performs less effectively com-
in VQA-v2, indicating strong capabilities in inte- pared to others. MM-Vet (Multimodal Multitask
grating visual content with language processing Vision-Language Evaluation Test) (Yu et al., 2023)
7
(a) MiniVLM
(b) TinyGPT-V (c) MoE
LLaVA Tiny (d) Tiny
LLaVA
Pali-3
LLaVA
85.2 Tiny MoE Tiny Mipha MoE Mobile
LLaVA LLaVA LLaVA LLaVA VLM-v2
69.0 57.0 59.0 Mobile 69.0
MoE 79.9 62.6 62.0 VLM-v2
39.0 56.6 70.3 70.0
LLaVA 80.0 Mini 57.5
Gemini 56.0
71.4 LLaVA Mobile 48.6 LLaVa 70.9 68.4 LLaVa
Phi Mipha 63.9 61.1 VLM-v2
32.7 Phi Mipha Phi
81.3 57.0 Flamingo
Mipha 81.2 3B
60.0 79.3 78.3
82.1 63.2 63.5 Qwen 72.8 84.5
Imp-3B 72.0 2.5-VL
Bunny 61.9 Imp-3B 73.4 Bunny Smol
Bunny 4B Imp-3B 4B VLM
4B Flamingo Smol
Imp-2B Moon VLM Imp-3B
3B dream-2B
VQA-v2 GQA TextVQA ScienceQA
Figure 4: Performance of SVLM models on benchmarks: (a) VQA-v2, (b) GQA, (c) TextVQA, (d) ScienceQA, (e)
POPE, (f) MM-Vet, (g) MMBench and (h) MME.
8
Limitations conference on machine learning, pages 1059–1071.
PMLR.
While this survey provides a comprehensive analy-
sis of Small Visual Language Models (SVLMs), it Guiming Hardy Chen, Shunian Chen, Ruifei Zhang,
Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong
has several limitations. The field is rapidly evolving Chen, Jianquan Li, Xiang Wan, and Benyou Wang.
and emerging models may introduce novel archi- 2024a. Allava: Harnessing gpt4v-synthesized data
tectures or evaluation techniques not covered here. for a lite vision-language model. arXiv preprint
This study primarily focuses on general-purpose arXiv:2402.11684.
SVLMs, overlooking specialized models for do- Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong
mains like medical imaging or industrial automa- Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin
tion. Benchmarking comparisons highlight model Wen, Tianke Zhang, et al. 2024b. Evlm: An effi-
cient vision-language model for visual understanding.
performance, but the absence of standardized ef-
arXiv preprint arXiv:2407.14177.
ficiency metrics (e.g., FLOPs, inference time, en-
ergy consumption) limits real-world deployability Xi Chen, Xiao Wang, Lucas Beyer, Alexander
assessments. Additionally, future directions like Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil
Mustafa, Sebastian Goodman, Ibrahim Alabdul-
neuromorphic computing, federated learning and mohsin, Piotr Padlewski, Daniel Salz, Xi Xiong,
adaptive architectures require further study to en- Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli
hance scalability and sustainability. Due to the Yu, Daniel Keysers, Xiaohua Zhai, and Radu Sori-
page limit of this paper some discussions are con- cut. 2023a. Pali-3 vision language models: Smaller,
faster, stronger. Preprint, arXiv:2310.09199.
strained and addressing these gaps will be essential
for advancing SVLM research and development. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier-
giovanni, Piotr Padlewski, Daniel Salz, Sebas-
tian Goodman, Adam Grycner, Basil Mustafa, Lu-
References cas Beyer, Alexander Kolesnikov, Joan Puigcerver,
Nan Ding, Keran Rong, Hassan Akbari, Gaurav
2024. Moondream ai. Mishra, Linting Xue, Ashish Thapliyal, James Brad-
bury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Jia, Burcu Karagol Ayan, Carlos Riquelme, An-
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, dreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil
Diogo Almeida, Janko Altenschmidt, Sam Altman, Houlsby, and Radu Soricut. 2023b. Pali: A jointly-
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. scaled multilingual language-image model. Preprint,
arXiv preprint arXiv:2303.08774. arXiv:2209.06794.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang
Antoine Miech, Iain Barr, Yana Hasson, Karel Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu
Lenc, Arthur Mensch, Katherine Millican, Malcolm Zhang, Bo Zhang, Xiaolin Wei, et al. 2023. Mo-
Reynolds, et al. 2022. Flamingo: a visual language bilevlm: A fast, reproducible and strong vision lan-
model for few-shot learning. Advances in neural guage assistant for mobile devices. arXiv preprint
information processing systems, 35:23716–23736. arXiv:2312.16886.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu,
Devi Parikh. 2015. Vqa: Visual question answering. Xinyang Lin, Bo Zhang, et al. 2024. Mobilevlm
In Proceedings of the IEEE international conference v2: Faster and stronger baseline for vision language
on computer vision, pages 2425–2433. model. arXiv preprint arXiv:2402.03766.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang,
Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Chunhua Shen. 2021. Conditional positional
and Jingren Zhou. 2023. Qwen-vl: A frontier large encodings for vision transformers. arXiv preprint
vision-language model with versatile abilities. arXiv arXiv:2102.10882.
preprint arXiv:2308.12966.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Phung, Maksym Zhuravinskyi, Reshinth Adithyan, Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
James Baicoianu, Ben Brooks, Nathan Cooper, 2024. Scaling instruction-finetuned language models.
Ashish Datta, et al. 2024. Stable lm 2 1.6 b tech- Journal of Machine Learning Research, 25(70):1–53.
nical report. arXiv preprint arXiv:2402.17834.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
Andy Brock, Soham De, Samuel L Smith, and Karen Si- and Yoshua Bengio. 2014. Empirical evaluation of
monyan. 2021. High-performance large-scale image gated recurrent neural networks on sequence model-
recognition without normalization. In International ing. Preprint, arXiv:1412.3555.
9
Patrycja Cieplicka, Julia Kłos, and Maciej Morawski. Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
2024. Visionqaries at mediqa-magic 2024: Small sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
vision language models for dermatological diagno- ford, Diego de Las Casas, Lisa Anne Hendricks,
sis. In Experimental IR Meets Multilinguality, Mul- Johannes Welbl, Aidan Clark, et al. 2022. Train-
timodality, and Interaction, Proceedings of the 15th ing compute-optimal large language models. arXiv
International Conference of the CLEF Association preprint arXiv:2203.15556.
(CLEF 2024), Springer Lecture Notes in Computer
Science LNCS, Grenoble, France. Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A
Kristina Toutanova. 2019. Bert: Pre-training of deep visual language model for gui agents. In Proceedings
bidirectional transformers for language understand- of the IEEE/CVF Conference on Computer Vision
ing. Preprint, arXiv:1810.04805. and Pattern Recognition, pages 14281–14290.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung,
Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Zhen Li, and Tom Duerig. 2021. Scaling up vi-
2024. Mme: A comprehensive evaluation benchmark sual and vision-language representation learning with
for multimodal large language models. Preprint, noisy text supervision. Preprint, arXiv:2102.05918.
arXiv:2306.13394. Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun
Sun, Huan Xiong, Deng-Ping Fan, and Fahad Shah-
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, baz Khan. 2024. Effectiveness assessment of recent
Rongyao Fang, Yongfeng Zhang, Hongsheng Li, large vision-language models. Visual Intelligence,
and Yu Qiao. 2021. Clip-adapter: Better vision- 2(1).
language models with feature adapters. Preprint,
arXiv:2110.04544. LAION/GPT-4V. Dataset at huggingface.
https://huggingface.co/datasets/laion/
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv gpt4v-dataset.
Batra, and Devi Parikh. 2017. Making the v in vqa
matter: Elevating the role of image understanding Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
in visual question answering. In Proceedings of the 2023a. Blip-2: Bootstrapping language-image pre-
IEEE conference on computer vision and pattern training with frozen image encoders and large lan-
recognition, pages 6904–6913. guage models. In International conference on ma-
chine learning, pages 19730–19742. PMLR.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and
Ross Girshick. 2020. Mask r-cnn. IEEE Transac- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
tions on Pattern Analysis and Machine Intelligence, Hsieh, and Kai-Wei Chang. 2019. Visualbert: A sim-
42(2):386–397. ple and performant baseline for vision and language.
Preprint, arXiv:1908.03557.
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan,
Wang, Tiejun Huang, and Bo Zhao. 2024. Efficient and Xiao Xiang Zhu. 2024a. Vision-language models
multimodal learning from data-centric perspective. in remote sensing: Current progress and future trends.
Preprint, arXiv:2402.11530. Preprint, arXiv:2305.05726.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng
short-term memory. Neural Computation, 9(8):1735– Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and
1780. Jiaya Jia. 2024b. Mini-gemini: Mining the potential
10
of multi-modality vision language models. arXiv Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
preprint arXiv:2403.18814. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, et al. 2021. Learning transferable visual models from
Wayne Xin Zhao, and Ji-Rong Wen. 2023b. Eval- natural language supervision. In International confer-
uating object hallucination in large vision-language ence on machine learning, pages 8748–8763. PMLR.
models. arXiv preprint arXiv:2305.10355.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Sun. 2016. Faster r-cnn: Towards real-time object
Del Giorno, Suriya Gunasekar, and Yin Tat Lee. detection with region proposal networks. IEEE trans-
2023c. Textbooks are all you need ii: phi-1.5 techni- actions on pattern analysis and machine intelligence,
cal report. arXiv preprint arXiv:2309.05463. 39(6):1137–1149.
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin David E. Rumelhart and James L. McClelland. 1987.
Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Learning Internal Representations by Error Propa-
Li Yuan. 2024a. Moe-llava: Mixture of experts gation, pages 318–362.
for large vision-language models. arXiv preprint
arXiv:2401.15947. Christoph Schuhmann, Romain Beaumont, Richard
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti,
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo-
Theo Coombes, Aarush Katta, Clayton Mullis,
hammad Shoeybi, and Song Han. 2024b. Vila: On
Mitchell Wortsman, et al. 2022. Laion-5b: An open
pre-training for visual language models. In Proceed-
large-scale dataset for training next generation image-
ings of the IEEE/CVF Conference on Computer Vi-
text models. Advances in Neural Information Pro-
sion and Pattern Recognition, pages 26689–26699.
cessing Systems, 35:25278–25294.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang,
and C Lawrence Zitnick. 2014. Microsoft coco: Lihao Zheng, Zhenbiao Gai, Mingyang Wang, and
Common objects in context. In Computer Vision– Jiajun Ding. 2024. Imp: Highly capable large mul-
ECCV 2014: 13th European Conference, Zurich, timodal models for mobile devices. arXiv preprint
Switzerland, September 6-12, 2014, Proceedings, arXiv:2405.12107.
Part V 13, pages 740–755. Springer. Amanpreet Singh, Vivek Natarajan, Meet Shah,
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh,
Lee. 2024. Visual instruction tuning. Advances in and Marcus Rohrbach. 2019. Towards vqa models
neural information processing systems, 36. that can read. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, pages 8317–8326.
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Wang, Conghui He, Ziwei Liu, et al. 2025. Mm- Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang,
bench: Is your multi-modal model an all-around and Yue Cao. 2023. Eva-clip: Improved train-
player? In European conference on computer vi- ing techniques for clip at scale. arXiv preprint
sion, pages 216–233. Springer. arXiv:2303.15389.
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Mingxing Tan and Quoc Le. 2019. Efficientnet: Re-
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter thinking model scaling for convolutional neural net-
Clark, and Ashwin Kalyan. 2022. Learn to explain: works. In International conference on machine learn-
Multimodal reasoning via thought chains for science ing, pages 6105–6114. PMLR.
question answering. Advances in Neural Information
Processing Systems, 35:2507–2521. Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020.
Efficientdet: Scalable and efficient object detection.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, In Proceedings of the IEEE/CVF conference on com-
Xiaoshuai Sun, and Rongrong Ji. 2023. Cheap puter vision and pattern recognition, pages 10781–
and quick: Efficient vision-language instruction 10790.
tuning for large language models. Preprint,
arXiv:2305.15023. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia,
Jason Wei, Xuezhi Wang, Hyung Won Chung, Sia-
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. mak Shakeri, Dara Bahri, Tal Schuster, et al. 2022.
2011. Im2text: Describing images using 1 million Ul2: Unifying language learning paradigms. arXiv
captioned photographs. Advances in neural informa- preprint arXiv:2205.05131.
tion processing systems, 24.
Gemma Team, Thomas Mesnard, Cassidy Hardin,
Fang Peng, Xiaoshan Yang, Linhui Xiao, Yaowei Wang, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
and Changsheng Xu. 2023. Sgva-clip: Semantic- Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale,
guided visual adapting of vision-language mod- Juliette Love, et al. 2024. Gemma: Open models
els for few-shot image classification. Preprint, based on gemini research and technology. arXiv
arXiv:2211.16191. preprint arXiv:2403.08295.
11
Qwen Team. 2025. Qwen2.5-vl. Zhengqing Yuan, Zhaoxu Li, Weiran Huang, Yanfang
Ye, and Lichao Sun. 2023. Tinygpt-v: Efficient mul-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier timodal large language model via small backbones.
Martinet, Marie-Anne Lachaux, Timothée Lacroix, arXiv preprint arXiv:2312.16862.
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi- Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov,
cient foundation language models. arXiv preprint and Lucas Beyer. 2023. Sigmoid loss for language
arXiv:2302.13971. image pre-training. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xi- 11975–11986.
ujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao,
and Zicheng Liu. 2020a. Minivlm: A smaller Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.
and faster vision-language model. arXiv preprint 2024a. Vision-language models for vision tasks: A
arXiv:2012.06946. survey. Preprint, arXiv:2304.00685.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Wei Lu. 2024b. Tinyllama: An open-source small
Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi language model. arXiv preprint arXiv:2401.02385.
Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2024.
Cogvlm: Visual expert for pretrained language mod- Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
els. Preprint, arXiv:2311.03079. Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian-
feng Gao. 2021. Vinvl: Revisiting visual represen-
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan tations in vision-language models. In Proceedings
Yang, and Ming Zhou. 2020b. Minilm: Deep self- of the IEEE/CVF conference on computer vision and
attention distillation for task-agnostic compression pattern recognition, pages 5579–5588.
of pre-trained transformers. Advances in Neural In-
formation Processing Systems, 33:5776–5788. Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li,
Yibing Song, Kai Wang, Zhangyang Wang, and Yang
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu- You. 2024. A stitch in time saves nine: Small vlm
lia Tsvetkov, and Yuan Cao. 2022. Simvlm: Simple is a precise guidance for accelerating large vlms.
visual language model pretraining with weak super- Preprint, arXiv:2412.03324.
vision. Preprint, arXiv:2108.10904.
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo,
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Xien Liu, Ji Wu, and Lei Huang. 2024. Tinyllava: A
Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, framework of small-scale large multimodal models.
and Xiangyu Zhang. 2025. Vary: Scaling up the arXiv preprint arXiv:2402.14289.
vision vocabulary for large vision-language model.
In European Conference on Computer Vision, pages Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
408–424. Springer. Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing
vision-language understanding with advanced large
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, language models. arXiv preprint arXiv:2304.10592.
Zheng Ge, En Yu, Jianjian Sun, Chunrui Han,
and Xiangyu Zhang. 2024. Small language model Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan
meets with reinforced vision vocabulary. Preprint, Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei
arXiv:2401.12503. Feng, and Jian Tang. 2024a. Mipha: A compre-
hensive overhaul of multimodal assistant with small
Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yan- language models. CoRR.
bei Chen, Manchen Wang, Paolo Favaro, Joseph
Tighe, and Davide Modolo. 2024. Benchmarking Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and
zero-shot recognition with vision-language models: Yaxin Peng. 2024b. Llava-phi: Efficient multi-modal
Challenges on granularity and specificity. Preprint, assistant with small language model. In Proceed-
arXiv:2306.16048. ings of the 1st International Workshop on Efficient
Multimedia Computing under Limited, pages 18–22.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in- Appendix
ference over event descriptions. Transactions of the
Association for Computational Linguistics, 2:67–78. A Recent SVLM Models
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, A brief overview of the SVLM models discussed
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
Wang. 2023. Mm-vet: Evaluating large multimodal
above, their published year, parameter size, archi-
models for integrated capabilities. arXiv preprint tectural components and other information is sum-
arXiv:2308.02490. marized in Table 2.
12
Language
Model Year Parameter Vision Encoder Connector
Model
TEE: EfficientNet RPN + RoIAlign
MiniVLM Aug 2021 53.2M, FLOPs 6.7B MiniLM
+BiFPN + 2 linear layers
Perceiver Resampler
Flamingo-3B Nov 2022 3.2B NFNet-F6 Chinchilla-1.4B
+ Gated XATTN-Dense layers
Pali-3 Oct 2023 5B SigLIP ViT G/14 UL2 Transformer Linear Projector
Lightweight Downsample
MobileVLM Dec 2023 1.7B, 3B CLIP ViT-L/14 MobileLLaMA
Projector (LDP)
MobileVLM-v2 Feb 2024 1.7B, 3B CLIP ViT-L/14 MobileLLaMA Upgraded LDPv2
LLaVa-Phi Feb 2024 3B CLIP ViT-L/14 Phi-2 MLP Projector
1.5B TinyLlama
TinyLLaVa Feb 2024 SigLIP-L/14 2-layer MLP
3.1B Phi-2
Mipha-1.6B Phi-1.5
Mipha March 2024 Mipha-2.4B SigLIP ViT L/14 Gemma-2B 2-layer MLP
Mipha-3B Phi-2
CLIP-ViT
Mini-Gemini March 2024 2B Gemma-2B Linear Projector
ConvNeXt
3B Phi-2
Imp May 2024 SigLIP-SO-L/14 2-layer MLP
2B Qwen-1.8B
Q-Former with
TinyGPT-V June 2024 3.7B EVA ViT-L/14 Phi-2
projection layers
Bunny-4B July 2024 4B SigLIP-SO Phi-3-Mini 2-layer MLP
2.9B, Active 2B StableLM-1.6B
MoE-LLaVa Dec 2024 3.1B, Active 2.2B CLIP-Large Qwen-1.8B 2-layer MLP
5.3B, Active 3.6B Phi-2
Moondream Jan 2025 2B, 0.5B SigLIP Phi-1.5 –
SmolVLM-2B SigLIP-SO-L/14
Nov 2024
SmolVLM SmolVLM-0.5B SigLIP base patch SmolLM2 Linear Projector
Jan 2025
SmolVLM-0.25B 16/512
Qwen2.5-VL Jan 2025 3B ViT-675M Qwen-2.5 LLM MLP Projector
Table 2: Brief overview of SVLM models: published year, parameter size, architectural components and training
datasets.
13
fine-tunes both vision and language modules. and remove non-essential tokens, improving
the signal-to-noise ratio. Small Guides Large
B.2 Multi-task Learning Strategies (SGL) (Zhao et al., 2024) demonstrates this
Multi-task learning (MTL) enables a single model strategy.
to learn from multiple related tasks, leveraging
shared representations to boost overall generaliza- 4. Contrastive and Knowledge Distillation
tion. Key strategies include: Losses: These losses are used to align visual
and textual features, further refining the train-
1. Shared Backbone with Task-Specific ing dataset. SgVA-CLIP (Peng et al., 2023)
Heads: A common architecture (shared integrates both techniques to enhance perfor-
vision encoder and language model) is ex- mance, particularly in few-shot learning sce-
tended with additional output layers tailored narios.
to each task. For example, TinyGPT-V (Yuan
et al., 2023) uses this approach for tasks
such as VQA, image captioning and referring B.4 Pre-Training Methods
expression comprehension.
Pre-training is crucial for developing robust cross-
2. Sequential Curriculum Learning: Tasks are modal representations in SVLMs. This phase
introduced progressively, beginning with sim- extracts rich semantic features from large-scale
pler ones to build a strong foundational un- datasets while maintaining computational effi-
derstanding before moving to complex tasks. ciency. Key techniques include:
CogAgent (Hong et al., 2024) applies this
method to gradually enhance its capabilities. • Vision-Language Alignment: Lightweight
adapters, such as linear projection layers or
3. Task-Specific Data Handling: Despite dif- Q-Formers, align visual tokens from models
ferences in data formats across tasks, a unified like ViT with textual embeddings. BLIP-2
optimization strategy is employed. Flamingo- (Li et al., 2023a) and MiniGPT-4 (Zhu et al.,
3B (Alayrac et al., 2022) uses gated cross- 2023) effectively employ these methods.
attention to condition on various tasks, ensur-
ing that each task is effectively learned with- • Contrastive Learning: This technique maxi-
out compromising overall performance. mizes similarity between semantically related
image-text pairs while minimizing similar-
B.3 Detailed Model-Based Data Filtering
ity for unrelated pairs. PaLI-3 (Chen et al.,
Techniques
2023a) leverages SigLIP-based contrastive ob-
High-quality training data is vital for robust model jectives for effective cross-modal alignment.
performance. Model-based data filtering tech-
niques help to refine large-scale datasets by reduc- • Pseudo-Label Generation: To overcome
ing noise and enhancing relevance: the challenges of limited annotations, pre-
trained captioning models are used to generate
1. Pseudo-Label Generation: Pre-trained mod-
pseudo-labels, as implemented in MiniVLM
els generate high-quality annotations to serve
(Wang et al., 2020a).
as pseudo-labels, thereby augmenting weakly
labeled datasets. MiniVLM (Wang et al., • Masked multimodal Modeling: By mask-
2020a) exemplifies this method. ing portions of both visual and textual inputs,
2. Dataset Condensation: Techniques such as k- the model is trained to predict the missing to-
means clustering and graph-based pruning are kens, promoting bidirectional understanding.
used to reduce dataset size while preserving VinVL (Zhang et al., 2021) is an example of
essential information. Bunny (He et al., 2024) this approach.
uses these methods to condense large datasets
like LAION-2B. • Dataset Preprocessing: Rigorous filtering of
large-scale datasets (e.g., CC-595K (Zhu et al.,
3. Attention Map-Guided Pruning: Attention 2024b) and LAION (Schuhmann et al., 2022))
maps from smaller pre-trained models identify ensures high-quality inputs for pre-training.
14
Table 3: A Brief Comparison of Training Strategies for SVLMs
Transfer Utilizes pre-trained Lightweight adapters (e.g., BLIP-2:Uses frozen encoders and Q-Formers
Learning vision encoders and linear layers, Q-Formers), to bridge modalities. MiniGPT-4:Fine-tunes
language models, Two-stage pipelines (pre- a linear projection layer between frozen
adapting them to train + task-specific fine- components. LLaVA-Phi:Pretrains projec-
specific tasks by tuning), Frozen or par- tion layers, followed by instruction tuning.
fine-tuning certain tially frozen layers, Grad- PaLI-3:Employs contrastive pretraining for
components. ual unfreezing for bet- cross-modal alignment. MobileVLM:Fine-
ter alignment, Domain- tunes LLMs and projectors for edge devices.
Specific Fine-Tuning VinVL:Fine-tunes object detection models to
produce rich vision-language embeddings.
Model- Curates and filters Pseudo-label generation, MiniVLM: Uses pre-trained captioning mod-
based Data datasets to enhance Dataset condensation els to generate pseudo-labels. Bunny: Con-
Filtering quality and rele- using clustering/graph denses datasets via k-means and graph-based
vance, ensuring pruning, Attention filtering. SGL: Uses attention maps to
models train on map-guided pruning, prune datasets. SgVA-CLIP: Aligns visual
meaningful and Knowledge distillation features with semantics through contrastive
noise-free data. and contrastive losses, loss. LaVIN: Filters datasets dynamically
Mixed text-only and for mixed text-image and text-only tasks.
multimodal tasks dataset, FEWVLM: Curates data for few-shot learn-
Prompt tuning ing using prompt templates.
• Instruction Fine-Tuning: Models are fine- • Dynamic Layer Updates: Gradually unfreez-
tuned on curated instruction datasets to im- ing model components during fine-tuning bal-
prove their conversational and directive ca- ances stability and performance enhancement.
pabilities, as seen in LLaVA-Phi (Zhu et al., MobileVLM V2 (Chu et al., 2024) demon-
2024b) and Bunny (He et al., 2024). strates this approach effectively.
15
These detailed methodologies underscore how
our training framework efficiently combines robust
pre-training with targeted fine-tuning protocols, en-
abling SVLMs to excel on diverse vision-language
tasks even under resource constraints.
16
Benchmark Year Properties Tasks
VQA (Antol et al., 2015 0.25M images, 0.76M questions, 10M Recognition, Object Detection, Knowledge Integration,
2015) answers with 3 plausible incorrect an- Commonsense Reasoning
swers per question
VQA-v2 (Goyal 2017 443K training, 214K validation, 453K Enhanced Visual Understanding, Reduced Language
et al., 2017) test questions, balanced image pairs Bias, Fine-grained Visual Discrimination
GQA (Hudson and 2019 113K images, 22M questions, structured Spatial Reasoning, Logical Inference, Attribute Recog-
Manning, 2019) scene graphs nition, Compositional Understanding
TextVQA (Singh 2019 28,408 images, 45,336 questions from OCR Integration, Text Comprehension, Visual-Textual
et al., 2019) OpenImages Reasoning
ScienceQA (Lu 2022 21,208 questions, 48.7% image context, Scientific Reasoning, multimodal Understanding, Zero-
et al., 2022) 90.5% detailed explanations shot Generalization
POPE (Li et al., 2023 Binary questions on COCO dataset sub- Object Hallucination Detection, Visual Accuracy, Ro-
2023b) sets (random, popular, adversarial) bustness Testing
MMBench (Liu 2023 20 distinct abilities across perception Hierarchical Capability Assessment, Fine-grained Per-
et al., 2025) and reasoning tasks, bilingual support ception, Multi-level Reasoning
MME (Fu et al., 2023 14 sub-tasks covering perception and OCR, Object Recognition, Commonsense Reasoning,
2024) cognition Numerical Calculation
MM-Vet (Yu et al., 2023 200 images, 218 questions, 16 capability Integrated Vision-Language Capabilities, Spatial Under-
2023) combinations standing, Knowledge Application
17
Model VQA-v2 GQA POPE TextVQA MMBench ScienceQA MME MM-Vet
LLaVA-Phi 71.4 – 85.0 48.6 59.8 68.4 1335.1 28.9
MobileVLM – 59.0 84.9 47.5 59.6 61.2 1288.9 –
MobileVLM-v2 – 61.1 84.7 57.5 63.2 70.0 1440.5 –
TinyLLaVA 79.9 62.0 86.4 59.1 66.9 69.1 1464.9 32.0
TinyGPT-V – 38.9 – – – – – –
MiniVLM 69.09 – – – – – – –
PaLI-3 85.2 – – – – – – –
MoE-LLaVA 79.9 62.6 85.7 57.0 68.0 70.3 1431.3 35.9
Mipha 81.3 63.9 86.9 56.6 69.7 70.9 1488.9 32.1
Mini-Gemini – – – 56.2 59.8 – 1341 31.1
Bunny 4B 82.1 63.2 87.2 – 75.7 78.3 1581.5 –
Flamingo-3B 57.1 – – 32.7 – – – –
Imp-2B 79.2 61.9 86.7 54.5 63.8 66.1 1304.8 33.5
Imp-3B 81.2 63.5 88 59.8 72.9 72.8 1446.4 43.3
Moondream-0.5B – – 85.1 68.92 – – – –
Moondream-2B – – 89.8 73.4 – – – –
SmolVLM – – – 72.1 – 84.5 – –
Qwen2.5-VL – – – 79.3 77.6 – – –
18
Figure 11: Comprehensive skill demonstrations of Imp, including math problem solving, code generation, Multilin-
gual conversation and medical image understanding.
19
Figure 12: Qualitative analysis of Mini-Gemini in visual understanding.
20