0% found this document useful (0 votes)

36 views20 pages

SVLM Survey For ACL 2025

This survey examines Small Visual Language Models (SVLMs), which offer efficient alternatives to large-scale Vision-Language Models (VLMs) by reducing computational costs and enabling real-time deployment. It reviews advancements in SVLM design strategies, training techniques, and evaluation methods while highlighting challenges such as compression trade-offs and benchmarking. The findings provide insights into developing scalable SVLMs for practical applications across various domains.

Uploaded by

niloy.celloscope

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views20 pages

SVLM Survey For ACL 2025

Uploaded by

niloy.celloscope

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Scaling Down, Powering Up: A Survey of Small Vision-Language Models

First Author Second Author

Affiliation / Address line 1 Affiliation / Address line 1
Affiliation / Address line 2 Affiliation / Address line 2
Affiliation / Address line 3 Affiliation / Address line 3
email@domain email@domain

Abstract Despite their success, large-scale VLMs often

Small Visual Language Models (SVLMs) have exceed 10–20 billion parameters, making them im-
emerged as efficient alternatives to large-scale practical for resource-limited settings (Zhou et al.,
Vision-Language Models (VLMs), addressing 2024). Small Visual Language Models (SVLMs),
challenges in computational cost and real-time typically under 5 billion parameters, address this
deployment. However, the trade-offs between limitation by incorporating shallow transformers,
compact architectures and vision-language per- factorized embeddings and optimized fusion mech-
formance remain unclear. In this survey, we
anisms to reduce memory footprint and inference
review recent advancements in SVLMs, focus-
ing on their core design strategies, choice of costs (Fang et al., 2021; Wang et al., 2020a). These
encoders and LLMs, training techniques, qual- lightweight models enable real-time applications
itative analysis and evaluation methods. We such as AR-assisted navigation, robotic instruction
analyze the balance between model size and parsing and medical image analysis while improv-
performance, highlighting key considerations ing computational efficiency and accessibility (Zhu
and challenges in compression, cross-modal et al., 2024b; Wei et al., 2024; Luo et al., 2023).
alignment and generalization. Additionally, we
However, SVLMs face several challenges, in-
explore emerging directions in SVLM design
also compared qualitative and benchmark per- cluding compression trade-offs, cross-modal align-
formances of different SVLMs. Our findings ment and benchmarking. Pruning, quantization and
provide insights into developing scalable and knowledge distillation can degrade fine-grained
efficient SVLMs for real-world applications. multimodal representations (Fang et al., 2021;
Chen et al., 2024b; Hong et al., 2024). Hard-
1 Introduction
ware constraints further complicate deployment,
Visual Language Models (VLMs) integrate com- as AI accelerators handle low-precision computa-
puter vision (CV) and natural language processing tions differently, requiring tailored optimizations
(NLP) for multimodal tasks such as image caption- (Shao et al., 2024). Additionally, benchmarking
ing, visual question answering (VQA) and cross- remains incomplete with most evaluations empha-
modal reasoning (Li et al., 2019; Wang et al., 2022; sizing accuracy while neglecting efficiency metrics
Cieplicka et al., 2024). Early CV models relied on such as FLOPs, latency and energy consumption
CNNs for object detection (Li et al., 2024a), while (Jiang et al., 2024; Zhang et al., 2024a). Address-
NLP advanced with RNNs, particularly LSTMs ing these limitations is crucial for ensuring that
and GRUs, for sequential text processing (Rumel- SVLMs are both scalable and ethically responsible,
hart and McClelland, 1987; Chung et al., 2014; particularly concerning bias and domain general-
Hochreiter and Schmidhuber, 1997). The introduc- ization (Chen et al., 2023b; Lin et al., 2024b; Wang
tion of Transformers revolutionized both fields with et al., 2024).
BERT enhancing sentence-level dependencies (De- This survey provides a structured review of
vlin et al., 2019) and Vision Transformers (ViTs) SVLMs with key contributions:
surpassing CNNs in image classification (Dosovit-
• Architectural Overview: Analyzes the de-
skiy et al., 2021). VLMs leverage cross-attention
sign choice of architectural components for
and contrastive learning to align text and image rep-
SVLMs, of vision encoders, language models
resentations, enabling models like CLIP, SimVLM
and modality connecto
and PaLI to achieve strong performance (Gao et al.,
2021; Jia et al., 2021; Chen et al., 2023b). • Comprehensive Taxonomy: Classifies

1
MobileVLM SmolVLM Moondream
MiniVLM Pali-3 MobileVLM v2 TinyLLaVa Mipha Imp Bunny 2B 0.5B Qwen2.5-VL
08-2021 10-2023 12-2023 02-2024 02-2024 03-2024 05-2024 07-2024 11-2024 01-2025 01-2025

Flamigo Gemini-Nano LLaVa-Phi Mini-Gemini TinyGPT-V Moondream MoELLaVa SmolVLM

11-2022 12-2023 02-2024 03-2024 06-2024 2B 12-2024 0.5B, 0.25B
10-2024 11-2024

Figure 1: A timeline of the recently introduced small-scale VLMs by the research community.

SVLMs based on architecture and model 2.1 Vision Encoder

size, analyzing their efficiency-driven design
This component processes visual inputs (images
choices.
or videos) and extracts meaningful visual features.
• Strategy-Quality Analysis: Examines dif- VLMs frequently use pre-trained vision backbones
ferent training strategies employed for opti- such as Convolutional Neural Networks (CNNs)
mizing SVLMs and the corresponding impact or Vision Transformers (ViT) (Dosovitskiy et al.,
on qualitative model performance and multi- 2021). CNNs were used in earlier VLMs for image
modal generalization. feature extraction. However, ViTs are dominant in
modern VLMs due to their ability to model global
• Evaluation of Benchmarks: Evaluates image features. ViT-based vision encoders trained
SVLMs using multidimensional metrics, com- with contrastive objectives such as the Contrastive
paring them for visual reasoning, multitasking Language–Image Pre-training model (CLIP) (Rad-
and hallucination mitigation capability. ford et al., 2021) and the Sigmoid Loss Image Pre-
training model (SigLIP) (Zhai et al., 2023) can
• Research Gaps and Future Directions: align visual and textual embeddings from image-
Identifies challenges in robustness, bias and text pairs, allowing the model to visualize represen-
privacy, proposing directions such as adaptive tations from natural language supervision. SigLIP
architectures and neuromorphic hardware for is a modified version of CLIP that replaces the
efficient AI. contrastive loss with a sigmoid-based loss, lead-
ing to better training stability and performance.
2 Overview of SVLM Architecture These models are typically trained on large im-
age datasets and can effectively capture different
SVLMs typically include three main components: levels of visual information. Some VLMs utilize
a vision encoder dedicated to extracting relevant hierarchical ViT features extracted from multiple
visual feature from input images, a small-scale lan- layers of the ViT to provide a more comprehen-
guage model (SLM) for processing textual infor- sive representation and understanding of both vi-
mation and generating text output and a modal- sual details and overall image context (Chen et al.,
ity connector to align the visual features extracted 2024b). The commonly used CLIP model is the
by the visual encoder with the textual informa- CLIP-ViT-L/14, contrastively pre-trained on 400M
tion processed by the language model. The to- curated image-text pairs and works on 336-pixel
tal parameter of the model comes from all three resolution with a patch size (14 × 14). This model
components; however, the parameter count of the provides a balance between performance and ef-
language model is the primary factor in determin- ficiency with 0.3B parameters and can generalize
ing whether a VLM is considered small-scale or well across tasks without requiring extensive fine-
large-scale (Zhu et al., 2024a,b). While visual en- tuning, making it suitable for small-scale VLMs. In
coders typically have around 0.4B to 1B parame- contrast, the most used lightweight SigLIP model
ters, the language models in VLMs can range from is the SigLIP-SO400M/14@384 which works on
7B to 65B parameters (Zhu et al., 2024a). How- 384-pixel resolution with patch size (14×14). This
ever, small language models (SLMs) have param- model works well in zero-shot and few-shot set-
eter sizes ranging from 1B billion to 3B, which tings with only 0.4B parameters, making it practi-
make the size of SVLMs around 2B to 5.5B. A cal for resource-constrained small VLMs. Apart
brief overview of the components used in SVLM from vanilla CLIP, other variant like EVA-CLIP
is described below: (Sun et al., 2023) is also explored (Yuan et al.,

2
2023). Mini-Gemini (Li et al., 2024b) combines SLM. These learnable connectors or projectors in
CNN with ViT, making a dual-encoder system to SVLMs are designed to reduce the number of vi-
process both low and high resolution images. sual tokens efficiently before feeding them to the
language model to speed up processing (Chu et al.,
2.2 Small Language Model (SLM) 2023). Efficiency is key for small models so the
Instead of the larger LLMs, small VLMs use projector must be lightweight to occupy less than
smaller LLMs to reduce computational cost. To 1% of parameters with minimal computational cost.
avoid the cost of training from scratch, the nor- A common type of projector is Multi-Layer Per-
mal tendency is to use lightweight pre-trained and ceptrons (MLPs), which is often a two-layer linear
open-source LLMs in the VLMs. SVLMs require network. It is a simple yet effective way to convert
compact, efficient language models to ensure fast visual features into the dimension of text embed-
inference and low memory usage. Most widely ding space. The MLP often uses GELU activation
used small-scale language models are Phi-1.5 (Li (Lin et al., 2024a) or SwiGLU (Team, 2025) acti-
et al., 2023c), Phi-2 (Javaheripi et al., 2023), TinyL- vation functions. Some models use a linear layer
LaMA (Zhang et al., 2024b), StableLM-2 (Bella- (Li et al., 2024b; Chen et al., 2023a) or a series
gente et al., 2024), downscaled version of LLaMA of linear projection layers (Face, 2025) to bridge
(Touvron et al., 2023) as MobileLLaMA (Chu et al., the dimensionality gap between the visual encoder
2023), Chinchilla-1.4B (Hoffmann et al., 2022), and the LLM embedding space, while some up-
Qwen-1.8B (Bai et al., 2023), FLAT-T5 (Chung grade them to the lightweight downsample projec-
et al., 2024), MiniLM (Wang et al., 2020b) etc. tor (LDP) (Chu et al., 2023, 2024), designed to re-
A brief overview of these LLMs are summarized duce visual tokens and enhance positional informa-
in Table 1. These models range from 1.1B to tion using depth-wise convolutions. Another popu-
2.7B parameters, making them much smaller than lar projector is Querying Transformer (Q-Former)
standard LLMs (e.g., LLaMA-7B(Touvron et al., (Li et al., 2023a; Yuan et al., 2023), which uses
2023), GPT-4 (Achiam et al., 2023)). Their com- trainable query tokens to interact with image fea-
pact size makes them efficient for edge AI and mo- tures and selectively retrieve relevant information
bile applications. Many of these models (Phi-1.5, (Figure 2). This architecture efficiently extracts
Phi-2, StableLM-2, MobileLLaMA) are trained task-relevant information from images, improving
to work on low-power devices. They avoid mas- vision-language understanding.
sive token generation overhead common in larger
LLMs. Also, these models integrate well with
Fully Connected

Feed Forward
CLIP, SigLIP and ViT-based encoders, which are Visual
Q-Former
Image

Vision Features Visual

Cross-Attention
used in small VLMs for vision-language process- Encoder tokens

Self-Attention
ing.
Learned
Queries
Model Published Parameters
MiniLM 2020 30M to 300M Figure 2: Architecture and Workflow of Q-Former.
Chinchilla-1.4B 2022 1.4B
FLAT-T5 2023 1.3B-2.7B
Qwen-1.8B Aug 2023 1.8B 3 Categorizing SVLM Models
Phi-1.5 Nov 2023 1.3B Recent research on small-scale VLMs has led to the
Phi -2 Dec 2023 2.7B development of various architectures with varying
TinyLLaMA Dec 2023 1.1B configurations of vision encoders, language models
StableLM-2 Jan 2024 1.6B and modality connectors. These models focus on
MobileLLaMA Jan 2024 1.4B-2.7B reducing computational overhead and improving
Table 1: Summary of the most widely used small-scale inference speed, while maintaining multimodal rea-
LLMs in SVLMs. soning capabilities while balancing the trade-off
between model size and performance. SVLMs can
broadly be classified based on their vision back-
2.3 Modality Connector bone into two main categories: CNN-based models
This component projects the visual features into and ViT-based models, where ViT-based models
the text embedding space understandable to the can be further divided into two subcategories based

3
CNN-based
Flamingo-3B
on pre-trained contrastive vision-language models
MiniVLM Mini-Gemini
(CLIP, SigLIP), which provide stronger alignment

Vision Encoder
TinyGPT-V
LLaVa-Phi
SVLM CLIP MobileVLM between vision and language embeddings.
MobileVLM-v2
MoE-LLaVa
3.2 Models with ViT Backbone
ViT-based Mipha
Pali-3 The transition from CNN to ViT-based encoders
SigLIP TinyLLaVa
Bunny-4B marks a crucial trend in SVLM development after
Moondream
SmolVLM
2023. Vision Transformers (ViTs) offer a more ex-
Imp pressive feature representation, making them better
suited for multimodal reasoning tasks. ViT-based
Figure 3: Categorization of SVLMs based on Vision
SVLMs are categorized into three major groups:
Encoder.
models utilizing CLIP encoders, those leveraging
SigLIP encoders and those adopting other ViT ar-
on the ViT architecture. A detailed dendrogram of chitectures.
SVLM categorization is illustrated in Figure 3. Models with CLIP Encoders: A large num-
ber of SVLMs initially adopted CLIP-based en-
3.1 Models with CNN Backbone coders due to their robust contrastive pretraining
Earlier SVLMs leveraged CNN-based vision en- on large-scale image-text pairs. LLaVA-Phi (Zhu
coders due to their efficiency in image recognition et al., 2024b), MobileVLM (Chu et al., 2023) and
tasks. Two such small-scale models are MiniVLM MobileVLM-v2 (Chu et al., 2024) integrated CLIP
(Wang et al., 2020a) and Flamigo-3B (Alayrac ViT-L/14 (Radford et al., 2021) as the vision back-
et al., 2022), published before 2023. The MiniVLM bone, leveraging its zero-shot capabilities and pre-
architecture employs a Two-stage Efficient fea- trained multimodal alignment. LLaVa-Phi incorpo-
ture Extractor (TEE) for visual feature extraction rates Phi-2 LLM and the other two use downsized
and MiniLM (Wang et al., 2020b) as the language MobileLLaMA language model (Touvron et al.,
model. The TEE module includes EffiecinetNet 2023). These SVLMs introduced lightweight pro-
(Tan and Le, 2019) with BiFPN as the backbone jection layers (e.g., MLPs, LDP) to efficiently map
based on EfficientDet (Tan et al., 2020), which visual embeddings into the language model hid-
uses depthwise and pointwise convolutions for den space. MobileVLM (Chu et al., 2023), op-
lightweight processing and reduced model size. timized for mobile and edge devices, introduced
The backbone is followed by a region proposal a Lightweight Downsample Projector (LDP) that
network (RPN) (Ren et al., 2016) and an RoIAlign uses convolution with stride 2 to reduce visual to-
operation (He et al., 2020) for visual region feature ken sizes by 75%. Whereas MobileVLM-v2 (Chu
extraction with reduced computational complex- et al., 2024) upgraded the projector to LDPv2 us-
ity while ensuring robust multimodal alignment. ing point-wise convolution, average pooling for
Similarly, Flamingo (Alayrac et al., 2022) offers token reduction and PEG (Chu et al., 2021) with
a family of VLMs developed by DeepMind and skip connection for enhanced positional informa-
their 3B parameter model is the SVLM one. It in- tion. This reduces 99.8% parameters than LDP,
tegrates a frozen ResNet-based NFNet backbone making it faster in inference. Despite their ef-
(Brock et al., 2021) and a Perceiver Resampler to re- ficiency, CLIP-based SVLMs face limitations in
fine visual features before fusion with a Chinchilla- fine-grained vision-language understanding, partic-
based language model (Hoffmann et al., 2022). De- ularly in text recognition and detailed spatial rea-
spite its strength in open-ended tasks, Flamingo soning. This led to a shift towards models adopting
has limitations, such as sensitivity to prompt de- SigLIP-based encoders.
sign, computational intensity for long sequences Models with SigLIP Encoders: Recently, a
and relatively weaker classification performance significant number of SVLMs have transitioned
compared to contrastive models. However, CNN- to SigLIP-based encoders, which refine the con-
based SVLMs are gradually being outperformed trastive learning approach by introducing a more
by ViT-based models, as transformers offer supe- stable loss function and better multimodal fea-
rior long-range dependency modeling and global ture representation. Moondream (moo, 2024),
feature extraction. The shift towards ViT-based en- TinyLLaVA (Zhou et al., 2024), Mipha (Zhu et al.,
coders has been driven by the increasing reliance 2024a), Imp (Shao et al., 2024) and Bunny-4B (He

4
et al., 2024) integrated SigLIP ViT-L/14 or ViT- ties. They retain large-model capabilities while
G variants, leading to more accurate multimodal optimizing computational costs, making them suit-
alignment while reducing computational overhead. able for scientific research, enterprise AI and high-
For example, TinyLLaVA combined SigLIP-L/14 fidelity multimodal reasoning.
with Phi-2, achieving superior performance in in- A brief overview of the recent SVLM models
struction tuning tasks. Moondream-2B is one of are summarized in Table 2.
the smallest yet effective SVLMs, which optimized
SigLIP models with quantization techniques, re- 4 Training Strategy
ducing the memory footprint for deployment in The training methods for SVLMs are designed to
edge AI applications. Mipha extended the effec- balance computational efficiency with high per-
tiveness of SigLIP ViT-G by incorporating smaller formance by leveraging innovative strategies. It
language models like Phi-1.5 (Li et al., 2023c) is significant because it can help compensate for
and Gemma-2B (Team et al., 2024), demonstrat- the reduction in model size and improve overall
ing that large vision encoders can still be paired performance. Better training recipes and quality
with compact LLMs for efficiency. Hugging Face data allow smaller LMMs to achieve on-par perfor-
released SmolVLM (Face, 2025) series, including mance with larger models (Zhou et al., 2024). The
2B, 0.5B and the 0.25B version being the smallest core training schemes include Pre-training, Instruc-
VLM available. The smaller models use a 93M- tion tuning, Fine-tuning and Multi-task learning.
parameter SigLIP base patch-16/512. These mod- Models usually employ multiple of these schemes
els are optimized for on-device deployment, includ- together to enable effective adaptation of SVLMs
ing laptops and potentially web browsers, due to to diverse real-world tasks.
their minimal memory requirements and efficient
processing capabilities. 4.1 Pre-Training
Pre-training is a crucial stage that aims for vision-
3.3 Categorization on Model Efficiency
language alignment. The primary objective is
SVLMs can be classified based on model size effi- to train the modality connector on large-scale
ciency and deployment scalability in the following image-text data to learn cross-modal representa-
way: tions, preparing the model for subsequent fine-
Memory-Optimized SVLMs: These mod- tuning on specific downstream tasks. This enables
els, including Moondream-0.5B (moo, 2024), the connector to align visual tokens (from mod-
TinyLLaVA-1.5B (Zhou et al., 2024) and els such as ViTs) with textual embeddings, allow-
SmolVLM-0.5B/0.25B (Face, 2025), focus on ex- ing the language model to adapt to visual inputs
treme compression techniques such as quantization, through the learnable projector while retaining the
low-bit encoding and minimal architectural compre-trained knowledge. MiniGPT-V (Zhu et al.,
ponents to enable deployment on mobile, IoT and 2023), Pali (Chen et al., 2023a), TinyLLaVa (Zhou
embedded systems while maintaining competitive et al., 2024). and MobileVLM (Chu et al., 2023)
multimodal capabilities. adopt this method. Some models, like MobileVLM-
Balanced Size and Performance Models: v2 (Chu et al., 2024), unfreeze the LLM while pre-
These models include MobileVLM-v2 (Chu et al., training to enhance the model’s in-context learn-
2024), Bunny-4B (He et al., 2024) and Mipha (Zhu ing capabilities. The quality and diversity of the
et al., 2024a), which balance between accuracy and pre-training data are critical for advancing SVLM
computational efficiency by leveraging moderate- performance (He et al., 2024). Hence, high-quality
sized encoders and small LLMs, ensuring usability data are constructed by filtering large datasets (e.g.,
across diverse applications from real-time dialogue CC-595K (Zhu et al., 2024b), LAION-GPT-4V
systems to AR/VR. (LAION/GPT-4V), SBU (Ordonez et al., 2011)) to
High-Performance Compact SVLMs: These ensure quality input.
types of models including MoE-LLaVA (Lin et al.,
2024a) and Pali-3 (Chen et al., 2023a) are larger 4.2 Instruction Tuning
in size. MoE-LLaVa introduces the Mixture of Ex- Instruction tuning is a training strategy that en-
perts (MoE) (Jacobs et al., 1991) approach-based hances a model’s ability to follow user instructions.
sparse VLM framework with excellent multimodal It involves finetuning language models (LLM) us-
understanding and hallucination mitigation abili- ing instruction datasets to improve their zero-shot

5
capabilities. In this way, models learn to gen- diverse set of training examples, using task-specific
eralize well on unseen tasks. Popular instruc- tokens to reduce ambiguity and improve task exe-
tion dataset is LLaVa-Instruct (Liu et al., 2024), cution, carefully designing the training process to
ALLaVa (Chen et al., 2024a) and some other mix- balance the learning of different tasks.
ture of datasets. Pali, MobileVLM, LLaVa-Phi, By combining robust pre-training with special-
TinyGPT-V, MiniGPT-4 models adopt this strat- ized fine-tuning strategies, SVLMs are able to
egy for improved multimodal instruction following achieve high performance across a range of vision-
capability. language tasks while remaining computationally
efficient. Note: Extended explanations, examples
4.3 Fine-Tuning and additional experimental details are provided
Fine-tuning adapts pre-trained components of in Appendix B
SVLMs to specific downstream applications
such as VQA, image captioning and instruction- 5 Qualitative Performance of SVLMs
following. This technique refines the model’s pre- Qualitative examples play a key role in understand-
trained parameters using a targeted dataset to en- ing the capabilities of SVLMs. This analysis can
hance its performance for a specific task. The pro- reveal the strengths and weaknesses of these mod-
cess aims to enhance task-specific performance els in real-world scenarios, providing insights that
while preserving generalization. Some models fine- go beyond quantitative benchmarks. Some key
tune only small components (e.g., projection lay- model performances are described below:
ers), as demonstrated by MiniGPT-4 (Zhu et al.,
2023). LLaVA-Phi (Zhu et al., 2024b) and Bunny • MiniGPT-4 (Zhu et al., 2023) can identify ob-
(He et al., 2024) perform instruction fine-tuning for jective elements within an image, offering de-
enhanced conversational and instruction following tailed descriptions. It can also differentiate
abilities using curated datasets. TinyGPT-V (Yuan memes as humorous by interpreting the under-
et al., 2023) adopts task-specific training, optimiz- lying message, demonstrating humor interpre-
ing on multi-task datasets to address various vision- tation capability.
language challenges. Sometimes, fine-tuning on
high-resolution images improve fine-grained visual • TinyGPT-V (Yuan et al., 2023) excels in de-
understanding, as implemented in PaLI-3 (Chen livering concise and accurate visual interpre-
et al., 2023a). MobileVLM (Chu et al., 2023) and tations. In a test involving a game of hide-
Flamigo (Alayrac et al., 2022) also adopt finetuning and-seek, TinyGPT-V gave a single, viable
strategy for improved task specific performance. suggestion ("under couch"), unlike other mod-
els that gave multiple options some of which
4.4 Multi-task Learning were incorrect.
Multi-task Learning (MTL) trains a single model
• Imp (Shao et al., 2024) models can demon-
on multiple related tasks, harnessing shared rep-
strate skills such as code generation, math
resentations to improve generalization. This is
problem solving and medical image under-
typically achieved by employing a shared back-
standing. The Imp-3B model provides rea-
bone (vision encoder and language model) along
sonable responses in these cases, showing its
with task-specific output heads. Techniques such
superiority in VL understanding and reason-
as sequential curriculum learning, where tasks are
ing, as well as completeness of knowledge.
introduced progressively and unified optimization
across diverse data formats help the model adapt • Imp-2B (Qwen-1.5) responds in the expected
to complex challenges such as visual question an- language for Chinese conversations, while
swering, image captioning and visual grounding. other models fail to generate Chinese re-
MobileVLM-v2 (Chu et al., 2024), TinyGPT-V sponses, highlighting the importance of multi-
(Yuan et al., 2023), Pali (Chen et al., 2023a) mod- lingual LLMs in user-friendly LMMs.
els adopt this strategy using multi-task TextVQA,
OCR, VQA, COCO Caption (Lin et al., 2014), • Mini-Gemini (Li et al., 2024b) can recognize
SBU, Flickr30k (Young et al., 2014) and other mix- plotted curves in graphical data and trans-
ture of datasets. Key Aspects of Multi-Task Train- late them into Python code, describe intricate
ing include mixing of different datasets to provide a elements within complex indoor scenes and

6
demonstrate an understanding of character as- to effectively answer varied and complex ques-
sociations in memes. tions. However, Flamingo-3B (Alayrac et al., 2022)
performs poor in this case. GQA (Graph Ques-
• LLaVA-Phi (Zhu et al., 2024b) exhibits strong tion Answering) benchmark (Hudson and Manning,
generalization ability in handling challenging 2019) focuses on complex real-world visual rea-
questions, generating code based on instruc- soning with structured scene understanding, testing
tions and solving mathematical problems. object recognition, spatial reasoning and compo-
• MobileVLM (Chu et al., 2023) showcases per- sitional question answering. Mipha (Zhu et al.,
formance on benchmarks including attribute 2024a), MoE-LLaVA (Lin et al., 2024a), Imp-3B
understanding, spatial and relational reason- and Bunny-4B (He et al., 2024) perform relatively
ing social and natural science, OCR, object well on the GQA benchmark, where TinyLLaVA
recognition and word knowledge. (Zhou et al., 2024) shows moderate performance.
In contrast, TinyGPT-V (Yuan et al., 2023) strug-
• Pali-3 (Chen et al., 2023a) provides excel- gles with this assessment. TextVQA (Singh et al.,
lent video QA results and achieves respectable 2019) specifically evaluates how models read and
video captioning results understand text within images to answer ques-
tions, emphasizing OCR and text-based reason-
Some example of qualitative performance of the ing within visual contexts. Recently introduced
SVLMs are demonstrated in Appendix D. Moondream-2B (moo, 2024), SmolVLM (Face,
2025) and Qwen2.5-VL (Team, 2025) show high
6 Benchmark Evaluation
performances, as shown in Fig. 4(c), indicating su-
Evaluating SVLMs through benchmarks is crucial perior capabilities in synthesizing OCR with visual
for assessing their capabilities relative to larger and language understanding.
models. This evaluation identifies the strengths and
weaknesses of different SVLMs and determines 6.2 Comprehensive Multimodal Assessment
whether they can achieve performance comparable In spite of evaluating some selected aspects some
to larger models with reduced computational costs. benchmarks are designed to assess a wide range
This is particularly relevant for deployment on edge of multimodal capabilities, including perception,
or mobile devices and real-time applications, where cognition and reasoning across different modalities.
efficiency is significant. The existing benchmarks They often include multiple sub-tasks that test dif-
for SVLMs are the same used for larger models, ferent aspects of a model’s ability to process and in-
which can be broadly categorized into three groups tegrate information from both visual and textual in-
based on the aspects they assess: Visual Reason- puts. MMBench (Liu et al., 2025) provides a broad
ing, Comprehensive Multimodal Assessment and evaluation of 20 capabilities using a structured and
Specialized Tasks. The performance of SVLMs on hierarchical approach, involving fine-grained per-
several benchmarks is illustrated in Figure 4. ception and complex reasoning tasks. Bunny-4B
(He et al., 2024), Imp-3B (Shao et al., 2024) and
6.1 Visual Reasoning and Understanding Qwen2.5-VL (Team, 2025) exhibit top-tier perfor-
These benchmarks focus on the model’s ability to mances, while LLaVa-Phi (Zhu et al., 2024b) and
understand and reason about visual content and Mini-Gemini (Li et al., 2024b) perform lower in
answer questions, often requiring integration with this benchmark. MME (Multimodal Model Eval-
basic textual queries. VQA (Visual Question An- uation) (Fu et al., 2024) measures both perception
swering) benchmark (Antol et al., 2015; Goyal (like object recognition and OCR) and cognition
et al., 2017) evaluates the ability to answer open- (including commonsense reasoning and numerical
ended questions about images, challenging mod- calculations) abilities of VLMs across 14 diverse
els to understand and process visual content in sub-tasks. Bunny-4B (He et al., 2024) shows ex-
conjunction with textual questions. As shown in ceptional performance, demonstrating their strong
Fig. 4(a), PaLI-3 (Chen et al., 2023a), Imp-3B ability to handle diverse and challenging multi-
(Shao et al., 2024), Mipha (Zhu et al., 2024a) and modal tasks, while LLaVa-Phi and Mini-Gemini
Bunny-4B (He et al., 2024) perform notably well (Li et al., 2024b) performs less effectively com-
in VQA-v2, indicating strong capabilities in inte- pared to others. MM-Vet (Multimodal Multitask
grating visual content with language processing Vision-Language Evaluation Test) (Yu et al., 2023)

7
(a) MiniVLM
(b) TinyGPT-V (c) MoE
LLaVA Tiny (d) Tiny
LLaVA
Pali-3
LLaVA
85.2 Tiny MoE Tiny Mipha MoE Mobile
LLaVA LLaVA LLaVA LLaVA VLM-v2
69.0 57.0 59.0 Mobile 69.0
MoE 79.9 62.6 62.0 VLM-v2
39.0 56.6 70.3 70.0
LLaVA 80.0 Mini 57.5
Gemini 56.0
71.4 LLaVA Mobile 48.6 LLaVa 70.9 68.4 LLaVa
Phi Mipha 63.9 61.1 VLM-v2
32.7 Phi Mipha Phi
81.3 57.0 Flamingo
Mipha 81.2 3B
60.0 79.3 78.3
82.1 63.2 63.5 Qwen 72.8 84.5
Imp-3B 72.0 2.5-VL
Bunny 61.9 Imp-3B 73.4 Bunny Smol
Bunny 4B Imp-3B 4B VLM
4B Flamingo Smol
Imp-2B Moon VLM Imp-3B
3B dream-2B
VQA-v2 GQA TextVQA ScienceQA

(f) MoE Tiny

(e) MobileVLM
v2 LLaVA (g) MoE LLaVA (h) Tiny
LLaVA
Tiny LLaVA
MoE Tiny LLaVA MoE MobileVLM
LLaVA LLaVA 35.9 MobileVLM LLaVA v2
68.0 66.9 v2 1464
84.7 86.4
Mipha 32.0 1431 1440
85.7 Mipha 63.2
32.1 69.7
86.9 85.0 LLaVA 28.9 LLaVA 59.8 LLaVA LLaVA
Mipha Phi Phi Phi Mipha 1488 1335 Phi
31.1 59.8
Mini 1341
87.2 Mini Gemini 1446
89.8 Gemini 33.5 77.6
Bunny 88.0 Moondream
43.3 75.7 72.9 Qwen2.5
Mini
VL Imp-3B
4B 2B Imp-3B Bunny Gemini 1581
Imp-2B 4B Bunny
Imp-3B Imp-3B 4B
POPE MM-Vet MMBench MME

Figure 4: Performance of SVLM models on benchmarks: (a) VQA-v2, (b) GQA, (c) TextVQA, (d) ScienceQA, (e)
POPE, (f) MM-Vet, (g) MMBench and (h) MME.

evaluates the integration of multiple core (total 16) 7 Challenges

vision-langauge capabilities in realistic scenarios,
testing general visual recognition, spatial aware- The main challenge is the trade-off between model
ness, language generation and more. Imp-3B (Shao size and accuracy, often sacrificing performance
et al., 2024) is the top performer here with MoE- for lightweight design and fast inference. SVLMs
LLaVA (score of 35.9) being the moderate per- may produce unreliable responses and generate
former. However, LLaVa-Phi struggles in handling harmful content. There is potential for models to
such integration of multiple capabilities effectively. perpetuate societal biases, including gender biases.
Models can be manipulated by users to generate
harmful content. Flamingo models (Alayrac et al.,
6.3 Specialized Task Focused Benchmark 2022) are exposed to risks such as leaking private
information. They might struggle with zero-shot
Some benchmarks focus on specific challenges learning scenarios, where the model encounters
or tasks within the multimodal context, often ad- previously unseen concepts (Xu et al., 2024).
dressing unique or niche model capabilities. Sci-
enceQA tests the ability to answer multimodal sci- 8 Conclusion and Future Work
ence questions that often require complex reason-
ing and a deep understanding of scientific concepts. This survey provides a comprehensive review of
SmolVLM (Face, 2025) shows superior capability SVLMs by exploring their architectures, training
in synthesizing complex scientific data and reason- strategies, benchmarks and challenges. We high-
ing to answer diverse questions accurately. Bunny- lighted the trade-offs between size efficiency and
4B (He et al., 2024) and Imp-3B (Shao et al., 2024) performance. While SVLMs offer promising solu-
show good performance, but LLaVa-Phi (Liu et al., tions for resource-constrained environments (like
2024) and TinyLLaVa (Zhou et al., 2024) struggle edge devices and low-power environments), chal-
with this ability. POPE (Li et al., 2023b) is focused lenges such as multimodal alignment, robustness
on evaluating object hallucination in visual content, and bias mitigation remain critical areas for future
addressing the model’s capability to differentiate research. Future research is likely to explore hy-
between real and non-existent objects within im- brid vision encoders, optimized projection layers
ages. Imp-3B (Shao et al., 2024), Moondream-2B and innovative compression techniques to further
(moo, 2024) and Bunny-4B (He et al., 2024) show enhance scalability, inference speed and robustness
top performance against object hallucination. in real-world applications.

8
Limitations conference on machine learning, pages 1059–1071.
PMLR.
While this survey provides a comprehensive analy-
sis of Small Visual Language Models (SVLMs), it Guiming Hardy Chen, Shunian Chen, Ruifei Zhang,
Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong
has several limitations. The field is rapidly evolving Chen, Jianquan Li, Xiang Wan, and Benyou Wang.
and emerging models may introduce novel archi- 2024a. Allava: Harnessing gpt4v-synthesized data
tectures or evaluation techniques not covered here. for a lite vision-language model. arXiv preprint
This study primarily focuses on general-purpose arXiv:2402.11684.
SVLMs, overlooking specialized models for do- Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong
mains like medical imaging or industrial automa- Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin
tion. Benchmarking comparisons highlight model Wen, Tianke Zhang, et al. 2024b. Evlm: An effi-
cient vision-language model for visual understanding.
performance, but the absence of standardized ef-
arXiv preprint arXiv:2407.14177.
ficiency metrics (e.g., FLOPs, inference time, en-
ergy consumption) limits real-world deployability Xi Chen, Xiao Wang, Lucas Beyer, Alexander
assessments. Additionally, future directions like Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil
Mustafa, Sebastian Goodman, Ibrahim Alabdul-
neuromorphic computing, federated learning and mohsin, Piotr Padlewski, Daniel Salz, Xi Xiong,
adaptive architectures require further study to en- Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli
hance scalability and sustainability. Due to the Yu, Daniel Keysers, Xiaohua Zhai, and Radu Sori-
page limit of this paper some discussions are con- cut. 2023a. Pali-3 vision language models: Smaller,
faster, stronger. Preprint, arXiv:2310.09199.
strained and addressing these gaps will be essential
for advancing SVLM research and development. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier-
giovanni, Piotr Padlewski, Daniel Salz, Sebas-
tian Goodman, Adam Grycner, Basil Mustafa, Lu-
References cas Beyer, Alexander Kolesnikov, Joan Puigcerver,
Nan Ding, Keran Rong, Hassan Akbari, Gaurav
2024. Moondream ai. Mishra, Linting Xue, Ashish Thapliyal, James Brad-
bury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Jia, Burcu Karagol Ayan, Carlos Riquelme, An-
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, dreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil
Diogo Almeida, Janko Altenschmidt, Sam Altman, Houlsby, and Radu Soricut. 2023b. Pali: A jointly-
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. scaled multilingual language-image model. Preprint,
arXiv preprint arXiv:2303.08774. arXiv:2209.06794.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang
Antoine Miech, Iain Barr, Yana Hasson, Karel Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu
Lenc, Arthur Mensch, Katherine Millican, Malcolm Zhang, Bo Zhang, Xiaolin Wei, et al. 2023. Mo-
Reynolds, et al. 2022. Flamingo: a visual language bilevlm: A fast, reproducible and strong vision lan-
model for few-shot learning. Advances in neural guage assistant for mobile devices. arXiv preprint
information processing systems, 35:23716–23736. arXiv:2312.16886.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu,
Devi Parikh. 2015. Vqa: Visual question answering. Xinyang Lin, Bo Zhang, et al. 2024. Mobilevlm
In Proceedings of the IEEE international conference v2: Faster and stronger baseline for vision language
on computer vision, pages 2425–2433. model. arXiv preprint arXiv:2402.03766.
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang,
Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Chunhua Shen. 2021. Conditional positional
and Jingren Zhou. 2023. Qwen-vl: A frontier large encodings for vision transformers. arXiv preprint
vision-language model with versatile abilities. arXiv arXiv:2102.10882.
preprint arXiv:2308.12966.
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Marco Bellagente, Jonathan Tow, Dakota Mahan, Duy Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
Phung, Maksym Zhuravinskyi, Reshinth Adithyan, Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
James Baicoianu, Ben Brooks, Nathan Cooper, 2024. Scaling instruction-finetuned language models.
Ashish Datta, et al. 2024. Stable lm 2 1.6 b tech- Journal of Machine Learning Research, 25(70):1–53.
nical report. arXiv preprint arXiv:2402.17834.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,
Andy Brock, Soham De, Samuel L Smith, and Karen Si- and Yoshua Bengio. 2014. Empirical evaluation of
monyan. 2021. High-performance large-scale image gated recurrent neural networks on sequence model-
recognition without normalization. In International ing. Preprint, arXiv:1412.3555.

9
Patrycja Cieplicka, Julia Kłos, and Maciej Morawski. Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
2024. Visionqaries at mediqa-magic 2024: Small sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
vision language models for dermatological diagno- ford, Diego de Las Casas, Lisa Anne Hendricks,
sis. In Experimental IR Meets Multilinguality, Mul- Johannes Welbl, Aidan Clark, et al. 2022. Train-
timodality, and Interaction, Proceedings of the 15th ing compute-optimal large language models. arXiv
International Conference of the CLEF Association preprint arXiv:2203.15556.
(CLEF 2024), Springer Lecture Notes in Computer
Science LNCS, Grenoble, France. Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A
Kristina Toutanova. 2019. Bert: Pre-training of deep visual language model for gui agents. In Proceedings
bidirectional transformers for language understand- of the IEEE/CVF Conference on Computer Vision
ing. Preprint, arXiv:1810.04805. and Pattern Recognition, pages 14281–14290.

Drew A Hudson and Christopher D Manning. 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Gqa: A new dataset for real-world visual reasoning
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, and compositional question answering. In Proceed-
Thomas Unterthiner, Mostafa Dehghani, Matthias ings of the IEEE/CVF conference on computer vision
Minderer, Georg Heigold, Sylvain Gelly, Jakob and pattern recognition, pages 6700–6709.
Uszkoreit, and Neil Houlsby. 2021. An image
is worth 16x16 words: Transformers for image Robert A Jacobs, Michael I Jordan, Steven J Nowlan,
recognition at scale. Preprint, arXiv:2010.11929. and Geoffrey E Hinton. 1991. Adaptive mixtures of
local experts. Neural computation, 3(1):79–87.
Hugging Face. 2025. Smolvlm - small yet mighty vision
language model. Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jy-
oti Aneja, Sebastien Bubeck, Caio César Teodoro
Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lijuan Mendes, Weizhu Chen, Allie Del Giorno, Ronen
Wang, Yezhou Yang, and Zicheng Liu. 2021. Com- Eldan, Sivakanth Gopi, et al. 2023. Phi-2: The sur-
pressing visual-linguistic model via knowledge dis- prising power of small language models. Microsoft
tillation. Preprint, arXiv:2104.02096. Research Blog, 1(3):3.

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung,
Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Zhen Li, and Tom Duerig. 2021. Scaling up vi-
2024. Mme: A comprehensive evaluation benchmark sual and vision-language representation learning with
for multimodal large language models. Preprint, noisy text supervision. Preprint, arXiv:2102.05918.
arXiv:2306.13394. Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun
Sun, Huan Xiong, Deng-Ping Fan, and Fahad Shah-
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, baz Khan. 2024. Effectiveness assessment of recent
Rongyao Fang, Yongfeng Zhang, Hongsheng Li, large vision-language models. Visual Intelligence,
and Yu Qiao. 2021. Clip-adapter: Better vision- 2(1).
language models with feature adapters. Preprint,
arXiv:2110.04544. LAION/GPT-4V. Dataset at huggingface.
https://huggingface.co/datasets/laion/
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv gpt4v-dataset.
Batra, and Devi Parikh. 2017. Making the v in vqa
matter: Elevating the role of image understanding Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
in visual question answering. In Proceedings of the 2023a. Blip-2: Bootstrapping language-image pre-
IEEE conference on computer vision and pattern training with frozen image encoders and large lan-
recognition, pages 6904–6913. guage models. In International conference on ma-
chine learning, pages 19730–19742. PMLR.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and
Ross Girshick. 2020. Mask r-cnn. IEEE Transac- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
tions on Pattern Analysis and Machine Intelligence, Hsieh, and Kai-Wei Chang. 2019. Visualbert: A sim-
42(2):386–397. ple and performant baseline for vision and language.
Preprint, arXiv:1908.03557.
Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan,
Wang, Tiejun Huang, and Bo Zhao. 2024. Efficient and Xiao Xiang Zhu. 2024a. Vision-language models
multimodal learning from data-centric perspective. in remote sensing: Current progress and future trends.
Preprint, arXiv:2402.11530. Preprint, arXiv:2305.05726.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng
short-term memory. Neural Computation, 9(8):1735– Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and
1780. Jiaya Jia. 2024b. Mini-gemini: Mining the potential

10
of multi-modality vision language models. arXiv Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
preprint arXiv:2403.18814. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, et al. 2021. Learning transferable visual models from
Wayne Xin Zhao, and Ji-Rong Wen. 2023b. Eval- natural language supervision. In International confer-
uating object hallucination in large vision-language ence on machine learning, pages 8748–8763. PMLR.
models. arXiv preprint arXiv:2305.10355.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Sun. 2016. Faster r-cnn: Towards real-time object
Del Giorno, Suriya Gunasekar, and Yin Tat Lee. detection with region proposal networks. IEEE trans-
2023c. Textbooks are all you need ii: phi-1.5 techni- actions on pattern analysis and machine intelligence,
cal report. arXiv preprint arXiv:2309.05463. 39(6):1137–1149.
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin David E. Rumelhart and James L. McClelland. 1987.
Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Learning Internal Representations by Error Propa-
Li Yuan. 2024a. Moe-llava: Mixture of experts gation, pages 318–362.
for large vision-language models. arXiv preprint
arXiv:2401.15947. Christoph Schuhmann, Romain Beaumont, Richard
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti,
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo-
Theo Coombes, Aarush Katta, Clayton Mullis,
hammad Shoeybi, and Song Han. 2024b. Vila: On
Mitchell Wortsman, et al. 2022. Laion-5b: An open
pre-training for visual language models. In Proceed-
large-scale dataset for training next generation image-
ings of the IEEE/CVF Conference on Computer Vi-
text models. Advances in Neural Information Pro-
sion and Pattern Recognition, pages 26689–26699.
cessing Systems, 35:25278–25294.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang,
and C Lawrence Zitnick. 2014. Microsoft coco: Lihao Zheng, Zhenbiao Gai, Mingyang Wang, and
Common objects in context. In Computer Vision– Jiajun Ding. 2024. Imp: Highly capable large mul-
ECCV 2014: 13th European Conference, Zurich, timodal models for mobile devices. arXiv preprint
Switzerland, September 6-12, 2014, Proceedings, arXiv:2405.12107.
Part V 13, pages 740–755. Springer. Amanpreet Singh, Vivek Natarajan, Meet Shah,
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh,
Lee. 2024. Visual instruction tuning. Advances in and Marcus Rohrbach. 2019. Towards vqa models
neural information processing systems, 36. that can read. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, pages 8317–8326.
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Wang, Conghui He, Ziwei Liu, et al. 2025. Mm- Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang,
bench: Is your multi-modal model an all-around and Yue Cao. 2023. Eva-clip: Improved train-
player? In European conference on computer vi- ing techniques for clip at scale. arXiv preprint
sion, pages 216–233. Springer. arXiv:2303.15389.

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Mingxing Tan and Quoc Le. 2019. Efficientnet: Re-
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter thinking model scaling for convolutional neural net-
Clark, and Ashwin Kalyan. 2022. Learn to explain: works. In International conference on machine learn-
Multimodal reasoning via thought chains for science ing, pages 6105–6114. PMLR.
question answering. Advances in Neural Information
Processing Systems, 35:2507–2521. Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020.
Efficientdet: Scalable and efficient object detection.
Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, In Proceedings of the IEEE/CVF conference on com-
Xiaoshuai Sun, and Rongrong Ji. 2023. Cheap puter vision and pattern recognition, pages 10781–
and quick: Efficient vision-language instruction 10790.
tuning for large language models. Preprint,
arXiv:2305.15023. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia,
Jason Wei, Xuezhi Wang, Hyung Won Chung, Sia-
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. mak Shakeri, Dara Bahri, Tal Schuster, et al. 2022.
2011. Im2text: Describing images using 1 million Ul2: Unifying language learning paradigms. arXiv
captioned photographs. Advances in neural informa- preprint arXiv:2205.05131.
tion processing systems, 24.
Gemma Team, Thomas Mesnard, Cassidy Hardin,
Fang Peng, Xiaoshan Yang, Linhui Xiao, Yaowei Wang, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
and Changsheng Xu. 2023. Sgva-clip: Semantic- Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale,
guided visual adapting of vision-language mod- Juliette Love, et al. 2024. Gemma: Open models
els for few-shot image classification. Preprint, based on gemini research and technology. arXiv
arXiv:2211.16191. preprint arXiv:2403.08295.

11
Qwen Team. 2025. Qwen2.5-vl. Zhengqing Yuan, Zhaoxu Li, Weiran Huang, Yanfang
Ye, and Lichao Sun. 2023. Tinygpt-v: Efficient mul-
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier timodal large language model via small backbones.
Martinet, Marie-Anne Lachaux, Timothée Lacroix, arXiv preprint arXiv:2312.16862.
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi- Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov,
cient foundation language models. arXiv preprint and Lucas Beyer. 2023. Sigmoid loss for language
arXiv:2302.13971. image pre-training. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xi- 11975–11986.
ujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao,
and Zicheng Liu. 2020a. Minivlm: A smaller Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.
and faster vision-language model. arXiv preprint 2024a. Vision-language models for vision tasks: A
arXiv:2012.06946. survey. Preprint, arXiv:2304.00685.

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Wei Lu. 2024b. Tinyllama: An open-source small
Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi language model. arXiv preprint arXiv:2401.02385.
Li, Yuxiao Dong, Ming Ding, and Jie Tang. 2024.
Cogvlm: Visual expert for pretrained language mod- Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
els. Preprint, arXiv:2311.03079. Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian-
feng Gao. 2021. Vinvl: Revisiting visual represen-
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan tations in vision-language models. In Proceedings
Yang, and Ming Zhou. 2020b. Minilm: Deep self- of the IEEE/CVF conference on computer vision and
attention distillation for task-agnostic compression pattern recognition, pages 5579–5588.
of pre-trained transformers. Advances in Neural In-
formation Processing Systems, 33:5776–5788. Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li,
Yibing Song, Kai Wang, Zhangyang Wang, and Yang
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu- You. 2024. A stitch in time saves nine: Small vlm
lia Tsvetkov, and Yuan Cao. 2022. Simvlm: Simple is a precise guidance for accelerating large vlms.
visual language model pretraining with weak super- Preprint, arXiv:2412.03324.
vision. Preprint, arXiv:2108.10904.
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo,
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Xien Liu, Ji Wu, and Lei Huang. 2024. Tinyllava: A
Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, framework of small-scale large multimodal models.
and Xiangyu Zhang. 2025. Vary: Scaling up the arXiv preprint arXiv:2402.14289.
vision vocabulary for large vision-language model.
In European Conference on Computer Vision, pages Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
408–424. Springer. Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing
vision-language understanding with advanced large
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, language models. arXiv preprint arXiv:2304.10592.
Zheng Ge, En Yu, Jianjian Sun, Chunrui Han,
and Xiangyu Zhang. 2024. Small language model Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan
meets with reinforced vision vocabulary. Preprint, Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei
arXiv:2401.12503. Feng, and Jian Tang. 2024a. Mipha: A compre-
hensive overhaul of multimodal assistant with small
Zhenlin Xu, Yi Zhu, Tiffany Deng, Abhay Mittal, Yan- language models. CoRR.
bei Chen, Manchen Wang, Paolo Favaro, Joseph
Tighe, and Davide Modolo. 2024. Benchmarking Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and
zero-shot recognition with vision-language models: Yaxin Peng. 2024b. Llava-phi: Efficient multi-modal
Challenges on granularity and specificity. Preprint, assistant with small language model. In Proceed-
arXiv:2306.16048. ings of the 1st International Workshop on Efficient
Multimedia Computing under Limited, pages 18–22.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in- Appendix
ference over event descriptions. Transactions of the
Association for Computational Linguistics, 2:67–78. A Recent SVLM Models
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, A brief overview of the SVLM models discussed
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
Wang. 2023. Mm-vet: Evaluating large multimodal
above, their published year, parameter size, archi-
models for integrated capabilities. arXiv preprint tectural components and other information is sum-
arXiv:2308.02490. marized in Table 2.

12
Language
Model Year Parameter Vision Encoder Connector
Model
TEE: EfficientNet RPN + RoIAlign
MiniVLM Aug 2021 53.2M, FLOPs 6.7B MiniLM
+BiFPN + 2 linear layers
Perceiver Resampler
Flamingo-3B Nov 2022 3.2B NFNet-F6 Chinchilla-1.4B
+ Gated XATTN-Dense layers
Pali-3 Oct 2023 5B SigLIP ViT G/14 UL2 Transformer Linear Projector
Lightweight Downsample
MobileVLM Dec 2023 1.7B, 3B CLIP ViT-L/14 MobileLLaMA
Projector (LDP)
MobileVLM-v2 Feb 2024 1.7B, 3B CLIP ViT-L/14 MobileLLaMA Upgraded LDPv2
LLaVa-Phi Feb 2024 3B CLIP ViT-L/14 Phi-2 MLP Projector
1.5B TinyLlama
TinyLLaVa Feb 2024 SigLIP-L/14 2-layer MLP
3.1B Phi-2
Mipha-1.6B Phi-1.5
Mipha March 2024 Mipha-2.4B SigLIP ViT L/14 Gemma-2B 2-layer MLP
Mipha-3B Phi-2
CLIP-ViT
Mini-Gemini March 2024 2B Gemma-2B Linear Projector
ConvNeXt
3B Phi-2
Imp May 2024 SigLIP-SO-L/14 2-layer MLP
2B Qwen-1.8B
Q-Former with
TinyGPT-V June 2024 3.7B EVA ViT-L/14 Phi-2
projection layers
Bunny-4B July 2024 4B SigLIP-SO Phi-3-Mini 2-layer MLP
2.9B, Active 2B StableLM-1.6B
MoE-LLaVa Dec 2024 3.1B, Active 2.2B CLIP-Large Qwen-1.8B 2-layer MLP
5.3B, Active 3.6B Phi-2
Moondream Jan 2025 2B, 0.5B SigLIP Phi-1.5 –
SmolVLM-2B SigLIP-SO-L/14
Nov 2024
SmolVLM SmolVLM-0.5B SigLIP base patch SmolLM2 Linear Projector
Jan 2025
SmolVLM-0.25B 16/512
Qwen2.5-VL Jan 2025 3B ViT-675M Qwen-2.5 LLM MLP Projector

Table 2: Brief overview of SVLM models: published year, parameter size, architectural components and training
datasets.

B Training Strategies 2. Two-Stage Training Pipelines: The train-

ing is divided into two stages: first, special-
B.1 Transfer Learning ized connectors (or projection layers) are pre-
Transfer learning leverages pre-trained vision en- trained with the base encoders frozen; next,
coders (e.g., CLIP (Radford et al., 2021), SigLIP the entire model is fine-tuned on domain-
(Zhai et al., 2023)) and language models (e.g., specific datasets. LLaVA-Phi (Zhu et al.,
LLaMA (Touvron et al., 2023), UL2 (Tay et al., 2024b) and Vary (Wei et al., 2025) adopt this
2022)) to reduce training time and computational approach.
cost. In our framework, several methods are em- 3. Gradual Unfreezing: Portions of the network
ployed: are incrementally unfrozen during training,
allowing initial stability and improved adap-
1. Frozen Components with Fine-Tuning: tation in later stages. MobileVLM-v2 (Chu
Only a limited set of layers (e.g., a lin- et al., 2024) demonstrates this strategy effec-
ear projection layer or Q-Former) are fine- tively.
tuned while the bulk of the model remains
frozen. For instance, BLIP-2 (Li et al., 2023a) 4. Domain-Specific and Complete Fine-
uses frozen image encoders and LLMs with Tuning: Depending on the application, either
a lightweight Q-Former to align modalities. select components or the entire network
Similarly, MiniGPT-4 (Zhu et al., 2023) fine- is fine-tuned. VinVL (Zhang et al., 2021)
tunes only the linear projection layer between enhances vision encoders for object detection
a frozen ViT and a language model. tasks, while TinyLLaVA (Zhou et al., 2024)

13
fine-tunes both vision and language modules. and remove non-essential tokens, improving
the signal-to-noise ratio. Small Guides Large
B.2 Multi-task Learning Strategies (SGL) (Zhao et al., 2024) demonstrates this
Multi-task learning (MTL) enables a single model strategy.
to learn from multiple related tasks, leveraging
shared representations to boost overall generaliza- 4. Contrastive and Knowledge Distillation
tion. Key strategies include: Losses: These losses are used to align visual
and textual features, further refining the train-
1. Shared Backbone with Task-Specific ing dataset. SgVA-CLIP (Peng et al., 2023)
Heads: A common architecture (shared integrates both techniques to enhance perfor-
vision encoder and language model) is ex- mance, particularly in few-shot learning sce-
tended with additional output layers tailored narios.
to each task. For example, TinyGPT-V (Yuan
et al., 2023) uses this approach for tasks
such as VQA, image captioning and referring B.4 Pre-Training Methods
expression comprehension.
Pre-training is crucial for developing robust cross-
2. Sequential Curriculum Learning: Tasks are modal representations in SVLMs. This phase
introduced progressively, beginning with sim- extracts rich semantic features from large-scale
pler ones to build a strong foundational un- datasets while maintaining computational effi-
derstanding before moving to complex tasks. ciency. Key techniques include:
CogAgent (Hong et al., 2024) applies this
method to gradually enhance its capabilities. • Vision-Language Alignment: Lightweight
adapters, such as linear projection layers or
3. Task-Specific Data Handling: Despite dif- Q-Formers, align visual tokens from models
ferences in data formats across tasks, a unified like ViT with textual embeddings. BLIP-2
optimization strategy is employed. Flamingo- (Li et al., 2023a) and MiniGPT-4 (Zhu et al.,
3B (Alayrac et al., 2022) uses gated cross- 2023) effectively employ these methods.
attention to condition on various tasks, ensur-
ing that each task is effectively learned with- • Contrastive Learning: This technique maxi-
out compromising overall performance. mizes similarity between semantically related
image-text pairs while minimizing similar-
B.3 Detailed Model-Based Data Filtering
ity for unrelated pairs. PaLI-3 (Chen et al.,
Techniques
2023a) leverages SigLIP-based contrastive ob-
High-quality training data is vital for robust model jectives for effective cross-modal alignment.
performance. Model-based data filtering tech-
niques help to refine large-scale datasets by reduc- • Pseudo-Label Generation: To overcome
ing noise and enhancing relevance: the challenges of limited annotations, pre-
trained captioning models are used to generate
1. Pseudo-Label Generation: Pre-trained mod-
pseudo-labels, as implemented in MiniVLM
els generate high-quality annotations to serve
(Wang et al., 2020a).
as pseudo-labels, thereby augmenting weakly
labeled datasets. MiniVLM (Wang et al., • Masked multimodal Modeling: By mask-
2020a) exemplifies this method. ing portions of both visual and textual inputs,
2. Dataset Condensation: Techniques such as k- the model is trained to predict the missing to-
means clustering and graph-based pruning are kens, promoting bidirectional understanding.
used to reduce dataset size while preserving VinVL (Zhang et al., 2021) is an example of
essential information. Bunny (He et al., 2024) this approach.
uses these methods to condense large datasets
like LAION-2B. • Dataset Preprocessing: Rigorous filtering of
large-scale datasets (e.g., CC-595K (Zhu et al.,
3. Attention Map-Guided Pruning: Attention 2024b) and LAION (Schuhmann et al., 2022))
maps from smaller pre-trained models identify ensures high-quality inputs for pre-training.

14
Table 3: A Brief Comparison of Training Strategies for SVLMs

Division Description Techniques/Strategies Examples

Transfer Utilizes pre-trained Lightweight adapters (e.g., BLIP-2:Uses frozen encoders and Q-Formers
Learning vision encoders and linear layers, Q-Formers), to bridge modalities. MiniGPT-4:Fine-tunes
language models, Two-stage pipelines (pre- a linear projection layer between frozen
adapting them to train + task-specific fine- components. LLaVA-Phi:Pretrains projec-
specific tasks by tuning), Frozen or par- tion layers, followed by instruction tuning.
fine-tuning certain tially frozen layers, Grad- PaLI-3:Employs contrastive pretraining for
components. ual unfreezing for bet- cross-modal alignment. MobileVLM:Fine-
ter alignment, Domain- tunes LLMs and projectors for edge devices.
Specific Fine-Tuning VinVL:Fine-tunes object detection models to
produce rich vision-language embeddings.

Multi-task Trains on multiple Shared backbones with Flamingo-3B:Uses gated cross-attention

Learning related tasks simul- task-specific heads, Se- for task-specific layer activation. TinyGPT-
taneously, using quential curriculum learn- V:Trained for VQA, image captioning
shared representa- ing, Task-specific opti- and referring expressions. Shotluck
tions to improve mization strategies, Multi- Holmes:Focused on video captioning and
generalization and turn conversational fine- summarization using multi-task fine-tuning.
efficiency. tuning CogAgent:Trained for GUI-specific tasks.
MoE-LLaVA:Uses modular MoE layers for
multitasking across multimodal reasoning.
LaVIN:Combines text-only and multimodal
tasks during training.

Model- Curates and filters Pseudo-label generation, MiniVLM: Uses pre-trained captioning mod-
based Data datasets to enhance Dataset condensation els to generate pseudo-labels. Bunny: Con-
Filtering quality and rele- using clustering/graph denses datasets via k-means and graph-based
vance, ensuring pruning, Attention filtering. SGL: Uses attention maps to
models train on map-guided pruning, prune datasets. SgVA-CLIP: Aligns visual
meaningful and Knowledge distillation features with semantics through contrastive
noise-free data. and contrastive losses, loss. LaVIN: Filters datasets dynamically
Mixed text-only and for mixed text-image and text-only tasks.
multimodal tasks dataset, FEWVLM: Curates data for few-shot learn-
Prompt tuning ing using prompt templates.

B.5 Fine-Tuning Protocols • Task-Specific Training: By fine-tuning on

Fine-tuning adapts pre-trained SVLMs to specific multi-task datasets with shared backbones and
downstream applications such as VQA, image cap- task-specific heads, models like TinyGPT-V
tioning and instruction-following. The protocols (Yuan et al., 2023) effectively address varied
involve: vision-language challenges.

• Lightweight Adaptation: Only small model • High-Resolution Fine-Tuning: Increasing

components (e.g., projection layers) are fine- the input resolution helps improve object lo-
tuned while keeping the bulk of the model calization and fine-grained visual understand-
fixed. MiniGPT-4 (Zhu et al., 2023) is a rep- ing, as implemented in PaLI-3 (Chen et al.,
resentative example. 2023a).

• Instruction Fine-Tuning: Models are fine- • Dynamic Layer Updates: Gradually unfreez-
tuned on curated instruction datasets to im- ing model components during fine-tuning bal-
prove their conversational and directive ca- ances stability and performance enhancement.
pabilities, as seen in LLaVA-Phi (Zhu et al., MobileVLM V2 (Chu et al., 2024) demon-
2024b) and Bunny (He et al., 2024). strates this approach effectively.

15
These detailed methodologies underscore how
our training framework efficiently combines robust
pre-training with targeted fine-tuning protocols, en-
abling SVLMs to excel on diverse vision-language
tasks even under resource constraints.

C Benchmarks Overview and SVLM

Performances
The summary of the benchmarks and their prop-
erties is shown in Table. 4. The performance of
SVLMs on all the benchmarks is summarized in
Table 5. Figure 6: Qualitative example of math problem solving
from LLaVa-Phi model.
D Qualitative Examples

Figure 7: Describing the image in detail from MiniGPT-

Figure 5: Qualitative example of QA from LLaVa-Phi

model.

Figure 8: Website creation example from MiniGPT-4.

16
Benchmark Year Properties Tasks
VQA (Antol et al., 2015 0.25M images, 0.76M questions, 10M Recognition, Object Detection, Knowledge Integration,
2015) answers with 3 plausible incorrect an- Commonsense Reasoning
swers per question
VQA-v2 (Goyal 2017 443K training, 214K validation, 453K Enhanced Visual Understanding, Reduced Language
et al., 2017) test questions, balanced image pairs Bias, Fine-grained Visual Discrimination
GQA (Hudson and 2019 113K images, 22M questions, structured Spatial Reasoning, Logical Inference, Attribute Recog-
Manning, 2019) scene graphs nition, Compositional Understanding
TextVQA (Singh 2019 28,408 images, 45,336 questions from OCR Integration, Text Comprehension, Visual-Textual
et al., 2019) OpenImages Reasoning
ScienceQA (Lu 2022 21,208 questions, 48.7% image context, Scientific Reasoning, multimodal Understanding, Zero-
et al., 2022) 90.5% detailed explanations shot Generalization
POPE (Li et al., 2023 Binary questions on COCO dataset sub- Object Hallucination Detection, Visual Accuracy, Ro-
2023b) sets (random, popular, adversarial) bustness Testing
MMBench (Liu 2023 20 distinct abilities across perception Hierarchical Capability Assessment, Fine-grained Per-
et al., 2025) and reasoning tasks, bilingual support ception, Multi-level Reasoning
MME (Fu et al., 2023 14 sub-tasks covering perception and OCR, Object Recognition, Commonsense Reasoning,
2024) cognition Numerical Calculation
MM-Vet (Yu et al., 2023 200 images, 218 questions, 16 capability Integrated Vision-Language Capabilities, Spatial Under-
2023) combinations standing, Knowledge Application

Table 4: Brief Overview of several multimodal Benchmarks used for SVLMs

Figure 10: Example of descriptive answers from differ-

Figure 9: Example of prediction answers from different
ent models.
models.

17
Model VQA-v2 GQA POPE TextVQA MMBench ScienceQA MME MM-Vet
LLaVA-Phi 71.4 – 85.0 48.6 59.8 68.4 1335.1 28.9
MobileVLM – 59.0 84.9 47.5 59.6 61.2 1288.9 –
MobileVLM-v2 – 61.1 84.7 57.5 63.2 70.0 1440.5 –
TinyLLaVA 79.9 62.0 86.4 59.1 66.9 69.1 1464.9 32.0
TinyGPT-V – 38.9 – – – – – –
MiniVLM 69.09 – – – – – – –
PaLI-3 85.2 – – – – – – –
MoE-LLaVA 79.9 62.6 85.7 57.0 68.0 70.3 1431.3 35.9
Mipha 81.3 63.9 86.9 56.6 69.7 70.9 1488.9 32.1
Mini-Gemini – – – 56.2 59.8 – 1341 31.1
Bunny 4B 82.1 63.2 87.2 – 75.7 78.3 1581.5 –
Flamingo-3B 57.1 – – 32.7 – – – –
Imp-2B 79.2 61.9 86.7 54.5 63.8 66.1 1304.8 33.5
Imp-3B 81.2 63.5 88 59.8 72.9 72.8 1446.4 43.3
Moondream-0.5B – – 85.1 68.92 – – – –
Moondream-2B – – 89.8 73.4 – – – –
SmolVLM – – – 72.1 – 84.5 – –
Qwen2.5-VL – – – 79.3 77.6 – – –

Table 5: Performance of SVLMs on Various Benchmarks

18
Figure 11: Comprehensive skill demonstrations of Imp, including math problem solving, code generation, Multilin-
gual conversation and medical image understanding.

19
Figure 12: Qualitative analysis of Mini-Gemini in visual understanding.

LVLM Survey
No ratings yet
LVLM Survey
22 pages
Smolvlm: Redefining Small and Efficient Multimodal Models: Hugging Face, Stanford University Equal Contribution
No ratings yet
Smolvlm: Redefining Small and Efficient Multimodal Models: Hugging Face, Stanford University Equal Contribution
20 pages
2501.02189v3 - 2025
No ratings yet
2501.02189v3 - 2025
35 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Exploring
No ratings yet
Exploring
16 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
No ratings yet
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
24 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Paper Ieee Tai
No ratings yet
Paper Ieee Tai
10 pages
PaLI-3 Vision Language Models - Smaller, Faster, Stronger - 2310.09199
No ratings yet
PaLI-3 Vision Language Models - Smaller, Faster, Stronger - 2310.09199
16 pages
07 - LLM Attention Models
No ratings yet
07 - LLM Attention Models
17 pages
SmolVLA A VisionLanguageAction Model For Affordable and Efficient Robotics
No ratings yet
SmolVLA A VisionLanguageAction Model For Affordable and Efficient Robotics
24 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
20 pages
Cogvlm Paper
No ratings yet
Cogvlm Paper
18 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
16 pages
Applsci 14 05068
No ratings yet
Applsci 14 05068
30 pages
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
No ratings yet
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
12 pages
Visionllama
No ratings yet
Visionllama
17 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
Internvl: Scaling Up Vision Foundation Models and Aligning For Generic Visual-Linguistic Tasks
No ratings yet
Internvl: Scaling Up Vision Foundation Models and Aligning For Generic Visual-Linguistic Tasks
25 pages
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
No ratings yet
What Matters When Building Vision-Language Models?: Prompt Idefics2 Output
26 pages
Lecture-27-Introduction To VLM
No ratings yet
Lecture-27-Introduction To VLM
46 pages
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
No ratings yet
BLIP: Bootstrapping Language-Image Pre-Training For Unified Vision-Language Understanding and Generation
12 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
No ratings yet
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
52 pages
Vila-U Foundation Model
No ratings yet
Vila-U Foundation Model
15 pages
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
No ratings yet
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
45 pages
Survey
No ratings yet
Survey
19 pages
VLP: A Survey On Vision-Language Pre-Training
No ratings yet
VLP: A Survey On Vision-Language Pre-Training
19 pages
2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
No ratings yet
2024 - Language Model Beats Diffusion - Tokenizer Is Key To Visual Generation - Yu Et Al
19 pages
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
No ratings yet
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
13 pages
Mini-Internvl: A Flexible-Transfer Pocket Multimodal Model With 5% Parameters and 90% Performance
No ratings yet
Mini-Internvl: A Flexible-Transfer Pocket Multimodal Model With 5% Parameters and 90% Performance
24 pages
L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
No ratings yet
L M B D - T K V G: Anguage Odel Eats Iffusion Okenizer Is Ey To Isual Eneration
17 pages
Introduction To Multimodal RAG
No ratings yet
Introduction To Multimodal RAG
12 pages
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
No ratings yet
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
54 pages
Moe-Llava: Mixture of Experts For Large Vision-Language Models
No ratings yet
Moe-Llava: Mixture of Experts For Large Vision-Language Models
22 pages
Socratic Models ML AI LM 2204.00598
No ratings yet
Socratic Models ML AI LM 2204.00598
20 pages
VILT-vision-and Language Transformer Without Convolution or Region Supervision
No ratings yet
VILT-vision-and Language Transformer Without Convolution or Region Supervision
12 pages
LLM
No ratings yet
LLM
28 pages
Write and Paint
No ratings yet
Write and Paint
25 pages
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
No ratings yet
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
13 pages
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
No ratings yet
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
22 pages
Chen ViTamin Designing Scalable Vision Models in The Vision-Language Era CVPR 2024 Paper
No ratings yet
Chen ViTamin Designing Scalable Vision Models in The Vision-Language Era CVPR 2024 Paper
13 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
Moe-Llava: Mixture of Experts For Large Vision-Language Models
No ratings yet
Moe-Llava: Mixture of Experts For Large Vision-Language Models
22 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
Multimodal Foundation Models
No ratings yet
Multimodal Foundation Models
14 pages
Seminar CLIP
No ratings yet
Seminar CLIP
71 pages
Multimodal Autoregressive Pre-Training of Large Vision Encoders
No ratings yet
Multimodal Autoregressive Pre-Training of Large Vision Encoders
18 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
DINO: Self-Supervised Vision Transformers Explained
From Everand
DINO: Self-Supervised Vision Transformers Explained
William Smith
No ratings yet
Architecting Applications with Model-View-ViewModel: Definitive Reference for Developers and Engineers
From Everand
Architecting Applications with Model-View-ViewModel: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
EEE 443: Radar and Satellite Communication: Satellite Link Design (Rain & UPC)
No ratings yet
EEE 443: Radar and Satellite Communication: Satellite Link Design (Rain & UPC)
10 pages
EEE443SatelliteLinkDesign 140924
No ratings yet
EEE443SatelliteLinkDesign 140924
19 pages
EEE 443: Radar and Satellite Communication
No ratings yet
EEE 443: Radar and Satellite Communication
22 pages
Higher Order Differential Equations
No ratings yet
Higher Order Differential Equations
74 pages
Dep-Ind Variable Absent
No ratings yet
Dep-Ind Variable Absent
8 pages
Linear Homogeneous Equation With Variable Coefficients
No ratings yet
Linear Homogeneous Equation With Variable Coefficients
10 pages
CSE 109 Lecture 13
No ratings yet
CSE 109 Lecture 13
8 pages
Contributions of Filipino Scientist
100% (1)
Contributions of Filipino Scientist
2 pages
Food Packaging: Unit 1 - Metals
No ratings yet
Food Packaging: Unit 1 - Metals
22 pages
03 - Product Specification
No ratings yet
03 - Product Specification
4 pages
QC Yorp Forms
No ratings yet
QC Yorp Forms
4 pages
PCOALAUUSDM
No ratings yet
PCOALAUUSDM
9 pages
Demands of Society From The Teacher
No ratings yet
Demands of Society From The Teacher
7 pages
Avionics Data Buses & Architectures
No ratings yet
Avionics Data Buses & Architectures
27 pages
Optical Fiber Communication: Technology and Systems: Chapter 1: Introduction
No ratings yet
Optical Fiber Communication: Technology and Systems: Chapter 1: Introduction
44 pages
SINAMICS G120 PN at S7-1200 DOCU V1d0 en
No ratings yet
SINAMICS G120 PN at S7-1200 DOCU V1d0 en
63 pages
Assignment 3 BTF3363
No ratings yet
Assignment 3 BTF3363
5 pages
Electrical System (HCR1500-EDII, D20II)
100% (2)
Electrical System (HCR1500-EDII, D20II)
20 pages
Rough Transcriptionasi Se Baila El Tango - Violin 1
No ratings yet
Rough Transcriptionasi Se Baila El Tango - Violin 1
2 pages
Inventor Tutorials
100% (3)
Inventor Tutorials
1,264 pages
Sugar As On 01-08-2024
No ratings yet
Sugar As On 01-08-2024
1 page
VTX PDF
No ratings yet
VTX PDF
6 pages
Feasibility Test of A Project:: Market Feasibility (For Demand and Market Price Estimates)
No ratings yet
Feasibility Test of A Project:: Market Feasibility (For Demand and Market Price Estimates)
5 pages
Woman-Centered Coaching Revolution - Lesson 1 - Handout
No ratings yet
Woman-Centered Coaching Revolution - Lesson 1 - Handout
28 pages
1Y0-204 Dumps Citrix Virtual Apps and Desktops 7 Administration
No ratings yet
1Y0-204 Dumps Citrix Virtual Apps and Desktops 7 Administration
7 pages
Trigonometry 15 Dec1.
No ratings yet
Trigonometry 15 Dec1.
107 pages
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
No ratings yet
PROPOSAL Syringe4 Needle Assemble INDIA 20180212 MR - Rohit Shaha
31 pages
Module 3.1 - Training Certificate - Folayeni - Awosika
No ratings yet
Module 3.1 - Training Certificate - Folayeni - Awosika
1 page
Selfstudys Com File
No ratings yet
Selfstudys Com File
18 pages
True or False 1
No ratings yet
True or False 1
7 pages
Ledesma vs. CA Notes
No ratings yet
Ledesma vs. CA Notes
4 pages
Group2 Ece142
No ratings yet
Group2 Ece142
61 pages
Backface Removal
No ratings yet
Backface Removal
4 pages
Volume 5-2 (C) - ESIA For Padibe West
No ratings yet
Volume 5-2 (C) - ESIA For Padibe West
288 pages
Black Box and White Box Testing
No ratings yet
Black Box and White Box Testing
5 pages
Payment Plan: Doctors Floor Price List
No ratings yet
Payment Plan: Doctors Floor Price List
1 page
Ac Defects
No ratings yet
Ac Defects
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SVLM Survey For ACL 2025

Uploaded by

SVLM Survey For ACL 2025

Uploaded by

Scaling Down, Powering Up: A Survey of Small Vision-Language Models

First Author Second Author

Abstract Despite their success, large-scale VLMs often

Flamigo Gemini-Nano LLaVa-Phi Mini-Gemini TinyGPT-V Moondream MoELLaVa SmolVLM

SVLMs based on architecture and model 2.1 Vision Encoder

Vision Features Visual

(f) MoE Tiny

evaluates the integration of multiple core (total 16) 7 Challenges

Drew A Hudson and Christopher D Manning. 2019.

B Training Strategies 2. Two-Stage Training Pipelines: The train-

Division Description Techniques/Strategies Examples

Multi-task Trains on multiple Shared backbones with Flamingo-3B:Uses gated cross-attention

B.5 Fine-Tuning Protocols • Task-Specific Training: By fine-tuning on

• Lightweight Adaptation: Only small model • High-Resolution Fine-Tuning: Increasing

C Benchmarks Overview and SVLM

Figure 7: Describing the image in detail from MiniGPT-

Figure 5: Qualitative example of QA from LLaVa-Phi

Figure 8: Website creation example from MiniGPT-4.

Table 4: Brief Overview of several multimodal Benchmarks used for SVLMs

Figure 10: Example of descriptive answers from differ-

Table 5: Performance of SVLMs on Various Benchmarks

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.