HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Zhang, Wenqiao; Lin, Tianwei; Liu, Jiang; Shu, Fangxun; Li, Haoyuan; Zhang, Lei; Wanggui, He; Zhou, Hao; Lv, Zheqi; Jiang, Hao; Li, Juncheng; Tang, Siliang; Zhuang, Yueting

Computer Science > Artificial Intelligence

arXiv:2403.13447 (cs)

[Submitted on 20 Mar 2024]

Title:HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Authors:Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang

View PDF HTML (experimental)

Abstract:Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training.
Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link this https URL}.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.13447 [cs.AI]
	(or arXiv:2403.13447v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2403.13447

Submission history

From: Wenqiao Zhang [view email]
[v1] Wed, 20 Mar 2024 09:42:43 UTC (8,932 KB)

Computer Science > Artificial Intelligence

Title:HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Artificial Intelligence

Title:HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.