ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Wu, Mingrui; Cai, Xinyue; Ji, Jiayi; Li, Jiale; Huang, Oucheng; Luo, Gen; Fei, Hao; Jiang, Guannan; Sun, Xiaoshuai; Ji, Rongrong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.21534 (cs)

[Submitted on 31 Jul 2024 (v1), last revised 7 Jan 2025 (this version, v6)]

Title:ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Authors:Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji

View PDF HTML (experimental)

Abstract:In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output at test time, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point. The results demonstrate that our method exhibits out-of-domain generalization and interpretability.

Comments:	Accepted to NeurIPS 2024; Code:this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.21534 [cs.CV]
	(or arXiv:2407.21534v6 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.21534

Submission history

From: Mingrui Wu [view email]
[v1] Wed, 31 Jul 2024 11:40:29 UTC (1,968 KB)
[v2] Sun, 29 Sep 2024 12:12:06 UTC (1,865 KB)
[v3] Mon, 11 Nov 2024 05:12:01 UTC (5,697 KB)
[v4] Wed, 18 Dec 2024 13:12:29 UTC (6,202 KB)
[v5] Mon, 23 Dec 2024 04:03:44 UTC (6,202 KB)
[v6] Tue, 7 Jan 2025 02:54:18 UTC (6,202 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Computer Vision and Pattern Recognition

Title:ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.