Abstract
Prompt tuning has emerged as a flexible approach for adapting pre-trained models by solely learning additional inputs while keeping the model parameters frozen. However, simplistic prompts are insufficient to effectively address the challenges posed by complex multi-modal tasks such as visual grounding. In this paper, we propose a novel prompting architecture called Dynamic Multi-modAl Prompting (DMAP) for visual grounding. DMAP incorporates input-dependent prompting to tailor instance-level prompts for more accurate representation and dynamic multi-modal prompting to capture the relationship between the textual and visual inputs. To this end, we design a Dynamic Prompt Network (DPN) to generate multi-modal prompts based on the specific inputs, enhancing both adaptive prompt generation and multi-modal feature fusion. Extensive experimental results demonstrate the superiority of DMAP over competing methods in parameter-efficient settings. Furthermore, DMAP consistently outperforms state-of-the-art VG methods even when fine-tuning all parameters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, vol. 1, no. 3, p. 4 (2022)
Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, L., Ma, W., Xiao, J., Zhang, H., Chang, S.F.: Ref-NMS: breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1036–1044 (2021)
Deng, J., Yang, Z., Chen, T., Zhou, W., Li, H.: TransVG: end-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1769–1779 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hong, R., Liu, D., Mo, X., He, X., Zhang, H.: Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 684–696 (2019)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: ICML. Proceedings of Machine Learning Research, vol. 97, pp. 2790–2799. PMLR (2019)
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124 (2017)
Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19113–19122 (2023)
Li, M., Chen, L., Duan, Y., Hu, Z., Feng, J., Zhou, J., Lu, J.: Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp. 19848–19857. IEEE (2022)
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: ACL/IJCNLP (1), pp. 4582–4597. Association for Computational Linguistics (2021)
Li, Y., et al.: A deep learning-based hybrid framework for object detection and recognition in autonomous driving. IEEE Access 8, 194228–194239 (2020)
Lialin, V., Deshpande, V., Rumshisky, A.: Scaling down to scale up: a guide to parameter-efficient fine-tuning. CoRR abs/2303.15647 (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Loedeman, J., Stol, M.C., Han, T., Asano, Y.M.: Prompt generation networks for efficient adaptation of frozen vision transformers. CoRR abs/2210.06466 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (Poster). OpenReview.net (2019)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20 (2016)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
Wu, W., Chang, T., Li, X.: Visual-and-language navigation: a survey and taxonomy. arXiv preprint arXiv:2108.11544 (2021)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166 (2018)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804. IEEE (2022)
Zhou, Y., et al.: TRAR: routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2074–2084 (2021)
Acknowledgement
This research was supported partially by the National Natural Science Fund of China (Grant Nos. 62306329, 62103420, 62103425 and 62103428) and the Natural Science Fund of Hunan Province (Grant Nos. 2021JJ40697, 2021JJ40702, 2022JJ40559 and 2023JJ40676), and Hunan Provincial Innovation Foundation For Postgraduate.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wu, W., Liu, T., Wang, Y., Xu, K., Yin, Q., Hu, Y. (2024). Dynamic Multi-modal Prompting for Efficient Visual Grounding. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14431. Springer, Singapore. https://doi.org/10.1007/978-981-99-8540-1_29
Download citation
DOI: https://doi.org/10.1007/978-981-99-8540-1_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8539-5
Online ISBN: 978-981-99-8540-1
eBook Packages: Computer ScienceComputer Science (R0)