Grounded Image Text Matching with Mismatched Relation Reasoning

Wu, Yu; Wei, Yana; Wang, Haozhe; Liu, Yongfei; Yang, Sibei; He, Xuming

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.01236 (cs)

[Submitted on 2 Aug 2023 (v1), last revised 4 Aug 2023 (this version, v2)]

Title:Grounded Image Text Matching with Mismatched Relation Reasoning

Authors:Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He

View PDF

Abstract:This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2308.01236 [cs.CV]
	(or arXiv:2308.01236v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.01236

Submission history

From: Yana Wei [view email]
[v1] Wed, 2 Aug 2023 15:44:36 UTC (8,850 KB)
[v2] Fri, 4 Aug 2023 17:51:57 UTC (8,850 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Grounded Image Text Matching with Mismatched Relation Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Computer Vision and Pattern Recognition

Title:Grounded Image Text Matching with Mismatched Relation Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.