Collaborative Three-Stream Transformers for Video Captioning

Wang, Hao; Zhang, Libo; Fan, Heng; Luo, Tiejian

doi:10.1016/j.cviu.2023.103799

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.09611 (cs)

[Submitted on 18 Sep 2023]

Title:Collaborative Three-Stream Transformers for Video Captioning

Authors:Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo

View PDF

Abstract:As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

Comments:	Accepted by CVIU
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.09611 [cs.CV]
	(or arXiv:2309.09611v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.09611
Related DOI:	https://doi.org/10.1016/j.cviu.2023.103799

Submission history

From: Hao Wang [view email]
[v1] Mon, 18 Sep 2023 09:33:25 UTC (2,508 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Collaborative Three-Stream Transformers for Video Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Computer Science > Computer Vision and Pattern Recognition

Title:Collaborative Three-Stream Transformers for Video Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.