0% found this document useful (0 votes)

68 views39 pages

Advancing Video Understanding With GPT4

MM-V ID is an advanced system that integrates GPT-4V with specialized tools in vision, audio, and speech to enhance video understanding, particularly for long-form content. It processes videos by generating detailed scripts that capture character actions, dialogues, and events, enabling capabilities such as audio description, grounded question-answering, and character identification. Experimental results indicate its effectiveness in handling various video genres and interactive environments, paving the way for improved comprehension of complex video narratives.

Uploaded by

Ran Bi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views39 pages

Advancing Video Understanding With GPT4

Uploaded by

Ran Bi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

MM-V ID : Advancing Video Understanding with GPT-4V(ision)

Kevin Lin∗ , Faisal Ahmed∗ , Linjie Li∗ , Chung-Ching Lin∗ , Ehsan Azarnasab, Zhengyuan Yang,
Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang∗♠
Microsoft Azure AI
∗ ♠
Core Contribution Project Lead
https://multimodal-vid.github.io/
arXiv:2310.19773v1 [cs.CV] 30 Oct 2023

…… …

Faraon Thut Nefer Nimona Ballister

Boldheart

baal bir

……

Figure 1. MM-V ID allocates specialized vision, audio, speech experts with GPT-4V(ision) to address challenging video understanding
tasks. For example, the system could associate information from multiple uploaded episodes and reason the storyline of the queried
characters (“Multi-Video Episodic Analysis”). We highlight key information here and postpone full MM-V ID responses to Figures 4-32.
Demo videos are available at this link.

1
Abstract quiring semantic understanding, and the ability of long-term
strategy planning [11, 54, 67, 75, 83].
We present MM-V ID, an integrated system that har-
nesses the capabilities of GPT-4V1 , combined with special-
ized tools in vision, audio, and speech, to facilitate ad- Recently, substantial advances have been made with
vanced video understanding. MM-V ID is designed to ad- large pre-trained video models [9, 13, 20, 21, 42, 69] and
dress the challenges posed by long-form videos and intri- video-language models [10, 22, 23, 30, 36–40, 66], which
cate tasks such as reasoning within hour-long content and have demonstrated their reasoning capabilities for video
grasping storylines spanning multiple episodes. MM-V ID content. However, these models are usually trained on
uses a video-to-script generation with GPT-4V to transcribe short clips (e.g., 10-second videos in Kinetics [15] and VA-
multimodal elements into a long textual script. The gener- TEX [68]) or pre-defined action classes (e.g., 174 classes in
ated script details character movements, actions, expres- Something-Something v1 [24]). Consequently, these mod-
sions, and dialogues, paving the way for large language els may fall short in providing a detailed comprehension of
models (LLMs) to achieve video understanding. This en- intricate videos in real world [62, 70]. To achieve a more
ables advanced capabilities, including audio description, comprehensive understanding of the videos we encounter in
character identification, and multimodal high-level compre- daily life, we need methods capable of addressing complex
hension. Experimental results demonstrate the effectiveness challenges. It involves not only identifying who are in the
of MM-V ID in handling distinct video genres with various scene and what they do, but also pinpointing when and how
video lengths. Additionally, we showcase its potential when they act, while recognizing subtle nuances and visual cues
applied to interactive environments, such as video games across different scenes. The aim of this work is to address
and graphic user interfaces. these challenges and explore methods that can be applied
directly to real-world video understanding. Our approach
involves breaking down extended video content into coher-
ent narratives and subsequently employing these generated
1. Introduction
stories for video analysis.
People around the world create numerous videos on a
daily basis [14, 27, 48, 55], including user-generated live Recent advances in Large Multimodal Models
streams, video-game live streams, short clips, movies, (LMMs) [7,8,19,47,51,52,78], such as GPT-4V(ision) [52],
sports broadcasts, advertising, and more. Videos serve as have demonstrated significant breakthroughs in processing
a versatile medium for conveying information and content both input images and text for multimodal understanding.
through various modalities [12, 59, 60, 74, 77, 81, 82], such This has sparked interest in applying LMMs to the video
as text, visuals, and audio. Developing methods that can domain. In this work, we present MM-V ID, a system
learn from diverse modalities will enable us to design cogni- that integrates specialized tools with GPT-4V for video
tive machines with enhanced capabilities for analyzing un- understanding. Given an input video, MM-V ID performs
curated real-world videos, extending beyond the confines multimodal pre-processing, including scene detection and
of hand-curated datasets. However, this rich representation automatic speech recognition (ASR), to collect important
introduces many challenges for the study of video under- information in the video. The input video is then split into
standing, particularly when dealing with extended-duration multiple clips according to the scene detection algorithm.
videos [62, 70]. Then, we employ GPT-4V, which takes the clip-level video
Understanding long videos, especially those spanning frames as input and generates a detailed description for
over an hour, is a complex task that demands advanced each video clip. Finally, GPT-4 is adopted to generate
methods capable of analyzing sequences of images and a coherent script for the full video, conditioning on the
audio across multiple episodes. This challenge is com- clip-level video descriptions, ASR, and video metadata if
pounded by the need to extract information from various available. As shown in Figure 1, the generated script allows
sources, such as distinguishing speakers [17, 49, 61], identi- MM-V ID to perform a diverse set of video tasks.
fying characters [29, 46, 50], and maintaining narrative co-
herence [26, 57]. Additionally, answering questions based Experimental results demonstrate the effectiveness of
on video evidence [31] requires a deep comprehension of MM-V ID in different challenging scenarios. MM-V ID
the content, context, and subtitles. When it comes to live is able to comprehend hour-long videos through multiple
streaming and gaming videos [1, 11, 55], there are chal- modalities, and localize specific events with correct times-
lenges in processing dynamic environments in real-time, re- tamps. MM-V ID also demonstrates intriguing results in
1 In this work, we explore GPT-4V(ision) with the vision capability and an interactive environment, such as predicting the possible
refers to the model as “GPT-4V,” following the OpenAI reports [51, 52]. next steps when playing a video game [4] or interacting with
We refer to the text-only version of the model as “GPT-4” [51]. a graphical user interface (GUI) [78].

2
Clip 1

Multimodal Clip 2 Visual Prompt Clip-Level Video

Pre-Processing LLM with
Description Generation
(Scene Detection, ASR) Text Prompt (GPT-4V) Script
Clip N Video Tasks:
Audio Description,
Grounded QA,
Summarization,
External Knowledge Script Generation Speaker Identification,
Script
Input Video (Metadata, Title, Abstract, Text Prompt Character Identification,
Character Snapshot, etc.)
using LLM (GPT-4) etc.

Figure 2. Overview of MM-V ID. Our system takes a video file as input, and outputs a long textual script describing the video contents.
MM-V ID consists
Pleaseof four modules:
generate audio (i) Multimodal Pre-Processing,
Chat with (ii) External Task List: Collection, (iii) Clip-Level Video Description
Knowledge
Audio Description, Grounded QA, Summarization, Speaker
description for the input video Reference (Script)
Generation, and (iv) Script Generation. Identification, Character Identification, Multimodal Reasoning, etc.

2. Related Work researchers [6, 33, 65, 72] explore the LangChain system
paradigm [16], which aims to integrate expert tools with
Conventional Video Understanding Methods. Early existing LLMs to create new functionalities. For example,
work in computer vision centered on building video founda- VLog [6] uses BLIP2 [34] and GRIT [71] as dense image
tion models [9,13,20,21,42,69]. These models, with differ- captioners, Whisper [56] as ASR translator, and ChatGPT
ent neural network architecture designs and training meth- as a reasoner. By transcribing a given video to textual de-
ods, have achieved great breakthrough at analyzing short scriptions (e.g., document), it enables ChatGPT for video
video clips [14,15,28,63], typically lasting less than 30 sec- question-answering tasks. Inspired by the efficacy of these
onds. However, these models are typically pre-trained with tool-using approaches [16, 65, 79], we explore integration
vision modality only, and thus may require specific adjust- with GPT-4V for video understanding.
ment or fine-tuning for multimodal downstream tasks.
Video-Language Models. Recent studies [10, 22, 23, 30, 3. Preliminary Study with GPT-4V(ision)
36–40, 66] have made remarkable improvements in multi-
modal representation learning for video-and-language un- Recent studies [3, 51, 52, 78] show that GPT-4V can ac-
derstanding. These advancements have been particularly cept a range of inputs, such as textual descriptions, ques-
evident in popular downstream tasks such as video questions, or even visual cues like images or short video clips.
tion answering [31], text-video retrieval [32, 73] and video GPT-4V’s inherent ability to comprehend visual inputs and
captioning [68]. Building on this momentum, researchers generate contextually relevant text opens the door for a wide
typically embark on a pretrain-finetune paradigm: initially range of applications. By introducing a sequence of frames
pre-training a video-language foundation model on large- as input, GPT-4V can grasp temporal relationships and in-
scale video-text pairs, followed by a fine-tuning process on teractions, aiding in the identification and interpretation of
specific downstream datasets. However, these methods are dynamic visual content.
usually trained on short video clips, often restricted to du-
rations of around 10 seconds, posing potential challenges in 4. MM-V ID
comprehending longer video sequences. Figure 2 shows the overview of our system pipeline.
Visual Instruction Tuning. Inspired by the breakthrough MM-V ID takes the video file as input, and outputs a script
of Large Language Models (LLMs) [18, 19, 51, 64, 85], re- describing the video contents. The generated script enables
cent studies [35, 43, 44, 62, 84] suggest using a frozen LLM LLMs to achieve various video understanding capabilities.
combined with an image encoder and a few learnable mod- MM-V ID consists of four modules: (i) Multimodal Pre-
ules for video understanding tasks. Specifically, researchers Processing, (ii) External Knowledge Collection, (iii) Clip-
propose the visual instruction tuning [35, 41, 44], which Level Video Description Generation, and (iv) Script Gener-
aims to fine-tune the learnable modules and thus enable ation. We describe each module in detail below.
LLMs to generate textual descriptions for the video content. Multimodal Pre-Processing. Starting with an input video
While promising performance is presented, these models file, our process begins by using the established ASR tool
may fall short when it comes to handling videos with ex- to extract transcriptions from the video. Following this, we
tended duration. Our work aims to fill this gap, exploring divide the video into several short video clips. This pro-
methods that can be directly applied to the understanding of cess involves uniform sampling of video frames, with each
long videos in real-world situations. clip consisting of 10 frames. To enhance the overall qual-
Prompting LLMs for Video Understanding. Recently, ity of frame sampling, we use established scene detection

3
Streaming Inputs

resenting the ongoing visual information unfolding in the

State
(e.g., Video Frames, Scalar Signals) environment. These states are then processed by GPT-4V
to make informed decisions and generate responses.
By continually analyzing the streaming video frames,
MM-V ID plays a crucial role in transforming raw visual
Agent Environment data into meaningful insights, making it valuable for appli-
(GPT-4V and Memory) (Visual World, Game, GUI, etc.)
cations such as video game play, the embodied agent, and
GUI navigation.

Action 6. Experiments
(e.g., Gaming Console Control)
6.1. Experiment Setup
Figure 3. MM-V ID for streaming inputs. MM-V ID can serve as
an agent in an interactive environment, continually receiving and We implement MM-V ID based on MM-R E ACT [79]
processing the streaming video frames. codebase. We use the Automatic Speech Recognition
(ASR) tool publicly available via the Azure Cognitive Ser-
vices APIs [2], and utilize PySceneDetect [5] for scene de-
tools like PySceneDetect [5] to help identify crucial scene tection.
boundaries.
6.2. MM-V ID Capabilities Audio Description Generation,
External Knowledge Collection. We incorporate external
knowledge into our input prompts to GPT-4V. This involves Figures 4-9 provide illustrative examples of MM-V ID’s
gathering available information, such as metadata, title, ab- complete execution flow. When a user uploads a video file,
stract, and face photos of characters within the video. In MM-V ID initiates the process by first assessing the esti-
our experiments, the metadata, title, and abstract are gath- mated video length. Subsequently, it performs multimodal
ered from YouTube. pre-processing by invoking expert tools, including scene de-
Clip-Level Video Description Generation. During our tection and ASR. Additionally, MM-V ID collects external
multimodal pre-processing, the input video is segmented knowledge, encompassing video metadata such as title and
into multiple clips. For each clip, which typically consists abstract.
of 10 frames, we employ GPT-4V to generate video descrip- Following this preliminary stage, MM-V ID proceeds to
tions. By feeding the video frames along with the associ- generate clip-level video descriptions for each segment of
ated text prompt into the model, GPT-4V utilizes the input the video. Finally, it invokes GPT-4, integrating these clip-
to generate detailed descriptions that capture the visual ele- level descriptions into a coherent script. Once the script is
ments, actions, and events depicted in those frames. generated, it empowers LLMs to provide a summarized un-
In addition, we explore the use of visual prompting, derstanding of the video content. That equips the system to
where the character’s face photos are presented alongside address users’ questions with grounded answers. We dis-
the character’s name in the input to GPT-4V. Our empirical cuss MM-V ID’s distinct capabilities as below.
results suggest that visual prompting is helpful to enhance Grounded Question-Answer (QA). The generation of a
the quality of video descriptions, particularly for more accu- comprehensive script empowers our system with the capa-
rate character identification. These findings align with the bility of grounded QA. As shown in Figure 8, let us consider
insights from [78]. a scenario where a user poses the question, “Show me the
Script Generation using LLM. After generating the de- most exciting moment in this video.” In response, MM-
scriptions for each video clip, we use GPT-4 to integrate V ID displays a highlight, specifically featuring a home run,
these clip-level descriptions into a coherent script. This and provides the corresponding timestamp. When a user
script serves as a comprehensive description of the entire asks “Who are the best pitchers in this video?” MM-V ID
video, and is used by GPT-4 for a diverse set of video un- addresses the question by referring to relevant evidence in
derstanding tasks. the generated script. This grounding capability owes its
success to the extensive and detailed script generation pro-
5. MM-V ID for Streaming Inputs cess, which documents essential timestamps and significant
events within the video, enabling accurate and contextually
Figure 3 shows the diagram of MM-V ID when applied to grounded responses to user inquiries.
the context of streaming inputs. Our system operates as an Multimodal Reasoning. MM-V ID considers multimodal
agent within a dynamic environment where streaming video inputs, including video frames, speech transcriptions, and
frames serve as the primary input. In this context, the agent external knowledge if available. In Figure 8, when a
continually receives streaming video frames as states, rep- user inquires, “How did you know the sound is different?”

4
Clip-Level Video Description Generation
Using GPT-4V

Output:
Timestamp: 00:00:00 - 00:00:02
rk heart
Scene: Opening graphics introducing the game
between the Colorado Rockies and the Los Angeles
Dodgers.

Timestamp: 00:00:20 - 00:00:24

Loading video…
Scene: Pitcher on the mound at Dodger Stadium.
Video length is 9 minutes 19 seconds.
:
:
:
Invoke tools: Scene Detection, ASR, Metadata Timestamp 00:09:01-00:09:07
Collection Scene Description: In this scene, we see a baseball
game between the Los Angeles Dodgers and the
Scene Detection: Colorado Rockies. The Dodgers are leading with a
00:00:00, 00:00:02, 00:00:20, 00:00:24, score of 5-2. A player from the Dodgers hits the
ball, and it is fielded by a player from the Rockies.
..., 00:08:42, 00:08:53, 00:09:01, 00:09:07

ASR: Script Generation Using GPT-4:

0:05 gets ready for his second start of the
0:08 Season began this home stand got the win Input Prompt:
: : You are an expert at understanding audio
: : descriptions of different scenes in a video. Can you
9:04 that's the ball game Dodgers sweep the leverage the information provided, including title,
9:07 two-game series and they go four and two abstract, audio descriptions and generate full
9:09 on the season opening home stand audio description of each scene with non
overlapping time ranges. Keep as many scenes
Metadata Collection: possible covering all time ranges. You may find
Title: Rockies vs. Dodgers Game Highlights character names in the title or abstract. Use
Abstract: Rockies vs. Dodgers Game Highlights character names wherever possible in the audio
(4/4/23) | MLB Highlights descriptions. Keep the audio description for each
time range within one short sentence.

Clip-Level Video Description Generation Output:

Using GPT-4V 0:00:00 - 0:00:14: Video begins with a montage
of the Colorado Rockies and the Los Angeles
Dodgers players warming up for their game at
Input Prompt:
Dodger Stadium over the backdrop of the Los
Angeles skyline.

0:00:15 - 0:00:30: Dodge.rs pitcher, Julio Urias,

00:00:00 00:00:02 00:09:07 makes his second start of the season and deals a
You are an expert in understanding scene
powerful pitch towards the Rocky's batter. As the
transitions based on visual features in a
batter swings and makes contact, the ball heads
video. For the given sequence of images per
down the line under the glove of the Dodgers and
timestamp, identify different scenes in the
is marked as a double for the Rockies.
video. Generate audio description for each
scene with time ranges.
0:00:31 - 0:00:46: Urias successfully strikes out

Figure 4. An example of MM-V ID’s execution flow. Given a baseball video, MM-V ID provides an estimated video length, and then
invokes scene detection and ASR tools, and collects external knowledge. Then, we generate clip-level video descriptions by using GPT-4V.
GPT-4V takes the video frames and the text prompt as input, and outputs the video descriptions. Finally, given the clip-level descriptions,
video metadata and ASR, we use GPT-4 to generate a coherent script for the input video. Figures 7-9 show continued outputs. The original
video is available at https://www.youtube.com/watch?v=-pNe0p4H8ec

5
MM-V ID explains that this information was derived from tions provide contextual narration of meaningful visual el-
the commentator’s remarks during the game. The exam- ements, clarify speakers, and convey the essence of visual
ples illustrate MM-V ID’s multimodal reasoning capabili- information within a video. In our experiments, we also ex-
ties, where it integrates both visual and auditory cues to plore MM-V ID’s performance in audio description genera-
provide contextually accurate responses to user queries. tion. We experiment with videos where there is limited or
Hour-Long Video Comprehension. Figures 10-13 no speech content. In Figure 20, we showcase an example
demonstrate MM-V ID’s capabilities in processing lengthy featuring a short film of Mr. Bean taking an exam, which
videos. In this example, MM-V ID effectively analyzes a primarily lacks speech. Without ASR inputs, MM-V ID pro-
documentary video spanning approximately 50 minutes in cesses the video and generates a detailed script. This shows
duration. For simplicity, the intermediate outputs are omit- MM-V ID’s versatility in handling various types of video
ted in the figures, and only the final generated script is pre- content and its potential in creating inclusive and accessible
sented. We observe that MM-V ID is able to generate a long multimedia content.
script with the corresponding timestamps to represent the Self-Refinement. While the generated script offers a com-
documentary video. By leveraging this generated script as prehensive understanding of video content, our experiments
contextual information, MM-V ID is equipped to perform have unveiled occasional inaccuracies, especially in cases
a range of tasks, including summarizing the lengthy video, involving blurry or low-resolution video frames, as demon-
addressing specific queries raised within the video, and in- strated in Figure 21. In this example, MM-V ID mistak-
dexing pivotal moments. enly identifies a bird as a rock due to the challenges posed
Multi-Video Episodic Analysis. MM-V ID’s proficiency by the video’s visual quality. To address such inconsisten-
in handling extensive video content can be expanded to cies and elevate the overall accuracy of the generated script,
encompass multiple lengthy videos, as illustrated in Fig- we employ a self-refinement approach [45, 58, 80]. This
ures 14-16. In these examples, we upload multiple episodes involves revising the script based on both the initially gen-
to MM-V ID, showcasing its ability to perform a variety of erated script and a concurrently generated video summary.
complex tasks. MM-V ID exhibits the capability to sum- Through this process, MM-V ID is able to rectify errors and
marize the video series, engage in cross-episode reasoning, inaccuracies, resulting in a more refined output.
provide detailed descriptions of character journeys across Fast-Changing Short Videos. In Figure 22, we present an
multiple episodes, and facilitate grounded QA interactions. example of our experimentation with fast-changing short-
Character Identification. We found that incorporating form videos, such as those found on platforms like TikTok.
visual prompts enhances the quality of script generation, Short videos often feature non-standard frame sizes and
particularly with regards to character identification. In Fig- significantly shorter durations compared to conventional
ure 17, we illustrate this by providing MM-V ID with ad- videos. Remarkably, MM-V ID excels at accurately describ-
ditional inputs consisting of characters’ face photos and ing the cooking recipes depicted in these short videos, de-
their corresponding names. MM-V ID effectively utilizes spite the distinct characteristics of such content.
these visual prompts to identify the characters depicted in These examples demonstrate the versatility of MM-V ID
the video, based on the provided face photos. As a result, in processing a diverse array of video content. Whether
the script generation process is notably improved, ensur- dealing with lengthy documentaries, episodic series, or
ing more accurate and contextually relevant descriptions of short-form clips, MM-V ID adapts seamlessly to the unique
characters and their interactions within the video content. attributes of each video type, consistently delivering mean-
Speaker Identification. Our exploration has revealed an- ingful and contextually relevant descriptions.
other valuable application of visual prompting in enhanc-
6.3. Applications to Interactive Environments
ing the quality of Automatic Speech Recognition (ASR). In
Figures 18-19, we highlight a scenario where conventional In the following section, we evaluate MM-V ID when ap-
ASR struggles to accurately recognize the number of speak- plying to the context of streaming inputs. MM-V ID serves
ers and their identities in the video. Visual prompting plays as an agent in an interactive environment, continually re-
a pivotal role in enhancing ASR performance by providing ceiving streaming video frames as the inputs.
contextual cues to identify individuals and attribute speech Embodied Agent. Figure 23 illustrates an example where
to specific speakers. This improvement ensures more pre- MM-V ID is applied to an egocentric video captured by a
cise transcriptions, enabling a more accurate representation head-mounted camera. This video, collected from Ego4D
of the dialogue and interactions within the video content. dataset [25], provides a brief glimpse into the wearer’s daily
Audio Description Generation. Audio descriptions [26, life within their home environment. Remarkably, MM-V ID
57] play a crucial role in making videos accessible to in- showcases its capability in understanding such video con-
dividuals who are blind, have low vision, or face difficul- tent and assists the user in a few practical tasks. Specif-
ties in visually understanding the content. These descrip- ically, MM-V ID helps the user locate items like the pink

6
Table 1. Questionnaire for the group with visual impairments. Par- Table 2. Questionnaire for the group with normal vision. Partici-
ticipants listen to a video and subsequently assign scores (ranging pants view a video and subsequently assign scores (ranging from
from 0 to 10) for distinct auditory criteria. 0 to 10) for various auditory and visual criteria.

Effectiveness of Delivery: If the original audio and the Clarity: Are the visual elements clearly and accuratetly
embedded AD are effectively presented? described?
Informative: Is it easy to follow the storyline? Does the Conciseness: Does the AD convey essential visual infor-
AD provide context and background information when mation without overloading the user?
necessary? Timing and Synchronization: Are the original audio
Audio Quality: Is the overall audio production quality and the embedded AD effectively presented? Does the
good? AD properly synchronize with visual contents?
Overall Satisfaction: Are you satisfied with the overall Informative: Is it easy to follow the storyline? Does the
AD experience? AD provide context and background information when
necessary?
Audio Quality: Is the overall audio production quality
good?
Overall Satisfaction: Are you satisfied with the overall
jacket and the laptop within the home. Additionally, it gen-
AD experience?
erates a list of the user’s activities within a specified time
range, offering insights into the wearer’s daily routine.
Playing Video Games. Figures 24-27 demonstrate the
results of applying MM-V ID to a Mario video game [4].
vision. All the participants have normal hearing. For the
In these experiments, our agent consistently receives three
purposes of the experiments, we segregated participants into
video frames as states and calculates the next possible con-
two distinct groups: (i) Group with visual impairments, and
trol action. Remarkably, our agent displays an understand-
(ii) Group with normal vision.
ing of the specific video game dynamics and generates rea-
sonable action controls to play the game effectively. These
examples highlight MM-V ID’s ability to comprehend and 6.4.1 Evaluation Procedure
navigate in an interactive gaming environment. Interested Our experiments utilize a curated set of videos, which are
readers may find the full gameplay demonstration on our mainly suggested by the American Council of the Blind2 .
project website. We also collected accessibility videos from YouTube3 . For
GUI Navigation. Figures 28-32 provide the demonstra- every video used in our evaluation, participants are exposed
tion of MM-V ID’s performance in the GUI navigation sce- to two versions: the first containing human-crafted AD and
nario. In this context, the agent continually receives iPhone the second powered by MM-V ID-generated AD. Both ren-
screenshots and previous user actions as states. The agent ditions are narrated using text-to-speech (TTS) technology.
effectively predicts the possible next steps in the user’s jour- We have designed two questionnaires for the two groups,
ney, which may include clicking on the correct shopping referenced in Table 1 and Table 2, respectively. Participants
apps, initiating searches for items of interest, and ultimately with visual impairments are instructed to base their evalu-
placing an order. These results demonstrate MM-V ID’s re- ation exclusively on auditory cues. In contrast, those with
markable ability to interact with graphical user interfaces, normal vision are instructed to consider both visual and au-
facilitating seamless and intelligent navigation through dig- ditory elements.
ital interfaces. The assessment adopts the standardized Likert scale for
6.4. User Study ratings. For each posed question, participants are guided to
assign a score ranging from 0 to 10, with higher values indi-
We explore the potential of MM-V ID for people who are cating more favorable feedback. Furthermore, participants
blind or have low vision. Audio description (AD) [26, 57] are urged to share feedback and remarks concerning their
provides an auditory narration integrated into the video’s overall experience.
soundtrack, offering important visual details that may not
be discernible from the main video soundtrack. Such de- 6.4.2 Results on the Group with Visual Impairments
scriptions play a pivotal role in conveying essential visual
content to those with visual impairments. We utilized 3 different videos for our evaluation, with
To assess the efficacy of MM-V ID in generating audio durations of 1 minute, 1 minute 42 seconds, and 2 minutes
descriptions (AD), we conduct a user study. We invited 9 2 The
Audio Description Project: https://adp.acb.org/
participants for the evaluation. 4 participants were either 3 Apple
Accessibility: https://www.youtube.com/watch?v=
blind or had low vision, while the remaining 5 had normal SL7YSqlEd8k

7
42 seconds, respectively. Each of the 4 participants with
visual impairment was well versed with screen reader
10Results on Group with Visual Impairments
and other common accessibility tools. After listening to 8
the audio descriptions for each video, they were asked to
respond to the 4 questions outlined in Table 1. 6

Rating
Hypotheses and Results 4
H1: The MM-V ID-generated audio description and 2 Human-written AD
original video dialogues are effectively presented to the MMVid-generated AD
participants. 0

ty
ry

n
tio
Results: Using the Likert scale (0=Not Effective to

ali
ve

ati

ac
Qu
eli

orm
10=Most Effective) the participants rated the effectiveness

f
t. D

tis
dio
Inf

Sa
of the delivery of human-crafted AD and MM-V ID-

Au
Eff
generated AD. On average, participants gave 7.14 ± 1.39
for MM-V ID-generated AD and 8.33 ± 0.90 for human- Figure 5. Results on the group with visual impairments. MM-
crafted AD, which shows a MM-V ID-generated AD very V ID-generated AD is close to human-generated ones in terms
close to human-crafted one in terms of effective delivery of audio quality and effectiveness of delivery. However, MM-
(Figure 5). V ID’s AD yields lower satisfaction levels compared to the human-
generated ones. This was primarily attributed to occasional over-
H2: Participants are able to follow the main story laps between the audio descriptions and the video dialogues.
line of the video based on MM-V ID-generated audio
description only.
vide a cost-effective and scalable solution. Thus, millions
Results: Using the Likert scale (0=Not Informative to
of videos that cannot afford to be professionally audio
10=Highly Informative) the participants rated the infor-
described, can be auto-processed by a tool like MM-V ID
mativeness of human-crafted AD and MM-V ID-generated
to make them accessible to the visual-impaired community.
AD. On average, participants gave 7.14 ± 1.16 for MM-
V ID-generated AD and 9.29 ± 0.91 for human-crafted AD,
which shows little difference in informativeness between 6.4.3 Results on the Group with Normal Vision
MM-V ID-generated AD and human-crafted one (Figure
5). For sighted individuals, we used the same set of videos as
we used for individuals with visual impairments. All of
H3: MM-V ID-generated AD and human-crafted AD our 5 participants answered to 6 questions listed in Table 2
are close in terms of voice and audio quality. after watching videos embedded with MM-V ID-generated
Results: Using the Likert scale (0=Low Quality to AD as subtitles and audio track.
10=High Quality) the participants rated the voice and audio
quality on average as 8.91 ± 1.23 for MM-V ID-generated Hypotheses and Results
AD and 9.07 ± 0.65 for human-crafted AD. This minimal
difference between the scores indicates the close-to-human H1: The MM-V ID-generated AD is accurate and conveys
voice and audio quality of MM-V ID-generated AD (Figure essential information without overloading the listener.
5). Results: The sighted individuals rated the clarify and
accuracy of MM-V ID-generated AD as 7.83 ± 1.24 and
Discussion: human-curated AD as 8.9 ± 0.74 on average, using the
The results show that the participants’ overall satisfaction Likert scale (0=Not Accurate to 10=Most Accurate).
of MM-V ID-generated ADs was on average around 2 In terms of conciseness, the participants on average gave
points less than human-crafted ones in the Likert scale 8.73±0.49 for the MM-V ID-generated AD and 9.16±0.54
(0=Not Satisfied to 10=Highly satisfied) (Figure 5). Some for human-curated AD based on the Likert scale (0=Not
of the difficulties indicated by participants while listening concise to 10=Most concise). These results indicate
to MM-V ID-generated ADs were 1) occasional overlaps MM-V ID-generated ADs are close to human-curated ones
between AD audio and original video dialogues 2) wrong in terms of accuracy and conciseness (Figure 6).
descriptions due to hallucinations of GPT-4V(ision).
Regardless of the difference in overall satisfaction, all the H2: The MM-V ID-generated ADs are in sync with
participants agreed that MM-V ID-generated AD can pro- visual content and do not overlap with other dialogues
ensuring listeners can follow the story line.

8
Results: Participants gave on average 8.90 ± 0.90 and
10 Results on Group with Normal Vision
7.97 ± 1.54 to human-crafted AD and MM-V ID-generated
AD respectively using the Likert scale (0=Not Infor- 8
mative to 10=Highly Informative). Human-crafted AD
and MM-V ID-generated AD received 8.59 ± 0.95 and 6

Rating
8.53 ± 0.58 respectively on the aspect of timing and
synchronization using the Likert scale (0=Not Effective 4
to 10=Most Effective). These indicates while listening to
2 Human-written AD
MM-V ID-generated ADs participants were able to follow MMVid-generated AD
main story line and found the audios are in sync with video 0
content very close to that of human-crafted ADs (Figure 6).

rity

ty
ss

n
tio
yn

ali
ne

ati
Cla

fac
dS

Qu
ise

orm

tis
H3: The voice and audio quality of MM-V ID-generated

an
nc

dio
Inf

Sa
Co

ing

Au
ADs are close to human-crafted ADs.

Tim
Results: The results are very similar to results on group
with visual impairments. Sighted participants rated the Figure 6. Results on the group with normal vision. MM-V ID-
voice and audio quality on average as 8.30 ± 0.89 for MM- generated AD was on average <1 points lower than human-
V ID-generated AD and as 8.93 ± 0.32 for human-crafted generated ADs. The participants were able to follow the main
AD. Therefore the voice and audio experience did not story line and the audios are in sync with the video content.
degrade much while listening to MM-V ID-generated ADs
compare to human-crafted ADs (Figure 6).
and I did not depend on someone else.”
Discussion: P2: “If it’s AI-generated, there are so many
The evaluations on sighted individuals helped to cross places it’s not available, and we need it there.”
verify the hypotheses of individuals with visual impair- P2: “First time listening to auto-generated AD.
ments, that are based on audio cues only. Although the As a user, if I am offered this AD, I would take it.”
overall satisfaction points for sighted participants with P3: “Nearly perfect. Most favorite was the de-
MM-V ID-generated ADs was on average <1 points lower tails.”
than human-generated ADs (Figure 6), the overall satis- P3: “More information helped me follow the sto-
faction points for participants who were blind was worse. ryline.”
This is expected because sighted individuals had access to P3: “It allowed me to follow anything visual. It
both audio and video modalities but individuals with visual felt natural the way AD describes how the actor
impairments did not. We also believe the reason for lower interacts with the environment.”
overall satisfaction, may have been the lack of practice P3: “I love animal kingdom, and I watch Wild
listening to auto generated ADs. Some of the users also Earth safari virtual tour. I would love to have au-
mentioned they have preference between pitches of voice dio descriptions of Wild Earth videos and daily
and number of concurrent audio channels. These may add safaris.”
to the reason of lower overall satisfaction. P4: “I would like to have auto-generated au-
dio description for live conferences in Microsoft
6.4.4 Participant Feedback Teams.”
P4: “It worked best as the original audio had not
We present a collection of interview quotes from our par- much value.”
ticipants who were visually impaired, in which they share
their personal experiences and insights about the audio de- Despite the positive feedback, not all responses were fa-
scriptions (AD) generated by MM-V ID. The participants vorable:
expressed a unanimous desire to continue utilizing this AD
generation service in the future, highlighting its exceptional P4: “I am skeptical when it becomes subjective.
quality (“Nearly perfect”), intricate details (“favorite was Sometimes I feel they make up stories which is not
the details”), extensive applicability (“allowed me to follow good.”
anything visual”), and the profound impact it has on them P4: “After listening to the human-generated AD,
(“I did not depend on someone else”). Below, we provide I figured I misunderstood parts of the original
additional quotes for further insight. story.”
P1: “It keeps referring to the same person using
P1: “I understand what is going on very quickly their names instead of pronouns.”

9
P4: “I don’t deal well with overlapped or two [6] Vlog: Video as a long document. https://github.
parallel audios.” com/showlab/VLog, 2023. 3
[7] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
Interestingly, even those participants who provided criti- Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
cal feedback still rated the MM-V ID-generated AD closely Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi-
to human-generated AD, during the questionnaire sessions. sual language model for few-shot learning. arXiv preprint
This indicates that, similar to human-curated AD, adapting arXiv:2204.14198, 2022. 2
[8] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-
to MM-V ID-generated ADs might necessitate some prac-
son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri,
tice and acclimatization over time.
Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2
technical report. arXiv preprint arXiv:2305.10403, 2023. 2
7. Conclusion [9] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video
We have presented MM-V ID, a system that synergizes
vision transformer. In Proceedings of the IEEE/CVF inter-
with GPT-4V for advancing video understanding. MM- national conference on computer vision, pages 6836–6846,
V ID employs GPT-4V to transcribe video content into 2021. 2, 3
long and detailed scripts, thereby enriching LLMs with ad- [10] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser-
vanced video understanding capabilities. Experimental re- man. Frozen in time: A joint video and image encoder for
sults demonstrate the effectiveness of MM-V ID in address- end-to-end retrieval. In Proceedings of the IEEE/CVF Inter-
ing challenging tasks, including comprehension of hour- national Conference on Computer Vision, pages 1728–1738,
long videos, analysis across multiple episodes, identifica- 2021. 2, 3
tion of characters and speakers, and interaction with video [11] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga,
games and graphical user interfaces. Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe-
Beyond the development of the MM-V ID system, we dro, and Jeff Clune. Video pretraining (vpt): Learning to act
by watching unlabeled online videos. Advances in Neural
conducted an extensive user study, drawing feedback from a
Information Processing Systems, 35:24639–24654, 2022. 2
varied group of participants. The outcomes of this study in-
[12] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe
dicated that the audio descriptions generated by MM-V ID Morency. Multimodal machine learning: A survey and tax-
closely mirror the quality of those crafted by humans. In onomy. IEEE transactions on pattern analysis and machine
our future work, we plan to explore SoM [76] and object intelligence, 41(2):423–443, 2018. 2
tracking techniques to enhance various tasks and function- [13] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
alities. space-time attention all you need for video understanding?
In ICML, volume 2, page 4, 2021. 2, 3
Acknowledgment [14] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-
We are deeply grateful to OpenAI for providing access serman. A short note on the kinetics-700 human action
to their exceptional tool [3, 51–53]. We are profoundly dataset. arXiv preprint arXiv:1907.06987, 2019. 2, 3
thankful to Misha Bilenko for his invaluable guidance and [15] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In pro-
support. We also extend heartfelt thanks to our Microsoft
ceedings of the IEEE Conference on Computer Vision and
colleagues for their insights, with special acknowledgment Pattern Recognition, pages 6299–6308, 2017. 2, 3
to Cenyu Zhang, Saqib Shaikh, Ailsa Leen, Jeremy Curry, [16] Harrison Chase. Langchain. https://langchain.
Crystal Jones, Roberto Perez, Ryan Shugart, Anne Taylor readthedocs.io/, 2023. 3
for their constructive feedback. [17] Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li,
Takuya Yoshioka, Chengyi Wang, Shujie Liu, and Ming
References Zhou. Continuous speech separation with conformer.
In ICASSP 2021-2021 IEEE International Conference on
[1] Dota 2. https://openai.com/research/dota-2, Acoustics, Speech and Signal Processing (ICASSP), pages
2017. 2 5749–5753. IEEE, 2021. 2
[2] Azure cognitive services apis. https : / / azure . [18] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang-
microsoft . com / products / cognitive - hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong-
services, 2023. 4 hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.
[3] Chatgpt can now see, hear, and speak. https://openai. Xing. Vicuna: An open-source chatbot impressing gpt-4
com/blog/chatgpt- can- now- see- hear- and- with 90%* chatgpt quality, March 2023. 3
speak, 2023. 3, 10 [19] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
[4] Pygame library. https://www.pygame.org/, 2023. 2, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
7, 31, 32, 33, 34 Barham, Hyung Won Chung, Charles Sutton, Sebastian
[5] Pyscenedetect: Video scene cut detection and analysis tool. Gehrmann, et al. Palm: Scaling language modeling with
https://www.scenedetect.com/, 2023. 4 pathways. arXiv preprint arXiv:2204.02311, 2022. 2, 3

10
[20] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, [31] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg.
Zhicheng Yan, Jitendra Malik, and Christoph Feichten- Tvqa: Localized, compositional video question answering.
hofer. Multiscale vision transformers. In Proceedings of In EMNLP, 2018. 2, 3
the IEEE/CVF International Conference on Computer Vision [32] Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvr:
(ICCV), pages 6824–6835, October 2021. 2, 3 A large-scale dataset for video-subtitle moment retrieval. In
[21] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and ECCV, 2020. 3
Kaiming He. Slowfast networks for video recognition. In [33] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang,
Proceedings of the IEEE/CVF international conference on Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foun-
computer vision, pages 6202–6211, 2019. 2, 3 dation models: From specialists to general-purpose assis-
[22] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang tants. arXiv preprint arXiv:2309.10020, 2023. 3
Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end [34] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
video-language transformers with masked visual-token mod- Blip-2: Bootstrapping language-image pre-training with
eling. arXiv preprint arXiv:2111.12681, 2021. 2, 3 frozen image encoders and large language models. arXiv
[23] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang preprint arXiv:2301.12597, 2023. 3
Wang, Lijuan Wang, and Zicheng Liu. An empirical study [35] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai
of end-to-end video-language transformers with masked vi- Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao.
sual modeling. In Proceedings of the IEEE/CVF Conference Videochat: Chat-centric video understanding. arXiv preprint
on Computer Vision and Pattern Recognition, pages 22898– arXiv:2305.06355, 2023. 3
22909, 2023. 2, 3 [36] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu,
[24] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- and Jingjing Liu. Hero: Hierarchical encoder for video+
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, language omni-representation pre-training. arXiv preprint
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz arXiv:2005.00200, 2020. 2, 3
Mueller-Freitag, et al. The” something something” video [37] Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng
database for learning and evaluating visual common sense. Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-
In Proceedings of the IEEE international conference on com- language understanding as masked language modeling. In
puter vision, pages 5842–5850, 2017. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 23119–23129, 2023. 2,
[25] Kristen Grauman, Andrew Westbury, Eugene Byrne,
3
Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson
[38] Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen,
Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d:
Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang,
Around the world in 3,000 hours of egocentric video. In Pro-
William Yang Wang, et al. Value: A multi-task bench-
ceedings of the IEEE/CVF Conference on Computer Vision
mark for video-and-language understanding evaluation. In
and Pattern Recognition, pages 18995–19012, 2022. 6, 30
35th Conference on Neural Information Processing Systems
[26] Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi (NeurIPS 2021) Track on Datasets and Benchmarks, 2021.
Xie, and Andrew Zisserman. Autoad: Movie description 2, 3
in context. In Proceedings of the IEEE/CVF Conference [39] Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe
on Computer Vision and Pattern Recognition, pages 18930– Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swin-
18940, 2023. 2, 6, 7 bert: End-to-end transformers with sparse attention for video
[27] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas captioning. In Proceedings of the IEEE/CVF Conference
Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video on Computer Vision and Pattern Recognition, pages 17949–
classification with convolutional neural networks. In Pro- 17958, 2022. 2, 3
ceedings of the IEEE conference on Computer Vision and [40] Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael
Pattern Recognition, pages 1725–1732, 2014. 2 Wray, Rui Yan, Eric Z XU, Difei Gao, Rong-Cheng Tu, Wen-
[28] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote, zhe Zhao, Weijie Kong, et al. Egocentric video-language
Tomaso Poggio, and Thomas Serre. Hmdb: a large video pretraining. Advances in Neural Information Processing Sys-
database for human motion recognition. In 2011 Inter- tems, 35:7575–7586, 2022. 2, 3
national conference on computer vision, pages 2556–2563. [41] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
IEEE, 2011. 3 Visual instruction tuning, 2023. 3
[29] Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. Learn- [42] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
ing interactions and relationships between movie characters. Stephen Lin, and Han Hu. Video swin transformer. In Pro-
In Proceedings of the IEEE/CVF Conference on Computer ceedings of the IEEE/CVF conference on computer vision
Vision and Pattern Recognition, pages 9849–9858, 2020. 2 and pattern recognition, pages 3202–3211, 2022. 2, 3
[30] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, [43] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui
Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley:
video-and-language learning via sparse sampling. In Pro- Video assistant with large language model enhanced ability.
ceedings of the IEEE/CVF Conference on Computer Vision arXiv preprint arXiv:2306.07207, 2023. 3
and Pattern Recognition (CVPR), pages 7331–7341, June [44] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa-
2021. 2, 3 had Shahbaz Khan. Video-chatgpt: Towards detailed video

11
understanding via large vision and language models. arXiv [61] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan
preprint arXiv:2306.05424, 2023. 3 McCree, Daniel Povey, and Sanjeev Khudanpur. Speaker
[45] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- recognition for multi-speaker conversations using x-vectors.
linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, In ICASSP 2019-2019 IEEE International conference on
Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: acoustics, speech and signal processing (ICASSP), pages
Iterative refinement with self-feedback. arXiv preprint 5796–5800. IEEE, 2019. 2
arXiv:2303.17651, 2023. 6 [62] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng
[46] EVERINGHAM Mark. Hello! my name is... buffy’- Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan
automatic naming of characters in tv video. In Proceedings Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to-
of British Machine Vision Conference, 2006, 2006. 2 ken to sparse memory for long video understanding. arXiv
[47] Microsoft. Bingchat. https://www.microsoft.com/ preprint arXiv:2307.16449, 2023. 2, 3
en-us/edge/features/bing-chat, 2023. 2 [63] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
[48] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Ucf101: A dataset of 101 human actions classes from videos
Makarand Tapaswi, Ivan Laptev, and Josef Sivic. in the wild. arXiv preprint arXiv:1212.0402, 2012. 3
HowTo100M: Learning a Text-Video Embedding by [64] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Watching Hundred Million Narrated Video Clips. In ICCV, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
2019. 2 Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
[49] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Llama 2: Open foundation and fine-tuned chat models. arXiv
Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:2307.09288, 2023. 3
preprint arXiv:1706.08612, 2017. 2 [65] Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai,
[50] Arsha Nagrani and Andrew Zisserman. From benedict cum- Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A
berbatch to sherlock holmes: Character identification in tv tracklet-centric multimodal and versatile video understand-
series without a script. arXiv preprint arXiv:1801.10442, ing system. arXiv preprint arXiv:2304.14407, 2023. 3
2018. 2 [66] Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge,
Kevin Qinghong Lin, Satoshi Tsutsui, Xudong Lin,
[51] OpenAI. Gpt-4 technical report. 2023. 2, 3, 10
Guanyu Cai, Jianping Wu, Ying Shan, et al. All in one:
[52] OpenAI. Gpt-4v(ision) system card. 2023. 2, 3, 10
Exploring unified video-language pre-training. In Proceed-
[53] OpenAI. Gpt-4v(ision) technical work and authors. https: ings of the IEEE/CVF Conference on Computer Vision and
//cdn.openai.com/contributions/gpt- 4v. Pattern Recognition, pages 6598–6608, 2023. 2, 3
pdf, 2023. 10
[67] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang,
[54] Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai
ith Ringel Morris, Percy Liang, and Michael S Bernstein. Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A
Generative agents: Interactive simulacra of human behavior. survey on large language model based autonomous agents,
arXiv preprint arXiv:2304.03442, 2023. 2 2023. 2
[55] Karine Pires and Gwendal Simon. Youtube live and twitch: [68] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang
a tour of user-generated live streaming systems. In Proceed- Wang, and William Yang Wang. Vatex: A large-scale, high-
ings of the 6th ACM multimedia systems conference, pages quality multilingual dataset for video-and-language research.
225–230, 2015. 2 In The IEEE International Conference on Computer Vision
[56] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, (ICCV), October 2019. 2, 3
Christine McLeavey, and Ilya Sutskever. Robust speech [69] Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S
recognition via large-scale weak supervision. In Inter- Yu. Spatiotemporal pyramid network for video action recog-
national Conference on Machine Learning, pages 28492– nition. In Proceedings of the IEEE conference on Computer
28518. PMLR, 2023. 3 Vision and Pattern Recognition, pages 1529–1538, 2017. 2,
[57] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt 3
Schiele. A dataset for movie description. In Proceedings of [70] Chao-Yuan Wu and Philipp Krahenbuhl. Towards long-form
the IEEE conference on computer vision and pattern recog- video understanding. In Proceedings of the IEEE/CVF Con-
nition, pages 3202–3212, 2015. 2, 6, 7 ference on Computer Vision and Pattern Recognition, pages
[58] Noah Shinn, Federico Cassano, Beck Labash, Ashwin 1884–1894, 2021. 2
Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: [71] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan,
Language agents with verbal reinforcement learning, 2023. Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A gen-
6 erative region-to-text transformer for object understanding.
[59] Mustafa Shukor, Corentin Dancette, Alexandre Rame, and arXiv preprint arXiv:2212.00280, 2022. 3
Matthieu Cord. Unified model for image, video, audio and [72] Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen
language tasks. arXiv preprint arXiv:2307.16184, 2023. 2 Bach, Ce Liu, and Michael Zeng. Visual clues: Bridging vi-
[60] Cees GM Snoek and Marcel Worring. Multimodal video in- sion and language foundations for image paragraph caption-
dexing: A review of the state-of-the-art. Multimedia tools ing. Advances in Neural Information Processing Systems,
and applications, 25:5–35, 2005. 2 35:17287–17300, 2022. 3

12
[73] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
CVPR, 2016. 3
[74] Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal
learning with transformers: A survey. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2023. 2
[75] Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong
Wang, Weidong Liu, and Yang Liu. Exploring large language
models for communication games: An empirical study on
werewolf. arXiv preprint arXiv:2309.04658, 2023. 2
[76] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan
Li, and Jianfeng Gao. Set-of-mark prompting unleashes
extraordinary visual grounding in gpt-4v. arXiv preprint
arXiv:2310.11441, 2023. 10
[77] Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant,
Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao,
Yi-Ling Chen, et al. i-code: An integrative and compos-
able multimodal learning framework. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 37, pages
10880–10890, 2023. 2
[78] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn
of lmms: Preliminary explorations with gpt-4v(ision). arXiv
preprint arXiv:2309.17421, 2023. 2, 3, 4
[79] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin,
Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu,
Michael Zeng, and Lijuan Wang. Mm-react: Prompting
chatgpt for multimodal reasoning and action. arXiv preprint
arXiv:2303.11381, 2023. 3, 4
[80] Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img:
Iterative self-refinement with gpt-4v (ision) for auto-
matic image design and generation. arXiv preprint
arXiv:2310.08541, 2023. 6
[81] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Boxin Li, Chunyuan Li, et al. Florence: A new
foundation model for computer vision. arXiv preprint
arXiv:2111.11432, 2021. 2
[82] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu,
Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Mer-
lot: Multimodal neural script knowledge models. Advances
in Neural Information Processing Systems, 34:23634–23651,
2021. 2
[83] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou,
Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang
Gan. Building cooperative embodied agents modularly with
large language models. arXiv preprint arXiv:2307.02485,
2023. 2
[84] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An
instruction-tuned audio-visual language model for video un-
derstanding. arXiv preprint arXiv:2306.02858, 2023. 3
[85] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained
transformer language models, 2022. URL https://arxiv.
org/abs/2205.01068. 3

13
Script Generation Using GPT-4 takes a turn at bat. After a quick scuffle with
(continued): the ball, he blasts it into left field, scoring yet
another home run.
Chris Bryant from the Rockies, employing a
rt series of breaking stuff and tempting 0:03:11 - 0:03:26: The Dodgers are in a roll. Max
fastballs at the top of the strike zone. Muncie adds another point to the Dodgers lead by
sending the ball deep into the right field.
0:00:47 - 0:01:02: Next up, KRON from the
Rockies pops the ball to the infield. Dodgers' 0:03:27 - 0:03:42: The Rockies' are unfortunate to
catcher, Will Smith, steps out and catches the lose a run despite a successful hit due to swift
ball despite a one-out double, stopping the coordination between the Dodgers' players.
Rockies from making a score.
0:03:43 - 0:03:58: Jason Hayward from Dodgers
0:01:03 - 0:01:18: The video pans to Dodgers' adds to Rockies' dismay by smacking a lead-off shot
lead-off hitter, Mookie Betts, who manages to in the third inning of the game.
hit a lead-off single to the centre field, setting
a positive start for the Dodgers. 0:03:59 - 0:04:14: Julio Urias, from the Dodgers,
efficiently strikes out the Venezuelans' Luis
0:01:19 - 0:01:34: Will Smith steps up and Montero and Ryan McMahon in quick succession.
drives a fly ball to left which goes deep,
resulting in him scoring a home run for the 0:04:15 - 0:04:30: A moment of respite for the
third game in a row. Rockies as Urias throws a pitch, allowing the batter
to hit a ground ball into infield. However, it is
0:01:35 - 0:01:50: Scene next highlights Max quickly fielded, ending their chances. The innings
Muncy's performance who contributes to the end with the Dodgers in the lead by four runs.
Dodgers scoring by cracking a ball to the
left-center field which is then picked up by the 0:04:31 - 0:04:46: The scene captures dodgers' Max
Rockies' fielder. Muncy blasting a hit into the field, which bumps
the Dodgers' lead to four.
0:01:51 - 0:02:06: The scene changes to a duo
of spectacular plays by the Dodgers. First, a 0:04:47 - 0:05:02: Despite a promising start for the
double play managed by shortstop Tovar, Rockies at the top of the third, their momentum
swiftly eliminating JD Martinez to curb the slows down as they lose a run due to a double play
first inning. Then in the second inning, we see induced by Urias.
Urias successfully striking out Rockies' Ryan
McMahon. 0:05:03 - 0:05:18: Hayward makes a fantastic
lead-off shot for the Dodgers' in the third
0:02:07 - 0:02:22: Urias faces a setback as inning, continuing their advantageous streak.
McMahon bounces back with a high-quality
deep shot to the center field, netting himself 0:05:19 - 0:05:34: Back at the pitching mound,
a double. Urias expertly strikes out the Rockies' Luis Montero
and leaves the Rockies scoreless for 6 games.
0:02:23 - 0:02:38: Despite the Rockies
improving, Urias ends the inning by striking 0:05:35 - 0:05:50: Urias's superior pitching
out the Rockies' third baseman, Ellen Trejo, continues as he strikes out Ryan McMahon and
leaving McMahon stranded at second base. skillfully fields a sharp grounder ending the fourth
inning, serving a huge blow to the Rockies.
0:02:39 - 0:02:54: The tables turn as the
dodgers regain their footing. Mookie Betts 0:05:51 - 0:06:06: Scene cuts to Dodgers' Mookie
smacks the ball into right field and makes his Betts, who despite an injury, hammers a hit into the
way safely to second base. field, landing him a double.

0:02:55 - 0:03:10: Dodgers player Will Smith 0:06:07 - 0:06:22: Despite leading the series,

Figure 7. An example of MM-V ID’s execution flow (continued). We present the full script generated by MM-V ID. The original video is
available at https://www.youtube.com/watch?v=-pNe0p4H8ec

14
Script Generation Using GPT-4 (continued)
(continued):

Dodgers' catcher Will Smith lifts a high ball

rt into center field which brings Mookie Betts
running home, netting the Dodgers another
point.

0:06:23 - 0:06:38: The Rockies' show some

fight as Chris Brian leads the ninth innings
with a double bringing the Rockies trailing the
Dodgers by five.
Will Smith hits a home run in the first inning
0:06:39 - 0:06:54: A series of unfortunate
at timestamp 0:01:19 - 0:01:34
events for the Rockies as batters Montero and
Diaz get directly hit by pitched balls
consecutively. Trejo from the Rockies finally Compared to the last season, did Will Smith
manages to hit a fly ball into left field, become stronger?
allowing Montero to make it to the home
plate, closing the gap between the two teams
to three.
Based
Will on thehits
Smith video, it is run
a home suggested that
at 0:01:19 Will
- 0:01:34
Smith has become stronger compared to
0:06:55 - 0:07:10: The game comes to a close
the last season, as there is a different
as Charlie Blackman grounds to the first base,
sound when the ball leaves his bat.
ending the game in favor of Dodgers, after a
thrilling and eventful match.

0:07:11 - 0:09:09: The scene concludes with How did you know the sound is different?
the teams leaving the field.

The commentator mentions that there is a

Can you summarize the game? different sound when the ball leaves Will
Smith's bat, implying that he might have
become stronger.
The game was a thrilling and eventful match
between the Colorado Rockies and the Los Evidence in ASR:
Angeles Dodgers. The Dodgers dominated 1:27 the track at the wall Smith goes deep
the game with superior pitching and 1:29 again third game in a row
1:34 from Big Time trajectory on that one
powerful hits, while the Rockies showed
1:37 Will Smith must have gotten stronger in
some fight towards the end but ultimately
1:39 the off season there's a different sound
fell short.
1:41 now when the ball leaves his Bat so the
1:44 Dodgers on the board quickly two-run
1:46 home run in the first inning for Will
Show me the most exciting moment in this 1:48 Smith first time in his career he's gone
video 1:50 deep three games in a row one two to
1:52 Muncie is crack the left center field

The most exciting moment in this video is

when Will Smith hits a home run in the first
inning at timestamp 0:01:19 - 0:01:34 in Who won the game, and what are their final
this video scores?

Figure 8. An example of MM-V ID’s execution flow (continued). Given the generated script as reference, MM-V ID achieves various video
understanding capabilities, including grounded QA and multimodal reasoning. The original video is available at https://www.
youtube.com/watch?v=-pNe0p4H8ec

15
The Los Angeles Dodgers won the game
with a final score of 5-2. (continued)

Final scores were displayed at 0:07:11 - 0:09:09,

Julio Urias had his first strikeout at 0:00:31
and mentioned by the commentator at 0:08:53

Who are the best pitchers in this video?

When did Julio Urias induce a double play?

Julio Urias was the best pitcher in this

video, striking out multiple batters and Julio Urias induced a double play at around
inducing a double play. 0:02:07 in the video.
Evidence in the Generated Script:
0:02:07 - 0:02:22: Urias faces a setback as
McMahon bounces back with a high-quality deep
shot to the center field, netting himself a double.
0:03:59 - 0:04:14: Julio Urias, from the Dodgers,
efficiently strikes out the Venezuelans' Luis
Montero and Ryan McMahon in quick succession.
0:04:47 - 0:05:02: Despite a promising start for the
Rockies at the top of the third, their momentum slows
down as they lose a run due to a double play
induced by Urias. Julio Urias induced a double play around 0:02:07
0:05:35 - 0:05:50: Urias's superior pitching
continues as he strikes out Ryan McMahon and
skillfully fields a sharp grounder ending the
fourth inning, serving a huge blow to the Rockies.

When did Julio Urias have his first strikeout?

Julio Urias had his first strikeout in the

video at 0:00:31, striking out Chris Bryant
from the Rockies.

Figure 9. An example of MM-V ID’s execution flow (continued). Given the generated script as reference, MM-V ID achieves various video
understanding capabilities, including grounded QA and multimodal reasoning. The original video is available at https://www.
youtube.com/watch?v=-pNe0p4H8ec

16
Figure 10. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension. Figures 11-13 show
continued outputs. The original video is available at https://www.youtube.com/watch?v=um2Q9aUecy0

17
Scene 31: 00:06:46 Scene 46: 00:09:03
An elephant and a gorilla roam the jungle. Close-up of a bug on a leaf.

Scene 32: 00:06:57 Scene 47: 00:09:43

The gorilla relocates deeper into the jungle. Landscape view of mountains with clouds hovering
overhead.
Scene 33: 00:07:05
The gorilla is seen wading in a small body of Scene 48: 00:10:32
water. Close-up of a bird as it takes flight.

Scene 34: 00:07:13 Scene 49: 00:10:50

Two gorillas spotted playing in the grass. Close-up of a gorilla's face hidden behind the leaves.

Scene 35: 00:07:27 Scene 50: 00:11:23

Admiring the grass, the gorilla takes a deserved A gorilla is seen searching the dense forest floor.
break.
Scene 51: 00:12:15
Scene 36: 00:07:32 Close-up of a black bird with large eyes.
Aerial view of gorillas and elephants in the
marsh. Scene 52: 00:13:03
A black bird is seen walking in a jungle path.
Scene 37: 00:07:37
Close-up of the elephant in the water. Scene 53: 00:13:29
Bird of prey perched on tree branches in the jungle.
Scene 38: 00:07:43
Two elephants are seen fighting and splashing in Scene 54: 00:15:05
the water. Close-up of a bird's eye in the forest.

Scene 39: 00:08:00 Scene 55: 00:16:38

A lone elephant is caught grazing in the A shot of a gorilla hiding behind a tree in a dense
landscape. jungle.

Scene 40: 00:08:08 Scene 56: 00:17:50

Scene transitions to a group of gorillas in their Two elephants engage in a playful fight near a water
natural habitat. body.

Scene 41: 00:08:16 Scene 57: 00:19:20

A herd of elephants is spotted in shallow water. The leopard is seen slowly moving through the
foliage.
Scene 42: 00:08:22
Aerial view of a vast landscape with a mountain Scene 58: 00:21:51
peak and flying birds. Bird seen entering a hole in a tree.

Scene 43: 00:08:28 Scene 59: 00:22:22

Aerial view of a wide, green landscape with A red ant crawls on a fern.
scattered water bodies and flying birds.
Scene 60: 00:23:32
Scene 44: 00:08:34 A peculiar mushroom blooms in the darkness near
Aerial view of a mountainous range with a valley the forest floor.
and a river.
Scene 61: 00:25:07
Scene 45: 00:08:41 A deer seen foraging in the foliage.
Close-up of gorilla hanging from a tree branch.

Figure 11. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension; We present the full script
generated by MM-V ID. The original video is available at https://www.youtube.com/watch?v=um2Q9aUecy0

18
Can you summarize the video?
Figure 12. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension;
What is Cordyceps fungus? MM-V ID is able to
summarize the video and address specific questions with regard to the video contents. Blue highlights the correct answers. The original
video is available at https://www.youtube.com/watch?v=um2Q9aUecy0
The video is a documentary that explores the Cordyceps fungus is a parasitic fungus that
diverse and rich ecosystems of jungles and 19 infects and takes control of its host, usually an
rainforests around the world. It showcases various insect. The fungus manipulates the host's
animal species, their behaviors, and their roles in behavior, eventually leading to the host's death.
maintaining the balance of these ecosystems. The The fungus then erupts from the host's body and
documentary also highlights the threats faced by releases spores to infect more victims. Cordyceps
these habitats, such as deforestation and fungus plays a role in maintaining the balance of
Figure 13. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension; MM-V ID is able to
localize the specific moments or topics in the lengthy documentary video. Blue highlights the correct answers. The original video is
available at https://www.youtube.com/watch?v=um2Q9aUecy0
Pitcher plants can be found in the video around
20 the timestamp 00:18:18.
(continued)
the deep web for information. The episode
features various tense conversations and
political strategizing, with characters
navigating complex relationships and making
rt
crucial decisions.

Loading video…
What are the main challenges Frank facing in
Video length is 46 minutes 54 seconds.
the two videos?

Main challenges Frank faces in the two

episodes:

- Erasing all leads connecting him and Doug to

their past actions, particularly their involvement
in Peter Russo's death.
- Managing the political fallout from Zoe's
Loading video… death and preventing any further investigation
Video length is 48 minutes 30 seconds. into his involvement.
- Navigating the complex relationships and
political strategies within the government,
particularly in relation to the Whip race and
Can you summarize the two episodes?
the US-China talks.
- Balancing his personal life with his new role
as Vice President, including dealing with the
Episode 1 Summary: In the first episode of increased security measures and lack of privacy.
House of Cards Season 2, Frank Underwood
prepares for his promotion to Vice President.
Doug Stamper works on erasing any leads
that could connect him and Frank to their When did Frank win the promotion? include
past actions. Claire faces a legal battle against the timestamps
Gillian, while Zoe considers working with
Frank again. Throughout the episode, various
characters engage in tense conversations and
political maneuvering, with Frank ultimately Frank wins the promotion to Vice President in
being sworn in as Vice President in a private Season 2 Episode 1, with the swearing-in
ceremony. ceremony taking place between 0:02:24 and
0:02:36 in Episode 2.
Episode 2 Summary: In the second episode,
Frank is announced as Vice President and
faces obstacles in his new role. The US
engages in talks with China, which fall apart
over cyber warfare issues. Jackie Sharp plans
to run for the new whip position, facing
challenges along the way. Meanwhile, Lucas
investigates Zoe's death and delves into

Describe
Figure 14. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video Frank's
episodic journey
analysis; in the
Given two videos.
multiple video files as
inputs, MM-V ID can summarize the video series, perform cross-episodic reasoning, describe a specific character’s journey across multiple
videos. Figure 15 shows continued outputs. Blue highlights the correct answers. The two videos are from House of Cards, Season 2:
episode 1 and episode 2. Frank's journey in these two episodes involves
21 him preparing for his new role as Vice
President while trying to maintain control over
various situations. He deals with the
consequences of his past actions, such as
Describe Frank's journey in the two videos.

Frank's journey in these two episodes

involves him preparing for his new role as
Vice President while trying to maintain
control over various situations. He deals with
the consequences of his past actions, such as
managing Doug Stamper's attempts to erase
their tracks and handling the fallout from
Zoe's death. Frank also navigates the political
landscape, working with Jackie to secure her
position as Whip and engaging in tense
negotiations with China. Throughout these
challenges, Frank continues to demonstrate
his cunning and ruthlessness in achieving his
goals.

Figure 15. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video episodic analysis; Given multiple video files as
inputs, MM-V ID can summarize the video series, perform cross-episodic reasoning, describe a specific character’s journey across multiple
videos. Blue highlights the correct answers. The two videos are from House of Cards, Season 2: episode 1 and episode 2.

22
Did Pikachu use 10000000 Volts move?

rt Yes, Pikachu used the 10,000,000 Volts move

(Timestamp: 00:20:09 - 00:20:14 in Episode
131).

Loading video…
Video length is 23 minutes 36 seconds.

Did Pikachu use Gigantamax?

Loading video… Yes, Pikachu used Gigantamax. (Timestamp:

Video length is 21 minutes 57 seconds. 00:03:45 - 00:03:49 in Episode 132)

Can you summarize the two episodes?

In Episode 131, Ash and Leon face off in the

finals of the Pokémon World Coronation
Series Masters Eight Tournament. The
intense battle sees various Pokémon,
including Rillaboom, Dragonite, and Show me the keyframe when Pikachu
Cinderace, fighting fiercely. The combined defeats Charizard
power of Pikachu's Z-Move and Charizard's
Gigantamax attack causes Eternatus to
appear unexpectedly.
The keyframe when Pikachu defeats
In Episode 132, the final round of the Charizard is at timestamp 00:16:02-00:16:03
tournament continues with the appearance in Episode 132.
of Eternatus, which recharges Leon and Ash's
Dynamax bands. Ash and Pikachu face Leon's
Gigantamax Cinderace and Charizard in a
thrilling battle. With the support of their
friends and Pokémon, Ash and Pikachu
emerge victorious, making Ash the new
World Champion.

Figure 16. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video episodic analysis; Given multiple video files
as inputs, MM-V ID can summarize the video series, and localize specific moments across multiple videos. Blue highlights the correct
answers. The two videos are from Pokémon Journeys: episode 131 and episode 132.

23
(continued)
0:00:06 - 0:00:08: Nefer looks down the cliff's
edge, revealing her concerned reaction.
0:00:12 - 0:00:14: A shot of a rock with a golden
ankh necklace sets an ominous tone.
rt
0:00:15 - 0:00:16: Transition to a worried Nefer
under the open sky.
0:00:17 - 0:00:18: Faraon overlooks Thut lying
Loading video… on the rocky ground.
Video length is 2 minutes 43 seconds. 0:00:18 - 0:00:24: Various interactions between
Thut, Nefer, and Faraon in the desert transpire.
0:00:24 - 0:00:30: Nefer is hugged by Faraon as
Invoke tools: Scene Detection, Metadata Thut watches closely.
Collection 0:00:30 - 0:00:36: Thut shares a moment with
Nefer before seating himself beside a frog as
Scene Detection: Faraon joins Nefer.
0:00:36 - 0:00:42: Nefer and Faraon approach a
0:00:00, 0:00:01, 0:00:02, 0:00:03, 0:00:04,
slightly confused Thut.
0:00:05, 0:00:06, 0:00:07, 0:00:08, 0:00:09, 0:00:42 - 0:00:48: Thut and Nefer stand alone
..., 0:02:39, 0:02:40, 0:02:41, 0:02:42 exchanging a few words.
0:00:48 - 0:00:54: Faraon joins Thut and Nefer,
Metadata Collection: looking at a ring Thut holds.
Title: Mummies 0:01:00 - 0:01:06: Thut presents a ring between
Abstract: Thut is a young charioteer. Nefer is the his and Nefer's fingers.
daughter of Faraon. In this video, Thut saves 0:01:06 - 0:01:12: Thut listens as Nefer talks
himself from falling off a cliff. Thut and Nefer excitingly about the ring.
0:01:12 - 0:01:18: Faraon oversees a close
meets Faraon. Thut and Nefer gets married in
moment between Thut and Nefer.
front of the goddess statue and all people. Thut 0:01:18 - 0:01:24: Nefer and Thut stand in front
and Nefer celebrates by singing and dancing. of an audience as a small girl with a snake
appears.
0:01:24 - 0:01:30: Thut and Nefer perform in
front of a crowd.
Visual Prompting: 0:01:42 - 0:01:48: Thut and Nefer share a kiss as
the crowd cheers on, transitioning to the girl
with her crocodile.
0:01:54 - 0:02:00: Thut and Nefer perform on
stage with others in a palace courtyard.
0:02:12 - 0:02:18: Multiple performances ensue
in the courtyard, ending with Thut and Nefer
sharing a gaze.
0:02:18 - 0:02:24: Thut and Nefer continue their
romantic interaction under the watchful eyes of
Faraon Thut Nefer their audience.
0:02:24 - 0:02:30: The dancing in the courtyard
continues.
0:02:30 - 0:02:36: Thut, Nefer, and Faraon
Script Generation:
perform on stage as their performers enjoy
0:00:00 - 0:00:06: Nefer, an Egyptian princess, is their display.
seen gazing at the desert landscape and 0:02:36 - 0:02:43: As the sun sets, Thut and
examining a rocky cliff. Nefer conclude their musical performance on a
balcony.

Figure 17. Case studies of MM-V ID ’s capabilities and application scenarios: character identification; By providing the face photos
of the characters and their names, MM-V ID can generate the script with the identified characters. Blue highlights the correct character
identification. The original video is available at https://www.tokyvideo.com/video/mummies-online-free-film, from
1:19:47 to 1:22:30.
24
Regenerated ASR with Identified Speakers:

0:00:01 Ballister Boldheart: Good trade flesh

and we run and we hide.
0:00:33 Unknown: That was weird.
0:00:34 Nimona: Hey boss.
0:00:39 Nimona: I love the secret layer.
0:00:42 Nimona: The garbage and the smell of
sadness really pull the whole thing together.
Loading video…
0:00:45 Ballister Boldheart: Well, wait, wait,
Video length is 2 minutes 57 seconds. wait. Who? Who are you?
0:00:47 Nimona: The names. Pneumonia.
Whoa. Yeah. Sick arm. n
Invoke tools: ASR, Metadata Collection
0:00:48 Ballister Boldheart: And how did you? co
0:00:51 Unknown: Did it bleed a lot? Did they th
ASR: let you keep the old one?
0:00:01 Guest_0: Good trade flesh and we run and 0:00:54 Ballister Boldheart: No, let's go. That's
we hide. wrong with you.
0:00:33 Guest_2: That was weird. 0:00:58 Nimona: Can I have it? What? Put that?
0:00:59 Ballister Boldheart: Down that is.
0:00:34 Guest_3: Hey boss.
0:01:00 Ballister Boldheart: Not for little girls,
0:00:39 Guest_3: I love the secret layer.
little.
… 0:01:02 Nimona: Girls. OK, how old do you
0:02:53 Guest_10: From you. No. Fine then from. think I?
0:02:55 Guest_10: Who? The guy who cut off your 0:01:05 Ballister Boldheart: AM 10.
arms. 0:01:09 Unknown: Alright, home.
0:01:09 Ballister Boldheart: We are more or
less than 10.
Metadata Collection:
0:01:10 Nimona: Not a lot of kids in your life,
Title: Nimona
huh?
Abstract: Ballister boldheart is a knight. Nimona is a 0:01:12 Ballister Boldheart: You know what?
mischievous teen who is a shapeshifting creature. 0:01:13 Ballister Boldheart: No. And then like if
The video shows how Ballister met Nimona first you stay that?
time.Nimona thought boldheart was the villain but 0:01:14 Ballister Boldheart: Way you have to go
disappointed to know he is not. job.
0:01:15 Nimona: But I'm here about this.
0:01:16 Unknown: What job?
0:01:18 Nimona: Oh, it's all here.
Visual Prompting: 0:01:18 Nimona: It's all here in my application.
0:01:22 Nimona: This is just a bunch of
drawings.
0:01:26 Unknown: Very disturbing drawings.
0:01:28 Ballister Boldheart: Oh, look, it's me on
a rhinoceros skewering.
0:01:32 Ballister Boldheart: Guards like a
human kebab.
0:01:34 Nimona: Yeah. Do you like?
0:01:35 Nimona: It I thought a visual aid.
Nimona Ballister 0:01:37 Unknown: Really embarrass me. Puff up
Boldheart here. So about the job. What job?
0:01:45 Nimona: To be your.

Figure 18. Case studies of MM-V ID ’s capabilities and application scenarios: speaker identification; By leveraging visual prompting,
MM-V ID can enhance ASR predictions with the speakers’ identity. Blue and Red highlight the correct and incorrect predictions, respec-
tively. Figure 19 shows continued outputs. The original video is available at https://www.netflix.com/title/81444554,
from 9:52 to 12:52.
25
(continued):
0:01:45 Nimona: Sidekick, you know.
0:01:46 Nimona: To help you do whatever it
takes to.
0:01:47 Nimona: Get revenge on.
0:01:48 Nimona: The cold, cruel world that
rejected you. Shall we pillage village? Lay low
until they don't remember you, and then we
rise like a phoenix. From the Ashes, overthrow
the government.
0:01:58 Unknown: Or we could.
0:01:59 Nimona: Just talk.
0:02:02 Unknown: The the the point.
0:02:03 Nimona: Is whatever your dark heart
desires. Boss, your sidekick has arrived.
0:02:08 Ballister Boldheart: I don't need a.
0:02:09 Ballister Boldheart: Side wait, I am not
a.
0:02:09 Nimona: Every villain needs a sidekick.
0:02:12 Unknown: Sweetener. Murdo. That's
how you want to kill first. Oh.
0:02:14 Nimona: Are these your next targets?
0:02:15 Nimona: Who do?
0:02:16 Ballister Boldheart: I'm not killing
anybody.
0:02:18 Nimona: This one, this guy looks
extremely punchable.
0:02:19 Nimona: Right. He is actually extremely.
0:02:21 Ballister Boldheart: You're right, he is
actually.
0:02:22 Ballister Boldheart: Punchable. Wait,
that is not a murder wall. It's a.
0:02:26 Ballister Boldheart: The innocence wall,
yeah.
0:02:27 Nimona: Citizens will.
0:02:30 Nimona: We are you saying that you're
not a villain? Yes.
0:02:36 Nimona: Yes you are.
0:02:37 Ballister Boldheart: No, no, no. I'm not
a villain, no.
0:02:42 Ballister Boldheart: But you.
0:02:43 Ballister Boldheart: Disappointed that
I'm not a murderer.
0:02:46 Nimona: I mean.
0:02:46 Nimona: I mean a little. Why?
0:02:48 Ballister Boldheart: You know what we.
0:02:49 Ballister Boldheart: Don't know.
0:02:50 Ballister Boldheart: The real villain is
still out there and I do need help not.
0:02:53 Nimona: From you. No. Fine then from.
0:02:55 Nimona: Who? The guy who cut off
your arms.

Figure 19. Case studies of MM-V ID ’s capabilities and application scenarios: speaker identification; By leveraging visual prompting,
MM-V ID can enhance ASR predictions with the speakers’ identity. Blue and Red highlight the correct and incorrect predictions, respec-
tively. The original video is available at https://www.netflix.com/title/81444554, from 9:52 to 12:52.

26
(continued)

0:01:03 - 0:01:09 : Mr. Bean tries to regain

focus in this scene, tapping his fingers as he
nervously attempts to recall information for
the exam.

0:01:11 - 0:01:12 : Two men, including Mr.

Loading video… Bean, are discussing a portion of the exam in a
Video length is 1 minutes 44 seconds. library, pointing at a book while brainstorming
answers.
Script Generation:
0:01:12 - 0:01:21 : Mr. Bean and the man next
0:00:00 - 0:00:08 : Mr. Bean enters a classroom and to him become more desperate, whispering to
takes a seat next to another man. They both each other and trying to cheat as their
anxiously look at their exam papers as they begin frustration grows. The mood is tense, mirroring
their test. their anxiety.

0:00:09 - 0:00:16 : Mr. Bean glances over at the other

man's paper and tries to sneakily copy his answers.
Both men continue to work on their tests, unaware Why Mr. Bean cheat in the exam?
of their surroundings.

0:00:17 - 0:00:23 : Close-up shot of Mr. Bean's face

showing his frustration as he struggles to figure out Mr. Bean cheats in the exam because he is
the answers to the exam. unprepared, lacks confidence in his ability to
answer the questions, and feels desperate
0:00:24 - 0:00:27 : Mr. Bean continues to take the to pass the test. His comedic nature and
exam while holding a pink flamingo pen, struggling inability to cope with the pressure of the
to concentrate. exam lead him to attempt cheating as a last
resort.
0:00:27 - 0:00:35 : Mr. Bean becomes more
desperate, sticking his nose into the exam paper as
he continues to copy answers from the man sitting Why do people find the video funny?
next to him.

0:00:36 - 0:00:45 : Mr. Bean takes a break, rubbing

his eyes as the scene cuts back and forth between People might find the video funny because of
him and his colleague. Mr. Bean's exaggerated expressions and
actions, such as sticking his nose into the
0:00:45 - 0:00:54 : The students continue to work on exam paper and using a pink flamingo pen,
their task as the scene shows the peaceful and quiet which add humor to the otherwise tense
environment of the exam room. and stressful situation of taking an exam.

0:00:54 - 0:01:00 : Mr. Bean is deep in thought as he

taps his pen on the desk, looking more worried and
nervous about his exam performance.

0:01:01 - 0:01:02 : Close-up of Mr. Bean struggling

with a problem and looking upward with a frustrated
expression.

Figure 20. Case studies of MM-V ID ’s capabilities and application scenarios: audio description generation; Without ASR inputs, MM-
V ID processes the video and generates a detailed script. Blue and Red highlight the correct and incorrect predictions, respectively. The
original video is available at https://www.youtube.com/watch?v=29ZmXkICT1g, from 2:50 to 4:34.

27
……

bapl
biotects the

biotects the

rohammad examining a small

Figure 21. Case studies of MM-V ID ’s capabilities and application scenarios: self-refinement; In this example, MM-V ID mistakenly
identifies a bird as a rock due to low-resolution video frames. The self-refinement helps rectify the errors, resulting in a more refined
output. Blue and Red highlight the correct and incorrect predictions, respectively. The original video is available at https://www.
youtube.com/watch?v=o4UKvgBkO2Y, from 1:06 to 4:17.
28
…

Figure 22. Case studies of MM-V ID ’s capabilities and application scenarios: fast-changing short videos. The original videos are
available at https://www.instagram.com/mealtimewithmummy/reels/

29
…

Figure 23. Case studies of MM-V ID ’s capabilities and application scenarios: embodied agent. Blue highlights the correct prediction.
The original video is collected from Ego4D dataset [25].

30
Figure 24. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figures 25-27 show continued outputs. The video is generated by Pygame library [4].

31
Figure 25. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figures 26-27 show continued outputs. The video is generated by Pygame library [4].

32
{
"con
"rea
You are expert in playing Super Mario Bros.
What is the next controls to be pressed after
the last frame. Output JSON. }

Available controls are: Jump, ArrowLeft,

ArrowRight, Wait

Output:
{
"controls": "ArrowLeft+Jump",
"reason": "Mario should"
} You are expert in playing Super Mario Bros.
What is the next controls to be pressed after
For this sequence of frames of super mario, the last frame. Output JSON.
what is the next controls.
Available controls are: Jump, ArrowLeft,
Output: ArrowRight, Wait

Output:
{
{
"control": "Jump",
"controls": "ArrowLeft+Jump",
"reason": "Mario should jump to
"reason": "Mario should"
avoid the goomba"
}
}
(Agent presses control “Jump”)

Figure 26. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figure 27 shows continued outputs. The video is generated by Pygame library [4].

33
Figure 27. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
The video is generated by Pygame library [4].

34
Figure 28. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 29-32 show continued outputs.

35
Imagine that you are a robot operating a Imagine that you are a robot operating a
phone. Like how humans operate the phone, phone. Like how humans operate the phone,
you can move your finger to an icon, click an you can move your finger to an icon, click an
icon with your finger, scroll down a screen, or icon with your finger, scroll down a screen, or
type some texts with the keyboard. type some texts with the keyboard.
You are asked to shop for an ergonomic You are asked to shop for an ergonomic
keyboard, your budget is between $50 and keyboard, your budget is between $50 and
$100. $100.
Previous action: "My next move would be to Previous action: "My immediate next move
move my finger to the Amazon icon, which is would be to move my finger to the "Filters"
located in the third row and third column of button, which is located at the top right corner
the screen. I would then click on the icon to of the screen, next to the "Windows" button. I
open the app and search for an ergonomic would then click on the "Filters" button to open
keyboard within my budget of $50. " the filter options and narrow down my search
You have now opened the Amazon app. results to only show ergonomic keyboards
Below is what you see on the phone screen. within my budget range of $50 to $100. "
What will you do next? If the action involves You have now completed the previous action as
moving your finger to an icon, describe the described above. Below is what you see on the
location of that icon as detailed as possible. phone screen. Please predict your immediate
next move. If the action involves moving your
finger to an icon, describe the location of that
icon as detailed as possible.

My next move would be to move my

finger to the search bar, which is located
at the top of the screen. I would then click
on the search bar and type in "ergonomic My immediate next move would be to move
keyboard" using the keyboard. After my finger to the "Price & Deals" filter
typing in the search term, I would click on option, which is located under the "Colors"
the "Search" button, which is located on filter option and above the "Sort by" filter
the bottom right corner of the keyboard, option. I would then click on the "Price &
to initiate the search. Deals" filter option to open the price range

Figure 29. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 30-32 show continued outputs.

36
(continued) My next move would be to move my finger
filter and set my minimum price to $50 to the "Show 1,987 results" button, which is
and my maximum price to $100. This located at the bottom of the screen, below
would further narrow down my search the "Clear Filters" button and above the
results to only show ergonomic keyboards navigation bar. I would then click on the
within my budget range. "Show 1,987 results" button to view the
search results for ergonomic keyboards
Imagine that you are a robot operating a
within my budget range.
phone. Like how humans operate the phone,
you can move your finger to an icon, click an Imagine that you are a robot operating a
icon with your finger, scroll down a screen, or phone. Like how humans operate the phone,
type some texts with the keyboard. you can move your finger to an icon, click an
You are asked to shop for an ergonomic icon with your finger, scroll down a screen, or
keyboard, your budget is between $50 and type some texts with the keyboard.
$100. You are asked to shop for an ergonomic
Previous action: " My immediate next move keyboard, your budget is between $50 and
would be to move my finger to the "Price & $100.
Deals" filter option, which is located under the Previous action: " My next move would be to
"Colors" filter option and above the "Sort by" move my finger to the "Show 1,987 results"
filter option. I would then click on the "Price & button, which is located at the bottom of the
Deals" filter option to open the price range filter screen, below the "Clear Filters" button and
and set my minimum price to $50 and my above the navigation bar. I would then click on
maximum price to $100. This would further the "Show 1,987 results" button to view the
narrow down my search results to only show search results for ergonomic keyboards within
ergonomic keyboards within my budget range. " my budget range. "
You have now completed the previous action as You have now completed the previous action as
described above. Below is what you see on the described above. Below is what you see on the
phone screen. What will you do next? If the phone screen. What will you do next? If the
action involves moving your finger to an icon, action involves moving your finger to an icon,
describe the location of that icon as detailed as describe the location of that icon as detailed as
possible. possible.

Figure 30. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 31-32 show continued outputs.

37
Figure 31. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figure 32 shows continued outputs.

38
Figure 32. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively.

Class 9 (Moral Science)
No ratings yet
Class 9 (Moral Science)
27 pages
Video Chat GPT
No ratings yet
Video Chat GPT
17 pages
VideoChat Chat-Centric Video Understanding
No ratings yet
VideoChat Chat-Centric Video Understanding
16 pages
2405.03770v1 Video Understanding Survey
No ratings yet
2405.03770v1 Video Understanding Survey
48 pages
Yun Thesesgg
No ratings yet
Yun Thesesgg
127 pages
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
No ratings yet
Generating Video Descriptions With Attention-Driven LSTM Models in Hindi Language
9 pages
Intern Video 2
No ratings yet
Intern Video 2
27 pages
Video-of-Thought:: Step-by-Step Video Reasoning From Perception To Cognition
No ratings yet
Video-of-Thought:: Step-by-Step Video Reasoning From Perception To Cognition
17 pages
Deepseek-Vl2: Mixture-Of-Experts Vision-Language Models For Advanced Multimodal Understanding
No ratings yet
Deepseek-Vl2: Mixture-Of-Experts Vision-Language Models For Advanced Multimodal Understanding
28 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
Visually-Aligned Retrieval-Augmented Long Video Comprehension
No ratings yet
Visually-Aligned Retrieval-Augmented Long Video Comprehension
15 pages
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
No ratings yet
Longvu: Spatiotemporal Adaptive Compression For Long Video-Language Understanding
17 pages
Vtimellm: Empower LLM To Grasp Video Moments: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu Tsinghua University
No ratings yet
Vtimellm: Empower LLM To Grasp Video Moments: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu Tsinghua University
14 pages
Vitron
No ratings yet
Vitron
22 pages
Text-Driven Video Prediction - 3675171
No ratings yet
Text-Driven Video Prediction - 3675171
14 pages
M: Evaluating Complex Video Reasoning
No ratings yet
M: Evaluating Complex Video Reasoning
21 pages
Li MVBench A Comprehensive Multi-Modal Video Understanding Benchmark CVPR 2024 Paper
No ratings yet
Li MVBench A Comprehensive Multi-Modal Video Understanding Benchmark CVPR 2024 Paper
12 pages
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
No ratings yet
Cao MAPLM A Real-World Large-Scale Vision-Language Benchmark For Map and Traffic CVPR 2024 Paper
12 pages
Sigir 2022
No ratings yet
Sigir 2022
11 pages
Vcbench: A Controllable Benchmark For Symbolic and Abstract Challenges in Video Cognition
No ratings yet
Vcbench: A Controllable Benchmark For Symbolic and Abstract Challenges in Video Cognition
24 pages
2025 VideoRAG Ren Et Al
No ratings yet
2025 VideoRAG Ren Et Al
18 pages
Artemis
No ratings yet
Artemis
11 pages
LLaMA VID
No ratings yet
LLaMA VID
18 pages
Video-Mme: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal Llms in Video Analysis
No ratings yet
Video-Mme: The First-Ever Comprehensive Evaluation Benchmark of Multi-Modal Llms in Video Analysis
15 pages
Holmes-VAD - Towards Unbiased and Explainable Video Anomaly Detection Via Multi-Modal LLM
No ratings yet
Holmes-VAD - Towards Unbiased and Explainable Video Anomaly Detection Via Multi-Modal LLM
19 pages
Nlp-Enriched Automatic Video Segmentation: Mohannad Almousa Rachid Benlamri Richard Khoury
No ratings yet
Nlp-Enriched Automatic Video Segmentation: Mohannad Almousa Rachid Benlamri Richard Khoury
6 pages
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
No ratings yet
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
4 pages
Park VLAAD Vision and Language Assistant For Autonomous Driving WACVW 2024 Paper
No ratings yet
Park VLAAD Vision and Language Assistant For Autonomous Driving WACVW 2024 Paper
8 pages
AdaMML Adaptive Multi-Modal Learning For Efficient Video Recognition
No ratings yet
AdaMML Adaptive Multi-Modal Learning For Efficient Video Recognition
10 pages
Reframe Anything
No ratings yet
Reframe Anything
18 pages
Consulting Proposal
No ratings yet
Consulting Proposal
28 pages
Learning CLIP Guided Visual-Text Fusion Transformer For Video-Based Pedestrian Attribute Recognition
No ratings yet
Learning CLIP Guided Visual-Text Fusion Transformer For Video-Based Pedestrian Attribute Recognition
4 pages
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
No ratings yet
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
18 pages
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
No ratings yet
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
11 pages
Video Summarization Using Deep Semantic Features
No ratings yet
Video Summarization Using Deep Semantic Features
16 pages
Neuro-Symbolic Video Search: Minkyu Choi, Harsh Goel, Mohammad Omama Yunhao Yang, Sahil Shah, and Sandeep Chinchali
No ratings yet
Neuro-Symbolic Video Search: Minkyu Choi, Harsh Goel, Mohammad Omama Yunhao Yang, Sahil Shah, and Sandeep Chinchali
12 pages
MVBench - AComprehensive Multi-Modal Video Understanding Benchmark
No ratings yet
MVBench - AComprehensive Multi-Modal Video Understanding Benchmark
19 pages
Reference: Phenomena-Report-2023
No ratings yet
Reference: Phenomena-Report-2023
23 pages
Cross Modal LMM For Video and Audio Analysis
No ratings yet
Cross Modal LMM For Video and Audio Analysis
6 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
No ratings yet
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
37 pages
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
No ratings yet
Mobile-Videogpt: Fast and Accurate Video Understanding Language Model
13 pages
Video Understanding With Large Language Models - A Survey
No ratings yet
Video Understanding With Large Language Models - A Survey
24 pages
Essay On My Hero
100% (2)
Essay On My Hero
3 pages
A Machine Learning Pipeline For Semantic Aware and Contexts Rich Video Description Method
No ratings yet
A Machine Learning Pipeline For Semantic Aware and Contexts Rich Video Description Method
9 pages
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
No ratings yet
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
24 pages
IJCRT24A5009
No ratings yet
IJCRT24A5009
5 pages
Pytorchvideo: A Deep Learning Library For Video Understanding
No ratings yet
Pytorchvideo: A Deep Learning Library For Video Understanding
4 pages
Integrating Image and Video Encoders For Enhanced Video Understanding
No ratings yet
Integrating Image and Video Encoders For Enhanced Video Understanding
18 pages
Cross-Domain Modality Fusion For Dense Video Captioning
No ratings yet
Cross-Domain Modality Fusion For Dense Video Captioning
15 pages
CHP 19 Rehman-Et-Al-2022-Developing-The-Integrated-Marketing-Communication-Imc-Through-Social-Media-Sm-The-Modern-Marketing
No ratings yet
CHP 19 Rehman-Et-Al-2022-Developing-The-Integrated-Marketing-Communication-Imc-Through-Social-Media-Sm-The-Modern-Marketing
23 pages
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
No ratings yet
2019 - VideoBERT - A Joint Model For Video and Language Representation Learning
13 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
My Homework For You
100% (1)
My Homework For You
4 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
Physics For X - Cbse
No ratings yet
Physics For X - Cbse
9 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
INFO1113 Assignment 2023 S2
No ratings yet
INFO1113 Assignment 2023 S2
11 pages
Sentence Correction Rules
No ratings yet
Sentence Correction Rules
27 pages
Classroom Presentations
No ratings yet
Classroom Presentations
9 pages
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
No ratings yet
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
9 pages
Multi-Modal Inductive Framework For Text-Video Retrieval
No ratings yet
Multi-Modal Inductive Framework For Text-Video Retrieval
10 pages
Ramp Check List
No ratings yet
Ramp Check List
1 page
Schrack RE030024
No ratings yet
Schrack RE030024
2 pages
Deep Learning-Based Video Captioning Technique Using Transformer
No ratings yet
Deep Learning-Based Video Captioning Technique Using Transformer
4 pages
Human Values: DR - Sunil Ms Ob LPU
No ratings yet
Human Values: DR - Sunil Ms Ob LPU
11 pages
PP PPT Myp5
No ratings yet
PP PPT Myp5
14 pages
Cashless Economy
No ratings yet
Cashless Economy
9 pages
A Study On Customer Preference Towards Sports Shoes: Bachelor of Business Administration
No ratings yet
A Study On Customer Preference Towards Sports Shoes: Bachelor of Business Administration
8 pages
Irjet V11i617
No ratings yet
Irjet V11i617
7 pages
Adiabatic Compressibility of Liquid Ammonia
No ratings yet
Adiabatic Compressibility of Liquid Ammonia
3 pages
Definitions of Curriculum Bsed
No ratings yet
Definitions of Curriculum Bsed
1 page
Field Record of Concrete: Commercial & Office Building On Plot No. 373-1343 at Al Barsha First Dubai
No ratings yet
Field Record of Concrete: Commercial & Office Building On Plot No. 373-1343 at Al Barsha First Dubai
38 pages
10024947D00 - Turbine Control Board Requirements Specification, PB 540
No ratings yet
10024947D00 - Turbine Control Board Requirements Specification, PB 540
8 pages
Caries Detection
No ratings yet
Caries Detection
7 pages
8 Step Training Model
No ratings yet
8 Step Training Model
1 page
Sneha Sarkar, 127, B, Beta and Gamma Function
No ratings yet
Sneha Sarkar, 127, B, Beta and Gamma Function
12 pages
Friction JEE
No ratings yet
Friction JEE
33 pages
Germination Value A New Formula: Pinus Radiata
No ratings yet
Germination Value A New Formula: Pinus Radiata
5 pages
ITP AND Reports Only Approved
No ratings yet
ITP AND Reports Only Approved
18 pages
Catch-up-Friday-Teaching-Guide-HG V - Week 3
No ratings yet
Catch-up-Friday-Teaching-Guide-HG V - Week 3
5 pages
VVM MCQ On Electricity
No ratings yet
VVM MCQ On Electricity
4 pages
Case Study BARGAIN CITY
No ratings yet
Case Study BARGAIN CITY
1 page
List of Authorised Recyclers 09 07 2024 (2) - 3
No ratings yet
List of Authorised Recyclers 09 07 2024 (2) - 3
1 page
18.question Bank - SA I - ND22
No ratings yet
18.question Bank - SA I - ND22
5 pages
Itl 512 Learning Map Planning 1
No ratings yet
Itl 512 Learning Map Planning 1
12 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Advancing Video Understanding With GPT4

Uploaded by

Advancing Video Understanding With GPT4

Uploaded by

MM-V ID : Advancing Video Understanding with GPT-4V(ision)

Faraon Thut Nefer Nimona Ballister

Multimodal Clip 2 Visual Prompt Clip-Level Video

resenting the ongoing visual information unfolding in the

Timestamp: 00:00:20 - 00:00:24

ASR: Script Generation Using GPT-4:

Clip-Level Video Description Generation Output:

0:00:15 - 0:00:30: Dodge.rs pitcher, Julio Urias,

Dodgers' catcher Will Smith lifts a high ball

0:06:23 - 0:06:38: The Rockies' show some

The commentator mentions that there is a

The most exciting moment in this video is

Final scores were displayed at 0:07:11 - 0:09:09,

Who are the best pitchers in this video?

Julio Urias was the best pitcher in this

When did Julio Urias have his first strikeout?

Julio Urias had his first strikeout in the

Scene 32: 00:06:57 Scene 47: 00:09:43

Scene 34: 00:07:13 Scene 49: 00:10:50

Scene 35: 00:07:27 Scene 50: 00:11:23

Scene 39: 00:08:00 Scene 55: 00:16:38

Scene 40: 00:08:08 Scene 56: 00:17:50

Scene 41: 00:08:16 Scene 57: 00:19:20

Scene 43: 00:08:28 Scene 59: 00:22:22

Main challenges Frank faces in the two

- Erasing all leads connecting him and Doug to

Frank's journey in these two episodes

rt Yes, Pikachu used the 10,000,000 Volts move

Did Pikachu use Gigantamax?

Loading video… Yes, Pikachu used Gigantamax. (Timestamp:

Can you summarize the two episodes?

In Episode 131, Ash and Leon face off in the

0:00:01 Ballister Boldheart: Good trade flesh

0:01:03 - 0:01:09 : Mr. Bean tries to regain

0:01:11 - 0:01:12 : Two men, including Mr.

0:00:09 - 0:00:16 : Mr. Bean glances over at the other

0:00:17 - 0:00:23 : Close-up shot of Mr. Bean's face

0:00:36 - 0:00:45 : Mr. Bean takes a break, rubbing

0:00:54 - 0:01:00 : Mr. Bean is deep in thought as he

0:01:01 - 0:01:02 : Close-up of Mr. Bean struggling

rohammad examining a small

rohammad examining a small

Available controls are: Jump, ArrowLeft,

My next move would be to move my

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.