Advancing Video Understanding With GPT4
Advancing Video Understanding With GPT4
Kevin Lin∗ , Faisal Ahmed∗ , Linjie Li∗ , Chung-Ching Lin∗ , Ehsan Azarnasab, Zhengyuan Yang,
Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang∗♠
Microsoft Azure AI
∗ ♠
Core Contribution Project Lead
https://multimodal-vid.github.io/
arXiv:2310.19773v1 [cs.CV] 30 Oct 2023
…… …
baal bir
……
Figure 1. MM-V ID allocates specialized vision, audio, speech experts with GPT-4V(ision) to address challenging video understanding
tasks. For example, the system could associate information from multiple uploaded episodes and reason the storyline of the queried
characters (“Multi-Video Episodic Analysis”). We highlight key information here and postpone full MM-V ID responses to Figures 4-32.
Demo videos are available at this link.
1
Abstract quiring semantic understanding, and the ability of long-term
strategy planning [11, 54, 67, 75, 83].
We present MM-V ID, an integrated system that har-
nesses the capabilities of GPT-4V1 , combined with special-
ized tools in vision, audio, and speech, to facilitate ad- Recently, substantial advances have been made with
vanced video understanding. MM-V ID is designed to ad- large pre-trained video models [9, 13, 20, 21, 42, 69] and
dress the challenges posed by long-form videos and intri- video-language models [10, 22, 23, 30, 36–40, 66], which
cate tasks such as reasoning within hour-long content and have demonstrated their reasoning capabilities for video
grasping storylines spanning multiple episodes. MM-V ID content. However, these models are usually trained on
uses a video-to-script generation with GPT-4V to transcribe short clips (e.g., 10-second videos in Kinetics [15] and VA-
multimodal elements into a long textual script. The gener- TEX [68]) or pre-defined action classes (e.g., 174 classes in
ated script details character movements, actions, expres- Something-Something v1 [24]). Consequently, these mod-
sions, and dialogues, paving the way for large language els may fall short in providing a detailed comprehension of
models (LLMs) to achieve video understanding. This en- intricate videos in real world [62, 70]. To achieve a more
ables advanced capabilities, including audio description, comprehensive understanding of the videos we encounter in
character identification, and multimodal high-level compre- daily life, we need methods capable of addressing complex
hension. Experimental results demonstrate the effectiveness challenges. It involves not only identifying who are in the
of MM-V ID in handling distinct video genres with various scene and what they do, but also pinpointing when and how
video lengths. Additionally, we showcase its potential when they act, while recognizing subtle nuances and visual cues
applied to interactive environments, such as video games across different scenes. The aim of this work is to address
and graphic user interfaces. these challenges and explore methods that can be applied
directly to real-world video understanding. Our approach
involves breaking down extended video content into coher-
ent narratives and subsequently employing these generated
1. Introduction
stories for video analysis.
People around the world create numerous videos on a
daily basis [14, 27, 48, 55], including user-generated live Recent advances in Large Multimodal Models
streams, video-game live streams, short clips, movies, (LMMs) [7,8,19,47,51,52,78], such as GPT-4V(ision) [52],
sports broadcasts, advertising, and more. Videos serve as have demonstrated significant breakthroughs in processing
a versatile medium for conveying information and content both input images and text for multimodal understanding.
through various modalities [12, 59, 60, 74, 77, 81, 82], such This has sparked interest in applying LMMs to the video
as text, visuals, and audio. Developing methods that can domain. In this work, we present MM-V ID, a system
learn from diverse modalities will enable us to design cogni- that integrates specialized tools with GPT-4V for video
tive machines with enhanced capabilities for analyzing un- understanding. Given an input video, MM-V ID performs
curated real-world videos, extending beyond the confines multimodal pre-processing, including scene detection and
of hand-curated datasets. However, this rich representation automatic speech recognition (ASR), to collect important
introduces many challenges for the study of video under- information in the video. The input video is then split into
standing, particularly when dealing with extended-duration multiple clips according to the scene detection algorithm.
videos [62, 70]. Then, we employ GPT-4V, which takes the clip-level video
Understanding long videos, especially those spanning frames as input and generates a detailed description for
over an hour, is a complex task that demands advanced each video clip. Finally, GPT-4 is adopted to generate
methods capable of analyzing sequences of images and a coherent script for the full video, conditioning on the
audio across multiple episodes. This challenge is com- clip-level video descriptions, ASR, and video metadata if
pounded by the need to extract information from various available. As shown in Figure 1, the generated script allows
sources, such as distinguishing speakers [17, 49, 61], identi- MM-V ID to perform a diverse set of video tasks.
fying characters [29, 46, 50], and maintaining narrative co-
herence [26, 57]. Additionally, answering questions based Experimental results demonstrate the effectiveness of
on video evidence [31] requires a deep comprehension of MM-V ID in different challenging scenarios. MM-V ID
the content, context, and subtitles. When it comes to live is able to comprehend hour-long videos through multiple
streaming and gaming videos [1, 11, 55], there are chal- modalities, and localize specific events with correct times-
lenges in processing dynamic environments in real-time, re- tamps. MM-V ID also demonstrates intriguing results in
1 In this work, we explore GPT-4V(ision) with the vision capability and an interactive environment, such as predicting the possible
refers to the model as “GPT-4V,” following the OpenAI reports [51, 52]. next steps when playing a video game [4] or interacting with
We refer to the text-only version of the model as “GPT-4” [51]. a graphical user interface (GUI) [78].
2
Clip 1
Figure 2. Overview of MM-V ID. Our system takes a video file as input, and outputs a long textual script describing the video contents.
MM-V ID consists
Pleaseof four modules:
generate audio (i) Multimodal Pre-Processing,
Chat with (ii) External Task List: Collection, (iii) Clip-Level Video Description
Knowledge
Audio Description, Grounded QA, Summarization, Speaker
description for the input video Reference (Script)
Generation, and (iv) Script Generation. Identification, Character Identification, Multimodal Reasoning, etc.
2. Related Work researchers [6, 33, 65, 72] explore the LangChain system
paradigm [16], which aims to integrate expert tools with
Conventional Video Understanding Methods. Early existing LLMs to create new functionalities. For example,
work in computer vision centered on building video founda- VLog [6] uses BLIP2 [34] and GRIT [71] as dense image
tion models [9,13,20,21,42,69]. These models, with differ- captioners, Whisper [56] as ASR translator, and ChatGPT
ent neural network architecture designs and training meth- as a reasoner. By transcribing a given video to textual de-
ods, have achieved great breakthrough at analyzing short scriptions (e.g., document), it enables ChatGPT for video
video clips [14,15,28,63], typically lasting less than 30 sec- question-answering tasks. Inspired by the efficacy of these
onds. However, these models are typically pre-trained with tool-using approaches [16, 65, 79], we explore integration
vision modality only, and thus may require specific adjust- with GPT-4V for video understanding.
ment or fine-tuning for multimodal downstream tasks.
Video-Language Models. Recent studies [10, 22, 23, 30, 3. Preliminary Study with GPT-4V(ision)
36–40, 66] have made remarkable improvements in multi-
modal representation learning for video-and-language un- Recent studies [3, 51, 52, 78] show that GPT-4V can ac-
derstanding. These advancements have been particularly cept a range of inputs, such as textual descriptions, ques-
evident in popular downstream tasks such as video ques- tions, or even visual cues like images or short video clips.
tion answering [31], text-video retrieval [32, 73] and video GPT-4V’s inherent ability to comprehend visual inputs and
captioning [68]. Building on this momentum, researchers generate contextually relevant text opens the door for a wide
typically embark on a pretrain-finetune paradigm: initially range of applications. By introducing a sequence of frames
pre-training a video-language foundation model on large- as input, GPT-4V can grasp temporal relationships and in-
scale video-text pairs, followed by a fine-tuning process on teractions, aiding in the identification and interpretation of
specific downstream datasets. However, these methods are dynamic visual content.
usually trained on short video clips, often restricted to du-
rations of around 10 seconds, posing potential challenges in 4. MM-V ID
comprehending longer video sequences. Figure 2 shows the overview of our system pipeline.
Visual Instruction Tuning. Inspired by the breakthrough MM-V ID takes the video file as input, and outputs a script
of Large Language Models (LLMs) [18, 19, 51, 64, 85], re- describing the video contents. The generated script enables
cent studies [35, 43, 44, 62, 84] suggest using a frozen LLM LLMs to achieve various video understanding capabilities.
combined with an image encoder and a few learnable mod- MM-V ID consists of four modules: (i) Multimodal Pre-
ules for video understanding tasks. Specifically, researchers Processing, (ii) External Knowledge Collection, (iii) Clip-
propose the visual instruction tuning [35, 41, 44], which Level Video Description Generation, and (iv) Script Gener-
aims to fine-tune the learnable modules and thus enable ation. We describe each module in detail below.
LLMs to generate textual descriptions for the video content. Multimodal Pre-Processing. Starting with an input video
While promising performance is presented, these models file, our process begins by using the established ASR tool
may fall short when it comes to handling videos with ex- to extract transcriptions from the video. Following this, we
tended duration. Our work aims to fill this gap, exploring divide the video into several short video clips. This pro-
methods that can be directly applied to the understanding of cess involves uniform sampling of video frames, with each
long videos in real-world situations. clip consisting of 10 frames. To enhance the overall qual-
Prompting LLMs for Video Understanding. Recently, ity of frame sampling, we use established scene detection
3
Streaming Inputs
Action 6. Experiments
(e.g., Gaming Console Control)
6.1. Experiment Setup
Figure 3. MM-V ID for streaming inputs. MM-V ID can serve as
an agent in an interactive environment, continually receiving and We implement MM-V ID based on MM-R E ACT [79]
processing the streaming video frames. codebase. We use the Automatic Speech Recognition
(ASR) tool publicly available via the Azure Cognitive Ser-
vices APIs [2], and utilize PySceneDetect [5] for scene de-
tools like PySceneDetect [5] to help identify crucial scene tection.
boundaries.
6.2. MM-V ID Capabilities Audio Description Generation,
External Knowledge Collection. We incorporate external
knowledge into our input prompts to GPT-4V. This involves Figures 4-9 provide illustrative examples of MM-V ID’s
gathering available information, such as metadata, title, ab- complete execution flow. When a user uploads a video file,
stract, and face photos of characters within the video. In MM-V ID initiates the process by first assessing the esti-
our experiments, the metadata, title, and abstract are gath- mated video length. Subsequently, it performs multimodal
ered from YouTube. pre-processing by invoking expert tools, including scene de-
Clip-Level Video Description Generation. During our tection and ASR. Additionally, MM-V ID collects external
multimodal pre-processing, the input video is segmented knowledge, encompassing video metadata such as title and
into multiple clips. For each clip, which typically consists abstract.
of 10 frames, we employ GPT-4V to generate video descrip- Following this preliminary stage, MM-V ID proceeds to
tions. By feeding the video frames along with the associ- generate clip-level video descriptions for each segment of
ated text prompt into the model, GPT-4V utilizes the input the video. Finally, it invokes GPT-4, integrating these clip-
to generate detailed descriptions that capture the visual ele- level descriptions into a coherent script. Once the script is
ments, actions, and events depicted in those frames. generated, it empowers LLMs to provide a summarized un-
In addition, we explore the use of visual prompting, derstanding of the video content. That equips the system to
where the character’s face photos are presented alongside address users’ questions with grounded answers. We dis-
the character’s name in the input to GPT-4V. Our empirical cuss MM-V ID’s distinct capabilities as below.
results suggest that visual prompting is helpful to enhance Grounded Question-Answer (QA). The generation of a
the quality of video descriptions, particularly for more accu- comprehensive script empowers our system with the capa-
rate character identification. These findings align with the bility of grounded QA. As shown in Figure 8, let us consider
insights from [78]. a scenario where a user poses the question, “Show me the
Script Generation using LLM. After generating the de- most exciting moment in this video.” In response, MM-
scriptions for each video clip, we use GPT-4 to integrate V ID displays a highlight, specifically featuring a home run,
these clip-level descriptions into a coherent script. This and provides the corresponding timestamp. When a user
script serves as a comprehensive description of the entire asks “Who are the best pitchers in this video?” MM-V ID
video, and is used by GPT-4 for a diverse set of video un- addresses the question by referring to relevant evidence in
derstanding tasks. the generated script. This grounding capability owes its
success to the extensive and detailed script generation pro-
5. MM-V ID for Streaming Inputs cess, which documents essential timestamps and significant
events within the video, enabling accurate and contextually
Figure 3 shows the diagram of MM-V ID when applied to grounded responses to user inquiries.
the context of streaming inputs. Our system operates as an Multimodal Reasoning. MM-V ID considers multimodal
agent within a dynamic environment where streaming video inputs, including video frames, speech transcriptions, and
frames serve as the primary input. In this context, the agent external knowledge if available. In Figure 8, when a
continually receives streaming video frames as states, rep- user inquires, “How did you know the sound is different?”
4
Clip-Level Video Description Generation
Using GPT-4V
Output:
Timestamp: 00:00:00 - 00:00:02
rk heart
Scene: Opening graphics introducing the game
between the Colorado Rockies and the Los Angeles
Dodgers.
Figure 4. An example of MM-V ID’s execution flow. Given a baseball video, MM-V ID provides an estimated video length, and then
invokes scene detection and ASR tools, and collects external knowledge. Then, we generate clip-level video descriptions by using GPT-4V.
GPT-4V takes the video frames and the text prompt as input, and outputs the video descriptions. Finally, given the clip-level descriptions,
video metadata and ASR, we use GPT-4 to generate a coherent script for the input video. Figures 7-9 show continued outputs. The original
video is available at https://www.youtube.com/watch?v=-pNe0p4H8ec
5
MM-V ID explains that this information was derived from tions provide contextual narration of meaningful visual el-
the commentator’s remarks during the game. The exam- ements, clarify speakers, and convey the essence of visual
ples illustrate MM-V ID’s multimodal reasoning capabili- information within a video. In our experiments, we also ex-
ties, where it integrates both visual and auditory cues to plore MM-V ID’s performance in audio description genera-
provide contextually accurate responses to user queries. tion. We experiment with videos where there is limited or
Hour-Long Video Comprehension. Figures 10-13 no speech content. In Figure 20, we showcase an example
demonstrate MM-V ID’s capabilities in processing lengthy featuring a short film of Mr. Bean taking an exam, which
videos. In this example, MM-V ID effectively analyzes a primarily lacks speech. Without ASR inputs, MM-V ID pro-
documentary video spanning approximately 50 minutes in cesses the video and generates a detailed script. This shows
duration. For simplicity, the intermediate outputs are omit- MM-V ID’s versatility in handling various types of video
ted in the figures, and only the final generated script is pre- content and its potential in creating inclusive and accessible
sented. We observe that MM-V ID is able to generate a long multimedia content.
script with the corresponding timestamps to represent the Self-Refinement. While the generated script offers a com-
documentary video. By leveraging this generated script as prehensive understanding of video content, our experiments
contextual information, MM-V ID is equipped to perform have unveiled occasional inaccuracies, especially in cases
a range of tasks, including summarizing the lengthy video, involving blurry or low-resolution video frames, as demon-
addressing specific queries raised within the video, and in- strated in Figure 21. In this example, MM-V ID mistak-
dexing pivotal moments. enly identifies a bird as a rock due to the challenges posed
Multi-Video Episodic Analysis. MM-V ID’s proficiency by the video’s visual quality. To address such inconsisten-
in handling extensive video content can be expanded to cies and elevate the overall accuracy of the generated script,
encompass multiple lengthy videos, as illustrated in Fig- we employ a self-refinement approach [45, 58, 80]. This
ures 14-16. In these examples, we upload multiple episodes involves revising the script based on both the initially gen-
to MM-V ID, showcasing its ability to perform a variety of erated script and a concurrently generated video summary.
complex tasks. MM-V ID exhibits the capability to sum- Through this process, MM-V ID is able to rectify errors and
marize the video series, engage in cross-episode reasoning, inaccuracies, resulting in a more refined output.
provide detailed descriptions of character journeys across Fast-Changing Short Videos. In Figure 22, we present an
multiple episodes, and facilitate grounded QA interactions. example of our experimentation with fast-changing short-
Character Identification. We found that incorporating form videos, such as those found on platforms like TikTok.
visual prompts enhances the quality of script generation, Short videos often feature non-standard frame sizes and
particularly with regards to character identification. In Fig- significantly shorter durations compared to conventional
ure 17, we illustrate this by providing MM-V ID with ad- videos. Remarkably, MM-V ID excels at accurately describ-
ditional inputs consisting of characters’ face photos and ing the cooking recipes depicted in these short videos, de-
their corresponding names. MM-V ID effectively utilizes spite the distinct characteristics of such content.
these visual prompts to identify the characters depicted in These examples demonstrate the versatility of MM-V ID
the video, based on the provided face photos. As a result, in processing a diverse array of video content. Whether
the script generation process is notably improved, ensur- dealing with lengthy documentaries, episodic series, or
ing more accurate and contextually relevant descriptions of short-form clips, MM-V ID adapts seamlessly to the unique
characters and their interactions within the video content. attributes of each video type, consistently delivering mean-
Speaker Identification. Our exploration has revealed an- ingful and contextually relevant descriptions.
other valuable application of visual prompting in enhanc-
6.3. Applications to Interactive Environments
ing the quality of Automatic Speech Recognition (ASR). In
Figures 18-19, we highlight a scenario where conventional In the following section, we evaluate MM-V ID when ap-
ASR struggles to accurately recognize the number of speak- plying to the context of streaming inputs. MM-V ID serves
ers and their identities in the video. Visual prompting plays as an agent in an interactive environment, continually re-
a pivotal role in enhancing ASR performance by providing ceiving streaming video frames as the inputs.
contextual cues to identify individuals and attribute speech Embodied Agent. Figure 23 illustrates an example where
to specific speakers. This improvement ensures more pre- MM-V ID is applied to an egocentric video captured by a
cise transcriptions, enabling a more accurate representation head-mounted camera. This video, collected from Ego4D
of the dialogue and interactions within the video content. dataset [25], provides a brief glimpse into the wearer’s daily
Audio Description Generation. Audio descriptions [26, life within their home environment. Remarkably, MM-V ID
57] play a crucial role in making videos accessible to in- showcases its capability in understanding such video con-
dividuals who are blind, have low vision, or face difficul- tent and assists the user in a few practical tasks. Specif-
ties in visually understanding the content. These descrip- ically, MM-V ID helps the user locate items like the pink
6
Table 1. Questionnaire for the group with visual impairments. Par- Table 2. Questionnaire for the group with normal vision. Partici-
ticipants listen to a video and subsequently assign scores (ranging pants view a video and subsequently assign scores (ranging from
from 0 to 10) for distinct auditory criteria. 0 to 10) for various auditory and visual criteria.
Effectiveness of Delivery: If the original audio and the Clarity: Are the visual elements clearly and accuratetly
embedded AD are effectively presented? described?
Informative: Is it easy to follow the storyline? Does the Conciseness: Does the AD convey essential visual infor-
AD provide context and background information when mation without overloading the user?
necessary? Timing and Synchronization: Are the original audio
Audio Quality: Is the overall audio production quality and the embedded AD effectively presented? Does the
good? AD properly synchronize with visual contents?
Overall Satisfaction: Are you satisfied with the overall Informative: Is it easy to follow the storyline? Does the
AD experience? AD provide context and background information when
necessary?
Audio Quality: Is the overall audio production quality
good?
Overall Satisfaction: Are you satisfied with the overall
jacket and the laptop within the home. Additionally, it gen-
AD experience?
erates a list of the user’s activities within a specified time
range, offering insights into the wearer’s daily routine.
Playing Video Games. Figures 24-27 demonstrate the
results of applying MM-V ID to a Mario video game [4].
vision. All the participants have normal hearing. For the
In these experiments, our agent consistently receives three
purposes of the experiments, we segregated participants into
video frames as states and calculates the next possible con-
two distinct groups: (i) Group with visual impairments, and
trol action. Remarkably, our agent displays an understand-
(ii) Group with normal vision.
ing of the specific video game dynamics and generates rea-
sonable action controls to play the game effectively. These
examples highlight MM-V ID’s ability to comprehend and 6.4.1 Evaluation Procedure
navigate in an interactive gaming environment. Interested Our experiments utilize a curated set of videos, which are
readers may find the full gameplay demonstration on our mainly suggested by the American Council of the Blind2 .
project website. We also collected accessibility videos from YouTube3 . For
GUI Navigation. Figures 28-32 provide the demonstra- every video used in our evaluation, participants are exposed
tion of MM-V ID’s performance in the GUI navigation sce- to two versions: the first containing human-crafted AD and
nario. In this context, the agent continually receives iPhone the second powered by MM-V ID-generated AD. Both ren-
screenshots and previous user actions as states. The agent ditions are narrated using text-to-speech (TTS) technology.
effectively predicts the possible next steps in the user’s jour- We have designed two questionnaires for the two groups,
ney, which may include clicking on the correct shopping referenced in Table 1 and Table 2, respectively. Participants
apps, initiating searches for items of interest, and ultimately with visual impairments are instructed to base their evalu-
placing an order. These results demonstrate MM-V ID’s re- ation exclusively on auditory cues. In contrast, those with
markable ability to interact with graphical user interfaces, normal vision are instructed to consider both visual and au-
facilitating seamless and intelligent navigation through dig- ditory elements.
ital interfaces. The assessment adopts the standardized Likert scale for
6.4. User Study ratings. For each posed question, participants are guided to
assign a score ranging from 0 to 10, with higher values indi-
We explore the potential of MM-V ID for people who are cating more favorable feedback. Furthermore, participants
blind or have low vision. Audio description (AD) [26, 57] are urged to share feedback and remarks concerning their
provides an auditory narration integrated into the video’s overall experience.
soundtrack, offering important visual details that may not
be discernible from the main video soundtrack. Such de- 6.4.2 Results on the Group with Visual Impairments
scriptions play a pivotal role in conveying essential visual
content to those with visual impairments. We utilized 3 different videos for our evaluation, with
To assess the efficacy of MM-V ID in generating audio durations of 1 minute, 1 minute 42 seconds, and 2 minutes
descriptions (AD), we conduct a user study. We invited 9 2 The
Audio Description Project: https://adp.acb.org/
participants for the evaluation. 4 participants were either 3 Apple
Accessibility: https://www.youtube.com/watch?v=
blind or had low vision, while the remaining 5 had normal SL7YSqlEd8k
7
42 seconds, respectively. Each of the 4 participants with
visual impairment was well versed with screen reader
10Results on Group with Visual Impairments
and other common accessibility tools. After listening to 8
the audio descriptions for each video, they were asked to
respond to the 4 questions outlined in Table 1. 6
Rating
Hypotheses and Results 4
H1: The MM-V ID-generated audio description and 2 Human-written AD
original video dialogues are effectively presented to the MMVid-generated AD
participants. 0
ty
ry
ve
n
tio
Results: Using the Likert scale (0=Not Effective to
ali
ve
ati
ac
Qu
eli
orm
10=Most Effective) the participants rated the effectiveness
f
t. D
tis
dio
Inf
Sa
of the delivery of human-crafted AD and MM-V ID-
ec
Au
Eff
generated AD. On average, participants gave 7.14 ± 1.39
for MM-V ID-generated AD and 8.33 ± 0.90 for human- Figure 5. Results on the group with visual impairments. MM-
crafted AD, which shows a MM-V ID-generated AD very V ID-generated AD is close to human-generated ones in terms
close to human-crafted one in terms of effective delivery of audio quality and effectiveness of delivery. However, MM-
(Figure 5). V ID’s AD yields lower satisfaction levels compared to the human-
generated ones. This was primarily attributed to occasional over-
H2: Participants are able to follow the main story laps between the audio descriptions and the video dialogues.
line of the video based on MM-V ID-generated audio
description only.
vide a cost-effective and scalable solution. Thus, millions
Results: Using the Likert scale (0=Not Informative to
of videos that cannot afford to be professionally audio
10=Highly Informative) the participants rated the infor-
described, can be auto-processed by a tool like MM-V ID
mativeness of human-crafted AD and MM-V ID-generated
to make them accessible to the visual-impaired community.
AD. On average, participants gave 7.14 ± 1.16 for MM-
V ID-generated AD and 9.29 ± 0.91 for human-crafted AD,
which shows little difference in informativeness between 6.4.3 Results on the Group with Normal Vision
MM-V ID-generated AD and human-crafted one (Figure
5). For sighted individuals, we used the same set of videos as
we used for individuals with visual impairments. All of
H3: MM-V ID-generated AD and human-crafted AD our 5 participants answered to 6 questions listed in Table 2
are close in terms of voice and audio quality. after watching videos embedded with MM-V ID-generated
Results: Using the Likert scale (0=Low Quality to AD as subtitles and audio track.
10=High Quality) the participants rated the voice and audio
quality on average as 8.91 ± 1.23 for MM-V ID-generated Hypotheses and Results
AD and 9.07 ± 0.65 for human-crafted AD. This minimal
difference between the scores indicates the close-to-human H1: The MM-V ID-generated AD is accurate and conveys
voice and audio quality of MM-V ID-generated AD (Figure essential information without overloading the listener.
5). Results: The sighted individuals rated the clarify and
accuracy of MM-V ID-generated AD as 7.83 ± 1.24 and
Discussion: human-curated AD as 8.9 ± 0.74 on average, using the
The results show that the participants’ overall satisfaction Likert scale (0=Not Accurate to 10=Most Accurate).
of MM-V ID-generated ADs was on average around 2 In terms of conciseness, the participants on average gave
points less than human-crafted ones in the Likert scale 8.73±0.49 for the MM-V ID-generated AD and 9.16±0.54
(0=Not Satisfied to 10=Highly satisfied) (Figure 5). Some for human-curated AD based on the Likert scale (0=Not
of the difficulties indicated by participants while listening concise to 10=Most concise). These results indicate
to MM-V ID-generated ADs were 1) occasional overlaps MM-V ID-generated ADs are close to human-curated ones
between AD audio and original video dialogues 2) wrong in terms of accuracy and conciseness (Figure 6).
descriptions due to hallucinations of GPT-4V(ision).
Regardless of the difference in overall satisfaction, all the H2: The MM-V ID-generated ADs are in sync with
participants agreed that MM-V ID-generated AD can pro- visual content and do not overlap with other dialogues
ensuring listeners can follow the story line.
8
Results: Participants gave on average 8.90 ± 0.90 and
10 Results on Group with Normal Vision
7.97 ± 1.54 to human-crafted AD and MM-V ID-generated
AD respectively using the Likert scale (0=Not Infor- 8
mative to 10=Highly Informative). Human-crafted AD
and MM-V ID-generated AD received 8.59 ± 0.95 and 6
Rating
8.53 ± 0.58 respectively on the aspect of timing and
synchronization using the Likert scale (0=Not Effective 4
to 10=Most Effective). These indicates while listening to
2 Human-written AD
MM-V ID-generated ADs participants were able to follow MMVid-generated AD
main story line and found the audios are in sync with video 0
content very close to that of human-crafted ADs (Figure 6).
rity
ty
ss
ve
c.
n
tio
yn
ali
ne
ati
Cla
fac
dS
Qu
ise
orm
tis
H3: The voice and audio quality of MM-V ID-generated
an
nc
dio
Inf
Sa
Co
ing
Au
ADs are close to human-crafted ADs.
Tim
Results: The results are very similar to results on group
with visual impairments. Sighted participants rated the Figure 6. Results on the group with normal vision. MM-V ID-
voice and audio quality on average as 8.30 ± 0.89 for MM- generated AD was on average <1 points lower than human-
V ID-generated AD and as 8.93 ± 0.32 for human-crafted generated ADs. The participants were able to follow the main
AD. Therefore the voice and audio experience did not story line and the audios are in sync with the video content.
degrade much while listening to MM-V ID-generated ADs
compare to human-crafted ADs (Figure 6).
and I did not depend on someone else.”
Discussion: P2: “If it’s AI-generated, there are so many
The evaluations on sighted individuals helped to cross places it’s not available, and we need it there.”
verify the hypotheses of individuals with visual impair- P2: “First time listening to auto-generated AD.
ments, that are based on audio cues only. Although the As a user, if I am offered this AD, I would take it.”
overall satisfaction points for sighted participants with P3: “Nearly perfect. Most favorite was the de-
MM-V ID-generated ADs was on average <1 points lower tails.”
than human-generated ADs (Figure 6), the overall satis- P3: “More information helped me follow the sto-
faction points for participants who were blind was worse. ryline.”
This is expected because sighted individuals had access to P3: “It allowed me to follow anything visual. It
both audio and video modalities but individuals with visual felt natural the way AD describes how the actor
impairments did not. We also believe the reason for lower interacts with the environment.”
overall satisfaction, may have been the lack of practice P3: “I love animal kingdom, and I watch Wild
listening to auto generated ADs. Some of the users also Earth safari virtual tour. I would love to have au-
mentioned they have preference between pitches of voice dio descriptions of Wild Earth videos and daily
and number of concurrent audio channels. These may add safaris.”
to the reason of lower overall satisfaction. P4: “I would like to have auto-generated au-
dio description for live conferences in Microsoft
6.4.4 Participant Feedback Teams.”
P4: “It worked best as the original audio had not
We present a collection of interview quotes from our par- much value.”
ticipants who were visually impaired, in which they share
their personal experiences and insights about the audio de- Despite the positive feedback, not all responses were fa-
scriptions (AD) generated by MM-V ID. The participants vorable:
expressed a unanimous desire to continue utilizing this AD
generation service in the future, highlighting its exceptional P4: “I am skeptical when it becomes subjective.
quality (“Nearly perfect”), intricate details (“favorite was Sometimes I feel they make up stories which is not
the details”), extensive applicability (“allowed me to follow good.”
anything visual”), and the profound impact it has on them P4: “After listening to the human-generated AD,
(“I did not depend on someone else”). Below, we provide I figured I misunderstood parts of the original
additional quotes for further insight. story.”
P1: “It keeps referring to the same person using
P1: “I understand what is going on very quickly their names instead of pronouns.”
9
P4: “I don’t deal well with overlapped or two [6] Vlog: Video as a long document. https://github.
parallel audios.” com/showlab/VLog, 2023. 3
[7] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
Interestingly, even those participants who provided criti- Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch,
cal feedback still rated the MM-V ID-generated AD closely Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi-
to human-generated AD, during the questionnaire sessions. sual language model for few-shot learning. arXiv preprint
This indicates that, similar to human-curated AD, adapting arXiv:2204.14198, 2022. 2
[8] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-
to MM-V ID-generated ADs might necessitate some prac-
son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri,
tice and acclimatization over time.
Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2
technical report. arXiv preprint arXiv:2305.10403, 2023. 2
7. Conclusion [9] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video
We have presented MM-V ID, a system that synergizes
vision transformer. In Proceedings of the IEEE/CVF inter-
with GPT-4V for advancing video understanding. MM- national conference on computer vision, pages 6836–6846,
V ID employs GPT-4V to transcribe video content into 2021. 2, 3
long and detailed scripts, thereby enriching LLMs with ad- [10] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser-
vanced video understanding capabilities. Experimental re- man. Frozen in time: A joint video and image encoder for
sults demonstrate the effectiveness of MM-V ID in address- end-to-end retrieval. In Proceedings of the IEEE/CVF Inter-
ing challenging tasks, including comprehension of hour- national Conference on Computer Vision, pages 1728–1738,
long videos, analysis across multiple episodes, identifica- 2021. 2, 3
tion of characters and speakers, and interaction with video [11] Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga,
games and graphical user interfaces. Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe-
Beyond the development of the MM-V ID system, we dro, and Jeff Clune. Video pretraining (vpt): Learning to act
by watching unlabeled online videos. Advances in Neural
conducted an extensive user study, drawing feedback from a
Information Processing Systems, 35:24639–24654, 2022. 2
varied group of participants. The outcomes of this study in-
[12] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe
dicated that the audio descriptions generated by MM-V ID Morency. Multimodal machine learning: A survey and tax-
closely mirror the quality of those crafted by humans. In onomy. IEEE transactions on pattern analysis and machine
our future work, we plan to explore SoM [76] and object intelligence, 41(2):423–443, 2018. 2
tracking techniques to enhance various tasks and function- [13] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is
alities. space-time attention all you need for video understanding?
In ICML, volume 2, page 4, 2021. 2, 3
Acknowledgment [14] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-
We are deeply grateful to OpenAI for providing access serman. A short note on the kinetics-700 human action
to their exceptional tool [3, 51–53]. We are profoundly dataset. arXiv preprint arXiv:1907.06987, 2019. 2, 3
thankful to Misha Bilenko for his invaluable guidance and [15] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In pro-
support. We also extend heartfelt thanks to our Microsoft
ceedings of the IEEE Conference on Computer Vision and
colleagues for their insights, with special acknowledgment Pattern Recognition, pages 6299–6308, 2017. 2, 3
to Cenyu Zhang, Saqib Shaikh, Ailsa Leen, Jeremy Curry, [16] Harrison Chase. Langchain. https://langchain.
Crystal Jones, Roberto Perez, Ryan Shugart, Anne Taylor readthedocs.io/, 2023. 3
for their constructive feedback. [17] Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Jinyu Li,
Takuya Yoshioka, Chengyi Wang, Shujie Liu, and Ming
References Zhou. Continuous speech separation with conformer.
In ICASSP 2021-2021 IEEE International Conference on
[1] Dota 2. https://openai.com/research/dota-2, Acoustics, Speech and Signal Processing (ICASSP), pages
2017. 2 5749–5753. IEEE, 2021. 2
[2] Azure cognitive services apis. https : / / azure . [18] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang-
microsoft . com / products / cognitive - hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong-
services, 2023. 4 hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.
[3] Chatgpt can now see, hear, and speak. https://openai. Xing. Vicuna: An open-source chatbot impressing gpt-4
com/blog/chatgpt- can- now- see- hear- and- with 90%* chatgpt quality, March 2023. 3
speak, 2023. 3, 10 [19] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
[4] Pygame library. https://www.pygame.org/, 2023. 2, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
7, 31, 32, 33, 34 Barham, Hyung Won Chung, Charles Sutton, Sebastian
[5] Pyscenedetect: Video scene cut detection and analysis tool. Gehrmann, et al. Palm: Scaling language modeling with
https://www.scenedetect.com/, 2023. 4 pathways. arXiv preprint arXiv:2204.02311, 2022. 2, 3
10
[20] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, [31] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg.
Zhicheng Yan, Jitendra Malik, and Christoph Feichten- Tvqa: Localized, compositional video question answering.
hofer. Multiscale vision transformers. In Proceedings of In EMNLP, 2018. 2, 3
the IEEE/CVF International Conference on Computer Vision [32] Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. Tvr:
(ICCV), pages 6824–6835, October 2021. 2, 3 A large-scale dataset for video-subtitle moment retrieval. In
[21] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and ECCV, 2020. 3
Kaiming He. Slowfast networks for video recognition. In [33] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang,
Proceedings of the IEEE/CVF international conference on Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foun-
computer vision, pages 6202–6211, 2019. 2, 3 dation models: From specialists to general-purpose assis-
[22] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang tants. arXiv preprint arXiv:2309.10020, 2023. 3
Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end [34] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
video-language transformers with masked visual-token mod- Blip-2: Bootstrapping language-image pre-training with
eling. arXiv preprint arXiv:2111.12681, 2021. 2, 3 frozen image encoders and large language models. arXiv
[23] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang preprint arXiv:2301.12597, 2023. 3
Wang, Lijuan Wang, and Zicheng Liu. An empirical study [35] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai
of end-to-end video-language transformers with masked vi- Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao.
sual modeling. In Proceedings of the IEEE/CVF Conference Videochat: Chat-centric video understanding. arXiv preprint
on Computer Vision and Pattern Recognition, pages 22898– arXiv:2305.06355, 2023. 3
22909, 2023. 2, 3 [36] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu,
[24] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- and Jingjing Liu. Hero: Hierarchical encoder for video+
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, language omni-representation pre-training. arXiv preprint
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz arXiv:2005.00200, 2020. 2, 3
Mueller-Freitag, et al. The” something something” video [37] Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng
database for learning and evaluating visual common sense. Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-
In Proceedings of the IEEE international conference on com- language understanding as masked language modeling. In
puter vision, pages 5842–5850, 2017. 2 Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, pages 23119–23129, 2023. 2,
[25] Kristen Grauman, Andrew Westbury, Eugene Byrne,
3
Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson
[38] Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen,
Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d:
Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang,
Around the world in 3,000 hours of egocentric video. In Pro-
William Yang Wang, et al. Value: A multi-task bench-
ceedings of the IEEE/CVF Conference on Computer Vision
mark for video-and-language understanding evaluation. In
and Pattern Recognition, pages 18995–19012, 2022. 6, 30
35th Conference on Neural Information Processing Systems
[26] Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi (NeurIPS 2021) Track on Datasets and Benchmarks, 2021.
Xie, and Andrew Zisserman. Autoad: Movie description 2, 3
in context. In Proceedings of the IEEE/CVF Conference [39] Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe
on Computer Vision and Pattern Recognition, pages 18930– Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. Swin-
18940, 2023. 2, 6, 7 bert: End-to-end transformers with sparse attention for video
[27] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas captioning. In Proceedings of the IEEE/CVF Conference
Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video on Computer Vision and Pattern Recognition, pages 17949–
classification with convolutional neural networks. In Pro- 17958, 2022. 2, 3
ceedings of the IEEE conference on Computer Vision and [40] Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael
Pattern Recognition, pages 1725–1732, 2014. 2 Wray, Rui Yan, Eric Z XU, Difei Gao, Rong-Cheng Tu, Wen-
[28] Hildegard Kuehne, Hueihan Jhuang, Estı́baliz Garrote, zhe Zhao, Weijie Kong, et al. Egocentric video-language
Tomaso Poggio, and Thomas Serre. Hmdb: a large video pretraining. Advances in Neural Information Processing Sys-
database for human motion recognition. In 2011 Inter- tems, 35:7575–7586, 2022. 2, 3
national conference on computer vision, pages 2556–2563. [41] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
IEEE, 2011. 3 Visual instruction tuning, 2023. 3
[29] Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. Learn- [42] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
ing interactions and relationships between movie characters. Stephen Lin, and Han Hu. Video swin transformer. In Pro-
In Proceedings of the IEEE/CVF Conference on Computer ceedings of the IEEE/CVF conference on computer vision
Vision and Pattern Recognition, pages 9849–9858, 2020. 2 and pattern recognition, pages 3202–3211, 2022. 2, 3
[30] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, [43] Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui
Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for Qiu, Pengcheng Lu, Tao Wang, and Zhongyu Wei. Valley:
video-and-language learning via sparse sampling. In Pro- Video assistant with large language model enhanced ability.
ceedings of the IEEE/CVF Conference on Computer Vision arXiv preprint arXiv:2306.07207, 2023. 3
and Pattern Recognition (CVPR), pages 7331–7341, June [44] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa-
2021. 2, 3 had Shahbaz Khan. Video-chatgpt: Towards detailed video
11
understanding via large vision and language models. arXiv [61] David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan
preprint arXiv:2306.05424, 2023. 3 McCree, Daniel Povey, and Sanjeev Khudanpur. Speaker
[45] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- recognition for multi-speaker conversations using x-vectors.
linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, In ICASSP 2019-2019 IEEE International conference on
Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: acoustics, speech and signal processing (ICASSP), pages
Iterative refinement with self-feedback. arXiv preprint 5796–5800. IEEE, 2019. 2
arXiv:2303.17651, 2023. 6 [62] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng
[46] EVERINGHAM Mark. Hello! my name is... buffy’- Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan
automatic naming of characters in tv video. In Proceedings Lu, Jenq-Neng Hwang, et al. Moviechat: From dense to-
of British Machine Vision Conference, 2006, 2006. 2 ken to sparse memory for long video understanding. arXiv
[47] Microsoft. Bingchat. https://www.microsoft.com/ preprint arXiv:2307.16449, 2023. 2, 3
en-us/edge/features/bing-chat, 2023. 2 [63] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.
[48] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Ucf101: A dataset of 101 human actions classes from videos
Makarand Tapaswi, Ivan Laptev, and Josef Sivic. in the wild. arXiv preprint arXiv:1212.0402, 2012. 3
HowTo100M: Learning a Text-Video Embedding by [64] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Watching Hundred Million Narrated Video Clips. In ICCV, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
2019. 2 Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
[49] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Llama 2: Open foundation and fine-tuned chat models. arXiv
Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:2307.09288, 2023. 3
preprint arXiv:1706.08612, 2017. 2 [65] Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai,
[50] Arsha Nagrani and Andrew Zisserman. From benedict cum- Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. Chatvideo: A
berbatch to sherlock holmes: Character identification in tv tracklet-centric multimodal and versatile video understand-
series without a script. arXiv preprint arXiv:1801.10442, ing system. arXiv preprint arXiv:2304.14407, 2023. 3
2018. 2 [66] Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge,
Kevin Qinghong Lin, Satoshi Tsutsui, Xudong Lin,
[51] OpenAI. Gpt-4 technical report. 2023. 2, 3, 10
Guanyu Cai, Jianping Wu, Ying Shan, et al. All in one:
[52] OpenAI. Gpt-4v(ision) system card. 2023. 2, 3, 10
Exploring unified video-language pre-training. In Proceed-
[53] OpenAI. Gpt-4v(ision) technical work and authors. https: ings of the IEEE/CVF Conference on Computer Vision and
//cdn.openai.com/contributions/gpt- 4v. Pattern Recognition, pages 6598–6608, 2023. 2, 3
pdf, 2023. 10
[67] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang,
[54] Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai
ith Ringel Morris, Percy Liang, and Michael S Bernstein. Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A
Generative agents: Interactive simulacra of human behavior. survey on large language model based autonomous agents,
arXiv preprint arXiv:2304.03442, 2023. 2 2023. 2
[55] Karine Pires and Gwendal Simon. Youtube live and twitch: [68] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang
a tour of user-generated live streaming systems. In Proceed- Wang, and William Yang Wang. Vatex: A large-scale, high-
ings of the 6th ACM multimedia systems conference, pages quality multilingual dataset for video-and-language research.
225–230, 2015. 2 In The IEEE International Conference on Computer Vision
[56] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, (ICCV), October 2019. 2, 3
Christine McLeavey, and Ilya Sutskever. Robust speech [69] Yunbo Wang, Mingsheng Long, Jianmin Wang, and Philip S
recognition via large-scale weak supervision. In Inter- Yu. Spatiotemporal pyramid network for video action recog-
national Conference on Machine Learning, pages 28492– nition. In Proceedings of the IEEE conference on Computer
28518. PMLR, 2023. 3 Vision and Pattern Recognition, pages 1529–1538, 2017. 2,
[57] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt 3
Schiele. A dataset for movie description. In Proceedings of [70] Chao-Yuan Wu and Philipp Krahenbuhl. Towards long-form
the IEEE conference on computer vision and pattern recog- video understanding. In Proceedings of the IEEE/CVF Con-
nition, pages 3202–3212, 2015. 2, 6, 7 ference on Computer Vision and Pattern Recognition, pages
[58] Noah Shinn, Federico Cassano, Beck Labash, Ashwin 1884–1894, 2021. 2
Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: [71] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan,
Language agents with verbal reinforcement learning, 2023. Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A gen-
6 erative region-to-text transformer for object understanding.
[59] Mustafa Shukor, Corentin Dancette, Alexandre Rame, and arXiv preprint arXiv:2212.00280, 2022. 3
Matthieu Cord. Unified model for image, video, audio and [72] Yujia Xie, Luowei Zhou, Xiyang Dai, Lu Yuan, Nguyen
language tasks. arXiv preprint arXiv:2307.16184, 2023. 2 Bach, Ce Liu, and Michael Zeng. Visual clues: Bridging vi-
[60] Cees GM Snoek and Marcel Worring. Multimodal video in- sion and language foundations for image paragraph caption-
dexing: A review of the state-of-the-art. Multimedia tools ing. Advances in Neural Information Processing Systems,
and applications, 25:5–35, 2005. 2 35:17287–17300, 2022. 3
12
[73] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
video description dataset for bridging video and language. In
CVPR, 2016. 3
[74] Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal
learning with transformers: A survey. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2023. 2
[75] Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong
Wang, Weidong Liu, and Yang Liu. Exploring large language
models for communication games: An empirical study on
werewolf. arXiv preprint arXiv:2309.04658, 2023. 2
[76] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan
Li, and Jianfeng Gao. Set-of-mark prompting unleashes
extraordinary visual grounding in gpt-4v. arXiv preprint
arXiv:2310.11441, 2023. 10
[77] Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant,
Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao,
Yi-Ling Chen, et al. i-code: An integrative and compos-
able multimodal learning framework. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 37, pages
10880–10890, 2023. 2
[78] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn
of lmms: Preliminary explorations with gpt-4v(ision). arXiv
preprint arXiv:2309.17421, 2023. 2, 3, 4
[79] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin,
Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu,
Michael Zeng, and Lijuan Wang. Mm-react: Prompting
chatgpt for multimodal reasoning and action. arXiv preprint
arXiv:2303.11381, 2023. 3, 4
[80] Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin,
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img:
Iterative self-refinement with gpt-4v (ision) for auto-
matic image design and generation. arXiv preprint
arXiv:2310.08541, 2023. 6
[81] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Boxin Li, Chunyuan Li, et al. Florence: A new
foundation model for computer vision. arXiv preprint
arXiv:2111.11432, 2021. 2
[82] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu,
Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Mer-
lot: Multimodal neural script knowledge models. Advances
in Neural Information Processing Systems, 34:23634–23651,
2021. 2
[83] Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou,
Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang
Gan. Building cooperative embodied agents modularly with
large language models. arXiv preprint arXiv:2307.02485,
2023. 2
[84] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An
instruction-tuned audio-visual language model for video un-
derstanding. arXiv preprint arXiv:2306.02858, 2023. 3
[85] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained
transformer language models, 2022. URL https://arxiv.
org/abs/2205.01068. 3
13
Script Generation Using GPT-4 takes a turn at bat. After a quick scuffle with
(continued): the ball, he blasts it into left field, scoring yet
another home run.
Chris Bryant from the Rockies, employing a
rt series of breaking stuff and tempting 0:03:11 - 0:03:26: The Dodgers are in a roll. Max
fastballs at the top of the strike zone. Muncie adds another point to the Dodgers lead by
sending the ball deep into the right field.
0:00:47 - 0:01:02: Next up, KRON from the
Rockies pops the ball to the infield. Dodgers' 0:03:27 - 0:03:42: The Rockies' are unfortunate to
catcher, Will Smith, steps out and catches the lose a run despite a successful hit due to swift
ball despite a one-out double, stopping the coordination between the Dodgers' players.
Rockies from making a score.
0:03:43 - 0:03:58: Jason Hayward from Dodgers
0:01:03 - 0:01:18: The video pans to Dodgers' adds to Rockies' dismay by smacking a lead-off shot
lead-off hitter, Mookie Betts, who manages to in the third inning of the game.
hit a lead-off single to the centre field, setting
a positive start for the Dodgers. 0:03:59 - 0:04:14: Julio Urias, from the Dodgers,
efficiently strikes out the Venezuelans' Luis
0:01:19 - 0:01:34: Will Smith steps up and Montero and Ryan McMahon in quick succession.
drives a fly ball to left which goes deep,
resulting in him scoring a home run for the 0:04:15 - 0:04:30: A moment of respite for the
third game in a row. Rockies as Urias throws a pitch, allowing the batter
to hit a ground ball into infield. However, it is
0:01:35 - 0:01:50: Scene next highlights Max quickly fielded, ending their chances. The innings
Muncy's performance who contributes to the end with the Dodgers in the lead by four runs.
Dodgers scoring by cracking a ball to the
left-center field which is then picked up by the 0:04:31 - 0:04:46: The scene captures dodgers' Max
Rockies' fielder. Muncy blasting a hit into the field, which bumps
the Dodgers' lead to four.
0:01:51 - 0:02:06: The scene changes to a duo
of spectacular plays by the Dodgers. First, a 0:04:47 - 0:05:02: Despite a promising start for the
double play managed by shortstop Tovar, Rockies at the top of the third, their momentum
swiftly eliminating JD Martinez to curb the slows down as they lose a run due to a double play
first inning. Then in the second inning, we see induced by Urias.
Urias successfully striking out Rockies' Ryan
McMahon. 0:05:03 - 0:05:18: Hayward makes a fantastic
lead-off shot for the Dodgers' in the third
0:02:07 - 0:02:22: Urias faces a setback as inning, continuing their advantageous streak.
McMahon bounces back with a high-quality
deep shot to the center field, netting himself 0:05:19 - 0:05:34: Back at the pitching mound,
a double. Urias expertly strikes out the Rockies' Luis Montero
and leaves the Rockies scoreless for 6 games.
0:02:23 - 0:02:38: Despite the Rockies
improving, Urias ends the inning by striking 0:05:35 - 0:05:50: Urias's superior pitching
out the Rockies' third baseman, Ellen Trejo, continues as he strikes out Ryan McMahon and
leaving McMahon stranded at second base. skillfully fields a sharp grounder ending the fourth
inning, serving a huge blow to the Rockies.
0:02:39 - 0:02:54: The tables turn as the
dodgers regain their footing. Mookie Betts 0:05:51 - 0:06:06: Scene cuts to Dodgers' Mookie
smacks the ball into right field and makes his Betts, who despite an injury, hammers a hit into the
way safely to second base. field, landing him a double.
0:02:55 - 0:03:10: Dodgers player Will Smith 0:06:07 - 0:06:22: Despite leading the series,
Figure 7. An example of MM-V ID’s execution flow (continued). We present the full script generated by MM-V ID. The original video is
available at https://www.youtube.com/watch?v=-pNe0p4H8ec
14
Script Generation Using GPT-4 (continued)
(continued):
0:07:11 - 0:09:09: The scene concludes with How did you know the sound is different?
the teams leaving the field.
Figure 8. An example of MM-V ID’s execution flow (continued). Given the generated script as reference, MM-V ID achieves various video
understanding capabilities, including grounded QA and multimodal reasoning. The original video is available at https://www.
youtube.com/watch?v=-pNe0p4H8ec
15
The Los Angeles Dodgers won the game
with a final score of 5-2. (continued)
rt
Figure 9. An example of MM-V ID’s execution flow (continued). Given the generated script as reference, MM-V ID achieves various video
understanding capabilities, including grounded QA and multimodal reasoning. The original video is available at https://www.
youtube.com/watch?v=-pNe0p4H8ec
16
Figure 10. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension. Figures 11-13 show
continued outputs. The original video is available at https://www.youtube.com/watch?v=um2Q9aUecy0
17
Scene 31: 00:06:46 Scene 46: 00:09:03
An elephant and a gorilla roam the jungle. Close-up of a bug on a leaf.
Figure 11. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension; We present the full script
generated by MM-V ID. The original video is available at https://www.youtube.com/watch?v=um2Q9aUecy0
18
Can you summarize the video?
Figure 12. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension;
What is Cordyceps fungus? MM-V ID is able to
summarize the video and address specific questions with regard to the video contents. Blue highlights the correct answers. The original
video is available at https://www.youtube.com/watch?v=um2Q9aUecy0
The video is a documentary that explores the Cordyceps fungus is a parasitic fungus that
diverse and rich ecosystems of jungles and 19 infects and takes control of its host, usually an
rainforests around the world. It showcases various insect. The fungus manipulates the host's
animal species, their behaviors, and their roles in behavior, eventually leading to the host's death.
maintaining the balance of these ecosystems. The The fungus then erupts from the host's body and
documentary also highlights the threats faced by releases spores to infect more victims. Cordyceps
these habitats, such as deforestation and fungus plays a role in maintaining the balance of
Figure 13. Case studies of MM-V ID ’s capabilities and application scenarios: hour-long video comprehension; MM-V ID is able to
localize the specific moments or topics in the lengthy documentary video. Blue highlights the correct answers. The original video is
available at https://www.youtube.com/watch?v=um2Q9aUecy0
Pitcher plants can be found in the video around
20 the timestamp 00:18:18.
(continued)
the deep web for information. The episode
features various tense conversations and
political strategizing, with characters
navigating complex relationships and making
rt
crucial decisions.
Loading video…
What are the main challenges Frank facing in
Video length is 46 minutes 54 seconds.
the two videos?
Describe
Figure 14. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video Frank's
episodic journey
analysis; in the
Given two videos.
multiple video files as
inputs, MM-V ID can summarize the video series, perform cross-episodic reasoning, describe a specific character’s journey across multiple
videos. Figure 15 shows continued outputs. Blue highlights the correct answers. The two videos are from House of Cards, Season 2:
episode 1 and episode 2. Frank's journey in these two episodes involves
21 him preparing for his new role as Vice
President while trying to maintain control over
various situations. He deals with the
consequences of his past actions, such as
Describe Frank's journey in the two videos.
Figure 15. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video episodic analysis; Given multiple video files as
inputs, MM-V ID can summarize the video series, perform cross-episodic reasoning, describe a specific character’s journey across multiple
videos. Blue highlights the correct answers. The two videos are from House of Cards, Season 2: episode 1 and episode 2.
22
Did Pikachu use 10000000 Volts move?
Loading video…
Video length is 23 minutes 36 seconds.
Figure 16. Case studies of MM-V ID ’s capabilities and application scenarios: multi-video episodic analysis; Given multiple video files
as inputs, MM-V ID can summarize the video series, and localize specific moments across multiple videos. Blue highlights the correct
answers. The two videos are from Pokémon Journeys: episode 131 and episode 132.
23
(continued)
0:00:06 - 0:00:08: Nefer looks down the cliff's
edge, revealing her concerned reaction.
0:00:12 - 0:00:14: A shot of a rock with a golden
ankh necklace sets an ominous tone.
rt
0:00:15 - 0:00:16: Transition to a worried Nefer
under the open sky.
0:00:17 - 0:00:18: Faraon overlooks Thut lying
Loading video… on the rocky ground.
Video length is 2 minutes 43 seconds. 0:00:18 - 0:00:24: Various interactions between
Thut, Nefer, and Faraon in the desert transpire.
0:00:24 - 0:00:30: Nefer is hugged by Faraon as
Invoke tools: Scene Detection, Metadata Thut watches closely.
Collection 0:00:30 - 0:00:36: Thut shares a moment with
Nefer before seating himself beside a frog as
Scene Detection: Faraon joins Nefer.
0:00:36 - 0:00:42: Nefer and Faraon approach a
0:00:00, 0:00:01, 0:00:02, 0:00:03, 0:00:04,
slightly confused Thut.
0:00:05, 0:00:06, 0:00:07, 0:00:08, 0:00:09, 0:00:42 - 0:00:48: Thut and Nefer stand alone
..., 0:02:39, 0:02:40, 0:02:41, 0:02:42 exchanging a few words.
0:00:48 - 0:00:54: Faraon joins Thut and Nefer,
Metadata Collection: looking at a ring Thut holds.
Title: Mummies 0:01:00 - 0:01:06: Thut presents a ring between
Abstract: Thut is a young charioteer. Nefer is the his and Nefer's fingers.
daughter of Faraon. In this video, Thut saves 0:01:06 - 0:01:12: Thut listens as Nefer talks
himself from falling off a cliff. Thut and Nefer excitingly about the ring.
0:01:12 - 0:01:18: Faraon oversees a close
meets Faraon. Thut and Nefer gets married in
moment between Thut and Nefer.
front of the goddess statue and all people. Thut 0:01:18 - 0:01:24: Nefer and Thut stand in front
and Nefer celebrates by singing and dancing. of an audience as a small girl with a snake
appears.
0:01:24 - 0:01:30: Thut and Nefer perform in
front of a crowd.
Visual Prompting: 0:01:42 - 0:01:48: Thut and Nefer share a kiss as
the crowd cheers on, transitioning to the girl
with her crocodile.
0:01:54 - 0:02:00: Thut and Nefer perform on
stage with others in a palace courtyard.
0:02:12 - 0:02:18: Multiple performances ensue
in the courtyard, ending with Thut and Nefer
sharing a gaze.
0:02:18 - 0:02:24: Thut and Nefer continue their
romantic interaction under the watchful eyes of
Faraon Thut Nefer their audience.
0:02:24 - 0:02:30: The dancing in the courtyard
continues.
0:02:30 - 0:02:36: Thut, Nefer, and Faraon
Script Generation:
perform on stage as their performers enjoy
0:00:00 - 0:00:06: Nefer, an Egyptian princess, is their display.
seen gazing at the desert landscape and 0:02:36 - 0:02:43: As the sun sets, Thut and
examining a rocky cliff. Nefer conclude their musical performance on a
balcony.
Figure 17. Case studies of MM-V ID ’s capabilities and application scenarios: character identification; By providing the face photos
of the characters and their names, MM-V ID can generate the script with the identified characters. Blue highlights the correct character
identification. The original video is available at https://www.tokyvideo.com/video/mummies-online-free-film, from
1:19:47 to 1:22:30.
24
Regenerated ASR with Identified Speakers:
Figure 18. Case studies of MM-V ID ’s capabilities and application scenarios: speaker identification; By leveraging visual prompting,
MM-V ID can enhance ASR predictions with the speakers’ identity. Blue and Red highlight the correct and incorrect predictions, respec-
tively. Figure 19 shows continued outputs. The original video is available at https://www.netflix.com/title/81444554,
from 9:52 to 12:52.
25
(continued):
0:01:45 Nimona: Sidekick, you know.
0:01:46 Nimona: To help you do whatever it
takes to.
0:01:47 Nimona: Get revenge on.
0:01:48 Nimona: The cold, cruel world that
rejected you. Shall we pillage village? Lay low
until they don't remember you, and then we
rise like a phoenix. From the Ashes, overthrow
the government.
0:01:58 Unknown: Or we could.
0:01:59 Nimona: Just talk.
0:02:02 Unknown: The the the point.
0:02:03 Nimona: Is whatever your dark heart
desires. Boss, your sidekick has arrived.
0:02:08 Ballister Boldheart: I don't need a.
0:02:09 Ballister Boldheart: Side wait, I am not
a.
0:02:09 Nimona: Every villain needs a sidekick.
0:02:12 Unknown: Sweetener. Murdo. That's
how you want to kill first. Oh.
0:02:14 Nimona: Are these your next targets?
0:02:15 Nimona: Who do?
0:02:16 Ballister Boldheart: I'm not killing
anybody.
0:02:18 Nimona: This one, this guy looks
extremely punchable.
0:02:19 Nimona: Right. He is actually extremely.
0:02:21 Ballister Boldheart: You're right, he is
actually.
0:02:22 Ballister Boldheart: Punchable. Wait,
that is not a murder wall. It's a.
0:02:26 Ballister Boldheart: The innocence wall,
yeah.
0:02:27 Nimona: Citizens will.
0:02:30 Nimona: We are you saying that you're
not a villain? Yes.
0:02:36 Nimona: Yes you are.
0:02:37 Ballister Boldheart: No, no, no. I'm not
a villain, no.
0:02:42 Ballister Boldheart: But you.
0:02:43 Ballister Boldheart: Disappointed that
I'm not a murderer.
0:02:46 Nimona: I mean.
0:02:46 Nimona: I mean a little. Why?
0:02:48 Ballister Boldheart: You know what we.
0:02:49 Ballister Boldheart: Don't know.
0:02:50 Ballister Boldheart: The real villain is
still out there and I do need help not.
0:02:53 Nimona: From you. No. Fine then from.
0:02:55 Nimona: Who? The guy who cut off
your arms.
Figure 19. Case studies of MM-V ID ’s capabilities and application scenarios: speaker identification; By leveraging visual prompting,
MM-V ID can enhance ASR predictions with the speakers’ identity. Blue and Red highlight the correct and incorrect predictions, respec-
tively. The original video is available at https://www.netflix.com/title/81444554, from 9:52 to 12:52.
26
(continued)
Figure 20. Case studies of MM-V ID ’s capabilities and application scenarios: audio description generation; Without ASR inputs, MM-
V ID processes the video and generates a detailed script. Blue and Red highlight the correct and incorrect predictions, respectively. The
original video is available at https://www.youtube.com/watch?v=29ZmXkICT1g, from 2:50 to 4:34.
27
……
bapl
biotects the
biotects the
Figure 21. Case studies of MM-V ID ’s capabilities and application scenarios: self-refinement; In this example, MM-V ID mistakenly
identifies a bird as a rock due to low-resolution video frames. The self-refinement helps rectify the errors, resulting in a more refined
output. Blue and Red highlight the correct and incorrect predictions, respectively. The original video is available at https://www.
youtube.com/watch?v=o4UKvgBkO2Y, from 1:06 to 4:17.
28
…
Figure 22. Case studies of MM-V ID ’s capabilities and application scenarios: fast-changing short videos. The original videos are
available at https://www.instagram.com/mealtimewithmummy/reels/
29
…
Figure 23. Case studies of MM-V ID ’s capabilities and application scenarios: embodied agent. Blue highlights the correct prediction.
The original video is collected from Ego4D dataset [25].
30
Figure 24. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figures 25-27 show continued outputs. The video is generated by Pygame library [4].
31
Figure 25. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figures 26-27 show continued outputs. The video is generated by Pygame library [4].
32
{
"con
"rea
You are expert in playing Super Mario Bros.
What is the next controls to be pressed after
the last frame. Output JSON. }
Output:
{
"controls": "ArrowLeft+Jump",
"reason": "Mario should"
} You are expert in playing Super Mario Bros.
What is the next controls to be pressed after
For this sequence of frames of super mario, the last frame. Output JSON.
what is the next controls.
Available controls are: Jump, ArrowLeft,
Output: ArrowRight, Wait
Output:
{
{
"control": "Jump",
"controls": "ArrowLeft+Jump",
"reason": "Mario should jump to
"reason": "Mario should"
avoid the goomba"
}
}
(Agent presses control “Jump”)
Figure 26. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
Figure 27 shows continued outputs. The video is generated by Pygame library [4].
33
Figure 27. Case studies of MM-V ID ’s capabilities and application scenarios: playing video game. Blue highlights the correct prediction.
The video is generated by Pygame library [4].
34
Figure 28. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 29-32 show continued outputs.
35
Imagine that you are a robot operating a Imagine that you are a robot operating a
phone. Like how humans operate the phone, phone. Like how humans operate the phone,
you can move your finger to an icon, click an you can move your finger to an icon, click an
icon with your finger, scroll down a screen, or icon with your finger, scroll down a screen, or
type some texts with the keyboard. type some texts with the keyboard.
You are asked to shop for an ergonomic You are asked to shop for an ergonomic
keyboard, your budget is between $50 and keyboard, your budget is between $50 and
$100. $100.
Previous action: "My next move would be to Previous action: "My immediate next move
move my finger to the Amazon icon, which is would be to move my finger to the "Filters"
located in the third row and third column of button, which is located at the top right corner
the screen. I would then click on the icon to of the screen, next to the "Windows" button. I
open the app and search for an ergonomic would then click on the "Filters" button to open
keyboard within my budget of $50. " the filter options and narrow down my search
You have now opened the Amazon app. results to only show ergonomic keyboards
Below is what you see on the phone screen. within my budget range of $50 to $100. "
What will you do next? If the action involves You have now completed the previous action as
moving your finger to an icon, describe the described above. Below is what you see on the
location of that icon as detailed as possible. phone screen. Please predict your immediate
next move. If the action involves moving your
finger to an icon, describe the location of that
icon as detailed as possible.
Figure 29. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 30-32 show continued outputs.
36
(continued) My next move would be to move my finger
filter and set my minimum price to $50 to the "Show 1,987 results" button, which is
and my maximum price to $100. This located at the bottom of the screen, below
would further narrow down my search the "Clear Filters" button and above the
results to only show ergonomic keyboards navigation bar. I would then click on the
within my budget range. "Show 1,987 results" button to view the
search results for ergonomic keyboards
Imagine that you are a robot operating a
within my budget range.
phone. Like how humans operate the phone,
you can move your finger to an icon, click an Imagine that you are a robot operating a
icon with your finger, scroll down a screen, or phone. Like how humans operate the phone,
type some texts with the keyboard. you can move your finger to an icon, click an
You are asked to shop for an ergonomic icon with your finger, scroll down a screen, or
keyboard, your budget is between $50 and type some texts with the keyboard.
$100. You are asked to shop for an ergonomic
Previous action: " My immediate next move keyboard, your budget is between $50 and
would be to move my finger to the "Price & $100.
Deals" filter option, which is located under the Previous action: " My next move would be to
"Colors" filter option and above the "Sort by" move my finger to the "Show 1,987 results"
filter option. I would then click on the "Price & button, which is located at the bottom of the
Deals" filter option to open the price range filter screen, below the "Clear Filters" button and
and set my minimum price to $50 and my above the navigation bar. I would then click on
maximum price to $100. This would further the "Show 1,987 results" button to view the
narrow down my search results to only show search results for ergonomic keyboards within
ergonomic keyboards within my budget range. " my budget range. "
You have now completed the previous action as You have now completed the previous action as
described above. Below is what you see on the described above. Below is what you see on the
phone screen. What will you do next? If the phone screen. What will you do next? If the
action involves moving your finger to an icon, action involves moving your finger to an icon,
describe the location of that icon as detailed as describe the location of that icon as detailed as
possible. possible.
Figure 30. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figures 31-32 show continued outputs.
37
Figure 31. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively. Figure 32 shows continued outputs.
38
Figure 32. Case studies of MM-V ID ’s capabilities and application scenarios: GUI navigation. Blue and Red highlight the correct and
incorrect predictions, respectively.
39