0% found this document useful (0 votes)

144 views41 pages

Arjun Sihota Blaine Segment Paper

Arjun Sihota Blain Segment Paper

Uploaded by

arjunsihota01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views41 pages

Arjun Sihota Blaine Segment Paper

Arjun Sihota Blain Segment Paper

Uploaded by

arjunsihota01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi∗,† , Valentin Gabeur∗ , Yuan-Ting Hu∗ , Ronghang Hu∗ , Chaitanya Ryali∗ , Tengyu Ma∗ ,
Haitham Khedr∗ , Roman Rädle∗ , Chloe Rolland , Laura Gustafson , Eric Mintun , Junting Pan , Kalyan
Vasudev Alwala , Nicolas Carion , Chao-Yuan Wu , Ross Girshick , Piotr Dollár† , Christoph Feichtenhofer∗,†

Meta FAIR
∗
core contributor, † project lead

We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable
visual segmentation in images and videos. We build a data engine, which improves model and data
via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple
transformer architecture with streaming memory for real-time video processing. SAM 2 trained on
our data provides strong performance across a wide range of tasks. In video segmentation, we observe
better accuracy, using 3× fewer interactions than prior approaches. In image segmentation, our model
is more accurate and 6× faster than the Segment Anything Model (SAM). We believe that our data,
model, and insights will serve as a significant milestone for video segmentation and related perception
tasks. We are releasing a version of our model, the dataset and an interactive demo.
Demo: https://sam2.metademolab.com
Code: https://github.com/facebookresearch/segment-anything-2
Website: https://ai.meta.com/sam2

1 Introduction
Segment Anything (SA) introduced a foundation model for promptable segmentation in images (Kirillov et al.,
2023). However an image is only a static snapshot of the real world in which visual segments can exhibit
complex motion, and with the rapid growth of multimedia content, a significant portion is now recorded
with a temporal dimension, particularly in video data. Many important applications in AR/VR, robotics,
autonomous vehicles, and video editing require temporal localization beyond image-level segmentation. We
believe a universal visual segmentation system should be applicable to both images and videos.
Segmentation in video aims to determine the spatio-temporal extent of entities, which presents unique
challenges beyond those in images. Entities can undergo significant changes in appearance due to motion,
deformation, occlusion, lighting changes, and other factors. Videos often have lower quality than images due
to camera motion, blur, and lower resolution. Further, efficient processing of a large number of frames is a
key challenge. While SA successfully addresses segmentation in images, existing video segmentation models
and datasets fall short in providing a comparable capability to “segment anything in videos”.
We introduce the Segment Anything Model 2 (SAM 2), a unified model for video and image segmentation
(we consider an image as a single-frame video). Our work includes a task, model, and dataset (see Fig. 1).
We focus on the Promptable Visual Segmentation (PVS) task that generalizes image segmentation to the
video domain. The task takes as input points, boxes, or masks on any frame of the video to define a segment of
interest for which the spatio-temporal mask (i.e., a ‘masklet’) is to be predicted. Once a masklet is predicted,
it can be iteratively refined by providing prompts in additional frames.
Our model (§4) produces segmentation masks of the object of interest, in single images and across video
frames. SAM 2 is equipped with a memory that stores information about the object and previous interactions,
which allows it to generate masklet predictions throughout the video, and also effectively correct these based
on the stored memory context of the object from previously observed frames. Our streaming architecture is a
natural generalization of SAM to the video domain, processing video frames one at a time, equipped with a
memory attention module to attend to the previous memories of the target object. When applied to images,
the memory is empty and the model behaves like SAM.

1
Figure 1 We introduce the Segment Anything Model 2 (SAM 2), towards solving the promptable visual segmentation
task (a) with our foundation model (b), trained on our large-scale SA-V dataset collected through our data engine (c).
SAM 2 is capable of interactively segmenting regions through prompts (clicks, boxes, or masks) on one or multiple
video frames by utilizing a streaming memory that stores previous prompts and predictions.

We employ a data engine (§5) to generate training data by using our model in the loop with annotators to
interactively annotate new and challenging data. Different from most existing video segmentation datasets,
our data engine is not restricted to objects of specific categories, but instead targeted to provide training
data for segmenting any object with a valid boundary, including parts and subparts. Compared to existing
model assisted approaches, our data engine with SAM 2 in the loop is 8.4× faster at comparable quality. Our
final Segment Anything Video (SA-V) dataset (§5.2) consists of 35.5M masks across 50.9K videos, 53× more
masks than any existing video segmentation dataset. SA-V is challenging with small objects and parts that
get occluded and re-appear throughout the video. Our SA-V dataset is geographically diverse, and a fairness
evaluation of SAM 2 indicates minimal performance discrepancy in video segmentation based on perceived
gender, and little variance among the three perceived age groups we evaluated.
Our experiments (§6) show that SAM 2 delivers a step-change in the video segmentation experience. SAM 2
can produce better segmentation accuracy while using 3× fewer interactions than prior approaches. Further,
SAM 2 outperforms prior work in established video object segmentation benchmarks, under multiple evaluation
settings, and delivers better performance compared to SAM on image segmentation benchmarks, while being
6× faster. SAM 2 is shown to be effective across a variety of video and image distributions as observed through
numerous zero-shot benchmarks including 17 for video segmentation and 37 for single-image segmentation.
We are releasing our work under permissive open licences, including the SA-V dataset (CC by 4.0) a version of
the model SAM 2 (Apache 2.0), along with an interactive online demo at https://sam2.metademolab.com.

2 Related work

Image segmentation. Segment Anything (Kirillov et al., 2023) introduces a promptable image segmentation
task where the goal is to output a valid segmentation mask given an input prompt such as a bounding box
or a point that refers to the object of interest. SAM trained on the SA-1B dataset allows for zero-shot
segmentation with flexible prompting which enabled its adoption to a wide range of downstream applications.
Recent work has extended SAM by improving its quality. For example, HQ-SAM (Ke et al., 2024) enhances
SAM by introducing a High-Quality output token and training the model on fine-grained masks. Another
line of work focuses on SAM’s efficiency to enable wider use in real-world and mobile applications, such as
EfficientSAM (Xiong et al., 2023), MobileSAM (Zhang et al., 2023a), and FastSAM (Zhao et al., 2023). The
success of SAM led to its adoption in a wide range of applications, such as medical imaging (Ma et al., 2024;
Deng et al., 2023; Mazurowski et al., 2023; Wu et al., 2023a), remote sensing (Chen et al., 2024; Ren et al.,
2024), motion segmentation (Xie et al., 2024), and camouflaged object detection (Tang et al., 2023).

2
Interactive Video Object Segmentation (iVOS). Interactive video object segmentation has emerged as a crucial
task to efficiently obtain object segmentations in videos (masklets) with user guidance, often in the form of
scribbles, clicks, or bounding boxes. A few early approaches (Wang et al., 2005; Bai & Sapiro, 2007; Fan
et al., 2015) deploy graph-based optimization to guide the segmentation annotation process. More recent
approaches (Heo et al., 2020; Cheng et al., 2021b; Delatolas et al., 2024) often adopt a modular design,
converting the user inputs into a mask representation on a single frame and then propagating it to other
frames. Our work shares a similar goal to these works to segment objects across videos with a good interactive
experience, and we build a strong model along with a large and diverse dataset in pursuit of this goal.
In particular, the DAVIS interactive benchmark (Caelles et al., 2018) allows interactively segmenting an
object via scribble inputs on multiple frames. Inspired by the DAVIS interactive benchmark, we also adopt an
interactive evaluation setting for the promptable video segmentation task in §6.1.
Click-based input is easier to collect (Homayounfar et al., 2021) for interactive video segmentation. Recent
works have used a combination of SAM on images with video trackers based on masks (Cheng et al., 2023b;
Yang et al., 2023; Cheng et al., 2023c) or points (Rajič et al., 2023). However, these approaches have
limitations: the tracker may not work for all objects, SAM may not perform well for image frames from videos,
and there is no mechanism to interactively refine a model’s mistakes, other than re-annotating using SAM
from scratch on the erroneous frame and restarting the tracking from there.

Semi-supervised Video Object Segmentation (VOS). Semi-supervised VOS usually begins with an object mask as
input in the first frame, which must be accurately tracked throughout the video (Pont-Tuset et al., 2017). It
is called “semi-supervised” since the input mask can be seen as a supervision signal of the object appearance
that is available only for the first frame. This task has drawn significant attention due to its relevance in
various applications, including video editing, robotics, and automatic background removal.
Early neural network-based approaches have often used online fine-tuning on the first video frame (Caelles
et al., 2016; Perazzi et al., 2016; Yoon et al., 2017; Maninis et al., 2017; Hu et al., 2018a; Bhat et al., 2020;
Robinson et al., 2020) or on all frames (Voigtlaender & Leibe, 2017) to adapt the model to the target object.
Faster inference has been achieved with offline-trained models, conditioned either only on the first frame (Hu
et al., 2018b; Chen et al., 2018), or also integrating the previous frame (Oh et al., 2018; Yang et al., 2018, 2020).
This multi-conditioning has been extended to all frames with RNNs (Xu et al., 2018a) and cross-attention (Oh
et al., 2019; Cheng et al., 2021a; Li et al., 2022a; Yang et al., 2021b, 2024; Cheng & Schwing, 2022; Yang
& Yang, 2022; Wang et al., 2022; Cheng et al., 2023a; Goyal et al., 2023). Recent approaches (Zhang et al.,
2023b; Wu et al., 2023b) extend a single vision transformer to jointly process the current frame along with all
previous frames and associated predictions, resulting in a simple architecture but at a prohibitive inference
cost. Semi-supervised VOS can be seen as a special case of our Promptable Visual Segmentation (PVS) task,
as it is equivalent to only providing a mask prompt in the first video frame. Nevertheless, annotating the
required high-quality object mask in the first frame is practically challenging and time-consuming.

Video segmentation datasets. Many datasets have been proposed to support the VOS task. Early VOS
datasets (Prest et al., 2012; Li et al., 2013; Ochs et al., 2014; Fan et al., 2015), such as DAVIS (Pont-Tuset
et al., 2017; Caelles et al., 2019), include high-quality annotations but their limited size does not allow training
deep-learning based approaches. Covering 94 object categories over 4 thousand videos, YouTube-VOS (Xu
et al., 2018b) is the first large-scale dataset for the VOS task. As algorithms became better and benchmark
performance started to saturate, researchers have looked at increasing the difficulty of the VOS task by
specifically focusing on occlusions (Qi et al., 2022; Ding et al., 2023), long videos (Hong et al., 2023, 2024),
extreme transformations (Tokmakov et al., 2022), object diversity (Wang et al., 2021b, 2023) or scene
diversity (Athar et al., 2022).
We find that current video segmentation datasets lack sufficient coverage to achieve the capability of “segmenting
anything in videos”. Their annotations typically cover entire objects (not parts) and datasets are often centered
around specific object classes, such as people, vehicles, and animals. In comparison to these datasets, our
released SA-V dataset not only focuses on whole objects but also extensively covers object parts and contains
over an order of magnitude more masks.

3
frame 1 frame 2 frame 3 frame 4

step 1

step 2

Figure 2 Interactive segmentation with SAM 2. Step 1 (selection): we prompt SAM 2 in frame 1 to obtain the segment
of the target object (the tongue). Green/red dots indicate positive/negative prompts respectively. SAM 2 automatically
propagates the segment to the following frames (blue arrows) to form a masklet. If SAM 2 loses the object (after frame
2), we can correct the masklet by providing an additional prompt in a new frame (red arrow). Step 2 (refinement): a
single click in frame 3 is sufficient to recover the object and propagate it to obtain the correct masklet. A decoupled
SAM + video tracker approach would require several clicks in frame 3 (as in frame 1) to correctly re-annotate the
object as the segmentation is restarted from scratch. With SAM 2’s memory, a single click can recover the tongue.

3 Task: promptable visual segmentation

The PVS task allows providing prompts to the model on any frame of a video. Prompts can be positive/negative
clicks, bounding boxes, or masks, either to define an object to segment or to refine a model predicted one. To
provide an interactive experience, upon receiving a prompt on a specific frame, the model should immediately
respond with a valid segmentation mask of the object on this frame. After receiving initial (one or multiple)
prompts (either on the same frame or different frames), the model should propagate these prompts to obtain
the masklet of the object across the entire video, which contains the segmentation mask of the target object
on every video frame. Additional prompts can be provided to the model on any frame to refine the segment
throughout the video (example in Fig. 2). For details on the task, see §A.
SAM 2, introduced in the next section (§4), is applied as a data collection tool to the PVS task for building our
SA-V dataset (§5). The model is evaluated (§6) in an online and offline setting by simulating interactive video
segmentation scenarios involving annotations across multiple frames, in the conventional semi-supervised VOS
setting where annotations are limited to the first frame, and for image segmentation on the SA benchmarks.

4 Model
Our model can be seen as a generalization of SAM to the video (and image) domain. SAM 2 (Fig. 3) supports
point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented
across the video. For image input, the model behaves similarly to SAM. A promptable and light-weight mask
decoder accepts a frame embedding and prompts (if any) on the current frame and outputs a segmentation
mask for the frame. Prompts can be iteratively added on a frame in order to refine the masks.
Unlike SAM, the frame embedding used by the SAM 2 decoder is not directly from an image encoder and is
instead conditioned on memories of past predictions and prompted frames. It is possible for prompted frames
to also come “from the future” relative to the current frame. Memories of frames are created by the memory
encoder based on the current prediction and placed in a memory bank for use in subsequent frames. The
memory attention operation takes the per-frame embedding from the image encoder and conditions it on the
memory bank to produce an embedding that is then passed to the mask decoder.
We describe individual components and training below and provide more details in Appendix C.

4
image
memory
mask decoder memory
memory

encoder attention encoder bank

time Prompt encoder

Mask Points Box

Figure 3 The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt
and/or on previously observed memories. Videos are processed in a streaming fashion with frames being consumed one
at a time by the image encoder, and cross-attended to memories of the target object from previous frames. The mask
decoder, which optionally also takes input prompts, predicts the segmentation mask for that frame. Finally, a memory
encoder transforms the prediction and image encoder embeddings (not shown in the figure) for use in future frames.

Image encoder. For real-time processing of arbitrarily long videos, we take a streaming approach, consuming
video frames as they become available. The image encoder is only run once for the entire interaction and its
role is to provide unconditioned tokens (feature embeddings) representing each frame. We use an MAE (He
et al., 2022) pre-trained Hiera (Ryali et al., 2023; Bolya et al., 2023) image encoder, which is hierarchical,
allowing us to use multiscale features during decoding.

Memory attention. The role of memory attention is to condition the current frame features on the past
frames features and predictions as well as on any new prompts. We stack L transformer blocks, the first one
taking the image encoding from the current frame as input. Each block performs self-attention, followed by
cross-attention to memories of (prompted/unprompted) frames and object pointers (see below), stored in a
memory bank (see below), followed by an MLP. We use vanilla attention operations for self- and cross-attention,
allowing us to benefit from recent developments in efficient attention kernels (Dao, 2023).

Prompt encoder and mask decoder. Our prompt encoder is identical to SAM’s and can be prompted by clicks
(positive or negative), bounding boxes, or masks to define the extent of the object in a given frame. Sparse
prompts are represented by positional encodings summed with learned embeddings for each prompt type,
while masks are embedded using convolutions and summed with the frame embedding.
Our decoder design largely follows SAM. We stack “two-way” transformer blocks that update prompt and
frame embeddings. As in SAM, for ambiguous prompts (i.e., a single click) where there may be multiple
compatible target masks, we predict multiple masks. This design is important to ensure that the model
outputs valid masks. In video, where ambiguity can extend across video frames, the model predicts multiple
masks on each frame. If no follow-up prompts resolve the ambiguity, the model only propagates the mask
with the highest predicted IoU for the current frame.
Unlike SAM where there is always a valid object to segment given a positive prompt, in the PVS task it is
possible for no valid object to exist on some frames (e.g. due to occlusion). To account for this new output
mode, we add an additional head that predicts whether the object of interest is present on the current frame.
Another difference from SAM is that we use skip connections from the hierarchical image encoder (bypassing
the memory attention) to incorporate high-resolution information for mask decoding (see §C).

Memory encoder. The memory encoder generates a memory by downsampling the output mask using
a convolutional module and summing it element-wise with the unconditioned frame embedding from the
image-encoder (not shown in Fig. 3), followed by light-weight convolutional layers to fuse the information.

Memory bank. The memory bank retains information about past predictions for the target object in the video
by maintaining a FIFO queue of memories of up to N recent frames and stores information from prompts
in a FIFO queue of up to M prompted frames. For instance, in the VOS task where the initial mask is the
only prompt, the memory bank consistently retains the first frame’s memory along with memories of up to N
recent (unprompted) frames. Both sets of memories are stored as spatial feature maps.
In addition to the spatial memory, we store a list of object pointers as lightweight vectors for high-level
semantic information of the object to segment, based on mask decoder output tokens of each frame (Meinhardt
et al., 2022). Our memory attention cross-attends to both spatial memory features and these object pointers.

5
We embed temporal position information into the memories of N recent frames, allowing the model to represent
short-term object motion, but not into those of prompted frames, because the training signal from prompted
frames is sparser and it is more difficult to generalize to the inference setting where prompted frames may
come from a very different temporal range than seen during training.

Training. The model is trained jointly on image and video data. Similar to previous work (Kirillov et al., 2023;
Sofiiuk et al., 2022), we simulate interactive prompting of the model. We sample sequences of 8 frames and
randomly select up to 2 frames to prompt and probabilistically receive corrective clicks which are sampled
using the ground-truth masklet and model predictions during training. The training task is to sequentially
(and “interactively”) predict the ground-truth masklet. Initial prompts to the model can be the ground-truth
mask with probability 0.5, a positive click sampled from the ground-truth mask with probability 0.25, or a
bounding box input with probability 0.25. See §C for more details.

5 Data
To develop the capability to “segment anything” in video, we built a data engine to collect a large and diverse
video segmentation dataset. We employ an interactive model in the loop setup with human annotators.
Similar to Kirillov et al. (2023), we do not impose semantic constraints on the annotated masklets, and focus
on both whole objects (e.g., a person) and parts (e.g., a person’s hat). Our data engine went through three
phases, each categorized based on the level of model assistance provided to annotators. Next, we describe
each data engine phase and our SA-V dataset.

5.1 Data engine

Phase 1: SAM per frame. The initial phase used the image-based interactive SAM (Kirillov et al., 2023) to assist
human annotation. Annotators are tasked with annotating the mask of a target object in every frame of the
video at 6 frames per second (FPS) using SAM, and pixel-precise manual editing tools such as a “brush” and
“eraser”. There is no tracking model involved to assist with the temporal propagation of masks to other frames.
As this is a per-frame method, and all frames require mask annotation from scratch, the process is slow, with
an average annotation time of 37.8 seconds per frame in our experiment. However, this yields high-quality
spatial annotations per frame. In this phase, we collected 16K masklets across 1.4K videos. We further use
this approach to annotate our SA-V val and test sets to mitigate potential biases of SAM 2 during evaluation.

Phase 2: SAM + SAM 2 Mask. The second phase added SAM 2 into the loop, where SAM 2 only accepted masks
as prompts. We refer to this version as SAM 2 Mask. Annotators used SAM and other tools as in Phase 1 to
generate spatial masks in the first frame, and then use SAM 2 Mask to temporally propagate the annotated
mask to other frames to get the full spatio-temporal masklets. At any subsequent video frame, annotators
can spatially modify the predictions made by SAM 2 Mask by annotating a mask from scratch with SAM,
a “brush” and/or “eraser”, and re-propagate with SAM 2 Mask, repeating this process until the masklet is
correct. SAM 2 Mask was initially trained on the Phase 1 data and publicly available datasets. During Phase
2, we re-trained and updated SAM 2 Mask in the annotation loop twice using the collected data. In Phase 2,
we collected 63.5K masklets. The annotation time went down to 7.4 s/frame, a ∼5.1x speed up over Phase 1.
Despite an improvement in annotation time, this decoupled approach requires annotating masks in intermediate
frames from scratch, without previous memory. We then advanced to develop the fully-featured SAM 2,
capable of performing both interactive image segmentation and mask propagation in a unified model.

Phase 3: SAM 2. In the final phase, we utilize the fully-featured SAM 2, which accepts various types of
prompts, including points and masks. SAM 2 benefits from memories of objects across the temporal dimension
to generate mask predictions. This means annotators only need to provide occasional refinement clicks to
SAM 2 to edit the predicted masklets in intermediate frames, as opposed to annotating from scratch with a
spatial SAM which has no such memory context. During Phase 3, we re-trained and updated SAM 2 using
the collected annotations five times. With SAM 2 in the loop, the annotation time per frame went down to
4.5 seconds, a ∼8.4x speed up over Phase 1. In Phase 3, we collected 197.0K masklets.

6
Clicks per Phase 1 Mask Alignment Score (IoU>0.75)
Time per Edited
Model in the Loop Frames Clicked
Frame Frame All Small Medium Large
Phase 1 SAM only 37.8 s 100.00 % 4.80 - - - -
Phase 2 SAM + SAM 2 Mask 7.4 s 23.25 % 3.61 86.4 % 71.3 % 80.4 % 97.9 %
Phase 3 SAM 2 4.5 s 19.04 % 2.68 89.1 % 72.8 % 81.8 % 100.0 %
Table 1 Evolution of data engine phases showing the average annotation time per frame, the average percent of edited
frames per masklet, the number of manual clicks per clicked frame, and Mask Alignment to Phase 1 by mask size.

Quality verification. To uphold a high standard for annotation, we introduce a verification step. A separate set
of annotators are tasked with verifying the quality of each annotated masklet as “satisfactory” (correctly and
consistently tracking the target object across all frames) or “unsatisfactory” (target object is well defined with
a clear boundary but the masklet is not correct or consistent). Unsatisfactory masklets were sent back to the
annotation pipeline for refinement. Any masklets tracking not well defined objects were rejected entirely.

Auto masklet generation. Ensuring diversity in annotation is important to enable the anything capability of
our model. As human annotators might typically focus more on salient objects, we augment the annotations
with automatically generated masklets (referred to as “Auto”). This serves a dual purpose of increasing the
coverage of annotations and helping identify model failure cases. To generate auto masklets, we prompt
SAM 2 with a regular grid of points in the first frame and generate candidate masklets. These are then sent
to the masklet verification step for filtering. Automatic masklets tagged as “satisfactory” are added to the
SA-V dataset. Masklets identified as “unsatisfactory” (i.e., model failure cases) are sampled and presented to
annotators to refine with SAM 2 in the loop (Phase 3 of the data engine). These automatic masklets cover
large salient central objects but also objects of varying sizes and positions in the background.

Analysis. Table 1 shows a comparison of the annotation protocol in each data engine phase through a controlled
experiment (details in §D.2.2). We compare the average annotation time per frame, the average percentage
of manually edited frames per masklet, and the average number of clicks per clicked frame. For quality
evaluation, we define the Phase 1 Mask Alignment Score as the percentage of masks whose IoU compared to
the corresponding masks in Phase 1 exceeds 0.75. Phase 1 data is chosen as a reference as it has per-frame
high quality manual annotations. Phase 3 with SAM 2 in the loop leads to increased efficiency and comparable
quality: it is 8.4× faster than Phase 1, has the lowest edited frame percentage and clicks per frame and results
in better alignment.
In Table 2, we show the performance comparison of SAM 2 Training data SA-V val 9 zero-shot
trained on the available data at the end of each phase keeping VOS + SA-1B 50.0 62.5
the number of iterations fixed, therefore measuring solely + Phase 1 53.0 66.9
the impact of the additional data. We evaluate on our own + Phase 2 58.8 70.9
SA-V val set and also on 9 zero-shot benchmarks (see §E.1 for + Phase 3 62.5 71.2
details) using the standard J &F accuracy metric (the higher + Auto 63.2 71.5
the better) when prompting with 3-clicks on the first frame.
Table 2 Segmentation accuracy (J &F metric)
We note a consistent improvement after iteratively including
improvement from adding data from each data
the data from each phase, not only on the in-domain SA-V engine phase. “VOS” is a set of video object
val set, but also on the 9 zero-shot benchmarks. segmentation datasets. Details are in §E.

5.2 SA-V dataset

The SA-V dataset collected with our data engine comprises 50.9K videos with 642.6K masklets. In Table 3 we
compare the SA-V composition to common VOS datasets across the number of videos, masklets, and masks.
Notably, the number of annotated masks is 53× (15× without auto) larger than any existing VOS dataset,
providing a substantial resource for future work. We are releasing SA-V under a permissive license.

Videos. We collected a new set of 50.9K videos captured by crowdworkers. Videos comprise 54% indoor and
46% outdoor scenes with an average duration of 14 seconds. Videos feature “in-the-wild ” diverse environments,
and cover various everyday scenarios. Our dataset has more videos than existing VOS datasets, and as shown
in Fig. 5, videos span 47 countries and were captured by diverse participants (self-reported demographics).

7
Figure 4 Example videos from the SA-V dataset with masklets overlaid (manual and automatic). Each masklet has a
unique color, and each row represents frames from one video, with 1 second between them.

8
#Videos Duration #Masklets #Masks #Frames Disapp. Rate
DAVIS 2017 (Pont-Tuset et al., 2017) 0.2K 0.1 hr 0.4K 27.1K 10.7K 16.1 %
YouTube-VOS (Xu et al., 2018b) 4.5K 5.6 hr 8.6K 197.3K 123.3K 13.0 %
UVO-dense (Wang et al., 2021b) 1.0K 0.9 hr 10.2K 667.1K 68.3K 9.2 %
VOST (Tokmakov et al., 2022) 0.7K 4.2 hr 1.5K 175.0K 75.5K 41.7 %
BURST (Athar et al., 2022) 2.9K 28.9 hr 16.1K 600.2K 195.7K 37.7 %
MOSE (Ding et al., 2023) 2.1K 7.4 hr 5.2K 431.7K 638.8K 41.5 %
Internal 62.9K 281.8 hr 69.6K 5.4M 6.0M 36.4 %
SA-V Manual 50.9K 196.0 hr 190.9K 10.0M 4.2M 42.5 %
SA-V Manual+Auto 50.9K 196.0 hr 642.6K 35.5M 4.2M 27.7 %
Table 3 Comparison of our datasets with open source VOS datasets in terms of number of videos, duration, number
of masklets, masks, frames, and disappearance rate. SA-V Manual contains only manually annotated labels. SA-V
Manual+Auto combines manually annotated labels with automatically generated masklets.
1.0
SA-V
Percentage of masks

0.8
DAVIS
0.6 MOSE Male
274
18-24
109
YouTubeVOS 25-40
0.4
Female 305 41-64
0.2 236 88

0.0
0-0.1 0.1-0.2 0.2-0.3 >0.3
Normalized mask area
(i) Gender (ii) Age
(a) Size
(b) Geography (c) Crowdworker Demographics
Figure 5 Dataset distribution: (a) masklets size distribution (normalized by video resolution), (b) geographic diversity
of the videos, and (c) self-reported demographics of the crowdworkers who recorded the videos.

Masklets. The annotations comprise 190.9K manual masklet annotations and 451.7K automatic masklets
collected using our data engine. Example videos with masklets overlaid (manual and automatic) are shown
in Fig. 4. SA-V has 53× (15× without auto annotations) more masks than the largest VOS dataset. The
disappearance rate (Ding et al., 2023) in SA-V Manual (the percentage of annotated masklets that disappear
in at least one frame and then re-appear) is 42.5%, competitive among existing datasets. Fig. 5a shows a
comparison of mask size distribution (normalized by video resolution) with DAVIS, MOSE, and YouTubeVOS.
More than 88% of SA-V masks have a normalized mask area less than 0.1.

SA-V training, validation and test splits. We split SA-V based on the video authors (and their geographic
locations) to ensure minimal overlap of similar objects. To create SA-V val and SA-V test sets, we focus
on challenging scenarios in selecting videos, and ask annotators to identify challenging targets that are
fast-moving, have complex occlusions by other objects as well as disappearance/re-appearance patterns. These
targets were annotated at 6 FPS using the data engine Phase 1 setup in §5.1. There are 293 masklets and 155
videos in the SA-V val split, and 278 masklets and 150 videos in the SA-V test split.

Internal dataset. We also used internally available licensed video data to further augment our training set.
Our internal dataset is comprised of 62.9K videos and 69.6K masklets annotated in Phase 2 and Phase 3 (see
§5.1) for training, and 96 videos and 189 masklets annotated using Phase 1 for testing (Internal-test).
See Appendix D for more details on the data engine and SA-V dataset.

6 Zero-shot experiments
Here, we compare SAM 2 with previous work on zero-shot video tasks (§6.1) and image tasks (§6.2). We
report the standard J &F metric (Pont-Tuset et al., 2017) for video and mIoU metric for image tasks. Unless
otherwise mentioned, the results reported in this section follow our default setup using Hiera-B+ image
encoder with a resolution of 1024 and trained on the full combination of datasets, i.e., SAM 2 (Hiera-B+) in
Table 7 (see also §C.2 for details).

9
6.1 Video tasks
6.1.1 Promptable video segmentation
85 85
average J&F over datasets

average J&F over datasets

80 80

75 75

70 SAM 2 70 SAM 2
SAM + XMem++ SAM + XMem++
SAM + Cutie SAM + Cutie
65 65
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
number of annotated frames with 3click number of annotated frames with 3click
(a) offline average J &F across datasets (3-click) (b) online average J &F across datasets (3-click)

Figure 6 Zero-shot accuracy over 9 datasets in interactive offline and online evaluation settings.

We first evaluate promptable video segmentation, which involves simulating an interactive setting that
resembles the user experience. We have two settings, offline evaluation, where multiple passes are made
through a video to select frames to interact with based on the largest model error, and online evaluation,
where the frames are annotated in a single forward pass through the video. These evaluations are conducted
on 9 densely annotated zero-shot video datasets using Nclick = 3 clicks per frame (see §E.1 for details).
We create two strong baselines, SAM+XMem++ and SAM+Cutie, based on two state-of-the-art models
for video object segmentation, XMem++ (Bekuzarov et al., 2023) and Cutie (Cheng et al., 2023a). We use
XMem++ to generate a video segmentation based on mask inputs on one or multiple frames. SAM is used
to provide an initial mask or to refine an output (by feeding the current segmentation as a mask prompt to
SAM). For the SAM+Cutie baseline, we modify Cutie to allow taking mask inputs on multiple frames.
In Fig. 6, we report the average J &F accuracy over Nframe = 1, . . . , 8 interacted frames. SAM 2 outperforms
SAM+XMem++ and SAM+Cutie for both offline and online evaluation settings. Across all 9 datasets (see
per-dataset results in §E.1), SAM 2 dominates both methods, confirming that SAM 2 is able to generate
high-quality video segmentation from a few clicks while also allowing continued refinement of the results with
further prompts. Overall, SAM 2 can generate better segmentation accuracy, with >3× fewer interactions.

6.1.2 Semi-supervised video object segmentation

Method 1-click 3-click 5-click bounding box ground-truth mask‡
SAM+XMem++ 56.9 68.4 70.6 67.6 72.7
SAM+Cutie 56.7 70.1 72.2 69.4 74.1
SAM 2 64.3 73.2 75.4 72.9 77.6

Table 4 Zero-shot accuracy across 17 video datasets under semi-supervised VOS evaluation using different prompts.
The table shows the averaged J &F for each type of prompt (1, 3 or 5 clicks, bounding boxes, or ground-truth masks)
in the first video frame (‡ : in this case we directly use masks as inputs into XMem++ or Cutie without using SAM).

We next evaluate the semi-supervised video object segmentation (VOS) setting (Pont-Tuset et al., 2017) with
click, box, or mask prompts only on the first frame of the video. When using click prompts, we interactively
sample either 1, 3 or 5 clicks on the first video frame, and then track the object based on these clicks.
Similar to the interactive setting in §6.1.1, we compare to XMem++ and Cutie, using SAM for click and
box prompts, and in their default setting when using mask prompts. We report the standard J &F accuracy
(Pont-Tuset et al., 2017), except for on VOST (Tokmakov et al., 2022), where we report the J metric following
its protocol. The results are in Table 4. SAM 2 outperforms both baselines on the 17 datasets, using various
input prompts. The results underline that SAM 2 also excels at the conventional, non-interactive VOS task
with mask input, for which these other works are specifically designed. More details are in §E.1.3.

10
1-click 3-click mask
6.1.3 Fairness evaluation
gender
We evaluate SAM 2 for fairness across demographic groups. male 81.9 95.1 95.9
We collect annotations for the people category in the Ego- female 75.1 94.1 95.2
Exo4D (Grauman et al., 2023) dataset, which contains self- age
reported demographic information supplied by the subject of 18-26 77.2 95.0 95.7
the video. We employ the same annotation setup as for SA-V 26-50 76.7 94.7 95.8
val and test sets and apply this to 20-second clips from the 50+ 81.4 95.1 96.2
third-person (exo) videos. We evaluate SAM 2 on this data Table 5 Fairness evaluation of SAM 2 (under
using 1-, 3-clicks, and ground-truth mask on the first frame. J &F metric) on protected demographic groups.

Table 5 shows the comparison in J &F accuracy of SAM 2 for segmenting people across gender and age. At
3 clicks and with ground-truth mask prompts there is minimal discrepancy. We manually inspect 1 click
predictions, and find the model frequently predicts the mask for a part instead of the person. When limiting
the comparison to clips where the person is correctly segmented, the gap in 1 click shrinks substantially (J &F
male 94.3, female 92.7), suggesting the discrepancy can be partially attributed to ambiguity in the prompt.
In Appendix G, we provide model, data and annotation cards for SA-V.

6.2 Image tasks

We evaluate SAM 2 on the Segment Anything task across 37 zero-shot datasets, including 23 datasets
previously used by SAM for evaluation. 1-click and 5-click mIoUs are reported in Table 6 and we show the
average mIoU by dataset domain and model speed in Frames Per Second (FPS) on a single A100 GPU.
The first column (SA-23 All) shows accuracy on the 23 datasets from SAM. SAM 2 achieves higher accuracy
(58.9 mIoU with 1 click) than SAM (58.1 mIoU with 1 click), without using any extra data and while being
6× faster. This can be mainly attributed to the smaller but more effective Hiera image encoder in SAM 2.

The bottom row shows how training on our SA-1B and video data mix can further improve accuracy to 61.4%
on average on the 23 datasets. We also see exceptional gains on the video benchmarks from SA-23 (video
datasets are evaluated as images, identical to Kirillov et al. (2023)), and the 14 new video datasets we added.

1 (5) click mIoU

Model Data SA-23 All SA-23 Image SA-23 Video 14 new Video FPS
SAM SA-1B 58.1 (81.3) 60.8 (82.1) 54.5 (80.3) 59.1 (83.4) 21.7
SAM 2 SA-1B 58.9 (81.7) 60.8 (82.1) 56.4 (81.2) 56.6 (83.7) 130.1
SAM 2 our mix 61.4 (83.7) 63.1 (83.9) 59.1 (83.3) 69.6 (86.0) 130.1

Table 6 Zero-shot accuracy on the Segment Anything (SA) task across 37 datasets. The table shows the average 1-
and 5-click mIoU of SAM 2 compared to SAM by domains (image/video). We report the average metrics on the 23
datasets used by SAM (SA-23) and the average across 14 additional zero-shot video datasets (as detailed in §E.3).

Overall, the findings underscore SAM 2’s dual capability in interactive video and image segmentation, a
strength derived from our diverse training data that encompasses videos and static images across visual
domains. More detailed results including a breakdown by dataset are in §E.3.

7 Comparison to state-of-the-art in semi-supervised VOS

Our primary focus is on the general, interactive PVS task, but we also address the specific semi-supervised
VOS setting (where the prompt is a ground-truth mask on the first frame), as it is a historically common
protocol. We present a comparison with existing state-of-the-art in Table 7, reporting accuracy using standard
protocols. We evaluate two versions of SAM 2 with varying image encoder sizes (Hiera-B+/-L) with different
speed-vs-accuracy tradeoffs. SAM 2 shows significant improvement over the best existing methods in both
accuracy and inference speed (FPS shown in the last column). For best overall results, we observe that using
a larger image encoder brings significant accuracy gains across the board.

11
J &F G
Method MOSE DAVIS LVOS SA-V SA-V YTVOS FPS
val 2017 val val val test 2019 val
STCN (Cheng et al., 2021a) 52.5 85.4 - 61.0 62.5 82.7 13.2
SwinB-AOT (Yang et al., 2021b) 59.4 85.4 - 51.1 50.3 84.5 -
SwinB-DeAOT (Yang & Yang, 2022) 59.9 86.2 - 61.4 61.8 86.1 -
RDE (Li et al., 2022a) 46.8 84.2 - 51.8 53.9 81.9 24.4
XMem (Cheng & Schwing, 2022) 59.6 86.0 - 60.1 62.3 85.6 22.6
SimVOS-B (Wu et al., 2023b) - 88.0 - 44.2 44.1 84.2 3.3
JointFormer (Zhang et al., 2023b) - 90.1 - - - 87.4 3.0
ISVOS (Wang et al., 2022) - 88.2 - - - 86.3 5.8
DEVA (Cheng et al., 2023b) 66.0 87.0 55.9 55.4 56.2 85.4 25.3
Cutie-base (Cheng et al., 2023a) 69.9 87.9 66.0 60.7 62.7 87.0 36.4
Cutie-base+ (Cheng et al., 2023a) 71.7 88.1 - 61.3 62.8 87.5 17.9
SAM 2 (Hiera-B+) 75.8 90.9 74.9 73.6 74.1 88.4 43.8
SAM 2 (Hiera-L) 77.2 91.6 76.1 75.6 77.6 89.1 30.2
Table 7 VOS comparison to prior work. SAM 2 performs well in accuracy (J &F, G) and speed (FPS) for video
segmentation based on first-frame ground-truth mask prompts. SAM 2 performs significantly better on SA-V val/test.
All FPS estimates are on A100 GPUs. FPS estimates for other than our models are taken from Cheng et al. (2023a).

We also evaluate existing work on the SA-V val and test sets which measure performance for open-world
segments of “any” object class. When comparing on this benchmark, we see that most previous methods peak
at around the same accuracy. The best performance on SA-V val and SA-V test for prior work is significantly
lower demonstrating the gap to a “segment anything in videos” capability. Finally, we see that SAM 2 also
brings notable gains in long-term video object segmentation as observed in the LVOS benchmark result.

8 Data and model ablations

This section presents ablations that informed the design decisions for SAM 2. We evaluate on our MOSE
development set (“MOSE dev”) which contains 200 randomly-sampled videos from the MOSE training split
and excluded from the training data in our ablations, SA-V val, and the average over 9 zero-shot video
datasets. As the metric for comparison, we report J &F under 3-click input on the first frame as a balance
between the 1-click regime and the VOS-style mask prompts. Additionally, we report the average 1-click mIoU
on the 23-dataset benchmark used by SAM for the SA task on images. Unless otherwise specified, we run our
ablations at 512 resolution and with SA-V manual and a 10% subset of SA-1B. Additional details are in §C.2.

8.1 Data ablations

Data mix ablation. In Table 8, we compare the accuracy of SAM 2 when trained on different data mixtures.
We pre-train on SA-1B and then train a separate model for each setting. We fix the number of iterations
(200k) and batch size (128) with only the training data changing between experiments. We report accuracy
on our SA-V val set, MOSE, 9 zero-shot video benchmarks, and the SA-23 tasks (§6.2). Row 1 shows that a
model purely trained on VOS datasets (Davis, MOSE, YouTubeVOS) performs well on the in-domain MOSE
dev, but poorly on all the others including the 9 zero-shot VOS datasets (59.7 J &F).
We observe tremendous benefit from adding our data engine data into the training mix, including +12.1%
average performance improvement on 9 zero-shot datasets (row 11 vs 1). This can be attributed to the
limited coverage and size of VOS datasets. Adding SA-1B images improves the performance on the image
segmentation task (rows 3 vs 4, 5 vs 6, 9 vs 10, 11 vs 12) without degrading the VOS capability. Training only
on SA-V and SA-1B (row 4) is enough to obtain strong performance on all benchmarks except for MOSE.
Overall, we obtain the best results when mixing all datasets: VOS, SA-1B, and our data engine data (row 12).

Data quantity ablation. We next study the effect of scaling training data. SAM 2 is pre-trained on SA-1B
before training on varying sizes of SA-V. We report average J &F score (when prompted with 3 clicks in the

12
Training data J &F mIoU
VOS Internal SA-V SA-1B SA-V val Internal-test MOSE dev 9 zero-shot SA-23
1 ✓ 48.1 60.2 76.9 59.7 45.4
2 ✓ 57.0 72.2 70.6 70.0 54.4
3 ✓ 63.0 72.6 72.8 69.7 53.0
4 ✓ ✓ 62.9 73.2 73.6 69.7 58.6
5 ✓ ✓ 63.0 73.2 73.3 70.9 55.8
6 ✓ ✓ ✓ 63.6 75.0 74.4 71.6 58.6
7 ✓ ✓ 50.0 63.2 77.6 62.5 54.8
8 ✓ ✓ 54.9 71.5 77.9 70.6 55.1
9 ✓ ✓ 61.6 72.8 78.3 69.9 51.0
10 ✓ ✓ ✓ 62.2 74.1 78.5 70.3 57.3
11 ✓ ✓ ✓ 61.8 74.4 78.5 71.8 55.7
12 ✓ ✓ ✓ ✓ 63.1 73.7 79.0 71.6 58.9

Table 8 We train our model on different data mixtures including VOS (Davis, MOSE, YouTubeVOS), and subsets of
Internal-train, SA-V, and SA-1B. We report the J &F accuracy when prompted with 3 clicks in the first frame on
SA-V val and Internal-test, MOSE, and 9 zero-shot datasets, and the average 1-click mIoU on SA-23 datasets.

first frame) over 3 benchmarks: SA-V val, zero-shot, and MOSE dev. Fig. 7 shows a consistent power law
relationship between the quantity of training data and the video segmentation accuracy on all benchmarks.
68 73 75
72 74
66
71

MOSE dev J&F

73
SAV val J&F

zeroshot J&F

64 70
69 72
62
68 71
60
67
70
58 66
69
5 10 20 50 100 200 5 10 20 50 100 200 5 10 20 50 100 200
#masklets × 1000 #masklets × 1000 #masklets × 1000
Figure 7 Performance of SAM 2 as a function of the SA-V quantity. We report J &F accuracy for 3-click prompts in
the first frame on SA-V val (left), 9 zero-shot datasets (center), and MOSE dev (right).

Data quality ablation. In Table 9, we experiment with filtering strategies for quality. We subsample 50k
masklets from SA-V, either randomly or by taking the masklets that have been edited the most by annotators.
Filtering based on the number of edited frames leads to strong performance using just 25% of the data, and
outperforms random sampling. However it is worse than using all 190k SA-V masklets.
J &F mIoU
Setting SA-V val Intern-test MOSE dev 9 zero-shot SA-23
SA-1B + SA-V 50k random 63.7 70.3 72.3 68.7 59.1
SA-1B + SA-V 50k most edited 66.2 73.0 72.5 69.2 58.6
SA-1B + SA-V 69.9 73.8 73.9 70.8 59.8

Table 9 We train our model on different subsets of our SA-V Manual data: 50k randomly sampled masklets, 50k
masklets with the most edited frames, and the full SA-V dataset (190k masklets).

8.2 Model architecture ablations

Here we present model ablations that guided design decisions. We report segmentation accuracy for video
(J &F ) and image (mIoU) tasks, and video segmentation speed (FPS). We find design choices for image and
video components to be largely decoupled – this can be attributed to our modular design and training strategy.

13
J &F mIoU J &F mIoU

res. MOSE dev SA-V val 9 zero-shot FPS SA-23 #frames MOSE dev SA-V val 9 zero-shot FPS SA-23

512 73.0 68.3 70.7 77.3 59.7 4 71.1 60.0 67.7 77.3 60.1
768 76.1 71.1 72.5 62.5 61.0 8 73.0 68.3 70.7 77.3 59.7
1024 77.0 70.1 72.3 44.6 61.5 10 74.5 68.1 71.1 77.3 59.9

(a) Resolution. (b) #Frames.

J &F mIoU J &F mIoU

#mem. MOSE dev SA-V val 9 zero-shot FPS SA-23 chan. dim. MOSE dev SA-V val 9 zero-shot FPS SA-23

4 73.5 68.6 70.5 77.4 59.9 64 73.0 68.3 70.7 77.3 59.7
6 73.0 68.3 70.7 77.3 59.7 256 73.4 66.4 70.0 77.0 60.0
8 73.2 69.0 70.7 67.7 59.9

(c) #Memories. (d) Memory channels.

J &F mIoU J &F mIoU

(#sa, #ca) MOSE dev SA-V val 9 zero-shot FPS SA-23 img. enc. MOSE dev SA-V val 9 zero-shot FPS SA-23

(2, 2) 73.3 67.3 70.2 85.8 59.9 S 70.9 65.5 69.4 78.3 57.8
(3, 2) 72.7 64.1 69.5 84.2 60.0 B+ 73.0 68.3 70.7 77.3 59.7
(4, 4) 73.0 68.3 70.7 77.3 59.7 L 75.0 66.3 71.9 62.6 61.1

(e) Memory attention. (f) Image encoder size.

Table 10 Capacity ablations. We ablate modeling capacity along input size (resolution, #frames), memory size
(#memories, memory channel dim) and model size (memory attention, image encoder). Ablation defaults in gray .

8.2.1 Capacity ablations

Input size. During training, we sample sequences of frames of fixed resolution and fixed length (here denoted by
# frames). We ablate their impact in Tables 10a, 10b. A higher resolution leads to significant improvements
across image and video tasks. Increasing the number of frames brings notable gains on video benchmarks and
we use a default of 8 to balance speed and accuracy.

Memory size. Increasing the (maximum) number of memories, N , generally helps the performance although
there could be some variance, as in Table 10c. We use a default value of 6 past frames to strike a balance
between temporal context length and computational cost. Using fewer channels for memories does not cause
much performance regression as in Table 10d, while making the memory required for storage 4× smaller.

Model size. More capacity in the backbone or memory-attention generally leads to improved results, as
shown in Tables 10e, 10f. Scaling the backbone brings gains on both image and video metrics, while scaling
the memory-attention only improves video metrics. We default to using a B+ backbone, which provides a
reasonable balance between speed and accuracy; we note that even with a B+ backbone, SAM 2 is already
able to significantly outperform SAM ViT-H (see Table 6).
J &F mIoU
RPB in img. enc. 2d-RoPE in mem. attn. MOSE dev SA-V val LVOSv2 val 9 zero-shot FPS SA-23
✓ 73.0 68.3 71.6 70.7 77.3 59.7
✓ ✓ 73.6 67.9 71.0 71.5 77.3 60.0
72.8 67.1 70.3 70.3 84.4 59.9

Table 11 Relative positional encoding. We use 2d-RoPE in memory attention while removing RPB from the image
encoder by default ( gray ). Despite similar FPS at 512 resolution in this table, removing RPB allows us to enable
FlashAttention-2 (Dao, 2023), which gives a significant speed boost at 1024 resolution. At the higher resolution of
1024, the FPS gap between 2d-RoPE (1st row) and the no RoPE baseline (3rd row) becomes much smaller.

8.2.2 Relative positional encoding

By default, we always use absolute positional encoding in both the image encoder as well as memory attention.
In this section, we study relative positional encoding design choices. Here we also evaluate on LVOSv2 (Hong
et al., 2024) with 3 clicks on the 1st frame as a benchmark for long-term video object segmentation.

14
While SAM (Kirillov et al., 2023) follows Li et al. (2022b) in adding relative positional biases (RPB) to all
backbone layers, Bolya et al. (2023) improve upon this by removing RPB in all but the global attention layers
while adopting “absolute-win” positional encoding which brings significant speed gains. We improve upon
this further by removing all RPB from the backbone, with no performance regression on SA-23 and minimal
regression on video benchmarks (see Table 11), while giving a significant speed boost at 1024 resolution. We
also find it is beneficial to use 2d-RoPE (Su et al., 2021; Heo et al., 2024) in the memory attention.

8.2.3 Memory architecture ablations

Recurrent memory. We investigate the effectiveness of feeding the memory features to a GRU before adding
them to the memory bank. Similar to §8.2.2, we also evaluate on LVOSv2 as an additional benchmark for
long-term object segmentation. While prior works have commonly employed GRU (Cho et al., 2014) states
as a means of incorporating memory into the tracking process, our findings in Table 12 suggest that this
approach does not provide an improvement (except slightly on LVOSv2). Instead, we find it sufficient to
directly store the memory features in the memory bank, which is both simpler and more efficient.

Object pointers. We ablate the impact of cross-attending to the object pointer vectors from the mask decoder
output in other frames (see §4). The results presented in Table 12 show that while cross-attending to
object pointers does not enhance average performance across the 9 zero-shot datasets, it significantly boosts
performance on SA-V val dataset as well as on the challenging LVOSv2 benchmark (validation split). Hence,
we default to cross-attending to object pointers together with the memory bank.
J &F mIoU
Object Pointers GRU MOSE dev SA-V val LVOSv2 val 9 zero-shot FPS SA-23
73.1 64.5 67.0 70.9 79.7 59.9
✓ 72.3 65.3 68.9 70.5 76.9 60.0
✓ 73.0 68.3 71.6 70.7 77.3 59.7

Table 12 Ablations on memory design. We use object pointers by default ( gray ) and also study recurrent GRU memory.

9 Conclusion
We present a natural evolution of Segment Anything into the video domain, based on three key aspects:
(i) extending the promptable segmentation task to video, (ii) equipping the SAM architecture to use memory
when applied to video, and (iii) the diverse SA-V dataset for training and benchmarking video segmentation.
We believe SAM 2 marks a significant advancement in visual perception, positioning our contributions as
milestones that will propel further research and applications in the field.

10 Acknowledgements
We thank Alexander Kirillov and Jitendra Malik for discussions on project direction. Thanks to Andrew
Huang, Sahir Gomez, Miguel Martin, Devansh Kukreja, and Somya Jain for work on the demo, and to Aohan
Li and Meng Wang for creating the dataset visualizer. We thank Shoubhik Debnath and Sagar Vaze for their
work on dataset preparation. Thanks also to William Ngan and Sasha Mitts for their design expertise and to
Grant Gardner and George Orlin for leading product management. We are grateful to Joelle Pineau, Daniel
Bolya, Kate Saenko, Pengchuan Zhang, and Christopher Chedeau, for valuable discussions. Thanks to Rene
Martinez Doehner and Baishan Guo for data support, and to our annotation engineering and management
partners: Robert Kuo, Rishi Godugu, Bob Kamma, Ida Cheng, Claudette Ward, Kai Brown, Jake Kinney,
Jenny Truong, and Karen Bergan. Thanks to Vispi Cassod, Parth Malani, Shiva Koduvayur, Alexander
Miller, and Caleb Ho for their support with compute and infra. Finally, we thank Azita Shokrpour, Mallika
Malhotra, Rodrick Shepard, Jonathan Torres, Luc Dahlin, David Soofian, Alex Bosenberg, and Amanda
Kallet for project level support.

15
Appendix

Table of contents:
• §A: Task Details
• §B: Limitations
• §C: Model Details
• §D: Dataset Details
• §E: Zero-shot Experiments Details
• §G: Dataset, Annotation, and Model Cards
• §D.2.1: Annotation Guidelines

A Details on the PVS Task

The Promptable Visual Segmentation (PVS) task can be seen as an extension of the Segment Anything (SA)
task from static images to videos. In the PVS setting, given an input video, the model can be can interactively
prompted with different types of inputs (including clicks, boxes, or masks) on any frame in the video, with
the goal of segmenting (and tracking) a valid object throughout the video. When interacting with a video,
the model provides an instant response on the frame being prompted (similar to the interactive experience
of SAM on images), and also returns the segmentation of the object throughout the entire video in near
real-time. Similar to SAM the focus is on valid objects which have a clearly defined boundary, and we do not
consider regions without visual boundaries (e.g. Bekuzarov et al. (2023)). Fig. 8 illustrates the task.

Segment
Semi-supervised Video

Anything
Object Segmentation

(SA) (VOS)

Promptable Visual

Segmentation

(PVS)

Figure 8 An illustration of the Promptable Visual Segmentation task (PVS). Previously studied tasks such as Segment
Anything (SA) and semi-supervised Video Object Segmentation (VOS) can be seen as special cases of the PVS task.

PVS is related to several tasks in both the static image and video domains. On images, the SA task can be
considered a subset of PVS with the video reduced to a single frame. Similarly, traditional semi-supervised and
interactive VOS (Pont-Tuset et al., 2017) tasks are special cases of PVS, limited to mask prompts provided
only on the first frame and scribbles on multiple frames to segment objects throughout a video, respectively.
In PVS, prompts can either be clicks, masks, or boxes, and the focus is on enhancing the interactive experience,
enabling easy refinement of an object’s segmentation with minimal interaction.

B Limitations
SAM 2 demonstrates strong performance in both static image and video domains, yet it encounters difficulties
in certain scenarios. The model may fail to segment objects across shot changes and can lose track of or
confuse objects in crowded scenes, after long occlusions or in extended videos. To alleviate this issue, we

16
designed the ability to prompt SAM 2 in any frame: if the model loses the object or makes an error, refinement
clicks on additional frames can quickly recover the correct prediction in most cases. SAM 2 also struggles
with accurately tracking objects with very thin or fine details especially when they are fast-moving. Another
challenging scenario occurs when there are nearby objects with similar appearance (e.g., multiple identical
juggling balls). Incorporating more explicit motion modeling into SAM 2 could mitigate errors in such cases.
While SAM 2 can track multiple objects in a video simultaneously, SAM 2 processes each object separately,
utilizing only shared per-frame embeddings without inter-object communication. While this approach is
simple, incorporating shared object-level contextual information could aid in improving efficiency.
Our data engine relies on human annotators to verify masklet quality and select frames that require correction.
Future developments could include automating this process to enhance efficiency.

C SAM 2 details

C.1 Architecture
Here we discuss further architecture details, expanding on the model description in §4.

Image encoder. We use a feature pyramid network (Lin et al., 2017) to fuse the stride 16 and 32 features from
Stages 3 and 4 of the Hiera image encoder respectively to produce the image embeddings for each frame. In
addition, the stride 4 and 8 features from Stages 1 and 2 are not used in the memory attention but are added
to the upsampling layers in the mask decoder as shown in Figure 9, which helps produce high-resolution
segmentation details. We follow Bolya et al. (2023) in using windowed absolute positional embeddings in the
Hiera image encoder. In Bolya et al. (2023), RPB provided positional information spanning across windows in
the backbone, in lieu of which we adopt a simpler approach of interpolating the global positional embedding
instead to span across windows. We do not use any relative positional encoding. We train models with varying
image encoder sizes – T, S, B+ and L. We follow Li et al. (2022b) and use global attention in only a subset of
the image encoder layers (see Table 13).

Memory attention. In addition to sinusoidal absolute positional embeddings, we use 2d spatial Rotary Positional
Embedding (RoPE) (Su et al., 2021; Heo et al., 2024) in self-attention and cross-attention layers. The object
pointer tokens are excluded from RoPE as they do not have specific spatial correspondence. By default, the
memory attention uses L = 4 layers.
stride 4, 8 feats. from img. enc.

mask decoder
x2
image dot product
embedding 2x per mask
image to token attn. conv. masks
(256x64x64)
trans.
mlp output
token
per mask
mlp
token to image attn.
output tokens token IoU obj ptr
+ to image
output token
mlp
prompt tokens self attn. IoU
attn. scores
(Ntokensx256) occlusion
token
mlp occlusion
score

Figure 9 Mask decoder architecture. The design largely follows SAM, and we additionally include the stride 4 and
stride 8 features from the image encoder during upsampling. We also use the mask token corresponding to the output
mask as an object pointer and generate an occlusion score which indicates if the object of interest is visible in the
current frame.

Prompt encoder and mask decoder. The prompt encoder design follows SAM, and we next discuss additional
details on design changes in the mask decoder. We use the mask token corresponding to the output mask
as the object pointer token for the frame, which is placed in the memory bank. As discussed in §4, we also

17
introduce an occlusion prediction head. This is accomplished by including an additional token along with
the mask and IoU output tokens. An additional MLP head is applied to this new token to produce a score
indicating the likelihood of the object of interest being visible in the current frame (as shown in Figure 9).
SAM introduced the ability to output multiple valid masks when faced with ambiguity about the object being
segmented in an image. For example, when a person clicks on the tire of a bike, the model can interpret this
click as referring to only the tire or the entire bike and output multiple predictions. In videos, this ambiguity
can extend across video frames. For example, if in one frame only the tire is visible, a click on the tire might
relate to just the tire, or as more of the bike becomes visible in subsequent frames, this click could have been
intended for the entire bike. To handle this ambiguity, SAM 2 predicts multiple masks at each step of the
video. If further prompts do not resolve the ambiguity, the model selects the mask with the highest predicted
IoU for the current frame for further propagation in the video.

Memory encoder and memory bank. Our memory encoder does not use an additional image encoder and
instead reuses the image embeddings produced by the Hiera encoder, which are fused with the predicted mask
information to produce memory features (as discussed in §4). This design allows the memory features to
benefit from the strong representations produced by the image encoder (especially when we scale the image
encoder to a larger size). Further, we project the memory features in our memory bank to a dimension of 64,
and split the 256-dim object pointer into 4 tokens of 64-dim for cross-attention to the memory bank.

C.2 Training
C.2.1 Pre-training

We first pre-train SAM 2 on static images on the SA-1B dataset (Kirillov et al., 2023). Table 13a details the
settings used during pre-training on SA-1B – other settings not mentioned here follow Kirillov et al. (2023).
The image encoder is initialized from MAE pre-trained Hiera (Ryali et al., 2023). Similar to SAM, we filter
masks covering more than 90% of the image and restricted training to 64 randomly sampled masks per image.
Unlike SAM, we found it beneficial to use an ℓ1 loss to more aggressively supervise the IoU predictions and
to apply a sigmoid activation to the IoU logits to restrict the output into the range between 0 and 1. For
multi-mask predictions (on the first click), we supervise the IoU predictions of all masks to encourage better
learning of when a mask might be bad, but only supervise the mask logits with the lowest segmentation
loss (linear combination of focal and dice loss). In SAM, during iterative sampling of points, two iterations
were inserted with no additional prompts (only feeding the previous mask logits) – we do not add such
iterations during our training and use 7 correction clicks (instead of 8 in SAM). We also employ horizontal
flip augmentation during training and resize the image to a square size of 1024×1024.
We use AdamW (Loshchilov & Hutter, 2019) and apply layer decay (Clark et al., 2020) on the image encoder
and follow a reciprocal square-root schedule (Zhai et al., 2022). See Table 13 (a) for the hyperparameters in
our pre-training stage.

C.2.2 Full training

After pre-training, we train SAM 2 on our introduced datasets SA-V + Internal (section §5.2), a 10% subset
of SA-1B, and a mixture of open-source video datasets including DAVIS (Pont-Tuset et al., 2017; Caelles
et al., 2019), MOSE (Ding et al., 2023), and YouTubeVOS (Xu et al., 2018b). Our released model is trained
on SA-V manual + Internal and SA-1B.
SAM 2 is designed for two tasks; the PVS task (on videos) and the SA task (on images). Training is done
jointly on image and video data. To optimize our data usage and computational resources during training, we
adopt an alternating training strategy between video data (multiple frames) and static images (one single
frame). Specifically, in each training iteration, we sample a full batch either from the image or video dataset,
with their sampling probabilities proportional to the size of each data source. This approach allows for
a balanced exposure to both tasks and a different batch size for each data source to maximize compute
utilization. Settings not explicitly mentioned here for the image task follow settings from the pre-training
phase. See Table 13 (b) for the hyperparameters in our full training stage. The training data mixture consists
of ∼15.2% SA-1B, ∼70% SA-V and ∼14.8% Internal. The same settings are used when open-source datasets

18
config value
data SA-1B, Internal, SA-V
steps ∼300k
resolution 1024
precision bfloat16
config value optimizer AdamW
data SA-1B optimizer momentum β1 , β2 =0.9, 0.999
steps ∼90k gradient clipping type: ℓ2 , max: 0.1
resolution 1024 weight decay 0.1
precision bfloat16 learning rate (lr) backbone: 6e-5, other: 3.0e-4
optimizer AdamW lr schedule cosine
optimizer momentum β1 , β2 =0.9, 0.999 warmup linear, 15k iters
gradient clipping type: ℓ2 , max: 0.1 layer-wise decay 0.8 (T, S), 0.9 (B+), 0.925 (L)
weight decay 0.1 img. augmentation hflip, resize to 1024 (square)
learning rate (lr) 4e-4 vid. aug. hflip, affine (deg: 25, shear: 20),
lr schedule reciprocal sqrt, timescale=1000 colorjitter (b: 0.1, c: 0.03, s:
warmup linear, 1k iters 0.03, h: null), grayscale (0.05),
cooldown linear, 5k iters per frame colorjitter (b: 0.1, c:
layer-wise decay 0.8 (T, S), 0.9 (B+), 0.925 (L) 0.05, s: 0.05, h: null)
augmentation hflip, resize to 1024 (square) batch size 256
batch size 256 drop path 0.1 (T, S), 0.2 (B+), 0.3 (L)
drop path 0.1 (T, S), 0.2 (B+), 0.3 (L) mask losses (weight) focal (20), dice (1)
mask losses (weight) focal (20), dice (1) IoU loss (weight) ℓ1 (1)
IoU loss (weight) ℓ1 (1) occlusion loss (weight) cross-entropy (1)
max. masks per img. 64 max. masks per frame. image: 64, video: 3
# correction points 7 # correction points 7
global attn. blocks 5-7-9 (T), 7-10-13 (S), 12-16-20 global attn. blocks 5-7-9 (T), 7-10-13 (S), 12-16-20
(B+), 23-33-43 (L) (B+), 23-33-43 (L)

(a) Pre-training (b) Full training

Table 13 Hyperparameters and details of SAM 2 pre-training and full training. Note that some settings vary with
image encoder size (T, S, B+, L).

are included, with the change that the additional data is included (∼1.3% DAVIS, ∼9.4% MOSE, ∼9.2%
YouTubeVOS, ∼15.5% SA-1B, ∼49.5% SA-V, ∼15.1% Internal).
We train by simulating an interactive setting, sampling 8-frame sequences and randomly selecting up to 2
frames (including the first) for corrective clicks. During training, we use ground-truth masklets and model
predictions to sample prompts, with initial prompts being the ground-truth mask (50% probability), a positive
click from the ground-truth mask (25%), or a bounding box input (25%).
We restrict the maximum number of masklets for each sequence of 8 frames to 3 randomly chosen ones. We
reverse the temporal order with a probability of 50% to help generalization to bi-directional propagation.
When we sample corrective clicks - with a small probability of 10%, we randomly sample clicks from the
ground truth mask, irrespective of the model prediction, to allow additional flexibility in mask refinement.

Losses and optimization. We supervise the model’s predictions using a linear combination of focal and dice
losses for the mask prediction, mean-absolute-error (MAE) loss for the IoU prediction, and cross-entropy loss
for object prediction with a ratio of 20:1:1:1 respectively. As during pre-training, for multi-mask predictions,
we only supervise the mask with the lowest segmentation loss. If the ground-truth does not contain a mask
for a frame, we do not supervise any of the mask outputs (but always supervise the occlusion prediction head
that predicts whether there should exist a mask in the frame).

C.3 Speed benchmarking

We run all benchmarking experiments on a single A100 GPU using PyTorch 2.3.1 and CUDA 12.1, under
automatic mixed precision with bfloat16. We compile the image encoder with torch.compile for all SAM 2
models and do the same for SAM and HQ-SAM for direct comparison on the SA task (Tables 6 and 15). The
FPS measurements for the SA task were conducted using a batch size of 10 images, which was found to yield
the highest FPS across all three models types. For video tasks, we use a batch size of 1 following commmon
protocol in video segmentation.

19
D Data details

D.1 SA-V dataset details

Videos. Video resolutions range from 240p to 4K with an average
size of 1,401 × 1,037. Video duration ranges from 4 seconds to 2.3
minutes, with an average of 13.8 seconds, totaling 4.2M annotated
frames and 196 hours.
Automatic masklets. Similar to the approach described by Kirillov
et al. (2023), automatic masklets are generated by prompting the
(a) ML only (b) With Auto
model with regular grids. We prompt the model with a 32 × 32 Figure 10 Annotations overlaid on the first
grid on the first frame, and additionally we use a 16 × 16 grid frame: (a) manual labels (ML) only, (b)
on 4 zoomed image crops of the first frame (derived from a 2 × 2 with automatic labels (Auto). Automatic
overlapped window) and a 4 × 4 grid on 16 zoomed image crops labels increase diversity and coverage.
of the first frame (derived from a 4 × 4 overlapped window). We
apply two post-processing steps. First, we remove tiny disconnected components with areas smaller than 200
pixels. Second, we fill in holes in segmentation masks if the area of the hole is less than 200 pixels. These
post-processing steps are applied across all frames. By combining these automatically generated masklets with
manually created ones, we enhance the coverage of annotations in the SA-V dataset, as illustrated in Fig. 10.

D.2 Data engine details

D.2.1 Annotation protocol

A diagram of the annotation protocol used in our data engine is shown in Fig. 11. The annotation task was
separated into steps each carried out by a different annotator: Steps 1 and 2 focus on object selection, Steps 3
and 4 on masklet tracking, and Step 5 on quality verification. SAM 2 was deployed on GPU as an API and
built into the annotation tool to enable interactive use.
Compared to image segmentation annotation, large-scale video segmentation annotation presents unique
challenges which required innovations in the annotation task design and protocol. To improve our model’s
ability to “segment anything”, it was important to focus annotation on challenging objects where SAM 2
struggled. We leveraged our online model in the loop setup to enable this, requesting annotators to use SAM 2
interactively to identify failure modes and then correct them.
We found the number of edited frames to be a proxy to the “challengingness” of an object as shown in Table 9.
Therefore, we asked annotators to annotate objects that required at least 2 edited frames with SAM 2 in the
loop. To focus annotation on less prominent and more challenging cases, annotators were presented with videos
pre-filled with verified satisfactory automatic masklets and asked to find un-annotated challenging objects.
We further decouple the object selection task from the annotation task: in the selection task annotators focus
on choosing the challenging objects in one frame, while in the annotation task annotators are presented with
a challenging target object and requested to annotate the masklet consistently throughout the video.

D.2.2 Data engine phase comparison

The comparison of data engine phases shown in Table 1 was conducted as a controlled experiment using 169
videos and 452 masklets. We ask three subsets of annotators to annotate the same set of objects with the
annotation protocol from each phase. We categorize masklets into three buckets based on the mask area in the
first frame (small: 1 to 322 , medium: 322 to 962 , and large: equal or greater than 962 ). Phase 1 data is used
as the quality reference, due to the high quality masks from frame-by-frame manual annotation with SAM.

E Details on zero-shot transfer experiments

In this section, we describe further details of our zero-shot experiments (§6). Unless otherwise noted, the
results reported in this section follow our default setup using Hiera-B+ image encoder with a resolution of

20
Step 1.

Play the Video

Watch the video to identify

objects to track

Masklet
Time
Selection

Annotator Step 2.
Model failed

A Identify SAM 2 Failures

Try SAM 2

Masklet candidate found

to track an object

Run SAM 2 on selected object e.g.

to identify tracking failures

Masklet
lost track

Existing manual or auto

annotated masklets will be pre-
filled in red
Repeat Repeat Repeat

SAM 2 propagated correctly

Add prompt to fix this frame

SAM 2 propagated correctly

No need to fix
Select target object
Re-run SAM 2 to propagate
No need to fix
Step 3.
Run SAM 2 to propagate

Iteratively Correct
Predictions

Run SAM 2 to propagate the

target object mask, show in
Yellow

Check through each frame, fix

Masklet when needed by adding
Tracking

refinement prompts, re-run

SAM 2 to forward propagate Masklet

Finished
Annotator to the remaining frames
B

Satisfactory
Step 4.
Step 5.

Final Check

Masklet Verify the Masklet

Verification

Not

Play the video again from start Submit Another annotator

Satisfactory
to finish to review and then Annotator  verifies the masklet for
submit C quality (correctness and
consistency)
Reject

Figure 11 Annotation guideline overview. There are 3 main annotation tasks: masklet selection, masklet tracking, and
masklet verification. Each task has a different set of annotators working on it.

1024 and trained on the full combination of datasets, i.e., SAM 2 (Hiera-B+) in Table 7.

E.1 Zero-shot video tasks

E.1.1 Video dataset details

We evaluate SAM 2 on a diverse benchmark of 17 zero-shot datasets: EndoVis 2018 (Allan et al., 2020)
contains medical surgery videos with robotic instruments. ESD (Huang et al., 2023) contains videos from a
robot manipulator camera often with motion blur. LVOSv2 (Hong et al., 2024) is a benchmark for long-term
video object segmentation. LV-VIS (Wang et al., 2023) contains videos from a diverse set of open-vocabulary
object categories. UVO (Wang et al., 2021b) contains videos for open-world object segmentation, and
VOST (Tokmakov et al., 2022) contains videos with objects undergoing large transformations such as egg
broken or paper torn. PUMaVOS (Bekuzarov et al., 2023) contains videos with segments around object parts
such as a person’s cheek. Virtual KITTI 2 (Cabon et al., 2020) is a synthetic video dataset with driving scenes.
VIPSeg (Miao et al., 2022) provides object segmentation in panoptic videos. Wildfires (Toulouse et al., 2017)

21
contains wildfire videos under different conditions from the Corsican Fire Database. VISOR (Darkhalil et al.,
2022) contains egocentric videos in kitchen scenes with segments around hands and active objects. FBMS (Brox
et al., 2010) provides motion segmentation over moving objects in videos. Ego-Exo4D (Grauman et al., 2023)
is a large dataset with egocentric videos around various human activities. Cityscapes (Cordts et al., 2016)
contains videos for urban driving scenes. Lindenthal Camera (Haucke & Steinhage, 2021) contains videos in a
wildlife park with segments around observed animals such as birds and mammals. HT1080WT Cells (Gómez-de
Mariscal et al., 2021) contains microscopy videos with cell segments. Drosophila Heart (Fishman et al., 2023)
contains microscopy videos for the heart of fruit flies.
Among these 17 zero-shot video datasets above, 9 of them (EndoVis, ESD, LVOSv2, LV-VIS, UVO, VOST,
PUMaVOS, Virtual KITTI 2, and VIPSeg) have dense object segments annotated for every video frame. In
the remaining 8 datasets (Wildfires, VISOR, FBMS, Ego-Exo4D, Cityscapes, Lindenthal Camera, HT1080WT
Cells, and Drosophila Heart), the object segments are sparsely annotated over only a subset of video frames,
and we compute the metrics on those frames where the ground-truth segmentation masks are available. In
most evaluations of the paper, we only evaluate zero-shot performance on the 9 densely annotated datasets,
while in our semi-supervised VOS evaluation (§6.1.2), we evaluate on all these 17 datasets listed above.

E.1.2 Interactive offline and online evaluation details

Offline evaluation involves multiple passes over the entire video. We start with click prompts on the first
frame, segment the object throughout the entire video, and then in the next pass, we select the frame with the
lowest segmentation IoU w.r.t. the ground-truth as the new frame for prompting. The model then segments
the object again throughout the video based on all prompts received previously, until reaching a maximum of
Nframe passes (with one new prompted frame in each pass).
Online evaluation involves only one pass over the entire video. We start with click prompts on the first
frame and propagate the prompts across the video, pausing propagation when encountering a frame with a
low-quality prediction (IoU < 0.75 with ground-truth). We then add additional click prompts on the paused
frame to correct the segment on this frame and resume the propagation forward until reaching another low
quality frame with IoU < 0.75. This is repeated while the number of prompted frames is less than the
maximum Nframe . Unlike the previous offline evaluation, in this setting, the new prompts only affect the
frames after the current paused frame but not the frames before it.
In both settings, we evaluate on 9 densely annotated datasets in §E.1.1 (EndoVis, ESD, LVOSv2, LV-VIS,
UVO, VOST, PUMaVOS, Virtual KITTI 2, and VIPSeg). If a video contains multiple objects to segment in
its ground-truth annotations, we perform inference on each object independently. We simulate interactive
video segmentation with Nclick = 3 clicks per frame, assuming that the user would visually locate the object
to label it (with initial clicks) or to refine the current segmentation prediction of it (with correction clicks).
Specifically, when starting the first pass (where there are not any existing predictions yet), we place an initial
click on the first frame at the center1 of the object’s ground-truth mask and then interactively add two more
clicks based on the center of the error region (between the ground-truth mask and the predicted segments on
the first frame). Then in subsequent passes (where there are already predicted segments), we interactively
add three clicks based on the center of the error region (between the ground-truth mask and the predicted
segments on the frame being prompted).
We report the average J &F metric over Nframe = 1, . . . , 8 interacted frames and the J &F metrics under
different annotation time on a video based on the following assumptions:
• On each frame, it takes Tloc = 1 sec for the annotator to visually locate an object in the frame, and
Tclick = 1.5 sec to add each click, following Delatolas et al. (2024).
• In offline mode, it takes Texam = 30 sec on a 300-frame video to examine the results throughout the
video in each round, including finding the frame with the worst segmentation quality to add corrections
(and for longer or shorter videos, this time is proportional to the video length L, assuming the annotator
could examine the results at 10 FPS).
1 The center of a mask is defined as the mask pixel that has the largest Euclidean distance to the mask boundary.

22
85 92 95
91 90
80
90 85
75 89
80
J&F

J&F

J&F
88
70 75
87
SAM 2 86 SAM 2 70 SAM 2
65 SAM + XMem++ SAM + XMem++ SAM + XMem++
85 65
SAM + Cutie SAM + Cutie SAM + Cutie
60 84 60
20 40 60 80 100 120 20 40 60 80 100 120 140 160 50 100 150 200 250 300 350 400
EndoVis 2018 annotation time (sec) with 3click ESD annotation time (sec) with 3click LVOSv2 annotation time (sec) with 3click
94 85 84
92 80 82
90 75
80
70
88 78
J&F

J&F
65

J&F
86 76
60
84 SAM 2 55 SAM 2 74 SAM 2
82 SAM + XMem++ SAM + XMem++ 72 SAM + XMem++
SAM + Cutie 50 SAM + Cutie SAM + Cutie
80 45 70
10 20 30 40 50 60 100 200 300 400 500 600 700 20 40 60 80 100 120
LVVIS annotation time (sec) with 3click PUMaVOS annotation time (sec) with 3click UVO annotation time (sec) with 3click
92 80 80
90
75
70
88
70
86
60
J&F

J&F

J&F
84 65
82 50
SAM 2 60 SAM 2 SAM 2
80 SAM + XMem++ SAM + XMem++ SAM + XMem++
SAM + Cutie 55 SAM + Cutie 40 SAM + Cutie
78
10 20 30 40 50 60 20 40 60 80 100 120 20 40 60 80 100 120
VIPSeg annotation time (sec) with 3click Virtual KITTI 2 annotation time (sec) with 3click VOST annotation time (sec) with 3click

(a) J &F performance on each dataset with different number of interacted frames (3-click)
EndoVis Virtual
Method 2018 ESD LVOSv2 LV-VIS PUMaVOS UVO VIPSeg KITTI 2 VOST (average)
SAM + XMem++ 68.9 88.2 72.1 86.4 60.2 74.5 84.2 63.8 46.6 71.7
SAM + Cutie 71.8 87.6 82.1 87.1 59.4 75.2 84.4 70.3 54.3 74.7
SAM 2 78.4 89.6 87.3 90.4 70.6 79.8 89.0 74.7 66.6 80.7

(b) average J &F on each dataset over 8 interacted frames (3-click)

Figure 12 Zero-shot performance of SAM 2 vs baselines (SAM+XMem++ and SAM+Cutie) under interactive offline
evaluation with different numbers of interacted frames, using 3 clicks per interacted frame. See §E.1.2 for details.

• In online mode, it takes Texam = 30 sec on a 300-frame video to follow the results throughout the
video in total, including pausing at a frame with low quality for further corrections (and this time is
proportional to the video length L similar to the offline mode).
• The annotation time for an object is (Texam · (L/300) + Tloc + Tclick · Nclick ) · Nframe in offline mode and
Texam · (L/300) + (Tloc + Tclick · Nclick ) · Nframe in online mode, where L is the total frame number in
the video, Nframe = 1, . . . , 8 is the number of frames annotated (i.e., the number of interactive rounds),
and Nclick = 3 is the number of clicks per frame.2
We show per-dataset results of SAM 2 and the two baselines (SAM+XMem++ and SAM+Cutie, see their
details below) for interactive offline and online evaluation in Fig. 12 and Fig. 13. SAM 2 outperforms both
baselines with a notable margin on all datasets and settings.

E.1.3 Semi-supervised VOS evaluation details

In §6.1.2, we also compare with previous video tracking methods under the semi-supervised VOS setting
(Pont-Tuset et al., 2017), where prompts (which can be foreground/background clicks, bounding boxes, or
ground-truth object masks) are provided only on the first frame of the video. When using click prompts, we
interactively sample either 1, 3 or 5 clicks on the first video frame, and then track the object based on these
clicks. Following the click-based evaluation in prior work (Kirillov et al., 2023; Sofiiuk et al., 2022), the initial
2We note that this estimation does not account for the model’s tracking FPS. The intuition is that human annotators can

only examine the results at a lower speed, and therefore the model’s tracking time is covered by Texam .

23
85 92 95
91 90
80
90 85
75 89
80
J&F

J&F

J&F
88
70 75
87
SAM 2 86 SAM 2 70 SAM 2
65 SAM + XMem++ SAM + XMem++ SAM + XMem++
85 65
SAM + Cutie SAM + Cutie SAM + Cutie
60 84 60
15 20 25 30 35 40 45 50 55 20 25 30 35 40 45 50 55 60 55 60 65 70 75 80 85 90
EndoVis 2018 annotation time (sec) with 3click ESD annotation time (sec) with 3click LVOSv2 annotation time (sec) with 3click
94 85 84
92 80 82
90 75 80
70
88 78
J&F

J&F

J&F
65
86 76
60
84 SAM 2 55 SAM 2 74 SAM 2
82 SAM + XMem++ SAM + XMem++ 72 SAM + XMem++
SAM + Cutie 50 SAM + Cutie SAM + Cutie
80 45 70
10 15 20 25 30 35 40 45 95 100 105 110 115 120 125 130 15 20 25 30 35 40 45 50
LVVIS annotation time (sec) with 3click PUMaVOS annotation time (sec) with 3click UVO annotation time (sec) with 3click
92 80 80
90
75
70
88
70
86
60
J&F

J&F

J&F
84 65
82 50
SAM 2 60 SAM 2 SAM 2
80 SAM + XMem++ SAM + XMem++ SAM + XMem++
SAM + Cutie 55 SAM + Cutie 40 SAM + Cutie
78
10 15 20 25 30 35 40 45 15 20 25 30 35 40 45 50 55 15 20 25 30 35 40 45 50 55
VIPSeg annotation time (sec) with 3click Virtual KITTI 2 annotation time (sec) with 3click VOST annotation time (sec) with 3click

(a) J &F performance on each dataset with different number of interacted frames (3-click)
EndoVis Virtual
Method 2018 ESD LVOSv2 LV-VIS PUMaVOS UVO VIPSeg KITTI 2 VOST (average)
SAM + XMem++ 71.4 87.8 72.9 85.2 63.7 74.7 82.5 63.9 52.7 72.8
SAM + Cutie 70.5 87.3 80.6 86.0 58.9 75.2 82.1 70.4 54.6 74.0
SAM 2 77.8 88.5 85.8 88.7 74.2 79.0 86.1 74.3 63.3 79.7

(b) average J &F on each dataset over 8 interacted frames (3-click)

Figure 13 Zero-shot performance of SAM 2 vs baselines (SAM+XMem++ and SAM+Cutie) under interactive online
evaluation with different numbers of interacted frames, using 3 clicks per interacted frame. See §E.1.2 for details.

click is placed on the object center and subsequent clicks are obtained from the center of the error region.
Similar to the interactive setting, here we also use SAM+XMem++ and SAM+Cutie as two baselines. For
click or box prompts, SAM is first used to handle click or bounding box inputs, and its output mask is then
used as input to XMem++ or Cutie. For mask prompts, the ground-truth object masks on the first frame
are directly used as input to XMem++ and Cutie – this is the standard semi-supervised VOS setting and
evaluates XMem++ and Cutie without using SAM.
In this setting, we evaluate on all 17 zero-shot video datasets in §E.1.1. If a dataset does not follow the
standard VOS format, we preprocess it into a format similar to MOSE (Ding et al., 2023). During processing,
we ensure that all objects in each video have a valid non-empty segmentation mask on the first frame to be
compatible with semi-supervised VOS evaluation. In case an object doesn’t appear in the first frame, we
create a separate video for it starting from the first frame where the object appears.
We report the standard J &F metric (Pont-Tuset et al., 2017) for this evaluation. If a dataset provides an
official evaluation toolkit, we use it for evaluation (on the VOST dataset, we report the J metric instead,
following its official protocol (Tokmakov et al., 2022)). The results are shown in Table 4, where SAM 2
performs better than both baselines on the majority of the 17 datasets across different types of prompts.
We show per-dataset results of SAM 2 and the two baselines (SAM+XMem++ and SAM+Cutie, see their
details below) for semi-supervised VOS evaluation in Fig. 14. SAM 2 outperforms both baselines on the
majority of these datasets across different types of prompts.

24
SAM + XMem++ (3-click on 1st frame) SAM + Cutie (3-click on 1st frame) SAM 2 (3-click on 1st frame)
100