0% found this document useful (0 votes)
81 views21 pages

Segment Anything Model For Medical Image Analysis: An Experimental Study

This document summarizes a study that evaluated the performance of the Segment Anything Model (SAM) for segmenting structures in medical images from 19 different datasets. The key findings were: 1) SAM's performance varied significantly depending on the dataset and task, ranging from an IoU of 0.1135 for spine MRI to 0.8650 for hip X-ray. 2) SAM performed better on well-circumscribed objects with unambiguous prompts, and poorer on tasks like brain tumor segmentation. 3) SAM was more accurate with box prompts than point prompts. 4) SAM outperformed other interactive segmentation methods in most single-point prompt evaluations. 5) SAM

Uploaded by

mawara khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views21 pages

Segment Anything Model For Medical Image Analysis: An Experimental Study

This document summarizes a study that evaluated the performance of the Segment Anything Model (SAM) for segmenting structures in medical images from 19 different datasets. The key findings were: 1) SAM's performance varied significantly depending on the dataset and task, ranging from an IoU of 0.1135 for spine MRI to 0.8650 for hip X-ray. 2) SAM performed better on well-circumscribed objects with unambiguous prompts, and poorer on tasks like brain tumor segmentation. 3) SAM was more accurate with box prompts than point prompts. 4) SAM outperformed other interactive segmentation methods in most single-point prompt evaluations. 5) SAM

Uploaded by

mawara khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Segment Anything Model for Medical Image Analysis: an Experimental Study

Maciej A. Mazurowski1,2,3,4 Haoyu Dong2 Hanxue Gu2


Jichen Yang2 Nicholas Konz2 Yixin Zhang2
1
Department of Radiology 2 Department of Electrical and Computer Engineering
3
Department of Computer Science 4 Department of Biostatistics & Bioinformatics
Duke University, NC, USA
arXiv:2304.10517v3 [cs.CV] 17 May 2023

{maciej.mazurowski, haoyu.dong151, hanxue.gu,


jichen.yang, nicholas.konz, yixin.zhang7}@duke.edu
Code: https://github.com/mazurowski-lab/segment-anything-medical-evaluation

Abstract 1. Introduction
Image segmentation is a central task in medical im-
Training segmentation models for medical images con- age analysis, ranging from the segmentation of organs
tinues to be challenging due to the limited availability of [31], abnormalities [3], bones [17], and others [28], which
data annotations. Segment Anything Model (SAM) is a has received significant advancements from deep learning
foundation model that is intended to segment user-defined [13, 30]. However, developing and training segmentation
objects of interest in an interactive manner. While the per- models for new medical imaging data and/or tasks is practi-
formance on natural images is impressive, medical image cally challenging, due to the expensive and time-consuming
domains pose their own set of challenges. Here, we perform nature of collecting and curating medical images, primarily
an extensive evaluation of SAM’s ability to segment medical because trained radiologists must typically provide careful
images on a collection of 19 medical imaging datasets from mask annotations for images.
various modalities and anatomies. We report the following These difficulties could be significantly mitigated with
findings: (1) SAM’s performance based on single prompts the advent of foundation models [39] and zero-shot learn-
highly varies depending on the dataset and the task, from ing [6]. Foundation models are neural networks trained on
IoU=0.1135 for spine MRI to IoU=0.8650 for hip X-ray. an extensive amount of data, using creative learning and
(2) Segmentation performance appears to be better for well- prompting objectives that typically do not require tradi-
circumscribed objects with prompts with less ambiguity and tional supervised training labels, both of which contribute
poorer in various other scenarios such as the segmentation towards the ability to perform zero-shot learning on com-
of brain tumors. (3) SAM performs notably better with box pletely new data in a variety of settings. Foundation mod-
prompts than with point prompts. (4) SAM outperforms sim- els have shown paradigm-shifting abilities in the domain of
ilar methods RITM, SimpleClick, and FocalClick in almost natural language processing [27]. The recently developed
all single-point prompt settings. (5) When multiple-point Segment Anything Model is a foundation model that has
prompts are provided iteratively, SAM’s performance gen- achieved promising zero-shot segmentation performance on
erally improves only slightly while other methods’ perfor- a variety of natural image datasets [18].
mance improves to the level that surpasses SAM’s point-
based performance. We also provide several illustrations 1.1. What is SAM?
for SAM’s performance on all tested datasets, iterative seg- Segment Anything Model (SAM) is designed to segment
mentation, and SAM’s behavior given prompt ambiguity. an object of interest in an image given certain prompts pro-
We conclude that SAM shows impressive zero-shot segmen- vided by a user. Prompts can take the form of a single point,
tation performance for certain medical imaging datasets, a set of points (including an entire mask), a bounding box,
but moderate to poor performance for others. SAM has the or text. The model is asked to return a valid segmenta-
potential to make a significant impact in automated medi- tion mask even in the presence of ambiguity in the prompt.
cal image segmentation in medical imaging, but appropri- The general idea behind this approach is that the model has
ate care needs to be applied when using it. learned the concept of an object and thus can segment any
object that is pointed out. This results in a high potential for

1
it to be able to segment objects of types that it has not seen sibilities could be imagined.
without any additional training, i.e., high performance in the SAM assisting other segmentation models. One version
zero-shot learning regime. In addition to the prompt-based of this usage mode is where SAM works alongside another
definition of the task, the SAM authors utilized a specific algorithm to automatically segment images (an “inference
model architecture and a uniquely large dataset to achieve mode”). For example, SAM, based on point prompts dis-
this goal, described as follows. tributed across the image, could generate multiple object
SAM was trained progressively alongside the develop- masks which then could be classified as specific objects by
ment of the dataset of images with corresponding object a separate classification model. Similarly, an independent
masks (SA-1B). The dataset was developed in three stages. detection model, e.g., ViTDet [20], could generate object-
First, a set of images was annotated by human annotators by bounding boxes of images to be used as prompts for SAM
clicking on objects and manually refining masks generated to generate precise segmentation masks.
by SAM, which at that point was trained on public datasets. Furthermore, SAM could be used in the training loop
Second, the annotators were asked to segment masks that of some other semantic segmentation model. For example,
were not confidently generated by SAM to increase the di- the masks generated by a segmentation model on unlabeled
versity of objects. The final set of masks was generated images during training could be used as prompts to SAM to
automatically by prompting the SAM model with a set of generate more precise masks for these images, which could
points distributed in a grid across the image and selecting be used as iteratively refined supervised training examples
confident and stable masks. for the model being trained. One could conceptualize many
other specific modes of including SAM in the process of
1.2. How to segment medical images with SAM? training new segmentation models.
SAM is designed to require a prompt or a set of prompts New medical image foundation segmentation models. In
to produce a segmentation mask. Technically, the model this usage mode, the development process of a new seg-
can be run without a prompt to provide any visible object, mentation foundation model for medical images could be
but we do not expect this to be useful for medical images, guided by SAM’s own development process. The largest
where there are often many other objects in the image be- difficulty of this would be in the much lower availability of
side the one of interest. Given this prompt-based nature, in medical images and quality annotations, compared to natu-
its basic form, SAM cannot be used the same way as most ral images, but this is possible in principle. A more feasible
segmentation models in medical imaging where the input is option could be to fine-tune SAM on medical images and
simply an image and the output is a segmentation mask or masks from a variety of medical imaging domains, rather
multiple masks for the desired object or objects. than training from scratch, as this would likely require fewer
We propose that there are three main ways in which images.
SAM can be used in the process of segmentation of med-
ical images. The first two involve using the actual Segment 2. Methodology
Anything Model in the process of annotation, mask gener-
In the previous section, we described various usage sce-
ation, or training of additional models. These approaches
narios of SAM for medical image segmentation. These are
do not involve changes to the SAM . The third approach in-
conceptually promising but largely rely on the assumption
volves the process of training/fine-tuning a SAM-like model
that SAM can generate accurate segmentations of medical
targeted for medical images. We detail each approach next.
images. Here, we experimentally evaluate this claim and
Note that we do not comment here on text-based prompting,
evaluate the performance of SAM within a variety of differ-
as it is still in the proof-of-concept stage for SAM.
ent realistic usage scenarios and datasets in medical imag-
Semi-automated annotation (“human in the loop”). The
ing.
manual annotation of medical images is one of the main
challenges of developing segmentation models in this field
2.1. Datasets
since it typically requires the valuable time of physicians.
SAM could be used in this setting as a tool for faster an- We compiled and curated a set of 19 publicly available
notation. This could be done in different ways. In the sim- medical imaging datasets for image segmentation. While
plest case, a human user provides prompts for SAM, which the phrase “medical imaging” is sometimes used to re-
generates a mask to be approved or modified by the user; fer to all images pertaining to medicine, we focus on the
this could be refined iteratively. Another option is where common definition of radiological images. Our dataset in-
SAM is given prompts distributed in a grid across the im- cludes planar X-rays, magnetic resonance images (MRIs),
age (the “segment everything” mode), and generates masks computed tomography (CT) images, ultrasound (US) im-
for multiple objects which are then named, selected, and/or ages, and positron emission tomography (PET) images. The
modified by the user. This is only the start; many other pos- datasets are summarized in Tables 1 and 2. For datasets

2
Abbreviated Num. Num.
Full dataset name and citation Modality Object(s) of interest
dataset name classes masks
Spinal Cord Grey Matter Gray matter,
MRI-Spine MRI 2 551
Segmentation Challenge [29] spinal cord
MRI-Heart Medical Segmentation Decathlon [33] MRI 1 Heart 1,301
Initiative for Collaborative
MRI-Prostate MRI 1 Prostate 893
Computer Vision Benchmarking [19]
GD-enhancing tumor,
The Multimodal Brain Tumor Image Peritumoral edema,
MRI-Brain MRI 3 12,591
Segmentation Benchmark (BraTS) [26] necrotic and non-
enhancing tumor core
Duke Breast Cancer MRI: Breast, fibrog-
MRI-Breast MRI 2 503
Breast + FGT Segmentation [14, 32] landular tissue
Montgomery County and Shenzhen
Xray-Chest X-ray 1 Chest 704
Chest X-ray Datasets [16]
Xray-Hip X-ray Images of the Hip Joints [11] X-ray 2 Ilium, femur 140
US-Breast Dataset of Breast Ultrasound Images [1] Ultrasound 1 Breast 647
US-Kidney CT2US for Kidney Segmentation [35] Ultrasound 1 Kidney 4,586
Transverse Musculoskeletal
US-Muscle Ultrasound 1 Muscle 4,044
Ultrasound Image Segmentations [25]
Ultrasound Nerve Segmentation
US-Nerve Ultrasound 1 Nerve 2,323
Identify ( [2])
Multi-Modality Ovarian Tumor
US-Ovarian-Tumor Ultrasound 1 Ovarian tumor 1,469
Ultrasound (MMOTU) [38]

Table 1. All datasets evaluated in this paper. “num. masks” refers to the number of images with non-zero masks. For 3D modalities,
2D slices are used as inputs.

containing more than one type of object, we considered the tial thing to consider is that a single “object” of interest /
segmentation of each object as a separate task. “ground truth” mask may consist of multiple disconnected
parts, which is especially common in medical images. An
2.2. Experiments example of this is a cross-sectional image of a liver (such
We performed a thorough evaluation of SAM with both as MRI or CT) where in one 3D slice, the liver is portrayed
non-iterative prompts (generated prior to SAM being ap- as two non-contiguous areas. Given this consideration, we
plied) and iterative prompts (generated after seeing the introduce the following five prompting modes:
model’s predictions). We also explored the “segment every- • One prompt point is placed at the center of the largest
thing” mode of SAM and analyzed the different outputs that contiguous region of the object of interest/ground truth
SAM generates in response to ambiguity in the prompts. mask.
• A prompt point is placed at the center of each separate
2.2.1 Prompting Strategies contiguous region of the object of interest (up to three
Non-iteratvie prompts. In this primary mode of evalua- points).
tion, prompts were simulated to reflect how a human user • One box prompt is placed to tightly enclose the largest
might generate them while looking at the objects. We focus contiguous region of the object of interest.
on five modes of non-iterative prompting designed to cap-
ture the realistic usage cases of SAM for generating image • A box prompt is placed to tightly enclose each separate
masks, using either points or bounding boxes. An essen- contiguous region of the object of interest (up to three

3
Abbreviated Num. Num.
Full dataset name and citation Modality Object(s) of interest
dataset name classes masks
Colon cancer
CT-Colon Medical Segmentation Decathlon [33] CT 1 1,285
primaries
CT-HepaticVessel Medical Segmentation Decathlon [33] CT 1 Vessels, tumors 13,046
parenchyma
CT-Pancreas Medical Segmentation Decathlon [33] CT 1 8,792
and mass
CT-Spleen Medical Segmentation Decathlon [33] CT 1 spleen 1,051
The Liver Tumor
CT-Liver CT 1 Liver 5,501
Segmentation Benchmark (LiTS) [4]
CT Volumes with Multiple Liver, bladder, lungs,
CT-Organ CT 5 4,776
Organ Segmentations (CT-ORG) [31] kidney, bone
A FDG-PET/CT dataset
PET-Whole-Body PET/CT 1 Lesion 1,015
with annotated tumor lesions [10]

Table 2. (Continued) All datasets evaluated in this paper. “num. masks” refers to the number of images with non-zero masks. For
3D modalities, 2D slices are used as inputs.

Figure 1. Examples of prompt(s) generated by the five modes respectively. Green contours show the ground-truth masks, and blue star(s)
and box(es) indicate the prompts.

boxes) component of the error mask. The label of the prompt is


based on whether the new location is foreground or back-
• A single box is placed to tightly enclose the entire ob- ground.
ject mask.
Prompt ambiguity and oracle performance. Prompts can
Juxtaposed examples of each prompting mode for the be ambiguous in the sense that it may be unclear which ob-
same image are shown in Figure 1. Modes 1 and 2 are ject in the image the prompt is referring to. A typical sce-
equivalent if the object consists of only one contiguous re- nario is when objects are nested within each other in the
gion/part, and the same is true for modes 3, 4, and 5. The image. For example, when a user provides a point prompt
point prompts were generated as the point farthest from the within a necrotic component of a brain tumor, they could in-
boundary of the mask for the object or its part. tend to segment that component, the entire tumor, one hemi-
Iterative prompts. We use a common, intuitive strategy sphere of the brain, the entire brain, or the entire head. In re-
for simulating realistic iterative point prompts, which re- sponse to this issue, SAM provides multiple outputs aiming
flects how those could be generated by a user in an inter- at disambiguating the prompts. This is a very important and
active way [24]. The details of the prompt generation are practical feature of SAM since in the interactive segmenta-
illustrated in Algorithm 1. Specifically, once the network tion setting, multiple potential outputs could be presented to
makes a prediction, we compute an error map where both the user, from which they could select the one which is the
false positive and false negative predictions are marked as closest to the object that they intended. In our experiments,
1, i.e., the point furthest from 0. Then we can find the loca- we display some examples of the multiple outputs gener-
tion of the next prompt as the central location of the largest ated by SAM to illustrate how it deals with the ambiguity

4
Figure 2. Performance of SAM under 5 modes of use. Left: Performance of SAM across 28 segmentation tasks, with results ranked
in descending order based on Mode 4. Oracle performance for each mode is indicated by the inverted triangle. Right: A summarized
performance comparison of all five modes across all tasks, presented in a box and whisker plot format.

Algorithm 1 Prompt Point Generation Scheme mance”. This is the performance of the model when the
Input: Image X ∈ RH×W , ground truth mask prediction closest (in terms of IoU) to the true mask, i.e.,
M ∈ {0, 1}H×W , Segment Anything Model SAM, oracle prediction, is always used, out of SAM’s three gen-
prompt count N , closest-zero-pixel distance function erated predictions. Note that this prediction may differ
d = distanceTransform() from OpenCV [5]. from SAM’s most confident prediction. While this assumes
knowledge of the true mask and is a biased way to assess
1: Initialize first prompt point p1 as the point within the performance when there is no additional interaction with
mask foreground farthest from the background: the user after providing the initial prompts, it is a practical
2: P = argmax(d[(i, j), (k, l)]) for all (i, j), (k, l) such
reflection of performance in a setting where the user can se-
(i,j) lect one of the masks generated by SAM. When prompts are
that Mij = 1, Mkl = 0. generated iteratively, the oracle prediction is used to create
3: Choose randomly if multiple points satisfy this: the error map that guides the location of the next prompt.
4: p1 = random choice(P)
5: Predict mask Y1 = SAM(X, p1 )
6: Get prediction error region E1 = Y1 ∪ M − Y1 ∩ M 2.2.2 Comparison with other methods
7: Subsequent prompt points are those farthest from the
boundary of iteratively-updating error region En : We compared SAM with three interactive segmentation
8: for n = 2, . . . , N do methods, namely RITM [34], SimpleClick [21], and Fo-
9: Pn = argmax(d[(i, j), (k, l)]) for all (i, j), (k, l) calClick [7]. RITM adopted HRNet-18 [36] as the back-
(i,j) bone segmentation model and an iterative mask correction
such that [En−1 ]ij = 1, [En−1 ]kl = 0. approach based on the { previous mask, new click } set to
10: pn = random choice(Pn ) achieve outstanding performance on multiple natural imag-
11: Yn = SAM(X, pn ) ing datasets. Based on the framework proposed by RITM,
12: En = Yn ∪ M − Yn ∩ M SimpleClick replaced the HRNet-18 backbone in RITM
13: end for with a plain-ViT [9] and included a “click embedding” layer
14: return Prompt points p1 , . . . , pN ∈ N2 symmetric to patch embedding. It allows for fewer clicks
than RITM to achieve an above-threshold performance. Fo-
calClick replaced the HRNet-18 backbone in RITM with a
SegFormer [37] and restricted the update responding to new
of the prompts. clicks to occur only as local. The method proposed “pro-
Related to the ambiguity, in all experiments, we also gressive merge” that can exploit morphological information
present what the developers of SAM call “oracle perfor- and prevent unintended changes far away from users’ clicks,

5
25 percentile 50 percentile 75 percentile
IoU
Prediction Ground Truth

Point Prompt
Zoom in Box Prompt

Figure 3. Visualization of SAM’s segmentation results in two different modes. Each dataset is shown in two sequential rows, with its name
along the left side. For each dataset, it displays three examples from left to right, reflecting the 25th, 50th, and 75th percentiles of IoU
across all images for that dataset. For each example, we visualize (top left) the raw image; (bottom left) the zoom-in image with the area
of interest; (top right) the segmented results for mode 2: 1 point at each object region; (bottom right) the segmented results for mode 4: 1
box region at each object region. Additionally, the IoU is represented above each segmented result. Examples of all the datasets are shown
in Appendix Figure 1-5.

6
Comparison of SAM with other methods on multiple Medical-imaging datastes on 1-prompt setting
0.8
SAM vs. RITM
SAM vs. SimpleClick
SAM vs. FocalClick
0.6 Oracle performance of SAM

0.4
IoU

0.2

0.0

0.2
:B r

-H ur

I-B ung

CT rain ne
I-S leen

Xr MRI er

pin r
CT CT- C
CT rga er

-Pa n

I-P ey

rg dy
Xr p: Fe t

CT um

CT Xray eas
an st

US tate

CT hole y

US GD

MR in: E sel
MR gan: st

MR epa ore

US east a
MR iant T

US GM
MR US erve

st: le
st
r
an ve

I-S umo
e

m
CT olo

-Va : FG
S

-H ea

rg he

rea usc
a

ea
-O Liv

MR Kidn

T-W dn
-O bo

I-B Bo

ra s
-O re

I-B de
Ili
rg Li

-H : C
r
lad
e:

MR an: L

n:

I-B tove
ros
ay -H

Br
e:
MR T-Sp

-O -C
C
nc

-N
PE -Ki

CT -B

I-B -M
pin

rai
-O n:

-
ip:

:
C

r
r
r
ay

Figure 4. Comparison of SAM with three other competing methods, namely RITM, SimpleClick, and Focalclick, under the 1-point prompt
setting. The results are presented in the form of the difference between SAM and other methods (∆ IoU), and ranked based on the
descending order of the largest ∆ IoU for each task.

Figure 5. Comparison of SAM and other methods under an interactive prompt setting. (Left) it presents the average performance of SAM
and other methods across all tasks with respect to the number of prompt changes. (Right) it shows the detailed performance of SAM over
each task.

resulting in faster inference time. object.

2.2.3 Performance evaluation metric 3. Results


For each dataset, we evaluated the accuracy of the masks 3.1. Performance of SAM for different modes of use
that SAM and aforementioned methods generate given for 28 tasks
prompts, with respect to the “ground truth” mask annota-
tions for the given dataset and task. In the quantitative eval- The performance of SAM for our five prompting modes,
uation, we always use the mask with the highest confidence introduced in Section 2.2.1, of use is shown in Figure 2. We
generated by SAM for a given prompt. We used IoU as draw several conclusions. First, SAMs performance varies
the evaluation metric, similar to as in SAM’s original pa- widely across different datasets. It ranges from an impres-
per [18]. For datasets containing multiple types of objects, sive IoU of 0.9118 to a very poor IoU of 0.1136.
performance was reported independently for each type of Comparing the performance for different prompting

7
Figure 6. Examples of SAM’s prediction under the interactive prompt setting. For each dataset, we display the results from 1-point prompts
to 9-point prompts, respectively. The positive prompts are represented as green stars, and the negative prompts are represented as red stars.

Figure 7. Visualizations of examples with ambiguity based on SAM; the 1st, 2nd, and 3rd confident predictions are shown sequentially.

modes shows clear superiority of box prompts over point than one part. Following these two trends, Mode 4, where
prompts. Moreover, as expected, prompts where each sep- each part of an object is indicated by a separate box, showed
arate part of the object is indicated separately are generally the best performance with an average IoU of 0.6542.
superior to those where only one part is indicated, or all Additionally, the oracle mode showed a moderate im-
parts are outlined in one box. This was particularly pro- provement over the default mode. The magnitude of this
nounced for datasets where objects typically consist of more improvement was highly dependent on the dataset.

8
(a)

(b) (c)

Figure 8. (Top) the relative size of the object in each dataset; (Bottom) the object size vs. detection performance for mode 2 and mode 4
separately; we also show a regression fitted curve each.

Figure 3 shows examples of segmentations generated in mance across different datasets was 0.4595 IoU for SAM,
prompting Mode 2 (a point for each object part) and Mode 0.5137 IoU for SAM in oracle mode, 0.2240 IoU for Fo-
4 (a box around each object part) for 4 selected datasets. calClick, 0.1910 IoU for SimpleClick, and 0.1322 IoU for
For each dataset, we provide examples of SAM’s segmen- RITM. Note that SAM exhibits notably better overall per-
tations in the 25th, 50th, and 75th percentile of IoU. Green formance than all other methods even though it is not used
contours represent the ground truth masks, red contours in its best-performing mode of using boxes as prompts, as
represent SAM’s prediction, and teal points or boxes rep- we used point prompts to have a fair comparison between
resent the prompts given to SAM. This figure illustrates a different methods. If SAM used Mode 3 (single box), it
high variability of SAM’s performance ranging from near- outperformed all other methods in all but one task. The av-
perfect for well-circumscribed objects with unambiguous erage performance for Mode 3 was 0.5891 IoU.
prompts to very poor, particularly for objects with ambigu-
ous prompts. A similar illustration for all datasets is pro- 3.3. Performance of SAM and other methods for
vided in Appendix Section B. iterative segmentation

3.2. Comparing SAM to other interactive segmen- We show the average performances of SAM, RITM,
tation methods SimpleClick, and FocalClick in Figure 5. The detailed per-
formance of the three competing methods over 28 tasks un-
We compare SAM to other interactive methods in the der the interactive prompt setting is shown in Appendix Sec-
non-iterative prompting setting in Figure 4. SAM per- tion C. As seen in the results above, for the scenario where
formed better than all other methods on 24 out of 28 tasks, prompts are provided prior to any interaction with the out-
with a dramatic improvement in performance for some of put of the models, SAM performs notably better than all
them. When used in oracle mode, SAM was better than all other methods. However, when additional clicks are pro-
other methods for 26 out of 28 tasks. The average perfor- vided iteratively with the goal of refining the segmentations

9
generated by region-growing-based algorithms. In most
cases, a connected region with similar intensity in the im-
age bounded by regions with dissimilar intensities would
be segmented. On the other hand, we found masks predic-
tions generated with lower confidence scores may expand
the highest-confidence predictions and tend to have more
variety in intensity/texture.

3.5. Performance of SAM for objects of different


sizes
In Figure 8(a) we show how object size relates to the
performance of SAM. We describe object size as the ratio
of the number of pixels in the object to the total in the im-
age. We found object size to vary broadly across our differ-
ent datasets, by up to over two orders of magnitude. Some
datasets also showed high variability of object size within
the dataset itself (e.g., kidneys in ultrasound images) while
others showed relative consistency of object size (such as
bones in hip X-ray images).
Figures 8(b) and 8(c) show the relationship between the
average object size in a dataset and the performance of
SAM. While the correlations are low, there is a trend to-
wards a higher performance of SAM for larger objects.
Note that the correlation analysis looks at average object
sizes in the image and does not consider the variation of
Figure 9. Examples of segment everything mode. For each exam-
object sizes in the individual datasets.
ple, we sampled a different number of grid points at each side as
25 ,26 and 27 .
3.6. Segment-everything mode for medical images
In Figure 9, we provide an example of using SAM’s
returned by the models, the superiority of SAM diminishes, “segment everything mode” on medical images. Segment-
and it is surpassed by two other methods (SimpleClick and everything mode uses a dense evenly-spaced mesh of
RITM) for five or more points provided by the users. This prompt points, designed to direct SAM to segment the im-
is due to the fact that SAM does not appear to draw hardly age into many different regions. We find that this mode
any benefits from additional information provided through provides mixed usefulness and that the results are somewhat
the interactive points after two or three points have been dependent on the number of prompts. While imperfect, this
provided. Figure 6 illustrates this phenomenon. setup could potentially be useful in some applications, e.g.,
We also see that SAM, when given a single prompt, has as a “starting point” for creating segmentation annotations
difficulty segmenting objects with multiple non-contiguous for different objects in a new image. Selecting masks from
regions. It is more likely to segment one contiguous region the output of segment everything mode is effectively a spe-
instead of trying to find additional semantically similar re- cial case of the prompting mode 2 in our experiments except
gions in the entire image. Therefore, in the scenario where with additional post-processing. We hence would not elab-
multiple regions of interest exist for an object, additional orate on this mode in our paper.
prompt points for SAM can be beneficial if they target addi-
tional regions, but beyond that, the benefit of further prompt
4. Conclusions and discussion
points is negligible and in some cases such additional input
is detrimental. In this study, we evaluated the new Segment Anything
Model for the segmentation of medical images. We reached
3.4. Performance of SAM in the presence of ambi-
the following conclusions:
guity of prompts
When applying SAM on medical images, we found • SAM’s accuracy for zero-shot medical image segmen-
a consistent tendency of SAM in interpreting the point tation is moderate on average and varies significantly
prompts. As shown in Figure 7, the SAM’s highest- across different datasets and different images within a
confidence map predictions tend to look similar to results dataset.

10
• The model performs best with box prompts, particu- Number R01EB031575 and by the National Heart Lung and
larly when one box is provided for each separate part Blood Institute of the National Institutes of Health under
of the object of interest. Award Number R44HL152825. The content is solely the
responsibility of the authors and does not necessarily repre-
• SAM outperforms RITM, SimpleClick, and Fo- sent the official views of the National Institutes of Health.
calClick in the vast majority of the evaluated settings
where a single non-iterative prompt point is provided. References
• In the setting where multiple iteratively-refined point [1] Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled,
prompts are provided, SAM obtains very limited ben- and Aly Fahmy. Dataset of breast ultrasound images. Data
efit from additional point prompts, except for objects in brief, 28:104863, 2020.
with multiple parts. On the other hand, the other algo- [2] Montoya Anna, Hasnin, kaggle446, shirzad, Cukierski Will,
rithms improve notably with additional point prompts, and yffud. Ultrasound nerve segmentation, 2016.
[3] Syed Muhammad Anwar, Muhammad Majid, Adnan
to the level of surpassing SAM’s performance. How-
Qayyum, Muhammad Awais, Majdi Alnowami, and Muham-
ever, the point prompting modes are inferior to SAM’s
mad Khurram Khan. Medical image analysis using convolu-
box prompting modes. tional neural networks: a review. Journal of medical systems,
42:1–13, 2018.
• We find a small but non-statistically significant corre-
[4] Patrick Bilic, Patrick Christ, Hongwei Bran Li, Eugene
lation between the average object size in a dataset and
Vorontsov, Avi Ben-Cohen, Georgios Kaissis, Adi Szeskin,
SAM performance. Colin Jacobs, Gabriel Efrain Humpire Mamani, Gabriel
Chartrand, et al. The liver tumor segmentation benchmark
One of the contributions of our study is that we identi- (lits). Medical Image Analysis, 84:102680, 2023.
fied five different modes of use for interactive segmentation [5] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of
methods. This is of particular importance for models which Software Tools, 2000.
have multiple components, a common feature in medical [6] Jiaoyan Chen, Yuxia Geng, Zhuo Chen, Ian Horrocks, Jeff Z.
imaging. These modes also showed different performances Pan, and Huajun Chen. Knowledge-aware zero-shot learn-
in such scenarios. While these modes demonstrate the va- ing: Survey and perspective. In International Joint Confer-
riety of uses, future work could focus on prompt engineer- ence on Artificial Intelligence, 2021.
ing, both non-iterative and iterative, which could potentially [7] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian
reach even higher performance. Qi, and Hengshuang Zhao. Focalclick: Towards prac-
The segment anything model, associated preprint, and tical interactive image segmentation. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
the code illustrate the strengths of open science and the ac-
Recognition (CVPR), pages 1300–1309, June 2022.
tivity of machine learning and machine learning in medi-
[8] Ruining Deng, Can Cui, Quan Liu, Tianyuan Yao, Lucas W.
cal imaging communities. Since we made the first version Remedios, Shunxing Bao, Bennett A. Landman, Lee Whe-
of this manuscript available, approximately 2 weeks after less, Lori A. Coburn, Keith T. Wilson, Yaohong Wang, Shilin
the release of the SAM paper (or even some before our re- Zhao, Agnes B. Fogo, Haichun Yang, Yucheng Tang, and
lease), multiple preprints have appeared evaluating SAM in Yuankai Huo. Segment anything model (sam) for digital
broadly understood medical imaging and radiological imag- pathology: Assess zero-shot segmentation on whole slide
ing [8, 15]. Some preprints already showed extensions of imaging. ArXiv, abs/2304.04155, 2023.
SAM to medical imaging [12,23] and one paper showed in- [9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
tegration of SAM into 3D Slicer [22]. This demonstrates a Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
high likelihood that SAM will become an important part of Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
image segmentation in medical imaging. vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
worth 16x16 words: Transformers for image recognition at
Overall, SAM shows promise for use in medical images,
scale. ICLR, 2021.
as long as suitable prompting strategies are used for the [10] Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian
dataset and task of choice. Future work will include the de- La Fougère, Konstantin Nikolaou, Christina Pfannenberg,
velopment of different ways to adapt it to construct medical Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and
imaging-specific models as well as extend to 3D segmenta- Daniel Rubin. A whole-body fdg-pet/ct dataset with manu-
tion. ally annotated tumor lesions. Scientific Data, 9(1):601, 2022.
[11] Daniel Gut. X-ray images of the hip joints. 1, July 2021.
Acknowledgments Publisher: Mendeley Data.
[12] Sheng He, Rina Bao, Jingpeng Li, P Ellen Grant, and
Research reported in this publication was supported by Yangming Ou. Accuracy of segment-anything model (sam)
the National Institute Of Biomedical Imaging And Bioengi- in medical image segmentation tasks. arXiv preprint
neering of the National Institutes of Health under Award arXiv:2304.09324, 2023.

11
[13] Mohammad Hesam Hesamian, Wenjing Jia, Xiangjian He, [28] Dzung L Pham, Chenyang Xu, and Jerry L Prince. Current
and Paul Kennedy. Deep learning techniques for medical methods in medical image segmentation. Annual review of
image segmentation: achievements and challenges. Journal biomedical engineering, 2(1):315–337, 2000.
of digital imaging, 32:582–596, 2019. [29] Ferran Prados, John Ashburner, Claudia Blaiotta, Tom
[14] Siping Hu, Christine Park, Christopher Orion Lew, Lars J Brosch, Julio Carballido-Gamio, Manuel Jorge Cardoso,
Grimm, Jay A Baker, Michael W Taylor-Cho, and Maciej A Benjamin N Conrad, Esha Datta, Gergely Dávid, Benjamin
Mazurowski. Fully automated deep learning method for fi- De Leener, et al. Spinal cord grey matter segmentation chal-
broglandular tissue segmentation in breast mri. 2022. lenge. Neuroimage, 152:312–329, 2017.
[15] Yuhao Huang, Xin Yang, Lian Liu, Han Zhou, Ao Chang,
[30] Muhammad Imran Razzak, Saeeda Naz, and Ahmad Zaib.
Xinrui Zhou, Rusi Chen, Junxuan Yu, Jiongquan Chen,
Deep learning for medical image processing: Overview,
Chaoyu Chen, et al. Segment anything model for medical
challenges and the future. Classification in BioApps: Au-
images? arXiv preprint arXiv:2304.14660, 2023.
tomation of Decision Making, pages 323–350, 2018.
[16] Stefan Jaeger, Sema Candemir, Sameer Antani, Yı̀-Xiáng J
Wáng, Pu-Xuan Lu, and George Thoma. Two public chest [31] Blaine Rister, Kaushik Shivakumar, Tomomi Nobashi, and
x-ray datasets for computer-aided screening of pulmonary Daniel L. Rubin. CT-ORG: A Dataset of CT Volumes With
diseases. Quantitative imaging in medicine and surgery, Multiple Organ Segmentations, 2019. Version Number: 1
4(6):475, 2014. Type: dataset.
[17] Baris Kayalibay, Grady Jensen, and Patrick van der Smagt. [32] Ashirbani Saha, Michael R Harowicz, Lars J Grimm, Con-
Cnn-based segmentation of medical imaging data. arXiv nie E Kim, Sujata V Ghate, Ruth Walsh, and Maciej A
preprint arXiv:1701.03056, 2017. Mazurowski. A machine learning approach to radiogenomics
[18] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, of breast cancer: a study of 922 subjects and 529 dce-mri
Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- features. British journal of cancer, 119(4):508–516, 2018.
head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- [33] Amber L Simpson, Michela Antonelli, Spyridon Bakas,
thing. arXiv preprint arXiv:2304.02643, 2023. Michel Bilello, Keyvan Farahani, Bram Van Ginneken, An-
[19] Guillaume Lemaı̂tre, Robert Martı́, Jordi Freixenet, Joan C nette Kopp-Schneider, Bennett A Landman, Geert Litjens,
Vilanova, Paul M Walker, and Fabrice Meriaudeau. Bjoern Menze, et al. A large annotated medical image dataset
Computer-aided detection and diagnosis for prostate cancer for the development and evaluation of segmentation algo-
based on mono and multi-parametric mri: a review. Comput- rithms. arXiv preprint arXiv:1902.09063, 2019.
ers in biology and medicine, 60:8–31, 2015. [34] Konstantin Sofiiuk, Ilya A. Petrov, and Anton Konushin. Re-
[20] Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming He. viving iterative training with mask guidance for interactive
Exploring plain vision transformer backbones for object de- segmentation. 2022 IEEE International Conference on Im-
tection. ArXiv, abs/2203.16527, 2022. age Processing (ICIP), pages 3141–3145, 2021.
[21] Qin Liu, Zhenlin Xu, Gedas Bertasius, and Marc Nietham-
[35] Yuxin Song, Jing Zheng, Long Lei, Zhipeng Ni, Baoliang
mer. Simpleclick: Interactive image segmentation with sim-
Zhao, and Ying Hu. Ct2us: Cross-modal transfer learning for
ple vision transformers. ArXiv, abs/2210.11006, 2022.
kidney segmentation in ultrasound images with synthesized
[22] Yihao Liu, Jiaming Zhang, Zhangcong She, Amir Kherad- data. Ultrasonics, 122:106706, 2022.
mand, and Mehran Armand. Samm (segment any med-
ical model): A 3d slicer integration to sam. ArXiv, [36] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
abs/2304.05622, 2023. Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
[23] Jun Ma and Bo Wang. Segment anything in medical images.
high-resolution representation learning for visual recogni-
arXiv preprint arXiv:2304.12306, 2023.
tion. TPAMI, 2019.
[24] Sabarinath Mahadevan, Paul Voigtlaender, and B. Leibe. It-
eratively trained interactive segmentation. In British Ma- [37] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
chine Vision Conference, 2018. Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
[25] Francesco Marzola, Nens van Alfen, Jonne Doorduin, and
In Neural Information Processing Systems (NeurIPS), 2021.
Kristen M Meiburger. Deep learning segmentation of trans-
verse musculoskeletal ultrasound images for neuromuscular [38] Qi Zhao, Shuchang Lyu, Wenpei Bai, Linghan Cai, Binghao
disease assessment. Computers in Biology and Medicine, Liu, Meijing Wu, Xiubo Sang, Min Yang, and Lijiang Chen.
135:104623, 2021. A multi-modality ovarian tumor ultrasound image dataset for
[26] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree unsupervised cross-domain semantic segmentation. arXiv
Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya preprint arXiv:2207.06799, 2022.
Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, [39] Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing
et al. The multimodal brain tumor image segmentation Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al.
benchmark (brats). IEEE transactions on medical imaging, A comprehensive survey on pretrained foundation mod-
34(10):1993–2024, 2014. els: A history from bert to chatgpt. arXiv preprint
[27] OpenAI. Gpt-4 technical report, 2023. arXiv:2302.09419, 2023.

12
Supplementary
A. Graphical Abstract
The graphical abstract of the work is shown in Fig. 10.

B. Examples of segmented results on all datasets


The examples of segmented results over all 28 tasks are
visualized in the following figures Fig. 11-16. For each
task, we display three examples with 25th, 50th, and 75th
percentile of IoU. For the PET/CT dataset, we are present-
ing PET information in the Red channel of the PNG file,
and the CT information in the Green and Blue channels of
the PNG file.
C. Performance of other competing methods
The detailed performance of the other three competing
methods over 28 tasks under the interactive prompt setting
is shown in Figure 17.

13
Segment Anything Model for medical image analysis: an experimental study
Which is the best prompting strategy?
We design 5 different modes of non-iterative prompting strategies over 18 medical imaging datasets and 28 segmentation tasks.

SAM's performance varies for each prompt mode, and it performs best in Mode 4: put one box at each object region (at most 3 boxes).
SAM performs notably better with box prompts than with point prompts.
Best-performed prompt strategy among 5 modes

How SAM's segmentation look like? How SAM compare with other methods?
We visualize some examples of SAM's segmentation under two different prompt modes.  We compare SAM with other state-of-art methods under single point prompt setting.
Prediction Ground Truth

14
Point Prompt Poor-circumscribed
Zoom in Box Prompt object

Figure 10. Graphical abstract of the work.


Mode 2: 1 point at each region SAM' outperforms other methods in almost all single prompt setting.
We compare SAM with other state-of-art methods under interactive point prompt
setting.

SAM'  performs notably better than


other methods when there are few
number of point prompts offered.
However, when additional clicks are
provided to refined the
segmentations, SAM is surpassed
SAM's performance varies highly for different datasets and tasks. by other methods.
Well-circumscribed SAM's performance appears to be better for well-circumscribed
object objects with prompts with less ambiguity, such as segmentation of
organs in computed tomography, and poorer in varies other
Mode 2: 1 box at each region  scenarios, such as segmentation of muscle in ultrasound imaging.
Figure 11. Visualization examples of SAM’s prediction results (part1).

15
Figure 12. Visualization examples of SAM’s prediction results (part2).

16
Figure 13. Visualization examples of SAM’s prediction results (part3).

17
Figure 14. Visualization examples of SAM’s prediction results (part4).

18
Figure 15. Visualization examples of SAM’s prediction results (part5).

19
Figure 16. Visualization examples of SAM’s prediction results.

20
Figure 17. the detailed performance of RITM, SimpleClick, and FocalClick over each task under the interactive prompt setting.

21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy