0% found this document useful (0 votes)
181 views11 pages

LVIS: A Dataset For Large Vocabulary Instance Segmentation

The document introduces LVIS, a new large-scale dataset for benchmarking large vocabulary instance segmentation. The dataset contains over 1000 object categories and approximately 2 million segmentation masks across 164,000 images. It poses a challenging scientific problem as many categories have only a few training examples due to the long-tailed distribution of objects in natural images. The dataset is designed to enable research on learning from few examples by taking a federated approach, where each category is annotated exhaustively in a separate constituent dataset.

Uploaded by

Gaston GB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views11 pages

LVIS: A Dataset For Large Vocabulary Instance Segmentation

The document introduces LVIS, a new large-scale dataset for benchmarking large vocabulary instance segmentation. The dataset contains over 1000 object categories and approximately 2 million segmentation masks across 164,000 images. It poses a challenging scientific problem as many categories have only a few training examples due to the long-tailed distribution of objects in natural images. The dataset is designed to enable research on learning from few examples by taking a federated approach, where each category is annotated exhaustively in a separate constituent dataset.

Uploaded by

Gaston GB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

LVIS: A Dataset for Large Vocabulary Instance Segmentation

Agrim Gupta Piotr Dollár Ross Girshick


Facebook AI Research (FAIR)

Abstract
arXiv:1908.03195v1 [cs.CV] 8 Aug 2019

Progress on object detection is enabled by datasets that


focus the research community’s attention on open chal-
lenges. This process led us from simple images to complex
scenes and from bounding boxes to segmentation masks. In
this work, we introduce LVIS (pronounced ‘el-vis’): a new
dataset for Large Vocabulary Instance Segmentation. We
plan to collect ∼2 million high-quality instance segmenta-
tion masks for over 1000 entry-level object categories in
164k images. Due to the Zipfian distribution of categories
in natural images, LVIS naturally has a long tail of cate-
gories with few training samples. Given that state-of-the-art
deep learning methods for object detection perform poorly
in the low-sample regime, we believe that our dataset poses
an important and exciting new scientific challenge. LVIS is
available at http://www.lvisdataset.org. Figure 1. Example annotations. We present LVIS, a new dataset
for benchmarking Large Vocabulary Instance Segmentation in the
1000+ category regime with a challenging long tail of rare objects.
1. Introduction
A central goal of computer vision is to endow algorithms We aim to enable this new research direction by design-
with the ability to intelligently describe images. Object ing and collecting LVIS (pronounced ‘el-vis’)—a bench-
detection is a canonical image description task; it is in- mark dataset for research on Large Vocabulary Instance
tuitively appealing, useful in applications, and straightfor- Segmentation. We are collecting instance segmentation
ward to benchmark in existing settings. The accuracy of masks for more than 1000 entry-level object categories (see
object detectors has improved dramatically and new capa- Fig. 1). When completed, we plan for our dataset to contain
bilities, such as predicting segmentation masks and 3D rep- 164k images and ∼2 million high-quality instance masks.1
resentations, have been developed. There are now exciting Our annotation pipeline starts from a set of images that were
opportunities to push these methods towards new goals. collected without prior knowledge of the categories that will
Today, rigorous evaluation of general purpose object de- be labeled in them. We engage annotators in an iterative
tectors is mostly performed in the few category regime (e.g. object spotting process that uncovers the long tail of cate-
80) or when there are a large number of training examples gories that naturally appears in the images and avoids using
per category (e.g. 100 to 1000+). Thus, there is an opportu- machine learning algorithms to automate data labeling.
nity to enable research in the natural setting where there are We designed a crowdsourced annotation pipeline that en-
a large number of categories and per-category data is some- ables the collection of our large-scale dataset while also
times scarce. The long tail of rare categories is inescapable; yielding high-quality segmentation masks. Quality is im-
annotating more images simply uncovers previously unseen, portant for future research because relatively coarse masks,
rare categories (see Fig. 9 and [29, 25, 24, 27]). Efficiently such as those in the COCO dataset [18], limit the ability
learning from few examples is a significant open problem in to differentiate algorithm-predicted mask quality beyond a
machine learning and computer vision, making this oppor- certain, coarse point. When compared to expert annotators,
tunity one of the most exciting from a scientific and practi- our segmentation masks have higher overlap and boundary
cal perspective. But to open this area to empirical study, a 1 We plan to annotate the 164k images in COCO 2017 (we have permis-

suitable, high-quality dataset and benchmark is required. sion to label test2017); ∼2M is a projection after labeling 85k images.
consistency than both COCO and ADE20K [28]. Deer
To build our dataset, we adopt an evaluation-first design Car
principle. This principle states that we should first deter- Toy Vehicle
Backpack,
mine exactly how to perform quantitative evaluation and Rucksack
Truck
only then design and build a dataset collection pipeline to
gather the data entailed by the evaluation. We select our
Figure 2. Category relationships from left to right: non-disjoint
benchmark task to be COCO-style instance segmentation category pairs may be in partially overlapping, parent-child, or
and we use the same COCO-style average precision (AP) equivalent (synonym) relationships, implying that a single object
metric that averages over categories and different mask in- may have multiple valid labels. The fair evaluation of an object
tersection over union (IoU) thresholds [19]. Task and metric detector must take the issue of multiple valid labels into account.
continuity with COCO reduces barriers to entry.
Buried within this seemingly innocuous task choice are Caltech 101 [6], PASCAL VOC [5], ImageNet [23], and
immediate technical challenges: How do we fairly evaluate COCO [18]. These datasets enabled the development of al-
detectors when one object can reasonably be labeled with gorithms that detect edges, perform large-scale image clas-
multiple categories (see Fig. 2)? How do we make the an- sification, and localize objects by bounding boxes and seg-
notation workload feasible when labeling 164k images with mentation masks. They were also used in the discovery of
segmented objects from over 1000 categories? important ideas, such as Convolutional Networks [15, 13],
The essential design choice resolving these challenges Residual Networks [10], and Batch Normalization [11].
is to build a federated dataset: a single dataset that is LVIS is inspired by these and other related datasets, in-
formed by the union of a large number of smaller con- cluding those focused on street scenes (Cityscapes [3] and
stituent datasets, each of which looks exactly like a tradi- Mapillary [22]) and pedestrians (Caltech Pedestrians [4]).
tional object detection dataset for a single category. Each We review the most closely related datasets below.
small dataset provides the essential guarantee of exhaus-
tive annotations for a single category—all instances of that COCO [18] is the most popular instance segmentation
category are annotated. Multiple constituent datasets may benchmark for common objects. It contains 80 categories
overlap and thus a single object within an image can be la- that are pairwise distinct. There are a total of 118k train-
beled with multiple categories. Furthermore, since the ex- ing images, 5k validation images, and 41k test images. All
haustive annotation guarantee only holds within each small 80 categories are exhaustively annotated in all images (ig-
dataset, we do not require the entire federated dataset to be noring annotation errors), leading to approximately 1.2 mil-
exhaustively annotated with all categories, which dramat- lion instance segmentation masks. To establish continuity
ically reduces the annotation workload. Crucially, at test with COCO, we adopt the same instance segmentation task
time the membership of each image with respect to the con- and AP metric, and we are also annotating all images from
stituent datasets is not known by the algorithm and thus it the COCO 2017 dataset. All 80 COCO categories can be
must make predictions as if all categories will be evaluated. mapped into our dataset. In addition to representing an or-
The evaluation oracle evaluates each category fairly on its der of magnitude more categories than COCO, our anno-
constituent dataset. tation pipeline leads to higher-quality segmentation masks
In the remainder of this paper, we summarize how our that more closely follow object boundaries (see §4).
dataset and benchmark relate to prior work, provide details ADE20K [28] is an ambitious effort to annotate almost ev-
on the evaluation protocol, describe how we collected data, ery pixel in 25k images with object instance, ‘stuff’, and
and then discuss results of the analysis of this data. part segmentations. The dataset includes approximately
Dataset Timeline. We report detailed analysis on the 5000 3000 named objects, stuff regions, and parts. Notably,
image val subset that we have annotated twice. We have ADE20K was annotated by a single expert annotator, which
now annotated an additional 77k images (split between increases consistency but also limits dataset size. Due to the
train, val, and test), representing ∼50% of the final relatively small number of annotated images, most of the
dataset; we refer to this as LVIS v0.5 (see §A for details). categories do not have enough data to allow for both train-
The first LVIS Challenge, based on v0.5, will be held at the ing and evaluation. Consequently, the instance segmenta-
COCO Workshop at ICCV 2019. tion benchmark associated with ADE20K evaluates algo-
rithms on the 100 most frequent categories. In contrast, our
1.1. Related Datasets goal is to enable benchmarking of large vocabulary instance
Datasets shape the technical problems researchers study segmentation methods.
and consequently the path of scientific discovery [17]. We iNaturalist [26] contains nearly 900k images annotated
owe much of our current success in image recognition with bounding boxes for 5000 plant and animal species.
to pioneering datasets such as MNIST [16], BSDS [20], Similar to our goals, iNaturalist emphasizes the importance
Shoulder bag (3) Motor Scooter (4) Table (1) Hairbrush (3) Bear (2)

Peanut (29) Bed (2) Printer (2) Pineapple (12) Banana (80)

Pool Table (1) Beer Bottle (3) Zebra (8)

Teacup (12) Donut (195)

Umbrella (24) Hand Towel (2) Goose (2)

Figure 3. Example LVIS annotations (one category per image for clarity). See http://www.lvisdataset.org/explore.

of benchmarking classification and detection in the few ex- do not occur when there are few categories. These must be
ample regime. Unlike our effort, iNaturalist does not in- resolved first, because they have profound implications for
clude segmentation masks and is focussed on a different the structure of the dataset, as we discuss next.
image and fine-grained category distribution; our category
distribution emphasizes entry-level categories.
2.1. Task and Evaluation Overview
Task and Metric. Our dataset benchmark is the instance
Open Images v4 [14] is a large dataset of 1.9M images.
segmentation task: given a fixed, known set of categories,
The detection portion of the dataset includes 15M bounding
design an algorithm that when presented with a previously
boxes labeled with 600 object categories. The associated
unseen image will output a segmentation mask for each in-
benchmark evaluates the 500 most frequent categories, all
stance of each category that appears in the image along with
of which have over 100 training samples (>70% of them
the category label and a confidence score. Given the output
have over 1000 training samples). Thus, unlike our bench-
of an algorithm over a set of images, we compute mask av-
mark, low-shot learning is not integral to Open Images.
erage precision (AP) using the definition and implementa-
Also different from our dataset is the use of machine learn-
tion from the COCO dataset [19] (for more detail see §2.3).
ing algorithms to select which images will be annotated by
using classifiers for the target categories. Our data collec- Evaluation Challenges. Datasets like PASCAL VOC and
tion process, in contrast, involves no machine learning algo- COCO use manually selected categories that are pairwise
rithms and instead discovers the objects that appear within a disjoint: when annotating a car, there’s never any question
given set of images. Starting with release v4, Open Images if the object is instead a potted plant or a sofa. When in-
has used a federated dataset design for object detection. creasing the number of categories, it is inevitable that other
types of pairwise relationships will occur: (1) partially over-
2. Dataset Design lapping visual concepts; (2) parent-child relationships; and
(3) perfect synonyms. See Fig. 2 for examples.
We followed an evaluation-first design principle: prior If these relations are not properly addressed, then the
to any data collection, we precisely defined what task would evaluation protocol will be unfair. For example, most toys
be performed and how it would be evaluated. This principle are not deer and most deer are not toys, but a toy deer is
is important because there are technical challenges that arise both—if a detector outputs deer and the object is only la-
when evaluating detectors on a large vocabulary dataset that beled toy, the detection will be marked as wrong. Likewise,
if a car is only labeled vehicle, and the algorithm outputs duces the workload and allows us to undersample the most
car, it will be incorrectly judged to be wrong. Or, if an ob- frequent categories in order to avoid wasting annotation re-
ject is only labeled backpack and the algorithm outputs the sources on them (e.g. person accounts for 30% of COCO).
synonym rucksack, it will be incorrectly penalized. Provid- Of our estimated ∼2 million instances, likely no single cate-
ing a fair benchmark is important for accurately reflecting gory will account for more than ∼3% of the total instances.
algorithm performance.
These problems occur when the ground-truth annota- 2.3. Evaluation Details
tions are missing one or more true labels for an object. If The challenge evaluation server will only return the over-
an algorithm happens to predict one of these correct, but all AP, not per-category AP’s. We do this because: (1) it
missing labels, it will be unfairly penalized. Now, if all avoids leaking which categories are present in the test
objects are exhaustively and correctly labeled with all cat- set;2 (2) given that tail categories are rare, there will be few
egories, then the problem is trivially solved. But correctly examples for evaluation in some cases, which makes per-
and exhaustively labeling 164k images each with 1000 cate- category AP unstable; (3) by averaging over a large number
gories is undesirable: it forces a binary judgement deciding of categories, the overall category-averaged AP has lower
if each category applies to each object; there will be many variance, making it a robust metric for ranking algorithms.
cases of genuine ambiguity and inter-annotator disagree- Non-Exhaustive Annotations. We also collect an image-
ment. Moreover, the annotation workload will be very large. level boolean label, eci , indicating if image i ∈ Pc is ex-
Given these drawbacks, we describe our solution next. haustively annotated for category c. In most cases (91%),
2.2. Federated Datasets this flag is true, indicating that the annotations are indeed
exhaustive. In the remaining cases, there is at least one in-
Our key observation is that the desired evaluation proto- stance in the image that is not annotated. Missing annota-
col does not require us to exhaustively annotate all images tions often occur in ‘crowds’ where there are a large number
with all categories. What is required instead is that for each of instances and delineating them is difficult. During eval-
category c there must exist two disjoint subsets of the entire uation, we do not count false positives for category c on
dataset D for which the following guarantees hold: images i that have eci set to false. We do measure recall on
Positive set: there exists a subset of images Pc ⊆ D these images: the detector is expected to predict accurate
such that all instances of c in Pc are segmented. In other segmentation masks for the labeled instances. Our strategy
words, Pc is exhaustively annotated for category c. differs from other datasets that use a small maximum num-
Negative set: there exists a subset of images Nc ⊆ D ber of instances per image, per category (10-15) together
such that no instance of c appears in any of these images. with ‘crowd regions’ (COCO) or use a special ‘group of c’
Given these two subsets for a category c, Pc ∪ Nc can be label to represent 5 or more instances (Open Images v4).
used to perform standard COCO-style AP evaluation for c. Our annotation pipeline (§3) attempts to collect segmenta-
The evaluation oracle only judges the algorithm on a cate- tions for all instances in an image, regardless of count, and
gory c over the subset of images in which c has been exhaus- then checks if the labeling is in fact exhaustive. See Fig. 3.
tively annotated; if a detector reports a detection of category Hierarchy. During evaluation, we treat all categories the
c on an image i ∈ / Pc ∪ Nc , the detection is not evaluated. same; we do nothing special in the case of hierarchical re-
By collecting the per-category sets into a single dataset, lationships. To perform best, for each detected object o, the
D = ∪c (Pc ∪ Nc ), we arrive at the concept of a feder- detector should output the most specific correct category as
ated dataset. A federated dataset is a dataset that is formed well as all more general categories, e.g., a canoe should be
by the union of smaller constituent datasets, each of which labeled both canoe and boat. The detected object o in image
looks exactly like a traditional object detection dataset for a i will be evaluated with respect to all labeled positive cate-
single category. By not annotating all images with all cat- gories {c | i ∈ Pc }, which may be any subset of categories
egories, freedom is created to design an annotation process between the most specific and the most general.
that avoids ambiguous cases and collects annotations only
Synonyms. A federated dataset that separates synonyms
if there is sufficient inter-annotator agreement. At the same
into different categories is valid, but is unnecessarily frag-
time, the workload can be dramatically reduced.
mented (see Fig. 2, right). We avoid splitting synonyms
Finally, we note that positive set and negative set mem-
into separate categories with WordNet [21]. Specifically, in
bership on the test split is not disclosed and therefore algo-
LVIS each category c is a WordNet synset—a word sense
rithms have no side information about what categories will
specified by a set of synonyms and a definition.
be evaluated in each image. An algorithm thus must make
2 It’s
its best prediction for all categories in each test image. possible that the categories present in the val and test sets may
be a strict subset of those in the train set; we use the standard COCO
Reduced Workload. Federated dataset design allows us to 2017 val and test splits and cannot guarantee that all categories present
make |Pc ∪ Nc |  |D|, ∀c. This choice dramatically re- in the train images are also present in val and test.
Not exhaustive: Car
Book Book (5) {Book} Coffee
Dog Person
Stapler

Box Pillow

!"#$%&'(&)*+%,"&-./""01$2& !"#$%&5(&678#9-"0:%&01-"#1,%& !"#$%&>&?&@&A*#,<&#1B&=/3"8C( !"#$%&D(&678#9-"0:% !"#$%&E(&F%$#"0:%&G#*%G-


/1%&01-"#1,%&.%3&,#"%$/34 ;#3<01$&/=&%#,8&,#"%$/34 !%$;%1"#"0/1&#1B&:%30=0,#"0/1 #11/"#"0/1&:%30=0,#"0/1

Figure 4. Our annotation pipeline comprises six stages. Stage 1: Object Spotting elicits annotators to mark a single instance of many
different categories per image. This stage is iterative and causes annotators to discover a long tail of categories. Stage 2: Exhaustive
Instance Marking extends the stage 1 annotations to cover all instances of each spotted category. Here we show additional instances of
book. Stages 3 and 4: Instance Segmentation and Verification are repeated back and forth until ∼99% of all segmentations pass a quality
check. Stage 5: Exhaustive Annotations Verification checks that all instances are in fact segmented and flags categories that are missing
one or more instances. Stage 6: Negative Labels are assigned by verifying that a subset of categories do not appear in the image.

3. Dataset Construction Exhaustive Instance Marking, Stage 2. The goals this


stage are to: (1) verify stage 1 annotations and (2) take each
In this section we provide an overview of the annotation image i ∈ Pc and mark all instances of c in i with a point.
pipeline that we use to collect LVIS. In this stage, (i, c) pairs from stage 1 are each sent to 5
annotators. They are asked to perform two steps. First, they
3.1. Annotation Pipeline are shown the definition of category c and asked to verify if
it describes the spotted object. Second, if it matches, then
Fig. 4 illustrates our annotation pipeline by showing the
the annotators are asked to mark all other instances of the
output of each stage, which we describe below. For now,
same category. If it does not match, there is no second step.
assume that we have a fixed category vocabulary V. We
To prevent frequent categories from dominating the dataset
will describe how the vocabulary was collected in §3.2.
and to reduce the overall workload, we subsample frequent
categories such that no positive set exceeds more than 1%
Object Spotting, Stage 1. The goals of the object spotting
of the images in the dataset.
stage are to: (1) generate the positive set, Pc , for each cat-
To ensure annotation quality, we embed a ‘gold set’
egory c ∈ V and (2) elicit vocabulary recall such that many
within the pool of work. These are cases for which we know
different object categories are included in the dataset.
the correct ground-truth. We use the gold set to automati-
Object spotting is an iterative process in which each im-
cally evaluate the work quality of each annotator so that we
age is visited a variable number of times. On the first visit,
can direct work towards more reliable annotators. We use 5
an annotator is asked to mark one object with a point and to
annotators per (i, c) pair to help ensure instance-level recall.
name it with a category c ∈ V using an autocomplete text
To summarize, from stage 2 we have exhaustive instance
input. On each subsequent visit, all previously spotted ob-
spotting for each image i ∈ Pc for each category c ∈ V.
jects are displayed and an annotator is asked to mark an ob-
ject of a previously unmarked category or to skip the image Instance Segmentation, Stage 3. The goals of the instance
if no more categories in V can be spotted. When an image segmentation stage are to: (1) verify the category for each
has been skipped 3 times, it will no longer be visited. The marked object from stage 2 and (2) upgrade each marked
autocomplete is performed against the set of all synonyms, object from a point annotation to a full segmentation mask.
presented with their definitions; we internally map the se- To do this, each pair (i, o) of image i and marked object
lected word to its synset/category to resolve synonyms. instance o is presented to one annotator who is asked to ver-
Obvious and salient objects are spotted early in this iter- ify that the category label for o is correct and if it is correct,
ative process. As an image is visited more, less obvious ob- to draw a detailed segmentation mask for it (e.g. see Fig. 3).
jects are spotted, including incidental, non-salient ones. We We use a training task to establish our quality standards.
run the spotting stage twice, and for each image we retain Annotator quality is assessed with a gold set and by track-
categories that were spotted in both runs. Thus two people ing their average vertex count per polygon. We use these
must independently agree on a name in order for it to be metrics to assign work to reliable annotators.
included in the dataset; this increases naming consistency. In sum, from stage 3 we have for each image and spotted
To summarize the output of stage 1: for each category in instance pair one segmentation mask (if it is not rejected).
the vocabulary, we have a (possibly empty) set of images in Segment Verification, Stage 4. The goal of the segment
which one object of that category is marked per image. This verification stage is to verify the quality of the segmenta-
defines an initial positive set, Pc , for each category c. tion masks from stage 3. We show each segmentation to
up to 5 annotators and ask them to rate its quality using
a rubric. If two or more annotators reject the mask, then
we requeue the instance for stage 3 segmentation. Thus we
only accept a segmentation if 4 annotators agree it is high-
quality. Unreliable workers from stage 3 are not invited to
judge segmentations in stage 4; we also use rejections rates
from this stage to monitor annotator reliability. We iterate Figure 5. Distribution of object centers in normalized image co-
between stages 3 & 4 a total of four times, each time only ordinates for four datasets. ADE20K exhibits the greatest spatial
re-annotating rejected instances. diversity, with LVIS achieving greater complexity than COCO and
To summarize the output of stage 4 (after iterating back the Open Images v4 training set.3
and forth with stage 3): we have a high-quality segmenta-
tion mask for >99% of all marked objects. 4. Dataset Analysis
Full Recall Verification, Stage 5. The full recall verifica- For analysis, we have annotated 5000 images (the COCO
tion stage finalizes the positive sets. The goal is to find im- val2017 split) twice using the proposed pipeline. We be-
ages i ∈ Pc where c is not exhaustively annotated. We do gin by discussing general dataset statistics next before pro-
this by asking annotators if there are any unsegmented in- ceeding to an analysis of annotation consistency in §4.2 and
stances of category c in i. We ask up to 5 annotators and an analysis of the evaluation protocol in §4.3.
require at least 4 to agree that annotation is exhaustive. As 4.1. Dataset Statistics
soon as two believe it is not, we mark the exhaustive anno-
tation flag eci as false. We use a gold set to maintain quality. Category Statistics. There are 977 categories present in
To summarize the output of stage 5: we have a boolean the 5000 LVIS images. The category growth rate (see
flag eci for each image i ∈ Pc indicating if category c is ex- Fig. 9) indicates that the final dataset will have well over
haustively annotated in image i. This finalizes the positive 1000 categories. On average, each image is annotated with
sets along with their instance segmentation annotations. 11.2 instances from 3.4 categories. The largest instances-
per-image count is a remarkable 294. Fig. 6a shows the full
Negative Sets, Stage 6. The final stage of the pipeline is to categories-per-image distribution. LVIS’s distribution has
collect a negative set Nc for each category c in the vocabu- more spread than COCO’s indicating that many images are
lary. We do this by randomly sampling images i ∈ D \ Pc , labeled with more categories. The low-shot nature of our
where D is all images in the dataset. For each sampled im- dataset can be seen in Fig. 6b, which plots the total number
age i, we ask up to 5 annotators if category c appears in of instances for each category (in the 5000 images). The
image i. If any one annotator reports that it does, we reject median value is 9, and while this number will be larger for
the image. Otherwise i is added to Nc . We sample until the the full image set, this statistic highlights the challenging
negative set Nc reaches a target size of 1% of the images in long-tailed nature of our data.
the dataset. We use a gold set to maintain quality.
Spatial Statistics. Our object spotting process (stage 1) en-
To summarize, from stage 6 we have a negative image
courages the inclusion of objects distributed throughout the
set Nc for each category c ∈ V such that the category does
image plane, not just the most salient foreground objects.
not appear in any of the images in Nc .
The effect can be seen in Fig. 5 which shows object-center
3.2. Vocabulary Construction density plots. All datasets have some degree of center bias,
with ADE20K and LVIS having the most diverse spatial
We construct the vocabulary V with an iterative process distribution. COCO and Open Images v4 (training set3 )
that starts from a large super-vocabulary and uses the object have similar object-center distributions with a marginally
spotting process (stage 1) to winnow it down. We start from lower degree of spatial diversity.
8.8k synsets that were selected from WordNet by remov- Scale Statistics. Objects in LVIS are also more likely to
ing some obvious cases (e.g. proper nouns) and then find- be small. Fig. 6c shows the relative size distribution of ob-
ing the intersection with highly concrete common nouns [2]. ject masks: compared with COCO, LVIS objects tend to
This yields a high-recall set of concrete, and thus likely vi- smaller and there are fewer large objects (e.g., objects that
sual, entry-level synsets. We then apply object spotting to occupy most of an image are ∼10× less frequent). ADE20K
10k COCO images with autocomplete against this super- has the fewest large objects overall and more medium ones.
vocabulary. This yields a reduced vocabulary with which
3 The CVPR 2019 version of this paper shows the distribution of the
we repeat the process once more. Finally, we perform mi-
Open Images v4 validation set, which has more center bias. The peaki-
nor manual editing. The resulting vocabulary contains 1723 ness is also exaggerated due to an intensity scaling artifact. For more de-
synsets—the upper bound on the number of categories that tails, see https://storage.googleapis.com/openimages/
can appear in LVIS. web/factsfigures.html.
103 101
101 LVIS

Number of instances

Percent of instances
Percent of images

COCO
100
LVIS 102 ADE20K
10−1
COCO 10−1
ADE20K 101
10−3 10−2
Open Images v4
100
0 5 10 15 20 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
Number of categories per image Sorted category index Relative segmentation mask size
(a) Distribution of category count per image. (b) The number of instances per category (on 5k (c) Relative segmentation mask size (square root
LVIS has a heavier tail than COCO and Open Im- images) reveals the long tail with few examples. of mask-area-divided-by-image-area) compared
ages training set. ADE20K is the most uniform. Orange dots: categories in common with COCO. between LVIS, COCO, and ADE20K.
Figure 6. Dataset statistics. Best viewed digitally.

Recognition quality (F1) 1.0


1.0
Mask quality (IoU)

0.9 0.8
0.8 0.6
0.7 0.4
0.6 0.2
0.5 0.0
0 20 40 60 80 100 0 20 40 60 80 100 Mask IoU : 0.91 Mask IoU : 0.94
Percent of instances Percent of image-category pairs (i, c) Boundary quality: 0.82 Boundary quality: 0.99

(a) LVIS segmentation quality measured by mask (b) LVIS recognition quality measured by F1 (c) Illustration of mask IoU vs. boundary quality
IoU between matched instances from two runs of score given matched instances across two runs of to provide intuition for interpreting Fig. 7a (left)
our annotation pipeline. Masks from the runs are our annotation pipeline. Category labeling is con- and Tab. 1a (dataset annotations vs. expert anno-
consistent with a dataset average IoU of 0.85. sistent with a dataset average F1 score of 0.87. tators, below).
Figure 7. Annotation consistency using 5000 doubly annotated images from LVIS. Best viewed digitally.

mask IoU boundary quality annotation boundary complexity


dataset comparison mean median mean median dataset source mean median
dataset vs. experts 0.83 – 0.87 0.88 – 0.91 0.77 – 0.82 0.79 – 0.88 dataset 5.59 – 6.04 5.13 – 5.51
COCO COCO
expert1 vs. expert2 0.91 – 0.95 0.96 – 0.98 0.92 – 0.96 0.97 – 0.99 experts 6.94 – 7.84 5.86 – 6.80
dataset vs. experts 0.84 – 0.88 0.90 – 0.93 0.83 – 0.87 0.84 – 0.92 dataset 6.00 – 6.84 4.79 – 5.31
ADE20K ADE20K
expert1 vs. expert2 0.90 – 0.94 0.95 – 0.97 0.90 – 0.95 0.99 – 1.00 experts 6.34 – 7.43 4.83 – 5.53
dataset vs. experts 0.90 – 0.92 0.94 – 0.96 0.87 – 0.91 0.93 – 0.98 dataset 6.35 – 7.07 5.44 – 6.00
LVIS LVIS
expert1 vs. expert2 0.93 – 0.96 0.96 – 0.98 0.91 – 0.96 0.97 – 1.00 experts 7.13 – 8.48 5.91 – 6.82
(a) For each metric (mask IoU, boundary quality) and each statistic (mean, median), we show (b) Comparison of annotation complexity. Boundary
a bootstrapped 95% confidence interval. LVIS has the highest quality across all measures. complexity is perimeter divided by square root area [1].
Table 1. Annotation quality and complexity relative to experts.

4.2. Annotation Consistency lished based on a low IoU threshold (0.5), matched masks
tend to have much higher IoU. The results show that roughly
Annotation Pipeline Repeatability. A repeatable annota-
50% of matched instances have IoU greater than 90% and
tion pipeline implies that the process generating the ground-
roughly 75% of the image-category pairs have a perfect F1
truth data is not overly random and therefore may be
score. Taken together, these metrics are a strong indication
learned. To understand repeatability, we annotated the 5000
that our pipeline has a large degree of repeatability.
images twice: after completing object spotting (stage 1),
we have initial positive sets Pc for each category c; we Comparison with Expert Annotators. To measure seg-
then execute stages 2 through 5 (exhaustive instance mark- mentation quality, we randomly selected 100 instances with
ing through full recall verification) twice in order to yield mask area greater than 322 pixels from LVIS, COCO,
doubly annotated positive sets. To compare them, we com- and ADE20K. We presented these instances (indicated by
pute a matching between them for each image and category bounding box and category) to two independent expert an-
pair. We find a matching that maximizes the total mask in- notators and asked them to segment each object using pro-
tersection over union (IoU) summed over the matched pairs fessional image editing tools. We compare dataset annota-
and then discard any matches with IoU < 0.5. Given these tions to expert annotations using mask IoU and boundary
matches we compute the dataset average mask IoU (0.85) quality (boundary F [20]) in Tab. 1a. The results (boot-
and the dataset average F1 score (0.87). Intuitively, these strapped 95% confidence intervals) show that our masks are
quantities describe ‘segmentation quality’ and ‘recognition high-quality, surpassing COCO and ADE20K on both mea-
quality’ [12]. The cumulative distributions of these metrics sures (see Fig. 7c for intuition). At the same time, the ob-
(Fig. 7a and 7b) show that even though matches are estab- jects in LVIS have more complex boundaries [1] (Tab. 1b).
42 42 40 36.4
32.4
40 40 30
Mask AP %

Mask AP %
23.7

Mask AP %
38 38 20 15.3
9.8
36 36 10

34 34 0
50 100 200 500 1000 2000 5000 5 10 20 40 80 160 320 640 1280 1k 3.5k 10k 35k 118k
max |Nc| (with max |Pc| = inf) max |Pc| (with max |Nc| = 50) Training set size in images

(a) Given fixed detections, we show how AP (b) With the same detections from Fig. 8a and (c) Low-shot detection is an open problem:
varies with max |Nc |, the max number of nega- max |Nc | = 50, we show how AP varies as we training Mask R-CNN on 1k images decreases
tive images per category used in evaluation. vary max |Pc |, the max positive set size. COCO val2017 mask AP from 36% to 10%.
Figure 8. Detection experiments using COCO and 5000 annotated images from LVIS. Best viewed digitally.

Mask R-CNN test anno. box AP mask AP 1000 with 1-10 images (rare)
100

Percent of categories
Number of categories
with 11-100 (common)
R-50-FPN COCO 38.2 34.1 750 with > 100 (frequent)
model id: 35859007 LVIS 38.8 34.4
500 50
R-101-FPN COCO 40.6 36.0
model id: 35861858 LVIS 40.9 36.0 250
0
X-101-64x4d-FPN COCO 47.8 41.2 0
0 2000 4000 0 2000 4000
model id: 37129812 LVIS 48.6 41.7 Number of annotated images Number of annotated images
Table 2. COCO-trained Mask R-CNN evaluated on LVIS an- Figure 9. (Left) As more images are annotated, new categories
notations. Both annotations yield similar AP values. are discovered. (Right) Consequently, the percentage of low-shot
categories (blue curve) remains large, decreasing slowly.
4.3. Evaluation Protocol ing from 1k to 118k images. For each subset, we optimized
COCO Detectors on LVIS. To validate our annotations the learning rate schedule and weight decay by grid search.
and federated dataset design we downloaded three Mask R- Results on val2017 are shown in Fig. 8c. At 1k images,
CNN [9] models from the Detectron Model Zoo [7] and mask AP drops from 36.4% (full dataset) to 9.8% (1k sub-
evaluated them on LVIS annotations for the categories in set). In the 1k subset, 89% of the categories have more than
COCO. Tab. 2 shows that both box AP and mask AP are 20 training instances, while the low-shot literature typically
close between our annotations and the original ones from considers  20 examples per category [8].
COCO for all models, which span a wide AP range. This re- Low-Shot Category Statistics. Fig. 9 (left) shows category
sult validates our annotations and evaluation protocol: even growth as a function of image count (up to 977 categories
though LVIS uses a federated dataset design with sparse an- in 5k images). Extrapolating the trajectory, our final dataset
notations, the quantitative outcome closely reproduces the will include over 1k categories (upper bounded by the vo-
‘gold standard’ results from dense COCO annotations. cabulary size, 1723). Since the number of categories in-
Federated Dataset Simulations. For insight into how AP creases during data collection, the low-shot nature of LVIS
changes with positive and negative sets sizes |Pc | and |Nc |, is somewhat independent of the dataset scale, see Fig. 9
we randomly sample smaller evaluation sets from COCO (right) where we bin categories based on how many images
val2017 and recompute AP. To plot quartiles and min- they appear in: rare (1-10 images), common (11-100), and
max ranges, we re-test each setting 20 times. In Fig. 8a we frequent (>100). These bins, as measured w.r.t. the training
use all positive instances for evaluation, but vary max |Nc | set, will be used to present disaggregated AP metrics.
between 50 and 5k. AP decreases somewhat (∼2 points) as
we increase the number of negative images as the ratio of 5. Conclusion
negative to positive examples grows with fixed |Pc | and in- We introduced LVIS, a new dataset designed to enable,
creasing |Nc |. Next, in Fig 8b we set max |Nc | = 50 and for the first time, the rigorous study of instance segmenta-
vary |Pc |. We observe that even with a small positive set tion algorithms that can recognize a large vocabulary of ob-
size of 80, AP is similar to the baseline with low variance. ject categories (>1000) and must do so using methods that
With smaller positive sets (down to 5) variance increases, can cope with the open problem of low-shot learning. While
but the AP gap from 1st to 3rd quartile remains below 2 LVIS emphasizes learning from few examples, the dataset
points. These simulations together with COCO detectors is not small: it will span 164k images and label ∼2 million
tested on LVIS (Tab. 2) indicate that including smaller eval- object instances. Each object instance is segmented with a
uation sets for each category is viable for evaluation. high-quality mask that surpasses the annotation quality of
Low-Shot Detection. To validate the claim that low-shot related datasets. We plan to establish LVIS as a benchmark
detection is a challenging open problem, we trained Mask challenge that we hope will lead to exciting new object de-
R-CNN on random subsets of COCO train2017 rang- tection, segmentation, and low-shot learning algorithms.
A. LVIS Release v0.5 ing, but remains a sizeable portion of the dataset. Roughly
75% of categories appear in 100 training images or less,
LVIS release v0.5 marks the halfway point in data col- highlighting the challenging low-shot nature of the dataset.
lection. For this release, we have annotated an additional Finally, we look at the spatial distribution of object cen-
77k images (57k train, 20k test) beyond the 5k val ters in Fig. 11. This visualization verifies that quality con-
images that we analyzed in the previous sections, for a total trol did not lead to a meaningful bias in this statistic. The
of 82k annotated images. Release v0.5 is publicly available train and val sets exhibit visually similar distributions.
at http://www.lvisdataset.org and will be used
in the first LVIS Challenge to be held in conjunction with
1500 with 1-10 images (rare)
the COCO Workshop at ICCV 2019. 100

Percent of categories
Number of categories
with 11-100 (common)
with > 100 (frequent)
Collection Details. We collected this data in two 38.5k im- 1000

LVIS val (unpruned) 50


age batches using the process described in the main paper. 500 LVIS val v0.5
Each batch contained a proportional mixture of train and LVIS train v0.5
0
test set images. After all stages were completed for the 0
0 20000 40000 60000 0 20000 40000 60000
first batch, we (the authors of this paper) manually checked Number of annotated images Number of annotated images

all 1415 categories that were represented in the raw data (a) Category growth comparison (b) As category growth slows, the
between the train and val sets percent of rare categories decreases,
collection and cast an include vs. exclude vote for each cat- (before and after quality control). but remains large (train v0.5 set).
egory based on its visual consistency. This process led to
Figure 10. Category growth and frequency statistics for LVIS
the removal of ∼18% of the categories and ∼10% of the v0.5. Best viewed digitally. Compare with Fig. 9.
labeled instances. After collecting the second batch, we
repeated this process for the 83 new categories that were
newly introduced. After we finish the full data collection
for v1 (estimated January 2020), we will conduct another
similar quality control pass on a subset of the categories.
LVIS val v0.5 is the same as the set used for analysis in
the main paper, except that we: (1) removed categories that
were determined to be visually inconsistent in the quality
control pass and (2) we removed any categories with zero Figure 11. Distribution of object centers in normalized image coor-
instances in the training set. In this section, we refer to the dinates for LVIS val (prior to quality control; i.e. the same as in
set of images used for analysis in the main paper as ‘LVIS Fig. 5), LVIS val v0.5 (after quality control), and LVIS train
val (unpruned)’. v0.5. The distributions are nearly identical.
Statistics. After our quality control pass, the final category
count for release v0.5 is 1230. The number of categories Summary. Based on this analysis and our qualitative
in the val set decreased from 977 to 830, due to quality judgement when performing per-category quality control,
control, and now has 56k segmented object instances. The we conclude that our data collection process scales well be-
train v0.5 set has 694k segmented instances. yond the initial 5k set analyzed in the main paper.
We now repeat some of the key analysis plots, this time
showing the final val and train v0.5 sets in compari-
son to the original (unpruned) val set that was analyzed in Acknowledgements. We thank Ilija Radosavovic, Aman-
the main paper. The train and test sets are collected preet Singh, Alexander Kirillov, and Tsung-Yi Lin for their
using an identical process (the images are mixed together help during the creation of LVIS. We would also like to
in each annotation batch) and therefore the training data is thank the COCO Committee for granting us permission to
statistically identical to that of the test data (noting that the annotate the COCO test set. We are grateful to Amanpreet
train and test images were randomly sampled from the Singh for his help in creating the LVIS website.
same image distribution when COCO was collected).
Fig. 10a illustrates the category growth rate on the
train set and the val set before and after pruning. We
expect only modest growth while collecting the second half
of the dataset, perhaps expanding by ∼100 additional cate-
gories. Next, we extend Fig. 9 (right) from 5k images to 57k
images using the train v0.5 data, as shown in Fig. 10b.
Due to the slowing category growth, the percent of rare cat-
egories (those appearing in 1-10 training images) is decreas-
Bird Feeder (4) Buggy (1) License Plate (5) Jeep (1) Weathervane (1)

Hurdle (1) Bird (18) Pasta (44) (NE)

Bread (8)

Scissors (1)

Drawer (9) Wheel (27) Camera (2)

Fire Hydrant (1)

Keg (2) Ashtray (2) Lanyard (4) Apple (120)

Wreath (2) Camcorder (1) Garbage Truck (2)


Mitten (2)

Pipe (1)

Suspenders (1) Forklift (1) Cistern (1)

Shears (1) Bucket (4) Sunglasses (6) Cucumber (12) Gridle (1)

Suit (6) Bench (9) Blouse (5) Cupcake (27) Perfume (2)

Figure 12. Example annotations from our dataset. For clarity, we show one category per image. (NE) signifies that the category was not
exhaustively annotated in the image. See http://www.lvisdataset.org/explore to explore LVIS in detail.
References lecture/reproducible-research-and-the-
common-task-method/, 2015. 2
[1] Fred Attneave and Malcolm D Arnoult. The quantitative
[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
study of shape and pattern perception. Psychological bul-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
letin, 1956. 7
Zitnick. Microsoft COCO: Common objects in context. In
[2] Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. ECCV, 2014. 1, 2
Concreteness ratings for 40 thousand generally known en-
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
glish word lemmas. Behavior research methods, 2014. 6
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo C Lawrence Zitnick. COCO detection evaluation. http:
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe //cocodataset.org/#detection-eval, Accessed
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Oct 30, 2018. 2, 3
dataset for semantic urban scene understanding. In CVPR, [20] David Martin, Charless Fowlkes, Doron Tal, and Jitendra
2016. 2 Malik. A database of human segmented natural images
[4] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Per- and its application to evaluating segmentation algorithms and
ona. Pedestrian detection: An evaluation of the state of the measuring ecological statistics. In ICCV, 2001. 2, 7
art. TPAMI, 2012. 2 [21] George Miller. WordNet: An electronic lexical database.
[5] Mark Everingham, Luc Van Gool, Christopher KI Williams, MIT press, 1998. 4
John Winn, and Andrew Zisserman. The PASCAL Visual [22] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and
Object Classes (VOC) Challenge. IJCV, 2010. 2 Peter Kontschieder. The mapillary vistas dataset for semantic
[6] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning understanding of street scenes. In ICCV, 2017. 2
of object categories. TPAMI, 2006. 2 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[7] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Dollár, and Kaiming He. Detectron. https://github. Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
com/facebookresearch/detectron, 2018. 8 Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
[8] Bharath Hariharan and Ross Girshick. Low-shot visual lenge. IJCV, 2015. 2
recognition by shrinking and hallucinating features. In [24] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and
ICCV, 2017. 8 William T Freeman. Labelme: a database and web-based
[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- tool for image annotation. IJCV, 2008. 1
shick. Mask R-CNN. In ICCV, 2017. 8 [25] Merrielle Spain and Pietro Perona. Measuring and predict-
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ing importance of objects in our visual world. Technical Re-
Deep residual learning for image recognition. In CVPR, port CNS-TR-2007-002, California Institute of Technology,
2016. 2 2007. 1
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: [26] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui,
accelerating deep network training by reducing internal co- Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and
variate shift. In ICML, 2015. 2 Serge Belongie. The iNaturalist species classification and
[12] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten detection dataset. In CVPR, 2018. 2
Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, [27] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
2019. 7 and Antonio Torralba. SUN database: Large-scale scene
[13] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Im- recognition from abbey to zoo. In CVPR, 2010. 1
ageNet classification with deep convolutional neural net- [28] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
works. In NIPS, 2012. 2 Barriuso, and Antonio Torralba. Semantic understanding of
[14] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- scenes through the ADE20K dataset. IJCV, 2019. 2
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan [29] George Kingsley Zipf. The psycho-biology of language: An
Popov, Matteo Malloci, Tom Duerig, et al. The open im- introduction to dynamic philology. Routledge, 2013. 1
ages dataset v4: Unified image classification, object detec-
tion, and visual relationship detection at scale. arXiv preprint
arXiv:1811.00982, 2018. 3
[15] Yann LeCun, Bernhard Boser, John S Denker, Donnie
Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwrit-
ten zip code recognition. Neural computation, 1989. 2
[16] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges.
The MNIST database of handwritten digits. http://
yann.lecun.com/exdb/mnist/, 1998. 2
[17] Marc Liberman. Reproducible research and the
common task method. Simmons Foundation Lec-
ture https://www.simonsfoundation.org/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy