LVIS: A Dataset For Large Vocabulary Instance Segmentation
LVIS: A Dataset For Large Vocabulary Instance Segmentation
Abstract
arXiv:1908.03195v1 [cs.CV] 8 Aug 2019
suitable, high-quality dataset and benchmark is required. sion to label test2017); ∼2M is a projection after labeling 85k images.
consistency than both COCO and ADE20K [28]. Deer
To build our dataset, we adopt an evaluation-first design Car
principle. This principle states that we should first deter- Toy Vehicle
Backpack,
mine exactly how to perform quantitative evaluation and Rucksack
Truck
only then design and build a dataset collection pipeline to
gather the data entailed by the evaluation. We select our
Figure 2. Category relationships from left to right: non-disjoint
benchmark task to be COCO-style instance segmentation category pairs may be in partially overlapping, parent-child, or
and we use the same COCO-style average precision (AP) equivalent (synonym) relationships, implying that a single object
metric that averages over categories and different mask in- may have multiple valid labels. The fair evaluation of an object
tersection over union (IoU) thresholds [19]. Task and metric detector must take the issue of multiple valid labels into account.
continuity with COCO reduces barriers to entry.
Buried within this seemingly innocuous task choice are Caltech 101 [6], PASCAL VOC [5], ImageNet [23], and
immediate technical challenges: How do we fairly evaluate COCO [18]. These datasets enabled the development of al-
detectors when one object can reasonably be labeled with gorithms that detect edges, perform large-scale image clas-
multiple categories (see Fig. 2)? How do we make the an- sification, and localize objects by bounding boxes and seg-
notation workload feasible when labeling 164k images with mentation masks. They were also used in the discovery of
segmented objects from over 1000 categories? important ideas, such as Convolutional Networks [15, 13],
The essential design choice resolving these challenges Residual Networks [10], and Batch Normalization [11].
is to build a federated dataset: a single dataset that is LVIS is inspired by these and other related datasets, in-
formed by the union of a large number of smaller con- cluding those focused on street scenes (Cityscapes [3] and
stituent datasets, each of which looks exactly like a tradi- Mapillary [22]) and pedestrians (Caltech Pedestrians [4]).
tional object detection dataset for a single category. Each We review the most closely related datasets below.
small dataset provides the essential guarantee of exhaus-
tive annotations for a single category—all instances of that COCO [18] is the most popular instance segmentation
category are annotated. Multiple constituent datasets may benchmark for common objects. It contains 80 categories
overlap and thus a single object within an image can be la- that are pairwise distinct. There are a total of 118k train-
beled with multiple categories. Furthermore, since the ex- ing images, 5k validation images, and 41k test images. All
haustive annotation guarantee only holds within each small 80 categories are exhaustively annotated in all images (ig-
dataset, we do not require the entire federated dataset to be noring annotation errors), leading to approximately 1.2 mil-
exhaustively annotated with all categories, which dramat- lion instance segmentation masks. To establish continuity
ically reduces the annotation workload. Crucially, at test with COCO, we adopt the same instance segmentation task
time the membership of each image with respect to the con- and AP metric, and we are also annotating all images from
stituent datasets is not known by the algorithm and thus it the COCO 2017 dataset. All 80 COCO categories can be
must make predictions as if all categories will be evaluated. mapped into our dataset. In addition to representing an or-
The evaluation oracle evaluates each category fairly on its der of magnitude more categories than COCO, our anno-
constituent dataset. tation pipeline leads to higher-quality segmentation masks
In the remainder of this paper, we summarize how our that more closely follow object boundaries (see §4).
dataset and benchmark relate to prior work, provide details ADE20K [28] is an ambitious effort to annotate almost ev-
on the evaluation protocol, describe how we collected data, ery pixel in 25k images with object instance, ‘stuff’, and
and then discuss results of the analysis of this data. part segmentations. The dataset includes approximately
Dataset Timeline. We report detailed analysis on the 5000 3000 named objects, stuff regions, and parts. Notably,
image val subset that we have annotated twice. We have ADE20K was annotated by a single expert annotator, which
now annotated an additional 77k images (split between increases consistency but also limits dataset size. Due to the
train, val, and test), representing ∼50% of the final relatively small number of annotated images, most of the
dataset; we refer to this as LVIS v0.5 (see §A for details). categories do not have enough data to allow for both train-
The first LVIS Challenge, based on v0.5, will be held at the ing and evaluation. Consequently, the instance segmenta-
COCO Workshop at ICCV 2019. tion benchmark associated with ADE20K evaluates algo-
rithms on the 100 most frequent categories. In contrast, our
1.1. Related Datasets goal is to enable benchmarking of large vocabulary instance
Datasets shape the technical problems researchers study segmentation methods.
and consequently the path of scientific discovery [17]. We iNaturalist [26] contains nearly 900k images annotated
owe much of our current success in image recognition with bounding boxes for 5000 plant and animal species.
to pioneering datasets such as MNIST [16], BSDS [20], Similar to our goals, iNaturalist emphasizes the importance
Shoulder bag (3) Motor Scooter (4) Table (1) Hairbrush (3) Bear (2)
Peanut (29) Bed (2) Printer (2) Pineapple (12) Banana (80)
Figure 3. Example LVIS annotations (one category per image for clarity). See http://www.lvisdataset.org/explore.
of benchmarking classification and detection in the few ex- do not occur when there are few categories. These must be
ample regime. Unlike our effort, iNaturalist does not in- resolved first, because they have profound implications for
clude segmentation masks and is focussed on a different the structure of the dataset, as we discuss next.
image and fine-grained category distribution; our category
distribution emphasizes entry-level categories.
2.1. Task and Evaluation Overview
Task and Metric. Our dataset benchmark is the instance
Open Images v4 [14] is a large dataset of 1.9M images.
segmentation task: given a fixed, known set of categories,
The detection portion of the dataset includes 15M bounding
design an algorithm that when presented with a previously
boxes labeled with 600 object categories. The associated
unseen image will output a segmentation mask for each in-
benchmark evaluates the 500 most frequent categories, all
stance of each category that appears in the image along with
of which have over 100 training samples (>70% of them
the category label and a confidence score. Given the output
have over 1000 training samples). Thus, unlike our bench-
of an algorithm over a set of images, we compute mask av-
mark, low-shot learning is not integral to Open Images.
erage precision (AP) using the definition and implementa-
Also different from our dataset is the use of machine learn-
tion from the COCO dataset [19] (for more detail see §2.3).
ing algorithms to select which images will be annotated by
using classifiers for the target categories. Our data collec- Evaluation Challenges. Datasets like PASCAL VOC and
tion process, in contrast, involves no machine learning algo- COCO use manually selected categories that are pairwise
rithms and instead discovers the objects that appear within a disjoint: when annotating a car, there’s never any question
given set of images. Starting with release v4, Open Images if the object is instead a potted plant or a sofa. When in-
has used a federated dataset design for object detection. creasing the number of categories, it is inevitable that other
types of pairwise relationships will occur: (1) partially over-
2. Dataset Design lapping visual concepts; (2) parent-child relationships; and
(3) perfect synonyms. See Fig. 2 for examples.
We followed an evaluation-first design principle: prior If these relations are not properly addressed, then the
to any data collection, we precisely defined what task would evaluation protocol will be unfair. For example, most toys
be performed and how it would be evaluated. This principle are not deer and most deer are not toys, but a toy deer is
is important because there are technical challenges that arise both—if a detector outputs deer and the object is only la-
when evaluating detectors on a large vocabulary dataset that beled toy, the detection will be marked as wrong. Likewise,
if a car is only labeled vehicle, and the algorithm outputs duces the workload and allows us to undersample the most
car, it will be incorrectly judged to be wrong. Or, if an ob- frequent categories in order to avoid wasting annotation re-
ject is only labeled backpack and the algorithm outputs the sources on them (e.g. person accounts for 30% of COCO).
synonym rucksack, it will be incorrectly penalized. Provid- Of our estimated ∼2 million instances, likely no single cate-
ing a fair benchmark is important for accurately reflecting gory will account for more than ∼3% of the total instances.
algorithm performance.
These problems occur when the ground-truth annota- 2.3. Evaluation Details
tions are missing one or more true labels for an object. If The challenge evaluation server will only return the over-
an algorithm happens to predict one of these correct, but all AP, not per-category AP’s. We do this because: (1) it
missing labels, it will be unfairly penalized. Now, if all avoids leaking which categories are present in the test
objects are exhaustively and correctly labeled with all cat- set;2 (2) given that tail categories are rare, there will be few
egories, then the problem is trivially solved. But correctly examples for evaluation in some cases, which makes per-
and exhaustively labeling 164k images each with 1000 cate- category AP unstable; (3) by averaging over a large number
gories is undesirable: it forces a binary judgement deciding of categories, the overall category-averaged AP has lower
if each category applies to each object; there will be many variance, making it a robust metric for ranking algorithms.
cases of genuine ambiguity and inter-annotator disagree- Non-Exhaustive Annotations. We also collect an image-
ment. Moreover, the annotation workload will be very large. level boolean label, eci , indicating if image i ∈ Pc is ex-
Given these drawbacks, we describe our solution next. haustively annotated for category c. In most cases (91%),
2.2. Federated Datasets this flag is true, indicating that the annotations are indeed
exhaustive. In the remaining cases, there is at least one in-
Our key observation is that the desired evaluation proto- stance in the image that is not annotated. Missing annota-
col does not require us to exhaustively annotate all images tions often occur in ‘crowds’ where there are a large number
with all categories. What is required instead is that for each of instances and delineating them is difficult. During eval-
category c there must exist two disjoint subsets of the entire uation, we do not count false positives for category c on
dataset D for which the following guarantees hold: images i that have eci set to false. We do measure recall on
Positive set: there exists a subset of images Pc ⊆ D these images: the detector is expected to predict accurate
such that all instances of c in Pc are segmented. In other segmentation masks for the labeled instances. Our strategy
words, Pc is exhaustively annotated for category c. differs from other datasets that use a small maximum num-
Negative set: there exists a subset of images Nc ⊆ D ber of instances per image, per category (10-15) together
such that no instance of c appears in any of these images. with ‘crowd regions’ (COCO) or use a special ‘group of c’
Given these two subsets for a category c, Pc ∪ Nc can be label to represent 5 or more instances (Open Images v4).
used to perform standard COCO-style AP evaluation for c. Our annotation pipeline (§3) attempts to collect segmenta-
The evaluation oracle only judges the algorithm on a cate- tions for all instances in an image, regardless of count, and
gory c over the subset of images in which c has been exhaus- then checks if the labeling is in fact exhaustive. See Fig. 3.
tively annotated; if a detector reports a detection of category Hierarchy. During evaluation, we treat all categories the
c on an image i ∈ / Pc ∪ Nc , the detection is not evaluated. same; we do nothing special in the case of hierarchical re-
By collecting the per-category sets into a single dataset, lationships. To perform best, for each detected object o, the
D = ∪c (Pc ∪ Nc ), we arrive at the concept of a feder- detector should output the most specific correct category as
ated dataset. A federated dataset is a dataset that is formed well as all more general categories, e.g., a canoe should be
by the union of smaller constituent datasets, each of which labeled both canoe and boat. The detected object o in image
looks exactly like a traditional object detection dataset for a i will be evaluated with respect to all labeled positive cate-
single category. By not annotating all images with all cat- gories {c | i ∈ Pc }, which may be any subset of categories
egories, freedom is created to design an annotation process between the most specific and the most general.
that avoids ambiguous cases and collects annotations only
Synonyms. A federated dataset that separates synonyms
if there is sufficient inter-annotator agreement. At the same
into different categories is valid, but is unnecessarily frag-
time, the workload can be dramatically reduced.
mented (see Fig. 2, right). We avoid splitting synonyms
Finally, we note that positive set and negative set mem-
into separate categories with WordNet [21]. Specifically, in
bership on the test split is not disclosed and therefore algo-
LVIS each category c is a WordNet synset—a word sense
rithms have no side information about what categories will
specified by a set of synonyms and a definition.
be evaluated in each image. An algorithm thus must make
2 It’s
its best prediction for all categories in each test image. possible that the categories present in the val and test sets may
be a strict subset of those in the train set; we use the standard COCO
Reduced Workload. Federated dataset design allows us to 2017 val and test splits and cannot guarantee that all categories present
make |Pc ∪ Nc | |D|, ∀c. This choice dramatically re- in the train images are also present in val and test.
Not exhaustive: Car
Book Book (5) {Book} Coffee
Dog Person
Stapler
…
Box Pillow
Figure 4. Our annotation pipeline comprises six stages. Stage 1: Object Spotting elicits annotators to mark a single instance of many
different categories per image. This stage is iterative and causes annotators to discover a long tail of categories. Stage 2: Exhaustive
Instance Marking extends the stage 1 annotations to cover all instances of each spotted category. Here we show additional instances of
book. Stages 3 and 4: Instance Segmentation and Verification are repeated back and forth until ∼99% of all segmentations pass a quality
check. Stage 5: Exhaustive Annotations Verification checks that all instances are in fact segmented and flags categories that are missing
one or more instances. Stage 6: Negative Labels are assigned by verifying that a subset of categories do not appear in the image.
Number of instances
Percent of instances
Percent of images
COCO
100
LVIS 102 ADE20K
10−1
COCO 10−1
ADE20K 101
10−3 10−2
Open Images v4
100
0 5 10 15 20 0 200 400 600 800 1000 0.0 0.2 0.4 0.6 0.8 1.0
Number of categories per image Sorted category index Relative segmentation mask size
(a) Distribution of category count per image. (b) The number of instances per category (on 5k (c) Relative segmentation mask size (square root
LVIS has a heavier tail than COCO and Open Im- images) reveals the long tail with few examples. of mask-area-divided-by-image-area) compared
ages training set. ADE20K is the most uniform. Orange dots: categories in common with COCO. between LVIS, COCO, and ADE20K.
Figure 6. Dataset statistics. Best viewed digitally.
0.9 0.8
0.8 0.6
0.7 0.4
0.6 0.2
0.5 0.0
0 20 40 60 80 100 0 20 40 60 80 100 Mask IoU : 0.91 Mask IoU : 0.94
Percent of instances Percent of image-category pairs (i, c) Boundary quality: 0.82 Boundary quality: 0.99
(a) LVIS segmentation quality measured by mask (b) LVIS recognition quality measured by F1 (c) Illustration of mask IoU vs. boundary quality
IoU between matched instances from two runs of score given matched instances across two runs of to provide intuition for interpreting Fig. 7a (left)
our annotation pipeline. Masks from the runs are our annotation pipeline. Category labeling is con- and Tab. 1a (dataset annotations vs. expert anno-
consistent with a dataset average IoU of 0.85. sistent with a dataset average F1 score of 0.87. tators, below).
Figure 7. Annotation consistency using 5000 doubly annotated images from LVIS. Best viewed digitally.
4.2. Annotation Consistency lished based on a low IoU threshold (0.5), matched masks
tend to have much higher IoU. The results show that roughly
Annotation Pipeline Repeatability. A repeatable annota-
50% of matched instances have IoU greater than 90% and
tion pipeline implies that the process generating the ground-
roughly 75% of the image-category pairs have a perfect F1
truth data is not overly random and therefore may be
score. Taken together, these metrics are a strong indication
learned. To understand repeatability, we annotated the 5000
that our pipeline has a large degree of repeatability.
images twice: after completing object spotting (stage 1),
we have initial positive sets Pc for each category c; we Comparison with Expert Annotators. To measure seg-
then execute stages 2 through 5 (exhaustive instance mark- mentation quality, we randomly selected 100 instances with
ing through full recall verification) twice in order to yield mask area greater than 322 pixels from LVIS, COCO,
doubly annotated positive sets. To compare them, we com- and ADE20K. We presented these instances (indicated by
pute a matching between them for each image and category bounding box and category) to two independent expert an-
pair. We find a matching that maximizes the total mask in- notators and asked them to segment each object using pro-
tersection over union (IoU) summed over the matched pairs fessional image editing tools. We compare dataset annota-
and then discard any matches with IoU < 0.5. Given these tions to expert annotations using mask IoU and boundary
matches we compute the dataset average mask IoU (0.85) quality (boundary F [20]) in Tab. 1a. The results (boot-
and the dataset average F1 score (0.87). Intuitively, these strapped 95% confidence intervals) show that our masks are
quantities describe ‘segmentation quality’ and ‘recognition high-quality, surpassing COCO and ADE20K on both mea-
quality’ [12]. The cumulative distributions of these metrics sures (see Fig. 7c for intuition). At the same time, the ob-
(Fig. 7a and 7b) show that even though matches are estab- jects in LVIS have more complex boundaries [1] (Tab. 1b).
42 42 40 36.4
32.4
40 40 30
Mask AP %
Mask AP %
23.7
Mask AP %
38 38 20 15.3
9.8
36 36 10
34 34 0
50 100 200 500 1000 2000 5000 5 10 20 40 80 160 320 640 1280 1k 3.5k 10k 35k 118k
max |Nc| (with max |Pc| = inf) max |Pc| (with max |Nc| = 50) Training set size in images
(a) Given fixed detections, we show how AP (b) With the same detections from Fig. 8a and (c) Low-shot detection is an open problem:
varies with max |Nc |, the max number of nega- max |Nc | = 50, we show how AP varies as we training Mask R-CNN on 1k images decreases
tive images per category used in evaluation. vary max |Pc |, the max positive set size. COCO val2017 mask AP from 36% to 10%.
Figure 8. Detection experiments using COCO and 5000 annotated images from LVIS. Best viewed digitally.
Mask R-CNN test anno. box AP mask AP 1000 with 1-10 images (rare)
100
Percent of categories
Number of categories
with 11-100 (common)
R-50-FPN COCO 38.2 34.1 750 with > 100 (frequent)
model id: 35859007 LVIS 38.8 34.4
500 50
R-101-FPN COCO 40.6 36.0
model id: 35861858 LVIS 40.9 36.0 250
0
X-101-64x4d-FPN COCO 47.8 41.2 0
0 2000 4000 0 2000 4000
model id: 37129812 LVIS 48.6 41.7 Number of annotated images Number of annotated images
Table 2. COCO-trained Mask R-CNN evaluated on LVIS an- Figure 9. (Left) As more images are annotated, new categories
notations. Both annotations yield similar AP values. are discovered. (Right) Consequently, the percentage of low-shot
categories (blue curve) remains large, decreasing slowly.
4.3. Evaluation Protocol ing from 1k to 118k images. For each subset, we optimized
COCO Detectors on LVIS. To validate our annotations the learning rate schedule and weight decay by grid search.
and federated dataset design we downloaded three Mask R- Results on val2017 are shown in Fig. 8c. At 1k images,
CNN [9] models from the Detectron Model Zoo [7] and mask AP drops from 36.4% (full dataset) to 9.8% (1k sub-
evaluated them on LVIS annotations for the categories in set). In the 1k subset, 89% of the categories have more than
COCO. Tab. 2 shows that both box AP and mask AP are 20 training instances, while the low-shot literature typically
close between our annotations and the original ones from considers 20 examples per category [8].
COCO for all models, which span a wide AP range. This re- Low-Shot Category Statistics. Fig. 9 (left) shows category
sult validates our annotations and evaluation protocol: even growth as a function of image count (up to 977 categories
though LVIS uses a federated dataset design with sparse an- in 5k images). Extrapolating the trajectory, our final dataset
notations, the quantitative outcome closely reproduces the will include over 1k categories (upper bounded by the vo-
‘gold standard’ results from dense COCO annotations. cabulary size, 1723). Since the number of categories in-
Federated Dataset Simulations. For insight into how AP creases during data collection, the low-shot nature of LVIS
changes with positive and negative sets sizes |Pc | and |Nc |, is somewhat independent of the dataset scale, see Fig. 9
we randomly sample smaller evaluation sets from COCO (right) where we bin categories based on how many images
val2017 and recompute AP. To plot quartiles and min- they appear in: rare (1-10 images), common (11-100), and
max ranges, we re-test each setting 20 times. In Fig. 8a we frequent (>100). These bins, as measured w.r.t. the training
use all positive instances for evaluation, but vary max |Nc | set, will be used to present disaggregated AP metrics.
between 50 and 5k. AP decreases somewhat (∼2 points) as
we increase the number of negative images as the ratio of 5. Conclusion
negative to positive examples grows with fixed |Pc | and in- We introduced LVIS, a new dataset designed to enable,
creasing |Nc |. Next, in Fig 8b we set max |Nc | = 50 and for the first time, the rigorous study of instance segmenta-
vary |Pc |. We observe that even with a small positive set tion algorithms that can recognize a large vocabulary of ob-
size of 80, AP is similar to the baseline with low variance. ject categories (>1000) and must do so using methods that
With smaller positive sets (down to 5) variance increases, can cope with the open problem of low-shot learning. While
but the AP gap from 1st to 3rd quartile remains below 2 LVIS emphasizes learning from few examples, the dataset
points. These simulations together with COCO detectors is not small: it will span 164k images and label ∼2 million
tested on LVIS (Tab. 2) indicate that including smaller eval- object instances. Each object instance is segmented with a
uation sets for each category is viable for evaluation. high-quality mask that surpasses the annotation quality of
Low-Shot Detection. To validate the claim that low-shot related datasets. We plan to establish LVIS as a benchmark
detection is a challenging open problem, we trained Mask challenge that we hope will lead to exciting new object de-
R-CNN on random subsets of COCO train2017 rang- tection, segmentation, and low-shot learning algorithms.
A. LVIS Release v0.5 ing, but remains a sizeable portion of the dataset. Roughly
75% of categories appear in 100 training images or less,
LVIS release v0.5 marks the halfway point in data col- highlighting the challenging low-shot nature of the dataset.
lection. For this release, we have annotated an additional Finally, we look at the spatial distribution of object cen-
77k images (57k train, 20k test) beyond the 5k val ters in Fig. 11. This visualization verifies that quality con-
images that we analyzed in the previous sections, for a total trol did not lead to a meaningful bias in this statistic. The
of 82k annotated images. Release v0.5 is publicly available train and val sets exhibit visually similar distributions.
at http://www.lvisdataset.org and will be used
in the first LVIS Challenge to be held in conjunction with
1500 with 1-10 images (rare)
the COCO Workshop at ICCV 2019. 100
Percent of categories
Number of categories
with 11-100 (common)
with > 100 (frequent)
Collection Details. We collected this data in two 38.5k im- 1000
all 1415 categories that were represented in the raw data (a) Category growth comparison (b) As category growth slows, the
between the train and val sets percent of rare categories decreases,
collection and cast an include vs. exclude vote for each cat- (before and after quality control). but remains large (train v0.5 set).
egory based on its visual consistency. This process led to
Figure 10. Category growth and frequency statistics for LVIS
the removal of ∼18% of the categories and ∼10% of the v0.5. Best viewed digitally. Compare with Fig. 9.
labeled instances. After collecting the second batch, we
repeated this process for the 83 new categories that were
newly introduced. After we finish the full data collection
for v1 (estimated January 2020), we will conduct another
similar quality control pass on a subset of the categories.
LVIS val v0.5 is the same as the set used for analysis in
the main paper, except that we: (1) removed categories that
were determined to be visually inconsistent in the quality
control pass and (2) we removed any categories with zero Figure 11. Distribution of object centers in normalized image coor-
instances in the training set. In this section, we refer to the dinates for LVIS val (prior to quality control; i.e. the same as in
set of images used for analysis in the main paper as ‘LVIS Fig. 5), LVIS val v0.5 (after quality control), and LVIS train
val (unpruned)’. v0.5. The distributions are nearly identical.
Statistics. After our quality control pass, the final category
count for release v0.5 is 1230. The number of categories Summary. Based on this analysis and our qualitative
in the val set decreased from 977 to 830, due to quality judgement when performing per-category quality control,
control, and now has 56k segmented object instances. The we conclude that our data collection process scales well be-
train v0.5 set has 694k segmented instances. yond the initial 5k set analyzed in the main paper.
We now repeat some of the key analysis plots, this time
showing the final val and train v0.5 sets in compari-
son to the original (unpruned) val set that was analyzed in Acknowledgements. We thank Ilija Radosavovic, Aman-
the main paper. The train and test sets are collected preet Singh, Alexander Kirillov, and Tsung-Yi Lin for their
using an identical process (the images are mixed together help during the creation of LVIS. We would also like to
in each annotation batch) and therefore the training data is thank the COCO Committee for granting us permission to
statistically identical to that of the test data (noting that the annotate the COCO test set. We are grateful to Amanpreet
train and test images were randomly sampled from the Singh for his help in creating the LVIS website.
same image distribution when COCO was collected).
Fig. 10a illustrates the category growth rate on the
train set and the val set before and after pruning. We
expect only modest growth while collecting the second half
of the dataset, perhaps expanding by ∼100 additional cate-
gories. Next, we extend Fig. 9 (right) from 5k images to 57k
images using the train v0.5 data, as shown in Fig. 10b.
Due to the slowing category growth, the percent of rare cat-
egories (those appearing in 1-10 training images) is decreas-
Bird Feeder (4) Buggy (1) License Plate (5) Jeep (1) Weathervane (1)
Bread (8)
Scissors (1)
Pipe (1)
Shears (1) Bucket (4) Sunglasses (6) Cucumber (12) Gridle (1)
Suit (6) Bench (9) Blouse (5) Cupcake (27) Perfume (2)
Figure 12. Example annotations from our dataset. For clarity, we show one category per image. (NE) signifies that the category was not
exhaustively annotated in the image. See http://www.lvisdataset.org/explore to explore LVIS in detail.
References lecture/reproducible-research-and-the-
common-task-method/, 2015. 2
[1] Fred Attneave and Malcolm D Arnoult. The quantitative
[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
study of shape and pattern perception. Psychological bul-
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
letin, 1956. 7
Zitnick. Microsoft COCO: Common objects in context. In
[2] Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. ECCV, 2014. 1, 2
Concreteness ratings for 40 thousand generally known en-
[19] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
glish word lemmas. Behavior research methods, 2014. 6
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
[3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo C Lawrence Zitnick. COCO detection evaluation. http:
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe //cocodataset.org/#detection-eval, Accessed
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes Oct 30, 2018. 2, 3
dataset for semantic urban scene understanding. In CVPR, [20] David Martin, Charless Fowlkes, Doron Tal, and Jitendra
2016. 2 Malik. A database of human segmented natural images
[4] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Per- and its application to evaluating segmentation algorithms and
ona. Pedestrian detection: An evaluation of the state of the measuring ecological statistics. In ICCV, 2001. 2, 7
art. TPAMI, 2012. 2 [21] George Miller. WordNet: An electronic lexical database.
[5] Mark Everingham, Luc Van Gool, Christopher KI Williams, MIT press, 1998. 4
John Winn, and Andrew Zisserman. The PASCAL Visual [22] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and
Object Classes (VOC) Challenge. IJCV, 2010. 2 Peter Kontschieder. The mapillary vistas dataset for semantic
[6] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning understanding of street scenes. In ICCV, 2017. 2
of object categories. TPAMI, 2006. 2 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
[7] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Dollár, and Kaiming He. Detectron. https://github. Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
com/facebookresearch/detectron, 2018. 8 Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
[8] Bharath Hariharan and Ross Girshick. Low-shot visual lenge. IJCV, 2015. 2
recognition by shrinking and hallucinating features. In [24] Bryan C Russell, Antonio Torralba, Kevin P Murphy, and
ICCV, 2017. 8 William T Freeman. Labelme: a database and web-based
[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- tool for image annotation. IJCV, 2008. 1
shick. Mask R-CNN. In ICCV, 2017. 8 [25] Merrielle Spain and Pietro Perona. Measuring and predict-
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ing importance of objects in our visual world. Technical Re-
Deep residual learning for image recognition. In CVPR, port CNS-TR-2007-002, California Institute of Technology,
2016. 2 2007. 1
[11] Sergey Ioffe and Christian Szegedy. Batch normalization: [26] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui,
accelerating deep network training by reducing internal co- Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and
variate shift. In ICML, 2015. 2 Serge Belongie. The iNaturalist species classification and
[12] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten detection dataset. In CVPR, 2018. 2
Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, [27] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
2019. 7 and Antonio Torralba. SUN database: Large-scale scene
[13] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Im- recognition from abbey to zoo. In CVPR, 2010. 1
ageNet classification with deep convolutional neural net- [28] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
works. In NIPS, 2012. 2 Barriuso, and Antonio Torralba. Semantic understanding of
[14] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Ui- scenes through the ADE20K dataset. IJCV, 2019. 2
jlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan [29] George Kingsley Zipf. The psycho-biology of language: An
Popov, Matteo Malloci, Tom Duerig, et al. The open im- introduction to dynamic philology. Routledge, 2013. 1
ages dataset v4: Unified image classification, object detec-
tion, and visual relationship detection at scale. arXiv preprint
arXiv:1811.00982, 2018. 3
[15] Yann LeCun, Bernhard Boser, John S Denker, Donnie
Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel. Backpropagation applied to handwrit-
ten zip code recognition. Neural computation, 1989. 2
[16] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges.
The MNIST database of handwritten digits. http://
yann.lecun.com/exdb/mnist/, 1998. 2
[17] Marc Liberman. Reproducible research and the
common task method. Simmons Foundation Lec-
ture https://www.simonsfoundation.org/