Unsupervised Learning from Videos for Object Discovery in Single Images
Abstract
:1. Introduction
- We propose a novel deep network architecture for unsupervised learning, which factors the image into multiple object instances that are based on the sparsity of images and the inter-frame structure of videos.
- We propose a method to discover the primary object in single images by completely unsupervised learning without any manual annotation or pre-trained features.
- Our segmentation quality tends to increase logarithmically with the amount of training data, which suggests the infinite possibilities of learning and generalization of our model. Besides, our model maintains a very high speed in testing and the experimental results demonstrate that it is at least two orders of magnitude faster than the related co-segmentation methods [4,27].
2. Related Work
3. Our Approach
3.1. Foreground and Background Model
3.2. Segmentation Mask Model
3.3. Image Reconstruction
4. Experiments and Analysis
4.1. Experimental Setup
4.1.1. Datasets
- The YouTube Objects (YTO) dataset [6]. The YTO dataset is a large-scale database that was collected from YouTube containing videos for each of 10 diverse object classes (aeroplane, bird, boat, car, cat, cow, dog, horse, motorbike, and train). The dataset has 5484 video shots for a total of 571,089 frames. The videos display significant clutter, with foreground objects coming in and out of focus and often out of sight, undergoing occlusions and significant changes in scale and viewpoint. The dataset also provides ground-truth bounding-boxes on the object of interest in one frame for each of 1407 video shots.
- The Object Discovery dataset [21]. The Object Discovery dataset is collected by automatically downloading images while using the Bing API, using queries for airplane, car, and horse. It contains 15k internet images: airplane (4542 images), car (4347 images), horse (6381 images), and it is annotated with high detail segmentation masks.
- The MSRC dataset [66]. The MSRC dataset is composed of 591 photographs of 21 object classes and hand-labeled with the assigned colors acting as indices into the list of object classes. All of the images were taken considering completely general lighting conditions, camera viewpoint, scene geometry, object pose, and articulation.
- The iCoseg dataset [48]. The iCoseg dataset is built from the Flickr online photo collection and hand-labelled pixel-level segmentations in all of the images. It contains 38 challenging groups with 643 total images (∼17 images per group), consisting of animals in the wild, popular landmarks, sports teams, and other groups that contain a common theme or common foreground object.
4.1.2. Implementation Details
- Conv(, , k, s, p): convolution with input channels, output channels, kernel size k, stride s, and padding p.
- Deconv(, , k, s, p): deconvolution with input channels, output channels, kernel size k, stride s, and padding p.
- Down(s): max-pooling downsampling with a scale factor of s.
- Up(s): nearest-neighbor upsampling with a scale factor of s.
- GN(n): group normalization with n groups.
4.1.3. Evaluation Metrics
- For the comparison of object localization bounding-boxes, we adopt the correct localization (CorLoc) metric following previous image localization works [11,12,13,43], which measures the percentage of images that were correctly localized according to the PASCAL criterion: an predicted box is correct when compared with the ground-truth box , when the intersection over union (IoU) overlap ratio is larger than 0.5.
- For the comparison of object segmentation masks, we evaluate, based on the P, the J metric, as described by Rubinstein et al. [21]—the higher P and J, the better. P refers to the per pixel precision (the ratio of correctly labeled pixels), while J is the Jaccard similarity (the intersection over union of the result and ground truth segmentation). Both measures of the are commonly used for evaluation in image segmentation.
4.2. Results on Video Dataset
- (1)
- the primary object is completely separated from the background;
- (2)
- the background model can automatically filled-in the “missing” image parts in its output;
- (3)
- the produced object masks have smoother shapes, with very few holes, and capture the figure-ground contrast and organization well;
- (4)
- the UnsupOD model is able to detect multiple objects (see the masks of the sixth column); and,
- (5)
- the UnsupOD model boxes the parts that move with the object (see the bounding box in the penultimate column: only motorbike for ground truth, motorbike with a man in our bounding box); this is mainly because our motion-based approach groups pixels that share the same motion.
4.3. Results on Single Images
4.4. Results on Single Images of Unseen Classes
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Horn, B.K.; Schunck, B.G. Determining optical flow. Techniques and Applications of Image Understanding. Int. Soc. Opt. Photonics 1981, 281, 319–331. [Google Scholar]
- Barron, J.L.; Fleet, D.J.; Beauchemin, S.S. Performance of optical flow techniques. Int. J. Comput. Vis. 1994, 12, 43–77. [Google Scholar] [CrossRef]
- Zhang, R.; Huang, Y.; Pu, M.; Zhang, J.; Guan, Q.; Zou, Q.; Ling, H. Object discovery from a single unlabeled image by mining frequent itemsets with multi-scale features. IEEE Trans. Image Process. 2020, 29, 8606–8621. [Google Scholar] [CrossRef] [PubMed]
- Papazoglou, A.; Ferrari, V. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 1777–1784. [Google Scholar]
- Lu, X.; Wang, W.; Ma, C.; Shen, J.; Shao, L.; Porikli, F. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3623–3632. [Google Scholar]
- Prest, A.; Leistner, C.; Civera, J.; Schmid, C.; Ferrari, V. Learning object class detectors from weakly annotated video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–24 June 2012; pp. 3282–3289. [Google Scholar]
- Dave, A.; Tokmakov, P.; Ramanan, D. Towards segmenting anything that moves. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Jain, S.D.; Xiong, B.; Grauman, K. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2126. [Google Scholar]
- Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 724–732. [Google Scholar]
- Luiten, J.; Zulfikar, I.E.; Leibe, B. Unovost: Unsupervised offline video object segmentation and tracking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2000–2009. [Google Scholar]
- Tang, K.; Joulin, A.; Li, L.J.; Fei-Fei, L. Co-localization in real-world images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1464–1471. [Google Scholar]
- Joulin, A.; Tang, K.; Fei-Fei, L. Efficient image and video co-localization with frank-wolfe algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 253–268. [Google Scholar]
- Cho, M.; Kwak, S.; Schmid, C.; Ponce, J. Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1201–1210. [Google Scholar]
- Faktor, A.; Irani, M. “Clustering by composition”—Unsupervised discovery of image categories. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; Springer: Cham, Switzerland, 2012; pp. 474–487. [Google Scholar]
- Jun Koh, Y.; Jang, W.D.; Kim, C.S. POD: Discovering primary objects in videos based on evolutionary refinement of object recurrence, background, and primary object models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1068–1076. [Google Scholar]
- Kim, G.; Xing, E.P.; Fei-Fei, L.; Kanade, T. Distributed cosegmentation via submodular optimization on anisotropic diffusion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 169–176. [Google Scholar]
- Joulin, A.; Bach, F.; Ponce, J. Discriminative clustering for image co-segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1943–1950. [Google Scholar]
- Rochan, M.; Wang, Y. Efficient object localization and segmentation in weakly labeled videos. In Proceedings of the International Symposium on Visual Computing, Las Vegas, NV, USA, 8–10 December 2014; Springer: Cham, Switzerland, 2014; pp. 172–181. [Google Scholar]
- Joulin, A.; Bach, F.; Ponce, J. Multi-class cosegmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 542–549. [Google Scholar]
- Kuettel, D.; Guillaumin, M.; Ferrari, V. Segmentation propagation in imagenet. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13October 2012; Springer: Cham, Switzerland, 2012; pp. 459–473. [Google Scholar]
- Rubinstein, M.; Joulin, A.; Kopf, J.; Liu, C. Unsupervised joint object discovery and segmentation in internet images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 1939–1946. [Google Scholar]
- Rubio, J.C.; Serrat, J.; López, A. Video co-segmentation. In Proceedings of the Asian Conference on Computer Vision (ACCV), Daejeon, Korea, 5–9 November 2012; Springer: Cham, Switzerland, 2012; pp. 13–24. [Google Scholar]
- Vicente, S.; Rother, C.; Kolmogorov, V. Object cosegmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 2217–2224. [Google Scholar]
- Leordeanu, M.; Collins, R.; Hebert, M. Unsupervised learning of object features from video sequences. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; Volume 1, p. 1142. [Google Scholar]
- Liu, D.; Chen, T. A topic-motion model for unsupervised video object discovery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
- Parikh, D.; Chen, T. Unsupervised identification of multiple objects of interest from multiple images: Discovery. In Proceedings of the Asian Conference on Computer Vision (ACCV), Tokyo, Japan, 18–22 November 2007; Springer: Cham, Switzerland, 2007; pp. 487–496. [Google Scholar]
- Stretcu, O.; Leordeanu, M. Multiple frames matching for object discovery in video. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; Volume 1, p. 3. [Google Scholar]
- Thai, M.T.; Wu, W.; Xiong, H. Big Data in Complex and Social Networks; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
- Stai, E.; Kafetzoglou, S.; Tsiropoulou, E.E.; Papavassiliou, S. A holistic approach for personalization, relevance feedback & recommendation in enriched multimedia content. Multimed. Tools Appl. 2018, 77, 283–326. [Google Scholar]
- Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Noroozi, M.; Pirsiavash, H.; Favaro, P. Representation learning by learning to count. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5898–5906. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 649–666. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 69–84. [Google Scholar]
- Wang, X.; Gupta, A. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2794–2802. [Google Scholar]
- Wang, X.; He, K.; Gupta, A. Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1329–1338. [Google Scholar]
- Walker, J.; Doersch, C.; Gupta, A.; Hebert, M. An uncertain future: Forecasting from static images using variational autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 835–851. [Google Scholar]
- Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; Hariharan, B. Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2701–2710. [Google Scholar]
- Li, Y.; Liu, L.; Shen, C.; van den Hengel, A. Image co-localization by mimicking a good detector’s confidence score distribution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 19–34. [Google Scholar]
- Wei, X.S.; Zhang, C.L.; Li, Y.; Xie, C.W.; Wu, J.; Shen, C.; Zhou, Z.H. Deep descriptor transforming for image co-localization. arXiv 2017, arXiv:1705.02758. [Google Scholar]
- Wei, X.S.; Luo, J.H.; Wu, J.; Zhou, Z.H. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans. Image Process. 2017, 26, 2868–2881. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Rother, C.; Minka, T.; Blake, A.; Kolmogorov, V. Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; Volume 1, pp. 993–1000. [Google Scholar]
- Mukherjee, L.; Singh, V.; Dyer, C.R. Half-integrality based algorithms for cosegmentation of images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 2028–2035. [Google Scholar]
- Batra, D.; Kowdle, A.; Parikh, D.; Luo, J.; Chen, T. icoseg: Interactive co-segmentation with intelligent scribble guidance. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3169–3176. [Google Scholar]
- Hochbaum, D.S.; Singh, V. An efficient algorithm for co-segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, 27 September–4 October 2009; pp. 269–276. [Google Scholar]
- Rubio, J.C.; Serrat, J.; López, A.; Paragios, N. Unsupervised co-segmentation through region matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–24 June 2012; pp. 749–756. [Google Scholar]
- Lee, C.; Jang, W.D.; Sim, J.Y.; Kim, C.S. Multiple random walkers and their application to image cosegmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3837–3845. [Google Scholar]
- Faktor, A.; Irani, M. Video segmentation by non-local consensus voting. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK, 1–5 September 2014; Volume 2, p. 8. [Google Scholar]
- Lee, Y.J.; Kim, J.; Grauman, K. Key-segments for video object segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 1995–2002. [Google Scholar]
- Wang, W.; Shen, J.; Porikli, F. Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3395–3402. [Google Scholar]
- Bideau, P.; Learned-Miller, E. It’s moving! A probabilistic model for causal motion segmentation in moving camera videos. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 433–449. [Google Scholar]
- Narayana, M.; Hanson, A.; Learned-Miller, E. Coherent motion segmentation in moving camera videos using optical flow orientations. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 1577–1584. [Google Scholar]
- Tokmakov, P.; Alahari, K.; Schmid, C. Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3386–3394. [Google Scholar]
- Tokmakov, P.; Schmid, C.; Alahari, K. Learning to segment moving objects. Int. J. Comput. Vis. 2019, 127, 282–301. [Google Scholar] [CrossRef] [Green Version]
- Tokmakov, P.; Alahari, K.; Schmid, C. Learning video object segmentation with visual memory. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4481–4490. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Mondal, A.K.; Dolz, J.; Desrosiers, C. Few-shot 3d multi-modal medical image segmentation using generative adversarial learning. arXiv 2018, arXiv:1810.12241. [Google Scholar]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
- Isensee, F.; Petersen, J.; Kohl, S.A.; Jäger, P.F.; Maier-Hein, K.H. nnu-net: Breaking the spell on successful medical image segmentation. arXiv 2019, arXiv:1904.08128. [Google Scholar]
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv 2016, arXiv:1602.02830. [Google Scholar]
- Li, M.; Zuo, W.; Gu, S.; Zhao, D.; Zhang, D. Learning convolutional networks for content-weighted image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3214–3223. [Google Scholar]
- Shotton, J.; Winn, J.; Rother, C.; Criminisi, A. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 1–15. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
Parameter | Value |
---|---|
Optimizer | Adam |
Learning rate | |
Number of epochs | 10 |
Batch size | 16 |
Proportion factor | 0.2 |
Loss weight | 1 |
Loss weight | 1 |
Loss weight | 0.1 |
Loss weight | 1 |
Loss weight | 0.05 |
Input image size | 128 × 128 |
Output image size | 128 × 128 |
Encoder | Output Size |
Conv(3, 64, 3, 1, 1) × 2 | 128 |
Down(2) + Conv(64, 128, 3, 1, 1) × 2 | 64 |
Down(2) + Conv(128, 256, 3, 1, 1) × 2 | 32 |
Down(2) + Conv(256, 512, 3, 1, 1) × 2 | 16 |
Down(2) + Conv(512, 512, 3, 1, 1) × 2 | 8 |
Decoder | Output Size |
Up(2) | 16 |
Conv(1024, 256, 3, 1, 1) × 2 + Up(2) | 32 |
Conv(512, 128, 3, 1, 1) × 2 + Up(2) | 64 |
Conv(256, 64, 3, 1, 1) × 2 + Up(2) | 128 |
Conv(128, 64, 3, 1, 1) × 2 | 128 |
Conv(64, 1, 3, 1, 1) + Sigmoid | 128 |
Encoder | Output Size |
ResNet18.conv1 | 64 |
ResNet18.conv2_x | 32 |
ResNet18.conv3_x | 16 |
ResNet18.conv4_x | 8 |
ResNet18.conv5_x | 4 |
Conv(512, , 3, 1, 1) + ReLU | 4 |
Decoder | Output Size |
Deconv(, 512, 4, 2, 1) + ReLU | 8 |
Conv(512, 512, 3, 1, 1) + ReLU | 8 |
Deconv(512, 256, 4, 2, 1) + GN(64) + ReLU | 16 |
Conv(256, 256, 3, 1, 1) + GN(64) + ReLU | 16 |
Deconv(256, 128, 4, 2, 1) + GN(32) + ReLU | 32 |
Conv(128, 128, 3, 1, 1) + GN(32) + ReLU | 32 |
Deconv(128, 64, 4, 2, 1) + GN(16) + ReLU | 64 |
Conv(64, 64, 3, 1, 1) + GN(16) + ReLU | 64 |
Upsample(2) | 128 |
Conv(64, 64, 3, 1, 1) + GN(16) + ReLU | 128 |
Conv(64, 64, 5, 1, 2) + GN(16) + ReLU | 128 |
Conv(64, 3, 5, 1, 2) + Sigmoid | 128 |
Method | Aero | Bird | Boat | Car | Cat | Cow | Dog | Horse | Mbike | Train | Avg | Time |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Prest et al. [6] | 51.7 | 17.5 | 34.4 | 34.7 | 22.3 | 17.9 | 13.5 | 26.7 | 41.2 | 25.0 | 28.5 | N/A |
Joulin et al. [12] | 25.1 | 31.2 | 27.8 | 38.5 | 41.2 | 28.4 | 33.9 | 35.6 | 23.0 | 25.0 | 31.0 | N/A |
Stretcu et al. [27] | 38.3 | 62.5 | 51.1 | 54.9 | 64.3 | 52.9 | 44.3 | 43.8 | 41.9 | 45.8 | 49.9 | 6.9s |
Papazoglou et al. [4] | 65.4 | 67.3 | 38.9 | 65.2 | 46.3 | 40.2 | 65.3 | 48.4 | 39.0 | 25.0 | 50.1 | 4s |
Jun et al. [15] | 64.3 | 63.2 | 73.3 | 68.9 | 44.4 | 62.5 | 71.4 | 52.3 | 78.6 | 23.1 | 60.2 | N/A |
(Ours) UnsupOD | 69.5 | 60.5 | 74.2 | 60.8 | 65.7 | 63.2 | 65.8 | 54.8 | 43.7 | 48.9 | 60.7 | 0.03 s |
Method | Supervision | Airplane | Car | Horse | Avg |
---|---|---|---|---|---|
Kim et al. [16] | co-segmentation | 21.95 | 0.00 | 16.13 | 12.69 |
Joulin et al. [17] | co-segmentation | 32.93 | 66.29 | 54.84 | 51.35 |
Joulin et al. [19] | co-segmentation | 57.32 | 64.04 | 52.69 | 58.02 |
Rubinstein et al. [21] | co-segmentation | 74.39 | 87.64 | 63.44 | 75.16 |
Tang et al. [11] | co-localization | 71.95 | 93.26 | 64.52 | 76.58 |
(Ours) UnsupOD | w/o | 82.93 | 91.95 | 67.05 | 80.64 |
Method | Airplane | Car | Horse | |||
---|---|---|---|---|---|---|
P | J | P | J | P | J | |
Kim et al. [16] | 80.20 | 7.90 | 68.85 | 0.04 | 75.12 | 6.43 |
Joulin et al. [17] | 49.25 | 15.36 | 58.70 | 37.15 | 63.84 | 30.16 |
Joulin et al. [19] | 47.48 | 11.72 | 59.20 | 35.15 | 64.22 | 29.53 |
Rubinstein et al. [21] | 88.04 | 55.81 | 85.38 | 64.42 | 82.81 | 51.65 |
(Ours) UnsupOD | 89.20 | 61.45 | 90.14 | 68.24 | 86.16 | 57.02 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, D.; Ding, B.; Wu, Y.; Chen, L.; Zhou, H. Unsupervised Learning from Videos for Object Discovery in Single Images. Symmetry 2021, 13, 38. https://doi.org/10.3390/sym13010038
Zhao D, Ding B, Wu Y, Chen L, Zhou H. Unsupervised Learning from Videos for Object Discovery in Single Images. Symmetry. 2021; 13(1):38. https://doi.org/10.3390/sym13010038
Chicago/Turabian StyleZhao, Dong, Baoqing Ding, Yulin Wu, Lei Chen, and Hongchao Zhou. 2021. "Unsupervised Learning from Videos for Object Discovery in Single Images" Symmetry 13, no. 1: 38. https://doi.org/10.3390/sym13010038
APA StyleZhao, D., Ding, B., Wu, Y., Chen, L., & Zhou, H. (2021). Unsupervised Learning from Videos for Object Discovery in Single Images. Symmetry, 13(1), 38. https://doi.org/10.3390/sym13010038