A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers

Shi, Jinlong; Sun, Yunhan; Bai, Suqin; Sun, Zhengxing; Tian, Zhaohui

doi:10.1007/s00371-020-01832-6

A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers

Original Article
Published: 09 April 2020

Volume 37, pages 815–829, (2021)
Cite this article

The Visual Computer Aims and scope Submit manuscript

492 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

We propose an encoder–decoder CNN framework to predict depth from one single image in a self-supervised manner. To this aim, we design three kinds of encoder based on the recent advanced deep neural network and one kind of decoder which can generate multiscale predictions. Eight loss functions are designed based on the proposed encoder–decoder CNN framework to validate the performance. For training, we take rectified stereo image pairs as input of the CNN, which is trained by reconstructing image via learning multiscale disparity maps. For testing, the CNN can estimate the accurate depth information by inputting only one single image. We validate our framework on two public datasets in contrast to the state-of-the-art methods and our designed different variants, and the performance of different encoder–decoder architectures and loss functions is evaluated to obtain the best combination, which proves that our proposed method performs very well for single-image depth estimation without the supervision of ground truth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Article 03 February 2020

Single View Depth Estimation via Dense Convolution Network with Self-supervision

Detail-preserving depth estimation from a single image based on modified fully convolutional residual network and gradient network

Article Open access 29 November 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)
Zabulis, X., Lourakis, M.I., Koutlemanis, P.: Correspondence-free pose estimation for 3d objects from noisy depth data. Vis. Comput. 34(2), 193–211 (2018)
Article Google Scholar
Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)
Article Google Scholar
Shi, J., Sun, Z., Bai, S.: 3d reconstruction framework via combining one 3d scanner and multiple stereo trackers. Vis. Comput. 34(3), 377–389 (2018)
Article Google Scholar
Kopf, J., Cohen, M.F., Szeliski, R.: First-person hyper-lapse videos. ACM Trans. Graph. (TOG) 33(4), 78 (2014)
Article Google Scholar
Barron, J. T., Adams, A.,Shih, Y. , Hernández, C.: Fast bilateral-space stereo for synthetic defocus. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4466–4474 (2015)
Karsch, K., Sunkavalli, K., Hadap, S., Carr, N., Jin, H., Fonte, R., Sittig, M., Forsyth, D.: Automatic scene inference for 3d object compositing. ACM Trans. Graph. (TOG) 33(3), 32 (2014)
Article Google Scholar
Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T.A.: Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 4296–4303 (2016)
Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. Int. J. Robot. Res. 34(4–5), 705–724 (2015)
Article Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: European Conference on Computer Vision. Springer, pp. 842–857 (2016)
Criminisi, A., Blake, A., Rother, C., Shotton, J., Torr, P.H.: Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming. Int. J. Comput. Vis. 71(1), 89–110 (2007)
Article Google Scholar
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1297–1304 (2011)
Stoyanov, D., Scarzanella, M.V., Pratt, P., Yang, G.-Z.: Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 275–282 (2010)
He, K.,Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 3 (2017)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)
Chakrabarti, A., Shao, J., Shakhnarovich, G.: Depth from a single image by harmonizing overcomplete local network predictions. In: Advances in Neural Information Processing Systems, pp. 2658–2666 (2016)
Jafari, O.H., Groth, O., Kirillov, A., Yang, M.Y., Rother, C.: Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 4620–4627 (2017)
Tulsiani, S., Zhou, T., Efros, A. A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR, p. 3 (2017)
Yan, X. , Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)
Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Advances in Neural Information Processing Systems, pp. 364–375 (2017)
Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: European Conference on Computer Vision. Springer, pp. 484–499 (2016)
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, p. 7 (2017)
Garg, R., BG, V.K., Carneiro, G. , Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: European Conference on Computer Vision. Springer, pp. 740–756 (2016)
Kuznietsov, Y. , Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655 (2017)
Simonyan, K. , Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Pillai, S., Ambrus, R., Gaidon, A.: Superdepth: self-supervised, super-resolved monocular depth estimation. arXiv preprint arXiv:1810.01849
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV). IEEE, pp. 239–248 (2016)
Chen, W., Fu, Z. , Yang, D., Deng, J.: Single-image depth perception in the wild. In: Advances in Neural Information Processing Systems, pp. 730–738 (2016)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision. Springer, pp. 483–499 (2016)
Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to synthesize a 4d rgbd light field from a single image. In: International Conference on Computer Vision (ICCV), p. 6 (2017)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122
Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-level vision. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, pp. 388–396 (2015)
Repala, V.K., Dubey, S.R.: Dual cnn models for unsupervised monocular depth estimation. arXiv preprint arXiv:1804.06324
Li, B., Shen, C., Dai, Y., van den Hengel, A. , He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)
Xu, D., Ricci, E., Ouyang, W. ,Wang, X., Sebe, N.: Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: Proceedings of CVPR (2017)
Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170 (2015)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)
Article Google Scholar
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809 (2015)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Ramirez, P.Z., Poggi, M., Tosi, F., Mattoccia, S., Di Stefano, L.: Geometry meets semantics for semi-supervised monocular depth estimation. arXiv preprint arXiv:1810.04093
Godard, C., Mac Aodha, O., Brostow, G.:Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260
Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992 (2018)
Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: monocular visual odometry through unsupervised deep learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 7286–7291 (2018)
Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfm-net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675 (2018)
Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 5 (2017)
Mukasa, T., Xu, J., Stenger, B.: 3d scene mesh from cnn depth predictions and sparse monocular slam. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 921–928 (2017)
Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489
Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems, pp. 1161–1168 (2006)
Saxena, A., Sun, M., Ng, A.Y.: Make3d: learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)
Article Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 3354–3361 (2012)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

Download references

Acknowledgements

This work is supported by National Key Research and Development Program of China (No. 2018YFC0309100, No. 2018YFC0309104), Six Talent Peaks Project in Jiangsu Province, Scholarship program of Jiangsu Provincial Government and Jiangsu University of Science and Technology.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, China
Jinlong Shi, Yunhan Sun & Suqin Bai
State Key Lab for Novel Software Technology, Nanjing University, Nanjing, China
Zhengxing Sun
Ruizhi Information Technology, Co., Ltd, Zhenjiang, China
Zhaohui Tian

Authors

Jinlong Shi
View author publications
You can also search for this author inPubMed Google Scholar
Yunhan Sun
View author publications
You can also search for this author inPubMed Google Scholar
Suqin Bai
View author publications
You can also search for this author inPubMed Google Scholar
Zhengxing Sun
View author publications
You can also search for this author inPubMed Google Scholar
Zhaohui Tian
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jinlong Shi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, J., Sun, Y., Bai, S. et al. A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers. Vis Comput 37, 815–829 (2021). https://doi.org/10.1007/s00371-020-01832-6

Download citation

Published: 09 April 2020
Issue Date: April 2021
DOI: https://doi.org/10.1007/s00371-020-01832-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Single View Depth Estimation via Dense Convolution Network with Self-supervision

Detail-preserving depth estimation from a single image based on modified fully convolutional residual network and gradient network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Digging into the multi-scale structure for a more refined depth map and 3D reconstruction

Single View Depth Estimation via Dense Convolution Network with Self-supervision

Detail-preserving depth estimation from a single image based on modified fully convolutional residual network and gradient network

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.