Abstract
Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from visual inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along depths raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without bells and whistles. Code is available at https://github.com/MCG-NJU/SparseOcc.
H. Liu and Y. Chen—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tesla AI Day (2021). https://www.youtube.com/watch?v=j0z4FweCy4M
Bozic, A., Palafox, P., Thies, J., Dai, A., Nießner, M.: Transformerfusion: monocular rgb scene reconstruction using transformers. In: NeurIPS (2021)
Caesar, H., et al.: nuscenes: a multimodal dataset for autonomous driving. In: CVPR (2020)
Cao, A.Q., de Charette, R.: Monoscene: monocular 3d semantic scene completion. In: CVPR (2022)
Chen, L., et al.: Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 550–567. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19839-7_32
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: minkowski convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3075–3084 (2019)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: SIGGRAPH (1996)
Ding, Y., Huang, L., Zhong, J.: Multi-scale occ: 4th place solution for Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 3d Occupancy Prediction Challenge. arXiv preprint arXiv:2306.11414 (2023)
Feng, Z., Yang, L., Guo, P., Li, B.: Cvrecon: rethinking 3d geometric feature learning for neural reconstruction. In: ICCV (2023)
Gan, W., Mo, N., Xu, H., Yokoya, N.: A comprehensive framework for 3d occupancy estimation in autonomous driving. IEEE Trans. Intell. Veh. 1–19 (2024)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Huang, J., Huang, G.: Bevdet4d: exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054 (2022)
Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: high-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021)
Huang, L., et al.: Leveraging vision-centric multi-modal expertise for 3d object detection. In: NeurIPS (2024)
Huang, L., et al.: Geometric-aware pretraining for vision-centric 3d object detection. arXiv preprint arXiv:2304.03105 (2023)
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for vision-based 3d semantic occupancy prediction. In: CVPR (2023)
Jia, X., De Brabandere, B., Tuytelaars, T., Gool, L.V.: Dynamic filter networks. In: NeurIPS (2016)
Khurana, T., Hu, P., Held, D., Ramanan, D.: Point cloud forecasting as a proxy for 4d occupancy forecasting. In: CVPR (2023)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR (2019)
Li, H., et al.: Open-sourced data ecosystem in autonomous driving: the present and future. arXiv preprint arXiv:2312.03408 (2023)
Li, H., et al.: Delving into the devils of bird’s-eye-view perception: a review, evaluation and recipe. IEEE TPAMI (2023)
Li, T., et al.: Lanesegnet: map learning with lane segment perception for autonomous driving. In: ICLR (2024)
Li, Y., et al.: Voxformer: sparse voxel transformer for camera-based 3d semantic scene completion. In: CVPR (2023)
Li, Z., et al.: Bevformer: learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 1–18. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_1
Li, Z., et al.: Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492 (2023)
Liao, B., et al.: Maptr: structured modeling and learning for online vectorized hd map construction. In: ICLR (2023)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Liu, H., Lu, T., Xu, Y., Liu, J., Li, W., Chen, L.: Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In: CVPR (2022)
Liu, H., Lu, T., Xu, Y., Liu, J., Wang, L.: Learning optical flow and scene flow with bidirectional camera-lidar fusion. arXiv preprint arXiv:2303.12017 (2023)
Liu, H., Teng, Y., Lu, T., Wang, H., Wang, L.: Sparsebev: high-performance sparse 3d object detection from multi-camera videos. In: ICCV (2023)
Liu, Y., Wang, T., Zhang, X., Sun, J.: PETR: position embedding transformation for multi-view 3d object detection. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, pp. 531–548. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_31
Liu, Y., et al.: Petrv2: a unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: CVPR (2019)
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Murez, Z., van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas: end-to-end 3D scene reconstruction from posed images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 414–431. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_25
Pan, M., et al.: Renderocc: vision-centric 3d occupancy prediction with 2d rendering supervision. arXiv preprint arXiv:2309.09502 (2023)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancy networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 523–540. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_31
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: mask transformer for 3d semantic instance segmentation. In: ICRA (2023)
Sima, C., et al.: Scene as occupancy. In: ICCV (2023)
Stier, N., Rich, A., Sen, P., Höllerer, T.: Vortx: volumetric 3d reconstruction with transformers for voxelwise view selection and fusion. In: 3DV (2021)
Sun, J., Xie, Y., Chen, L., Zhou, X., Bao, H.: Neuralrecon: real-time coherent 3d reconstruction from monocular video. In: CVPR (2021)
Takikawa, T., et al.: Neural geometric level of detail: real-time rendering with implicit 3d shapes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11358–11367 (2021)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631 (2023)
Tian, X., Jiang, T., Yun, L., Wang, Y., Wang, Y., Zhao, H.: Occ3d: a large-scale 3d occupancy prediction benchmark for autonomous driving. In: NeurIPS Datasets and Benchmarks (2023)
Tolstikhin, I.O., et al.: Mlp-mixer: an all-mlp architecture for vision. In: NeurIPS (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, H., et al.: Openlane-v2: a topology reasoning benchmark for unified 3d hd mapping. In: NeurIPS (2023)
Wang, S., Liu, Y., Wang, T., Li, Y., Zhang, X.: Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926 (2023)
Wang, X., et al.: Openoccupancy: a large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991 (2023)
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: CoRL (2022)
Wang, Y., Chen, Y., Liao, X., Fan, L., Zhang, Z.: Panoocc: unified occupancy representation for camera-based 3d panoptic segmentation. arXiv preprint arXiv:2306.10013 (2023)
Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: multi-camera 3d occupancy prediction for autonomous driving. In: ICCV (2023)
Yang, Z., Chen, L., Sun, Y., Li, H.: Visual point cloud forecasting enables scalable autonomous driving. In: CVPR (2024)
Yang, Z., Jiang, L., Sun, Y., Schiele, B., Jia, J.: A unified query-based paradigm for point cloud understanding. In: ICCV (2022)
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3dssd: point-based 3d single stage object detector. In: CVPR (2020)
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: sparse-to-dense 3d object detector for point cloud. In: ICCV (2019)
Yang, Z., Zhou, Y., Chen, Z., Ngiam, J.: 3d-man: 3d multi-frame attention network for object detection. In: ICCV (2021)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: CVPR (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Acknowledgements
We thank the anonymous reviewers for their suggestions that make this work better. This work is supported by the National Key R&D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119, No. 61921006), the Fundamental Research Funds for the Central Universities (No. 020214380119), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, H. et al. (2025). Fully Sparse 3D Occupancy Prediction. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72698-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)