Abstract
Multi-modal 3D object detection has been an active research topic in autonomous driving. Nevertheless, it is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels. These fusion approaches often suffer from severe information loss, thus causing sub-optimal performance. To address these problems, we construct the homogeneous structure between the point cloud and images to avoid projective information loss by transforming the camera features into the LiDAR 3D space. In this paper, we propose a homogeneous multi-modal feature fusion and interaction method (HMFI) for 3D object detection. Specifically, we first design an image voxel lifter module (IVLM) to lift 2D image features into the 3D space and generate homogeneous image voxel features. Then, we fuse the voxelized point cloud features with the image features from different regions by introducing the self-attention based query fusion mechanism (QFM). Next, we propose a voxel feature interaction module (VFIM) to enforce the consistency of semantic information from identical objects in the homogeneous point cloud and image voxel representations, which can provide object-level alignment guidance for cross-modal feature fusion and strengthen the discriminative ability in complex backgrounds. We conduct extensive experiments on the KITTI and Waymo Open Dataset, and the proposed HMFI achieves better performance compared with the state-of-the-art multi-modal methods. Particularly, for the 3D detection of cyclist on the KITTI benchmark, HMFI surpasses all the published algorithms by a large margin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3d object detection. In: AAAI, pp. 1201–1209 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Guan, T., Wang, J., Lan, S., Chandra, R., Wu, Z., Davis, L., Manocha, D.: M3detr: multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: WACV, pp. 772–782 (2022)
Guo, X., Shi, S., Wang, X., Li, H.: Liga-stereo: learning lidar geometry aware representations for stereo-based 3D detector. In: CVPR, pp. 3153–3163 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 35–52. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_3
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR (2015)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: IROS, pp. 1–8. IEEE (2018)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)
Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: CVPR, pp. 7546–7555 (2021)
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: CVPR, pp. 7345–7353 (2019)
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3D object detection via keypoint estimation. In: CVPRW, pp. 996–997 (2020)
Lu, Y., et al.: Geometry uncertainty projection network for monocular 3D object detection. In: ICCV, pp. 3111–3121 (2021)
Mao, J., Niu, M., Bai, H., Liang, X., Xu, H., Xu, C.: Pyramid R-CNN: Towards better performance and adaptability for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2723–2732 (2021)
Mao, J., et al.: Voxel transformer for 3D object detection. In: ICCV, pp. 3164–3173 (2021)
Pang, S., Morris, D., Radha, H.: Clocs: camera-lidar object candidates fusion for 3D object detection. In: IROS, pp. 10386–10393. IEEE (2020)
Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: CVPR, pp. 918–927 (2018)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: CVPR, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS, vol. 30 (2017)
Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical depth distribution network for monocular 3D object detection. In: CVPR, pp. 8555–8564 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, vol. 28 (2015)
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: ICCV, pp. 2743–2752 (2021)
Shi, S., et al.: PV-RCNN: point-voxel feature set abstraction for 3d object detection. In: CVPR, pp. 10529–10538 (2020)
Shi, S., Wang, X., Li, H.: PointrCNN: 3D object proposal generation and detection from point cloud. In: CVPR, pp. 770–779 (2019)
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. PAMI 43(8), 2647–2664 (2020)
Shi, W., Rajkumar, R.: Point-GNN: graph neural network for 3D object detection in a point cloud. In: CVPR, pp. 1711–1719 (2020)
Sindagi, V.A., Zhou, Y., Tuzel, O.: MVX-Net: multimodal voxelnet for 3D object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1-learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820 (2018)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454 (2020)
Tang, Y., Dorn, S., Savani, C.: Center3D: center-based monocular 3D object detection with joint depth understanding. In: Akata, Z., Geiger, A., Sattler, T. (eds.) DAGM GCPR 2020. LNCS, vol. 12544, pp. 289–302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71278-5_21
Team, O.D.: Openpcdet: an open-source toolbox for 3D object detection from point clouds (2020). https://github.com/open-mmlab/OpenPCDet
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: sequential fusion for 3D object detection. In: CVPR, pp. 4604–4612 (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: Pointaugmenting: cross-modal augmentation for 3D object detection. In: CVPR, pp. 11794–11803 (2021)
Wang, Y., Mao, Q., Zhu, H., Zhang, Y., Ji, J., Zhang, Y.: Multi-modal 3D object detection in autonomous driving: a survey. CoRR (2021)
Wang, Y., Mao, Q., Zhu, H., Zhang, Y., Ji, J., Zhang, Y.: Multi-modal 3D object detection in autonomous driving: a survey. arXiv preprint arXiv:2106.12735 (2021)
Wang, Z., Jia, K.: Frustum convnet: sliding frustums to aggregate local point-wise features for amodal 3D object detection. In: IROS, pp. 1742–1749. IEEE (2019)
Xie, L., Xiang, C., Yu, Z., Xu, G., Yang, Z., Cai, D., He, X.: PI-RCNN: an efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In: AAAI, pp. 12460–12467 (2020)
Xu, D., Anguelov, D., Jain, A.: Pointfusion: deep sensor fusion for 3D bounding box estimation. In: CVPR, pp. 244–253 (2018)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Yoo, J.H., Kim, Y., Kim, J., Choi, J.W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12372, pp. 720–736. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58583-9_43
You, Y., et al.: Pseudo-lidar++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)
Zhang, Z., et al.: Maff-net: filter false positive for 3D vehicle detection with multi-modal adaptive feature fusion. arXiv preprint arXiv:2009.10945 (2020)
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3D object detection. In: CVPR, pp. 4490–4499 (2018)
Acknowledgments
This research is funded by the Science and Technology Commission of Shanghai Municipality (19511120200), The computation is performed in ECNU Multifunctional Platform for Innovation (001).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X. et al. (2022). Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_40
Download citation
DOI: https://doi.org/10.1007/978-3-031-19839-7_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19838-0
Online ISBN: 978-3-031-19839-7
eBook Packages: Computer ScienceComputer Science (R0)