Abstract
Head pose estimation (HPE) is a challenging and critical research subject with a wide range of applications in areas such as driver monitoring, attention recognition, and human-computer interaction. However, there are two challenging problems in HPE, the first one is that in real application scenarios, occlusion is very common, which affects the accuracy of HPE to a great extent. The second is that most research works use Euler angles to represent the head pose, which may lead to problems in neural network optimization. To solve these problems, an adaptive occlusion hybrid second-order attention network model was proposed. First, facial landmarks were detected by the occlusion-aware module to generate heat maps reflecting the presence or absence of occlusion in the specific facial parts, thereby enhancing features in the non-occluded parts of the face and suppressing features in the occluded regions. Meanwhile, we designed a novel second-order information attention module to interact with spatial and channel information using second-order statistical information, such that the model learns the feature correlations of different facial parts while paying more attention to important channels and suppressing redundant ones to further reduce the effect of occlusion and excavate more powerful features. Furthermore, to avoid ambiguity in common head pose representation, we introduced an exponential map to represent the head pose and designed a prediction framework capable of capturing the geometry of the pose space. The results of the experiments showed that the proposed model was competitive with methods using depth information from the BIWI dataset and achieved obvious advantages on the challenging AFLW2000 dataset, with more robust performance under large poses and occlusion interference, and stronger robustness compared with other models.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Murphy-Chutorian E, Trivedi MM (2009) Head pose estimation in computer vision: a survey. IEEE Trans Pattern Anal Mach Intell 31(4):607–626. https://doi.org/10.1109/TPAMI.2008.106
Wang K, Zhao R, Ji Q (2018) Human Computer Interaction with Head Pose, Eye Gaze and Body Gestures. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp 789-789. https://doi.org/10.1109/FG.2018.00126
Li Y, Li J, Jiang X et al (2019) A Driving Attention Detection Method Based on Head Pose. In: 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp 483-490. https://doi.org/10.1109/SmartWorld-UIC-ATC-SCALCOM-IOP-SCI.2019.00124
Bosch N, Dmello SK (2021) Automatic detection of mind wandering from video in the lab and in the classroom. IEEE Trans Affect Comput 12(4):974–988. https://doi.org/10.1109/TAFFC.2019.2908837
Zhuang Z, Tao H, Chen Y et al (2022) An Optimal Iterative Learning Control Approach for Linear Systems With Nonuniform Trial Lengths Under Input Constraints. IEEE Trans on Syst, Man, and Cybern: Syst 1–13. https://doi.org/10.1109/TSMC.2022.3225381
Zhuang Z, Tao H, Chen Y et al (2022) Iterative learning control for repetitive tasks with randomly varying trial lengths using successive projection. Int J Adapt Control Signal Process 36(5):1196–1215. https://doi.org/10.1002/acs.3396
Stojanovic V, Nedic N (2016) Robust Kalman filtering for nonlinear multivariable stochastic systems in the presence of non-Gaussian noise. Int J of Robust and Nonlinear Control 26(3):445–460. https://doi.org/10.1002/rnc.3319
Banan A, Nasiri A, Taheri-Garavand A (2020) Deep learning-based appearance features extraction for automated carp species identification. Aquac Eng 89:102053. https://doi.org/10.1016/j.aquaeng.2020.102053
Chen C, Zhang Q, Kashani MH et al (2022) Forecast of rainfall distribution based on fixed sliding window long short-term memory. Eng Appl of Comput Fluid Mech 16(1):248–261. https://doi.org/10.1080/19942060.2021.2009374
Afan HA, Ibrahem Ahmed Osman A, Essam Y et al (2021) Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng Appl of Comput Fluid Mech 15(1):1420–1439. https://doi.org/10.1080/19942060.2021.1974093
Chen W, Sharifrazi D, Liang G et al (2022) Accurate discharge coefficient prediction of streamlined weirs by coupling linear regression and deep convolutional gated recurrent unit. Eng Appl of Comput Fluid Mech 16(1):965–976. https://doi.org/10.1080/19942060.2022.2053786
Wang W, Du Y, Chau K et al (2021) An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network. Water Resour Manag 35:4695–4726. https://doi.org/10.1007/S11269-021-02920-5
Lepetit V, Fua P (2005) Monocular Model-Based 3D Tracking of Rigid Objects: A Survey. Found Trends Comput Graph Vis 1(1):1–89. https://doi.org/10.1561/0600000001
Gao S, Wang J, Lu H et al (2020) Pose-Guided Visible Part Matching for Occluded Person Reid. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11741-11749. https://doi.org/10.1109/CVPR42600.2020.01176
Dai T, Cai J, Zhang Y et al (2019) Second-Order Attention Network for Single Image Super-Resolution. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp 11057-11066. https://doi.org/10.1109/CVPR.2019.01132
Hall, B.C (2003) Lie Algebras and the Exponential Mapping. In: Lie Groups, Lie Algebras, and Representations, pp 27-62. https://doi.org/10.1007/978-0-387-21554-9_2
Abate AF, Bisogni C, Castiglione A et al (2022) Head pose estimation: An extensive survey on recent techniques and applications. Pattern Recognit 127:108591. https://doi.org/10.1016/j.patcog.2022.108591
Dong X, Yu S, Weng X et al (2018) Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 360-368. https://doi.org/10.1109/CVPR.2018.00045
Dong X, Yu S, Weng X et al (2021) Supervision by Registration and Triangulation for Landmark Detection. IEEE Trans Pattern Anal Mach Intell 43(10):3681–3694. https://doi.org/10.1109/TPAMI.2020.2983935
Ranjan R, Patel VM, Chellappa R (2019) Hyperface: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition. IEEE Trans Pattern Anal Mach Intell 41(1):121–135. https://doi.org/10.1109/TPAMI.2017.2781233
Kumar A, Alavi A, Chellappa R (2017) Kepler: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp 258-265. https://doi.org/10.1109/FG.2017.149
Bulat A, Tzimiropoulos G (2017) How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks). In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 1021-1030. https://doi.org/10.1109/ICCV.2017.116
Sun Y, Wang X-G, Tang X (2013) Deep Convolutional Network Cascade for Facial Point Detection. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp 3476-3483. https://doi.org/10.1109/CVPR.2013.446
Zhu X, Lei Z, Liu X et al (2016) Face Alignment Across Large Poses: A 3D Solution. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 146-155. https://doi.org/10.1109/CVPR.2016.23
Guo J, Zhu X, Yang Yet al (2020) Towards Fast, Accurate and Stable 3D Dense Face Alignment. In: Vedaldi A, Bischof H, Brox T, Frahm JM. (eds) Computer Vision - ECCV 2020, Lecture Notes in Computer Science. Springer, Cham, pp 152-168. https://doi.org/10.1007/978-3-030-58529-7_10
Ruiz N, Chong E, Rehg JM (2018) Fine-Grained Head Pose Estimation Without Keypoints. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 2074-2083. https://doi.org/10.1109/CVPRW.2018.00281
Yang TY, Chen YT, Lin YY et al (2019) FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 1087-1096. https://doi.org/10.1109/CVPR.2019.00118
Zhang H, Wang M, Liu Y et al (2020) FDN: Feature Decoupling Network for Head Pose Estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 34(07): 12789-12796. https://doi.org/10.1609/aaai.v34i07.6974
Dhingra N (2022) LwPosr: Lightweight Efficient Fine Grained Head Pose Estimation. In: 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 1495-1505. https://doi.org/10.1109/WACV51458.2022.00127
Dhingra N (2021) HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp 1-8. https://doi.org/10.1109/FG52635.2021.9667080
Xu Y-Q, Jung C, Chang Y (2021) Head pose estimation using deep neural networks and 3D point clouds. Pattern Recognit 121:108210. https://doi.org/10.1016/j.patcog.2021.108210
Hu Z, Zhang Y, Xing Y et al (2022) Toward Human-Centered Automated Driving: A Novel Spatiotemporal Vision Transformer-Enabled Head Tracker. IEEE Veh Technol Mag 2–9. https://doi.org/10.1109/MVT.2021.3140047
Cao Z, Chu Z, Liu D et al (2021) A Vector-based Representation to Enhance Head Pose Estimation. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 1188-1197. https://doi.org/10.1109/WACV48630.2021.00123
Liu H, Fang S, Zhang Z et al (2021) MFDNet: Collaborative poses perception and matrix Fisher distribution for head pose estimation. IEEE Trans Multimed 24:2449–2460. https://doi.org/10.1109/TMM.2021.3081873
Hsu H-W, Wu T-Y, Wan S et al (2019) Quatnet: Quaternion-Based Head Pose Estimation with Multiregression Loss. IEEE Trans Multimed 21(4):1035–1046. https://doi.org/10.1109/TMM.2018.2866770
Tay NC, Tee C, Ong TS, Teh PS (2019) Abnormal Behavior Recognition using CNN-LSTM with Attention Mechanism. In: 2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), pp 1-5. https://doi.org/10.1109/ICECIE47765.2019.8974824
Wang K, Liu M (2022) YOLOv3-MT: A YOLOv3 using multi-target tracking for vehicle visual detection. Appl Intell 52(2):2070–2091. https://doi.org/10.1007/s10489-021-02491-3
Li YX, Wu XR, Li C (2022) A hierarchical conditional random field-based attention mechanism approach for gastric histopathology image classification. Appl Intell 52(9): 9717-9738. https://doi.org/10.1007/s10489-021-02886-2
DING, Z. R (2022) GLPose: Global-Local Attention Network with Feature Interpolation Regularization for Head Pose Estimation of People Wearing Facial Masks. In 33rd British Machine Vision Conference 2022
Zhu X, Yang Q, Zhao L et al (2022) An Improved Tiered Head Pose Estimation Network with Self-Adjust Loss Function. Entropy 24(7):974. https://doi.org/10.3390/e24070974
Li Y K, Yu Y Z, Liu Y L, et al (2022) MS-GCN: Multi-Stream Graph Convolution Network for Driver Head Pose Estimation. In: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp: 3819-3824. https://doi.org/10.1109/ITSC55140.2022.9922277
Li Y, Zeng JB, Shan SG, Chen XL (2019) Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Trans Image Process 28:2439-2450. https://doi.org/10.1109/TIP.2018.2886767
Hu J, Shen L, Sun G et al (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372
Woo S, Park J, Lee JY et al (2018) Cbam: Convolutional block attention module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018, Lecture Notes in Computer Science. Springer Cham, pp 3-19. https://doi.org/10.1007/978-3-030-01234-2_1
Liu H, Nie H, Zhang Z et al (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322. https://doi.org/10.1016/j.neucom.2020.09.068
Liu T, Wang J, Yang B et al (2021) NGDNet: Nonuniform Gaussian-label distribution learning for infrared head pose estimation and on-task behavior understanding in the classroom. Neurocomputing 436: 210-220. https://doi.org/10.1016/j.neucom.2020.12.090
Xu LH, Chen JY, Gan YL (2019) Head pose estimation with soft labels using regularized convolutional neural network. Neurocomputing 337:339–353. https://doi.org/10.1016/j.neucom.2018.12.074
Lee T (2018) Bayesian attitude estimation with the matrix fisher distribution on SO(3). IEEE Trans Autom Control 63(10):3377–3392. https://doi.org/10.1109/TAC.2018.2797162
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 770-778. https://doi.org/10.1109/CVPR.2016.90
Dong X, Yan Y, Ouyang W et al (2018) Style aggregated network for facial landmark detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 379-388. https://doi.org/10.1109/CVPR.2018.00047
Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 13708-13717. https://doi.org/10.1109/CVPR46437.2021.01350
Richard M. Murray and Zexiang Li and S. Shankar Sastry. A Mathematical Introduction to Robotic Manipulation. CRC Press, Boca Raton, pp 22-34
MacQueen J (1967) Classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp 281-297
Fanelli G, Dantone M, Gall J et al (2013) Random Forests for Real Time 3D Face Analysis. Int J Comput Vis 101(3):437–458. https://doi.org/10.1007/s11263-012-0549-0
Sagonas C, Tzimiropoulos G, Zafeiriou S et al (2013) 300 Faces in-the-Wild Challenge: The First Facial Landmark Localization Challenge. In: 2013 IEEE International Conference on Computer Vision Workshops, pp 397-403. https://doi.org/10.1109/ICCVW.2013.59
Zhang KP, Zhang ZP, Li ZF et al (2016) Joint Face Detection and Alignment using Multitask Cascaded Convolutional Networks. IEEE Signal Process Lett 23(10):1499–1503. https://doi.org/10.1109/LSP.2016.2603342
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds) International Conference on Learning Representations, San Diego
Kazemi V, Sullivan J (2014) One millisecond face alignment with an ensemble of regression trees. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 1867-1874. https://doi.org/10.1109/CVPR.2014.241
Xin M, Mo S, Lin Y (2021) EVA-GCN: Head Pose Estimation Based on Graph Convolutional Networks. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 1462-1471. https://doi.org/10.1109/CVPRW53098.2021.00162
Mukherjee SS, Robertson NM (2015) Deep head pose: Gaze-direction estimation in multimodal video. IEEE Trans Multimed 17(11):2094–2107. https://doi.org/10.1109/TMM.2015.2482819
Gu JW, Yang XD, Mello SD et al (2017) Dynamic Facial Analysis: From Bayesian Filtering to Recurrent Neural Network. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1531-1540. https://doi.org/10.1109/CVPR.2017.167
Martin M, Camp FVD, Stiefelhagen R (2014) Real Time Head Model Creation and Head Pose Estimation on Consumer Depth Cameras. In: 2014 2nd International Conference on 3D Vision, pp 641-648. https://doi.org/10.1109/3DV.2014.54
Wang Q, Wu B, Zhu P et al (2020) ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 11531-11539. https://doi.org/10.1109/CVPR42600.2020.01155
Acknowledgements
This research was funded by the National Natural Science Foundation of China (62272485), Natural Science Foundation of Xinjiang Uygur Autonomous Region (Grant No. 2020DO1A131) and Teaching and Research Fund of Yangtze University (Grant No. JY2020101). We gratefully acknowledge all the members who participated in this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
(MP4 16688 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Fu, Q., Xie, K., Wen, C. et al. Adaptive occlusion hybrid second-order attention network for head pose estimation. Int. J. Mach. Learn. & Cyber. 15, 667–683 (2024). https://doi.org/10.1007/s13042-023-01933-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01933-3