Human Pose Estimation via an Ultra-Lightweight Pose Distillation Network
Abstract
:1. Introduction
- We design a lightweight one-stage pose estimation network, stage 1, which learns from an increasingly refined sequential expert network in an online knowledge distillation manner;
- We construct an ultra-lightweight re-parameterized pose estimation subnetwork that uses a multi-module design with weight-sharing to improve the multi-scale image feature acquisition capability of the single-module design. When training is complete, we use the first re-parameterized module as the deployment network to retain the simple architecture;
- Extensive experimental results demonstrate the superiority of our method on three standard benchmark datasets.
2. Related Work
2.1. Lightweight Pose Estimation Network
2.2. Intermediate Supervision
2.3. Structure Optimization
3. Proposed Methods
3.1. Keypoint Feature Extraction
3.2. Re-Parameterized Structure
3.3. Learning in the UEPDN
3.4. Summary
Algorithm 1 Ultra-lightweight Pose Estimation Algorithm |
|
4. Experimental Results
4.1. Pose Estimation on the MPII Dataset
4.1.1. Dataset and Performance Metric
4.1.2. Training and Deployment Details
4.1.3. Results on the MPII Dataset
4.2. Pose Estimation on the LSP Dataset
4.2.1. Dataset and Performance Metric
4.2.2. Training and Deployment Details
4.2.3. Results on the LSP Dataset
4.3. Pose Estimation on the UAV-Human Pose Estimation Dataset
4.3.1. Dataset and Performance Metric
4.3.2. Training and Deployment Details
4.3.3. Results on the UAV-Human Pose Estimation Dataset
4.4. Ablation Experiments
4.4.1. Effect of Pose Distillation and Re-Parameterized Modules
4.4.2. Effect of Training Stage Size
4.4.3. Effect of Deployment Network
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Felzenszwalb, P.F.; Huttenlocher, D.P. Pictorial structures for object recognition. Int. J. Comput. Vis. 2005, 61, 55–79. [Google Scholar] [CrossRef]
- Andriluka, M.; Roth, S.; Schiele, B. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Wang, F.; Li, Y. Learning visual symbols for parsing human poses in images. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
- Pishchulin, L.; Andriluka, M.; Gehler, P.V.; Schiele, B. Poselet conditioned pictorial structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Sapp, B.; Toshev, A.; Taskar, B. Cascaded models for articulated pose estimation. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010. [Google Scholar]
- Sapp, B.; Taskar, B. Modec: Multimodal decomposable models for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Chen, X.; Yuille, A. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Cherian, A.; Mairal, J.; Alahari, K.; Schmid, C. Mixing body-part sequences for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Tompson, J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Fang, H.; Xie, S.; Tai, Y.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Nie, X.; Li, Y.; Luo, L.; Zhang, N.; Feng, J. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Zhang, J.; Chen, Z.; Tao, D. Towards high performance human keypoint detection. Int. J. Comput. Vis. 2021, 129, 2639–2662. [Google Scholar] [CrossRef]
- Dong, H.; Wang, G.; Chen, C.; Zhang, X. RefinePose: Towards more refined human pose estimation. Electronics 2022, 11, 4060. [Google Scholar] [CrossRef]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kato, N.; Li, T.; Nishino, K.; Uchida, Y. Improving multi-person pose estimation using label correction. arXiv 2018, arXiv:1811.03331. [Google Scholar]
- Zhang, F.; Zhu, X.; Ye, M. Fast human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Qiang, B.; Zhai, Y.; Chen, J.; Xie, W.; Zheng, H.; Wang, X.; Zhang, S. Lightweight human skeleton key point detection model based on improved convolutional pose machines and SqueezeNet. J. Comput. Appl. 2020, 40, 1806–1811. [Google Scholar]
- Weinzaepfel, P.; Brégier, R.; Combaluzier, H.; Leroy, V.; Rogez, G. DOPE: Distillation of part experts for whole-body 3D pose estimation in the wild. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Zhong, F.; Li, M.; Zhang, K.; Hu, J.; Liu, L. DSPNet: A low computational-cost network for human pose estimation. Neurocomputing 2021, 423, 327–335. [Google Scholar] [CrossRef]
- Wang, W.; Zhang, K.; Ren, H.; Wei, D.; Gao, Y.; Liu, J. UULPN: An ultra-lightweight network for human pose estimation based on unbiased data processing. Neurocomputing 2022, 480, 220–233. [Google Scholar] [CrossRef]
- Bulat, A.; Tzimiropoulos, G. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Martinez, G.H.; Raaj, Y.; Idrees, H.; Xiang, D.; Joo, H.; Simon, T.; Sheikh, Y. Single-network whole-body pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, J.; Luo, Z. Pointless pose: Part affinity field-based 3D pose estimation without detecting keypoints. Electronics 2021, 10, 929. [Google Scholar] [CrossRef]
- Li, Z.; Ye, J.; Song, M.; Huang, Y.; Pan, Z. Online knowledge distillation for efficient pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Xiao, Y.; Wang, X.; He, M.; Jin, L.; Song, M.; Zhao, J. A compact and powerful single-stage network for multi-person pose estimation. Electronics 2023, 12, 857. [Google Scholar] [CrossRef]
- Wang, J.R.; Li, X.; Ling, C.X. Pelee: A real-time object detection system on mobile devices. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Johnson, S.; Everingham, M. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Aberystwyth, UK, 31 August–3 September 2010. [Google Scholar]
- Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. UAV-Human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021. [Google Scholar]
- Li, Y.; Shi, Q.; Song, J.; Yang, F. Human pose estimation via dynamic information transfer. Electronics 2023, 12, 695. [Google Scholar] [CrossRef]
- Jia, Y.; Shelhamer, E.; Donahue, J. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014. [Google Scholar]
- Kingma, P.D.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S.T. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition with Cascade Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021. [Google Scholar]
- Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. TokenPose: Learning Keypoint Tokens for Human Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Geng, Z.; Wang, C.; Wei, Y.; Liu, Z.; Li, H.; Hu, H. Human Pose as Compositional Tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 20–22 June 2023. [Google Scholar]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A Lightweight High-Resolution Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021. [Google Scholar]
- Rafi, U.; Leibe, B.; Gall, J.; Kostrikov, I. An efficient convolutional network for human pose estimation. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016. [Google Scholar]
- Ning, G.; Zhang, Z.; He, Z. Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multim. 2018, 20, 1246–1259. [Google Scholar] [CrossRef] [Green Version]
Methods | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | Mean | AUC | #Params | FLOPs |
---|---|---|---|---|---|---|---|---|---|---|---|
Hourglass [14] | 96.9 | 95.9 | 89.5 | 84.4 | 88.4 | 84.5 | 80.7 | 89.1 | – | 25.6 M | 55 G |
SimCC [42] | 97.2 | 96.0 | 90.4 | 85.6 | 89.5 | 85.8 | 81.8 | 90.0 | – | 25.7 M | 32.9 G |
PRTR [43] | 97.3 | 96.0 | 90.6 | 84.5 | 89.7 | 85.5 | 79.0 | 89.5 | – | 57.2 M | 21.6 G |
TokenPose [44] | 97.1 | 95.9 | 91.0 | 85.8 | 89.5 | 86.1 | 82.7 | 90.2 | – | 21.4 M | 9.1 G |
OKDHP-bran2 [33] | 96.7 | 95.4 | 89.9 | 84.1 | 89.0 | 84.7 | 81.1 | 89.2 | – | 15.5 M | 47 G |
OKDHP-bran1 [33] | 96.7 | 95.3 | 89.2 | 84.0 | 87.8 | 83.9 | 79.5 | 88.6 | – | 13.0 M | 41 G |
DSPNet-B1 [26] | 97.1 | 96.1 | 89.7 | 84.8 | 89.6 | 85.5 | 81.3 | 89.7 | – | 12.6 M | 1.6 G |
DSPNet-B0 [26] | 96.7 | 95.7 | 88.9 | 82.6 | 88.7 | 84.1 | 78.7 | 88.5 | – | 7.6 M | 1.2 G |
FPD [23] | – | – | – | – | – | – | – | 90.1 | 62.4 | 3.0 M | 9 G |
PCT [45] | 97.5 | 97.2 | 92.8 | 88.4 | 92.4 | 89.6 | 87.1 | 92.5 | – | 221.5 M | 15.2 G |
Openpose [31] | 96.2 | 95.0 | 87.5 | 82.2 | 87.6 | 82.7 | 78.4 | 87.7 | – | – | – |
UULPN [27] | 96.0 | 93.6 | 85.3 | 78.7 | 86.2 | 80.4 | 75.6 | 85.7 | – | 2.8 M | 2.23 G |
Lite-HRNet-30 [46] | – | – | – | – | – | – | – | 87.0 | – | 1.8 M | 0.42 G |
UEPDN-R1 (Ours) | 98.1 | 96.7 | 91.0 | 84.4 | 90.3 | 83.8 | 76.3 | 89.3 | 64.3 | 2.75 M | 6.2 G |
Methods | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | Mean | AUC | #Params | FLOPs | FPS |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CNGM [12] | 90.6 | 79.2 | 67.9 | 63.4 | 69.5 | 71.0 | 64.2 | 72.3 | 47.3 | – | – | – |
ECN [47] | 95.8 | 86.2 | 79.3 | 75.0 | 86.6 | 83.8 | 79.8 | 83.8 | 56.9 | 56.0 M | 28 G | – |
CPM [13] | 97.8 | 92.5 | 87.0 | 83.9 | 91.5 | 90.8 | 89.9 | 90.5 | 65.4 | 31.0 M | 351 G | 3.5 |
KGDFNN [48] | 98.2 | 94.4 | 91.8 | 89.3 | 94.7 | 95.0 | 93.5 | 93.9 | – | 53.1 M | 124 G | – |
FPD [23] | 97.3 | 92.3 | 86.8 | 84.2 | 91.9 | 92.2 | 90.9 | 90.8 | 64.3 | 3.0 M | 9 G | – |
UEPDN-Stage 1 (Ours) | 97.3 | 92.8 | 88.8 | 86.1 | 91.2 | 91.5 | 89.9 | 91.1 | 66.3 | 3.8 M | 8.4 G | 4.0 |
UEPDN-R1 (Ours) | 96.5 | 91.8 | 86.0 | 80.3 | 88.4 | 88.4 | 80.8 | 87.5 | 62.9 | 2.75 M | 6.2 G | 5.3 |
Methods | mAP (%) | #Params | FLOPs |
---|---|---|---|
HigherHRNet [18] | 56.5 | 28.6 M | 47.9 G |
RMPE [15] | 56.9 | 59.7 M | – |
UEPDN-Stage 1 (Ours) | 56.3 | 3.8 M | 8.4 G |
UEPDN-R1 (Ours) | 54.8 | 2.75 M | 6.2 G |
PD | RM | R1 | R2 | R3 | R4 | (R5) Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 | Stage 6 |
---|---|---|---|---|---|---|---|---|---|---|---|
× | × | – | – | – | – | 89.5 | 91.1 | 91.8 | 92.1 | 92.1 | 92.1 |
√ | × | – | – | – | – | 90.7 | 91.4 | 91.8 | 92.1 | 92.2 | 92.1 |
× | √ | 85.0 | 88.3 | 90.0 | 90.6 | 90.9 | 91.1 | 91.7 | 91.9 | 92.1 | 92.1 |
√ | √ | 87.5 | 88.8 | 90.2 | 90.6 | 91.1 | 91.2 | 91.2 | 91.4 | 91.6 | 91.2 |
Training Stage Size | Mean | #Params |
---|---|---|
Stage size s = 6 | 87.5 | 2.75 M |
Stage size s = 5 | 87.2 | 2.75 M |
Stage size s = 4 | 87.1 | 2.75 M |
Method | R1 | R2 | R3 | R4 | (R5) Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 | Stage 6 |
---|---|---|---|---|---|---|---|---|---|---|
Mean | 87.5 | 88.8 | 90.2 | 90.6 | 91.1 | 91.2 | 91.2 | 91.4 | 91.6 | 91.2 |
#Params | 2.75 M | 3.00 M | 3.27 M | 3.52 M | 3.82 M | 5.90 M | 7.90 M | 10.00 M | 12.00 M | 14.10 M |
FLOPs | 6.20 G | 6.75 G | 7.30 G | 7.85 G | 8.40 G | 12.90 G | 17.32 G | 21.74 G | 26.16 G | 30.58 G |
FPS | 5.3 | 4.9 | 4.7 | 4.5 | 4.0 | 3.3 | 2.6 | 2.0 | 1.7 | 1.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, S.; Qiang, B.; Yang, X.; Wei, X.; Chen, R.; Chen, L. Human Pose Estimation via an Ultra-Lightweight Pose Distillation Network. Electronics 2023, 12, 2593. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12122593
Zhang S, Qiang B, Yang X, Wei X, Chen R, Chen L. Human Pose Estimation via an Ultra-Lightweight Pose Distillation Network. Electronics. 2023; 12(12):2593. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12122593
Chicago/Turabian StyleZhang, Shihao, Baohua Qiang, Xianyi Yang, Xuekai Wei, Ruidong Chen, and Lirui Chen. 2023. "Human Pose Estimation via an Ultra-Lightweight Pose Distillation Network" Electronics 12, no. 12: 2593. https://0-doi-org.brum.beds.ac.uk/10.3390/electronics12122593