CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-06-27
Cited: 0
Clicked: 948
Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG. FaSRnet: a feature and semantics refinement network for human pose estimation[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(4): 513-526.
@article{title="FaSRnet: a feature and semantics refinement network for human pose estimation",
author="Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="4",
pages="513-526",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2200639"
}
%0 Journal Article
%T FaSRnet: a feature and semantics refinement network for human pose estimation
%A Yuanhong ZHONG
%A Qianfeng XU
%A Daidi ZHONG
%A Xun YANG
%A Shanshan WANG
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 4
%P 513-526
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2200639
TY - JOUR
T1 - FaSRnet: a feature and semantics refinement network for human pose estimation
A1 - Yuanhong ZHONG
A1 - Qianfeng XU
A1 - Daidi ZHONG
A1 - Xun YANG
A1 - Shanshan WANG
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 4
SP - 513
EP - 526
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2200639
Abstract: Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue. Currently, most methods explore temporal consistency through refinements of the final heatmaps. The heatmaps contain the semantics information of key points, and can improve the detection quality to a certain extent. However, they are generated by features, and feature-level refinements are rarely considered. In this paper, we propose a human pose estimation framework with refinements at the feature and semantics levels. We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions. An attention mechanism is then used to fuse auxiliary features with current features. In terms of semantics, we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps. The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018, and the results demonstrate the effectiveness of our method.
[1]Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. IEEE Conf on Computer Vision and Pattern Recognition, p.3686-3693.
[2]Andriluka M, Iqbal U, Insafutdinov E, et al., 2018. PoseTrack: a benchmark for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5167-5176.
[3]Bertasius G, Feichtenhofer C, Tran D, et al., 2019. Learning temporal pose estimation from sparsely-labeled videos. Proc 33rd Int Conf on Neural Information Processing Systems, p.3027-3038.
[4]Cai YH, Wang ZC, Luo ZX, et al., 2020. Learning delicate local representations for multi-person pose estimation. 16th European Conf on Computer Vision, p.455-472.
[5]Cao Z, Hidalgo G, Simon T, et al., 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Patt Anal Mach Intell, 43(1):172-186.
[6]Chu X, Yang W, Ouyang WL, et al., 2017. Multi-context attention for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.5669-5678.
[7]Dang YH, Yin JQ, Zhang SJ, et al., 2022a. Learning human kinematics by modeling temporal correlations between joints for video-based human pose estimation.
[8]Dang YH, Yin JQ, Zhang SJ, 2022b. Relation-based associative joint location for human pose estimation in videos. IEEE Trans Image Process, 31:3973-3986.
[9]Doering A, Iqbal U, Gall J, 2018. Joint flow: temporal flow fields for multi person tracking.
[10]Fang HS, Xie SQ, Tai YW, et al., 2017. RMPE: regional multi-person pose estimation. IEEE Int Conf on Computer Vision, p.2353-2362.
[11]Fang HS, Li JF, Tang HY, et al., 2023. AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Patt Anal Mach Intell, 45(6):7157-7173.
[12]Fang ZJ, López AM, 2020. Intention recognition of pedestrians and cyclists by 2D pose estimation. IEEE Trans Intell Transp Syst, 21(11):4773-4783.
[13]Girdhar R, Gkioxari G, Torresani L, et al., 2018. Detect-and-track: efficient pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.350-359.
[14]Guo HK, Tang T, Luo GZ, et al., 2019. Multi-domain pose network for multi-person pose estimation and tracking. European Conf on Computer Vision, p.209-216.
[15]Hwang J, Lee J, Park S, et al., 2019. Pose estimator and tracker using temporal flow maps for limbs. Int Joint Conf on Neural Networks, p.1-8.
[16]Insafutdinov E, Andriluka M, Pishchulin L, et al., 2017. ArtTrack: articulated multi-person tracking in the wild. Conf on Computer Vision and Pattern Recognition, p.1293-1301.
[17]Iqbal U, Milan A, Gall J, 2017. PoseTrack: joint multi-person pose estimation and tracking. IEEE Conf on Computer Vision and Pattern Recognition, p.4654-4663.
[18]Jin S, Liu WT, Ouyang WL, et al., 2019. Multi-person articulated tracking with spatial and temporal embeddings. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5657-5666.
[19]Jin S, Liu WT, Xie EZ, et al., 2020. Differentiable hierarchical graph grouping for multi-person pose estimation. 16th European Conf on Computer Vision, p.718-734.
[20]Li DW, Chen XT, Zhang Z, et al., 2018. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. IEEE Int Conf on Multimedia and Expo, p.1-6.
[21]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13th European Conf on Computer Vision, p.740-755.
[22]Liu ZG, Wu S, Jin SY, et al., 2019. Towards natural and accurate future motion prediction of humans and animals. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9996-10004.
[23]Liu ZG, Chen HM, Feng RY, et al., 2021. Deep dual consecutive network for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.525-534.
[24]Liu ZG, Feng RY, Chen HM, et al., 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10996-11006.
[25]Luo Y, Ren J, Wang ZX, et al., 2018. LSTM pose machines. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5207-5215.
[26]Martinez J, Hossain R, Romero J, et al., 2017. A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf on Computer Vision, p.2659-2668.
[27]Pfister T, Charles J, Zisserman A, 2015. Flowing ConvNets for human pose estimation in videos. IEEE Int Conf on Computer Vision, p.1913-1921.
[28]Sapp B, Taskar B, 2013. MODEC: multimodal decomposable models for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.3674-3681.
[29]Shao ZP, Zhou W, Wang WZ, et al., 2023. A temporal densely connected recurrent network for event-based human pose estimation.
[30]Snower M, Kadav A, Lai F, et al., 2020. 15 keypoints is all you need. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6737-6747.
[31]Song J, Wang LM, van Gool L, et al., 2017. Thin-slicing network: a deep structured model for pose estimation in videos. IEEE Conf on Computer Vision and Pattern Recognition, p.5563-5572.
[32]Sun K, Xiao B, Liu D, et al., 2019. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686-5696.
[33]Tian YP, Zhang YL, Fu Y, et al., 2020. TDAN: temporally-deformable alignment network for video super-resolution. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3357-3366.
[34]Wang J, Long X, Gao Y, et al., 2020. Graph-PCNN: two stage human pose estimation with graph pose refinement. 16th European Conf on Computer Vision, p.492-508.
[35]Wang M, Hong RC, Yuan XT, et al., 2012. Movie2Comics: towards a lively video content presentation. IEEE Trans Multim, 14(3):858-870.
[36]Wang MC, Tighe J, Modolo D, 2020. Combining detection and tracking for human pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11085-11093.
[37]Wang XL, Girshick R, Gupta A, et al., 2018. Non-local neural networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7794-7803.
[38]Wang XT, Chan KCK, Yu K, et al., 2019. EDVR: video restoration with enhanced deformable convolutional networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1954-1963.
[39]Weinzaepfel P, Revaud J, Harchaoui Z, et al., 2013. DeepFlow: large displacement optical flow with deep matching. IEEE Int Conf on Computer Vision, p.1385-1392.
[40]Xiao B, Wu HP, Wei YC, 2018. Simple baselines for human pose estimation and tracking. 15th European Conf on Computer Vision, p.472-487.
[41]Xiu YL, Li JF, Wang HY, et al., 2018. Pose flow: efficient online pose tracking.
[42]Yang X, Wang M, Hong RC, et al., 2017. Enhancing person re-identification in a self-trained subspace. ACM Trans Multim Comput Commun Appl, 13(3):27.
[43]Yang X, Wang M, Tao DC, 2018. Person re-identification with metric learning using privileged information. IEEE Trans Image Process, 27(2):791-805.
[44]Yang YD, Ren Z, Li HX, et al., 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8070-8080.
[45]Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions.
[46]Zhang JB, Zhu Z, Zou W, et al., 2019. FastPose: towards real-time pose estimation and tracking via scale-normalized multi-task networks.
[47]Zheng W, Li L, Zhang ZX, et al., 2019. Relational network for skeleton-based action recognition. IEEE Int Conf on Multimedia and Expo, p.826-831.
[48]Zhu XZ, Hu H, Lin S, et al., 2019. Deformable ConvNets V2: more deformable, better results. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9300-9308.
Open peer comments: Debate/Discuss/Question/Opinion
<1>