Human Pose Estimation

输出关节位置、旋转、连接关系

mindmap
基于视觉的人类动作捕捉
    输入信息
        单帧图像/连续视频
        单目/多目
        相机位姿固定/不固定
    输出格式
        SMPL/SMPLX/SMPLH骨骼动作
        自定义骨骼动作
    输出对象
        单人骨骼动作
        多人骨骼动作
        人类骨骼动作 + 物体位姿
        相机位姿
    方法
        特定数据的优化方法
        前向推理方法
    要解决的问题
        动作的连续性
        动作的合理性
        视频数据的摭挡问题与歧义性
        实时流式输出
        与图像数据的一致性
        接触准确且不穿模
        人类动作、相机位姿、人类体型之间的耦合关系

单人HPE

图像单人HPE

IDYearNameNoteTagsLink
31SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation基于 SMPL 的 Transformer 框架的HMRlink

Solving Depth Ambiguity

Solving Body Structure Understanding

Solving Occlusion Problems

Solving Data Lacking

图像人物-物体交互 (HOI)

IDYearNameNoteTagsLink
2025.4.24PICO: Reconstructing 3D People In Contact with Objectslink

视频单人HPE

Solving Single-frame Limitation

IDYearNameNoteTagsLink
2025.5.29GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion从单目人体视频中生成精确且时序一致的深度图和法线图link

Solving Real-time Problems

IDYearName解决了什么痛点主要贡献是什么TagsLink
2025.8.29Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning基于diffusion的方法成本高采用分层时序剪枝(HTP)策略,能在保留关键运动动态的同时,从帧级别和语义级别动态剪除冗余姿态令牌。

Solving Body Structure Understanding

IDYearNameNoteTagsLink
Learned Neural Physics Simulation for Articulated 3D Human Pose Reconstruction提出替代传统物理引擎的神经网络,辅助视频动作理解。LARP

Solving Occlusion Problems

Solving Data Lacking

IDYearNameNoteTagsLink
1032025.5.2GENMO: A GENeralist Model for Human MOtion把HPE看作是视频condition的动作生成任务。通过动作估计与动作生成的协同增强,提升动作估计的准确性。人体运动通用模型,动作估计,动作生成, NVIDIAlink

多人HPE

Human Mesh Recovery

Template-based human mesh recovery

Naked human body recovery

IDYearName解决了什么痛点主要贡献是什么TagsLink
2025.8.13HumanGenesis: Agent-Based Geometric and Generative Modeling for Synthetic Human Dynamics1. 传统HPE方法没有考虑特定人体体型与3D姿态的关系,因此牺牲了HPE精度。
2. 依赖2D图像衍生的约束条件的对齐效果来优化姿态。
1. 通过先校准用户身体形状,再基于该形状进行个性化姿态拟合。2. 开发了基于身体形状条件的3D姿态先验模型,有效缓解了因过度依赖2D约束而产生的误差。
升了骨盆对齐姿态精度,还改善了绝对姿态精度
仅需合成数据训练,即插即用

Multimodal Methods

IDYearNameNoteTagsLink
[123]2019
[124]2022
[125]2022
[126]2022
2023WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion单人,移动相机link
2024Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment2D to 3D liftinglink
Moritz Einfalt, Katja Ludwig, and Rainer Lienhart. Uplift and upsample: Efficient 3d human pose estimation with uplifting transformers. In IEEE Winter Conf. Appl. Comput. Vis., pages 2903–2913, 2023.
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang. Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Trans. Multimedia, 25:1282–1293, 2022a.
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation. In Eur. Conf. Comput. Vis., pages 461–478. Springer, 2022.
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Junsong Yuan. Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13232– 13242, 2022.
Zhenhua Tang, Zhaofan Qiu, Yanbin Hao, Richang Hong, and Ting Yao. 3d human pose estimation with spatio-temporal criss-cross attention. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4790–4799, 2023.
Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, and Chen Chen. Poseformerv2: Exploring frequency domain for efficient and robust 3d human pose estimation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8877–8886, 2023.

Utilizing Attention Mechanism

IDYearNameNoteTagsLink
2023Humans in 4D: Reconstructing and Tracking Humans with Transformers图像,开源link

Exploiting Temporal Information

IDYearNameNoteTagsLink
[134]2019
[135]2021
[136]2021
[137]2021
[138]2022
[139]2023Global-to-local modeling for video-based 3d human pose and shape estimationTo effec-tively balance the learning of short-term and long-term temporal correlations, Global-to-Local Transformer (GLoT) [139] structurally decouples the modeling of long-term and short-term correlations.视频,单人,SMPL,非流式,transformerlink
2024TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Video仅图像特征恢复3D动作link

Multi-view Methods.

Boosting Efficiency

Developing Various Representations

Utilizing Structural Information

IDYearNameNoteTagsLink
262024.4.5PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos利用物理合理化人物动作基于SMPL模型从单目视频估计人体动力学,但仅通过拉格朗日损失隐式融入物理约束link

Choosing Appropriate Learning Strategies

IDYearNameNoteTagsLink
1612019
442020
1632020Coherent reconstruction of multiple humans from a single image图像,多人
1642021
462021
2142021
1652022
1662022
1672023Jotr: 3d joint con-trastive learning with transformers for occluded human mesh recovery融合 2D 和 3D 特征,并通过基于 Transformer 的对比学习框架结合对 3D 特征的监督
1622023Refit: Recurrent fitting network for 3d human recovery通过反馈-更新循环机制重新投影关键点并完善人体模型
42023Co-evolution of pose and mesh for 3d human body estimation from video引入了一种利用 3D 姿势作为中介的人体mesh恢复的共同进化方法。该方法将过程分为两个不同的阶段:首先,它从视频中估计 3D 人体姿势,随后,根据估计的 3D 姿势并结合时间图像特征对mesh顶点进行回归开源、单人、视频、meshlink
1682023Cyclic test-time adaptation on monocular video for 3d human mesh reconstruction为了弥合训练和测试数据之间的差距,CycleAdapt [168]提出了一种域自适应方法,包括mesh重建网络和运动降噪网络,能够实现更有效的自适应。

Detailed human body recovery

With Clothes

With Hands

IDYearNameNoteTagsLink
1732023SGNify, a model that captures hand pose, facial expression, and body movement from sign language videos. It employs linguistic priors and constraints on 3D hand pose to effectively address the ambiguities in isolated signs.
1742021the relationship between Two- Hands
1752021the relationship between Hand-Object
2023HMP: Hand Motion Priors for Pose and Shape Estimation from Video先用无视频信息的手势数据做手势动作先验。基于先验再做手势识别手、开源link

Whole Body

IDYearNameNoteTagsLink
176
177
1782021independently running 3D mesh recovery regression for face, hands, and body and subsequently combining the outputs through an integration module
1792021integrates independent es- timates from the body, face, and hands using the shared shape space of SMPL-X across all body parts
1802022Accurate 3d hand pose estimation for whole-body 3d human mesh estimationend-to-end framework for whole-body human mesh recovery named Hand4Whole, which employs joint features for 3D joint rotations to enhance the accuracy of 3D hand predictions
1812023Pymaf-x: Towards well-aligned full-body model regression from monocular imagesto resolve the misalignment issues in regression-based, one-stage human mesh recovery methods by employing a feature pyramid approach and refining the mesh-image alignment parameters.
215
1822023One-stage 3d whole-body mesh recovery with component aware transformera simple yet effective component-aware transformer that includes a global body encoder and a lo- cal face/hand decoder instead of separate networks for each part
183

Template-free human body recovery

运动相机场景

提取相机轨迹

IDYearNameNoteTagsLink
2022BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking
2023Decoupling Human and Camera Motion from Videos in the Wild联合优化人体姿势和相机scale,使人体位移与学习的运动模型相匹配多人link

Evaluation

Evaluation metrics

For pose and shape reconstruction

mean per-joint error (MPJPE), Procrustes-aligned perjoint error (PA-MPJPE),
per-vertex error (PVE)

To evaluate the motion smoothness

acceleration error (ACCEL) against the ground truth acceleration

For human trajectory evaluation,

we slice a sequence into 100-frame segments and evaluate 3D joint error after aligning the first two frames (W-MPJPE100) or the entire segment (WA-MPJPE100) [93].
evaluate the error of the entire trajectory after aligning the first frame, with root translation error (RTE), root orientation error (ROE), and egocentric root velocity error (ERVE).

For camera trajectory evaluation

absolute trajectory error (ATE) [75], which performs Procrustes with scaling to align the estimation with ground truth before computing error.

To evaluate the accuracy of our scale estimation

evaluate ATE using our estimated scale (ATE-S) [35].

Reference

  1. Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey