节点文献
联合多视图可控融合和关节相关性的三维人体姿态估计
Combining multi-view controlled fusion and joint correlation for 3D human pose estimation
【摘要】 目的 多视图三维人体姿态估计能够从多方位的二维图像中估计出各个关节点的深度信息,克服单目三维人体姿态估计中因遮挡和深度模糊导致的不适定性问题,但如果系统性能被二维姿态估计结果的有效性所约束,则难以实现最终三维估计精度的进一步提升。为此,提出了一种联合多视图可控融合和关节相关性的三维人体姿态估计算法CFJCNet(controlled fusion and joint correlation network),包括多视图融合优化模块、二维姿态细化模块和结构化三角剖分模块3部分。方法 首先,基于极线几何框架的多视图可控融合优化模块有选择地利用极线几何原理提高二维热图的估计质量,并减少噪声引入;然后,基于图卷积与注意力机制联合学习的二维姿态细化方法以单视图中关节点之间的联系性为约束,更好地学习人体的整体和局部信息,优化二维姿态估计;最后,引入结构化三角剖分以获取人体骨长先验知识,嵌入三维重建过程,改进三维人体姿态的估计性能。结果 该算法在两个公共数据集Human3.6M、Total Capture和一个合成数据集Occlusion-Person上进行了评估实验,平均关节误差为17.1 mm、18.7 mm和10.2 mm,明显优于现有的多视图三维人体姿态估计算法。结论 本文提出了一个能够构建多视图间人体关节一致性联系以及各自视图中人体骨架内在拓扑约束的多视图三维人体姿态估计算法,优化二维估计结果,修正错误姿态,有效地提高了三维人体姿态估计的精确度,取得了最佳的估计结果。
【Abstract】 Objective 3D human pose estimation is fundamental to understanding human behavior and aims to estimate 3D joint points from images or videos. It is widely used in downstream tasks such as human-computer interaction, virtual fitting, autonomous driving, and pose tracking. According to the number of cameras, 3D human pose estimation can be divided into monocular 3D human pose estimation and multi-view 3D human pose estimation. The ill-posed problem caused by occlusion and depth ambiguity means that estimating the 3D human joint points by monocular 3D human pose estimation is difficult. However multi-view 3D human pose estimation can obtain the depth of each joint from multiple images, which can overcome this problem. In most recent methods, the triangulation module is used to estimate the 3D joint positions by leveraging their 2D counterparts measured in multiple images to 3D space. This module is usually used in a two-stage procedure: First, the 2D joint coordinates of the human on each view are estimated separately by using a 2D pose detector, and then the 3D pose from multi-view 2D poses by applying triangulation. On this basis, some methods work with epipolar geometry to fuse the human joint features to establish the correlation among multiple views, which can improve the accuracy of 3D estimation. However, when the system performance is constrained by the effectiveness of the 2D estimation results, improving the final 3D estimation accuracy further is difficult. Therefore, to extract human contextual information for more effective 2D features, we construct a novel 3D pose estimation network to explore the correlation of the same joint among multiple views and the correlation between neighbor joints in the single view.Method In this paper, we propose a 3D human pose estimation method based on multi-view controllable fusion and joint correlation(CFJCNet), which includes three parts: a controllable multi-view fusion optimization module, a 2D pose refinement module, and a structural triangulation module. First, a set of RGB images captured from multiple views are fed into the 2D detector to obtain the 2D heatmaps, and then the adaptive weights of each heatmap are learned by a weight learning network with appearance information and geometric information branches. On this basis, we construct a multi-view controlled fusion optimization module based on epipolar geometry framework, which can analyze the estimation quality of joints in each camera view to influence the fuse process. Specifically, it selectively utilizes the principles of epipolar geometry to fuse all views according to the weights, thus ensuring that the low-quality estimation can benefit from auxiliary views while avoiding the introduction of noise in high-quality heatmaps. Subsequently, a 2D pose refinement module composed of attention mechanisms and graph convolution is applied. The attention mechanism enables the model to capture the global content by weight assignment, while the graph convolutional network(GCN) can exploit local information by aggregating the features of the neighbor nodes and instruct the topological structure information of the human skeleton. The network combining the attention and GCN can not only learn human information better but also construct the interdependence between joint points in the single view to refine 2D pose estimation results. Finally, structural triangulation is introduced with structural constraints of the human body and human skeleton length in the process of 2D-to-3D inference to improve the accuracy of 3D pose estimation. This paper adopts the pre-trained 2D backbone called simple baseline as the 2D detector to extract 2D heatmaps. The threshold ε = 0. 99 is used to determine the joint estimation quality, and the number of layers N = 3 is designed for 2D pose refinement.Result We compare the performance of CFJCNet with that of state-of-the-art models on two public datasets, namely, Human3. 6M and Total Capture, and a synthetic dataset called Occlusion-Person. The mean per joint position error(MPJPE) is used as the evaluation metric, which measures the Euclidean distance between the estimated 3D joint positions and the ground truth. MPJPE can reflect the quality of the estimated 3D human poses, providing a more intuitive representation of the performance of different methods. On the Human3. 6M dataset, the proposed method achieves an additional error reduction of 2. 4 mm compared with the baseline Adafuse. Moreover, because our network introduces rich priori knowledge and effectively constructs the connectivity of human joints, CFJCNet achieves at least a 10% improvement compared with most methods that do not use the skinned multi-person linear(SMPL) model. Compared with learnable human mesh triangulation(LMT) incorporating the SMPL model and volumetric triangulation, our method still achieves a 0. 5 mm error reduction. On the Total Capture dataset, compared with the excellent baseline Adafuse, our method exhibits a performance improvement of 2. 6%. On the Occlusion-Person dataset, the CFJCNet achieves optimal estimation for the vast majority of joints, which improves performance by 19%. Furthermore, we compare the visualization results of 3D human pose estimation between our method and the baseline Adafuse on the Human3. 6M dataset and the Total Capture dataset to provide a more intuitive demonstration of the estimation performance. The qualitative experimental results on both datasets demonstrate that CFJCNet can use the prior constraints of skeleton length to correct unreasonable erroneous poses.Conclusion We propose a multi-view 3D human pose estimation method CFJCNet, which is capable of constructing human joint consistency between multiple views as well as intrinsic topological constraints on the human skeleton in the respective views. The method achieves excellent 3D human pose estimation performance. Experimental results on the public datasets show that CFJCNet has significant advantages in evaluation metrics over other advanced methods, demonstrating its superiority and generalization.
【Key words】 multi-view; 3D human pose estimation; joint point correlation; graph convolutional network(GCN); attention mechanism; triangulation;
- 【文献出处】 中国图象图形学报 ,Journal of Image and Graphics , 编辑部邮箱 ,2025年01期
- 【分类号】TP391.41
- 【下载频次】10