Back to Search Start Over

ESMformer: Error-aware self-supervised transformer for multi-view 3D human pose estimation.

Authors :
Zhang, Lijun
Zhou, Kangkang
Lu, Feng
Li, Zhenghao
Shao, Xiaohu
Zhou, Xiang-Dong
Shi, Yu
Source :
Pattern Recognition. Feb2025, Vol. 158, pN.PAG-N.PAG. 1p.
Publication Year :
2025

Abstract

Multi-view 3D human pose estimation (HPE) currently faces several key challenges. Information from different viewpoints exhibits high variability due to complex environmental factors, posing difficulties in cross-view feature extraction and fusion. Additionally, multi-view 3D labeled pose data is rather scarce, and the impact of input 2D poses on 3D HPE accuracy has received little attention. To address these issues, we propose an E rror-aware S elf-supervised transformer framework for M ulti-view 3D HPE (ESMformer). Firstly, we introduce a single-view multi-level feature extraction module to enhance pose features in individual viewpoints, which incorporates a novel relative attention mechanism for representative feature extraction at different levels. Subsequently, we develop multi-view intra-level and cross-level fusion modules to exploit spatio-temporal feature dependencies among human joints, and progressively fuse pose information from all views and levels. Furthermore, we explore an error-aware self-supervised learning strategy to reduce the model's reliance on 3D pose annotations and mitigate the impact of incorrect 2D poses. This strategy adaptively selects reliable input 2D poses based on 3D pose prediction errors. Experiments on three popular benchmarks show that ESMformer achieves state-of-the-art results and maintains cost-effective computational complexity. Notably, ESMformer does not rely on any 3D pose annotations or prior human body knowledge, making it highly versatile and adaptable in practical applications. 1 1 The code and models are available at https://github.com/z0911k/ESMformer. • We develop a novel transformer-based multi-view 3D HPE framework (ESMformer). It hierarchically integrates single-view multi-level pose feature extraction with progressive feature fusion across viewpoints and levels. By mining out meaningful multi-level information from multiple viewpoints, ESMformer effectively enriches pose feature expression. • We design a simple yet effective relative attention mechanism to model the spatial dependencies among all human joints, significantly enhancing the pose feature representations and making the feature more robust and adaptable to changes in viewpoints and environmental conditions. • We explore an error-aware self-supervised learning strategy that reduces the model's reliance on 3D pose annotations and mitigates the impacts of incorrect 2D poses. This strategy utilizes the prediction errors of 3D poses to guide the selection of reliable 2D poses. • ESMformer achieves state-of-the-art results on three typical 3D HPE benchmarks. It significantly alleviates depth ambiguity and improves 3D HPE performance while maintaining a cost-effective computational complexity, without requiring any 3D pose annotations. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00313203
Volume :
158
Database :
Academic Search Index
Journal :
Pattern Recognition
Publication Type :
Academic Journal
Accession number :
180766516
Full Text :
https://doi.org/10.1016/j.patcog.2024.110955