1. Feature‐enhanced representation with transformers for multi‐view stereo
- Author
-
Lintao Xiang and Hujun Yin
- Subjects
computer graphics ,computer vision ,learning (artificial intelligence) ,photography ,supervised learning ,visual perception ,Photography ,TR1-1050 ,Computer software ,QA76.75-76.765 - Abstract
Abstract Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.
- Published
- 2024
- Full Text
- View/download PDF