Feature‐enhanced representation with transformers for multi‐view stereo.

Authors :: Xiang, Lintao
Yin, Hujun
Source :: IET Image Processing (Wiley-Blackwell); May2024, Vol. 18 Issue 6, p1530-1539, 10p
Publication Year :: 2024
Abstract: Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported. [ABSTRACT FROM AUTHOR]