TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation.

Authors :: Li, Shaokang
Lyu, Chengzhi
Xia, Bin
Chen, Ziheng
Zhang, Lei
Source :: Visual Computer. Oct2024, Vol. 40 Issue 10, p6797-6808. 12p.
Publication Year :: 2024
Abstract: Self-supervised monocular depth estimation presents a promising result, which utilizes image sequences instead of challenging-to-source ground truth for training. The framework of most current studies on self-supervised depth estimation is based on fully convolutional or transformer architectures, and there is little discussion on the hybrid architecture. In this paper, we proposed TAMDepth, a novel framework that can effectively capture the local and global features of image sequences by combining convolutional blocks and transformer blocks. TAMDepth adopts multi-scale feature fusion convolutional modules capture local details in shallow layers while transformer blocks build the global dependency in higher layers. Furthermore, to enhance the representation of architecture, we introduce an adapter modulation that injects the spatial prior to the transformer blocks through cross-attention, which improves the ability of modeling the scene. Experiments demonstrate that our model exhibits state-of-the-art performance on the KITTI dataset and also shows strong generalization performance on the Make3D dataset. Source code is available at https://github.com/deansaice/TAMDepth. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools