641 results on '"Monocular depth estimation"'
Search Results
2. LapUNet: a novel approach to monocular depth estimation using dynamic laplacian residual U-shape networks.
- Author
-
Xi, Yanhui, Li, Sai, Xu, Zhikang, Zhou, Feng, and Tian, Juanxiu
- Subjects
- *
ADDITION (Mathematics) , *SPATIAL resolution , *MONOCULARS , *PROBLEM solving , *PYRAMIDS - Abstract
Monocular depth estimation is an important but challenging task. Although the performance has been improved by adopting various encoder-decoder architectures, the estimated depth maps lack structure details and clear edges due to simple repeated upsampling. To solve this problem, this paper presents the novel LapUNet (Laplacian U-shape networks), in which the encoder adopts ResNeXt101, and the decoder is constructed with the novel DLRU (dynamic Laplacian residual U-shape) module. The DLRU module based on the U-shape structure can supplement high-frequency features by fusing dynamic Laplacian residual into the process of upsampling, and the residual is dynamically learnable due to the addition of convolutional operation. Also, the ASPP (atrous spatial pyramid pooling) module is introduced to capture image context at multiple scales though multiple parallel atrous convolutional layers, and the depth map fusion module is used for combining high and low frequency features from depth maps with different spatial resolution. Experiments demonstrate that the proposed model with moderate model size is superior to other previous competitors on the KITTI and NYU Depth V2 datasets. Furthermore, 3D reconstruction and target ranging by utilizing the estimated depth maps prove the effectiveness of our proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation.
- Author
-
Li, Shaokang, Lyu, Chengzhi, Xia, Bin, Chen, Ziheng, and Zhang, Lei
- Subjects
- *
TRANSFORMER models , *SOURCE code , *MONOCULARS , *GENERALIZATION - Abstract
Self-supervised monocular depth estimation presents a promising result, which utilizes image sequences instead of challenging-to-source ground truth for training. The framework of most current studies on self-supervised depth estimation is based on fully convolutional or transformer architectures, and there is little discussion on the hybrid architecture. In this paper, we proposed TAMDepth, a novel framework that can effectively capture the local and global features of image sequences by combining convolutional blocks and transformer blocks. TAMDepth adopts multi-scale feature fusion convolutional modules capture local details in shallow layers while transformer blocks build the global dependency in higher layers. Furthermore, to enhance the representation of architecture, we introduce an adapter modulation that injects the spatial prior to the transformer blocks through cross-attention, which improves the ability of modeling the scene. Experiments demonstrate that our model exhibits state-of-the-art performance on the KITTI dataset and also shows strong generalization performance on the Make3D dataset. Source code is available at https://github.com/deansaice/TAMDepth. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Lightweight monocular depth estimation using a fusion-improved transformer.
- Author
-
Sui, Xin, Gao, Song, Xu, Aigong, Zhang, Cong, Wang, Changqiang, and Shi, Zhengxu
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *FEATURE extraction , *PARAMETER estimation , *MONOCULARS - Abstract
The existing deep estimation networks often overlook the issue of computational efficiency while pursuing high accuracy. This paper proposes a lightweight self-supervised network that combines convolutional neural networks (CNN) and Transformers as the feature extraction and encoding layers for images, enabling the network to capture both local geometric and global semantic features for depth estimation. First, depth-separable convolution is used to construct a dilated convolution residual module based on a shallow network to improve the shallow CNN feature extraction receptive field. In the transformer, a multidepth separable convolution head transposed attention module is proposed to reduce the computational burden of spatial self-attention. In the feedforward network, a two-step gating mechanism is proposed to improve the nonlinear representation ability of the feedforward network. Finally, the CNN and transformer are integrated to implement a depth estimation network with a local-global context interaction function. Compared with other lightweight models, this model has fewer model parameters and higher estimation accuracy. It also has better generalizability for different outdoor datasets. Additionally, the inference speed can reach 87 FPS, achieving better real-time performance and accounting for both inference speed and estimation accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation.
- Author
-
Xia, Chenxing, Zhang, Mengge, Gao, Xiuju, Ge, Bin, Li, Kuan-Ching, Fang, Xianjin, Zhang, Yan, and Liang, Xingzhu
- Subjects
- *
TRANSFORMER models , *FEATURE extraction , *MONOCULARS , *COST - Abstract
Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit heavier network at the expense of computational costs to obtain high-quality depth maps. In this paper, we propose a novel enriched multi-path vision transformer feature interaction network with an encoder–decoder architecture, denoted as EDFIDepth , which seeks a balance between computational costs and performance rather than pursuing the highest accuracy or extremely lightweight models. Specifically, an encoder called MPViT-D, incorporating multi-path vision transformer and a deep convolution module, is introduced to extract diverse features with both fine and coarse details at the same feature level with fewer parameters. Subsequently, we propose a lightweight decoder comprising two effective modules to establish multi-scale feature interaction: an encoder–decoder cross-feature matching (ED-CFM) module and a channel-level feature fusion (CLFF) module. The ED-CFM module is to establish connections between encoder–decoder features through a dual-path structure, where a cross-attention mechanism is deployed to enhance the relevance of multi-scale complementary depth information. Meanwhile, the CLFF module utilizes a channel attention mechanism to further fuse crucial depth information within the channels, thereby improving the accuracy of depth estimation. Extensive experiments on the indoor dataset NYUv2 and the outdoor dataset KITTI demonstrate that our method can achieve comparable state-of-the-art (SOTA) results while significantly reducing the number of trainable parameters. Our codes and approach are available at https://github.com/Zhangmg123/EDFIDEpth. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Dyna-MSDepth: multi-scale self-supervised monocular depth estimation network for visual SLAM in dynamic scenes.
- Author
-
Yao, Jianjun, Li, Yingzhao, and Li, Jiajia
- Abstract
Monocular Simultaneous Localization And Mapping (SLAM) suffers from scale drift, leading to tracking failure due to scale ambiguity. Deep learning has significantly advanced self-supervised monocular depth estimation, enabling scale drift reduction. Nonetheless, current self-supervised learning approaches fail to provide scale-consistent depth maps, estimate depth in dynamic environments, or perceive multi-scale information. In response to these limitations, this paper proposes Dyna-MSDepth, a novel method for estimating multi-scale, stable, and reliable depth maps in dynamic environments. Dyna-MSDepth incorporates multi-scale high-order spatial semantic interaction into self-supervised training. This integration enhances the model’s capacity to discern intricate texture nuances and distant depth cues. Dyna-MSDepth is evaluated on challenging dynamic datasets, including KITTI, TUM, BONN, and DDAD, employing rigorous qualitative evaluations and quantitative experiments. Furthermore, the accuracy of the depth maps estimated by Dyna-MSDepth is assessed in monocular SLAM. Extensive experiments confirm the superior multi-scale depth estimation capabilities of Dyna-MSDepth, highlighting its significant value in dynamic environments. Code is available at . [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. End-to-end learning for joint depth and image reconstruction from diffracted rotation.
- Author
-
Mel, Mazen, Siddiqui, Muhammad, and Zanuttigh, Pietro
- Subjects
- *
IMAGE reconstruction , *DEEP learning , *MONOCULARS , *ROTATIONAL motion , *PRICES , *OPTICAL apertures - Abstract
Monocular depth estimation is an open challenge due to the ill-posed nature of the problem at hand. Deep learning techniques proved capable of producing acceptable depth estimation accuracy but the lack of robust depth cues within RGB images severally limits their performance. Coded aperture-based methods using phase and amplitude masks encode strong depth cues within 2D images by means of depth-dependent Point Spread Functions (PSFs) at the price of a reduced image quality. In this paper, we propose a novel end-to-end learning approach for depth from diffracted rotation. A phase mask that produces a Rotating Point Spread Function (RPSF) as a function of defocus is jointly optimized with the weights of a depth estimation neural network. To this aim, we introduce a differentiable physical model of the aperture mask and exploit an accurate simulation of the camera imaging pipeline. Our approach requires a significantly less complex model and less training data, yet it outperforms existing methods for monocular depth estimation on indoor benchmarks. In addition, we address the image degradation problem by incorporating a non-blind and nonuniform image deblurring module to recover the sharp all-in-focus image from its blurred counterpart. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. AI-Powered Obstacle Detection for Safer Human-Machine Collaboration.
- Author
-
Krupáš, Maros, Kot, Mykyta, Kajáti, Erik, and Zolotová, Iveta
- Subjects
ARTIFICIAL intelligence ,HUMAN-machine systems ,MOBILE robots ,MOBILE apps ,MONOCULARS - Abstract
This article deals with ensuring and increasing the safety of mobile robotic systems in human-machine collaboration. The goal of the research was to design and implement an artificial intelligence application that recognizes obstacles, including humans, and increases safety. The resulting mobile Android application uses a MiDaS model to generate a depth map of the environment from the drone's camera to approximate the distance from all obstacles to avoid the drone's collision. Besides, this work introduced us to DJI Mobile SDK and neural network optimizations for their use on smartphones. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. 基于深度学习的自监督单目动态场景深度估计综述.
- Author
-
程彬彬, 于英, 张磊, 王自全, and 江志鹏
- Subjects
STANDARD deviations ,DEEP learning ,MONOCULARS ,ONLINE education ,OPTICAL flow ,RESEARCH personnel ,AUTONOMOUS vehicles - Abstract
Copyright of Journal of Remote Sensing is the property of Editorial Office of Journal of Remote Sensing & Science Publishing Co. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
10. LapUNet: a novel approach to monocular depth estimation using dynamic laplacian residual U-shape networks
- Author
-
Yanhui Xi, Sai Li, Zhikang Xu, Feng Zhou, and Juanxiu Tian
- Subjects
Monocular depth estimation ,Laplacian pyramid ,Dynamic laplacian residual ,ASPP ,LapUNet ,Medicine ,Science - Abstract
Abstract Monocular depth estimation is an important but challenging task. Although the performance has been improved by adopting various encoder-decoder architectures, the estimated depth maps lack structure details and clear edges due to simple repeated upsampling. To solve this problem, this paper presents the novel LapUNet (Laplacian U-shape networks), in which the encoder adopts ResNeXt101, and the decoder is constructed with the novel DLRU (dynamic Laplacian residual U-shape) module. The DLRU module based on the U-shape structure can supplement high-frequency features by fusing dynamic Laplacian residual into the process of upsampling, and the residual is dynamically learnable due to the addition of convolutional operation. Also, the ASPP (atrous spatial pyramid pooling) module is introduced to capture image context at multiple scales though multiple parallel atrous convolutional layers, and the depth map fusion module is used for combining high and low frequency features from depth maps with different spatial resolution. Experiments demonstrate that the proposed model with moderate model size is superior to other previous competitors on the KITTI and NYU Depth V2 datasets. Furthermore, 3D reconstruction and target ranging by utilizing the estimated depth maps prove the effectiveness of our proposed method.
- Published
- 2024
- Full Text
- View/download PDF
11. Lightweight monocular depth estimation using a fusion-improved transformer
- Author
-
Xin Sui, Song Gao, Aigong Xu, Cong Zhang, Changqiang Wang, and Zhengxu Shi
- Subjects
Self-supervision ,Monocular depth estimation ,Lightweight ,CNN ,Transformer ,Medicine ,Science - Abstract
Abstract The existing deep estimation networks often overlook the issue of computational efficiency while pursuing high accuracy. This paper proposes a lightweight self-supervised network that combines convolutional neural networks (CNN) and Transformers as the feature extraction and encoding layers for images, enabling the network to capture both local geometric and global semantic features for depth estimation. First, depth-separable convolution is used to construct a dilated convolution residual module based on a shallow network to improve the shallow CNN feature extraction receptive field. In the transformer, a multidepth separable convolution head transposed attention module is proposed to reduce the computational burden of spatial self-attention. In the feedforward network, a two-step gating mechanism is proposed to improve the nonlinear representation ability of the feedforward network. Finally, the CNN and transformer are integrated to implement a depth estimation network with a local-global context interaction function. Compared with other lightweight models, this model has fewer model parameters and higher estimation accuracy. It also has better generalizability for different outdoor datasets. Additionally, the inference speed can reach 87 FPS, achieving better real-time performance and accounting for both inference speed and estimation accuracy.
- Published
- 2024
- Full Text
- View/download PDF
12. AI-Powered Obstacle Detection for Safer Human-Machine Collaboration
- Author
-
Krupáš Maros, Kot Mykyta, Kajáti Erik, and Zolotová Iveta
- Subjects
human-machine collaboration ,safety ,monocular depth estimation ,obstacle detection ,mobile robots ,midas ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
This article deals with ensuring and increasing the safety of mobile robotic systems in human-machine collaboration. The goal of the research was to design and implement an artificial intelligence application that recognizes obstacles, including humans, and increases safety. The resulting mobile Android application uses a MiDaS model to generate a depth map of the environment from the drone’s camera to approximate the distance from all obstacles to avoid the drone’s collision. Besides, this work introduced us to DJI Mobile SDK and neural network optimizations for their use on smartphones.
- Published
- 2024
- Full Text
- View/download PDF
13. UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning.
- Author
-
Zhu, Xue-Feng, Xu, Tianyang, Liu, Zongtao, Tang, Zhangyong, Wu, Xiao-Jun, and Kittler, Josef
- Subjects
- *
COMPUTER vision , *OBJECT tracking (Computer vision) , *DEEP learning , *MONOCULARS - Abstract
The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning.
- Author
-
Peng, Bo, Sun, Lin, Lei, Jianjun, Liu, Bingzheng, Shen, Haifeng, Li, Wanqing, and Huang, Qingming
- Subjects
MONOCULARS ,ANNOTATIONS ,FORECASTING ,SUPERVISION - Abstract
Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods, on the other hand, do not require any annotation of ground-truth depth and have recently attracted increasing attention. In this work, we propose a self-supervised monocular depth estimation network via binocular geometric correlation learning. Specifically, considering the inter-view geometric correlation, a binocular cue prediction module is presented to generate the auxiliary vision cue for the self-supervised learning of monocular depth estimation. Then, to deal with the occlusion in depth estimation, an occlusion interference attenuated constraint is developed to guide the supervision of the network by inferring the occlusion region and producing paired occlusion masks. Experimental results on two popular benchmark datasets have demonstrated that the proposed network obtains competitive results compared to state-of-the-art self-supervised methods and achieves comparable results to some popular supervised methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Dual-attention-based semantic-aware self-supervised monocular depth estimation.
- Author
-
Xu, Jinze, Ye, Feng, and Lai, Yizong
- Subjects
DATA augmentation ,DATA recovery ,MONOCULARS ,NOISE ,ANNOTATIONS - Abstract
Based on the assumption of photometric consistency, self-supervised monocular depth estimation has been widely studied due to the advantage of avoiding costly annotations. However, it is sensitive to noise, occlusion issues and photometric changes. To overcome these problems, we propose a multi-task model with a dual-attention-based cross-task feature fusion module (DCFFM). We simultaneously predict depth and semantic with a shared encoder and two separate decoders, aiming to improve depth estimation with the enhancement of semantic supervision information. In DCFFM, we fuse the cross-task features with both pixel-wise and channel-wise attention, which fully excavate and make good use of the helpful information from the other task mutually. We compute both of two attentions in a one-to-all manner to capture global information while limiting the rapid growth of computation. Furthermore, we propose a novel data augmentation method called data exchange & recovery (DE &R), which performs inter-batch data exchange in both vertical and horizontal direction so as to increase the diversity of input data. It encourages the network to explore more diversified cues for depth estimation and avoid overfitting. And essentially, the corresponding outputs are further recovered in order to keep the geometry relationship and ensure the correct calculation of photometric loss. Extensive experiments on the KITTI dataset and the NYU-Depth-v2 dataset demonstrate that our method is very effective and achieves better performance compared with other state-of-the-art works. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery.
- Author
-
Zhang, Chuanqi, Weng, Xiangrui, Cao, Yunfeng, and Ding, Meng
- Subjects
- *
DRONE aircraft , *MONOCULARS , *REMOTELY piloted vehicles , *ORTHOGONAL matching pursuit , *IMAGE sensors , *GEOMETRIC modeling , *ABSOLUTE value - Abstract
In recent years, there has been extensive research and application of unsupervised monocular depth estimation methods for intelligent vehicles. However, a major limitation of most existing approaches is their inability to predict absolute depth values in physical units, as they generally suffer from the scale problem. Furthermore, most research efforts have focused on ground vehicles, neglecting the potential application of these methods to unmanned aerial vehicles (UAVs). To address these gaps, this paper proposes a novel absolute depth estimation method specifically designed for flight scenes using a monocular vision sensor, in which a geometry-based scale recovery algorithm serves as a post-processing stage of relative depth estimation results with scale consistency. By exploiting the feature correspondence between successive images and using the pose data provided by equipped navigation sensors, the scale factor between relative and absolute scales is calculated according to a multi-view geometry model, and then absolute depth maps are generated by pixel-wise multiplication of relative depth maps with the scale factor. As a result, the unsupervised monocular depth estimation technology is extended from relative depth estimation in semi-structured scenes to absolute depth estimation in unstructured scenes. Experiments on the publicly available Mid-Air dataset and customized data demonstrate the effectiveness of our method in different cases and settings, as well as its robustness to navigation sensor noise. The proposed method only requires UAVs to be equipped with monocular camera and common navigation sensors, and the obtained absolute depth information can be directly used for downstream tasks, which is significant for this kind of vehicle that has rarely been explored in previous depth estimation studies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Synthetic Data Enhancement and Network Compression Technology of Monocular Depth Estimation for Real-Time Autonomous Driving System.
- Author
-
Jun, Woomin, Yoo, Jisang, and Lee, Sungjin
- Subjects
- *
AUTONOMOUS vehicles , *MONOCULARS , *DATA augmentation , *IMAGE recognition (Computer vision) , *TRAFFIC safety , *COST estimates , *SYNTHETIC apertures , *DIGITAL cameras - Abstract
Accurate 3D image recognition, critical for autonomous driving safety, is shifting from the LIDAR-based point cloud to camera-based depth estimation technologies driven by cost considerations and the point cloud's limitations in detecting distant small objects. This research aims to enhance MDE (Monocular Depth Estimation) using a single camera, offering extreme cost-effectiveness in acquiring 3D environmental data. In particular, this paper focuses on novel data augmentation methods designed to enhance the accuracy of MDE. Our research addresses the challenge of limited MDE data quantities by proposing the use of synthetic-based augmentation techniques: Mask, Mask-Scale, and CutFlip. The implementation of these synthetic-based data augmentation strategies has demonstrably enhanced the accuracy of MDE models by 4.0% compared to the original dataset. Furthermore, this study introduces the RMS (Real-time Monocular Depth Estimation configuration considering Resolution, Efficiency, and Latency) algorithm, designed for the optimization of neural networks to augment the performance of contemporary monocular depth estimation technologies through a three-step process. Initially, it selects a model based on minimum latency and REL criteria, followed by refining the model's accuracy using various data augmentation techniques and loss functions. Finally, the refined model is compressed using quantization and pruning techniques to minimize its size for efficient on-device real-time applications. Experimental results from implementing the RMS algorithm indicated that, within the required latency and size constraints, the IEBins model exhibited the most accurate REL (absolute RELative error) performance, achieving a 0.0480 REL. Furthermore, the data augmentation combination of the original dataset with Flip, Mask, and CutFlip, alongside the SigLoss loss function, displayed the best REL performance, with a score of 0.0461 . The network compression technique using FP16 was analyzed as the most effective, reducing the model size by 83.4% compared to the original while maintaining the least impact on REL performance and latency. Finally, the performance of the RMS algorithm was validated on the on-device autonomous driving platform, NVIDIA Jetson AGX Orin, through which optimal deployment strategies were derived for various applications and scenarios requiring autonomous driving technologies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Monocular Depth Estimation via Self-Supervised Self-Distillation.
- Author
-
Hu, Haifeng, Feng, Yuyang, Li, Dapeng, Zhang, Suofei, and Zhao, Haitao
- Subjects
- *
MONOCULARS , *FILTERS & filtration , *FEATURE extraction , *DEEP learning - Abstract
Self-supervised monocular depth estimation can exhibit excellent performance in static environments due to the multi-view consistency assumption during the training process. However, it is hard to maintain depth consistency in dynamic scenes when considering the occlusion problem caused by moving objects. For this reason, we propose a method of self-supervised self-distillation for monocular depth estimation (SS-MDE) in dynamic scenes, where a deep network with a multi-scale decoder and a lightweight pose network are designed to predict depth in a self-supervised manner via the disparity, motion information, and the association between two adjacent frames in the image sequence. Meanwhile, in order to improve the depth estimation accuracy of static areas, the pseudo-depth images generated by the LeReS network are used to provide the pseudo-supervision information, enhancing the effect of depth refinement in static areas. Furthermore, a forgetting factor is leveraged to alleviate the dependency on the pseudo-supervision. In addition, a teacher model is introduced to generate depth prior information, and a multi-view mask filter module is designed to implement feature extraction and noise filtering. This can enable the student model to better learn the deep structure of dynamic scenes, enhancing the generalization and robustness of the entire model in a self-distillation manner. Finally, on four public data datasets, the performance of the proposed SS-MDE method outperformed several state-of-the-art monocular depth estimation techniques, achieving an accuracy ( δ 1 ) of 89% while minimizing the error (AbsRel) by 0.102 in NYU-Depth V2 and achieving an accuracy ( δ 1 ) of 87% while minimizing the error (AbsRel) by 0.111 in KITTI. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Chfnet: a coarse-to-fine hierarchical refinement model for monocular depth estimation.
- Author
-
Chen, Han and Wang, Yongxiong
- Abstract
In recent years, many researchers have exploited multiple depth estimation architectures to produce high-quality depth maps from a single image. For monocular depth estimation, abundant multiscale features can significantly improve the prediction accuracy. Furthermore, multilevel refinement of the depth map through the model can effectively enhance the overall quality of the depth map. Therefore, we propose an efficient and effective module called light densely connected atrous spatial pyramid (LightDASP), which is employed to extract multiscale information at denser and larger scales from different levels of encoded features without significantly increasing the model size. Next, we propose a hierarchical reconstruction strategy that generates more accurate depth maps by refining the depth maps generated in the previous stage after each decoding stage. Additionally, to provide spatial location information to the decoder, the edge map is incorporated into the generation of a more rational refinement map. The experimental results, conducted on benchmark datasets in both indoor and outdoor scenes, demonstrate that our approach achieves efficient and competitive performance compared to existing methods for monocular depth estimation. We strike a balance between performance and efficiency, resulting in a model with greater potential for practical application. The code is available at upon article acceptance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion.
- Author
-
Li, Hang, Liu, Shuai, Wang, Bin, and Wu, Yuanhao
- Subjects
CONVOLUTIONAL neural networks ,MONOCULARS ,OPTICAL radar ,LIDAR ,OPTIMIZATION algorithms ,IMAGE registration - Abstract
Depth estimation represents a prevalent research focus within the realm of computer vision. Existing depth estimation methodologies utilizing LiDAR (Light Detection and Ranging) technology typically obtain sparse depth data and are associated with elevated hardware expenses. Multi-view image-matching techniques necessitate prior knowledge of camera intrinsic parameters and frequently encounter challenges such as depth inconsistency, loss of details, and the blurring of edges. To tackle these challenges, the present study introduces a monocular depth estimation approach based on an end-to-end convolutional neural network. Specifically, a DNET backbone has been developed, incorporating dilated convolution and feature fusion mechanisms within the network architecture. By integrating semantic information from various receptive fields and levels, the model's capacity for feature extraction is augmented, thereby enhancing its sensitivity to nuanced depth variations within the image. Furthermore, we introduce a loss function optimization algorithm specifically designed to address class imbalance, thereby enhancing the overall predictive accuracy of the model. Training and validation conducted on the NYU Depth-v2 (New York University Depth Dataset Version 2) and KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) datasets demonstrate that our approach outperforms other algorithms in terms of various evaluation metrics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. Monocular depth estimation via cross-spectral stereo information fusion.
- Author
-
Liu, Huwei
- Subjects
MONOCULARS ,INSURANCE reserves ,ARCHITECTURAL design ,STEREO vision (Computer science) - Abstract
Although amount of works are focused on monocular depth estimation, these works mainly study on the RGB spectrum, which has a poor performance on the case of nighttime, low light environment and even zero light environment. The images of other spectrum provide an opportunity to obtain depth without an active projector source. In this paper, we design a three-step architecture to realize monocular depth estimation by fusing cross-spectral stereo information. In the first step, we employ Spectral Translation Network to tackle with the problem that different spectral images have huge appearance differences and propose a disparity reservation loss to reserve disparity when translating. In the second step, we use Monocular Estimation Network to predict disparity of the principal spectrum, which is used for test. In the third step, we retrain the Spectral Translation Network with a generative optimization loss to improve the quality of image translation. Experiments show that our method achieves preeminent performance and reaches real-time speed. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation.
- Author
-
Zhao, Hailiang, Kong, Yongyi, Zhang, Chonghao, Zhang, Haoji, and Zhao, Jiansen
- Subjects
- *
MONOCULARS , *IMAGE representation , *GEOMETRY , *VIDEOS - Abstract
Recent studies on self-supervised monocular depth estimation have achieved promising results, which are mainly based on the joint optimization of depth and pose estimation via high-level photometric loss. However, how to learn the latent and beneficial task-specific geometry representation from videos is still far from being explored. To tackle this issue, we propose two novel schemes to learn more effective representation from monocular videos: (i) an Inter-task Attention Model (IAM) to learn the geometric correlation representation between the depth and pose learning networks to make structure and motion information mutually beneficial; (ii) a Spatial-Temporal Memory Module (STMM) to exploit long-range geometric context representation among consecutive frames both spatially and temporally. Systematic ablation studies are conducted to demonstrate the effectiveness of each component. Evaluations on KITTI show that our method outperforms current state-of-the-art techniques. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Apply Fuzzy Mask to Improve Monocular Depth Estimation.
- Author
-
Chen, Hsuan, Chen, Hsiang-Chieh, Sun, Chung-Hsun, and Wang, Wen-June
- Subjects
MONOCULARS ,PIXELS ,IMAGE reconstruction ,FUZZY logic - Abstract
A fuzzy mask applied to pixel-wise dissimilarity weighting is proposed to improve the monocular depth estimation in this study. The parameters in the monocular depth estimation model are learned unsupervised through the image reconstruction of binocular images. The significant reconstructed dissimilarity, which is challenging to reduce, always occurs at pixels outside the binocular overlap. The fuzzy mask is designed based on the binocular overlap to adjust the weight of the dissimilarity for each pixel. More than 68% of pixels with significant dissimilarity outside binocular overlap are suppressed with weights less than 0.5. The model with the proposed fuzzy mask would focus on learning the depth estimation for pixels within binocular overlap. Experiments on the KITTI dataset show that the inference of the fuzzy mask only increases the training time of the model by less than 1%, while the number of pixels whose depth is accurately estimated enhances, and the monocular depth estimation also improves. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. 结合金字塔结构和注意力机制的单目深度估计.
- Author
-
李滔, 胡婷, and 武丹丹
- Abstract
Copyright of Journal of Graphics is the property of Journal of Graphics Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
25. Simultaneous Monocular Endoscopic Dense Depth and Odometry Estimation Using Local-Global Integration Networks
- Author
-
Fan, Wenkang, Jiang, Wenjing, Fang, Hao, Shi, Hong, Chen, Jianhua, Luo, Xiongbiao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
26. EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera
- Author
-
Cui, Beilei, Islam, Mobarakol, Bai, Long, Wang, An, Ren, Hongliang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
27. 3DDX: Bone Surface Reconstruction from a Single Standard-Geometry Radiograph via Dual-Face Depth Estimation
- Author
-
Gu, Yi, Otake, Yoshito, Uemura, Keisuke, Takao, Masaki, Soufi, Mazen, Okada, Seiji, Sugano, Nobuhiko, Talbot, Hugues, Sato, Yoshinobu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
28. 3DGR-CAR: Coronary Artery Reconstruction from Ultra-sparse 2D X-Ray Views with a 3D Gaussians Representation
- Author
-
Fu, Xueming, Li, Yingtai, Tang, Fenghe, Li, Jun, Zhao, Mingyue, Teng, Gao-Jun, Zhou, S. Kevin, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
29. Multimodal Monocular Dense Depth Estimation with Event-Frame Fusion Using Transformer
- Author
-
Xiao, Baihui, Xu, Jingzehua, Zhang, Zekai, Xing, Tianyu, Wang, Jingjing, Ren, Yong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Wand, Michael, editor, Malinovská, Kristína, editor, Schmidhuber, Jürgen, editor, and Tetko, Igor V., editor
- Published
- 2024
- Full Text
- View/download PDF
30. MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction
- Author
-
Simon, Nathaniel, Majumdar, Anirudha, Siciliano, Bruno, Series Editor, Khatib, Oussama, Series Editor, Antonelli, Gianluca, Advisory Editor, Fox, Dieter, Advisory Editor, Harada, Kensuke, Advisory Editor, Hsieh, M. Ani, Advisory Editor, Kröger, Torsten, Advisory Editor, Kulic, Dana, Advisory Editor, Park, Jaeheung, Advisory Editor, and Ang Jr, Marcelo H., editor
- Published
- 2024
- Full Text
- View/download PDF
31. MonoRetNet: A Self-supervised Model for Monocular Depth Estimation with Bidirectional Half-Duplex Retention
- Author
-
Fan, Dengxin, Liu, Songyan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Pan, Yijie, editor, and Zhang, Qinhu, editor
- Published
- 2024
- Full Text
- View/download PDF
32. Light-Dark: A Novel Lightweight Self-supervised Monocular Depth Estimation in the Dark
- Author
-
Liang, Qi, Wang, Lizhe, Wang, Lanmei, Liu, Xiang, Wang, Guibao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Zhang, Chuanlei, editor, and Zhang, Qinhu, editor
- Published
- 2024
- Full Text
- View/download PDF
33. Fog Obscurity Mitigation
- Author
-
Agarwal, Shivam, Gupta, Abhishek Kumar, Singh, Anand, Tyagi, Shivangi, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Mahapatra, Rajendra Prasad, editor, Peddoju, Sateesh K., editor, Roy, Sudip, editor, and Parwekar, Pritee, editor
- Published
- 2024
- Full Text
- View/download PDF
34. Self-supervised Monocular Depth Estimation and Ego-Motion Made Better: A Masking Constraints
- Author
-
Wen, Tian, Sun, Gaofei, Zhang, Lifeng, Kacprzyk, Janusz, Series Editor, and Lee, Roger, editor
- Published
- 2024
- Full Text
- View/download PDF
35. Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task
- Author
-
Schiavella, Claudio, Cirillo, Lorenzo, Papa, Lorenzo, Russo, Paolo, Amerini, Irene, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Foresti, Gian Luca, editor, Fusiello, Andrea, editor, and Hancock, Edwin, editor
- Published
- 2024
- Full Text
- View/download PDF
36. Saliency Driven Monocular Depth Estimation Based on Multi-scale Graph Convolutional Network
- Author
-
Wu, Dunquan, Chen, Chenglizhao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
37. Illumination Insensitive Monocular Depth Estimation Based on Scene Object Attention and Depth Map Fusion
- Author
-
Wen, Jing, Ma, Haojiang, Yang, Jie, Zhang, Songsong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
38. Self-supervised Cascade Training for Monocular Endoscopic Dense Depth Recovery
- Author
-
Jiang, Wenjing, Fan, Wenkang, Chen, Jianhua, Shi, Hong, Luo, Xiongbiao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
39. SACFormer: Unify Depth Estimation and Completion with Prompt
- Author
-
Tang, Shiyu, Wu, Di, Wang, Yifan, Wang, Lijun, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
40. SwinFusion: Channel Query-Response Based Feature Fusion for Monocular Depth Estimation
- Author
-
Lai, Pengfei, Yin, Mengxiao, Yin, Yifan, Xie, Min, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
- Published
- 2024
- Full Text
- View/download PDF
41. Can Language Really Understand Depth?
- Author
-
Chen, Fangping, Lu, Yuheng, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Luo, Biao, editor, Cheng, Long, editor, Wu, Zheng-Guang, editor, Li, Hongyi, editor, and Li, Chaojie, editor
- Published
- 2024
- Full Text
- View/download PDF
42. Self-supervised Monocular Depth Estimation on Unseen Synthetic Cameras
- Author
-
Diana-Albelda, Cecilia, Bravo Pérez-Villar, Juan Ignacio, Montalvo, Javier, García-Martín, Álvaro, Bescós Cano, Jesús, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vasconcelos, Verónica, editor, Domingues, Inês, editor, and Paredes, Simão, editor
- Published
- 2024
- Full Text
- View/download PDF
43. Multi-view Stereo by Fusing Monocular and a Combination of Depth Representation Methods
- Author
-
Yu, Fanqi, Sun, Xinyang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Luo, Biao, editor, Cheng, Long, editor, Wu, Zheng-Guang, editor, Li, Hongyi, editor, and Li, Chaojie, editor
- Published
- 2024
- Full Text
- View/download PDF
44. AMENet is a monocular depth estimation network designed for automatic stereoscopic display
- Author
-
Tianzhao Wu, Zhongyi Xia, Man Zhou, Ling Bing Kong, and Zengyuan Chen
- Subjects
Depth loss ,Monocular depth estimation ,CNN ,Transformer ,Medicine ,Science - Abstract
Abstract Monocular depth estimation has a wide range of applications in the field of autostereoscopic displays, while accuracy and robustness in complex scenes are still a challenge. In this paper, we propose a depth estimation network for autostereoscopic displays, which aims at improving the accuracy of monocular depth estimation by fusing Vision Transformer (ViT) and Convolutional Neural Network (CNN). Our approach feeds the input image as a sequence of visual features into the ViT module and utilizes its global perception capability to extract high-level semantic features of the image. The relationship between the losses is quantified by adding a weight correction module to improve robustness of the model. Experimental evaluation results on several public datasets show that AMENet exhibits higher accuracy and robustness than existing methods in different scenarios and complex conditions. In addition, a detailed experimental analysis was conducted to verify the effectiveness and stability of our method. The accuracy improvement on the KITTI dataset compared to the baseline method is 4.4%. In summary, AMENet is a promising depth estimation method with sufficient high robustness and accuracy for monocular depth estimation tasks.
- Published
- 2024
- Full Text
- View/download PDF
45. TFDEPTH: SELF-SUPERVISED MONOCULAR DEPTH ESTIMATION WITH MULITI-SCALE SELECTIVE TRANSFORMER FEATURE FUSION.
- Author
-
HONGLI HU, JUN MIAO, GUANGHUI ZHU, JIE YAN, and JUN CHU
- Subjects
- *
TRANSFORMER models , *MONOCULARS , *ALGORITHMS , *NOISE - Abstract
Existing self-supervised models for monocular depth estimation suffer from issues such as discontinuity, blurred edges, and unclear contours, particularly for small objects. We propose a self-supervised monocular depth estimation network with multi-scale selective Transformer feature fusion. To preserve more detailed features, this paper constructs a multi-scale encoder to extract features and leverages the self-attention mechanism of Transformer to capture global contextual information, enabling better depth prediction for small objects. Additionally, the multi-scale selective fusion module (MSSF) is also proposed, which can make full use of multi-scale feature information in the decoding part and perform selective fusion step by step, which can effectively eliminate noise and retain local detail features to obtain a clear depth map with clear edges. Experimental evaluations on the KITTI dataset demonstrate that the proposed algorithm achieves an absolute relative error (Abs Rel) of 0.098 and an accuracy rate (d) of 0.983. The results indicate that the proposed algorithm not only estimates depth values with high accuracy but also predicts the continuous depth map with clear edges. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation.
- Author
-
Liu, Zihang and Wang, Quande
- Subjects
DEPTH perception ,TRANSFORMER models ,MONOCULARS ,CONVOLUTIONAL neural networks - Abstract
Estimating depth from a single RGB image has a wide range of applications, such as in robot navigation and autonomous driving. Currently, Convolutional Neural Networks based on encoder–decoder architecture are the most popular methods to estimate depth maps. However, convolutional operators have limitations in modeling large-scale dependence, often leading to inaccurate depth predictions at object edges. To address these issues, a new edge-enhanced dual-stream monocular depth estimation method is introduced in this paper. ResNet and Swin Transformer are combined to better extract global and local features, which benefits the estimation of the depth map. To better integrate the information from the two branches of the encoder and the shallow branch of the decoder, we designed a lightweight decoder based on the multi-head Cross-Attention Module. Furthermore, in order to improve the boundary clarity of objects in the depth map, a loss function with an additional penalty for depth estimation error on the edges of objects is presented. The results on three datasets, NYU Depth V2, KITTI, and SUN RGB-D, show that the method presented in this paper achieves better performance for monocular depth estimation. Additionally, it has good generalization capabilities for various scenarios and real-world images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. DPDFormer: A Coarse-to-Fine Model for Monocular Depth Estimation.
- Author
-
Liu, Chunpu, Yang, Guanglei, Zuo, Wangmeng, and Zang, Tianyi
- Subjects
OBJECT recognition (Computer vision) ,MONOCULARS ,DISTRIBUTION (Probability theory) ,COMPUTER vision ,DISCRETIZATION methods - Abstract
Monocular depth estimation attracts great attention from computer vision researchers for its convenience in acquiring environment depth information. Recently classification-based MDE methods show its promising performance and begin to act as an essential role in many multi-view applications such as reconstruction and 3D object detection. However, existed classification-based MDE models usually apply fixed depth range discretization strategy across a whole scene. This fixed depth range discretization leads to the imbalance of discretization scale among different depth ranges, resulting in the inexact depth range localization. In this article, to alleviate the imbalanced depth range discretization problem in classification-based monocular depth estimation (MDE) method we follow the coarse-to-fine principle and propose a novel depth range discretization method called depth post-discretization (DPD). Based on a coarse depth anchor roughly indicating the depth range, the DPD generates the depth range discretization adaptively for every position. The depth range discretization with DPD is more fine-grained around the actual depth, which is beneficial for locating the depth range more precisely for each scene position. Besides, to better manage the prediction of the coarse depth anchor and depth probability distribution for calculating the final depth, we design a dual-decoder transformer-based network, i.e., DPDFormer, which is more compatible with our proposed DPD method. We evaluate DPDFormer on popular depth datasets NYU Depth V2 and KITTI. The experimental results prove the superior performance of our proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset.
- Author
-
Xiang, Mochu, Dai, Yuchao, Zhang, Feiyu, Shi, Jiawei, Tian, Xinyu, and Zhang, Zhensong
- Subjects
- *
MONOCULARS , *COMPUTER vision , *ASPECT ratio (Images) - Abstract
Robust monocular depth estimation (MDE) aims at learning a unified model that works across diverse real-world scenes, which is an important and active topic in computer vision. In this paper, we present Megatron_RVC, our winning solution for the monocular depth challenge in the Robust Vision Challenge (RVC) 2022, where we tackle the challenging problem from three perspectives: network architecture, training strategy and dataset. In particular, we made three contributions towards robust MDE: (1) we built a neural network with high capacity to enable flexible and accurate monocular depth predictions, which contains dedicated components to provide content-aware embeddings and to improve the richness of the details; (2) we proposed a novel mixing training strategy to handle real-world images with different aspect ratios, resolutions and apply tailored loss functions based on the properties of their depth maps; (3) to train a unified network model that covers diverse real-world scenes, we used over 1 million images from different datasets. As of 3rd October 2022, our unified model ranked consistently first across three benchmarks (KITTI, MPI Sintel, and VIPER) among all participants. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Using full-scale feature fusion for self-supervised indoor depth estimation.
- Author
-
Cheng, Deqiang, Chen, Junhui, Lv, Chen, Han, Chenggong, and Jiang, He
- Abstract
Monocular depth estimation is a crucial task in computer vision, and self-supervised algorithms are gaining popularity due to their independence from expensive ground truth supervision. However, current self-supervised algorithms may not provide accurate estimation and may suffer from distorted boundaries when applied to indoor scenes. Combining multi-scale features is an important research direction in image segmentation to achieve accurate estimation and resolve boundary distortion. However, there are few studies on indoor self-supervised algorithms in this regard. To solve this issue, we propose a novel full-scale feature information fusion approach that includes a full-scale skip-connection and a full-scale feature fusion block. This approach can aggregate the high-level and low-level information of all scale feature maps during the network's encoding and decoding process to compensate for the network's loss of cross-layer feature information. The proposed full-scale feature fusion improves accuracy and reduces the decoder parameters. To fully exploit the superiority of the full-scale feature fusion module, we replace the encoder backbone from ResNet with the more advanced ResNeSt. Combining these two methods results in a significant improvement in prediction accuracy. We have extensively evaluated our approach on the indoor benchmark datasets NYU Depth V2 and ScanNet. Our experimental results demonstrate that our method outperforms existing algorithms, particularly on NYU Depth V2, where our precision is raised to 83.8%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Resolution-sensitive self-supervised monocular absolute depth estimation.
- Author
-
Zhou, Yuquan, Zhang, Chentao, Deng, Lianjun, Fu, Jianji, Li, Hongyi, Xu, Zhouyi, and Zhang, Jianhuan
- Subjects
MONOCULARS ,APPLICATION software - Abstract
Depth estimation is an essential component of computer vision applications for environment perception, 3D reconstruction and scene understanding. Among the available methods, self-supervised monocular depth estimation is noteworthy for its cost-effectiveness, ease of installation and data accessibility. However, there are two challenges with current methods. Firstly, the scale factor of self-supervised monocular depth estimation is uncertain, which poses significant difficulties for practical applications. Secondly, the depth prediction accuracy for high-resolution images is still unsatisfactory, resulting in low utilization of computational resources. We propose a novel solution to address these challenges with three specific contributions. Firstly, an interleaved depth network skip-connection structure and a new depth network decoder are proposed to improve the depth prediction accuracy for high-resolution images. Secondly, a data vertical splicing module is suggested as a data enhancement method to obtain more non-vertical features and improve model generalization. Lastly, a scale recovery module is proposed to recover the accurate absolute depth without additional sensors, which solves the issue of uncertainty in the scale factor. The experimental results demonstrate that the proposed framework significantly improves the prediction accuracy of high-resolution images. In particular, the novel network structure and data vertical splicing module contribute significantly to this improvement. Moreover, in a scenario where the camera height is fixed and the ground is flat, the effect of scale recovery module is comparable to that achieved by using ground truth. Overall, the RSANet framework offers a promising solution to solve the existing challenges in self-supervised monocular depth estimation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.