Back to Search Start Over

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning.

Authors :
Zhu, Xue-Feng
Xu, Tianyang
Liu, Zongtao
Tang, Zhangyong
Wu, Xiao-Jun
Kittler, Josef
Source :
International Journal of Computer Vision; Aug2024, Vol. 132 Issue 8, p2845-2860, 16p
Publication Year :
2024

Abstract

The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09205691
Volume :
132
Issue :
8
Database :
Complementary Index
Journal :
International Journal of Computer Vision
Publication Type :
Academic Journal
Accession number :
178402107
Full Text :
https://doi.org/10.1007/s11263-024-01999-8