Start Over

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning.

Authors :: Zhu, Xue-Feng
Xu, Tianyang
Liu, Zongtao
Tang, Zhangyong
Wu, Xiao-Jun
Kittler, Josef
Source :: International Journal of Computer Vision; Aug2024, Vol. 132 Issue 8, p2845-2860, 16p
Publication Year :: 2024
Abstract: The emergence of large-scale high-quality datasets has stimulated the rapid development of deep learning in recent years. However, most computer vision tasks focus on the visual modality only, resulting in a huge imbalance in the number of annotated data for other modalities. While several multi-modal datasets have been made available, the majority of them are confined to only two modalities, serving a single specific computer vision task. To redress the data deficiency for multi-modal learning and applications, a new dataset named UniMod1K is presented in this work. UniMod1K involves three data modalities: vision, depth, and language. For the vision and depth modalities, the UniMod1K dataset contains 1050 RGB-D sequences, comprising a total of some 2.5 million frames. Regarding the language modality, the proposed dataset includes 1050 sentences describing the target object in each video. To demonstrate the advantages of training on a larger multi-modal dataset, such as UniMod1K, and to stimulate research enabled by the dataset, we address several multi-modal tasks, namely multi-modal object tracking and monocular depth estimation. To establish a performance baseline, we propose novel baseline methods for RGB-D object tracking, vision-language tracking and vision-depth-language tracking. Additionally, we conduct comprehensive experiments for each of these tasks. The results highlight the potential of the UniMod1K dataset to improve the performance of multi-modal approaches. The dataset and codes can be accessed at https://github.com/xuefeng-zhu5/UniMod1K. [ABSTRACT FROM AUTHOR]

Subjects :: COMPUTER vision
OBJECT tracking (Computer vision)
DEEP learning
MONOCULARS

Details

Language :: English
ISSN :: 09205691
Volume :: 132
Issue :: 8
Database :: Complementary Index
Journal :: International Journal of Computer Vision
Publication Type :: Academic Journal
Accession number :: 178402107
Full Text :: https://doi.org/10.1007/s11263-024-01999-8

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

UniMod1K: Towards a More Universal Large-Scale Dataset and Benchmark for Multi-modal Learning.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources