Back to Search
Start Over
High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference.
- Source :
-
Information Fusion . Feb2025, Vol. 114, pN.PAG-N.PAG. 1p. - Publication Year :
- 2025
-
Abstract
- RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes. • A hybrid fusion strategy network for RGB-Thermal video object detection. • An early strategy for reducing modal disparities. • A novel differential method for modeling multimodal and temporal information. • The proposed PTMNet achieves SOTA performance on the VT-VOD50 dataset. [ABSTRACT FROM AUTHOR]
- Subjects :
- *VISIBLE spectra
*MODALITY (Linguistics)
*VIDEOS
*MOTIVATION (Psychology)
*SPEED
Subjects
Details
- Language :
- English
- ISSN :
- 15662535
- Volume :
- 114
- Database :
- Academic Search Index
- Journal :
- Information Fusion
- Publication Type :
- Academic Journal
- Accession number :
- 180494341
- Full Text :
- https://doi.org/10.1016/j.inffus.2024.102665