Start Over

M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Authors :: Liu, Xuyang
Liu, Ting
Huang, Siteng
Xin, Yi
Hu, Yue
Yin, Quanjun
Wang, Donglin
Chen, Honggang
Publication Year :: 2024
Abstract: Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the pre-trained uni-modal encoders fixed, updating M$^3$ISAs on side networks to progressively connect them, enabling more comprehensive vision-language alignment and efficient tuning for REC. Empirical results reveal that M$^2$IST achieves an optimal balance between performance and efficiency compared to most full fine-tuning and other PETL methods. With M$^2$IST, standard transformer-based REC methods present competitive or even superior performance compared to full fine-tuning, while utilizing only 2.11\% of the tunable parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time required for full fine-tuning.

Subjects :: Computer Science - Computer Vision and Pattern Recognition

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2407.01131
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources