1. VL-Meta: Vision-Language Models for Multimodal Meta-Learning.
- Author
-
Ma, Han, Fan, Baoyu, Ng, Benjamin K., and Lam, Chan-Tong
- Subjects
MACHINE learning ,LANGUAGE models ,MULTIMODAL user interfaces ,ARTIFICIAL intelligence ,TASK performance ,LEARNING ability ,STIMULUS generalization - Abstract
Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation power, and time. However, it is difficult to achieve in low-resource or small-sample situations. Therefore, we propose VL-Meta, Vision Language Models for Multimodal Meta Learning. It (1) presents the vision-language mapper and multimodal fusion mapper, which are light model structures, to use the existing pre-trained models to make models understand images to language feature space and save training data, computation power, and time; (2) constructs the meta-task pool that can only use a small amount of data to construct enough training data and improve the generalization of the model to learn the data knowledge and task knowledge; (3) proposes the token-level training that can align inputs with the outputs during training to improve the model performance; and (4) adopts the multi-task fusion loss to learn the different abilities for the models. It achieves a good performance on the Visual Question Answering (VQA) task, which shows the feasibility and effectiveness of the model. This solution can help blind or visually impaired individuals obtain visual information. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF