Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People

Authors :: Tung Le
Nguyen Tien Huy
Nguyen Le Minh
Source :: KSE
Publication Year :: 2020
Publisher :: IEEE, 2020.
Abstract: Visual Question Answering (VQA), the novel task among the intersection between Computer Vision (CV) and Natural Language Processing (NLP), extracts answers from features of both questions and images. The current approaches in VQA rely on the combination between convolution and recurrent networks, which leads to the huge number of parameters for learning phase. With the success of employing pre-trained models, we integrate BERT [1] for embedding text and two models: ResNets [2] and VGG [3] for embedding image. In addition, we also propose to take advantages of fine-tuning techniques and stacked attention mechanism to combine textual and visual features in a novel learning phase considered its ability to reduce the size of models. To demonstrate our model’s performance, we conduct experiments in the VizWiz VQA Challenge 2020. According to the experimental results, the proposed approach outperforms existing methods for Yes-No questions on VizWiz VQA dataset

Subjects :: Computer science
Intersection (set theory)
business.industry
Machine learning
computer.software_genre
Convolution
Image (mathematics)
Task (project management)
Feature (computer vision)
Question answering
Embedding
Artificial intelligence
business
computer
Transformer (machine learning model)

Database :: OpenAIRE
Journal :: 2020 12th International Conference on Knowledge and Systems Engineering (KSE)
Accession number :: edsair.doi...........4c44cdaa2595f63750bf05c3ec9119a6
Full Text :: https://doi.org/10.1109/kse50997.2020.9287539

Full Text Access

Tools