Back to Search
Start Over
Integrating Transformer into Global and Residual Image Feature Extractor in Visual Question Answering for Blind People
- Source :
- KSE
- Publication Year :
- 2020
- Publisher :
- IEEE, 2020.
-
Abstract
- Visual Question Answering (VQA), the novel task among the intersection between Computer Vision (CV) and Natural Language Processing (NLP), extracts answers from features of both questions and images. The current approaches in VQA rely on the combination between convolution and recurrent networks, which leads to the huge number of parameters for learning phase. With the success of employing pre-trained models, we integrate BERT [1] for embedding text and two models: ResNets [2] and VGG [3] for embedding image. In addition, we also propose to take advantages of fine-tuning techniques and stacked attention mechanism to combine textual and visual features in a novel learning phase considered its ability to reduce the size of models. To demonstrate our model’s performance, we conduct experiments in the VizWiz VQA Challenge 2020. According to the experimental results, the proposed approach outperforms existing methods for Yes-No questions on VizWiz VQA dataset
Details
- Database :
- OpenAIRE
- Journal :
- 2020 12th International Conference on Knowledge and Systems Engineering (KSE)
- Accession number :
- edsair.doi...........4c44cdaa2595f63750bf05c3ec9119a6
- Full Text :
- https://doi.org/10.1109/kse50997.2020.9287539