Back to Search Start Over

iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

Authors :
Wei, Yixuan
Cao, Yue
Zhang, Zheng
Yao, Zhuliang
Xie, Zhenda
Hu, Han
Guo, Baining
Publication Year :
2022

Abstract

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.<br />Comment: 22 pages, 6 figures

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2204.10760
Document Type :
Working Paper