Back to Search Start Over

CiT: Curation in Training for Effective Vision-Language Data

Authors :
Xu, Hu
Xie, Saining
Huang, Po-Yao
Yu, Licheng
Howes, Russell
Ghosh, Gargi
Zettlemoyer, Luke
Feichtenhofer, Christoph
Publication Year :
2023

Abstract

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.<br />Comment: Technical Report

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2301.02241
Document Type :
Working Paper