Back to Search Start Over

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

Authors :
Liang, Weicong
Yuan, Yuhui
Ding, Henghui
Luo, Xiao
Lin, Weihong
Jia, Ding
Zhang, Zheng
Zhang, Chao
Hu, Han
Publication Year :
2022

Abstract

Vision transformers have recently achieved competitive results across various vision tasks but still suffer from heavy computation costs when processing a large number of tokens. Many advanced approaches have been developed to reduce the total number of tokens in large-scale vision transformers, especially for image classification tasks. Typically, they select a small group of essential tokens according to their relevance with the class token, then fine-tune the weights of the vision transformer. Such fine-tuning is less practical for dense prediction due to the much heavier computation and GPU memory cost than image classification. In this paper, we focus on a more challenging problem, i.e., accelerating large-scale vision transformers for dense prediction without any additional re-training or fine-tuning. In response to the fact that high-resolution representations are necessary for dense prediction, we present two non-parametric operators, a token clustering layer to decrease the number of tokens and a token reconstruction layer to increase the number of tokens. The following steps are performed to achieve this: (i) we use the token clustering layer to cluster the neighboring tokens together, resulting in low-resolution representations that maintain the spatial structures; (ii) we apply the following transformer layers only to these low-resolution representations or clustered tokens; and (iii) we use the token reconstruction layer to re-create the high-resolution representations from the refined low-resolution representations. The results obtained by our method are promising on five dense prediction tasks, including object detection, semantic segmentation, panoptic segmentation, instance segmentation, and depth estimation.<br />Comment: Accepted at NeurIPS 2022, camera-ready version, 22 pages, 14 figures

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2210.01035
Document Type :
Working Paper