Start Over

Improved robustness of vision transformers via prelayernorm in patch embedding.

Authors :: Kim, Bum Jun
Choi, Hyeyeon
Jang, Hyeonah
Lee, Dong Gu
Jeong, Wonseok
Kim, Sang Woo
Source :: Pattern Recognition. Sep2023, Vol. 141, pN.PAG-N.PAG. 1p.
Publication Year :: 2023
Abstract: • We provide empirical tests on various image corruption using vision transformers. • Vision transformers showed performance degradation on contrast-enhanced images. • We proposed PreLayerNorm for the consistent behavior of positional embedding. • We observed that PreLayerNorm improved performance on contrast-enhanced images. • We provide theoretical analyses on the inconsistent behavior of vision transformers. [Display omitted] Vision Transformers (ViTs) have recently demonstrated state-of-the-art performance in various vision tasks, replacing convolutional neural networks (CNNs). However, because ViT has a different architectural design than CNN, it may behave differently. To investigate whether ViT has a different performance or robustness, we tested ViT and CNN under various imaging conditions in practical vision tasks. We confirmed that for most image transformations, ViT's robustness was comparable or even better than that of CNN. However, for contrast enhancement, ViT performed particularly poorly. We show that this is because positional embedding in ViT's patch embedding can work improperly when the color scale changes. We demonstrate that the use of PreLayerNorm, a modified patch embedding structure, ensures the consistent behavior of ViT. Results demonstrate that ViT with PreLayerNorm exhibited improved robustness in the contrast-varying environments. [ABSTRACT FROM AUTHOR]