Start Over

基于多模态特征频域融合的零样本指称图像分割.

Authors :: 林浩然
 刘春黔
 薛榕融
 谢勋伟
 雷印杰
Source :: Application Research of Computers / Jisuanji Yingyong Yanjiu. May2024, Vol. 41 Issue 5, p1562-1568. 7p.
Publication Year :: 2024
Abstract: In order to solve the problem that semantic segmentation cannot handle undefined categories when applied to downstream tasks in the real world, it proposed referring image segmentation to find the corresponding target in the image according to the description of natural language text. Most of the existing methods use a cross-modal decoder to fuse the features extracted independently from the visual encoder and language encoder, but these methods cannot effectively utilize the edge features of the image and are complicated to train. CLIP is a powerful pre-trained visual language cross-modal model that can effectively extract image and text features. Therefore, this paper proposed a method of multimodal feature fusion in the frequency domain after CLIP encoding. Firstly, it used an unsupervised model to segment images, and extracted nouns in natural language text for follow-up task. Then it used the image encoder and text encoder of CLIP to encode the image and text respectively. Then it used the wavelet transform to decompose the image and text features, and decomposed and fused in the frequency domain which could make full use of the edge features of the image and the position information in the image, fused the image feature and text feature respectively in the frequency domain, then inversed the fused features. Finally, it matched the text features and image features pixel by pixel, and obtained the segmentation results, and tested on commonly used data sets. The experimental results prove that the network has achieved good results without training zero samples, and has good robustness and generalization ability. [ABSTRACT FROM AUTHOR]