Start Over

An end-to-end image-text matching approach considering semantic uncertainty.

Authors :: Tuerhong, Gulanbaier
Dai, Xin
Tian, Liwei
Wushouer, Mairidan
Source :: Neurocomputing. Nov2024, Vol. 607, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: We propose a novel end-to-end image-text matching approach considering semantic uncertainty (SU-ITM), aiming to deal with the one-to-many semantic diversity involved in image-text matching in order to capture the associations between them more comprehensively and improve the robustness of the model. Traditional methods map images and texts as definite points in an embedding space to measure cross-modal similarity. However, the point-based embedding cannot capture the semantic uncertainty, leading to a large bias in the matching results. To address this problem, we model the one-to-many associations between image and text in a way that establishes a probability distribution, incorporating the uncertainty information into the final semantic representation of the text. In addition, we optimize the image-text matching loss so that the different text features approximate the image features in a distributed manner while maintaining the discriminative nature of the semantic representation, effectively reducing the matching uncertainty. Notably, our method achieves end-to-end training by not using pre-trained target detection branches throughout the training process. We fully demonstrate the excellent performance of our method in the image-text matching task through experimental validation on Flickr30k and MSCOCO. Excellent performance levels of 546.1 and 545.0 are achieved on the R@SUM metric for Flickr30k and MSCOCO 1k, respectively. [ABSTRACT FROM AUTHOR]