Cross-Modal Semantic Matching Generative Adversarial Networks for Text-to-Image Synthesis

Authors :: Hongchen Tan
Xiuping Liu
Xin Li
Baocai Yin
Source :: IEEE Transactions on Multimedia. 24:832-845
Publication Year :: 2022
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2022.
Abstract: Synthesizing photo-realistic images based on text descriptions is a challenging image generation problem. Although many recent approaches have significantly advanced the performance of text-to-image generation, to guarantee semantic matchings between the text description and synthesized image remains very challenging. In this paper, we propose a new model, Cross-modal Semantic Matching Generative Adversarial Networks (CSM-GAN), to improve the semantic consistency between text description and synthesized image for a fine-grained text-to-image generation. Two new modules are proposed in CSM-GAN: Text Encoder Module (TEM) and Textual-Visual Semantic Matching Module (TVSMM). TVSMM is aimed at making \textcolor{red}{the distance of the pairs of synthesized image and its corresponding text description closer}, in global semantic embedding space, than those of mismatched pairs. This improves the semantic consistency and consequently, the generalizability of CSM-GAN. In TEM, we introduce Text Convolutional Neural Networks (Text\_CNNs) to capture and highlight local visual features in textual descriptions. Thorough experiments on two public benchmark datasets demonstrated the superiority of CSM-GAN over other representative state-of-the-art methods.

Subjects :: Computer science
business.industry
computer.software_genre
Convolutional neural network
Computer Science Applications
Image (mathematics)
Modal
Signal Processing
Media Technology
Benchmark (computing)
Embedding
Artificial intelligence
Electrical and Electronic Engineering
business
computer
Encoder
Natural language processing
Generative grammar
Semantic matching

Full Text Access

Tools