TmfimCLIP: Text-Driven Multi-Attribute Face Image Manipulation.

Authors :: Yaermaimaiti, Yilihamu
Wang, Ruohao
Lou, Xudong
Liu, Yajie
Xi, Linfei
Source :: International Journal of Image & Graphics. Dec2024, p1. 21p.
Publication Year :: 2024
Abstract: Text-to-image conversion has garnered significant research attention, with contemporary methods leveraging the latent space analysis of StyleGAN. However, issues with latent code decoupling, interpretability, and controllability often remain, leading to misaligned image attributes. To address these challenges, we propose a refined approach that segments StyleGAN’s latent code using the Visual Language Model (CLIP). Our method aligns the latent code segments with text embeddings via an image-text alignment module and modulates them through a text injection module. Additionally, we incorporate semantic segmentation loss and mouth loss to constrain operations that affect irrelevant attributes. Compared to previous CLIP-driven techniques, our approach significantly enhances decoupling, interpretability, and controllability. Experiments on the CelebA-HQ and FFHQ datasets validate our model’s efficacy through both qualitative and quantitative comparisons. Our model effectively handles a wide range of style variations, achieving an FID score of 21.15 for facial attributes and an ID metric of 0.88 for hair attributes. [ABSTRACT FROM AUTHOR]

Full Text Access

Tools