Back to Search Start Over

End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

Authors :
Johanes Effendi
Sakriani Sakti
Satoshi Nakamura
Source :
IEEE Access, Vol 9, Pp 55144-55154 (2021)
Publication Year :
2021
Publisher :
IEEE, 2021.

Abstract

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

Details

Language :
English
ISSN :
21693536
Volume :
9
Database :
OpenAIRE
Journal :
IEEE Access
Accession number :
edsair.doi.dedup.....fd7b04709be13d0a6bfe6cb732b311c7