1. End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages
- Author
-
Johanes Effendi, Sakriani Sakti, and Satoshi Nakamura
- Subjects
Image-to-speech ,image captioning ,self-supervised speech representation ,vector-quantized variational autoencoder ,untranscribed unknown language ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.
- Published
- 2021
- Full Text
- View/download PDF