Back to Search
Start Over
End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages
- Source :
- IEEE Access, Vol 9, Pp 55144-55154 (2021)
- Publication Year :
- 2021
- Publisher :
- IEEE, 2021.
-
Abstract
- Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.
- Subjects :
- image captioning
General Computer Science
Computer science
Speech recognition
Decoding
02 engineering and technology
010501 environmental sciences
01 natural sciences
Field (computer science)
Data modeling
Task (project management)
untranscribed unknown language
0202 electrical engineering, electronic engineering, information engineering
Training
General Materials Science
self-supervised speech representation
Representation (mathematics)
Image-to-speech
0105 earth and related environmental sciences
Modality (human–computer interaction)
General Engineering
Data models
Autoencoder
Bridges
vector-quantized variational autoencoder
Task analysis
Image reconstruction
020201 artificial intelligence & image processing
lcsh:Electrical engineering. Electronics. Nuclear engineering
lcsh:TK1-9971
Decoding methods
Subjects
Details
- Language :
- English
- ISSN :
- 21693536
- Volume :
- 9
- Database :
- OpenAIRE
- Journal :
- IEEE Access
- Accession number :
- edsair.doi.dedup.....fd7b04709be13d0a6bfe6cb732b311c7