Start Over

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Authors :: Łajszczak, Mateusz
Cámbara, Guillermo
Li, Yang
Beyhan, Fatih
van Korlaar, Arent
Yang, Fan
Joly, Arnaud
Martín-Cortinas, Álvaro
Abbas, Ammar
Michalski, Adam
Moinet, Alexis
Karlapati, Sri
Muszyńska, Ewa
Guo, Haohan
Putrycz, Bartosz
Gambino, Soledad López
Yoo, Kayeon
Sokolova, Elena
Drugman, Thomas
Publication Year :: 2024
Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.<br />Comment: v1.1 (fixed typos)

Subjects :: Computer Science - Machine Learning
Computer Science - Computation and Language
Electrical Engineering and Systems Science - Audio and Speech Processing

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2402.08093
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources