1. BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights
- Author
-
Hsu, Chan-Jan, Lin, Yi-Cheng, Lin, Chia-Chun, Chen, Wei-Chih, Chung, Ho Lam, Li, Chen-An, Chen, Yi-Chang, Yu, Chien-Yu, Lee, Ming-Ji, Chen, Chien-Cheng, Huang, Ru-Heng, Lee, Hung-yi, and Shiu, Da-Shan
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
- Published
- 2025