1. From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
- Author
-
Goriely, Zébulon, Martinez, Richard Diehl, Caines, Andrew, Beinborn, Lisa, and Buttery, Paula
- Subjects
Computer Science - Computation and Language - Abstract
Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.
- Published
- 2024