Back to Search
Start Over
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
- Publication Year :
- 2023
-
Abstract
- We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.<br />Comment: 11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint available at https://huggingface.co/timinar/baby-llama-58m, training code available at https://github.com/timinar/BabyLlama
- Subjects :
- Computer Science - Computation and Language
I.2.7
Subjects
Details
- Database :
- arXiv
- Publication Type :
- Report
- Accession number :
- edsarx.2308.02019
- Document Type :
- Working Paper