Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Authors :: Timiryasov, Inar
Tastet, Jean-Loup
Publication Year :: 2023
Abstract: We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.<br />Comment: 11 pages, 4 figures, 4 tables, submitted to the BabyLM Challenge and accepted as archival full paper (CoNLL--CMCL 2023 Shared Task), checkpoint available at https://huggingface.co/timinar/baby-llama-58m, training code available at https://github.com/timinar/BabyLlama

Tools