'Vorbe\c{s}ti Rom\^ane\c{s}te?' A Recipe to Train Powerful Romanian LLMs with English Instructions

Authors :: Masala, Mihai
Ilie-Ablachim, Denis C.
Dima, Alexandru
Corlatescu, Dragos
Zavelca, Miruna
Olaru, Ovio
Terian, Simina
Terian, Andrei
Leordeanu, Marius
Velicu, Horia
Popescu, Marius
Dascalu, Mihai
Rebedea, Traian
Publication Year :: 2024
Abstract: In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.<br />Comment: arXiv admin note: text overlap with arXiv:2405.07703

Tools