1. 'Vorbe\c{s}ti Rom\^ane\c{s}te?' A Recipe to Train Powerful Romanian LLMs with English Instructions
- Author
-
Masala, Mihai, Ilie-Ablachim, Denis C., Dima, Alexandru, Corlatescu, Dragos, Zavelca, Miruna, Olaru, Ovio, Terian, Simina, Terian, Andrei, Leordeanu, Marius, Velicu, Horia, Popescu, Marius, Dascalu, Mihai, and Rebedea, Traian
- Subjects
Computer Science - Computation and Language - Abstract
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages., Comment: Accepted at The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024 Findings). arXiv admin note: text overlap with arXiv:2405.07703
- Published
- 2024