Back to Search Start Over

Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion

Authors :
Zhu, Jianqing
Huang, Huang
Lin, Zhihang
Liang, Juhao
Tang, Zhengyang
Almubarak, Khalid
Alharthik, Abdulmohsen
An, Bang
He, Juncai
Wu, Xiangbo
Yu, Fei
Chen, Junying
Ma, Zhuoheng
Du, Yuhao
Zhang, He
Alghamdi, Emad A.
Zhang, Lian
Sun, Ruoyu
Li, Haizhou
Wang, Benyou
Xu, Jinchao
Publication Year :
2024

Abstract

This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or ChatGPT 3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. However, using a different vocabulary often leads to a degradation of learned knowledge since many words are initially out-of-vocabulary (OOV) when training starts. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion. Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Models, training data, benchmarks, and codes will be all open-sourced.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2412.12310
Document Type :
Working Paper