Back to Search Start Over

Optimizing Single DGX-A100 System: Overcoming GPU Limitations via Efficient Parallelism and Scheduling for Large Language Models

Authors :
Kyeong-Hwan Kim
Chang-Sung Jeong
Source :
Applied Sciences, Vol 13, Iss 16, p 9306 (2023)
Publication Year :
2023
Publisher :
MDPI AG, 2023.

Abstract

In this study, we introduce a novel training algorithm specifically designed to overcome the limitations of GPU memory on a single DGX-A100 system. By utilizing the CPU and main memory in the training process and applying a strategy of division and parallelization, our algorithm enhances the size of the trainable language model and the batch size. In addition, we developed a comprehensive management system to effectively manage the execution of the algorithm. This system systematically controls the training process and resource usage, while also enabling the asynchronous deployment of tasks. Finally, we proposed a scheduling technique integrated into the management system, promoting efficient task scheduling in a complex, heterogeneous training environment. These advancements equip researchers with the ability to work with larger models and batch sizes, even when faced with limited GPU memory.

Details

Language :
English
ISSN :
20763417
Volume :
13
Issue :
16
Database :
Directory of Open Access Journals
Journal :
Applied Sciences
Publication Type :
Academic Journal
Accession number :
edsdoj.9ed207289ce44d78321e3b683044dab
Document Type :
article
Full Text :
https://doi.org/10.3390/app13169306