1. Challenges and technical development of storage systems in the era of large language models
- Author
-
FENG Yangyang, WANG Qing, and SHU Jiwu
- Subjects
large language model ,storage system ,data management ,storage scalability ,data fault tolerance ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Large language models have shown exceptional performance in complex tasks, such as text and visual processing, drawing significant attention from industry and academia. Both the training and inference phases of large language models are heavily reliant on GPU computational power. However, the limited capacity and volatility of GPU memory pose significant challenges in meeting the storage demands during these phases. This paper conducts an in-depth analysis of the challenges faced by storage systems in the era of large language models: (1) the data of large language models exhibits a high degree of fragmentation, and sparse semantic representation further reduces storage system efficiency; (2) large language model training and inference demand high data read/write bandwidth, but the high communication overhead associated with data transfer across heterogeneous storage media complicates the expansion of GPU memory using these media; (3) the fault tolerance requirements during large language model training are stringent, yet directly applying CPU-centric fault tolerance techniques incurs prohibitive costs. To address these challenges, existing solutions are summarized from three perspectives: data management, storage expansion, and fault tolerance. Finally, the future development trends of storage systems in the era of large language models are discussed.
- Published
- 2025
- Full Text
- View/download PDF