1. A Survey on Data Synthesis and Augmentation for Large Language Models
- Author
-
Wang, Ke, Zhu, Jiahui, Ren, Minjie, Liu, Zeming, Li, Shiwei, Zhang, Zongye, Zhang, Chenkai, Wu, Xiaoyu, Zhan, Qiqi, Liu, Qingjie, and Wang, Yunhong
- Subjects
Computer Science - Computation and Language - Abstract
The success of Large Language Models (LLMs) is inherently linked to the availability of vast, diverse, and high-quality data for training and evaluation. However, the growth rate of high-quality data is significantly outpaced by the expansion of training datasets, leading to a looming data exhaustion crisis. This underscores the urgent need to enhance data efficiency and explore new data sources. In this context, synthetic data has emerged as a promising solution. Currently, data generation primarily consists of two major approaches: data augmentation and synthesis. This paper comprehensively reviews and summarizes data generation techniques throughout the lifecycle of LLMs, including data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. Furthermore, We discuss the current constraints faced by these methods and investigate potential pathways for future development and research. Our aspiration is to equip researchers with a clear understanding of these methodologies, enabling them to swiftly identify appropriate data generation strategies in the construction of LLMs, while providing valuable insights for future exploration.
- Published
- 2024