1. Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU
- Author
-
Lingnan Xia and Hua Ma
- Subjects
Parameter efficient fine-tuning ,auto-tuning ,multi-tenancy ,GPU acceleration ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Low-Rank Adaptation (LoRA) has garnered increasing attention for effectively fine-tuning large language models (LLMs) with limited resources. Nonetheless, conventional approaches that cater to multiple LoRA models independently lead to redundant computations and suboptimal GPU utilization. This study tackles these inefficiencies by presenting Dynamic Operator Optimization, a sophisticated automated optimization methodology crafted to dynamically enhance the Segmented Gather Matrix-Vector Multiplication (SGMV) operator according to specific contexts. The distinctive design of SGMV facilitates the batching of GPU operations for diverse LoRA models, resulting in a notable enhancement in computational efficiency. The strategy exploits a Search Space Constructor to construct a structured search space, segmenting the program space into overarching structural outlines and intricate implementation particulars to ensure a varied and adaptable operator implementation. Moreover, an Optimization Engine fine-tunes these implementations through evolutionary search driven by a performance estimation cost model. This progressive optimization procedure ensures that SGMV implementations can dynamically adjust to varying scenarios to uphold superior performance. The findings illustrate that our design can elevate throughput by up to 1.46 times in cutting-edge multi-tenant LoRA deployments.
- Published
- 2024
- Full Text
- View/download PDF