Author: "Oliaro, Gabriele" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Oliaro, Gabriele"' showing total 12 results

Start Over Author "Oliaro, Gabriele"

12 results on '"Oliaro, Gabriele"'

1. SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

Author: Oliaro, Gabriele, Jia, Zhihao, Campos, Daniel, and Qiao, Aurick
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4\times$ higher output throughput than SpecInfer and up to $1.1\times$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9\times$ higher output throughput and $3\times$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.
Published: 2024

2. Optimal Kernel Orchestration for Tensor Programs with Korch

Author: Hu, Muyan, Venkatram, Ashwin, Biswas, Shreyashri, Marimuthu, Balamurugan, Hou, Bohan, Oliaro, Gabriele, Wang, Haojie, Zheng, Liyan, Miao, Xupeng, and Zhai, Jidong
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning
Abstract: Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch., Comment: Fix some typos in the ASPLOS version
Published: 2024
Full Text: View/download PDF

3. FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Author: Miao, Xupeng, Oliaro, Gabriele, Cheng, Xinhao, Wu, Mengdi, Unger, Colin, and Jia, Zhihao
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.
Published: 2024

4. Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Author: Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Jin, Hongyi, Chen, Tianqi, and Jia, Zhihao
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.
Published: 2023

5. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Author: Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Wang, Zeyu, Zhang, Zhengxin, Wong, Rae Ying Yee, Zhu, Alan, Yang, Lijie, Shi, Xiaoxiang, Shi, Chunan, Chen, Zhuoming, Arfeen, Daiyaan, Abhyankar, Reyna, and Jia, Zhihao
Subjects: Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/, Comment: ASPLOS'24
Published: 2023
Full Text: View/download PDF

6. Direct Telemetry Access

Author: Langlet, Jonatan, Basat, Ran Ben, Oliaro, Gabriele, Mitzenmacher, Michael, Yu, Minlan, and Antichi, Gianni
Subjects: Computer Science - Networking and Internet Architecture
Abstract: Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems. We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors' memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple. We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to $16$x., Comment: As appearing in the proceedings of ACM SIGCOMM'23
Published: 2022

7. Zero-CPU Collection with Direct Telemetry Access

Author: Langlet, Jonatan, Basat, Ran Ben, Ramanathan, Sivaramakrishnan, Oliaro, Gabriele, Mitzenmacher, Michael, Yu, Minlan, and Antichi, Gianni
Subjects: Computer Science - Networking and Internet Architecture, Computer Science - Data Structures and Algorithms, Electrical Engineering and Systems Science - Systems and Control
Abstract: Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are responsible for directly inserting queryable telemetry data into the collectors' memory, bypassing their CPU, and thereby improving their collection scalability. We propose to use a method we call \emph{direct telemetry access}, where switches jointly write telemetry reports directly into the same collector's memory region, without coordination. Our solution, DART, is probabilistic, trading memory redundancy and query success probability for CPU resources at collectors. We prototype DART using commodity hardware such as P4 switches and RDMA NICs and show that we get high query success rates with a reasonable memory overhead. For example, we can collect INT path tracing information on a fat tree topology without a collector's CPU involvement while achieving 99.9\% query success probability and using just 300 bytes per flow., Comment: To appear in ACM HotNets 2021
Published: 2021
Full Text: View/download PDF

8. Optimal Kernel Orchestration for Tensor Programs with Korch

Author: Hu, Muyan, primary, Venkatram, Ashwin, additional, Biswas, Shreyashri, additional, Marimuthu, Balamurugan, additional, Hou, Bohan, additional, Oliaro, Gabriele, additional, Wang, Haojie, additional, Zheng, Liyan, additional, Miao, Xupeng, additional, Zhai, Jidong, additional, and Jia, Zhihao, additional
Published: 2024
Full Text: View/download PDF

9. Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Author: Zhang, Zhengxin, Zhao, Dan, Miao, Xupeng, Oliaro, Gabriele, Li, Qing, Jiang, Yong, Jia, Zhihao, Zhang, Zhengxin, Zhao, Dan, Miao, Xupeng, Oliaro, Gabriele, Li, Qing, Jiang, Yong, and Jia, Zhihao
Abstract: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 $\times$.
Published: 2024

10. Direct Telemetry Access

Author: Langlet, Jonatan, primary, Ben Basat, Ran, additional, Oliaro, Gabriele, additional, Mitzenmacher, Michael, additional, Yu, Minlan, additional, and Antichi, Gianni, additional
Published: 2023
Full Text: View/download PDF

11. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

Author: Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Wang, Zeyu, Wong, Rae Ying Yee, Chen, Zhuoming, Arfeen, Daiyaan, Abhyankar, Reyna, and Jia, Zhihao
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. This paper introduces SpecInfer, an LLM serving system that accelerates generative LLM inference with speculative inference and token tree verification. A key insight behind SpecInfer is to combine various collectively boost-tuned small language models to jointly predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified by the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality.
Published: 2023
Full Text: View/download PDF

12. Zero-CPU Collection with Direct Telemetry Access

Author: Langlet, Jonatan, primary, Ben-Basat, Ran, additional, Ramanathan, Sivaramakrishnan, additional, Oliaro, Gabriele, additional, Mitzenmacher, Michael, additional, Yu, Minlan, additional, and Antichi, Gianni, additional
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

12 results on '"Oliaro, Gabriele"'

1. SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

2. Optimal Kernel Orchestration for Tensor Programs with Korch

3. FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

4. Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

5. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

6. Direct Telemetry Access

7. Zero-CPU Collection with Direct Telemetry Access

8. Optimal Kernel Orchestration for Tensor Programs with Korch

9. Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

10. Direct Telemetry Access

11. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

12. Zero-CPU Collection with Direct Telemetry Access

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

12 results on '"Oliaro, Gabriele"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources