Author: "Mohan, Jayashree" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mohan, Jayashree"' showing total 15 results

Start Over Author "Mohan, Jayashree" Publication Type Reports

15 results on '"Mohan, Jayashree"'

1. Towards Efficient Large Multimodal Model Serving

Author: Qiu, Haoran, Biswas, Anish, Zhao, Zihan, Mohan, Jayashree, Khare, Alind, Choukse, Esha, Goiri, Íñigo, Zhang, Zeyu, Shen, Haiying, Bansal, Chetan, Ramjee, Ramachandran, and Fonseca, Rodrigo
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Recent advances in generative AI have led to large multi-modal models (LMMs) capable of simultaneously processing inputs of various modalities such as text, images, video, and audio. While these models demonstrate impressive capabilities, efficiently serving them in production environments poses significant challenges due to their complex architectures and heterogeneous resource requirements. We present the first comprehensive systems analysis of two prominent LMM architectures, decoder-only and cross-attention, on six representative open-source models. We investigate their multi-stage inference pipelines and resource utilization patterns that lead to unique systems design implications. We also present an in-depth analysis of production LMM inference traces, uncovering unique workload characteristics, including variable, heavy-tailed request distributions, diverse modal combinations, and bursty traffic patterns. Our key findings reveal that different LMM inference stages exhibit highly heterogeneous performance characteristics and resource demands, while concurrent requests across modalities lead to significant performance interference. To address these challenges, we propose a decoupled serving architecture that enables independent resource allocation and adaptive scaling for each stage. We further propose optimizations such as stage colocation to maximize throughput and resource utilization while meeting the latency objectives.
Published: 2025

2. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Author: Kamath, Aditya K, Prabhu, Ramya, Mohan, Jayashree, Peter, Simon, Ramjee, Ramachandran, and Panwar, Ashish
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, I.2.7, C.1.4
Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. This approach optimizes linear operations but remains inefficient for attention computation because existing attention kernels specialize execution independently for the prefill and decode phases. In this paper, we present POD-Attention - the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59\%$ (mean $28\%$), enabling higher throughput and lower latency LLM inference compared to the use of independently optimized prefill and decode attention kernels., Comment: Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS '25), March 30 - April 3, 2025, Rotterdam, Netherlands
Published: 2024
Full Text: View/download PDF

3. ASTRA: Accurate and Scalable ANNS-based Training of Extreme Classifiers

Author: Mehta, Sonu, Mohan, Jayashree, Natarajan, Nagarajan, Ramjee, Ramachandran, and Varma, Manik
Subjects: Computer Science - Machine Learning, Computer Science - Information Retrieval
Abstract: `Extreme Classification'' (or XC) is the task of annotating data points (queries) with relevant labels (documents), from an extremely large set of $L$ possible labels, arising in search and recommendations. The most successful deep learning paradigm that has emerged over the last decade or so for XC is to embed the queries (and labels) using a deep encoder (e.g. DistilBERT), and use linear classifiers on top of the query embeddings. This architecture is of appeal because it enables millisecond-time inference using approximate nearest neighbor search (ANNS). The key question is how do we design training algorithms that are accurate as well as scale to $O(100M)$ labels on a limited number of GPUs. State-of-the-art XC techniques that demonstrate high accuracies (e.g., DEXML, Ren\'ee, DEXA) on standard datasets have per-epoch training time that scales as $O(L)$ or employ expensive negative sampling strategies, which are prohibitive in XC scenarios. In this work, we develop an accurate and scalable XC algorithm ASTRA with two key observations: (a) building ANNS index on the classifier vectors and retrieving hard negatives using the classifiers aligns the negative sampling strategy to the loss function optimized; (b) keeping the ANNS indices current as the classifiers change through the epochs is prohibitively expensive while using stale negatives (refreshed periodically) results in poor accuracy; to remedy this, we propose a negative sampling strategy that uses a mixture of importance sampling and uniform sampling. By extensive evaluation on standard XC as well as proprietary datasets with 120M labels, we demonstrate that ASTRA achieves SOTA precision, while reducing training time by 4x-15x relative to the second best.
Published: 2024

4. Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems

Author: Agrawal, Amey, Agarwal, Anmol, Kedia, Nitin, Mohan, Jayashree, Kundu, Souvik, Kwatra, Nipun, Ramjee, Ramachandran, and Tumanov, Alexey
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Etalon, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Etalon, discussing their strengths and weaknesses. Etalon is available at https://github.com/project-etalon/etalon.
Published: 2024

5. Vidur: A Large-Scale Simulation Framework For LLM Inference

Author: Agrawal, Amey, Kedia, Nitin, Mohan, Jayashree, Panwar, Ashish, Kwatra, Nipun, Gulavani, Bhargav, Ramjee, Ramachandran, and Tumanov, Alexey
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur.
Published: 2024

6. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Author: Prabhu, Ramya, Nayak, Ajay, Mohan, Jayashree, Ramjee, Ramachandran, and Panwar, Ashish
Subjects: Computer Science - Machine Learning, Computer Science - Operating Systems
Abstract: PagedAttention is a popular approach for dynamic memory allocation in LLM serving systems. It enables on-demand allocation of GPU memory to mitigate KV cache fragmentation -- a phenomenon that crippled the batch size (and consequently throughput) in prior systems. However, in trying to allocate physical memory at runtime, PagedAttention ends up changing the virtual memory layout of the KV cache from contiguous to non-contiguous. Such a design leads to non-trivial programming and performance overheads. We present vAttention -- an approach that mitigates fragmentation in physical memory while retaining the contiguity of KV cache in virtual memory. We achieve this by decoupling the allocation of virtual and physical memory using CUDA virtual memory management APIs. We also introduce various LLM-specific optimizations to address the limitations of CUDA virtual memory support. Overall, vAttention is a simpler, portable, and performant alternative to PagedAttention: it supports various attention kernels out-of-the-box and improves LLM serving throughput by up to 1.23x compared to the use of PagedAttention-based kernels of FlashAttention and FlashInfer., Comment: To appear in ASPLOS 2025
Published: 2024

7. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Author: Agrawal, Amey, Kedia, Nitin, Panwar, Ashish, Mohan, Jayashree, Kwatra, Nipun, Gulavani, Bhargav S., Tumanov, Alexey, and Ramjee, Ramachandran
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.
Published: 2024

8. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Author: Agrawal, Amey, Panwar, Ashish, Mohan, Jayashree, Kwatra, Nipun, Gulavani, Bhargav S., and Ramjee, Ramachandran
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.
Published: 2023

9. Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Author: Mohan, Jayashree, Phanishayee, Amar, Kulkarni, Janardhan, and Chidambaram, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. Existing schedulers for DNN training consider GPU as the dominant resource, and allocate other resources such as CPU and memory proportional to the number of GPUs requested by the job. Unfortunately, these schedulers do not consider the impact of a job's sensitivity to allocation of CPU, memory, and storage resources. In this work, we propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Synergy infers the sensitivity of DNNs to different resources using optimistic profiling; some jobs might benefit from more than the GPU-proportional allocation and some jobs might not be affected by less than GPU-proportional allocation. Synergy performs such multi-resource workload-aware assignments across a set of jobs scheduled on shared multi-tenant clusters using a new near-optimal online algorithm. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
Published: 2021

10. Memory Optimization for Deep Networks

Author: Shah, Aashaka, Wu, Chao-Yuan, Mohan, Jayashree, Chidambaram, Vijay, and Krähenbühl, Philipp
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32x over the last five years, the total available memory only grew by 2.5x. This prevents researchers from exploring larger architectures, as training large networks requires more memory for storing intermediate outputs. In this paper, we present MONeT, an automatic framework that minimizes both the memory footprint and computational overhead of deep networks. MONeT jointly optimizes the checkpointing schedule and the implementation of various operators. MONeT is able to outperform all prior hand-tuned operations as well as automated checkpointing. MONeT reduces the overall memory requirement by 3x for various PyTorch models, with a 9-16% overhead in computation. For the same computation cost, MONeT requires 1.2-1.8x less memory than current state-of-the-art automated checkpointing frameworks. Our code is available at https://github.com/utsaslab/MONeT., Comment: 18 pages, ICLR'21
Published: 2020

11. Analyzing and Mitigating Data Stalls in DNN Training

Author: Mohan, Jayashree, Phanishayee, Amar, Raniwala, Ashish, and Chidambaram, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Operating Systems
Abstract: Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline, i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data preprocessing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time: time spent waiting for data to be fetched and preprocessed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5x on a single server).
Published: 2020

12. RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes

Author: Lee, Se Kwon, Mohan, Jayashree, Kashyap, Sanidhya, Kim, Taesoo, and Chidambaram, Vijay
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Databases, Computer Science - Data Structures and Algorithms
Abstract: We present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM). The main insight behind Recipe is that isolation provided by a certain class of concurrent in-memory indexes can be translated with small changes to crash-consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Based on these conditions and conversion actions, we modify five different DRAM indexes based on B+ trees, tries, radix trees, and hash tables to their crash-consistent PM counterparts. The effort involved in this conversion is minimal, requiring 30-200 lines of code. We evaluated the converted PM indexes on Intel DC Persistent Memory, and found that they outperform state-of-the-art, hand-crafted PM indexes in multi-threaded workloads by up-to 5.2x. For example, we built P-CLHT, our PM implementation of the CLHT hash table by modifying only 30 LOC. When running YCSB workloads, P-CLHT performs up to 2.4x better than Cacheline-Conscious Extendible Hashing (CCEH), the state-of-the-art PM hash table., Comment: 3pages: Added one more reference
Published: 2019
Full Text: View/download PDF

13. Analyzing GDPR Compliance Through the Lens of Privacy Policy

Author: Mohan, Jayashree, Wasserman, Melissa, and Chidambaram, Vijay
Subjects: Computer Science - Computers and Society
Abstract: With the arrival of the European Union's General Data Protection Regulation (GDPR), several companies are making significant changes to their systems to achieve compliance. The changes range from modifying privacy policies to redesigning systems which process personal data. This work analyzes the privacy policies of large-scaled cloud services which seek to be GDPR compliant. The privacy policy is the main medium of information dissemination between the data controller and the users. We show that many services that claim compliance today do not have clear and concise privacy policies. We identify several points in the privacy policies which potentially indicate non-compliance; we term these GDPR vulnerabilities. We identify GDPR vulnerabilities in ten cloud services. Based on our analysis, we propose seven best practices for crafting GDPR privacy policies.
Published: 2019

14. Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing

Author: Mohan, Jayashree, Martinez, Ashlie, Ponnapalli, Soujanya, Raju, Pandian, and Chidambaram, Vijay
Subjects: Computer Science - Operating Systems
Abstract: We present a new approach to testing file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. Each workload is tested on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last five years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly-created file system, and that all reported bugs result from crashes after fsync() related system calls. We build two tools, CrashMonkey and ACE, to demonstrate the effectiveness of this approach. Our tools are able to find 24 out of the 26 crash-consistency bugs reported in the last five years. Our tools also revealed 10 new crash-consistency bugs in widely-used, mature Linux file systems, seven of which existed in the kernel since 2014. Our tools also found a crash-consistency bug in a verified file system, FSCQ. The new bugs result in severe consequences like broken rename atomicity and loss of persisted files.
Published: 2018

15. Analyzing IO Amplification in Linux File Systems

Author: Mohan, Jayashree, Kadekodi, Rohan, and Chidambaram, Vijay
Subjects: Computer Science - Operating Systems
Abstract: We present the first systematic analysis of read, write, and space amplification in Linux file systems. While many researchers are tackling write amplification in key-value stores, IO amplification in file systems has been largely unexplored. We analyze data and metadata operations on five widely-used Linux file systems: ext2, ext4, XFS, btrfs, and F2FS. We find that data operations result in significant write amplification (2-32X) and that metadata operations have a large IO cost. For example, a single rename requires 648 KB write IO in btrfs. We also find that small random reads result in read amplification of 2-13X. Based on these observations, we present the CReWS conjecture about the relationship between IO amplification, consistency, and storage space utilization. We hope this paper spurs people to design future file systems with less IO amplification, especially for non-volatile memory technologies.
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

15 results on '"Mohan, Jayashree"'

1. Towards Efficient Large Multimodal Model Serving

2. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

3. ASTRA: Accurate and Scalable ANNS-based Training of Extreme Classifiers

4. Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems

5. Vidur: A Large-Scale Simulation Framework For LLM Inference

6. vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

7. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

8. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

9. Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

10. Memory Optimization for Deep Networks

11. Analyzing and Mitigating Data Stalls in DNN Training

12. RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes

13. Analyzing GDPR Compliance Through the Lens of Privacy Policy

14. Finding Crash-Consistency Bugs with Bounded Black-Box Crash Testing

15. Analyzing IO Amplification in Linux File Systems

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

15 results on '"Mohan, Jayashree"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources